# 10 - Tracing & Observability Records agent run activities asynchronously. Spans are buffered in memory and flushed to the TracingStore in batches, with optional export to external OpenTelemetry backends. > Tracing uses PostgreSQL. The `traces` and `spans` tables store all tracing data. Optional OTel export sends spans to external backends (Jaeger, Grafana Tempo, Datadog) in addition to PostgreSQL. --- ## 1. Collector -- Buffer-Flush Architecture ```mermaid flowchart TD EMIT["EmitSpan(span)"] --> BUF["spanCh
(buffered channel, cap = 1000)"] BUF --> FLUSH["flushLoop() -- every 5s"] FLUSH --> DRAIN["Drain all spans from channel"] DRAIN --> BATCH["BatchCreateSpans() to PostgreSQL"] DRAIN --> OTEL["OTelExporter.ExportSpans()
to OTLP backend (if configured)"] DRAIN --> AGG["Update aggregates
for dirty traces"] FULL{"Buffer full?"} -.->|"Drop + warning log"| BUF ``` ### Trace Lifecycle ```mermaid flowchart LR CT["CreateTrace()
(synchronous, 1 per run)"] --> ES["EmitSpan()
(async, buffered)"] ES --> FT["FinishTrace()
(status, error, output preview)"] ``` ### Cancel Handling When a run is cancelled via `/stop` or `/stopall`, the run context is cancelled but trace finalization still needs to persist. `FinishTrace()` detects `ctx.Err() != nil` and switches to `context.Background()` for the final database write. The trace status is set to `"cancelled"` instead of `"error"`. Context values (traceID, collector) survive cancellation -- only `ctx.Done()` and `ctx.Err()` change. This allows trace finalization to find everything it needs with a fresh context for the DB call. --- ## 2. Span Types & Hierarchy | Type | Description | OTel Kind | |------|-------------|-----------| | `llm_call` | LLM provider call | Client | | `tool_call` | Tool execution | Internal | | `agent` | Root agent span (parents all child spans) | Internal | | `embedding` | Embedding generation (vector store operations) | Internal | | `event` | Discrete event marker (no duration) | Internal | ```mermaid flowchart TD AGENT["Agent Span (root)
parents all child spans"] --> LLM1["LLM Call Span 1
(model, tokens, finish reason)"] AGENT --> TOOL1["Tool Span: exec
(tool_name, duration)"] AGENT --> LLM2["LLM Call Span 2"] AGENT --> TOOL2["Tool Span: read_file"] AGENT --> EMB["Embedding Span
(vector store operation)"] AGENT --> LLM3["LLM Call Span 3"] ``` ### Token Aggregation Token counts are aggregated **only from `llm_call` spans** (not `agent` spans) to avoid double-counting. The `BatchUpdateTraceAggregates()` method sums `input_tokens` and `output_tokens` from spans where `span_type = 'llm_call'` and writes the totals to the parent trace record. --- ## 3. Verbose Mode | Mode | InputPreview | OutputPreview | |------|:---:|:---:| | Normal | Not recorded | 500 characters max | | Verbose (`GOCLAW_TRACE_VERBOSE=1`) | Up to 200KB | Up to 200KB | Verbose mode is useful for debugging LLM conversations. When enabled via `GOCLAW_TRACE_VERBOSE=1`: - **LLM spans**: Full input messages (including system prompt, history, and tool results) are serialized as JSON and stored in `InputPreview` (truncated at 200KB). LLM response content is stored in `OutputPreview` (truncated at 200KB, includes `` tag if present). - **Tool spans**: Tool input and output are both recorded up to 200KB. - **Agent span**: Input message and output are both recorded up to 200KB. In normal mode, previews are truncated to 500 characters max to minimize storage overhead. --- ## 4. OTel Export Optional OpenTelemetry OTLP exporter that sends spans to external observability backends. ```mermaid flowchart TD COLLECTOR["Collector flush cycle"] --> CHECK{"SpanExporter set?"} CHECK -->|No| PG_ONLY["Write to PostgreSQL only"] CHECK -->|Yes| BOTH["Write to PostgreSQL
+ ExportSpans() to OTLP backend"] BOTH --> BACKEND["Jaeger / Tempo / Datadog"] ``` ### OTel Configuration | Parameter | Description | |-----------|-------------| | `endpoint` | OTLP endpoint (e.g., `localhost:4317` for gRPC, `localhost:4318` for HTTP) | | `protocol` | `grpc` (default) or `http` | | `insecure` | Skip TLS for local development | | `service_name` | OTel service name (default: `goclaw-gateway`) | | `headers` | Extra headers (auth tokens, etc.) | ### Batch Processing | Parameter | Value | |-----------|-------| | Max batch size | 100 spans | | Batch timeout | 5 seconds | The exporter lives in a separate sub-package (`internal/tracing/otelexport/`) so its gRPC and protobuf dependencies are isolated. Commenting out the import and wiring removes approximately 15-20MB from the binary. The exporter is attached to the Collector via `SetExporter()`. --- ## 5. Cost Calculation Per-span cost is calculated using the `CalculateCost()` function in `internal/tracing/cost.go`. For each LLM call span: ``` Cost = (PromptTokens × InputCostPerMillion) / 1,000,000 + (CompletionTokens × OutputCostPerMillion) / 1,000,000 + (CacheReadTokens × CacheReadCostPerMillion) / 1,000,000 + (CacheCreationTokens × CacheCreateCostPerMillion) / 1,000,000 ``` Model pricing is loaded from `config.ModelPricing` and keyed by `provider/model` (with fallback to `model` only). Cost is stored in the `total_cost` field of each LLM call span. The trace aggregation sums costs from all child `llm_call` spans to compute the trace-level `total_cost`. Cache token costs (read + create) are optional and only applied if the pricing config specifies non-zero values. --- ## 6. Snapshot Worker -- Realtime Usage Aggregation The `SnapshotWorker` periodically aggregates trace and span data into hourly `usage_snapshots` for realtime analytics and dashboard displays. ### Operation - **Schedule**: Ticks every hour at HH:05:00 UTC (5 minutes past the hour) - **Catch-up**: On startup and after each tick, computes snapshots for all missed hours - **Backfill**: `Backfill()` method populates historical snapshots from the earliest trace to now ### Snapshot Dimensions For each hour `[00:00, 01:00)`, the worker creates two types of snapshot rows: 1. **Totals Row** (`provider=""`, `model=""`) — Aggregated from traces: - `request_count` — Count of root traces - `error_count` — Count of failed traces - `unique_users` — Distinct `user_id` in traces - `input_tokens`, `output_tokens` — Sum from all child `llm_call` spans - `total_cost` — Sum of costs from all child `llm_call` spans - `tool_call_count` — Sum from traces - `avg_duration_ms` — Average trace duration - `memory_docs`, `memory_chunks` — Point-in-time count (attached to agent's totals row only) - `kg_entities`, `kg_relations` — Point-in-time count (attached to agent's totals row only) 2. **Detail Rows** (`provider` + `model` specified) — Aggregated from `llm_call` spans: - `llm_call_count` — Count of LLM calls for this provider/model - `input_tokens`, `output_tokens` — Sum of tokens - `total_cost` — Sum of per-call costs - `cache_read_tokens`, `cache_create_tokens`, `thinking_tokens` — Sum from span metadata Grouping: by `(agent_id, channel)` for totals; by `(agent_id, channel, provider, model)` for details. ### Usage ```go worker := tracing.NewSnapshotWorker(db, snapshotStore) worker.Start() // Later: hoursBackfilled, err := worker.Backfill(ctx) worker.Stop() ``` --- ## 7. Trace HTTP API | Method | Path | Description | |--------|------|-------------| | GET | `/v1/traces` | List traces with pagination and filters | | GET | `/v1/traces/{id}` | Get trace details with all spans | ### Query Filters | Parameter | Type | Description | |-----------|------|-------------| | `agent_id` | UUID | Filter by agent | | `user_id` | string | Filter by user | | `status` | string | Filter by status (running, success, error, cancelled) | | `from` / `to` | timestamp | Date range filter | | `limit` | int | Page size (default 50) | | `offset` | int | Pagination offset | --- ## 8. Delegation History Delegation history records are stored in the `delegation_history` table and exposed alongside traces for cross-referencing agent interactions. | Channel | Endpoint | Details | |---------|----------|---------| | WebSocket RPC | `delegations.list` / `delegations.get` | Results truncated (500 runes for list, 8000 for detail) | | HTTP API | `GET /v1/delegations` / `GET /v1/delegations/{id}` | Full records | | Agent tool | `delegate(action="history")` | Agent self-checking past delegations | Delegation history is automatically recorded by `DelegateManager.saveDelegationHistory()` for every delegation (sync/async). Each record includes source agent, target agent, input, result, duration, and status. --- ## File Reference | File | Description | |------|-------------| | `internal/tracing/collector.go` | Collector buffer-flush, EmitSpan, FinishTrace, verbose mode | | `internal/tracing/context.go` | Trace context propagation (TraceID, ParentSpanID, DelegateParentTraceID) | | `internal/tracing/cost.go` | Cost calculation and pricing lookup | | `internal/tracing/snapshot_worker.go` | Hourly usage aggregation into snapshots | | `internal/tracing/otelexport/exporter.go` | OTel OTLP exporter (gRPC + HTTP) | | `internal/store/tracing_store.go` | TracingStore interface, span/trace type constants | | `internal/store/pg/tracing.go` | PostgreSQL trace/span persistence + aggregation | | `internal/http/traces.go` | Trace HTTP API handler (GET /v1/traces) | | `internal/agent/loop_tracing.go` | Span emission from agent loop (LLM, tool, agent spans) | | `internal/http/delegations.go` | Delegation history HTTP API handler | | `internal/gateway/methods/delegations.go` | Delegation history RPC handlers | --- ## Cross-References | Document | Relevant Content | |----------|-----------------| | [01-agent-loop.md](./01-agent-loop.md) | Span emission during agent execution, cancel handling | | [03-tools-system.md](./03-tools-system.md) | Delegation system, delegation history via agent tool | | [06-store-data-model.md](./06-store-data-model.md) | traces/spans tables schema, delegation_history table | | [08-scheduling-cron.md](./08-scheduling-cron.md) | Scheduler lanes, /stop and /stopall commands | | [09-security.md](./09-security.md) | Rate limiting, RBAC access control |