Files
goclaw/docs/10-tracing-observability.md
T
Viet Tran 037d18f711 docs: comprehensive audit and update of all documentation (#231)
* feat(ui): improve kanban UX, fix dialog scroll, remove delegation page

- Kanban: reorder columns (blocked after pending), show blocked-by info
  on cards, clickable blocker links in task detail, framer-motion card
  animation between columns
- Dialogs: standardize scroll pattern across all modals — header fixed,
  scrollbar flush with outer edge via negative margin trick
- Remove delegation page, types, events, i18n, routes, and all references
- Fix activity_logs NULL jsonb scan error (COALESCE)
- Board header: show text labels on action buttons (desktop)

* docs: comprehensive audit and update of all documentation

- Update Go 1.25 → 1.26, PostgreSQL 15+ → 18 across all docs
- Add 10 missing internal modules to CLAUDE.md project structure
- Expand provider docs from 2 to 6 packages (Anthropic, OpenAI, DashScope, Claude CLI, ACP, Codex)
- Add 8 missing store interfaces to data model docs (22 total)
- Update bootstrap files from 7 to 13 templates
- Expand tool inventory from ~35 to 60+ tools with media/KG/credential categories
- Fix Team Task Board: add blocked status, 3 missing actions, V2 versioning, delegate restrictions
- Remove all references to removed features: handoff, delegate_search, evaluate_loop, agent_links
- Fix lane defaults (2/4/1 → 30/50/100/30), ghost file references, models.list → providers.models
- Add SecureCLI, snapshot worker, cost calculation, pairing security docs
- Comprehensive changelog catch-up
- Trim docs/03-tools-system.md to 800-line limit
2026-03-16 22:51:57 +07:00

10 KiB
Raw Blame History

10 - Tracing & Observability

Records agent run activities asynchronously. Spans are buffered in memory and flushed to the TracingStore in batches, with optional export to external OpenTelemetry backends.

Tracing uses PostgreSQL. The traces and spans tables store all tracing data. Optional OTel export sends spans to external backends (Jaeger, Grafana Tempo, Datadog) in addition to PostgreSQL.


1. Collector -- Buffer-Flush Architecture

flowchart TD
    EMIT["EmitSpan(span)"] --> BUF["spanCh<br/>(buffered channel, cap = 1000)"]
    BUF --> FLUSH["flushLoop() -- every 5s"]
    FLUSH --> DRAIN["Drain all spans from channel"]
    DRAIN --> BATCH["BatchCreateSpans() to PostgreSQL"]
    DRAIN --> OTEL["OTelExporter.ExportSpans()<br/>to OTLP backend (if configured)"]
    DRAIN --> AGG["Update aggregates<br/>for dirty traces"]

    FULL{"Buffer full?"} -.->|"Drop + warning log"| BUF

Trace Lifecycle

flowchart LR
    CT["CreateTrace()<br/>(synchronous, 1 per run)"] --> ES["EmitSpan()<br/>(async, buffered)"]
    ES --> FT["FinishTrace()<br/>(status, error, output preview)"]

Cancel Handling

When a run is cancelled via /stop or /stopall, the run context is cancelled but trace finalization still needs to persist. FinishTrace() detects ctx.Err() != nil and switches to context.Background() for the final database write. The trace status is set to "cancelled" instead of "error".

Context values (traceID, collector) survive cancellation -- only ctx.Done() and ctx.Err() change. This allows trace finalization to find everything it needs with a fresh context for the DB call.


2. Span Types & Hierarchy

Type Description OTel Kind
llm_call LLM provider call Client
tool_call Tool execution Internal
agent Root agent span (parents all child spans) Internal
embedding Embedding generation (vector store operations) Internal
event Discrete event marker (no duration) Internal
flowchart TD
    AGENT["Agent Span (root)<br/>parents all child spans"] --> LLM1["LLM Call Span 1<br/>(model, tokens, finish reason)"]
    AGENT --> TOOL1["Tool Span: exec<br/>(tool_name, duration)"]
    AGENT --> LLM2["LLM Call Span 2"]
    AGENT --> TOOL2["Tool Span: read_file"]
    AGENT --> EMB["Embedding Span<br/>(vector store operation)"]
    AGENT --> LLM3["LLM Call Span 3"]

Token Aggregation

Token counts are aggregated only from llm_call spans (not agent spans) to avoid double-counting. The BatchUpdateTraceAggregates() method sums input_tokens and output_tokens from spans where span_type = 'llm_call' and writes the totals to the parent trace record.


3. Verbose Mode

Mode InputPreview OutputPreview
Normal Not recorded 500 characters max
Verbose (GOCLAW_TRACE_VERBOSE=1) Up to 200KB Up to 200KB

Verbose mode is useful for debugging LLM conversations. When enabled via GOCLAW_TRACE_VERBOSE=1:

  • LLM spans: Full input messages (including system prompt, history, and tool results) are serialized as JSON and stored in InputPreview (truncated at 200KB). LLM response content is stored in OutputPreview (truncated at 200KB, includes <thinking> tag if present).
  • Tool spans: Tool input and output are both recorded up to 200KB.
  • Agent span: Input message and output are both recorded up to 200KB.

In normal mode, previews are truncated to 500 characters max to minimize storage overhead.


4. OTel Export

Optional OpenTelemetry OTLP exporter that sends spans to external observability backends.

flowchart TD
    COLLECTOR["Collector flush cycle"] --> CHECK{"SpanExporter set?"}
    CHECK -->|No| PG_ONLY["Write to PostgreSQL only"]
    CHECK -->|Yes| BOTH["Write to PostgreSQL<br/>+ ExportSpans() to OTLP backend"]
    BOTH --> BACKEND["Jaeger / Tempo / Datadog"]

OTel Configuration

Parameter Description
endpoint OTLP endpoint (e.g., localhost:4317 for gRPC, localhost:4318 for HTTP)
protocol grpc (default) or http
insecure Skip TLS for local development
service_name OTel service name (default: goclaw-gateway)
headers Extra headers (auth tokens, etc.)

Batch Processing

Parameter Value
Max batch size 100 spans
Batch timeout 5 seconds

The exporter lives in a separate sub-package (internal/tracing/otelexport/) so its gRPC and protobuf dependencies are isolated. Commenting out the import and wiring removes approximately 15-20MB from the binary. The exporter is attached to the Collector via SetExporter().


5. Cost Calculation

Per-span cost is calculated using the CalculateCost() function in internal/tracing/cost.go. For each LLM call span:

Cost = (PromptTokens × InputCostPerMillion) / 1,000,000
      + (CompletionTokens × OutputCostPerMillion) / 1,000,000
      + (CacheReadTokens × CacheReadCostPerMillion) / 1,000,000
      + (CacheCreationTokens × CacheCreateCostPerMillion) / 1,000,000

Model pricing is loaded from config.ModelPricing and keyed by provider/model (with fallback to model only). Cost is stored in the total_cost field of each LLM call span. The trace aggregation sums costs from all child llm_call spans to compute the trace-level total_cost.

Cache token costs (read + create) are optional and only applied if the pricing config specifies non-zero values.


6. Snapshot Worker -- Realtime Usage Aggregation

The SnapshotWorker periodically aggregates trace and span data into hourly usage_snapshots for realtime analytics and dashboard displays.

Operation

  • Schedule: Ticks every hour at HH:05:00 UTC (5 minutes past the hour)
  • Catch-up: On startup and after each tick, computes snapshots for all missed hours
  • Backfill: Backfill() method populates historical snapshots from the earliest trace to now

Snapshot Dimensions

For each hour [00:00, 01:00), the worker creates two types of snapshot rows:

  1. Totals Row (provider="", model="") — Aggregated from traces:

    • request_count — Count of root traces
    • error_count — Count of failed traces
    • unique_users — Distinct user_id in traces
    • input_tokens, output_tokens — Sum from all child llm_call spans
    • total_cost — Sum of costs from all child llm_call spans
    • tool_call_count — Sum from traces
    • avg_duration_ms — Average trace duration
    • memory_docs, memory_chunks — Point-in-time count (attached to agent's totals row only)
    • kg_entities, kg_relations — Point-in-time count (attached to agent's totals row only)
  2. Detail Rows (provider + model specified) — Aggregated from llm_call spans:

    • llm_call_count — Count of LLM calls for this provider/model
    • input_tokens, output_tokens — Sum of tokens
    • total_cost — Sum of per-call costs
    • cache_read_tokens, cache_create_tokens, thinking_tokens — Sum from span metadata

Grouping: by (agent_id, channel) for totals; by (agent_id, channel, provider, model) for details.

Usage

worker := tracing.NewSnapshotWorker(db, snapshotStore)
worker.Start()

// Later:
hoursBackfilled, err := worker.Backfill(ctx)
worker.Stop()

7. Trace HTTP API

Method Path Description
GET /v1/traces List traces with pagination and filters
GET /v1/traces/{id} Get trace details with all spans

Query Filters

Parameter Type Description
agent_id UUID Filter by agent
user_id string Filter by user
status string Filter by status (running, success, error, cancelled)
from / to timestamp Date range filter
limit int Page size (default 50)
offset int Pagination offset

8. Delegation History

Delegation history records are stored in the delegation_history table and exposed alongside traces for cross-referencing agent interactions.

Channel Endpoint Details
WebSocket RPC delegations.list / delegations.get Results truncated (500 runes for list, 8000 for detail)
HTTP API GET /v1/delegations / GET /v1/delegations/{id} Full records
Agent tool delegate(action="history") Agent self-checking past delegations

Delegation history is automatically recorded by DelegateManager.saveDelegationHistory() for every delegation (sync/async). Each record includes source agent, target agent, input, result, duration, and status.


File Reference

File Description
internal/tracing/collector.go Collector buffer-flush, EmitSpan, FinishTrace, verbose mode
internal/tracing/context.go Trace context propagation (TraceID, ParentSpanID, DelegateParentTraceID)
internal/tracing/cost.go Cost calculation and pricing lookup
internal/tracing/snapshot_worker.go Hourly usage aggregation into snapshots
internal/tracing/otelexport/exporter.go OTel OTLP exporter (gRPC + HTTP)
internal/store/tracing_store.go TracingStore interface, span/trace type constants
internal/store/pg/tracing.go PostgreSQL trace/span persistence + aggregation
internal/http/traces.go Trace HTTP API handler (GET /v1/traces)
internal/agent/loop_tracing.go Span emission from agent loop (LLM, tool, agent spans)
internal/http/delegations.go Delegation history HTTP API handler
internal/gateway/methods/delegations.go Delegation history RPC handlers

Cross-References

Document Relevant Content
01-agent-loop.md Span emission during agent execution, cancel handling
03-tools-system.md Delegation system, delegation history via agent tool
06-store-data-model.md traces/spans tables schema, delegation_history table
08-scheduling-cron.md Scheduler lanes, /stop and /stopall commands
09-security.md Rate limiting, RBAC access control