* feat(ui): improve kanban UX, fix dialog scroll, remove delegation page - Kanban: reorder columns (blocked after pending), show blocked-by info on cards, clickable blocker links in task detail, framer-motion card animation between columns - Dialogs: standardize scroll pattern across all modals — header fixed, scrollbar flush with outer edge via negative margin trick - Remove delegation page, types, events, i18n, routes, and all references - Fix activity_logs NULL jsonb scan error (COALESCE) - Board header: show text labels on action buttons (desktop) * docs: comprehensive audit and update of all documentation - Update Go 1.25 → 1.26, PostgreSQL 15+ → 18 across all docs - Add 10 missing internal modules to CLAUDE.md project structure - Expand provider docs from 2 to 6 packages (Anthropic, OpenAI, DashScope, Claude CLI, ACP, Codex) - Add 8 missing store interfaces to data model docs (22 total) - Update bootstrap files from 7 to 13 templates - Expand tool inventory from ~35 to 60+ tools with media/KG/credential categories - Fix Team Task Board: add blocked status, 3 missing actions, V2 versioning, delegate restrictions - Remove all references to removed features: handoff, delegate_search, evaluate_loop, agent_links - Fix lane defaults (2/4/1 → 30/50/100/30), ghost file references, models.list → providers.models - Add SecureCLI, snapshot worker, cost calculation, pairing security docs - Comprehensive changelog catch-up - Trim docs/03-tools-system.md to 800-line limit
10 KiB
10 - Tracing & Observability
Records agent run activities asynchronously. Spans are buffered in memory and flushed to the TracingStore in batches, with optional export to external OpenTelemetry backends.
Tracing uses PostgreSQL. The
tracesandspanstables store all tracing data. Optional OTel export sends spans to external backends (Jaeger, Grafana Tempo, Datadog) in addition to PostgreSQL.
1. Collector -- Buffer-Flush Architecture
flowchart TD
EMIT["EmitSpan(span)"] --> BUF["spanCh<br/>(buffered channel, cap = 1000)"]
BUF --> FLUSH["flushLoop() -- every 5s"]
FLUSH --> DRAIN["Drain all spans from channel"]
DRAIN --> BATCH["BatchCreateSpans() to PostgreSQL"]
DRAIN --> OTEL["OTelExporter.ExportSpans()<br/>to OTLP backend (if configured)"]
DRAIN --> AGG["Update aggregates<br/>for dirty traces"]
FULL{"Buffer full?"} -.->|"Drop + warning log"| BUF
Trace Lifecycle
flowchart LR
CT["CreateTrace()<br/>(synchronous, 1 per run)"] --> ES["EmitSpan()<br/>(async, buffered)"]
ES --> FT["FinishTrace()<br/>(status, error, output preview)"]
Cancel Handling
When a run is cancelled via /stop or /stopall, the run context is cancelled but trace finalization still needs to persist. FinishTrace() detects ctx.Err() != nil and switches to context.Background() for the final database write. The trace status is set to "cancelled" instead of "error".
Context values (traceID, collector) survive cancellation -- only ctx.Done() and ctx.Err() change. This allows trace finalization to find everything it needs with a fresh context for the DB call.
2. Span Types & Hierarchy
| Type | Description | OTel Kind |
|---|---|---|
llm_call |
LLM provider call | Client |
tool_call |
Tool execution | Internal |
agent |
Root agent span (parents all child spans) | Internal |
embedding |
Embedding generation (vector store operations) | Internal |
event |
Discrete event marker (no duration) | Internal |
flowchart TD
AGENT["Agent Span (root)<br/>parents all child spans"] --> LLM1["LLM Call Span 1<br/>(model, tokens, finish reason)"]
AGENT --> TOOL1["Tool Span: exec<br/>(tool_name, duration)"]
AGENT --> LLM2["LLM Call Span 2"]
AGENT --> TOOL2["Tool Span: read_file"]
AGENT --> EMB["Embedding Span<br/>(vector store operation)"]
AGENT --> LLM3["LLM Call Span 3"]
Token Aggregation
Token counts are aggregated only from llm_call spans (not agent spans) to avoid double-counting. The BatchUpdateTraceAggregates() method sums input_tokens and output_tokens from spans where span_type = 'llm_call' and writes the totals to the parent trace record.
3. Verbose Mode
| Mode | InputPreview | OutputPreview |
|---|---|---|
| Normal | Not recorded | 500 characters max |
Verbose (GOCLAW_TRACE_VERBOSE=1) |
Up to 200KB | Up to 200KB |
Verbose mode is useful for debugging LLM conversations. When enabled via GOCLAW_TRACE_VERBOSE=1:
- LLM spans: Full input messages (including system prompt, history, and tool results) are serialized as JSON and stored in
InputPreview(truncated at 200KB). LLM response content is stored inOutputPreview(truncated at 200KB, includes<thinking>tag if present). - Tool spans: Tool input and output are both recorded up to 200KB.
- Agent span: Input message and output are both recorded up to 200KB.
In normal mode, previews are truncated to 500 characters max to minimize storage overhead.
4. OTel Export
Optional OpenTelemetry OTLP exporter that sends spans to external observability backends.
flowchart TD
COLLECTOR["Collector flush cycle"] --> CHECK{"SpanExporter set?"}
CHECK -->|No| PG_ONLY["Write to PostgreSQL only"]
CHECK -->|Yes| BOTH["Write to PostgreSQL<br/>+ ExportSpans() to OTLP backend"]
BOTH --> BACKEND["Jaeger / Tempo / Datadog"]
OTel Configuration
| Parameter | Description |
|---|---|
endpoint |
OTLP endpoint (e.g., localhost:4317 for gRPC, localhost:4318 for HTTP) |
protocol |
grpc (default) or http |
insecure |
Skip TLS for local development |
service_name |
OTel service name (default: goclaw-gateway) |
headers |
Extra headers (auth tokens, etc.) |
Batch Processing
| Parameter | Value |
|---|---|
| Max batch size | 100 spans |
| Batch timeout | 5 seconds |
The exporter lives in a separate sub-package (internal/tracing/otelexport/) so its gRPC and protobuf dependencies are isolated. Commenting out the import and wiring removes approximately 15-20MB from the binary. The exporter is attached to the Collector via SetExporter().
5. Cost Calculation
Per-span cost is calculated using the CalculateCost() function in internal/tracing/cost.go. For each LLM call span:
Cost = (PromptTokens × InputCostPerMillion) / 1,000,000
+ (CompletionTokens × OutputCostPerMillion) / 1,000,000
+ (CacheReadTokens × CacheReadCostPerMillion) / 1,000,000
+ (CacheCreationTokens × CacheCreateCostPerMillion) / 1,000,000
Model pricing is loaded from config.ModelPricing and keyed by provider/model (with fallback to model only). Cost is stored in the total_cost field of each LLM call span. The trace aggregation sums costs from all child llm_call spans to compute the trace-level total_cost.
Cache token costs (read + create) are optional and only applied if the pricing config specifies non-zero values.
6. Snapshot Worker -- Realtime Usage Aggregation
The SnapshotWorker periodically aggregates trace and span data into hourly usage_snapshots for realtime analytics and dashboard displays.
Operation
- Schedule: Ticks every hour at HH:05:00 UTC (5 minutes past the hour)
- Catch-up: On startup and after each tick, computes snapshots for all missed hours
- Backfill:
Backfill()method populates historical snapshots from the earliest trace to now
Snapshot Dimensions
For each hour [00:00, 01:00), the worker creates two types of snapshot rows:
-
Totals Row (
provider="",model="") — Aggregated from traces:request_count— Count of root traceserror_count— Count of failed tracesunique_users— Distinctuser_idin tracesinput_tokens,output_tokens— Sum from all childllm_callspanstotal_cost— Sum of costs from all childllm_callspanstool_call_count— Sum from tracesavg_duration_ms— Average trace durationmemory_docs,memory_chunks— Point-in-time count (attached to agent's totals row only)kg_entities,kg_relations— Point-in-time count (attached to agent's totals row only)
-
Detail Rows (
provider+modelspecified) — Aggregated fromllm_callspans:llm_call_count— Count of LLM calls for this provider/modelinput_tokens,output_tokens— Sum of tokenstotal_cost— Sum of per-call costscache_read_tokens,cache_create_tokens,thinking_tokens— Sum from span metadata
Grouping: by (agent_id, channel) for totals; by (agent_id, channel, provider, model) for details.
Usage
worker := tracing.NewSnapshotWorker(db, snapshotStore)
worker.Start()
// Later:
hoursBackfilled, err := worker.Backfill(ctx)
worker.Stop()
7. Trace HTTP API
| Method | Path | Description |
|---|---|---|
| GET | /v1/traces |
List traces with pagination and filters |
| GET | /v1/traces/{id} |
Get trace details with all spans |
Query Filters
| Parameter | Type | Description |
|---|---|---|
agent_id |
UUID | Filter by agent |
user_id |
string | Filter by user |
status |
string | Filter by status (running, success, error, cancelled) |
from / to |
timestamp | Date range filter |
limit |
int | Page size (default 50) |
offset |
int | Pagination offset |
8. Delegation History
Delegation history records are stored in the delegation_history table and exposed alongside traces for cross-referencing agent interactions.
| Channel | Endpoint | Details |
|---|---|---|
| WebSocket RPC | delegations.list / delegations.get |
Results truncated (500 runes for list, 8000 for detail) |
| HTTP API | GET /v1/delegations / GET /v1/delegations/{id} |
Full records |
| Agent tool | delegate(action="history") |
Agent self-checking past delegations |
Delegation history is automatically recorded by DelegateManager.saveDelegationHistory() for every delegation (sync/async). Each record includes source agent, target agent, input, result, duration, and status.
File Reference
| File | Description |
|---|---|
internal/tracing/collector.go |
Collector buffer-flush, EmitSpan, FinishTrace, verbose mode |
internal/tracing/context.go |
Trace context propagation (TraceID, ParentSpanID, DelegateParentTraceID) |
internal/tracing/cost.go |
Cost calculation and pricing lookup |
internal/tracing/snapshot_worker.go |
Hourly usage aggregation into snapshots |
internal/tracing/otelexport/exporter.go |
OTel OTLP exporter (gRPC + HTTP) |
internal/store/tracing_store.go |
TracingStore interface, span/trace type constants |
internal/store/pg/tracing.go |
PostgreSQL trace/span persistence + aggregation |
internal/http/traces.go |
Trace HTTP API handler (GET /v1/traces) |
internal/agent/loop_tracing.go |
Span emission from agent loop (LLM, tool, agent spans) |
internal/http/delegations.go |
Delegation history HTTP API handler |
internal/gateway/methods/delegations.go |
Delegation history RPC handlers |
Cross-References
| Document | Relevant Content |
|---|---|
| 01-agent-loop.md | Span emission during agent execution, cancel handling |
| 03-tools-system.md | Delegation system, delegation history via agent tool |
| 06-store-data-model.md | traces/spans tables schema, delegation_history table |
| 08-scheduling-cron.md | Scheduler lanes, /stop and /stopall commands |
| 09-security.md | Rate limiting, RBAC access control |