mirror of https://github.com/tiennm99/goclaw.git synced 2026-06-10 12:10:53 +00:00

Files

T

Viet Tran 037d18f711 docs: comprehensive audit and update of all documentation (#231 )

* feat(ui): improve kanban UX, fix dialog scroll, remove delegation page

- Kanban: reorder columns (blocked after pending), show blocked-by info
  on cards, clickable blocker links in task detail, framer-motion card
  animation between columns
- Dialogs: standardize scroll pattern across all modals — header fixed,
  scrollbar flush with outer edge via negative margin trick
- Remove delegation page, types, events, i18n, routes, and all references
- Fix activity_logs NULL jsonb scan error (COALESCE)
- Board header: show text labels on action buttons (desktop)

* docs: comprehensive audit and update of all documentation

- Update Go 1.25 → 1.26, PostgreSQL 15+ → 18 across all docs
- Add 10 missing internal modules to CLAUDE.md project structure
- Expand provider docs from 2 to 6 packages (Anthropic, OpenAI, DashScope, Claude CLI, ACP, Codex)
- Add 8 missing store interfaces to data model docs (22 total)
- Update bootstrap files from 7 to 13 templates
- Expand tool inventory from ~35 to 60+ tools with media/KG/credential categories
- Fix Team Task Board: add blocked status, 3 missing actions, V2 versioning, delegate restrictions
- Remove all references to removed features: handoff, delegate_search, evaluate_loop, agent_links
- Fix lane defaults (2/4/1 → 30/50/100/30), ghost file references, models.list → providers.models
- Add SecureCLI, snapshot worker, cost calculation, pairing security docs
- Comprehensive changelog catch-up
- Trim docs/03-tools-system.md to 800-line limit

2026-03-16 22:51:57 +07:00

10 KiB

Raw Blame History

10 - Tracing & Observability

Records agent run activities asynchronously. Spans are buffered in memory and flushed to the TracingStore in batches, with optional export to external OpenTelemetry backends.

Tracing uses PostgreSQL. The traces and spans tables store all tracing data. Optional OTel export sends spans to external backends (Jaeger, Grafana Tempo, Datadog) in addition to PostgreSQL.

1. Collector -- Buffer-Flush Architecture

flowchart TD
    EMIT["EmitSpan(span)"] --> BUF["spanCh<br/>(buffered channel, cap = 1000)"]
    BUF --> FLUSH["flushLoop() -- every 5s"]
    FLUSH --> DRAIN["Drain all spans from channel"]
    DRAIN --> BATCH["BatchCreateSpans() to PostgreSQL"]
    DRAIN --> OTEL["OTelExporter.ExportSpans()<br/>to OTLP backend (if configured)"]
    DRAIN --> AGG["Update aggregates<br/>for dirty traces"]

    FULL{"Buffer full?"} -.->|"Drop + warning log"| BUF

Trace Lifecycle

flowchart LR
    CT["CreateTrace()<br/>(synchronous, 1 per run)"] --> ES["EmitSpan()<br/>(async, buffered)"]
    ES --> FT["FinishTrace()<br/>(status, error, output preview)"]

Cancel Handling

When a run is cancelled via /stop or /stopall, the run context is cancelled but trace finalization still needs to persist. FinishTrace() detects ctx.Err() != nil and switches to context.Background() for the final database write. The trace status is set to "cancelled" instead of "error".

Context values (traceID, collector) survive cancellation -- only ctx.Done() and ctx.Err() change. This allows trace finalization to find everything it needs with a fresh context for the DB call.

2. Span Types & Hierarchy

Type	Description	OTel Kind
`llm_call`	LLM provider call	Client
`tool_call`	Tool execution	Internal
`agent`	Root agent span (parents all child spans)	Internal
`embedding`	Embedding generation (vector store operations)	Internal
`event`	Discrete event marker (no duration)	Internal

flowchart TD
    AGENT["Agent Span (root)<br/>parents all child spans"] --> LLM1["LLM Call Span 1<br/>(model, tokens, finish reason)"]
    AGENT --> TOOL1["Tool Span: exec<br/>(tool_name, duration)"]
    AGENT --> LLM2["LLM Call Span 2"]
    AGENT --> TOOL2["Tool Span: read_file"]
    AGENT --> EMB["Embedding Span<br/>(vector store operation)"]
    AGENT --> LLM3["LLM Call Span 3"]

Token Aggregation

Token counts are aggregated only from llm_call spans (not agent spans) to avoid double-counting. The BatchUpdateTraceAggregates() method sums input_tokens and output_tokens from spans where span_type = 'llm_call' and writes the totals to the parent trace record.

3. Verbose Mode

Mode	InputPreview	OutputPreview
Normal	Not recorded	500 characters max
Verbose (`GOCLAW_TRACE_VERBOSE=1`)	Up to 200KB	Up to 200KB

Verbose mode is useful for debugging LLM conversations. When enabled via GOCLAW_TRACE_VERBOSE=1:

LLM spans: Full input messages (including system prompt, history, and tool results) are serialized as JSON and stored in InputPreview (truncated at 200KB). LLM response content is stored in OutputPreview (truncated at 200KB, includes <thinking> tag if present).
Tool spans: Tool input and output are both recorded up to 200KB.
Agent span: Input message and output are both recorded up to 200KB.

In normal mode, previews are truncated to 500 characters max to minimize storage overhead.

4. OTel Export

Optional OpenTelemetry OTLP exporter that sends spans to external observability backends.

flowchart TD
    COLLECTOR["Collector flush cycle"] --> CHECK{"SpanExporter set?"}
    CHECK -->|No| PG_ONLY["Write to PostgreSQL only"]
    CHECK -->|Yes| BOTH["Write to PostgreSQL<br/>+ ExportSpans() to OTLP backend"]
    BOTH --> BACKEND["Jaeger / Tempo / Datadog"]

OTel Configuration

Parameter	Description
`endpoint`	OTLP endpoint (e.g., `localhost:4317` for gRPC, `localhost:4318` for HTTP)
`protocol`	`grpc` (default) or `http`
`insecure`	Skip TLS for local development
`service_name`	OTel service name (default: `goclaw-gateway`)
`headers`	Extra headers (auth tokens, etc.)

Batch Processing

Parameter	Value
Max batch size	100 spans
Batch timeout	5 seconds

The exporter lives in a separate sub-package (internal/tracing/otelexport/) so its gRPC and protobuf dependencies are isolated. Commenting out the import and wiring removes approximately 15-20MB from the binary. The exporter is attached to the Collector via SetExporter().

5. Cost Calculation

Per-span cost is calculated using the CalculateCost() function in internal/tracing/cost.go. For each LLM call span:

Cost = (PromptTokens × InputCostPerMillion) / 1,000,000
      + (CompletionTokens × OutputCostPerMillion) / 1,000,000
      + (CacheReadTokens × CacheReadCostPerMillion) / 1,000,000
      + (CacheCreationTokens × CacheCreateCostPerMillion) / 1,000,000

Model pricing is loaded from config.ModelPricing and keyed by provider/model (with fallback to model only). Cost is stored in the total_cost field of each LLM call span. The trace aggregation sums costs from all child llm_call spans to compute the trace-level total_cost.

Cache token costs (read + create) are optional and only applied if the pricing config specifies non-zero values.

6. Snapshot Worker -- Realtime Usage Aggregation

The SnapshotWorker periodically aggregates trace and span data into hourly usage_snapshots for realtime analytics and dashboard displays.

Operation

Schedule: Ticks every hour at HH:05:00 UTC (5 minutes past the hour)
Catch-up: On startup and after each tick, computes snapshots for all missed hours
Backfill: Backfill() method populates historical snapshots from the earliest trace to now

Snapshot Dimensions

For each hour [00:00, 01:00), the worker creates two types of snapshot rows:

Totals Row (provider="", model="") — Aggregated from traces:
- request_count — Count of root traces
- error_count — Count of failed traces
- unique_users — Distinct user_id in traces
- input_tokens, output_tokens — Sum from all child llm_call spans
- total_cost — Sum of costs from all child llm_call spans
- tool_call_count — Sum from traces
- avg_duration_ms — Average trace duration
- memory_docs, memory_chunks — Point-in-time count (attached to agent's totals row only)
- kg_entities, kg_relations — Point-in-time count (attached to agent's totals row only)
Detail Rows (provider + model specified) — Aggregated from llm_call spans:
- llm_call_count — Count of LLM calls for this provider/model
- input_tokens, output_tokens — Sum of tokens
- total_cost — Sum of per-call costs
- cache_read_tokens, cache_create_tokens, thinking_tokens — Sum from span metadata

Grouping: by (agent_id, channel) for totals; by (agent_id, channel, provider, model) for details.

Usage

worker := tracing.NewSnapshotWorker(db, snapshotStore)
worker.Start()

// Later:
hoursBackfilled, err := worker.Backfill(ctx)
worker.Stop()

7. Trace HTTP API

Method	Path	Description
GET	`/v1/traces`	List traces with pagination and filters
GET	`/v1/traces/{id}`	Get trace details with all spans

Query Filters

Parameter	Type	Description
`agent_id`	UUID	Filter by agent
`user_id`	string	Filter by user
`status`	string	Filter by status (running, success, error, cancelled)
`from` / `to`	timestamp	Date range filter
`limit`	int	Page size (default 50)
`offset`	int	Pagination offset

8. Delegation History

Delegation history records are stored in the delegation_history table and exposed alongside traces for cross-referencing agent interactions.

Channel	Endpoint	Details
WebSocket RPC	`delegations.list` / `delegations.get`	Results truncated (500 runes for list, 8000 for detail)
HTTP API	`GET /v1/delegations` / `GET /v1/delegations/{id}`	Full records
Agent tool	`delegate(action="history")`	Agent self-checking past delegations

Delegation history is automatically recorded by DelegateManager.saveDelegationHistory() for every delegation (sync/async). Each record includes source agent, target agent, input, result, duration, and status.

File Reference

File	Description
`internal/tracing/collector.go`	Collector buffer-flush, EmitSpan, FinishTrace, verbose mode
`internal/tracing/context.go`	Trace context propagation (TraceID, ParentSpanID, DelegateParentTraceID)
`internal/tracing/cost.go`	Cost calculation and pricing lookup
`internal/tracing/snapshot_worker.go`	Hourly usage aggregation into snapshots
`internal/tracing/otelexport/exporter.go`	OTel OTLP exporter (gRPC + HTTP)
`internal/store/tracing_store.go`	TracingStore interface, span/trace type constants
`internal/store/pg/tracing.go`	PostgreSQL trace/span persistence + aggregation
`internal/http/traces.go`	Trace HTTP API handler (GET /v1/traces)
`internal/agent/loop_tracing.go`	Span emission from agent loop (LLM, tool, agent spans)
`internal/http/delegations.go`	Delegation history HTTP API handler
`internal/gateway/methods/delegations.go`	Delegation history RPC handlers

Cross-References

Document	Relevant Content
01-agent-loop.md	Span emission during agent execution, cancel handling
03-tools-system.md	Delegation system, delegation history via agent tool
06-store-data-model.md	traces/spans tables schema, delegation_history table
08-scheduling-cron.md	Scheduler lanes, /stop and /stopall commands
09-security.md	Rate limiting, RBAC access control

10 KiB Raw Blame History Unescape Escape