Files
goclaw/docs/12-extended-thinking.md
T
Viet Tran 037d18f711 docs: comprehensive audit and update of all documentation (#231)
* feat(ui): improve kanban UX, fix dialog scroll, remove delegation page

- Kanban: reorder columns (blocked after pending), show blocked-by info
  on cards, clickable blocker links in task detail, framer-motion card
  animation between columns
- Dialogs: standardize scroll pattern across all modals — header fixed,
  scrollbar flush with outer edge via negative margin trick
- Remove delegation page, types, events, i18n, routes, and all references
- Fix activity_logs NULL jsonb scan error (COALESCE)
- Board header: show text labels on action buttons (desktop)

* docs: comprehensive audit and update of all documentation

- Update Go 1.25 → 1.26, PostgreSQL 15+ → 18 across all docs
- Add 10 missing internal modules to CLAUDE.md project structure
- Expand provider docs from 2 to 6 packages (Anthropic, OpenAI, DashScope, Claude CLI, ACP, Codex)
- Add 8 missing store interfaces to data model docs (22 total)
- Update bootstrap files from 7 to 13 templates
- Expand tool inventory from ~35 to 60+ tools with media/KG/credential categories
- Fix Team Task Board: add blocked status, 3 missing actions, V2 versioning, delegate restrictions
- Remove all references to removed features: handoff, delegate_search, evaluate_loop, agent_links
- Fix lane defaults (2/4/1 → 30/50/100/30), ghost file references, models.list → providers.models
- Add SecureCLI, snapshot worker, cost calculation, pairing security docs
- Comprehensive changelog catch-up
- Trim docs/03-tools-system.md to 800-line limit
2026-03-16 22:51:57 +07:00

8.1 KiB

12 - Extended Thinking

Overview

Extended thinking allows LLM providers to "think out loud" before producing a final response. When enabled, the model generates internal reasoning tokens that improve response quality for complex tasks — at the cost of additional token usage and latency. GoClaw supports extended thinking across multiple providers with a unified thinking_level configuration.


1. Configuration

Thinking is controlled per-agent through the thinking_level setting.

Level Behavior
off Thinking disabled (default)
low Minimal thinking — quick reasoning
medium Moderate thinking — balanced reasoning
high Maximum thinking — deep reasoning for complex tasks

The setting can be configured:

  • Per-agent: In the agent's configuration (applies to all users of that agent)
  • Per-user override: Via user_agent_overrides table (reserved for future use)

2. Provider Support

Each provider maps the abstract thinking_level to its own implementation parameters.

flowchart TD
    CONFIG["Agent config:<br/>thinking_level = medium"] --> CHECK{"Provider supports<br/>thinking?"}
    CHECK -->|No| SKIP["Send request<br/>without thinking"]
    CHECK -->|Yes| MAP{"Provider type?"}

    MAP -->|Anthropic| ANTH["Budget tokens: 10,000<br/>Header: anthropic-beta<br/>Strip temperature"]
    MAP -->|OpenAI-compat| OAI["Map to reasoning_effort<br/>(low/medium/high)"]
    MAP -->|DashScope| DASH["enable_thinking: true<br/>Budget: 16,384 tokens<br/>⚠ Model-specific + tools limitation"]
    MAP -->|Codex| CODEX["reasoning_tokens tracked<br/>via Responses API"]

    ANTH --> SEND["Send to LLM"]
    OAI --> SEND
    DASH --> SEND
    CODEX --> SEND

Anthropic (Native)

Thinking Level Budget Tokens
low 4,096
medium 10,000
high 32,000

When thinking is enabled:

  • Adds thinking: {type: "enabled", budget_tokens: N} to the request body
  • Sets anthropic-beta: interleaved-thinking-2025-05-14 header
  • Strips temperature parameter (Anthropic requirement — cannot use temperature with thinking)
  • Auto-adjusts max_tokens to accommodate thinking budget (budget + 8,192 buffer)

OpenAI-Compatible (OpenAI, Groq, DeepSeek, etc.)

Maps thinking_level directly to reasoning_effort:

  • lowreasoning_effort: "low"
  • mediumreasoning_effort: "medium"
  • highreasoning_effort: "high"

Reasoning content is returned in the reasoning_content field of the response delta during streaming.

DashScope (Alibaba Qwen)

Thinking Level Budget Tokens
low 4,096
medium 16,384
high 32,768

Enables thinking via enable_thinking: true plus a thinking_budget parameter.

Model-specific support: Only certain Qwen3 models accept the enable_thinking / thinking_budget parameters:

  • Qwen3.5 series: qwen3.5-plus, qwen3.5-turbo (thinking + vision)
  • Qwen3 hosted: qwen3-max
  • Qwen3 open-weight: qwen3-235b-a22b, qwen3-32b, qwen3-14b, qwen3-8b

Other models (e.g., qwen3-plus, qwen3-turbo) silently skip thinking injection to avoid API errors.

Important limitation: DashScope does not support streaming when tools are present. When an agent has tools enabled and thinking is active, the provider automatically falls back to non-streaming mode (single Chat() call) and synthesizes chunk callbacks to maintain the event flow.

Codex (Alibaba AI Reasoning)

Codex natively supports extended reasoning through its Responses API. Thinking/reasoning tokens are streamed as discrete reasoning events with summary fragments.

Token tracking: Reasoning token count is exposed in response.completed / response.incomplete events as OutputTokensDetails.ReasoningTokens and accessible via ChatResponse.Usage.ThinkingTokens.


3. Streaming

When thinking is active, reasoning content streams to the client alongside regular content.

flowchart TD
    LLM["LLM generates response"] --> THINK["Thinking tokens<br/>(internal reasoning)"]
    THINK --> CONTENT["Content tokens<br/>(final response)"]

    THINK -->|Stream| CHUNK_T["StreamChunk<br/>Thinking: 'reasoning text...'"]
    CONTENT -->|Stream| CHUNK_C["StreamChunk<br/>Content: 'response text...'"]

    CHUNK_T --> CLIENT["Client receives<br/>thinking + content separately"]
    CHUNK_C --> CLIENT

Provider-Specific Streaming Events

Provider Thinking Event Content Event
Anthropic thinking_delta in content blocks text_delta in content blocks
OpenAI-compat reasoning_content in delta content in delta
DashScope Same as OpenAI (when tools absent) Same as OpenAI
Codex reasoning items with text summaries content items

Token Estimation

Thinking tokens are estimated as character_count / 4 for context window tracking. This rough estimate ensures the agent loop can account for thinking overhead when calculating context usage.


4. Tool Loop Handling

Extended thinking interacts with multi-turn tool conversations. When the LLM calls a tool and then needs to continue reasoning, thinking blocks must be preserved correctly across turns.

flowchart TD
    TURN1["Turn 1: LLM thinks + calls tool"] --> PRESERVE["Preserve thinking blocks<br/>in raw assistant content"]
    PRESERVE --> TOOL["Tool executes,<br/>result appended to history"]
    TOOL --> TURN2["Turn 2: LLM receives history<br/>including preserved thinking blocks"]
    TURN2 --> CONTINUE["LLM continues reasoning<br/>with full context"]

Anthropic Thinking Block Preservation

Anthropic requires thinking blocks (including their cryptographic signatures) to be echoed back in subsequent turns. GoClaw handles this through RawAssistantContent:

  1. During streaming, raw content blocks are accumulated — including thinking type blocks with their signature fields
  2. When the assistant message is appended to history, the raw blocks are preserved
  3. On the next LLM call, these blocks are sent back as-is, ensuring the API can validate thinking continuity

This is critical for correctness: if thinking blocks are dropped or modified, the Anthropic API may reject the request or produce degraded responses.

Other Providers

OpenAI-compatible providers handle thinking/reasoning content as metadata. The reasoning_content is accumulated during streaming but does not require special passback handling — each turn's reasoning is independent.


5. Limitations

Provider Limitation
DashScope Cannot stream when tools are present — falls back to non-streaming mode. Only specific Qwen3 models support thinking.
Codex Reasoning tokens tracked via API response (not in streaming chunks themselves)
Anthropic Temperature parameter stripped when thinking is enabled
All Thinking tokens count against the context window budget
All Thinking increases latency and cost proportional to the budget level

File Reference

File Purpose
internal/providers/types.go ThinkingCapable interface, StreamChunk.Thinking field, Opt* thinking constants
internal/providers/anthropic.go Anthropic: budget mapping (4K/10K/32K), beta header injection, temperature stripping
internal/providers/anthropic_stream.go Anthropic streaming: thinking_delta handling, raw block accumulation
internal/providers/anthropic_request.go Anthropic request: thinking block preservation for tool loops
internal/providers/openai.go OpenAI-compat: reasoning_effort mapping, reasoning_content streaming
internal/providers/dashscope.go DashScope: model-specific thinking guard, budget mapping, tools+streaming fallback
internal/providers/codex.go Codex: reasoning event streaming, OutputTokensDetails.ReasoningTokens tracking

Cross-References

Document Relevant Content
02-providers.md Provider architecture, supported providers
01-agent-loop.md LLM iteration loop, streaming chunk handling