# 12 - Extended Thinking ## Overview Extended thinking allows LLM providers to "think out loud" before producing a final response. When enabled, the model generates internal reasoning tokens that improve response quality for complex tasks — at the cost of additional token usage and latency. GoClaw supports extended thinking across multiple providers with a unified `thinking_level` configuration. --- ## 1. Configuration Thinking is controlled per-agent through the `thinking_level` setting. | Level | Behavior | |-------|----------| | `off` | Thinking disabled (default) | | `low` | Minimal thinking — quick reasoning | | `medium` | Moderate thinking — balanced reasoning | | `high` | Maximum thinking — deep reasoning for complex tasks | The setting can be configured: - **Per-agent**: In the agent's configuration (applies to all users of that agent) - **Per-user override**: Via `user_agent_overrides` table (reserved for future use) --- ## 2. Provider Support Each provider maps the abstract `thinking_level` to its own implementation parameters. ```mermaid flowchart TD CONFIG["Agent config:
thinking_level = medium"] --> CHECK{"Provider supports
thinking?"} CHECK -->|No| SKIP["Send request
without thinking"] CHECK -->|Yes| MAP{"Provider type?"} MAP -->|Anthropic| ANTH["Budget tokens: 10,000
Header: anthropic-beta
Strip temperature"] MAP -->|OpenAI-compat| OAI["Map to reasoning_effort
(low/medium/high)"] MAP -->|DashScope| DASH["enable_thinking: true
Budget: 16,384 tokens
⚠ Model-specific + tools limitation"] MAP -->|Codex| CODEX["reasoning_tokens tracked
via Responses API"] ANTH --> SEND["Send to LLM"] OAI --> SEND DASH --> SEND CODEX --> SEND ``` ### Anthropic (Native) | Thinking Level | Budget Tokens | |:-:|:-:| | low | 4,096 | | medium | 10,000 | | high | 32,000 | When thinking is enabled: - Adds `thinking: {type: "enabled", budget_tokens: N}` to the request body - Sets `anthropic-beta: interleaved-thinking-2025-05-14` header - Strips `temperature` parameter (Anthropic requirement — cannot use temperature with thinking) - Auto-adjusts `max_tokens` to accommodate thinking budget (budget + 8,192 buffer) ### OpenAI-Compatible (OpenAI, Groq, DeepSeek, etc.) Maps `thinking_level` directly to `reasoning_effort`: - `low` → `reasoning_effort: "low"` - `medium` → `reasoning_effort: "medium"` - `high` → `reasoning_effort: "high"` Reasoning content is returned in the `reasoning_content` field of the response delta during streaming. ### DashScope (Alibaba Qwen) | Thinking Level | Budget Tokens | |:-:|:-:| | low | 4,096 | | medium | 16,384 | | high | 32,768 | Enables thinking via `enable_thinking: true` plus a `thinking_budget` parameter. **Model-specific support**: Only certain Qwen3 models accept the `enable_thinking` / `thinking_budget` parameters: - **Qwen3.5 series**: `qwen3.5-plus`, `qwen3.5-turbo` (thinking + vision) - **Qwen3 hosted**: `qwen3-max` - **Qwen3 open-weight**: `qwen3-235b-a22b`, `qwen3-32b`, `qwen3-14b`, `qwen3-8b` Other models (e.g., `qwen3-plus`, `qwen3-turbo`) silently skip thinking injection to avoid API errors. **Important limitation**: DashScope does not support streaming when tools are present. When an agent has tools enabled and thinking is active, the provider automatically falls back to non-streaming mode (single `Chat()` call) and synthesizes chunk callbacks to maintain the event flow. ### Codex (Alibaba AI Reasoning) Codex natively supports extended reasoning through its Responses API. Thinking/reasoning tokens are streamed as discrete `reasoning` events with summary fragments. **Token tracking**: Reasoning token count is exposed in `response.completed` / `response.incomplete` events as `OutputTokensDetails.ReasoningTokens` and accessible via `ChatResponse.Usage.ThinkingTokens`. --- ## 3. Streaming When thinking is active, reasoning content streams to the client alongside regular content. ```mermaid flowchart TD LLM["LLM generates response"] --> THINK["Thinking tokens
(internal reasoning)"] THINK --> CONTENT["Content tokens
(final response)"] THINK -->|Stream| CHUNK_T["StreamChunk
Thinking: 'reasoning text...'"] CONTENT -->|Stream| CHUNK_C["StreamChunk
Content: 'response text...'"] CHUNK_T --> CLIENT["Client receives
thinking + content separately"] CHUNK_C --> CLIENT ``` ### Provider-Specific Streaming Events | Provider | Thinking Event | Content Event | |----------|---------------|---------------| | Anthropic | `thinking_delta` in content blocks | `text_delta` in content blocks | | OpenAI-compat | `reasoning_content` in delta | `content` in delta | | DashScope | Same as OpenAI (when tools absent) | Same as OpenAI | | Codex | `reasoning` items with text summaries | `content` items | ### Token Estimation Thinking tokens are estimated as `character_count / 4` for context window tracking. This rough estimate ensures the agent loop can account for thinking overhead when calculating context usage. --- ## 4. Tool Loop Handling Extended thinking interacts with multi-turn tool conversations. When the LLM calls a tool and then needs to continue reasoning, thinking blocks must be preserved correctly across turns. ```mermaid flowchart TD TURN1["Turn 1: LLM thinks + calls tool"] --> PRESERVE["Preserve thinking blocks
in raw assistant content"] PRESERVE --> TOOL["Tool executes,
result appended to history"] TOOL --> TURN2["Turn 2: LLM receives history
including preserved thinking blocks"] TURN2 --> CONTINUE["LLM continues reasoning
with full context"] ``` ### Anthropic Thinking Block Preservation Anthropic requires thinking blocks (including their cryptographic signatures) to be echoed back in subsequent turns. GoClaw handles this through `RawAssistantContent`: 1. During streaming, raw content blocks are accumulated — including `thinking` type blocks with their `signature` fields 2. When the assistant message is appended to history, the raw blocks are preserved 3. On the next LLM call, these blocks are sent back as-is, ensuring the API can validate thinking continuity This is critical for correctness: if thinking blocks are dropped or modified, the Anthropic API may reject the request or produce degraded responses. ### Other Providers OpenAI-compatible providers handle thinking/reasoning content as metadata. The `reasoning_content` is accumulated during streaming but does not require special passback handling — each turn's reasoning is independent. --- ## 5. Limitations | Provider | Limitation | |----------|-----------| | DashScope | Cannot stream when tools are present — falls back to non-streaming mode. Only specific Qwen3 models support thinking. | | Codex | Reasoning tokens tracked via API response (not in streaming chunks themselves) | | Anthropic | Temperature parameter stripped when thinking is enabled | | All | Thinking tokens count against the context window budget | | All | Thinking increases latency and cost proportional to the budget level | --- ## File Reference | File | Purpose | |------|---------| | `internal/providers/types.go` | ThinkingCapable interface, StreamChunk.Thinking field, Opt* thinking constants | | `internal/providers/anthropic.go` | Anthropic: budget mapping (4K/10K/32K), beta header injection, temperature stripping | | `internal/providers/anthropic_stream.go` | Anthropic streaming: thinking_delta handling, raw block accumulation | | `internal/providers/anthropic_request.go` | Anthropic request: thinking block preservation for tool loops | | `internal/providers/openai.go` | OpenAI-compat: reasoning_effort mapping, reasoning_content streaming | | `internal/providers/dashscope.go` | DashScope: model-specific thinking guard, budget mapping, tools+streaming fallback | | `internal/providers/codex.go` | Codex: reasoning event streaming, OutputTokensDetails.ReasoningTokens tracking | --- ## Cross-References | Document | Relevant Content | |----------|-----------------| | [02-providers.md](./02-providers.md) | Provider architecture, supported providers | | [01-agent-loop.md](./01-agent-loop.md) | LLM iteration loop, streaming chunk handling |