mirror of https://github.com/tiennm99/goclaw.git synced 2026-06-10 00:13:42 +00:00

Files

T

Kai (Tam Nhu) Tran 3ca3bb2062 feat: add capability-aware reasoning effort controls (#593 )

* feat(reasoning): add capability-aware effort resolution

- resolve requested reasoning levels against exact model capabilities

- persist requested effort on agents and expose effective effort in traces

- add backend tests for provider models, agent store, and resolution logic

Refs #591

* feat(ui): gate reasoning controls by model capabilities

- only show supported reasoning levels when provider model metadata is available

- preserve expert reasoning selections during async model loading

- surface effective reasoning details in trace dialogs and localized copy

Refs #591

* docs(api): document capability-aware reasoning controls

- describe exact-match capability lookup and downgrade behavior

- update provider model metadata and trace response documentation

- refresh the generated OpenAPI spec for the new reasoning fields

Refs #591

* feat: add provider-first reasoning controls

* docs: refresh PR 593 UI evidence callouts

* refactor: deduplicate reasoning normalize functions and remove PR evidence

- Export NormalizeReasoningEffort/NormalizeReasoningFallback from providers
  package; store package now delegates instead of duplicating
- Store reasoning fallback constants alias providers canonical definitions
- Export deriveLegacyThinkingLevel from types/provider.ts; remove local
  copies from agent-advanced-dialog and provider-overview
- Remove unused _providerType param from useProviderModels hook
- Fix reasoning debug log to fire for all cases with a reason (not just
  non-off efforts)
- Remove docs/pr-593-evidence/ binary screenshots from repo

---------

Co-authored-by: viettranx <viettranx@gmail.com>

2026-03-31 07:56:01 +07:00

11 KiB

Raw Permalink Blame History

12 - Extended Thinking

Overview

Extended thinking allows LLM providers to "think out loud" before producing a final response. When enabled, the model generates internal reasoning tokens that improve response quality for complex tasks at the cost of additional token usage and latency. GoClaw now supports both the legacy coarse thinking_level setting and a provider-first reasoning policy for capability-aware GPT-5/Codex control.

1. Configuration

The reusable default now lives on the provider in settings.reasoning_defaults. Agents consume that default by inheriting it, or store a custom override in other_config.reasoning. thinking_level remains the backward-compatible coarse shim for older builds.

Level	Behavior
`off`	Thinking disabled (default)
`low`	Minimal thinking — quick reasoning
`medium`	Moderate thinking — balanced reasoning
`high`	Maximum thinking — deep reasoning for complex tasks

Provider default

{
  "provider_type": "chatgpt_oauth",
  "settings": {
    "reasoning_defaults": {
      "effort": "high",
      "fallback": "provider_default"
    }
  }
}

Agent inherits provider default

{
  "other_config": {
    "reasoning": {
      "override_mode": "inherit"
    }
  }
}

Agent custom override

{
  "other_config": {
    "thinking_level": "high",
    "reasoning": {
      "override_mode": "custom",
      "effort": "xhigh",
      "fallback": "downgrade"
    }
  }
}

Rules:

Unset provider defaults and unset agent reasoning both resolve to off.
settings.reasoning_defaults is provider-owned and reusable across agents.
reasoning.override_mode accepts inherit|custom.
thinking_level still accepts off|low|medium|high.
reasoning.effort accepts off|auto|none|minimal|low|medium|high|xhigh.
reasoning.fallback accepts downgrade|off|provider_default.
Existing reasoning payloads without override_mode are treated as custom overrides for backward compatibility.
Read path resolves provider defaults first, then applies agent inherit/custom semantics, then falls back to legacy thinking_level.
Write path keeps a derived coarse thinking_level only for custom agent overrides so rollback to older GoClaw builds stays safe.

2. Provider Support

Each provider maps the normalized reasoning policy to its own implementation parameters.

flowchart TD
    CONFIG["Provider defaults +<br/>agent inherit/custom"] --> CHECK{"Provider supports<br/>thinking?"}
    CHECK -->|No| SKIP["Send request<br/>without thinking"]
    CHECK -->|Yes| MAP{"Provider type?"}

    MAP -->|Anthropic| ANTH["Budget tokens: 10,000<br/>Header: anthropic-beta<br/>Strip temperature"]
    MAP -->|OpenAI-compat| OAI["Capability-aware effort<br/>or provider default"]
    MAP -->|DashScope| DASH["enable_thinking: true<br/>Budget: 16,384 tokens<br/>⚠ Model-specific + tools limitation"]
    MAP -->|Codex| CODEX["Capability-aware effort<br/>+ trace metadata"]

    ANTH --> SEND["Send to LLM"]
    OAI --> SEND
    DASH --> SEND
    CODEX --> SEND

Anthropic (Native)

Thinking Level	Budget Tokens
low	4,096
medium	10,000
high	32,000

When thinking is enabled:

Adds thinking: {type: "enabled", budget_tokens: N} to the request body
Sets anthropic-beta: interleaved-thinking-2025-05-14 header
Strips temperature parameter (Anthropic requirement — cannot use temperature with thinking)
Auto-adjusts max_tokens to accommodate thinking budget (budget + 8,192 buffer)

OpenAI-Compatible and Codex (GPT-5 / Codex families)

Known GPT-5/Codex models use a static capability registry. The runtime resolves:

requested effort
actual effective effort
fallback policy used
whether the model default was used
whether the source was the provider default or an agent override

If the model is known:

supported efforts pass through unchanged
unsupported efforts are normalized via downgrade, off, or provider_default
auto means "use the model default effort"

If the model is unknown:

explicit non-auto effort is passed through as requested
auto leaves provider-default reasoning untouched

Reasoning content still streams in the provider-native format, and span metadata now records the source plus requested versus effective effort.

DashScope (Alibaba Qwen)

Thinking Level	Budget Tokens
low	4,096
medium	16,384
high	32,768

Enables thinking via enable_thinking: true plus a thinking_budget parameter.

Model-specific support: Only certain Qwen3 models accept the enable_thinking / thinking_budget parameters:

Qwen3.5 series: qwen3.5-plus, qwen3.5-turbo (thinking + vision)
Qwen3 hosted: qwen3-max
Qwen3 open-weight: qwen3-235b-a22b, qwen3-32b, qwen3-14b, qwen3-8b

Other models (e.g., qwen3-plus, qwen3-turbo) silently skip thinking injection to avoid API errors.

Important limitation: DashScope does not support streaming when tools are present. When an agent has tools enabled and thinking is active, the provider automatically falls back to non-streaming mode (single Chat() call) and synthesizes chunk callbacks to maintain the event flow.

Codex (ChatGPT OAuth Responses API)

Codex natively supports extended reasoning through its Responses API. Thinking and reasoning tokens are streamed as discrete reasoning events with summary fragments.

Token tracking: Reasoning token count is exposed in response.completed / response.incomplete events as OutputTokensDetails.ReasoningTokens and accessible via ChatResponse.Usage.ThinkingTokens.

Model metadata: /v1/providers/{id}/models is now the backend source of truth for the ChatGPT OAuth model list and any known reasoning capabilities.

3. Streaming

When thinking is active, reasoning content streams to the client alongside regular content.

flowchart TD
    LLM["LLM generates response"] --> THINK["Thinking tokens<br/>(internal reasoning)"]
    THINK --> CONTENT["Content tokens<br/>(final response)"]

    THINK -->|Stream| CHUNK_T["StreamChunk<br/>Thinking: 'reasoning text...'"]
    CONTENT -->|Stream| CHUNK_C["StreamChunk<br/>Content: 'response text...'"]

    CHUNK_T --> CLIENT["Client receives<br/>thinking + content separately"]
    CHUNK_C --> CLIENT

Provider-Specific Streaming Events

Provider	Thinking Event	Content Event
Anthropic	`thinking_delta` in content blocks	`text_delta` in content blocks
OpenAI-compat	`reasoning_content` in delta	`content` in delta
DashScope	Same as OpenAI (when tools absent)	Same as OpenAI
Codex	`reasoning` items with text summaries	`content` items

Token Estimation

Thinking tokens are estimated as character_count / 4 for context window tracking. This rough estimate ensures the agent loop can account for thinking overhead when calculating context usage.

4. Tool Loop Handling

Extended thinking interacts with multi-turn tool conversations. When the LLM calls a tool and then needs to continue reasoning, thinking blocks must be preserved correctly across turns.

flowchart TD
    TURN1["Turn 1: LLM thinks + calls tool"] --> PRESERVE["Preserve thinking blocks<br/>in raw assistant content"]
    PRESERVE --> TOOL["Tool executes,<br/>result appended to history"]
    TOOL --> TURN2["Turn 2: LLM receives history<br/>including preserved thinking blocks"]
    TURN2 --> CONTINUE["LLM continues reasoning<br/>with full context"]

Anthropic Thinking Block Preservation

Anthropic requires thinking blocks (including their cryptographic signatures) to be echoed back in subsequent turns. GoClaw handles this through RawAssistantContent:

During streaming, raw content blocks are accumulated — including thinking type blocks with their signature fields
When the assistant message is appended to history, the raw blocks are preserved
On the next LLM call, these blocks are sent back as-is, ensuring the API can validate thinking continuity

This is critical for correctness: if thinking blocks are dropped or modified, the Anthropic API may reject the request or produce degraded responses.

Other Providers

OpenAI-compatible providers handle thinking/reasoning content as metadata. The reasoning_content is accumulated during streaming but does not require special passback handling — each turn's reasoning is independent.

5. Limitations

Provider	Limitation
DashScope	Cannot stream when tools are present — falls back to non-streaming mode. Only specific Qwen3 models support thinking.
Codex	Reasoning tokens tracked via API response (not in streaming chunks themselves)
Anthropic	Temperature parameter stripped when thinking is enabled
All	Thinking tokens count against the context window budget
All	Thinking increases latency and cost proportional to the budget level
GPT-5/Codex unknown models	GoClaw allows explicit effort passthrough but does not claim a capability contract

6. Observability

Each LLM span can now include a metadata.reasoning section with:

source
requested_effort
effective_effort
fallback
reason
supported_levels
used_provider_default

This makes silent downgrades or provider-default decisions visible in traces instead of leaving them implicit.

File Reference

File	Purpose
`internal/providers/types.go`	ThinkingCapable interface, StreamChunk.Thinking field, Opt* thinking constants
`internal/providers/anthropic.go`	Anthropic: budget mapping (4K/10K/32K), beta header injection, temperature stripping
`internal/providers/anthropic_stream.go`	Anthropic streaming: thinking_delta handling, raw block accumulation
`internal/providers/anthropic_request.go`	Anthropic request: thinking block preservation for tool loops
`internal/providers/openai.go`	OpenAI-compat: reasoning_effort mapping, reasoning_content streaming
`internal/providers/reasoning_capability.go`	Static GPT-5/Codex capability registry
`internal/providers/reasoning_resolution.go`	Requested-to-effective reasoning decision engine
`internal/providers/reasoning_observation.go`	Trace metadata merge helpers for reasoning decisions
`internal/providers/dashscope.go`	DashScope: model-specific thinking guard, budget mapping, tools+streaming fallback
`internal/providers/codex.go`	Codex: reasoning event streaming, OutputTokensDetails.ReasoningTokens tracking

Cross-References

Document	Relevant Content
02-providers.md	Provider architecture, supported providers
01-agent-loop.md	LLM iteration loop, streaming chunk handling

11 KiB Raw Permalink Blame History