* perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths
- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution
* perf: address greptile review for anthropic streaming hot path
- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
events from different block indexes are observed without an intervening
flush. Anthropic sends blocks strictly sequentially, but defensive bail
prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
in `_callback_capabilities` with a function-identity comparison that
walks the MRO. A vendor base class can carry the override and the
registered class can add nothing else; before this PR the hook was
unconditionally invoked, so an inherited-override miss would silently
drop the hook on the streaming path.
- Add unit tests for both behaviors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(mypy): narrow model_name to str in cost-injection branch
The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Model Armor blocks a streaming response, it correctly raises
HTTPException(status_code=400) but create_response() catches it with a
bare except Exception and hardcodes a 500 response, discarding the
original status code.
Fix create_response() to preserve status_code from HTTPException instead
of hardcoding 500. Also update Model Armor's streaming hook to yield an
SSE error event instead of raising (matching the Prisma Airs pattern),
and fix make_model_armor_request() to return 400 for upstream API
failures instead of passing through the upstream status code.
When fastest_response=true with comma-separated models, the response
model field was stamped with the entire comma-separated string. Now
uses the x-litellm-model-group header from the winning response to
return the correct model name.
Made-with: Cursor
The existing AttributeError detection in proxy error handling only
checked one level deep in the exception chain (__cause__, __context__,
original_exception). In practice, the AttributeError from malformed
messages gets wrapped in multiple layers (AttributeError ->
OpenAIException -> APIConnectionError), so the check never found it.
Extracted the check into _has_attribute_error_in_chain() which walks
the full exception chain recursively (depth-capped at 10 to prevent
infinite loops from circular references).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Azure Model Router transform_response: let parent extract actual model from raw response
- common_request_processing: skip model override for Azure Model Router requests
- proxy_server: skip streaming chunk model restamp for Azure Model Router
- Add _is_azure_model_router_request helper
- Add tests for non-streaming and streaming
Made-with: Cursor
* feat(proxy): add key_alias, key_hash, requested_model tags to DD APM spans
* refactor(proxy): consolidate DD APM tag helpers into DDSpanTagger class
* refactor(proxy): move DDSpanTagger to its own file litellm/proxy/dd_span_tagger.py
* fix(router): preserve _hidden_params in FallbackStreamWrapper so x-litellm-overhead-duration-ms is emitted for streaming requests
* test(router): add regression test for FallbackStreamWrapper _hidden_params preservation
* adding signoz integration to observability docs
* Fixing build
* Adding timeout for flaky test
* Fixing e2e
* fix(proxy): return json error response instead of sse format for initial streaming errors
when the first chunk of a streaming response contains an error,
return a standard json error response instead of sse format.
this ensures clients receive properly formatted error responses
before the stream actually begins.
- rename create_streaming_response to create_response
- add logic to detect error in first chunk and return JSONResponse
- add _extract_error_from_sse_chunk helper function
- update all call sites to use the new function name
- update tests to reflect the function rename
* test(proxy): add comprehensive tests for error extraction from sse chunks
- Add new test class TestExtractErrorFromSSEChunk with 10 test cases
- Update existing tests to verify JSONResponse returned for initial streaming errors
- Add tests for error code as string, bytes input, invalid JSON, and edge cases
- Verify correct error format extraction from SSE chunks
---------
Co-authored-by: Goutham Karthi <goutham@signoz.io>
Co-authored-by: yuneng-jiang <yuneng.jiang@gmail.com>
Co-authored-by: YutaSaito <36355491+uc4w6c@users.noreply.github.com>
* fix: use fastuuid helper across the codebase
First batch of changes, simple drop in replacement.
* second batch of changes
* fixed: script mistake on helper file