Commit Graph

31 Commits

Author SHA1 Message Date
Yassin Kortam 2eab9ee2c0 perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths (#28289)
* perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths

- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution

* perf: address greptile review for anthropic streaming hot path

- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
  events from different block indexes are observed without an intervening
  flush. Anthropic sends blocks strictly sequentially, but defensive bail
  prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
  in `_callback_capabilities` with a function-identity comparison that
  walks the MRO. A vendor base class can carry the override and the
  registered class can add nothing else; before this PR the hook was
  unconditionally invoked, so an inherited-override miss would silently
  drop the hook on the streaming path.
- Add unit tests for both behaviors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mypy): narrow model_name to str in cost-injection branch

The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 12:15:59 -07:00
ishaan-berri b891a201f8 Preserve LiteLLM headers for passthrough responses (#27412)
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
2026-05-07 12:59:36 -07:00
Sameer Kankute 4cecfec9f9 feat(proxy): LiteLLM headers on Google native generateContent routes (#25500)
* feat(proxy): return LiteLLM headers on Google native generateContent routes

Wire build_litellm_proxy_success_headers_from_llm_response for :generateContent
and :streamGenerateContent so x-litellm-*, rate limit, and provider headers
match the OpenAI-style proxy path. Add unit test.

Annotate httpx.HTTPStatusError branch so pyright accepts .response after optional
exception transform. Remove unused variable in streaming tracer test (Ruff F841).

Made-with: Cursor

* fix(proxy): prefill Google GenAI stream _hidden_params for proxy headers

- Pass model_id, api_base, and process_response_headers output into streaming
  iterators so streamGenerateContent gets the same x-litellm-* headers as
  non-streaming paths.
- Drop request_data deployment mutation from build_litellm_proxy_success_headers_from_llm_response.
- Avoid logging raw request key names in oversized debug payload (code scanning).
- Extend tests for streaming iterator shape, metadata fallback, and helper.

Made-with: Cursor

* Update litellm/proxy/common_request_processing.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* remove unused key count

* Fix greptile review

* Update litellm/proxy/common_request_processing.py

Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
2026-04-29 12:34:14 -07:00
Ishaan Jaffer e8461b5b97 style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
michelligabriele 363f9fe5da fix(proxy): preserve dict guardrail HTTPException.detail + bedrock context (#25558) 2026-04-11 09:40:39 -07:00
michelligabriele a6dfd02610 fix(guardrails): return HTTP 400 instead of 500 for Model Armor streaming blocks (#24693)
When Model Armor blocks a streaming response, it correctly raises
HTTPException(status_code=400) but create_response() catches it with a
bare except Exception and hardcodes a 500 response, discarding the
original status code.

Fix create_response() to preserve status_code from HTTPException instead
of hardcoding 500. Also update Model Armor's streaming hook to yield an
SSE error event instead of raising (matching the Prisma Airs pattern),
and fix make_model_armor_request() to return 400 for upstream API
failures instead of passing through the upstream status code.
2026-04-02 21:28:52 -07:00
Krrish Dholakia 32adda8a49 fix: return winning model name instead of comma-separated list for fastest_response
When fastest_response=true with comma-separated models, the response
model field was stamped with the entire comma-separated string. Now
uses the x-litellm-model-group header from the winning response to
return the correct model name.

Made-with: Cursor
2026-03-27 22:34:26 -07:00
yuneng-jiang 8ca744036a [Fix] Malformed messages returning 500 instead of 400
The existing AttributeError detection in proxy error handling only
checked one level deep in the exception chain (__cause__, __context__,
original_exception). In practice, the AttributeError from malformed
messages gets wrapped in multiple layers (AttributeError ->
OpenAIException -> APIConnectionError), so the check never found it.

Extracted the check into _has_attribute_error_in_chain() which walks
the full exception chain recursively (depth-capped at 10 to prevent
infinite loops from circular references).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 23:01:25 -07:00
Sameer Kankute 36ec80d90c Fix azure model router 2026-03-12 12:40:37 +05:30
Sameer Kankute 5b83aae715 feat(azure_ai): show actual model used in Azure Model Router response
- Azure Model Router transform_response: let parent extract actual model from raw response
- common_request_processing: skip model override for Azure Model Router requests
- proxy_server: skip streaming chunk model restamp for Azure Model Router
- Add _is_azure_model_router_request helper
- Add tests for non-streaming and streaming

Made-with: Cursor
2026-03-12 11:41:19 +05:30
Ishaan Jaff 7befe3c78f feat(proxy): add key_alias, key_hash, requested_model DD APM span tags (#22710)
* feat(proxy): add key_alias, key_hash, requested_model tags to DD APM spans

* refactor(proxy): consolidate DD APM tag helpers into DDSpanTagger class

* refactor(proxy): move DDSpanTagger to its own file litellm/proxy/dd_span_tagger.py
2026-03-03 20:22:59 -08:00
Ishaan Jaff 9546d9b482 _add_dd_apm_tags_for_litellm_call_id (#22219) 2026-02-26 16:42:23 -08:00
Ishaan Jaff c343bfffda fix(router): emit x-litellm-overhead-duration-ms header for streaming requests (#22027)
* fix(router): preserve _hidden_params in FallbackStreamWrapper so x-litellm-overhead-duration-ms is emitted for streaming requests

* test(router): add regression test for FallbackStreamWrapper _hidden_params preservation
2026-02-24 11:56:16 -08:00
Harshit Jain 9fc3c77c42 fix: ensure arrival_time is set before calculating queue time 2026-02-23 17:04:47 +05:30
yuneng-jiang fd3ca081cc use cached keys and teams for router settings 2026-02-06 15:07:29 -08:00
yuneng-jiang 400e560ee5 Merge remote-tracking branch 'origin' into litellm_router_search_fix 2026-02-06 14:08:55 -08:00
Ishaan Jaffer 35e29c2bcd Revert "Merge pull request #18790 from BerriAI/litellm_key_team_routing_3"
This reverts commit ae26d8e68a, reversing
changes made to 864e8c6543.
2026-01-31 17:58:46 -08:00
yuneng-jiang a9eae5937f Override router settings 2026-01-31 16:04:52 -08:00
yuneng-jiang c9261c9f37 fix model name during fallback 2026-01-31 11:46:58 -08:00
Sameer Kankute 844c766c65 Merge pull request #18763 from BerriAI/litellm_staging_01_07_2026
Staging - 01/07/2026
2026-01-09 17:01:58 +05:30
yuneng-jiang 51759424a6 Key and Team Routing Setting 2026-01-07 17:17:30 -08:00
Kris Xia 91b5c66cf2 fix(proxy): return json error response instead of sse format for initial streaming errors (#18757)
* adding signoz integration to observability docs

* Fixing build

* Adding timeout for flaky test

* Fixing e2e

* fix(proxy): return json error response instead of sse format for initial streaming errors

when the first chunk of a streaming response contains an error,
return a standard json error response instead of sse format.
this ensures clients receive properly formatted error responses
before the stream actually begins.

- rename create_streaming_response to create_response
- add logic to detect error in first chunk and return JSONResponse
- add _extract_error_from_sse_chunk helper function
- update all call sites to use the new function name
- update tests to reflect the function rename

* test(proxy): add comprehensive tests for error extraction from sse chunks

- Add new test class TestExtractErrorFromSSEChunk with 10 test cases
- Update existing tests to verify JSONResponse returned for initial streaming errors
- Add tests for error code as string, bytes input, invalid JSON, and edge cases
- Verify correct error format extraction from SSE chunks

---------

Co-authored-by: Goutham Karthi <goutham@signoz.io>
Co-authored-by: yuneng-jiang <yuneng.jiang@gmail.com>
Co-authored-by: YutaSaito <36355491+uc4w6c@users.noreply.github.com>
2026-01-07 21:26:47 +05:30
Ishaan Jaff 1123cfa928 [Feat] AI Gateway - Add support for Platform Fee / Margins (#18427)
* init cost_margin_config

* feat: add cost margin

* init types

* LITELLM_SETTINGS_SAFE_DB_OVERRIDES

* feat _apply_cost_margin

* ui endpoint

* ui provider margins

* add margin

* refactored ui

* test cost margins

* refactored ui

* provider discounts

* add cost_breakdown to spendLogs

* add CostBreakdownViewer

* fix cost breakdown

* docs fix

* doc margins

* docs margins
2025-12-25 11:07:27 +05:30
Sameer Kankute caaf8a6784 Fix x-litellm-key-spend update 2025-12-12 11:44:51 +05:30
Krish Dholakia 1eb06f8031 Revert "fix: respect guardrail mock_response during during_call to return blo…" (#17332)
This reverts commit 6de6107673.
2025-12-01 15:40:28 -08:00
YutaSaito 6de6107673 fix: respect guardrail mock_response during during_call to return blocked output (#17247) 2025-12-01 09:59:01 -08:00
Ishaan Jaff a6c57cb5bd [Feat] Cost Tracking - specify a global vendor discount for costs. (#15546)
* fix cost_discount_config

* add CostBreakdown

* fix: set_cost_breakdown

* test_cost_discount_vertex_ai

* docs fix

* docs fix discounts

* docs fix

* docs custom pricing

* docs fix

* fixes for getting cost breakdown in response headers

* test - response headers wth discount
2025-10-14 20:07:04 -07:00
Alexsander Hamir eaa04cd8ce fix: use fastuuid helper (#14903)
* fix: use fastuuid helper across the codebase

First batch of changes, simple drop in replacement.

* second batch of changes

* fixed: script mistake on helper file
2025-09-25 15:47:01 -07:00
Ishaan Jaff 98d57b5d27 [Feat] Allow using x-litellm-stream-timeout header for stream timeout in requests (#14147)
* fix: allow passing stream_timeout header

* fix: _get_stream_timeout_from_request

* test_add_litellm_data_to_request_with_stream_timeout_header

* docs: LiteLLM Headers

* test_add_litellm_data_to_request_with_stream_timeout_header
2025-09-01 15:59:14 -07:00
Ishaan Jaff 8a4b163453 [Feat] DD Trace - Add instrumentation for streaming chunks (#11338)
* fix: add tracing for litellm.completion

* fix: NULL span add trace

* fix: add tracing for litellm.completion streaming

* fix: add tracing for litellm.completion streaming

* fix: use a constant for str
2025-06-02 16:48:39 -07:00
Krish Dholakia ef42461c1e Litellm fix GitHub action testing (#11163)
* test: add __init__.py files

* refactor: rename test folder to avoid naming conflict

* test: update workflows

* test: update tests

* test: update imports

* test: update tests

* test: remove unused import

* ci(test-litellm.yml): add pytest retry to github workflow

* test: fix test
2025-05-26 14:41:42 -07:00