litellm

mirror of https://github.com/tiennm99/litellm.git synced 2026-06-17 22:48:35 +00:00

Author	SHA1	Message	Date
Yassin Kortam	2eab9ee2c0	perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths (#28289 ) * perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths - Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path - Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config - Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories - Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config) - Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default) - Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each) - Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk - Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch - Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk - Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params) - Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels - Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches - Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution * perf: address greptile review for anthropic streaming hot path - Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta events from different block indexes are observed without an intervening flush. Anthropic sends blocks strictly sequentially, but defensive bail prevents silent text-merging if the protocol ever interleaves. - Replace leaf-class `__dict__` check for `async_post_call_streaming_hook` in `_callback_capabilities` with a function-identity comparison that walks the MRO. A vendor base class can carry the override and the registered class can add nothing else; before this PR the hook was unconditionally invoked, so an inherited-override miss would silently drop the hook on the streaming path. - Add unit tests for both behaviors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(mypy): narrow model_name to str in cost-injection branch The hoisted cost_injection_active flag in chunk_processor encodes the `bool(model_name)` requirement but mypy can't track that invariant through the local, so the per-chunk `_process_chunk_with_cost_injection( chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed non-None local inside the cost-injection branch so mypy narrows correctly without changing runtime behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 12:15:59 -07:00
ishaan-berri	b891a201f8	Preserve LiteLLM headers for passthrough responses (#27412 ) Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>	2026-05-07 12:59:36 -07:00
Sameer Kankute	4cecfec9f9	feat(proxy): LiteLLM headers on Google native generateContent routes (#25500 ) * feat(proxy): return LiteLLM headers on Google native generateContent routes Wire build_litellm_proxy_success_headers_from_llm_response for :generateContent and :streamGenerateContent so x-litellm-, rate limit, and provider headers match the OpenAI-style proxy path. Add unit test. Annotate httpx.HTTPStatusError branch so pyright accepts .response after optional exception transform. Remove unused variable in streaming tracer test (Ruff F841). Made-with: Cursor fix(proxy): prefill Google GenAI stream _hidden_params for proxy headers - Pass model_id, api_base, and process_response_headers output into streaming iterators so streamGenerateContent gets the same x-litellm-* headers as non-streaming paths. - Drop request_data deployment mutation from build_litellm_proxy_success_headers_from_llm_response. - Avoid logging raw request key names in oversized debug payload (code scanning). - Extend tests for streaming iterator shape, metadata fallback, and helper. Made-with: Cursor * Update litellm/proxy/common_request_processing.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * remove unused key count * Fix greptile review * Update litellm/proxy/common_request_processing.py Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>	2026-04-29 12:34:14 -07:00
Ishaan Jaffer	e8461b5b97	style: run black formatter on files from main merge	2026-04-17 13:02:59 -07:00
michelligabriele	363f9fe5da	fix(proxy): preserve dict guardrail HTTPException.detail + bedrock context (#25558 )	2026-04-11 09:40:39 -07:00
michelligabriele	a6dfd02610	fix(guardrails): return HTTP 400 instead of 500 for Model Armor streaming blocks (#24693 ) When Model Armor blocks a streaming response, it correctly raises HTTPException(status_code=400) but create_response() catches it with a bare except Exception and hardcodes a 500 response, discarding the original status code. Fix create_response() to preserve status_code from HTTPException instead of hardcoding 500. Also update Model Armor's streaming hook to yield an SSE error event instead of raising (matching the Prisma Airs pattern), and fix make_model_armor_request() to return 400 for upstream API failures instead of passing through the upstream status code.	2026-04-02 21:28:52 -07:00
Krrish Dholakia	32adda8a49	fix: return winning model name instead of comma-separated list for fastest_response When fastest_response=true with comma-separated models, the response model field was stamped with the entire comma-separated string. Now uses the x-litellm-model-group header from the winning response to return the correct model name. Made-with: Cursor	2026-03-27 22:34:26 -07:00
yuneng-jiang	8ca744036a	[Fix] Malformed messages returning 500 instead of 400 The existing AttributeError detection in proxy error handling only checked one level deep in the exception chain (__cause__, __context__, original_exception). In practice, the AttributeError from malformed messages gets wrapped in multiple layers (AttributeError -> OpenAIException -> APIConnectionError), so the check never found it. Extracted the check into _has_attribute_error_in_chain() which walks the full exception chain recursively (depth-capped at 10 to prevent infinite loops from circular references). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 23:01:25 -07:00
Sameer Kankute	36ec80d90c	Fix azure model router	2026-03-12 12:40:37 +05:30
Sameer Kankute	5b83aae715	feat(azure_ai): show actual model used in Azure Model Router response - Azure Model Router transform_response: let parent extract actual model from raw response - common_request_processing: skip model override for Azure Model Router requests - proxy_server: skip streaming chunk model restamp for Azure Model Router - Add _is_azure_model_router_request helper - Add tests for non-streaming and streaming Made-with: Cursor	2026-03-12 11:41:19 +05:30
Ishaan Jaff	7befe3c78f	feat(proxy): add key_alias, key_hash, requested_model DD APM span tags (#22710 ) * feat(proxy): add key_alias, key_hash, requested_model tags to DD APM spans * refactor(proxy): consolidate DD APM tag helpers into DDSpanTagger class * refactor(proxy): move DDSpanTagger to its own file litellm/proxy/dd_span_tagger.py	2026-03-03 20:22:59 -08:00
Ishaan Jaff	9546d9b482	_add_dd_apm_tags_for_litellm_call_id (#22219 )	2026-02-26 16:42:23 -08:00
Ishaan Jaff	c343bfffda	fix(router): emit x-litellm-overhead-duration-ms header for streaming requests (#22027 ) * fix(router): preserve _hidden_params in FallbackStreamWrapper so x-litellm-overhead-duration-ms is emitted for streaming requests * test(router): add regression test for FallbackStreamWrapper _hidden_params preservation	2026-02-24 11:56:16 -08:00
Harshit Jain	9fc3c77c42	fix: ensure arrival_time is set before calculating queue time	2026-02-23 17:04:47 +05:30
yuneng-jiang	fd3ca081cc	use cached keys and teams for router settings	2026-02-06 15:07:29 -08:00
yuneng-jiang	400e560ee5	Merge remote-tracking branch 'origin' into litellm_router_search_fix	2026-02-06 14:08:55 -08:00
Ishaan Jaffer	35e29c2bcd	Revert "Merge pull request #18790 from BerriAI/litellm_key_team_routing_3" This reverts commit `ae26d8e68a`, reversing changes made to `864e8c6543`.	2026-01-31 17:58:46 -08:00
yuneng-jiang	a9eae5937f	Override router settings	2026-01-31 16:04:52 -08:00
yuneng-jiang	c9261c9f37	fix model name during fallback	2026-01-31 11:46:58 -08:00
Sameer Kankute	844c766c65	Merge pull request #18763 from BerriAI/litellm_staging_01_07_2026 Staging - 01/07/2026	2026-01-09 17:01:58 +05:30
yuneng-jiang	51759424a6	Key and Team Routing Setting	2026-01-07 17:17:30 -08:00
Kris Xia	91b5c66cf2	fix(proxy): return json error response instead of sse format for initial streaming errors (#18757 ) * adding signoz integration to observability docs * Fixing build * Adding timeout for flaky test * Fixing e2e * fix(proxy): return json error response instead of sse format for initial streaming errors when the first chunk of a streaming response contains an error, return a standard json error response instead of sse format. this ensures clients receive properly formatted error responses before the stream actually begins. - rename create_streaming_response to create_response - add logic to detect error in first chunk and return JSONResponse - add _extract_error_from_sse_chunk helper function - update all call sites to use the new function name - update tests to reflect the function rename * test(proxy): add comprehensive tests for error extraction from sse chunks - Add new test class TestExtractErrorFromSSEChunk with 10 test cases - Update existing tests to verify JSONResponse returned for initial streaming errors - Add tests for error code as string, bytes input, invalid JSON, and edge cases - Verify correct error format extraction from SSE chunks --------- Co-authored-by: Goutham Karthi <goutham@signoz.io> Co-authored-by: yuneng-jiang <yuneng.jiang@gmail.com> Co-authored-by: YutaSaito <36355491+uc4w6c@users.noreply.github.com>	2026-01-07 21:26:47 +05:30
Ishaan Jaff	1123cfa928	[Feat] AI Gateway - Add support for Platform Fee / Margins (#18427 ) * init cost_margin_config * feat: add cost margin * init types * LITELLM_SETTINGS_SAFE_DB_OVERRIDES * feat _apply_cost_margin * ui endpoint * ui provider margins * add margin * refactored ui * test cost margins * refactored ui * provider discounts * add cost_breakdown to spendLogs * add CostBreakdownViewer * fix cost breakdown * docs fix * doc margins * docs margins	2025-12-25 11:07:27 +05:30
Sameer Kankute	caaf8a6784	Fix x-litellm-key-spend update	2025-12-12 11:44:51 +05:30
Krish Dholakia	1eb06f8031	Revert "fix: respect guardrail mock_response during during_call to return blo…" (#17332 ) This reverts commit `6de6107673`.	2025-12-01 15:40:28 -08:00
YutaSaito	6de6107673	fix: respect guardrail mock_response during during_call to return blocked output (#17247 )	2025-12-01 09:59:01 -08:00
Ishaan Jaff	a6c57cb5bd	[Feat] Cost Tracking - specify a global vendor discount for costs. (#15546 ) * fix cost_discount_config * add CostBreakdown * fix: set_cost_breakdown * test_cost_discount_vertex_ai * docs fix * docs fix discounts * docs fix * docs custom pricing * docs fix * fixes for getting cost breakdown in response headers * test - response headers wth discount	2025-10-14 20:07:04 -07:00
Alexsander Hamir	eaa04cd8ce	fix: use fastuuid helper (#14903 ) * fix: use fastuuid helper across the codebase First batch of changes, simple drop in replacement. * second batch of changes * fixed: script mistake on helper file	2025-09-25 15:47:01 -07:00
Ishaan Jaff	98d57b5d27	[Feat] Allow using `x-litellm-stream-timeout` header for stream timeout in requests (#14147 ) * fix: allow passing stream_timeout header * fix: _get_stream_timeout_from_request * test_add_litellm_data_to_request_with_stream_timeout_header * docs: LiteLLM Headers * test_add_litellm_data_to_request_with_stream_timeout_header	2025-09-01 15:59:14 -07:00
Ishaan Jaff	8a4b163453	[Feat] DD Trace - Add instrumentation for streaming chunks (#11338 ) * fix: add tracing for litellm.completion * fix: NULL span add trace * fix: add tracing for litellm.completion streaming * fix: add tracing for litellm.completion streaming * fix: use a constant for str	2025-06-02 16:48:39 -07:00
Krish Dholakia	ef42461c1e	Litellm fix GitHub action testing (#11163 ) * test: add __init__.py files * refactor: rename test folder to avoid naming conflict * test: update workflows * test: update tests * test: update imports * test: update tests * test: remove unused import * ci(test-litellm.yml): add pytest retry to github workflow * test: fix test	2025-05-26 14:41:42 -07:00

31 Commits