mirror of
https://github.com/tiennm99/litellm.git
synced 2026-06-17 16:48:54 +00:00
cfcdf8714a
* Add gpt-realtime-whisper Realtime transcription support (OpenAI + Azure) (#29775) * Add gpt-realtime-whisper Realtime transcription support (OpenAI + Azure) Adds first-class support for the gpt-realtime-whisper streaming speech-to-text model, which uses the Realtime transcription session API rather than the file-based /audio/transcriptions path. Model registration: registers gpt-realtime-whisper and azure/gpt-realtime-whisper with audio-duration pricing (input_cost_per_second = 0.017/60, matching the published $0.017/minute input audio rate). REST endpoint: implements POST /v1/realtime/transcription_sessions (plus /realtime and /openai/v1 aliases) to mint an ephemeral transcription session for the WebRTC flow. Adds request/response types, OpenAI and Azure URL builders, a shared base handler (refactored from the client_secrets handler), the acreate_realtime_transcription_session SDK function, and route registration. The proxy encrypts the ephemeral key returned under client_secret.value and records the session type in the token so the follow-up /realtime/calls replays type=transcription rather than type=realtime. WebSocket: forwards intent=transcription through to the Azure handler (OpenAI already received it) with URL-encoding, so gpt-realtime-whisper opens a transcription session. Transcription-only sessions no longer trigger an erroneous response.create. Cost tracking: transcription sessions emit no response.done events; their usage arrives on conversation.item.input_audio_transcription.completed as {type: duration, seconds}. That usage is captured out-of-band (usage only, no transcript duplication) and billed by input_cost_per_second, with a token-billed fallback for token-priced transcription models. Adds tests for pricing math, URL builders, request/response types, the proxy route and SDK function, WebSocket intent forwarding, transcription-session streaming behavior, and the /realtime/calls session-type replay. * Address PR review: URL-encode all Azure WS query params; forward query_params through provider_config branch * Address PR review: session_type validation, model auth fix, cost perf, billing fallback, detail/docs cleanup * Improve test coverage: detection from backend, error paths, unknown usage type, resolved_model None * Backport realtime transcription websocket fixes * Enforce authorized realtime transcription model * Enforce realtime transcription model access * Enforce realtime resolved model scopes * Enforce WebRTC transcription model scope * Lazy evaluate debug log in pass-through endpoint (#30177) * Pass through debug lazy logging * fix(proxy): convert remaining eager pass-through debug logs to lazy formatting * fix(parallel_ai): migrate search integration from v1beta to v1 endpoint (#30157) * fix(parallel_ai): migrate search integration from v1beta to v1 endpoint The Parallel Search API moved from /v1beta/search (processor: base/pro, parallel-beta header) to /v1/search (mode: turbo/basic/advanced, no beta header). Request fields moved too: max_results, source_policy, and excerpt settings are now nested under advanced_settings, and source_policy uses include_domains/exclude_domains. The v1 response returns publish_date per result, which now maps to SearchResult.date instead of being hardcoded to None. The legacy processor param is mapped to the equivalent mode so existing callers keep working. * fix(parallel_ai): default mode to basic and simplify param handling The v1 API defaults to advanced mode when mode is omitted, while v1beta defaulted to the base processor. Without an explicit default, callers who pass no mode would be silently upgraded to a tier costing 2.25x more while litellm's cost map reports the basic-tier price. Sending mode=basic preserves the v1beta default and keeps cost tracking accurate. Also replaces the handled_params set with pop-as-consumed param handling so mapped params no longer need to be tracked in two places, and extends the tests to pin the default mode, processor=base mapping, mode-over-processor precedence, and top-level v1 param passthrough. * fix(parallel_ai): avoid double /v1 when api_base is already versioned A PARALLEL_AI_API_BASE like https://api.parallel.ai/v1 previously produced .../v1/v1/search. Strip a trailing /v1 before appending the search path and cover the api_base variants with a parametrized test. --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * feat(focus): add Mavvrik destination for FOCUS export (#29935) * fix: preserve responses streaming flag (#30189) * fix: preserve responses streaming flag * test: cover async responses streaming flag * fix(spend/daily-activity): stable offset pagination via id tiebreaker (#30164) (#30167) date alone is not a unique sort key for LiteLLM_DailyUserSpend or LiteLLM_DailyTeamSpend (many rows per date: api_key x model x model_group x provider x endpoint). Offset pagination over a non-unique sort landed on arbitrary boundaries, so a client paging through all results and summing per-page metrics (the Usage dashboard) got non-deterministic totals - sometimes inflated, sometimes deflated, different at different page_size values. Adding the row's UUID id (present on both tables) as a secondary sort gives every page a stable cursor. order=[{date desc}, {id asc}]. Fixes #30164 * fix(oci): inject a default maxTokens so omitted max_tokens doesn't truncate responses (#30018) * fix(oci): inject default maxTokens so omitted max_tokens doesn't truncate OCI GenAI applies a tiny server-side maxTokens default (~20 tokens) when the request omits it, so any call that doesn't send max_tokens comes back cut off mid-string with finishReason "length". MLflow judges never send max_tokens, so their JSON responses arrived as unterminated strings and json.loads failed in MLflow's gateway adapter. When no maxTokens/maxCompletionTokens target is set, inject DEFAULT_OCI_CHAT_MAX_TOKENS (env-overridable, defaults 4096), mirroring the Anthropic config's default-max-tokens behaviour. An explicit max_tokens still wins, and reasoning models still route to maxCompletionTokens. Used a fixed default rather than the catalog max_output_tokens because the catalog value is unreliable for some models (grok-4 reports max_output_tokens equal to its context window, not a real output cap, which would risk 400s). Adds TestOCIDefaultMaxTokens covering Cohere and generic injection, the explicit-override case, and the reasoning maxCompletionTokens branch. * test(oci): e2e regression that omitted max_tokens isn't truncated Real-proxy integration test asserting a chat completion that omits max_tokens completes with finish_reason "stop" instead of being cut off at OCI's ~20-token server default. Fails before the maxTokens-default injection (finish_reason "length", ~19 tokens), passes after. * test(oci): update cohere default-params test for injected maxTokens test_cohere_default_parameters asserted no maxTokens was injected, encoding the old behaviour where OCI's ~20-token server default truncated responses. Now that transform_request injects DEFAULT_OCI_CHAT_MAX_TOKENS, assert maxTokens equals that default while the other params (topK/topP/frequencyPenalty) stay pass-through with no hardcoded default. * fix(oci): make DEFAULT_OCI_CHAT_MAX_TOKENS a plain constant Drop the os.getenv override. The env knob was not requested and introducing a new env var forced a cross-repo dependency on litellm-docs (test_env_keys.py validates every referenced env var against the docs table there). A plain 4096 constant keeps the PR self-contained; callers who want a different limit pass max_tokens explicitly per request. * fix(oci): route all OpenAI commercial models to maxCompletionTokens OCI serves OpenAI models (gpt-4.1, gpt-5.1 through 5.5, o-series) that the litellm catalog doesn't track, so the supports_reasoning lookup returned False for them and the provider sent maxTokens, which the reasoning families reject with HTTP 400. With the injected default maxTokens this broke every request to those models, not just ones with an explicit max_tokens. Route the whole openai.* vendor prefix to maxCompletionTokens since OpenAI accepts max_completion_tokens on every chat model; the openai.gpt-oss-* open weights are served by OCI's own stack and keep maxTokens. Verified live against gpt-5.2, gpt-5, gpt-4o, gpt-4.1, gpt-oss-120b, llama-3.3, command-a and grok-3-mini * test(oci): hoist transformation imports and drop unused ones Makes the generic-chat test file ruff-clean: the per-test local imports of OCIChatConfig/OCIVendors shadowed the module-level import (F811) and left it unused (F401), and json plus three OCI type imports were never referenced * fix(oci): translate response_format json_schema to OCI's accepted shape (#29691) * fix(oci): translate response_format json_schema to OCI's accepted shape OCI GenAI rejected every json_schema response_format with HTTP 400 "Please pass in correct format of request", which broke structured-output callers such as MLflow LLM judges (they always send a json_schema). The provider forwarded OpenAI's raw json_schema body unchanged. For GENERIC models OCI's ResponseJsonSchema accepts only name/description/schema/isStrict, so OpenAI's `strict` key (and any other extra) 400s the request; the key must be renamed to isStrict and the body whitelisted. For Cohere models there is no JSON_SCHEMA type at all; the schema has to ride on JSON_OBJECT as {"type": "JSON_OBJECT", "schema": ...}. Cohere type values must also be the canonical uppercase TEXT/JSON_OBJECT. _normalize_response_format now branches by vendor and emits the exact shape each one accepts (verified live against OCI GenAI for Cohere, Meta, Gemini and Grok). Drops the unused, incorrect Cohere response-format pydantic models. Two existing tests asserted the broken behavior (lowercase type, raw jsonSchema on Cohere); they are rewritten to assert the corrected shape, and generic/Cohere json_schema regression tests are added. * fix(oci): raise early on json_schema response_format with no body A GENERIC model request with {"type": "json_schema"} and no json_schema object fell through to the JSON_OBJECT branch and emitted a bodyless {"type": "JSON_SCHEMA"}, which OCI rejects with an opaque HTTP 400. Raise a descriptive 400 at translation time instead. Cohere is unaffected since it always maps to JSON_OBJECT. * test(oci): gateway integration test for response_format json_schema Added to tests/integration/ (the real-network integration suite) reusing the existing OCI proxy harness, not tests/llm_translation/ which is mock-only. --------- Co-authored-by: Sameer Kankute <sameer@berri.ai> * fix(oci): accept default n=1 on Cohere instead of hard-failing (#29705) * fix(oci): accept default n=1 on Cohere instead of hard-failing Cohere on OCI has no numGenerations field, so n was mapped to False and map_openai_params raised "param `n` is not supported on OCI" whenever a client sent n. But n=1 (and None) is the OpenAI default single-generation request, which every OCI model produces anyway, so standard clients that always send n=1 (such as the MLflow gateway) were rejected with a 500. Drop n=1/None silently for Cohere; only n>1 is genuinely unsupported and still raises (or drops under drop_params). Generic models are unaffected and keep numGenerations, including n>1. * docs(oci): explain why n is not advertised for Cohere despite tolerating n=1 * test(oci): gateway integration test for Cohere default n=1 Added to tests/integration/ (the real-network integration suite) reusing the existing OCI proxy harness, not tests/llm_translation/ which is mock-only. --------- Co-authored-by: Sameer Kankute <sameer@berri.ai> * fix(oci): drop max_retries instead of hard-failing on OCI (#29727) max_retries is a litellm-level control param (litellm applies retries itself), not a generation param OCI accepts. The provider mapped it to False and raised "param `max_retries` is not supported on OCI" whenever it was present. The litellm proxy injects max_retries on every request, so any OCI call through the proxy 500'd unless drop_params was set. Drop max_retries silently in map_openai_params. Adds a unit test (Cohere and generic) and a gateway integration test that a plain request succeeds through a proxy without drop_params. Co-authored-by: Sameer Kankute <sameer@berri.ai> * fix(spend-logs): rehydrate metadata JSONB text on ui_view_spend_logs (#29682) Fixes #29674. `/spend/logs/ui` raw-SQL path returns the JSONB metadata column as a string — prisma's query_raw skips the ORM-layer hydration. The UI reads metadata.status / metadata.error_information as object fields, so provider-failure rows look like successes. Fix: json.loads the metadata field right after query_raw, fall back to {} on malformed JSON. 3 existing error-code/error-message tests called json.loads on response.data[0]["metadata"] — they were leaning on the bug. Updated to read the dict directly. Plus 2 new regression tests (failure metadata roundtrip + invalid-json fallback). Reverting the fix makes both new tests fail with AssertionError: metadata should be dict, got <class 'str'>. * fix(proxy): release max_parallel_requests slot when a stream is cancelled mid-flight (#27955) (#30020) * fix(proxy): release max_parallel_requests slot when a stream is cancelled mid-flight (#27955) * fix: refund max_parallel_requests on disconnect from outer streaming generators The cancellation refund previously lived in async_post_call_streaming_iterator_hook, but that hook is nested inside the outer streaming generators and a nested async generator only receives GeneratorExit on garbage collection (non-deterministic). With only the v3 limiter enabled, /chat/completions also bypasses the hook entirely (needs_iterator_wrap() is false). Move the release into async_data_generator and async_streaming_data_generator, the generators Starlette closes on client disconnect, so the refund fires deterministically on every streaming route. Warn when no event loop is running, and document the window TTL refresh on the decrement * fix(mcp): propagate model into model_call_details for passthrough tool calls (#30122) * fix(mcp): propagate model into model_call_details for passthrough tool calls The @client decorator on call_mcp_tool creates the logging object via function_setup without a model kwarg, so model_call_details["model"] starts as None. execute_mcp_tool only set logging_obj.model as an instance attribute, which the spend-log writer never reads (it reads kwargs["model"] from model_call_details). MCP passthrough tools/call rows therefore persisted with model="" while list_tools rows showed "MCP: list_tools", degrading the Logs UI display and bucketing all MCP tool spend under an empty model in DailyUserSpend. Propagate the model into model_call_details alongside the existing attribute assignment so the StandardLoggingPayload and SpendLogs writer pick it up. Covers the /mcp passthrough, REST /mcp-rest/tools/call, and orchestrated paths (the latter already passed model into function_setup, so this is a no-op there). * test(mcp): trim regression test docstring * fix(mcp): surface upstream challenges for delegated OAuth (#30124) * fix(mcp): surface upstream challenges for delegated OAuth * docs(mcp): clarify delegated upstream auth comments * perf(benchmarks): add CPU timing metrics to streaming benchmark (#29980) * Add CPU timing metrics to streaming benchmark * Fix spacing around timing sample dataclass * fix(gemini): don't emit empty choices on metadata-only stream chunks (#29167) web_search + reasoning makes Gemini stream mid-chunks that carry only grounding/thought metadata — no content part, no finishReason. _process_candidates skips content-less candidates and the existing fallback only ran when finishReason was set, so choices stayed empty and the downstream streaming handler raised IndexError on choices[0]. Emit an empty-delta choice for content-less chunks regardless of finishReason. Fixes #28884 * fix(key): allow /key/update to clear budget_limits with [] or null (#30085) * Fix /key/update rejecting budget_limits clear requests with HTTP 400 Sending budget_limits: [] or null to /key/update returned HTTP 400, so once a key had budget windows the last one could never be removed. prepare_key_update_data only json.dumps'd budget_limits when the value was truthy, so [] and None passed through raw to the Prisma Json? column; jsonify_object only serializes dicts, and prisma-client-py has no DbNull sentinel for Json? writes, so Prisma rejected both shapes. Serialize the clear case explicitly as the JSON literal null, matching how memory_endpoints encodes metadata for the same column type. Truthy values keep the existing reset_at window initialization path. Fixes #30067. * Require admin access for budget_limits changes on /key/update Clearing budget_limits via [] or null is a budget mutation, but _validate_update_key_data only counted max_budget and spend as budget changes before deciding whether to skip _check_key_admin_access. A non-admin key owner or a team member with /key/update could therefore remove a key's per-window spend caps without admin authorization. Treat any explicit budget_limits value in the request (set, change, or clear) as a budget change so it gates through the same admin check as max_budget. model_fields_set is used because an explicit null is indistinguishable from an omitted field by value alone. * fix(proxy): persist guardrail info in spend logs for /v1/responses (#30092) Pre-call guardrail blocks on /v1/responses wrote guardrail_information as null in LiteLLM_SpendLogs because _handle_logging_proxy_only_error splits request_data by LoggedLiteLLMParams keys and litellm_metadata, where the Responses API stores request metadata including standard_logging_guardrail_information, was not among them. It fell into optional_params, so merge_litellm_metadata never saw it. Add litellm_metadata to LoggedLiteLLMParams so it routes into litellm_params the same way metadata does on the chat completions path Fixes #28971. * fix(proxy): handle non-standard SSE frames in Anthropic passthrough logging (#26000) Some third-party Anthropic-compatible providers emit non-standard SSE frames (OpenAI-style [DONE] sentinels, non-JSON keep-alive lines) in streaming responses. These caused json.JSONDecodeError in _build_complete_streaming_response, breaking the passthrough logging pipeline so the request was never logged or billed. Skip whole-line 'data: [DONE]' sentinels and catch JSONDecodeError per event. Matching the full line (not a substring) keeps a valid chunk whose text payload contains '[DONE]' from being dropped. Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(newrelic): Add New Relic extension (#26989) * initial New Relic integration. * Minor fixes for basic observability. * Implemented basic support for the success path. Generates New Relic custom events needed by the AI Monitorin interface. * Supportability metric is sent on first request. * Emit supportability metric every hour instead of once a day. * Add the start/end times to the messages before sending them so that the start time and end time reflect the correct time and both are not set to 'now'. * Make use of `turn_off_message_logging` configuration that is available by default from CustomLogger. * Enabling New Relic agent to be wired when docker container starts if an environment variable is set. * If we cannot find trace information, send the AI events without the trace ID attached. * Use a fake trace_id if we cannot find one. * Implementing a configuration so that users can use litellm configuration to disable sending LLM messages to New Relic. There is a second method to do this via New Relic env var. * Mised file. * Cleaning up logic to turn off recording content via either the LiteLLM configuration or an env var. * Removing debugging. Fixed logic / comments around how often to send supportability metric. * Initial version of public doc for New Relic. * Use a proper name for the doc file. * Updating newrelic.md document. * Updating LiteLLM documentation for New Relic extension. * Moving New Relic imports into the methods to support unit tests. * Adding unit tests for the New Relic extension. * Updating linting and the unit tests that are not running in the CI environment. * Address reviewer feedback on New Relic integration. - Fix _record_error_metric to use app.record_custom_metric() instead of module-level newrelic.agent.record_custom_metric() so the call works outside of an active transaction context - Remove unreachable except ImportError block in _get_trace_context - Update stale "23 hours" comment to "27 hours" (matches 97200s threshold) - Remove commented-out debug code from _process_success - Fix docs typo: NEW_RELIC_CUSTOM_INSIGHTS_EVENTS_MAX_SAMPLES_STOREDA -> NEW_RELIC_CUSTOM_INSIGHTS_EVENTS_MAX_SAMPLES_STORED - Update TestRecordErrorMetric to verify app.record_custom_metric call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Reformating for the linter. * Addressing additional automated feedback. - Removed a legacy comment about the New Relic header - Reordered imports in one file - Switched another file to use the import at the top of the file instead of inline when used - Added unit tests for untested methods that were identified * Addressing new feedback. - Proper handling of time to floats. Created a util method and updated code to use it. - added the missing guard to ensure the app is enabled * Addressing feedback. - When an error occurs, still check if the periodic supportability metric should be emitted - Added a check to ensure the extension is ready in the error handler to match _process_success * Updating the NR event timestamps to more accurately reflect when the messages were generated. * Addressing feedback for potential better practice. * Addressing feedback on accessing default values. Added tests for most of these cases. * Adding a new catch exception block based on feedback. * Addressing feedback about a potential issue around a timestamp for the supportability metric. * Addressing minor feedback on length of generated, fallback traceId. * Addressing feedback. - A few more cases were found where the dictionary access might not return the correct value. - Handling cases where `traceparent` is not lower cased * Addressed feedback where the newrelic options might not apply correctly. * Addressing some feedback. * Addressing feedback. * Validating testing / formatting for our changes. * Updating linting, adding tests, defining data type for UI. * Configuration for the logging callback definition. * Adding a newrelic image for the UI to use. * Putting the New Relic callback in proper alphabetic order. * Copying the logo to a committed output directory so it shows up in a locally built container. * Adding missing definition of new env vars that were causing a build failure. * Addressing automated feedback from greptile. * Adding a few more unit tests to increase the code coverage just a bit more. * Additional unit tests to push coverage to almost 90%. * Adding a custom newrelic docker image build process. This removes the need to add the newrelic agent to the core litellm container or dependencies. * Clarifying message when the New Relic agent is not installed and someone is trying to use the newrelic extension. Either use the proper image when using docker, or install the agent manually when running from source. * Ensuring pip is available to install the New Relic agent. * Updating the definition and handling of traceId (no spanId). Clarifying behavior of env vars vs UI configuration for the newrelic extension. * Removing entries from the New Relic logger configuraiton UI as these values must be set as part of running the image. * Removing a stale doc file that has moved to the litellm-docs repo. Cleanup of Dockerfile to remove a LABEL that was incorrect. * Updating container image name to be the best guess for the new name. * Addressing feedback from greptile. - Added a comment around token_count=0 - Updated the boolean parser to allow a wider set of options which matches existing patterns in other parts of LiteLLM. * Removing option for a separate New Relic container image. The agreement is to handle this in the New Relic integration docs. * Updating error message when New Relic agent is not available. * Wiring in the test message from the LiteLLM callback UX. * Missed saving one of the file conflicts. * Fixed a lint error I introduced. Somehow, I dropped another string and now added it back. * Adding newrelic to the schema definition. * Added an admin check on the call before sending test message as mentioned by the AI code review. * Updating to use should_redact_message_logging(kwargs) as part of the logic to determine if message content should be sent to New Relic or not. This still uses the `record_content` property as well, but both have to be true in order for content to be included. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Add Azure AI Foundry DeepSeek V3.1 and V4 Pro/Flash global pricing to cost map (#30134) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(logging): translate Responses bridge result to ModelResponse for spend logs (#28985) PR #29394 fixed the AnthropicResponse.model_validate crash for the streaming anthropic_messages -> OpenAI Responses bridge by unwrapping terminal events and returning the inner ResponsesAPIResponse. The spend_logs row lands and usage/cost are correct, but the row's response field stores the Responses API shape (output[...].content[...].text). The proxy UI Logs tab reads response.choices[0].message via parseMessages in prettyMessagesUtils.ts with no fallback for the Responses shape, so the OutputCard renders "No response data available" for every cross-routed call. The same shape mismatch affects every downstream consumer of spend_logs that assumes the canonical chat-completion shape This change keeps the unwrap from #29394 but routes the resulting ResponsesAPIResponse (and the bare-response non-streaming path) through LiteLLMResponsesTransformationHandler.transform_response, which is the same conversion already used by the chat-completion Responses bridge. Spend_logs now stores a ModelResponse with choices[0].message.content, so the UI and other consumers see the assistant text. On a translation failure (eg. empty output on an incomplete response) the handler falls back to a minimal ModelResponse carrying model and usage so the row still lands rather than being dropped as a Non-Blocking error Also corrects a stale comment in the Responses adapter that implied the call type was reclassified to acompletion; the code preserves anthropic_messages and the success handler translates back to ModelResponse for the row Fixes #28595 * fix(anthropic-adapter): re-emit first delta on streaming content-block transitions (#30024) * fix(anthropic-adapter): re-emit first delta on streaming content-block transitions The `/v1/messages` -> `/v1/chat/completions` streaming adapter (`AnthropicStreamWrapper`) silently dropped the first non-empty delta of every content block that started via a *transition* (e.g. text -> tool_use -> text, text -> thinking). When an upstream chunk both triggers a new content block (its type differs from the active block) and carries that block's first delta, the wrapper emitted `content_block_stop` -> `content_block_start` and then only re-queued the trigger chunk when it was an `input_json_delta` (bundled tool args). The synthesized `content_block_start` always carries an empty body, so the first `text_delta` / `thinking_delta` was lost — the client output started from the second token (e.g. "Hi, how can I help you?" rendered as ", how can I help you?", or text resuming after a tool call lost its first sentence). This is especially visible with Claude Code-style clients that consume Anthropic Messages streaming events strictly. Fix: re-queue the trigger chunk's translated delta whenever it carries non-empty content (text/thinking/signature/tool args), via a shared `_trigger_delta_has_content` helper used by both the sync and async paths. Empty trigger deltas are still suppressed so no spurious empty `content_block_delta` is introduced. Fixes #30014 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(anthropic-adapter): cover all _trigger_delta_has_content branches Add a direct parametrized unit test for the re-emit predicate so every delta type (text/input_json/thinking/signature), the empty-payload guards, and the malformed/non-delta cases are exercised independently of upstream chunk translation. Raises patch coverage for the new helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat: add opt-in healthy_only filter to GET /v1/models (#30130) * feat: add opt-in healthy_only filter to GET /v1/models Adds an opt-in `healthy_only=true` query parameter to GET /v1/models and GET /models that hides models whose backing deployments are all marked unhealthy by background health checks. - Add Router.async_get_fully_unhealthy_model_names(), mirroring the semantics of get_fully_blocked_model_names(): a model is hidden only when every backing deployment is unhealthy and the health state is not stale (fail open otherwise). - Reuses the existing DeploymentHealthCache populated by _run_background_health_check(), so no new health state is introduced. - No-op when allowed_fails_policy is set, mirroring _async_filter_health_check_unhealthy_deployments semantics. - team_public_model_name aliases are aggregated alongside model_name. - Hiding is presentation-only; default behavior is unchanged. Fixes #30128 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: address Greptile review notes - Note team-alias asymmetry vs get_fully_blocked_model_names - Debug-log when healthy_only is set but no health state is available Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Claude Fable 5 <noreply@anthropic.com> * Dedupe team soft budget alerts by team_id instead of token (#30097) _team_soft_budget_check sends type="soft_budget" alerts with event_group=TEAM, but SoftBudgetAlert.get_id always returned the request token. The alert cache key was therefore scoped per virtual key, so every active key in a team over its soft budget fired its own alert within budget_alert_ttl. Branch on event_group so team-level alerts dedupe by team_id, matching TeamBudgetAlert, while key and project level alerts keep per-token dedupe. Fixes #27398. * feat(bedrock guardrails): support contextual grounding qualifiers (request-side) (#30057) * test: add failing tests for Bedrock contextual grounding (request-side) Drive the request-side of Bedrock contextual grounding: callers tag message content blocks as grounding_source/query, the post_call hook assembles an ApplyGuardrail(OUTPUT) call carrying source + query + response(guard_content), and the bedrock converse transform must render the tags as prompt text instead of silently dropping them. Non-grounding payloads must stay byte-identical. * feat(bedrock guardrails): support contextual grounding qualifiers Bedrock contextual grounding scores a model response against a reference source and the user query, expressed via a per-content-block `qualifiers` array on ApplyGuardrail. The guardrail hook previously sent plain text only, so grounding could not be driven through it even though the response-side contextualGroundingPolicy parsing already existed. Callers now tag message content blocks `{"type":"grounding_source"}` / `{"type":"query"}` (mirroring the existing `guarded_text` marker). On the generate path the bedrock converse transform renders them as plain text; at post_call the hook harvests them from the request and assembles one ApplyGuardrail(OUTPUT) call carrying grounding_source + query + the response (as guard_content). Requests without these tags produce a byte-identical payload, so existing behaviour is unchanged. * Feat(guardrail): Adding support for custom Ovalix guardrail (#21887) * Feat(guardrail): Adding support for custom Ovalix guardrail * Internal CR comments fixes * greptileai comments fixes * fix conflict * fixes * fix sha256 * clarify Ovalix actor-id hash is for normalization, not PII protection * fix(github_copilot): normalize per-event item_id in /responses streaming (#30072) GitHub Copilot's native /v1/responses stream assigns a different item_id to every event of a single output item (output_item.added, the part.added / delta / done events, and output_item.done). Spec-strict clients like the Vercel AI SDK key streaming parts by item_id and abort with "reasoning part <id> not found" / "text part <id> not found" when a delta references an unregistered id. Override transform_streaming_response in GithubCopilotResponsesAPIConfig to anchor every event of an output item to the id from its output_item.added. Copilot accepts that id paired with the final encrypted_content on the next turn, so multi-turn replay is unaffected. Fixes #30071 * feat: add /model/block and /model/unblock endpoints (#30125) * feat: add /model/block and /model/unblock endpoints Add dedicated proxy-admin POST /model/block and /model/unblock endpoints over the existing blocked flag on LiteLLM_ProxyModelTable, mirroring the /key/block and /key/unblock pattern. Calling a model whose deployments are all blocked now returns a clear 403 "Model is blocked" instead of a generic no-deployment error, including direct-dispatch route types (e.g. eval) via a pre-route guard. Includes audit-log entries for block/unblock and unit tests. Closes #29742 Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> * chore: regenerate dashboard API types for model block/unblock endpoints Regenerate ui/litellm-dashboard/src/lib/http/schema.d.ts from the proxy OpenAPI spec (npm run gen:api) so it includes the new endpoints. Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> * fix: widen router block-helper param type and add direct unit tests Type the _are_all_deployments_blocked deployments parameter to match its callers (DeploymentTypedDict) so mypy passes, and add tests/test_litellm/test_router_block_helpers.py with direct unit tests for the three block helper methods so router_code_coverage recognizes them. Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> * fix: restore type-ignore on messages arg after black reflow Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> * refactor: raise model-block 403 in proxy layer, not SDK Router Keep the SDK Router's documented behavior for blocked deployments (filtered -> "no healthy deployment") and move the 403 PermissionDeniedError into the proxy layer (route_llm_request), where model blocking is an admin concept. This avoids a backwards-incompatible 403 for SDK users who set blocked=True on their own deployments, per maintainer review. Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> --------- Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> Co-authored-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> * fix: add week unit support to get_next_standardized_reset_time (#30100) * fix: add week unit support to get_next_standardized_reset_time The function handled d/h/m/s/mo units but silently fell through to the default next-midnight branch for the w (week) unit. This was inconsistent: _extract_from_regex already accepted w in its character class, and duration_in_seconds already returned value * 604800 for it. Add the missing elif unit == 'w' branch that delegates to _handle_day_reset with value * 7, which reuses the existing Monday- alignment logic for 1w and the generic N-day-from-midnight path for larger multiples. Add test_week_based_resets covering 1w from a Wednesday (expects next Monday) and 2w from a Monday (expects 14 days forward at midnight). Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> * test: exercise relative week semantics with non-Monday base dates + add docstring Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> --------- Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> Co-authored-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> * fix: black formatting and remove undocumented MAVVRIK_FOCUS_FREQUENCY env var * fix: black formatting with correct version and sync schema.d.ts for healthy_only param * fix: resolve mypy errors and add transcription_sessions to JSON schema endpoint enum * fix: restore MAVVRIK_FOCUS_FREQUENCY guard and exclude it from docs key scan * fix: address Greptile P2 comments - move constant, use UTC datetime, skip redundant team lookup * revert: restore original team lookup logic in can_key_call_resolved_model --------- Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com> Co-authored-by: Emerson Gomes <emerson.gomes@thalesgroup.com> Co-authored-by: nina-hu <nina.huuu@gmail.com> Co-authored-by: Sahith Jagarlamudi <104647530+s-jag@users.noreply.github.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Praveen Ghuge <95286176+pghuge-cloudwiz@users.noreply.github.com> Co-authored-by: alex107ivanov <30668368+alex107ivanov@users.noreply.github.com> Co-authored-by: hcl <chenglunhu@gmail.com> Co-authored-by: Fede Kamelhar <federico.kamelhar@oracle.com> Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com> Co-authored-by: Teo Xian Zhong Augustine <35527068+auggie246@users.noreply.github.com> Co-authored-by: King Star <mcxin.y@gmail.com> Co-authored-by: Saksham Maggo <122939011+SakshamMaggo@users.noreply.github.com> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: Kelvin <leikaiwei@outlook.com> Co-authored-by: Josh Bonczkowski <josh.bonczkowski@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: M. Dennis Turp <mdturp@pm.me> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Piotr Minkina <piotrminkina@users.noreply.github.com> Co-authored-by: Martín Alcalá Rubí <martin@tryolabs.com> Co-authored-by: T. Kobayashi <13004314+nix-tkobayashi@users.noreply.github.com> Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com> Co-authored-by: Shalom <shalom@ovalix.io> Co-authored-by: codgician <15964984+codgician@users.noreply.github.com> Co-authored-by: FugoP <kim@pomsora.com> Co-authored-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>