litellm

tiennm99/litellm
Fork 0
mirror of https://github.com/tiennm99/litellm.git synced 2026-08-03 06:23:06 +00:00
Files
T
History
+20
cfcdf8714a feat: litellm oss 110626 (#30202 )
* Add gpt-realtime-whisper Realtime transcription support (OpenAI + Azure) (#29775)

* Add gpt-realtime-whisper Realtime transcription support (OpenAI + Azure)

Adds first-class support for the gpt-realtime-whisper streaming speech-to-text
model, which uses the Realtime transcription session API rather than the
file-based /audio/transcriptions path.

Model registration: registers gpt-realtime-whisper and azure/gpt-realtime-whisper
with audio-duration pricing (input_cost_per_second = 0.017/60, matching the
published $0.017/minute input audio rate).

REST endpoint: implements POST /v1/realtime/transcription_sessions (plus /realtime
and /openai/v1 aliases) to mint an ephemeral transcription session for the
WebRTC flow. Adds request/response types, OpenAI and Azure URL builders, a shared
base handler (refactored from the client_secrets handler), the
acreate_realtime_transcription_session SDK function, and route registration. The
proxy encrypts the ephemeral key returned under client_secret.value and records
the session type in the token so the follow-up /realtime/calls replays
type=transcription rather than type=realtime.

WebSocket: forwards intent=transcription through to the Azure handler (OpenAI
already received it) with URL-encoding, so gpt-realtime-whisper opens a
transcription session. Transcription-only sessions no longer trigger an
erroneous response.create.

Cost tracking: transcription sessions emit no response.done events; their usage
arrives on conversation.item.input_audio_transcription.completed as
{type: duration, seconds}. That usage is captured out-of-band (usage only, no
transcript duplication) and billed by input_cost_per_second, with a token-billed
fallback for token-priced transcription models.

Adds tests for pricing math, URL builders, request/response types, the proxy
route and SDK function, WebSocket intent forwarding, transcription-session
streaming behavior, and the /realtime/calls session-type replay.

* Address PR review: URL-encode all Azure WS query params; forward query_params through provider_config branch

* Address PR review: session_type validation, model auth fix, cost perf, billing fallback, detail/docs cleanup

* Improve test coverage: detection from backend, error paths, unknown usage type, resolved_model None

* Backport realtime transcription websocket fixes

* Enforce authorized realtime transcription model

* Enforce realtime transcription model access

* Enforce realtime resolved model scopes

* Enforce WebRTC transcription model scope

* Lazy evaluate debug log in pass-through endpoint (#30177)

* Pass through debug lazy logging

* fix(proxy): convert remaining eager pass-through debug logs to lazy formatting

* fix(parallel_ai): migrate search integration from v1beta to v1 endpoint (#30157)

* fix(parallel_ai): migrate search integration from v1beta to v1 endpoint

The Parallel Search API moved from /v1beta/search (processor: base/pro,
parallel-beta header) to /v1/search (mode: turbo/basic/advanced, no beta
header). Request fields moved too: max_results, source_policy, and excerpt
settings are now nested under advanced_settings, and source_policy uses
include_domains/exclude_domains. The v1 response returns publish_date per
result, which now maps to SearchResult.date instead of being hardcoded to
None. The legacy processor param is mapped to the equivalent mode so
existing callers keep working.

* fix(parallel_ai): default mode to basic and simplify param handling

The v1 API defaults to advanced mode when mode is omitted, while v1beta
defaulted to the base processor. Without an explicit default, callers who
pass no mode would be silently upgraded to a tier costing 2.25x more while
litellm's cost map reports the basic-tier price. Sending mode=basic
preserves the v1beta default and keeps cost tracking accurate.

Also replaces the handled_params set with pop-as-consumed param handling so
mapped params no longer need to be tracked in two places, and extends the
tests to pin the default mode, processor=base mapping, mode-over-processor
precedence, and top-level v1 param passthrough.

* fix(parallel_ai): avoid double /v1 when api_base is already versioned

A PARALLEL_AI_API_BASE like https://api.parallel.ai/v1 previously produced
.../v1/v1/search. Strip a trailing /v1 before appending the search path and
cover the api_base variants with a parametrized test.

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* feat(focus): add Mavvrik destination for FOCUS export (#29935)

* fix: preserve responses streaming flag (#30189)

* fix: preserve responses streaming flag

* test: cover async responses streaming flag

* fix(spend/daily-activity): stable offset pagination via id tiebreaker (#30164) (#30167)

date alone is not a unique sort key for LiteLLM_DailyUserSpend or
LiteLLM_DailyTeamSpend (many rows per date: api_key x model x
model_group x provider x endpoint). Offset pagination over a
non-unique sort landed on arbitrary boundaries, so a client paging
through all results and summing per-page metrics (the Usage dashboard)
got non-deterministic totals - sometimes inflated, sometimes deflated,
different at different page_size values.

Adding the row's UUID id (present on both tables) as a secondary sort
gives every page a stable cursor. order=[{date desc}, {id asc}].

Fixes #30164

* fix(oci): inject a default maxTokens so omitted max_tokens doesn't truncate responses (#30018)

* fix(oci): inject default maxTokens so omitted max_tokens doesn't truncate

OCI GenAI applies a tiny server-side maxTokens default (~20 tokens) when the
request omits it, so any call that doesn't send max_tokens comes back cut off
mid-string with finishReason "length". MLflow judges never send max_tokens, so
their JSON responses arrived as unterminated strings and json.loads failed in
MLflow's gateway adapter.

When no maxTokens/maxCompletionTokens target is set, inject
DEFAULT_OCI_CHAT_MAX_TOKENS (env-overridable, defaults 4096), mirroring the
Anthropic config's default-max-tokens behaviour. An explicit max_tokens still
wins, and reasoning models still route to maxCompletionTokens. Used a fixed
default rather than the catalog max_output_tokens because the catalog value is
unreliable for some models (grok-4 reports max_output_tokens equal to its
context window, not a real output cap, which would risk 400s).

Adds TestOCIDefaultMaxTokens covering Cohere and generic injection, the
explicit-override case, and the reasoning maxCompletionTokens branch.

* test(oci): e2e regression that omitted max_tokens isn't truncated

Real-proxy integration test asserting a chat completion that omits max_tokens
completes with finish_reason "stop" instead of being cut off at OCI's ~20-token
server default. Fails before the maxTokens-default injection (finish_reason
"length", ~19 tokens), passes after.

* test(oci): update cohere default-params test for injected maxTokens

test_cohere_default_parameters asserted no maxTokens was injected, encoding the
old behaviour where OCI's ~20-token server default truncated responses. Now
that transform_request injects DEFAULT_OCI_CHAT_MAX_TOKENS, assert maxTokens
equals that default while the other params (topK/topP/frequencyPenalty) stay
pass-through with no hardcoded default.

* fix(oci): make DEFAULT_OCI_CHAT_MAX_TOKENS a plain constant

Drop the os.getenv override. The env knob was not requested and introducing a
new env var forced a cross-repo dependency on litellm-docs (test_env_keys.py
validates every referenced env var against the docs table there). A plain 4096
constant keeps the PR self-contained; callers who want a different limit pass
max_tokens explicitly per request.

* fix(oci): route all OpenAI commercial models to maxCompletionTokens

OCI serves OpenAI models (gpt-4.1, gpt-5.1 through 5.5, o-series) that
the litellm catalog doesn't track, so the supports_reasoning lookup
returned False for them and the provider sent maxTokens, which the
reasoning families reject with HTTP 400. With the injected default
maxTokens this broke every request to those models, not just ones with
an explicit max_tokens. Route the whole openai.* vendor prefix to
maxCompletionTokens since OpenAI accepts max_completion_tokens on every
chat model; the openai.gpt-oss-* open weights are served by OCI's own
stack and keep maxTokens. Verified live against gpt-5.2, gpt-5, gpt-4o,
gpt-4.1, gpt-oss-120b, llama-3.3, command-a and grok-3-mini

* test(oci): hoist transformation imports and drop unused ones

Makes the generic-chat test file ruff-clean: the per-test local imports
of OCIChatConfig/OCIVendors shadowed the module-level import (F811) and
left it unused (F401), and json plus three OCI type imports were never
referenced

* fix(oci): translate response_format json_schema to OCI's accepted shape (#29691)

* fix(oci): translate response_format json_schema to OCI's accepted shape

OCI GenAI rejected every json_schema response_format with HTTP 400
"Please pass in correct format of request", which broke structured-output
callers such as MLflow LLM judges (they always send a json_schema).

The provider forwarded OpenAI's raw json_schema body unchanged. For GENERIC
models OCI's ResponseJsonSchema accepts only name/description/schema/isStrict,
so OpenAI's `strict` key (and any other extra) 400s the request; the key must
be renamed to isStrict and the body whitelisted. For Cohere models there is no
JSON_SCHEMA type at all; the schema has to ride on JSON_OBJECT as
{"type": "JSON_OBJECT", "schema": ...}. Cohere type values must also be the
canonical uppercase TEXT/JSON_OBJECT.

_normalize_response_format now branches by vendor and emits the exact shape
each one accepts (verified live against OCI GenAI for Cohere, Meta, Gemini and
Grok). Drops the unused, incorrect Cohere response-format pydantic models.

Two existing tests asserted the broken behavior (lowercase type, raw
jsonSchema on Cohere); they are rewritten to assert the corrected shape, and
generic/Cohere json_schema regression tests are added.

* fix(oci): raise early on json_schema response_format with no body

A GENERIC model request with {"type": "json_schema"} and no json_schema
object fell through to the JSON_OBJECT branch and emitted a bodyless
{"type": "JSON_SCHEMA"}, which OCI rejects with an opaque HTTP 400. Raise a
descriptive 400 at translation time instead. Cohere is unaffected since it
always maps to JSON_OBJECT.

* test(oci): gateway integration test for response_format json_schema

Added to tests/integration/ (the real-network integration suite) reusing the
existing OCI proxy harness, not tests/llm_translation/ which is mock-only.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(oci): accept default n=1 on Cohere instead of hard-failing (#29705)

* fix(oci): accept default n=1 on Cohere instead of hard-failing

Cohere on OCI has no numGenerations field, so n was mapped to False and
map_openai_params raised "param `n` is not supported on OCI" whenever a client
sent n. But n=1 (and None) is the OpenAI default single-generation request,
which every OCI model produces anyway, so standard clients that always send
n=1 (such as the MLflow gateway) were rejected with a 500.

Drop n=1/None silently for Cohere; only n>1 is genuinely unsupported and still
raises (or drops under drop_params). Generic models are unaffected and keep
numGenerations, including n>1.

* docs(oci): explain why n is not advertised for Cohere despite tolerating n=1

* test(oci): gateway integration test for Cohere default n=1

Added to tests/integration/ (the real-network integration suite) reusing the
existing OCI proxy harness, not tests/llm_translation/ which is mock-only.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(oci): drop max_retries instead of hard-failing on OCI (#29727)

max_retries is a litellm-level control param (litellm applies retries itself),
not a generation param OCI accepts. The provider mapped it to False and raised
"param `max_retries` is not supported on OCI" whenever it was present. The
litellm proxy injects max_retries on every request, so any OCI call through the
proxy 500'd unless drop_params was set.

Drop max_retries silently in map_openai_params. Adds a unit test (Cohere and
generic) and a gateway integration test that a plain request succeeds through a
proxy without drop_params.

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(spend-logs): rehydrate metadata JSONB text on ui_view_spend_logs (#29682)

Fixes #29674.

`/spend/logs/ui` raw-SQL path returns the JSONB metadata column as a
string — prisma's query_raw skips the ORM-layer hydration. The UI reads
metadata.status / metadata.error_information as object fields, so
provider-failure rows look like successes.

Fix: json.loads the metadata field right after query_raw, fall back to
{} on malformed JSON.

3 existing error-code/error-message tests called json.loads on
response.data[0]["metadata"] — they were leaning on the bug. Updated
to read the dict directly. Plus 2 new regression tests (failure metadata
roundtrip + invalid-json fallback). Reverting the fix makes both new
tests fail with AssertionError: metadata should be dict, got <class 'str'>.

* fix(proxy): release max_parallel_requests slot when a stream is cancelled mid-flight (#27955) (#30020)

* fix(proxy): release max_parallel_requests slot when a stream is cancelled mid-flight (#27955)

* fix: refund max_parallel_requests on disconnect from outer streaming generators

The cancellation refund previously lived in async_post_call_streaming_iterator_hook,
but that hook is nested inside the outer streaming generators and a nested async
generator only receives GeneratorExit on garbage collection (non-deterministic).
With only the v3 limiter enabled, /chat/completions also bypasses the hook entirely
(needs_iterator_wrap() is false). Move the release into async_data_generator and
async_streaming_data_generator, the generators Starlette closes on client disconnect,
so the refund fires deterministically on every streaming route. Warn when no event
loop is running, and document the window TTL refresh on the decrement

* fix(mcp): propagate model into model_call_details for passthrough tool calls (#30122)

* fix(mcp): propagate model into model_call_details for passthrough tool calls

The @client decorator on call_mcp_tool creates the logging object via
function_setup without a model kwarg, so model_call_details["model"]
starts as None. execute_mcp_tool only set logging_obj.model as an
instance attribute, which the spend-log writer never reads (it reads
kwargs["model"] from model_call_details). MCP passthrough tools/call
rows therefore persisted with model="" while list_tools rows showed
"MCP: list_tools", degrading the Logs UI display and bucketing all MCP
tool spend under an empty model in DailyUserSpend.

Propagate the model into model_call_details alongside the existing
attribute assignment so the StandardLoggingPayload and SpendLogs writer
pick it up. Covers the /mcp passthrough, REST /mcp-rest/tools/call, and
orchestrated paths (the latter already passed model into function_setup,
so this is a no-op there).

* test(mcp): trim regression test docstring

* fix(mcp): surface upstream challenges for delegated OAuth (#30124)

* fix(mcp): surface upstream challenges for delegated OAuth

* docs(mcp): clarify delegated upstream auth comments

* perf(benchmarks): add CPU timing metrics to streaming benchmark (#29980)

* Add CPU timing metrics to streaming benchmark

* Fix spacing around timing sample dataclass

* fix(gemini): don't emit empty choices on metadata-only stream chunks (#29167)

web_search + reasoning makes Gemini stream mid-chunks that carry only
grounding/thought metadata — no content part, no finishReason.
_process_candidates skips content-less candidates and the existing
fallback only ran when finishReason was set, so choices stayed empty
and the downstream streaming handler raised IndexError on choices[0].
Emit an empty-delta choice for content-less chunks regardless of
finishReason.

Fixes #28884

* fix(key): allow /key/update to clear budget_limits with [] or null (#30085)

* Fix /key/update rejecting budget_limits clear requests with HTTP 400

Sending budget_limits: [] or null to /key/update returned HTTP 400, so
once a key had budget windows the last one could never be removed.

prepare_key_update_data only json.dumps'd budget_limits when the value
was truthy, so [] and None passed through raw to the Prisma Json?
column; jsonify_object only serializes dicts, and prisma-client-py has
no DbNull sentinel for Json? writes, so Prisma rejected both shapes.

Serialize the clear case explicitly as the JSON literal null, matching
how memory_endpoints encodes metadata for the same column type. Truthy
values keep the existing reset_at window initialization path.

Fixes #30067.

* Require admin access for budget_limits changes on /key/update

Clearing budget_limits via [] or null is a budget mutation, but
_validate_update_key_data only counted max_budget and spend as budget
changes before deciding whether to skip _check_key_admin_access. A
non-admin key owner or a team member with /key/update could therefore
remove a key's per-window spend caps without admin authorization.

Treat any explicit budget_limits value in the request (set, change, or
clear) as a budget change so it gates through the same admin check as
max_budget. model_fields_set is used because an explicit null is
indistinguishable from an omitted field by value alone.

* fix(proxy): persist guardrail info in spend logs for /v1/responses (#30092)

Pre-call guardrail blocks on /v1/responses wrote guardrail_information
as null in LiteLLM_SpendLogs because _handle_logging_proxy_only_error
splits request_data by LoggedLiteLLMParams keys and litellm_metadata,
where the Responses API stores request metadata including
standard_logging_guardrail_information, was not among them. It fell
into optional_params, so merge_litellm_metadata never saw it. Add
litellm_metadata to LoggedLiteLLMParams so it routes into
litellm_params the same way metadata does on the chat completions path

Fixes #28971.

* fix(proxy): handle non-standard SSE frames in Anthropic passthrough logging (#26000)

Some third-party Anthropic-compatible providers emit non-standard SSE
frames (OpenAI-style [DONE] sentinels, non-JSON keep-alive lines) in
streaming responses. These caused json.JSONDecodeError in
_build_complete_streaming_response, breaking the passthrough logging
pipeline so the request was never logged or billed.

Skip whole-line 'data: [DONE]' sentinels and catch JSONDecodeError per
event. Matching the full line (not a substring) keeps a valid chunk
whose text payload contains '[DONE]' from being dropped.

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Sameer Kankute <sameer@berri.ai>

* feat(newrelic): Add New Relic extension  (#26989)

* initial New Relic integration.

* Minor fixes for basic observability.

* Implemented basic support for the success path. Generates New Relic
custom events needed by the AI Monitorin interface.

* Supportability metric is sent on first request.

* Emit supportability metric every hour instead of once a day.

* Add the start/end times to the messages before sending them so that the
start time and end time reflect the correct time and both are not set
to 'now'.

* Make use of `turn_off_message_logging` configuration that is available
by default from CustomLogger.

* Enabling New Relic agent to be wired when docker container starts if an environment variable
is set.

* If we cannot find trace information, send the AI events without the
trace ID attached.

* Use a fake trace_id if we cannot find one.

* Implementing a configuration so that users can use litellm configuration
to disable sending LLM messages to New Relic. There is a second method
to do this via New Relic env var.

* Mised file.

* Cleaning up logic to turn off recording content via either the
LiteLLM configuration or an env var.

* Removing debugging.
Fixed logic / comments around how often to send supportability metric.

* Initial version of public doc for New Relic.

* Use a proper name for the doc file.

* Updating newrelic.md document.

* Updating LiteLLM documentation for New Relic extension.

* Moving New Relic imports into the methods to support unit tests.

* Adding unit tests for the New Relic extension.

* Updating linting and the unit tests that are not running in the CI environment.

* Address reviewer feedback on New Relic integration.

- Fix _record_error_metric to use app.record_custom_metric() instead of
  module-level newrelic.agent.record_custom_metric() so the call works
  outside of an active transaction context
- Remove unreachable except ImportError block in _get_trace_context
- Update stale "23 hours" comment to "27 hours" (matches 97200s threshold)
- Remove commented-out debug code from _process_success
- Fix docs typo: NEW_RELIC_CUSTOM_INSIGHTS_EVENTS_MAX_SAMPLES_STOREDA ->
  NEW_RELIC_CUSTOM_INSIGHTS_EVENTS_MAX_SAMPLES_STORED
- Update TestRecordErrorMetric to verify app.record_custom_metric call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Reformating for the linter.

* Addressing additional automated feedback.

- Removed a legacy comment about the New Relic header
- Reordered imports in one file
- Switched another file to use the import at the top of the file instead of inline when used
- Added unit tests for untested methods that were identified

* Addressing new feedback.

- Proper handling of time to floats. Created a util method and updated code to use it.
- added the missing guard to ensure the app is enabled

* Addressing feedback.

- When an error occurs, still check if the periodic supportability metric should be emitted
- Added a check to ensure the extension is ready in the error handler to match _process_success

* Updating the NR event timestamps to more accurately reflect when
the messages were generated.

* Addressing feedback for potential better practice.

* Addressing feedback on accessing default values. Added tests for most of
these cases.

* Adding a new catch exception block based on feedback.

* Addressing feedback about a potential issue around a timestamp for the
supportability metric.

* Addressing minor feedback on length of generated, fallback traceId.

* Addressing feedback.

- A few more cases were found where the dictionary access might not return the correct value.
- Handling cases where `traceparent` is not lower cased

* Addressed feedback where the newrelic options might not apply correctly.

* Addressing some feedback.

* Addressing feedback.

* Validating testing / formatting for our changes.

* Updating linting, adding tests, defining data type for UI.

* Configuration for the logging callback definition.

* Adding a newrelic image for the UI to use.

* Putting the New Relic callback in proper alphabetic order.

* Copying the logo to a committed output directory so it shows up in a locally
built container.

* Adding missing definition of new env vars that were causing a build failure.

* Addressing automated feedback from greptile.

* Adding a few more unit tests to increase the code coverage just a bit more.

* Additional unit tests to push coverage to almost 90%.

* Adding a custom newrelic docker image build process. This removes the need to add the newrelic agent
to the core litellm container or dependencies.

* Clarifying message when the New Relic agent is not installed and someone
is trying to use the newrelic extension. Either use the proper image
when using docker, or install the agent manually when running from source.

* Ensuring pip is available to install the New Relic agent.

* Updating the definition and handling of traceId (no spanId).
Clarifying behavior of env vars vs UI configuration for
the newrelic extension.

* Removing entries from the New Relic logger configuraiton UI as these
values must be set as part of running the image.

* Removing a stale doc file that has moved to the litellm-docs repo.
Cleanup of Dockerfile to remove a LABEL that was incorrect.

* Updating container image name to be the best guess for the new name.

* Addressing feedback from greptile.

- Added a comment around token_count=0
- Updated the boolean parser to allow a wider set of options which matches existing patterns in other parts of LiteLLM.

* Removing option for a separate New Relic container image. The agreement
is to handle this in the New Relic integration docs.

* Updating error message when New Relic agent is not available.

* Wiring in the test message from the LiteLLM callback UX.

* Missed saving one of the file conflicts.

* Fixed a lint error I introduced. Somehow, I dropped another string
and now added it back.

* Adding newrelic to the schema definition.

* Added an admin check on the call before sending test message
as mentioned by the AI code review.

* Updating to use should_redact_message_logging(kwargs) as part of the
logic to determine if message content should be sent to New Relic
or not. This still uses the `record_content` property as well, but
both have to be true in order for content to be included.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add Azure AI Foundry DeepSeek V3.1 and V4 Pro/Flash global pricing to cost map (#30134)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(logging): translate Responses bridge result to ModelResponse for spend logs (#28985)

PR #29394 fixed the AnthropicResponse.model_validate crash for the streaming
anthropic_messages -> OpenAI Responses bridge by unwrapping terminal events
and returning the inner ResponsesAPIResponse. The spend_logs row lands and
usage/cost are correct, but the row's response field stores the Responses
API shape (output[...].content[...].text). The proxy UI Logs tab reads
response.choices[0].message via parseMessages in prettyMessagesUtils.ts
with no fallback for the Responses shape, so the OutputCard renders "No
response data available" for every cross-routed call. The same shape
mismatch affects every downstream consumer of spend_logs that assumes the
canonical chat-completion shape

This change keeps the unwrap from #29394 but routes the resulting
ResponsesAPIResponse (and the bare-response non-streaming path) through
LiteLLMResponsesTransformationHandler.transform_response, which is the
same conversion already used by the chat-completion Responses bridge.
Spend_logs now stores a ModelResponse with choices[0].message.content, so
the UI and other consumers see the assistant text. On a translation
failure (eg. empty output on an incomplete response) the handler falls
back to a minimal ModelResponse carrying model and usage so the row still
lands rather than being dropped as a Non-Blocking error

Also corrects a stale comment in the Responses adapter that implied the
call type was reclassified to acompletion; the code preserves
anthropic_messages and the success handler translates back to
ModelResponse for the row

Fixes #28595

* fix(anthropic-adapter): re-emit first delta on streaming content-block transitions (#30024)

* fix(anthropic-adapter): re-emit first delta on streaming content-block transitions

The `/v1/messages` -> `/v1/chat/completions` streaming adapter
(`AnthropicStreamWrapper`) silently dropped the first non-empty delta of
every content block that started via a *transition* (e.g. text -> tool_use ->
text, text -> thinking).

When an upstream chunk both triggers a new content block (its type differs
from the active block) and carries that block's first delta, the wrapper
emitted `content_block_stop` -> `content_block_start` and then only re-queued
the trigger chunk when it was an `input_json_delta` (bundled tool args). The
synthesized `content_block_start` always carries an empty body, so the first
`text_delta` / `thinking_delta` was lost — the client output started from the
second token (e.g. "Hi, how can I help you?" rendered as ", how can I help
you?", or text resuming after a tool call lost its first sentence). This is
especially visible with Claude Code-style clients that consume Anthropic
Messages streaming events strictly.

Fix: re-queue the trigger chunk's translated delta whenever it carries
non-empty content (text/thinking/signature/tool args), via a shared
`_trigger_delta_has_content` helper used by both the sync and async paths.
Empty trigger deltas are still suppressed so no spurious empty
`content_block_delta` is introduced.

Fixes #30014

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(anthropic-adapter): cover all _trigger_delta_has_content branches

Add a direct parametrized unit test for the re-emit predicate so every delta
type (text/input_json/thinking/signature), the empty-payload guards, and the
malformed/non-delta cases are exercised independently of upstream chunk
translation. Raises patch coverage for the new helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* feat: add opt-in healthy_only filter to GET /v1/models (#30130)

* feat: add opt-in healthy_only filter to GET /v1/models

Adds an opt-in `healthy_only=true` query parameter to GET /v1/models and
GET /models that hides models whose backing deployments are all marked
unhealthy by background health checks.

- Add Router.async_get_fully_unhealthy_model_names(), mirroring the
  semantics of get_fully_blocked_model_names(): a model is hidden only
  when every backing deployment is unhealthy and the health state is
  not stale (fail open otherwise).
- Reuses the existing DeploymentHealthCache populated by
  _run_background_health_check(), so no new health state is introduced.
- No-op when allowed_fails_policy is set, mirroring
  _async_filter_health_check_unhealthy_deployments semantics.
- team_public_model_name aliases are aggregated alongside model_name.
- Hiding is presentation-only; default behavior is unchanged.

Fixes #30128

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: address Greptile review notes

- Note team-alias asymmetry vs get_fully_blocked_model_names
- Debug-log when healthy_only is set but no health state is available

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* Dedupe team soft budget alerts by team_id instead of token (#30097)

_team_soft_budget_check sends type="soft_budget" alerts with
event_group=TEAM, but SoftBudgetAlert.get_id always returned the
request token. The alert cache key was therefore scoped per virtual
key, so every active key in a team over its soft budget fired its own
alert within budget_alert_ttl. Branch on event_group so team-level
alerts dedupe by team_id, matching TeamBudgetAlert, while key and
project level alerts keep per-token dedupe.

Fixes #27398.

* feat(bedrock guardrails): support contextual grounding qualifiers (request-side) (#30057)

* test: add failing tests for Bedrock contextual grounding (request-side)

Drive the request-side of Bedrock contextual grounding: callers tag message
content blocks as grounding_source/query, the post_call hook assembles an
ApplyGuardrail(OUTPUT) call carrying source + query + response(guard_content),
and the bedrock converse transform must render the tags as prompt text instead
of silently dropping them. Non-grounding payloads must stay byte-identical.

* feat(bedrock guardrails): support contextual grounding qualifiers

Bedrock contextual grounding scores a model response against a reference
source and the user query, expressed via a per-content-block `qualifiers`
array on ApplyGuardrail. The guardrail hook previously sent plain text only,
so grounding could not be driven through it even though the response-side
contextualGroundingPolicy parsing already existed.

Callers now tag message content blocks `{"type":"grounding_source"}` /
`{"type":"query"}` (mirroring the existing `guarded_text` marker). On the
generate path the bedrock converse transform renders them as plain text; at
post_call the hook harvests them from the request and assembles one
ApplyGuardrail(OUTPUT) call carrying grounding_source + query + the response
(as guard_content). Requests without these tags produce a byte-identical
payload, so existing behaviour is unchanged.

* Feat(guardrail): Adding support for custom Ovalix guardrail (#21887)

* Feat(guardrail): Adding support for custom Ovalix guardrail

* Internal CR comments fixes

* greptileai comments fixes

* fix conflict

* fixes

* fix sha256

* clarify Ovalix actor-id hash is for normalization, not PII protection

* fix(github_copilot): normalize per-event item_id in /responses streaming (#30072)

GitHub Copilot's native /v1/responses stream assigns a different item_id to
every event of a single output item (output_item.added, the part.added /
delta / done events, and output_item.done). Spec-strict clients like the
Vercel AI SDK key streaming parts by item_id and abort with
"reasoning part <id> not found" / "text part <id> not found" when a delta
references an unregistered id.

Override transform_streaming_response in GithubCopilotResponsesAPIConfig to
anchor every event of an output item to the id from its output_item.added.
Copilot accepts that id paired with the final encrypted_content on the next
turn, so multi-turn replay is unaffected.

Fixes #30071

* feat: add /model/block and /model/unblock endpoints (#30125)

* feat: add /model/block and /model/unblock endpoints

Add dedicated proxy-admin POST /model/block and /model/unblock endpoints
over the existing blocked flag on LiteLLM_ProxyModelTable, mirroring the
/key/block and /key/unblock pattern. Calling a model whose deployments are
all blocked now returns a clear 403 "Model is blocked" instead of a generic
no-deployment error, including direct-dispatch route types (e.g. eval) via a
pre-route guard. Includes audit-log entries for block/unblock and unit tests.

Closes #29742

Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>

* chore: regenerate dashboard API types for model block/unblock endpoints

Regenerate ui/litellm-dashboard/src/lib/http/schema.d.ts from the proxy
OpenAPI spec (npm run gen:api) so it includes the new endpoints.

Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>

* fix: widen router block-helper param type and add direct unit tests

Type the _are_all_deployments_blocked deployments parameter to match its
callers (DeploymentTypedDict) so mypy passes, and add
tests/test_litellm/test_router_block_helpers.py with direct unit tests for
the three block helper methods so router_code_coverage recognizes them.

Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>

* fix: restore type-ignore on messages arg after black reflow

Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>

* refactor: raise model-block 403 in proxy layer, not SDK Router

Keep the SDK Router's documented behavior for blocked deployments (filtered ->
"no healthy deployment") and move the 403 PermissionDeniedError into the proxy
layer (route_llm_request), where model blocking is an admin concept. This avoids
a backwards-incompatible 403 for SDK users who set blocked=True on their own
deployments, per maintainer review.

Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>

---------

Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>
Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>
Co-authored-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix: add week unit support to get_next_standardized_reset_time (#30100)

* fix: add week unit support to get_next_standardized_reset_time

The function handled d/h/m/s/mo units but silently fell through to
the default next-midnight branch for the w (week) unit. This was
inconsistent: _extract_from_regex already accepted w in its character
class, and duration_in_seconds already returned value * 604800 for it.

Add the missing elif unit == 'w' branch that delegates to
_handle_day_reset with value * 7, which reuses the existing Monday-
alignment logic for 1w and the generic N-day-from-midnight path for
larger multiples.

Add test_week_based_resets covering 1w from a Wednesday (expects next
Monday) and 2w from a Monday (expects 14 days forward at midnight).

Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>

* test: exercise relative week semantics with non-Monday base dates + add docstring

Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>

---------

Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>
Co-authored-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>

* fix: black formatting and remove undocumented MAVVRIK_FOCUS_FREQUENCY env var

* fix: black formatting with correct version and sync schema.d.ts for healthy_only param

* fix: resolve mypy errors and add transcription_sessions to JSON schema endpoint enum

* fix: restore MAVVRIK_FOCUS_FREQUENCY guard and exclude it from docs key scan

* fix: address Greptile P2 comments - move constant, use UTC datetime, skip redundant team lookup

* revert: restore original team lookup logic in can_key_call_resolved_model

---------

Signed-off-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>
Signed-off-by: FugoP <264910004+AgentGymLeader@users.noreply.github.com>
Co-authored-by: Emerson Gomes <emerson.gomes@thalesgroup.com>
Co-authored-by: nina-hu <nina.huuu@gmail.com>
Co-authored-by: Sahith Jagarlamudi <104647530+s-jag@users.noreply.github.com>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Praveen Ghuge <95286176+pghuge-cloudwiz@users.noreply.github.com>
Co-authored-by: alex107ivanov <30668368+alex107ivanov@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: Fede Kamelhar <federico.kamelhar@oracle.com>
Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com>
Co-authored-by: Teo Xian Zhong Augustine <35527068+auggie246@users.noreply.github.com>
Co-authored-by: King Star <mcxin.y@gmail.com>
Co-authored-by: Saksham Maggo <122939011+SakshamMaggo@users.noreply.github.com>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Kelvin <leikaiwei@outlook.com>
Co-authored-by: Josh Bonczkowski <josh.bonczkowski@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: M. Dennis Turp <mdturp@pm.me>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Piotr Minkina <piotrminkina@users.noreply.github.com>
Co-authored-by: Martín Alcalá Rubí <martin@tryolabs.com>
Co-authored-by: T. Kobayashi <13004314+nix-tkobayashi@users.noreply.github.com>
Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com>
Co-authored-by: Shalom <shalom@ovalix.io>
Co-authored-by: codgician <15964984+codgician@users.noreply.github.com>
Co-authored-by: FugoP <kim@pomsora.com>
Co-authored-by: AgentGymLeader <264910004+AgentGymLeader@users.noreply.github.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>