Files
litellm/tests/local_testing
Sameer Kankute cb041966bf Litellm oss staging 040626 (#29671)
* fix(azure): apply api_version fallback chain to image edit URL

`AzureImageEditConfig.get_complete_url` only read `api_version` from
`litellm_params`. When callers configured it via `litellm.api_version`
or `AZURE_API_VERSION`, the constructed URL had no `?api-version=` and
Azure responded `404 Resource not found`.

Apply the same fallback chain the Azure chat path already uses in
`common_utils.py`:

    litellm_params > litellm.api_version > AZURE_API_VERSION env >
    litellm.AZURE_DEFAULT_API_VERSION

Adds 5 unit tests pinning each layer of the chain plus a regression
guard for `api_base` that already carries `?api-version=`.

* feat(mcp): core sampling and elicitation flow with security hardening

- Add sampling_handler.py: full MCP sampling/createMessage flow with
  model selection (hint-based + priority-based), auth enforcement,
  budget checks, route restriction gates, and tag policy pre-auth
- Add elicitation_handler.py: MCP elicitation/create relay with
  downstream client capability detection
- Wire sampling/elicitation callbacks in mcp_server_manager.py
  gated behind allow_sampling/allow_elicitation config flags
- Add allow_sampling/allow_elicitation fields to MCPServer type
- Fix session lock deadlock: skip lock for JSON-RPC response POSTs
  (elicitation/sampling replies) with truncated-body heuristic
- Extend client.py with sampling_callback and elicitation_callback
- Security: RouteChecks gate, tag-budget bypass fix, x-forwarded-for
  spoofing fix, Latin-1 header encoding guard
- Add 4 new test modules (model access, priority selection, request
  builder, tool conversion) + update existing MCP tests

* fix(security): run pre-call guardrails before MCP sampling acompletion

Without this, an upstream MCP server with allow_sampling enabled could
send prompts that bypass every guardrail (content filtering, PII
redaction, prompt-injection detection) configured on /chat/completions.

- Call proxy_logging_obj.pre_call_hook(call_type='acompletion') before
  llm_router.acompletion so guardrails fire for sampling sub-calls
- Add HTTPException to the re-raise list so guardrail rejections
  propagate correctly instead of being swallowed as generic errors

* feat(bedrock_mantle): add Responses API support (/openai/v1/responses) (#29490)

* feat(bedrock_mantle): add Responses API transformation config

* test(bedrock_mantle): cover trailing-slash api_base normalization

* feat(bedrock_mantle): export BedrockMantleResponsesAPIConfig

* feat(bedrock_mantle): register gpt-5.x Responses config (gpt-oss unchanged)

* feat(bedrock_mantle): add gpt-5.5/gpt-5.4 Responses price-map entries

* refactor(bedrock_mantle): exclude gpt-oss instead of allow-listing gpt-5 for Responses routing

Frontier OpenAI models on Bedrock Mantle are Responses-only on /openai/v1/responses;
gpt-oss is the legacy family that also speaks chat-completions. Gate by excluding
gpt-oss (which keeps its chat-completions emulation) and defaulting everything else
to the native Responses config, so future frontier models (gpt-6, etc.) route
correctly without a code change. Verified against the live us-east-2 Mantle endpoint:
gpt-oss 400s on /openai/v1/responses while gpt-5.5 400s on both standard paths.

* test(bedrock_mantle): cover supports_native_websocket opt-out

Closes the one uncovered line flagged by codecov on the Responses config.
The assertion documents that Mantle Responses has no realtime/websocket
transport, so realtime routing must not attempt a socket it cannot serve.

* fix(bedrock_mantle): route file_search through emulation instead of forwarding to Mantle

BedrockMantleResponsesAPIConfig inherited supports_native_file_search()
-> True from OpenAIResponsesAPIConfig but never overrode it. Mantle has no
OpenAI vector stores, so a forwarded file_search tool is rejected with a
400 (verified upstream: Tool type 'file_search' is not supported). Opting
out, like the existing supports_native_websocket override, routes the tool
through LiteLLM's file_search emulation instead.

* fix(bedrock_mantle): only route openai.gpt frontier models to Responses

The previous gate excluded gpt-oss and routed every other model to the
native Responses config. But on Mantle only the OpenAI gpt frontier models
(gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI
families (nvidia, mistral, google, zai, ...) are chat-completions only and
400 on that path. Allow-list the openai.gpt- family (excluding gpt-oss)
instead, so chat-only models fall through to the chat-completions emulation.
Verified against the live us-east-2 endpoint: nvidia.nemotron-nano-9b-v2
returns 400 on /openai/v1/responses and 200 on /v1/chat/completions.

* feat(custom_llm): allow streaming/astreaming to yield ModelResponseStream (#27580)

* fix(custom_llm): allow streaming/astreaming to yield ModelResponseStream directly

* fix(streaming): enhance ModelResponseStream handling for custom LLM providers

* fix(streaming): strip finish_reason from content chunks and ensure tool_calls are preserved

* fix(streaming): add type ignore for finish_reason assignment in CustomStreamWrapper

* fix(proxy): strip stack trace from HTTP 503 responses (CWE-209) (#28330)

* fix(proxy/cwe-209): strip Python traceback from HTTP 503 error responses

The /cache/ping endpoint included a full Python traceback in its 503 error
response body (inside the ProxyException message), leaking internal file
paths, line numbers, and call stacks to any caller. Two MCP route handlers
in proxy_server.py similarly interpolated str(e) into "Internal server
error" detail strings.

Fix: log the traceback server-side via verbose_proxy_logger.exception()
and omit it from the ProxyException payload / HTTPException detail returned
to clients. Tests updated to assert no "traceback" keyword or frame paths
appear in the 503 body, with a new dedicated regression test.

CWE-209: Generation of Error Message Containing Sensitive Information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(proxy/cwe-209): apply Greptile P2 fixes and add MCP exception-path tests

Greptile 4/5 review identified two remaining gaps and Codecov reported
0% coverage on the two MCP handler exception branches:

1. caching_routes.py — str(e) in "Service Unhealthy ({str(e)})" could
   still leak Redis hostnames/IPs; replaced with static "Service Unhealthy".
   HTTPException is now re-raised before the generic handler so the
   "cache not initialized" 503 still reaches callers with its detail.
   Removed the redundant str(e) arg from verbose_proxy_logger.exception()
   (exception() already appends the traceback automatically).

2. tests — two new unit tests cover the exception paths in
   dynamic_mcp_route and toolset_mcp_route that were previously at 0%:
   - test_dynamic_mcp_route_unexpected_exception_returns_500_without_traceback
   - test_toolset_mcp_route_unexpected_exception_returns_500_without_traceback

All 25 tests pass (9 caching + 16 MCP).

CWE-209: Generation of Error Message Containing Sensitive Information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(caching_routes): restore precise assertion in test_cache_ping_no_cache_initialized

The assertion was weakened to `"Cache not initialized" in str(data)`, which
matches the raw string of the entire response dict and would pass even if the
error moved to an unexpected field or changed structure.

Restore a targeted check on the parsed response: assert the exact string in
the correct field `data["detail"]`, matching FastAPI's HTTPException
serialisation format {"detail": "<message>"}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(caching_routes): restore precise assertion and add CWE-209 no-cache path test

The assertion in test_cache_ping_no_cache_initialized was weakened to
`"Cache not initialized" in str(data)`, which matched against the raw string
representation of the entire response dict. This would pass silently even if
the error message moved to an unexpected field or the structure changed.

Restore a targeted assertion on the parsed field:
  assert data["detail"] == "Cache not initialized. litellm.cache is None"
matching FastAPI's HTTPException serialisation format exactly.

Add test_cache_ping_no_cache_does_not_expose_internals to show the code path
is still working correctly after the CWE-209 fix: verifies that the HTTPException
is re-raised as-is (no traceback, no source paths), and asserts the complete
response structure is exactly {"detail": "Cache not initialized. litellm.cache is None"}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(caching_routes): restore ProxyException envelope for null-cache 503

The except HTTPException: raise guard (added in the CWE-209 fix) caused
the null-cache HTTPException to escape as FastAPI's {"detail": "..."} shape
instead of the {"error": {...}} ProxyException envelope that callers expect.

Move the null-cache guard before the try block and raise ProxyException
directly so the response structure is consistent with all other /cache/ping
503s, and the except HTTPException: raise guard is only reachable by
unexpected downstream HTTPExceptions.

Update the two no-cache tests to assert the correct ProxyException envelope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update utils.py (#26609)

* feat(pricing): add Snowflake Cortex REST API model pricing (#26612)

* feat(pricing): add Snowflake Cortex REST API model pricing

## Summary

Adds pricing and context window information for 20+ Snowflake Cortex REST API models to `model_prices_and_context_window.json`.

## What's included

- **7 Claude models** (sonnet-4-5, sonnet-4-6, 4-sonnet, 4-opus, haiku-4-5, 3-7-sonnet, 3-5-sonnet) — with prompt caching rates
- **4 OpenAI models** (gpt-4.1, gpt-5, gpt-5-mini, gpt-5-nano) — with prompt caching rates  
- **5 Llama models** (3.1-8b, 3.1-70b, 3.1-405b, 3.3-70b, 4-maverick)
- **1 DeepSeek model** (deepseek-r1)
- **1 Mistral model** (mistral-large2)
- **1 Snowflake model** (snowflake-llama-3.3-70b)
- **2 Embedding models** (arctic-embed-l-v2.0, arctic-embed-m-v2.0)

Each entry includes `input_cost_per_token`, `output_cost_per_token`, `cache_read_input_token_cost` (where applicable), `max_input_tokens`, `max_output_tokens`, and capability flags (`supports_function_calling`, `supports_vision`, `supports_prompt_caching`, `supports_reasoning`).

## Pricing source

All prices are in USD per token, sourced from the official [Snowflake Service Consumption Table](https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf) — Tables 6(b) (REST API with Prompt Caching) and 6(c) (REST API).

## Context

The existing `snowflake/` provider has zero model entries in the pricing JSON, which means LiteLLM cannot track costs for Snowflake Cortex calls. This PR fills that gap.

## Related

- Existing provider: `litellm/llms/snowflake/`
- Cortex REST API docs: https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-rest-api

* Update model_prices_and_context_window.json

Fix the JSON parsing error

* Update model_prices_and_context_window.json

Removed the duplicate entry

* fix(utils): copy extra_body before adding unknown params to prevent model config mutation (#29620)

Fixes #29615. In add_provider_specific_params_to_optional_params, the line:

    extra_body = passed_params.pop("extra_body", None) or {}

returns the original dict reference when extra_body is non-empty (truthy).
Subsequent writes like extra_body[k] = passed_params[k] then mutate the
shared model config object held by the router, poisoning /model/info and
all subsequent requests for that deployment.

The or {} short-circuit creates a new dict only when extra_body is falsy
(None or {}), which is why the bug does not reproduce with extra_body: {}.

Fix: wrap in dict() so we always work on a fresh shallow copy.

* fix(vertex_ai): Bake tool_choice into Gemini CachedContent body to prevent silent drop (#29097)

* fix(vertex_ai): bake tool_choice into Gemini CachedContent body to prevent silent drop

* address greptile feedback on tool_choice cache test

* adds test that uses ToolConfig(functionCallingConfig=FunctionCallingConfig(mode=ANY)) instead of a dict literal, mirroring what map_tool_choice_values actually produce

* fix(gemini/veo): move image from parameters into instances[0] (#29501)

* fix(gemini/veo): move image from parameters into instances[0]

Veo's predictLongRunning schema puts image (and prompt) on the
instances element; parameters is for aspectRatio/durationSeconds/etc.
The Gemini path was leaving image in params_copy, so it ended up
nested under parameters and the API silently ignored it.

The Vertex path already builds the instance dict explicitly, so this
just aligns the Gemini path with it.

Fixes #29498

* address greptile: unconditional pop + BytesIO test

- Pop `image` from params_copy unconditionally so it never reaches
  GeminiVideoGenerationParameters even when None, removing implicit
  reliance on Pydantic's extra-field-ignore.
- Add test_transform_video_create_request_image_filelike_goes_to_instance
  covering the BytesIO path (_convert_image_to_gemini_format) — round-trips
  the base64 to confirm encoding.
- Add test_transform_video_create_request_image_none_is_dropped covering
  the new None branch.

* fix(huggingface): handle special token text in embedding usage (#29660)

* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params (#29655)

* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params

ToolPermissionGuardrail builds self.rules and the compiled target/pattern
maps only in __init__. The base update_in_memory_litellm_params re-sets raw
attributes via setattr but never rebuilds those maps, so a guardrail updated
in place (PUT /guardrails, or the immediate in-memory sync) keeps enforcing
the construction-time rules until it is reinitialized (PATCH path, periodic
DB poll, or restart).

Extract the compile step into _load_rules and override
update_in_memory_litellm_params to rebuild from it (dict- and model-safe),
re-normalizing default_action / on_disallowed_action. Mirrors the existing
PresidioGuardrail override of the same method. Adds regression tests.

Fixes #29592.

* fix(guardrails): handle dict params in ToolPermissionGuardrail in-memory update

Delegate to super() only for LitellmParams input (the base setattr loop is
model-only); apply the raw-dict case inline. Fixes the mypy arg-type error
and makes the recompile work when the proxy passes the raw DB dict.

* fix(guardrails): preserve tool-permission rules on a partial in-memory update

A partial update (e.g. a LitellmParams whose rules field is None) ran through
the generic setattr, which set self.rules to None, and the recompile was
skipped, leaving the guardrail with no rules. Snapshot the previous rules and
restore them when the update carries no rules; an explicit empty list still
clears them. Adds a regression test for the rules-absent case.

Addresses the Greptile review note on #29655.

* fix(bedrock): stop base_model label from stripping tools/tool_choice (#29621)

* fix(bedrock): stop base_model label from stripping tools/tool_choice

A Router/proxy Bedrock deployment whose model_info.base_model is a friendly
label (e.g. claude-haiku-4-5) silently lost tools/tool_choice: the outgoing
Converse request was built without toolConfig, so the model behaved as if no
tools were provided. Worked in v1.84.0, regressed in v1.85.0, and with
drop_params=true it failed silently.

Two changes compound into the bug. completion() passed model_info.base_model
as the model argument to get_optional_params, so the real Bedrock model id
never reached supported-param resolution; and get_supported_openai_params
resolved the provider config's params from base_model or model, letting the
label fully replace the real model. For Bedrock the label resolves to no tool
support, so tools/tool_choice were dropped before transformation.

completion() now keeps model as the real deployment model and threads the
resolved base_model (kwarg or model_info) through separately, and
get_supported_openai_params treats base_model as additive: it returns the
union of the params supported by model and by base_model. A hint can only add
capabilities, never strip ones the real model already exposes, which also
preserves the original base_model behavior from #27717 and Azure's base_model
driven model-type detection.

Fixes #29618

* test(main): make base_model param test robust to new parametrize cases

Restore an explicit per-case expected_model_param literal instead of
hardcoding the gemini id, so a future case with a different model can't
produce a misleading assertion failure.

* fix(fireworks_ai): pass response_format json_schema through unchanged (#29606)

FireworksAIConfig.map_openai_params was rewriting the OpenAI strict
`{type: json_schema, json_schema: {name, strict, schema}}` shape into
`{type: json_object, schema: ...}` before sending to Fireworks, dropping
`strict` and `name` and changing the `type`. Per Fireworks' docs json_object
means "force any valid JSON output (no specific schema)", so the schema
constraint was effectively dropped and grammar-guided decoding never ran;
model output silently violated the schema.

The rewrite landed in #7085 (Dec 2024) when Fireworks did not yet accept
native json_schema. Fireworks accepts the OpenAI strict shape natively now,
so the rewrite has become a regression.

Removes the rewrite. Passes response_format through unchanged. Updates the
existing test_map_response_format to assert pass-through. Adds focused
regression tests in tests/test_litellm/ covering preservation of type,
strict, name, and schema body, plus that json_object alone still works.

* fix(types): import Required from typing_extensions in gemini types

* style: reformat sampling_handler.py for py312 black compat

* refactor(mcp-sampling): extract helpers to fix PLR0915 too-many-statements in handle_sampling_create_message

* fix(proxy-server): add explicit ProxyLogging type annotation to proxy_logging_obj to fix mypy inference

* fix(mcp-sampling): suppress mypy assignment error on ImportError fallback for proxy_logging_obj

* fix(test): use .value when comparing LlmProviders enum against string in test_default_api_base

* fix(test): iterate LlmProviders enum in test_default_api_base to avoid str pollution from custom provider registration

litellm.provider_list is a mutable global initialized to list(LlmProviders) but custom_llm_setup() appends plain provider strings to it. When a test_custom_llm.py test runs first in the same xdist worker, provider_list contains a str and calling .value on it raises AttributeError. Iterate the immutable LlmProviders enum instead, which is deterministic and what the check intends.

* fix(mcp): depth-aware JSON-RPC response detection and neutral speed-priority fallback

Replace the flat substring check in the truncated-body routing path with a
top-level-key scan so a JSON-RPC response whose result payload nests a
"method" field is still detected as a response and skips the session lock,
removing a deadlock against the in-flight tool call awaiting it.

Drop the inverse max_output_tokens speed proxy when no model exposes
output_tokens_per_second; context-window size does not track latency, so a
neutral score avoids biasing speedPriority toward the smallest-context model.

* fix(guardrails): make ToolPermission rule reload atomic on invalid regex

_load_rules appended each rule to self.rules before compiling its regex, so an
invalid pattern raised mid-loop after the bad rule was already live but without
a _compiled_rule_targets entry. _matches_regex reads a missing compiled target
as a None pattern and returns True, turning the bad rule into a match-all that
silently applies its decision to every tool. Via update_in_memory_litellm_params
(PUT /guardrails) this corrupted the live guardrail.

Build the parsed rules and compiled maps into locals and swap them in only after
every regex compiles, and restore the previous ruleset if a live update is
rejected, so an invalid regex now fails the update without leaving the guardrail
enforcing a broken policy.

* test(mcp): cover sampling conversion, model resolution, and elicitation relay paths

The MCP sampling and elicitation handlers shipped with partial test
coverage, leaving the response-to-MCP conversion, the model resolution
fallback chain, completion-kwargs assembly, guardrail routing, and the
entire elicitation relay untested. That pulled the PR's diff (patch)
coverage below the codecov threshold even though overall project
coverage rose.

Add focused unit tests for _convert_openai_response_to_mcp_result,
_convert_mcp_tools_to_openai, _convert_mcp_tool_choice_to_openai, image
and audio content conversion, the hint-matching and fallback branches of
_resolve_model_from_preferences, _build_completion_kwargs, the router and
guardrail-rejection paths of _run_guardrails_and_call_llm, the
handle_sampling_create_message success and error-propagation flows, the
marker-hoisting fallback for tool content on unexpected roles, and the
elicitation form/url/generic relay together with its decline paths

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: lengkejun <lengkejun@xd.com>
Co-authored-by: Yug <yugborana000@gmail.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: tanmay958 <53569547+tanmay958@users.noreply.github.com>
Co-authored-by: DrishnaTrivedi <142084770+DrishnaTrivedi@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Navnit Shukla <Navnit.shukla25@gmail.com>
Co-authored-by: PRABHU KIRAN VANDRANKI <72809214+VANDRANKI@users.noreply.github.com>
Co-authored-by: Adrian Lopez <109683617+adriangomez24@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com>
Co-authored-by: Dinesh Girbide <85330597+Dinesh-Girbide@users.noreply.github.com>
Co-authored-by: cloudwiz <22098246+andrey-dubnik@users.noreply.github.com>
Co-authored-by: Ahmad Khan <ahmadkhan2508@gmail.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
2026-06-04 11:07:20 -07:00
..
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 21:13:16 -07:00
2026-03-28 20:27:39 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-30 16:24:35 -07:00
2025-10-25 10:19:24 -07:00
2026-03-28 19:17:38 -07:00
2025-10-25 10:19:24 -07:00
2026-03-28 19:17:38 -07:00
2025-09-01 17:04:47 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2026-03-28 19:17:38 -07:00
2025-09-01 17:04:47 -07:00