* fix(router.py): support base model for model group usage
allows model group info to show accurate cost information for azure models
* fix(router.py): fix changes
* test: add unit tests
* build(pyproject.toml): bump openai version requirements
support custom tool from responses api
Closes https://github.com/BerriAI/litellm/issues/13391
* docs(responses_api.md): add verbosity + free-form function calling parameters
* docs(responses_api.md): add cfg + minimal reasoning to docs
Closes https://github.com/BerriAI/litellm/issues/13391
* docs(responses_api.md): add proxy examples to docs
* refactor: fix ruff error
* added mcp guardrails doc in mcp.md
* add button to reload models
* Added button changes
* added button for scheduling reload
* add multi pod support to reloading the model price json
* fix ruff
* feat(proxy/utils.py): track pre-call hooks in OTEL
some pre call hooks can cause latency in high traffic - make sure this is tracked
* fix(router.py): move redis call on deployment_callback_on_success to pipeline operation
reduces p99 latency by half when redis is enabled
* fix(parallel_request_limiter_v3.py): only run check if any item has rate limits set
Prevents unnecessary latency added by rate limit checks
* test: add unit tests
* Latency Improvements: only track tpm/rpm usage when set on deployment+ LLM Caching - use an in-memory cache to reduce redis calls + OTEL - track time spent on LLM caching (#13472)
* fix(router.py): only track usage for deployments with tpm/rpm set
ensures additional latency avoided for non-tpm/rpm models
* fix(caching_handler.py): log time spent on request get cache to OTEL
enables easy debugging of call latency
* fix(caching_handler.py): use dual cache object for in-memory caching + trace redis call within caching handler
* fix(caching_handler.py): working in-memory cache for redis calls
ensures dual cache works when redis cache setup for llm calls
makes calls quicker by only checking redis when in-memory cache missed for llm api call
* test: remove redundant test
* test: add unit tests
* fix(access group): allow access group on mcp tool retrieval
* fix(test): fix broken tests and add test case for access group
* fix(mypy): fix typing issues
* fix proxy config
* fix(responses api): fix streaming ID consistency and tool format handling (#12640)
* fix(responses): ensure streaming chunk IDs use consistent encoding format
Fixes streaming ID inconsistency where streaming responses used raw provider IDs
while non-streaming responses used properly encoded IDs with provider context.
Changes:
- Updated LiteLLMCompletionStreamingIterator to accept provider context
- Added _encode_chunk_id() method using same logic as non-streaming responses
- Modified chunk transformation to encode all streaming item_ids with resp_ prefix
- Updated handlers to pass custom_llm_provider and litellm_metadata to streaming iterator
Impact:
- Streaming chunk IDs now format: resp_<base64_encoded_provider_context>
- Enables session continuity when using streaming response IDs as previous_response_id
- Allows provider detection and load balancing with streaming responses
- Maintains backward compatibility with existing streaming functionality
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(types): add explicit Optional[str] type annotation for model_id
This resolves MyPy type checking error where model_id could be None
but wasn't explicitly typed as Optional[str].
* fix(types): handle None case for litellm_metadata access
Prevents 'Item None has no attribute get' error by checking for None
before accessing litellm_metadata dictionary.
* test: add comprehensive tests for streaming ID consistency
Adds unit and E2E tests to verify streaming chunk IDs are properly encoded
with consistent format across streaming responses.
## Tests Added
### Unit Test (test_reasoning_content_transformation.py)
- `test_streaming_chunk_id_encoding()`: Validates the `_encode_chunk_id()` method
correctly encodes chunk IDs with `resp_` prefix and provider context
### E2E Tests (test_e2e_openai_responses_api.py)
- `test_streaming_id_consistency_across_chunks()`: Tests that all streaming chunk IDs
are properly encoded across multiple chunks in a real streaming response
- `test_streaming_response_id_as_previous_response_id()`: Tests the core use case -
using streaming response IDs for session continuity with `previous_response_id`
## Key Testing Approach
- Uses **Gemini** (non-OpenAI model) to test the transformation logic rather than
OpenAI passthrough, since the streaming ID consistency issue occurs when LiteLLM
transforms responses rather than just passing through to native OpenAI responses API
- Tests validate that streaming chunk IDs now use same encoding as non-streaming responses
- Verifies session continuity works with streaming responses
Addresses @ishaan-jaff's request for unit tests covering the streaming ID consistency fix.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(lint): remove unused imports in transformation.py
Removes unused imports to fix CI linting errors:
- GenericResponseOutputItem
- OutputFunctionToolCall
* test: remove E2E tests from openai_endpoints_tests
Remove streaming ID consistency E2E tests as requested by @ishaan-jaff.
Keep only the mock/unit test in test_reasoning_content_transformation.py
* revert: remove streaming chunk ID encoding to original behavior
This reverts the streaming chunk ID encoding changes to understand the original issue better.
Original behavior was:
- Streaming chunks: raw provider IDs
- Streaming final response: raw IDs (PROBLEM!)
- Non-streaming final response: encoded IDs (correct)
The real issue: streaming final response IDs were not encoded, breaking session continuity.
* fix(responses): encode streaming final response IDs to match OpenAI behavior
Fixes streaming ID inconsistency to match OpenAI's Responses API behavior:
- Streaming chunks: raw message IDs (like OpenAI's msg_xxx)
- Final response: encoded IDs (like OpenAI's resp_xxx)
This enables session continuity by ensuring streaming final response IDs
have the same encoded format as non-streaming responses, allowing them
to be used as previous_response_id in follow-up requests.
Changes:
- Add custom_llm_provider and litellm_metadata to LiteLLMCompletionStreamingIterator
- Update handlers to pass provider context to streaming iterator
- Apply _update_responses_api_response_id_with_model_id to final streaming response
- Keep streaming chunks as raw IDs to match OpenAI format
Impact:
- Session continuity works with streaming responses
- Load balancing can detect provider from streaming final response IDs
- Format matches OpenAI's Responses API exactly
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* test: update unit test to match correct OpenAI-compatible behavior
Updates the unit test to verify streaming chunk IDs are raw (not encoded)
to match OpenAI's responses API format:
- Streaming chunks: raw message IDs (like msg_xxx)
- Final response: encoded IDs (like resp_xxx)
This reflects the correct behavior implemented in the fix.
---------
Co-authored-by: Claude <noreply@anthropic.com>
* cleanup
* TestBaseResponsesAPIStreamingIterator
---------
Co-authored-by: Javier de la Torre <jatorre@carto.com>
Co-authored-by: Claude <noreply@anthropic.com>
* (#13284) add avector_store_create to route_type which doesn't require model
* (#13284) exclude hidden params in metadata when create vector store
* (#13284) fix lint error
* (#13284) keep metadata None if metadata is None(not empty dict)
* (#13284) add test code
* (#13284) change test code name
* (#13284) add avector_store_search to route_type which doesn't require model
* fix unsupported operand type(s) for +=: 'NoneType' and 'str' on clientside auth creds for responses
* fix the client side auth to use correct metadata
* add more tests
* fix tests
* fix(route_checks.py): ensure disable llm api endpoints is correctly set
* fix(route_checks.py): raise httpexception
raise expected exceptions
* fix(router.py): handle team only wildcard models
fixes issue where team only wildcard models were not considered during auth checks
* fix(router.py): handle team only wildcard models
fixes issue where team only wildcard models were not considered during auth checks