* fix: preserve tool output ordering for gemini in responses bridge
- Keep function_call_output adjacent to its function_call when building chat messages
- Normalize function_call_output.output lists (input_* parts) into tool message content
* fix test
* small improvements
Emit Responses API streaming events for tool calls when the underlying chat stream contains tool_call deltas, and recover tool calls into the stream when they only appear in the final response.
* fix: prevent HTTP client memory leaks in Presidio and OpenAI wrappers
Fixes multiple memory leak issues reported in #14540 and related tickets:
**Presidio Guardrail Fix (#14540)**
- Problem: Every guardrail check created a new aiohttp.ClientSession
- Impact: High-traffic proxies accumulated thousands of unclosed sessions
- Solution: Share a single session across all guardrail checks
- Added `self._http_session` instance variable
- Lazy session creation via `_get_http_session()`
- Proper cleanup via `_close_http_session()` and `__del__()`
- Files: litellm/proxy/guardrails/guardrail_hooks/presidio.py
**OpenAI HTTP Client Caching (#14540)**
- Problem: `_get_async_http_client()` created new httpx.AsyncClient on each call
- Impact: OpenAI/Azure completions bypassed client caching system
- Solution: Route through `get_async_httpx_client()` for TTL-based caching
- Caches clients by provider and SSL config
- Fallback to direct creation if caching fails
- Applied to both async and sync client methods
- Files: litellm/llms/openai/common_utils.py
**Test Script**
- Added validation script to demonstrate fixes
- Counts file descriptors and unclosed session objects
- Files: test_oom_fixes.py
Related issues: #14384, #13251, #12443
* fix(oom): prevent memory leaks in Presidio guardrails and OpenAI client creation
Fixes two high-impact memory leaks:
1. Presidio Guardrail Session Leak (issue #14540)
- Problem: Created new aiohttp.ClientSession on every guardrail check
- Impact: Runs on EVERY proxy request when PII masking enabled
- Fix: Shared session pattern with lifecycle management
- Files: litellm/proxy/guardrails/guardrail_hooks/presidio.py
2. OpenAI HTTP Client Cache Bypass (issue #14540)
- Problem: _get_async_http_client() created new httpx.AsyncClient, bypassing TTL cache
- Impact: Every completion created new client with own connection pool
- Fix: Route through get_async_httpx_client() for proper caching
- Critical: Include SSL config in cache key for correctness
- Files: litellm/llms/openai/common_utils.py
Validation:
- Presidio: 100 requests → 0 new sessions (was 100)
- OpenAI: 100 calls → 1 unique client (was 100)
- test_oom_fixes.py: Automated validation script
* fix(oom): resolve Gemini aiohttp session leak (issue #12443)
Fixes persistent "Unclosed client session" warnings when using Gemini models.
Root Causes:
1. Broken atexit cleanup - get_event_loop() fails at exit time
2. On-demand session creation without reliable cleanup
Changes:
1. Fixed atexit Cleanup (async_client_cleanup.py)
- OLD: Used get_event_loop() which fails when loop is closed
- NEW: Always create fresh event loop at exit time
- Ensures cleanup runs successfully even when main loop is closed
2. Added __del__ Cleanup (aiohttp_handler.py)
- Defense-in-depth: cleanup on garbage collection
- Handles abnormal termination cases
- Similar pattern to Presidio guardrail fix
3. Enhanced Cleanup Scope (async_client_cleanup.py)
- Now closes global base_llm_aiohttp_handler instance
- Previously only checked cache, missed module-level handler
Validation:
- Test 1: __del__ cleanup → 0 sessions leaked ✓
- Test 2: atexit cleanup → 0 sessions leaked ✓
- test_gemini_session_leak.py: Automated validation
Related: #14540 (broader OOM issue tracking)
* fix(types): use LlmProviders enum for get_async_httpx_client
MyPy was failing because llm_provider parameter expects Union[LlmProviders, httpxSpecialProvider], not a string.
Changed from string "openai" to LlmProviders.OPENAI enum value.
* test: move validation tests to proper CI directories
- Move test_oom_fixes.py to tests/test_litellm/llms/
- Move test_gemini_session_leak.py to tests/test_litellm/llms/custom_httpx/
- Fix pytest warning: use pytest.skip() instead of return True
This ensures CI actually runs our OOM fix validation tests.
* fix(oom): add asyncio.Lock to prevent race conditions in Presidio session creation
- Make _get_http_session() async with asyncio.Lock protection
- Prevents multiple concurrent requests from creating orphaned sessions
- Add concurrent load test (50 parallel requests) to validate fix
- Test confirms only 1 session created under concurrent load
Critical fix: Previous implementation had race condition where
concurrent guardrail checks could create multiple sessions,
defeating the shared session pattern and causing memory leaks.
* fix(presidio): eliminate race condition in session lock initialization
Move asyncio.Lock creation from lazy initialization in _get_http_session()
to __init__. The previous lazy init had a race condition where concurrent
coroutines could both see _session_lock as None, both create locks, and
end up with different lock instances - defeating the synchronization.
asyncio.Lock() can be safely created without an event loop; it only
requires one when awaited.
* Add Volcengine responses adapter
* fix llms/volcengine/responses/transformation.py:507:9: F841 Local variable `origin` is assigned to but never used
fix llms/volcengine/responses/transformation.py:95: error: Argument "headers" to "VolcEngineError" has incompatible type
add more supported optional params
removed redundant manual logging/utils fallbacks so litellm/__init__.py uses the registry only.
* fix: Avoid attaching tool calls when a call_id already exists
* fix: Prevent MCP responses from reviving past tool calls via previous_response_id
* test: Parametrize MCP streaming test to cover OpenAI and Anthropic models
* test: Fail MCP streaming test when LiteLLM logs errors during follow-up calls
* test: Let MCP tool-execution mock accept new kwargs for streaming tests
* chore: fix lint error
* docs: Add Google Workload Identity Federation (WIF) documentation to Vertex AI (#19320)
- Added new section documenting WIF support for Vertex AI authentication
- Included SDK and Proxy configuration examples
- Added sample WIF credentials file format for AWS federation
- Mentioned LLM Credentials UI as an alternative for credential management
- Added link to Google Cloud WIF documentation
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
* fix(bedrock): deduplicate tool calls in assistant history (#15178)
* fix(types): add missing Set import to factory.py
---------
Co-authored-by: Yuta Saito <uc4w6c@bma.biglobe.ne.jp>
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: YutaSaito <36355491+uc4w6c@users.noreply.github.com>
* feat(gemini): add opt-in support for responseJsonSchema
Add support for Gemini's native responseJsonSchema parameter which uses
standard JSON Schema format instead of OpenAPI-style responseSchema.
Benefits of responseJsonSchema (Gemini 2.0+ only):
- Standard JSON Schema format (lowercase types)
- Supports additionalProperties for stricter validation
- Better compatibility with Pydantic's model_json_schema()
- No propertyOrdering required
Usage:
```python
response_format={
"type": "json_schema",
"json_schema": {"schema": {...}},
"use_json_schema": True # opt-in
}
```
This is backwards compatible - existing code continues to use
responseSchema by default.
Closes#16340
* docs: add documentation for use_json_schema parameter
Document the new use_json_schema option for Gemini 2.0+ models
in the JSON Mode documentation.
* refactor(gemini): use responseJsonSchema by default for Gemini 2.0+
Remove opt-in flag `use_json_schema` and automatically detect model version:
- Gemini 2.0+: uses responseJsonSchema (standard JSON Schema, supports additionalProperties)
- Gemini 1.5: uses responseSchema (OpenAPI format, legacy)
This follows LiteLLM's philosophy of abstracting provider differences -
users write the same code regardless of model version.
* test(vertex): update json_schema tests to accept both responseSchema formats
Gemini 2.x+ uses responseJsonSchema while Gemini 1.x uses responseSchema.
Update tests to accept both formats since litellm now auto-selects based
on model version.
* feat(azure): add support for Azure OpenAI v1 API
When api_version is 'v1', 'latest', or 'preview', use the standard
OpenAI client instead of AzureOpenAI client with base_url pointing
to /openai/v1/ endpoint.
This follows Microsoft's documentation for the new v1 API format:
https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#api-specs
Changes:
- Add OpenAI/AsyncOpenAI imports to common_utils.py and azure.py
- Modify get_azure_openai_client() to detect v1 API versions and
create appropriate client type
- Update isinstance checks and type hints to accept both client types
- Add unit tests for v1 API client creation
* fix(azure): fix MyPy type errors for v1 API support
- Add type: ignore for AsyncOpenAI constructor
- Update type hints in files/handler.py and batches/handler.py
- Add OpenAI/AsyncOpenAI to Union types for client parameters
- Update isinstance checks to include OpenAI/AsyncOpenAI
* fix(azure): update type hints in files and batches handlers for v1 API
Update async method signatures to accept Union[AsyncAzureOpenAI, AsyncOpenAI]
to fix mypy errors when using v1 API client.
When using http:// api_base (converted to ws://), the websockets library
throws "ssl argument is incompatible with a ws:// URI". Only pass SSL
context for secure wss:// connections.
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
Replace copy.deepcopy with model_dump + model_validate in streaming
iterator logging to handle Pydantic ValidatorIterator objects that
cannot be pickled when tool_choice uses allowed_tools mode.
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
* fix(agentcore): simplify agentcore streaming
* fix(agentcore): move CustomStreamWrapper import to module level
The deferred imports inside streaming methods caused initialization delays
during health check requests, leading to timeouts in ECS deployments.
- Move CustomStreamWrapper import to module-level (line 19)
- Remove deferred imports from get_sync_custom_stream_wrapper (line 588)
- Remove deferred import from get_async_custom_stream_wrapper (line 747)
- Remove from TYPE_CHECKING block to use actual import
This ensures the import happens at module load time rather than during
first request processing, preventing health check endpoint blocking.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* test(agentcore): ensure sync response
* chore: upgrade boto3 to 1.40.76 in pyproject.toml
* chore: added taplo.toml
* fix(types): correct annotation type hint for MyPy compatibility
Update _convert_annotations_to_chat_format return type from
Dict[str, Any] to ChatCompletionAnnotation TypedDict to match
the Message class's expected type signature.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Benedikt Óskarsson <bensi94@hotmail.com>