* add support for vertex AI QWEN API
* streaming QWEN API support
* test_partner_models_httpx
* test_partner_models_httpx_streaming
* add cost tracking for vertex_ai/qwen/qwen3-235b-a22b-instruct-2507-maa
* docs qwen models vertexAI
* fix intent params
* Add responses
* fix unrelated test
* test fix - fireworks API endpoint is down
* test fix fireworks ai is having an active outage
* test_completion_cost_databricks
* dbrx fix test API currently not responding
* Update OpenAI Realtime handler to use the correct endpoint and include all query parameters. Adjusted error messages for missing API base and key. Updated health check URL construction to pass model as a query parameter.
* Enhance OpenAI Realtime handler tests to ensure model parameter inclusion in WebSocket URL. Added new tests to verify correct URL construction with model and additional parameters, preventing 'missing_model' errors. Updated existing tests for consistency.
* Remove debug print statements for API base and key in OpenAIRealtime handler to clean up the code.
---------
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
* feat(proxy/utils.py): track pre-call hooks in OTEL
some pre call hooks can cause latency in high traffic - make sure this is tracked
* fix(router.py): move redis call on deployment_callback_on_success to pipeline operation
reduces p99 latency by half when redis is enabled
* fix(parallel_request_limiter_v3.py): only run check if any item has rate limits set
Prevents unnecessary latency added by rate limit checks
* test: add unit tests
* Latency Improvements: only track tpm/rpm usage when set on deployment+ LLM Caching - use an in-memory cache to reduce redis calls + OTEL - track time spent on LLM caching (#13472)
* fix(router.py): only track usage for deployments with tpm/rpm set
ensures additional latency avoided for non-tpm/rpm models
* fix(caching_handler.py): log time spent on request get cache to OTEL
enables easy debugging of call latency
* fix(caching_handler.py): use dual cache object for in-memory caching + trace redis call within caching handler
* fix(caching_handler.py): working in-memory cache for redis calls
ensures dual cache works when redis cache setup for llm calls
makes calls quicker by only checking redis when in-memory cache missed for llm api call
* test: remove redundant test
* test: add unit tests
* fix unsupported operand type(s) for +=: 'NoneType' and 'str' on clientside auth creds for responses
* fix the client side auth to use correct metadata
* add more tests
* fix tests
* fix(router.py): add acompletion_streaming_iterator inside router
allows router to catch errors mid-stream for fallbacks
Work for https://github.com/BerriAI/litellm/issues/6532
* fix(router.py): working mid-stream fallbacks
* fix(router.py): more iterations
* fix(router.py): working mid-stream fallbacks with fallbacks set on router
* fix(router.py): pass prior content back in new request as assistant prefix message
* fix(router.py): add a system prompt to help guide non-prefix supporting models to use the continued text correctly
* fix(common_utils.py): support converting `prefix: true` for non-prefix supporting models
* fix: reduce LOC in function
* test(test_router.py): add unit tests for new function
* test: add basic unit test
* fix(router.py): ensure return type of fallback stream is compatible with CustomStreamWrapper
prevent client code from breaking
* fix: cleanup
* test: update test
* fix: fix linting error
If the user specified in the configuration e.g. "user_header_name:
X-OpenWebUI-User-Email", here we were looking for a dict key
"X-OpenWebUI-User-Email" when the dict actually contained
"x-openwebui-user-email".
Switch to iteration and case insensitive string comparison instead to
fix this.
This fixes customer budget enforcement when the customer ID is passed
in as a header rather than as a "user" value in the body.