UI - allow team member to view service account keys they create + Anthropic - include cache creation tokens in prompt token total (separate out during cost tracking)
* Use _PROXY_MaxParallelRequestsHandler_v3 by default (#14352)
(cherry picked from commit f3fa45cf8fbd5f5cce2f45a7312776d5005fb08e)
(cherry picked from commit 5b680bb4a3)
* Use random api_key for parallel requests test
* Fix off-by-one error in parallel request rate limit
The rate limiter was incorrectly rejecting requests when the limit was met, but not exceeded. The check in `is_cache_list_over_limit` was `int(counter_value) + 1 > current_limit`, which caused the first request to be rejected if the limit was 1.
This commit removes the `+ 1`, changing the logic to `int(counter_value) > current_limit`. The check now correctly allows requests up to the specified parallel limit.
* Test actual parallel requests
* Ensure rate limiting works correctly for multiple users
* Add sequential rate-limit test
* Revert random key usage
* Draft commit.
* user header mapping feature with backward compatibility with user_header_name field.
* user header mapping feature with backward compatibility with user_header_name field optimizations.
* Added unit tests.
* add support for vertex AI QWEN API
* streaming QWEN API support
* test_partner_models_httpx
* test_partner_models_httpx_streaming
* add cost tracking for vertex_ai/qwen/qwen3-235b-a22b-instruct-2507-maa
* docs qwen models vertexAI
* fix intent params
* Add responses
* fix unrelated test
* test fix - fireworks API endpoint is down
* test fix fireworks ai is having an active outage
* test_completion_cost_databricks
* dbrx fix test API currently not responding
* Update OpenAI Realtime handler to use the correct endpoint and include all query parameters. Adjusted error messages for missing API base and key. Updated health check URL construction to pass model as a query parameter.
* Enhance OpenAI Realtime handler tests to ensure model parameter inclusion in WebSocket URL. Added new tests to verify correct URL construction with model and additional parameters, preventing 'missing_model' errors. Updated existing tests for consistency.
* Remove debug print statements for API base and key in OpenAIRealtime handler to clean up the code.
---------
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
* feat(proxy/utils.py): track pre-call hooks in OTEL
some pre call hooks can cause latency in high traffic - make sure this is tracked
* fix(router.py): move redis call on deployment_callback_on_success to pipeline operation
reduces p99 latency by half when redis is enabled
* fix(parallel_request_limiter_v3.py): only run check if any item has rate limits set
Prevents unnecessary latency added by rate limit checks
* test: add unit tests
* Latency Improvements: only track tpm/rpm usage when set on deployment+ LLM Caching - use an in-memory cache to reduce redis calls + OTEL - track time spent on LLM caching (#13472)
* fix(router.py): only track usage for deployments with tpm/rpm set
ensures additional latency avoided for non-tpm/rpm models
* fix(caching_handler.py): log time spent on request get cache to OTEL
enables easy debugging of call latency
* fix(caching_handler.py): use dual cache object for in-memory caching + trace redis call within caching handler
* fix(caching_handler.py): working in-memory cache for redis calls
ensures dual cache works when redis cache setup for llm calls
makes calls quicker by only checking redis when in-memory cache missed for llm api call
* test: remove redundant test
* test: add unit tests