* fix: use fastuuid helper across the codebase
First batch of changes, simple drop in replacement.
* second batch of changes
* fixed: script mistake on helper file
* refactor: comment out circuit breaker
causes incorrect rate limiting in high traffic
* fix(base_routing_strategy.py): don't reset value if redis val is lower than current in-memory value
Fixes issue where redis might be trailing in-memory value
* fix(parallel_request_limiter_v2.py): if in-memory higher than redis, don't reset value; add previous slot keys to redis increment to correctly 'get' them
* fix(parallel_request_limiter_v3.py): v3 implementation of parallel request limiter
does not use background redis syncing - increments redis in call
simplify rate limiting logic, to improve accuracy
* fix: fix ruff errors
* fix(parallel_request_limiter_v3.py): don't decrement limit on post call success - causes double decrements
* fix(parallel_request_limiter_v3.py): working accurate multi-instance logic
ensured just 100 requests allowed on 100 users, 10 ramp up, 100 rpm limit key, 2 instances
* fix(parallel_request_limiter_v3.py): working accurate rate limiting with time window resets
allows rate limiting to work across multiple windows
* test: add unit tests for v3 rate limiter
* fix(parallel_request_limiter_v3.py): return window value into in-memory cache
allows in-memory cache checks to be used correctly
* refactor(parallel_request_limiter_v3.py): refactor rate limiting to work for multiple window/counter key pairs
enables using for user/team/model rate limiting
* feat(parallel_request_limiter_v3.py): working rate limiting, across key/user/team/end-user
* fix(parallel_request_limiter_v3.py): add model specific rate limiting
* fix(parallel_request_limiter_v3.py): ignore if no rate limits set
skip unecessary rate limit checks - if no limits set
* fix(parallel_request_limiter_v3.py): initial commit bringing token rate limits back
* fix(parallel_request_limiter_v3.py): increment by value in list + update assertions to handle tokens + max parallel requests
* test(parallel_request_limiter_v3.py): more testing
* fix(parallel_request_limiter.py): working in-memory cache limiter
* fix(redis_cache.py): ignore linting error - use safe hasattr
* fix(parallel_request_limiter_v3.py): fix linting error
* refactor: remove redundant parallel_Request_limiter_v2.py
old / inaccurate implementation
* test: update tests
* style: cleanup
* test: update test
* docs(config_settings.md): document new env var
* test(test_base_routing_strategy.py): update test
* fix(helicone.py): add helicone api base support
Fixes https://github.com/BerriAI/litellm/issues/10825
* test: add unit test for cache hit response on embedding calls
* fix(caching_handler.py): fix handling cache hit on embedding when input is string
Fixes LIT-197
* docs(helicone_integration.md): document new helicone api base param
* fix(caching_handler.py): fix embedding str caching result
Fixes issue where str caching results were not being correctly assembled on str input
* feat(azure/image_generation): Support dropping response_format for azure gpt-image-1
Fixes LIT-118
* test(test_utils.py): add unit testing
* test: rename file to avoid testing conflict
* build(model_prices_and_context_window.json): add fireworks ai new 0-4b pricing tier
* build(model_prices_and_context_window.json): add more fireworks ai models
* test: update testing
* fix(caching_handler.py): handle str + list cache
Fixes issue on cache hits for embedding when initial cached input was str
* test(test_caching.py): add e2e test on caching with individual item and then list
* fix(caching_handler.py): set usage tokens for cache hits
enables token counting to work
* fix(caching_handler.py): combine usage between cached result and embedding response
Handles case of new input to embedding response
* fix: cleanup
* test: move to gpt-4o-new-test
* test: update test
* use 1 file for duration_in_seconds
* add to readme.md
* re use duration_in_seconds
* fix importing _extract_from_regex, get_last_day_of_month
* fix import
* update provider budget routing
* fix - remove dup test
* add support for using in multi instance environments
* test_in_memory_redis_sync_e2e
* test_in_memory_redis_sync_e2e
* fix test_in_memory_redis_sync_e2e
* fix code quality check
* fix test provider budgets
* working provider budget tests
* add fixture for provider budget routing
* fix router testing for provider budgets
* add comments on provider budget routing
* use RedisPipelineIncrementOperation
* add redis async_increment_pipeline
* use redis async_increment_pipeline
* use lower value for testing
* use redis async_increment_pipeline
* use consistent key name for increment op
* add handling for budget windows
* fix typing async_increment_pipeline
* fix set attr
* add clear doc strings
* unit testing for provider budgets
* test_redis_increment_pipeline
* fix(caching): convert arg to equivalent kwargs in llm caching handler
prevent unexpected errors
* fix(caching_handler.py): don't pass args to caching
* fix(caching): remove all *args from caching.py
* fix(caching): consistent function signatures + abc method
* test(caching_unit_tests.py): add unit tests for llm caching
ensures coverage for common caching scenarios across different implementations
* refactor(litellm_logging.py): move to using cache key from hidden params instead of regenerating one
* fix(router.py): drop redis password requirement
* fix(proxy_server.py): fix faulty slack alerting check
* fix(langfuse.py): avoid copying functions/thread lock objects in metadata
fixes metadata copy error when parent otel span in metadata
* test: update test
* fix(dual_cache.py): update in-memory check for redis batch get cache
Fixes latency delay for async_batch_redis_cache
* fix(service_logger.py): fix race condition causing otel service logging to be overwritten if service_callbacks set
* feat(user_api_key_auth.py): add parent otel component for auth
allows us to isolate how much latency is added by auth checks
* perf(parallel_request_limiter.py): move async_set_cache_pipeline (from max parallel request limiter) out of execution path (background task)
reduces latency by 200ms
* feat(user_api_key_auth.py): have user api key auth object return user tpm/rpm limits - reduces redis calls in downstream task (parallel_request_limiter)
Reduces latency by 400-800ms
* fix(parallel_request_limiter.py): use batch get cache to reduce user/key/team usage object calls
reduces latency by 50-100ms
* fix: fix linting error
* fix(_service_logger.py): fix import
* fix(user_api_key_auth.py): fix service logging
* fix(dual_cache.py): don't pass 'self'
* fix: fix python3.8 error
* fix: fix init]
* fix(core_helpers.py): return None, instead of raising kwargs is None error
Closes https://github.com/BerriAI/litellm/issues/6500
* docs(cost_tracking.md): cleanup doc
* fix(vertex_and_google_ai_studio.py): handle function call with no params passed in
Closes https://github.com/BerriAI/litellm/issues/6495
* test(test_router_timeout.py): add test for router timeout + retry logic
* test: update test to use module level values
* (fix) Prometheus - Log Postgres DB latency, status on prometheus (#6484)
* fix logging DB fails on prometheus
* unit testing log to otel wrapper
* unit testing for service logger + prometheus
* use LATENCY buckets for service logging
* fix service logging
* docs clarify vertex vs gemini
* (router_strategy/) ensure all async functions use async cache methods (#6489)
* fix router strat
* use async set / get cache in router_strategy
* add coverage for router strategy
* fix imports
* fix batch_get_cache
* use async methods for least busy
* fix least busy use async methods
* fix test_dual_cache_increment
* test async_get_available_deployment when routing_strategy="least-busy"
* (fix) proxy - fix when `STORE_MODEL_IN_DB` should be set (#6492)
* set store_model_in_db at the top
* correctly use store_model_in_db global
* (fix) `PrometheusServicesLogger` `_get_metric` should return metric in Registry (#6486)
* fix logging DB fails on prometheus
* unit testing log to otel wrapper
* unit testing for service logger + prometheus
* use LATENCY buckets for service logging
* fix service logging
* fix _get_metric in prom services logger
* add clear doc string
* unit testing for prom service logger
* bump: version 1.51.0 → 1.51.1
* Add `azure/gpt-4o-mini-2024-07-18` to model_prices_and_context_window.json (#6477)
* Update utils.py (#6468)
Fixed missing keys
* (perf) Litellm redis router fix - ~100ms improvement (#6483)
* docs(exception_mapping.md): add missing exception types
Fixes https://github.com/Aider-AI/aider/issues/2120#issuecomment-2438971183
* fix(main.py): register custom model pricing with specific key
Ensure custom model pricing is registered to the specific model+provider key combination
* test: make testing more robust for custom pricing
* fix(redis_cache.py): instrument otel logging for sync redis calls
ensures complete coverage for all redis cache calls
* refactor: pass parent_otel_span for redis caching calls in router
allows for more observability into what calls are causing latency issues
* test: update tests with new params
* refactor: ensure e2e otel tracing for router
* refactor(router.py): add more otel tracing acrosss router
catch all latency issues for router requests
* fix: fix linting error
* fix(router.py): fix linting error
* fix: fix test
* test: fix tests
* fix(dual_cache.py): pass ttl to redis cache
* fix: fix param
* perf(cooldown_cache.py): improve cooldown cache, to store cache results in memory for 5s, prevents redis call from being made on each request
reduces 100ms latency per call with caching enabled on router
* fix: fix test
* fix(cooldown_cache.py): handle if a result is None
* fix(cooldown_cache.py): add debug statements
* refactor(dual_cache.py): move to using an in-memory check for batch get cache, to prevent redis from being hit for every call
* fix(cooldown_cache.py): fix linting erropr
* refactor(prometheus.py): move to using standard logging payload for reading the remaining request / tokens
Ensures prometheus token tracking works for anthropic as well
* fix: fix linting error
* fix(redis_cache.py): make sure ttl is always int (handle float values)
Fixes issue where redis_client.ex was not working correctly due to float ttl
* fix: fix linting error
* test: update test
* fix: fix linting error
---------
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: vibhanshu-ob <115142120+vibhanshu-ob@users.noreply.github.com>
* docs(exception_mapping.md): add missing exception types
Fixes https://github.com/Aider-AI/aider/issues/2120#issuecomment-2438971183
* fix(main.py): register custom model pricing with specific key
Ensure custom model pricing is registered to the specific model+provider key combination
* test: make testing more robust for custom pricing
* fix(redis_cache.py): instrument otel logging for sync redis calls
ensures complete coverage for all redis cache calls
* refactor: pass parent_otel_span for redis caching calls in router
allows for more observability into what calls are causing latency issues
* test: update tests with new params
* refactor: ensure e2e otel tracing for router
* refactor(router.py): add more otel tracing acrosss router
catch all latency issues for router requests
* fix: fix linting error
* fix(router.py): fix linting error
* fix: fix test
* test: fix tests
* fix(dual_cache.py): pass ttl to redis cache
* fix: fix param
* perf(cooldown_cache.py): improve cooldown cache, to store cache results in memory for 5s, prevents redis call from being made on each request
reduces 100ms latency per call with caching enabled on router
* fix: fix test
* fix(cooldown_cache.py): handle if a result is None
* fix(cooldown_cache.py): add debug statements
* refactor(dual_cache.py): move to using an in-memory check for batch get cache, to prevent redis from being hit for every call
* fix(cooldown_cache.py): fix linting erropr
* docs(exception_mapping.md): add missing exception types
Fixes https://github.com/Aider-AI/aider/issues/2120#issuecomment-2438971183
* fix(main.py): register custom model pricing with specific key
Ensure custom model pricing is registered to the specific model+provider key combination
* test: make testing more robust for custom pricing
* fix(redis_cache.py): instrument otel logging for sync redis calls
ensures complete coverage for all redis cache calls
* refactor: pass parent_otel_span for redis caching calls in router
allows for more observability into what calls are causing latency issues
* test: update tests with new params
* refactor: ensure e2e otel tracing for router
* refactor(router.py): add more otel tracing acrosss router
catch all latency issues for router requests
* fix: fix linting error
* fix(router.py): fix linting error
* fix: fix test
* test: fix tests
* fix(dual_cache.py): pass ttl to redis cache
* fix: fix param
* refactor(redis_cache.py): use a default cache value when writing to redis
prevent redis from blowing up in high traffic
* refactor(redis_cache.py): refactor all cache writes to use self.get_ttl
ensures default ttl always used when writing to redis
Prevents redis db from blowing up in prod
* refactor(main.py): streaming_chunk_builder
use <100 lines of code
refactor each component into a separate function - easier to maintain + test
* fix(utils.py): handle choices being None
openai pydantic schema updated
* fix(main.py): fix linting error
* feat(streaming_chunk_builder_utils.py): update stream chunk builder to support rebuilding audio chunks from openai
* test(test_custom_callback_input.py): test message redaction works for audio output
* fix(streaming_chunk_builder_utils.py): return anthropic token usage info directly
* fix(stream_chunk_builder_utils.py): run validation check before entering chunk processor
* fix(main.py): fix import
* refactor - use helpers for name space and hashing
* use openai to get the relevant supported params
* use helpers for getting cache key
* fix test caching
* use get/set helpers for preset cache keys
* make get_cache_key under 100 LOC
* fix _get_model_param_value
* fix _get_caching_group
* fix linting error
* add unit testing for get cache key
* test_generate_streaming_content
* caching - use _sync_set_cache
* add sync _sync_add_streaming_response_to_cache
* use caching class for cache storage
* fix use _sync_get_cache
* fix circular import
* use _update_litellm_logging_obj_environment
* use one helper for _process_async_embedding_cached_response
* fix _is_call_type_supported_by_cache
* fix checking cache
* fix sync get cache
* fix use _combine_cached_embedding_response_with_api_result
* fix _update_litellm_logging_obj_environment
* adjust test_redis_cache_acompletion_stream_bedrock