* feat(parallel_request_limiter_v3.py): allows admin to enforce token rate limit based on just output tokens
Useful when trying to rate limit for primarily self hosted model use-cases
* test(test_parallel_request_limiter_v3.py): add unit test for token rate limit type
* feat(parallel_request_limiter_v3.py): return remaining token limits in header
* feat: return rate limit headers in response
* feat(parallel_request_limiter_v3.py): working rate limit response headers
* feat(parallel_request_limiter_v3.py): fix rate limit tracking for tpm when rpm also set
* feat(parallel_request_limiter_v3.py): show headers for key/user/team
* feat(parallel_request_limiter_v3.py): decrement max parallel request limiter on failure event
* feat(parallel_request_limiter_v3.py): add in-memory cache implementation of parallel request rate limiter
allows rate limiter to work even without redis cache setup
Work for GA of parallel request limiter v3
* refactor(proxy/hooks/__init__.py): replace with new parallel request handler
* test: update testing
* fix: fix ruff check
* fix: revert ga of multi instance rate limiting - needs more work to pass testing
* refactor: comment out circuit breaker
causes incorrect rate limiting in high traffic
* fix(base_routing_strategy.py): don't reset value if redis val is lower than current in-memory value
Fixes issue where redis might be trailing in-memory value
* fix(parallel_request_limiter_v2.py): if in-memory higher than redis, don't reset value; add previous slot keys to redis increment to correctly 'get' them
* fix(parallel_request_limiter_v3.py): v3 implementation of parallel request limiter
does not use background redis syncing - increments redis in call
simplify rate limiting logic, to improve accuracy
* fix: fix ruff errors
* fix(parallel_request_limiter_v3.py): don't decrement limit on post call success - causes double decrements
* fix(parallel_request_limiter_v3.py): working accurate multi-instance logic
ensured just 100 requests allowed on 100 users, 10 ramp up, 100 rpm limit key, 2 instances
* fix(parallel_request_limiter_v3.py): working accurate rate limiting with time window resets
allows rate limiting to work across multiple windows
* test: add unit tests for v3 rate limiter
* fix(parallel_request_limiter_v3.py): return window value into in-memory cache
allows in-memory cache checks to be used correctly
* refactor(parallel_request_limiter_v3.py): refactor rate limiting to work for multiple window/counter key pairs
enables using for user/team/model rate limiting
* feat(parallel_request_limiter_v3.py): working rate limiting, across key/user/team/end-user
* fix(parallel_request_limiter_v3.py): add model specific rate limiting
* fix(parallel_request_limiter_v3.py): ignore if no rate limits set
skip unecessary rate limit checks - if no limits set
* fix(parallel_request_limiter_v3.py): initial commit bringing token rate limits back
* fix(parallel_request_limiter_v3.py): increment by value in list + update assertions to handle tokens + max parallel requests
* test(parallel_request_limiter_v3.py): more testing
* fix(parallel_request_limiter.py): working in-memory cache limiter
* fix(redis_cache.py): ignore linting error - use safe hasattr
* fix(parallel_request_limiter_v3.py): fix linting error
* refactor: remove redundant parallel_Request_limiter_v2.py
old / inaccurate implementation
* test: update tests
* style: cleanup
* test: update test
* docs(config_settings.md): document new env var
* test(test_base_routing_strategy.py): update test
* feat(parallel_request_limiter_v2.py): add sliding window logic
allows rate limiting to work across minutes
* fix(parallel_request_limiter_v2.py): decrement usage on rate limit error
* fix(base_routing_strategy.py): fix merge from redis - preserve values in in-memory cache during gap b/w push to redis and read from redis
* fix(base_routing_strategy.py): catch the delta change during redis sync
ensures values are kept in sync
* fix(parallel_request_limiter_v2.py): update tpm tracking to use slot key logic
* fix: fix linting error
* test: update testing
* test: update tests
* test: skip on rate limit or internal server errors
* test: use pytest fixture instead
* test: bump mistral model