Commit Graph

4 Commits

Author SHA1 Message Date
Krish Dholakia 4be0ec8e35 GA Multi-instance rate limiting v2 Requirements + New - specify token rate limit type - output / input / total (#11646)
* feat(parallel_request_limiter_v3.py): allows admin to enforce token rate limit based on just output tokens

Useful when trying to rate limit for primarily self hosted model use-cases

* test(test_parallel_request_limiter_v3.py): add unit test for token rate limit type

* feat(parallel_request_limiter_v3.py): return remaining token limits in header

* feat: return rate limit headers in response

* feat(parallel_request_limiter_v3.py): working rate limit response headers

* feat(parallel_request_limiter_v3.py): fix rate limit tracking for tpm when rpm also set

* feat(parallel_request_limiter_v3.py): show headers for key/user/team

* feat(parallel_request_limiter_v3.py): decrement max parallel request limiter on failure event

* feat(parallel_request_limiter_v3.py): add in-memory cache implementation of parallel request rate limiter

allows rate limiter to work even without redis cache setup

Work for GA of parallel request limiter v3

* refactor(proxy/hooks/__init__.py): replace with new parallel request handler

* test: update testing

* fix: fix ruff check

* fix: revert ga of multi instance rate limiting - needs more work to pass testing
2025-06-11 22:05:13 -07:00
Krish Dholakia c42740a4b9 Simplify experimental multi-instance rate limiter - more accurate (#11424)
* refactor: comment out circuit breaker

causes incorrect rate limiting in high traffic

* fix(base_routing_strategy.py): don't reset value if redis val is lower than current in-memory value

Fixes issue where redis might be trailing in-memory value

* fix(parallel_request_limiter_v2.py): if in-memory higher than redis, don't reset value; add previous slot keys to redis increment to correctly 'get' them

* fix(parallel_request_limiter_v3.py): v3 implementation of parallel request limiter

does not use background redis syncing - increments redis in call

 simplify rate limiting logic, to improve accuracy

* fix: fix ruff errors

* fix(parallel_request_limiter_v3.py): don't decrement limit on post call success - causes double decrements

* fix(parallel_request_limiter_v3.py): working accurate multi-instance logic

ensured just 100 requests allowed on 100 users, 10 ramp up, 100 rpm limit key, 2 instances

* fix(parallel_request_limiter_v3.py): working accurate rate limiting with time window resets

allows rate limiting to work across multiple windows

* test: add unit tests for v3 rate limiter

* fix(parallel_request_limiter_v3.py): return window value into in-memory cache

allows in-memory cache checks to be used correctly

* refactor(parallel_request_limiter_v3.py): refactor rate limiting to work for multiple window/counter key pairs

enables using for user/team/model rate limiting

* feat(parallel_request_limiter_v3.py): working rate limiting, across key/user/team/end-user

* fix(parallel_request_limiter_v3.py): add model specific rate limiting

* fix(parallel_request_limiter_v3.py): ignore if no rate limits set

skip unecessary rate limit checks - if no limits set

* fix(parallel_request_limiter_v3.py): initial commit bringing token rate limits back

* fix(parallel_request_limiter_v3.py): increment by value in list + update assertions to handle tokens + max parallel requests

* test(parallel_request_limiter_v3.py): more testing

* fix(parallel_request_limiter.py): working in-memory cache limiter

* fix(redis_cache.py): ignore linting error - use safe hasattr

* fix(parallel_request_limiter_v3.py): fix linting error

* refactor: remove redundant parallel_Request_limiter_v2.py

old / inaccurate implementation

* test: update tests

* style: cleanup

* test: update test

* docs(config_settings.md): document new env var

* test(test_base_routing_strategy.py): update test
2025-06-07 11:10:55 -07:00
Krish Dholakia 39849627f7 feat(parallel_request_limiter_v2.py): add sliding window logic (#11283)
* feat(parallel_request_limiter_v2.py): add sliding window logic

allows rate limiting to work across minutes

* fix(parallel_request_limiter_v2.py): decrement usage on rate limit error

* fix(base_routing_strategy.py): fix merge from redis - preserve values in in-memory cache during gap b/w push to redis and read from redis

* fix(base_routing_strategy.py): catch the delta change during redis sync

ensures values are kept in sync

* fix(parallel_request_limiter_v2.py): update tpm tracking to use slot key logic

* fix: fix linting error

* test: update testing

* test: update tests

* test: skip on rate limit or internal server errors

* test: use pytest fixture instead

* test: bump mistral model
2025-05-31 10:06:42 -07:00
Krish Dholakia ef42461c1e Litellm fix GitHub action testing (#11163)
* test: add __init__.py files

* refactor: rename test folder to avoid naming conflict

* test: update workflows

* test: update tests

* test: update imports

* test: update tests

* test: remove unused import

* ci(test-litellm.yml): add pytest retry to github workflow

* test: fix test
2025-05-26 14:41:42 -07:00