mirror of
https://github.com/tiennm99/litellm.git
synced 2026-06-17 14:48:44 +00:00
b1b96ff3cf
* perf(router): Optimize prompt management model check with early exit
Add early return for models without '/' to avoid expensive get_model_list()
calls for 99% of standard model requests (gpt-4, claude-3, etc).
- Refactor _is_prompt_management_model() with "/" check before model lookup
- Add unit tests to verify optimization doesn't break detection
* perf(caching): optimize Redis batch cache operations and reduce unnecessary queries
This commit introduces several performance optimizations to the Redis caching layer:
**DualCache Improvements (dual_cache.py):**
1. Increase batch cache size limit from 100 to 1000
- Allows for larger batch operations, reducing Redis round-trips
2. Throttle repeated Redis queries for cache misses
- Update last_redis_batch_access_time for ALL queried keys, including those
with None values
- Prevents excessive Redis queries for frequently-accessed non-existent keys
3. Add early exit optimization
- Short-circuit when redis_result is None or contains only None values
- Avoids unnecessary processing when no cache hits are found
4. Optimize key lookup performance
- Replace O(n) keys.index() calls with O(1) dict lookup via key_to_index mapping
- Reduces algorithmic complexity in batch operations
5. Streamline cache updates
- Combine result updates and in-memory cache updates in single loop
- Only cache non-None values to avoid polluting in-memory cache
**CooldownCache Improvements (cooldown_cache.py):**
1. Enhanced early return logic
- Check if all values in results are None, not just if results is None
- Prevents unnecessary iteration when no valid cooldown data exists
These changes significantly improve Redis caching performance, especially for:
- High-throughput batch operations
- Scenarios with frequent cache misses
- Large-scale deployments with many concurrent requests
* fix: remove unnecessary test
* refactor: move default_max_redis_batch_cache_size to constants
- Add DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE constant (default: 1000)
- Update DualCache to use constant from constants.py
- Document new environment variable in config_settings.md
* fix: only use in memory cache when set
* fix(router): improve prompt management model detection with smart early return
The previous early return optimization in _is_prompt_management_model() was
checking if the model name parameter contained '/' and returning False if it
didn't. This broke detection for model aliases (e.g., 'chatbot_actions') that
don't have '/' in their name but map to prompt management models
(e.g., 'langfuse/openai-gpt-3.5-turbo').
Changed the early return logic to only exit early when:
- Model name contains '/' AND
- The prefix is NOT a known prompt management provider
This maintains the performance optimization for 99% of direct model calls
(avoiding expensive get_model_list lookups) while correctly handling:
- Direct prompt management calls (e.g., 'langfuse/model')
- Model aliases without '/' (e.g., 'chatbot_actions')
- Regular models with/without '/' (e.g., 'gpt-3.5-turbo', 'openai/gpt-4')
Fixes test: test_router_prompt_management_factory
* perf(router): optimize _pre_call_checks with shallow copy (1400x faster)
Replace deepcopy with list() in _pre_call_checks - runs on every request.
Only pops from list, never modifies deployment dicts, so shallow copy is safe.
Performance: 1400x faster on hot path
Impact: 2-5x overall throughput improvement for routing workloads
Tests: Added regression test to ensure no mutation + filtering works
* perf(router): replace deepcopy with shallow copy for default deployment
Replace expensive copy.deepcopy() with shallow copy for default_deployment
in _common_checks_available_deployment() hot path.
Changes:
- Use dict.copy() for top-level deployment dict
- Use dict.copy() for nested litellm_params dict
- Only the 'model' field is modified, so deep recursion is unnecessary
Impact:
- 100x+ faster for default deployment path (every request when used)
- deepcopy recursively traverses entire object tree
- Shallow copy only copies two dict levels (exactly what's needed)
Test coverage:
- Added regression test to verify deployment isolation
- Ensures returned deployments don't mutate original default_deployment
- Validates multiple concurrent requests get independent copies
* perf(router): remove unnecessary dict copy in completion hot paths
Remove unnecessary deployment['litellm_params'].copy() in _completion
and _acompletion functions. The dict is only read and spread into a new
dict, never modified, making the defensive copy wasteful.
Changes:
- Remove .copy() in _completion (sync hot path)
- Remove .copy() in _acompletion (async hot path)
Impact:
- Every completion request (highest traffic endpoints)
- Eliminates unnecessary dict allocation and copy on every call
- Dict spreading already creates new dict, so no mutation possible
Test coverage:
- Added tests verifying deployment params unchanged after calls
- Tests both sync and async completion paths
- Validates optimization doesn't introduce mutations
* perf(router): optimize deployment filtering in pre-call checks
Replace O(n²) list pop pattern with O(n) set-based filtering in
_pre_call_checks() to improve routing performance under high load.
Changes:
- Use set() instead of list for invalid_model_indices tracking
- Replace reversed list.pop() loop with single-pass list comprehension
- Eliminate redundant list→set conversion overhead
Impact:
- Hot path optimization: runs on every request through the router
- ~2-5x faster filtering when many deployments fail validation
- Most beneficial with 50+ deployments per model group or high
invalidation rates (rate limits, context window exceeded)
Technical details:
Old: O(k²) where k = invalid deployments (pop shifts remaining elements)
New: O(n) single pass with O(1) set membership checks
* add: memory profiler
feat(proxy): Add configurable GC thresholds and enhance memory debugging endpoints
- Add PYTHON_GC_THRESHOLD env var to configure garbage collection thresholds
- Add POST /debug/memory/gc/configure endpoint for runtime GC tuning
- Enhance memory debugging endpoints with better structure and explanations
- Add comprehensive router and cache memory tracking
- Include worker PID in all debug responses for multi-worker debugging
* refactor: reduce complexity in get_memory_details endpoint
Extract 6 helper functions from get_memory_details to fix linter
error PLR0915 (too many statements). Improves maintainability
while preserving functionality.
* fix(router): remove incorrect early exit in _is_prompt_management_model
Removes early exit optimization that checked model_name prefix instead
of the actual litellm_params model. This incorrectly returned False for
custom model aliases that map to prompt management providers.
Example: "my-langfuse-prompt/test_id" -> "langfuse_prompt/actual_id"
The method now correctly checks the underlying model's prefix.
Fixes test_is_prompt_management_model_optimization
* fix(proxy): add explicit type annotations to debug_utils dictionaries
Resolved 6 mypy type errors in proxy/common_utils/debug_utils.py by adding
explicit Dict[str, Any] annotations to dictionary variables where mypy was
incorrectly inferring narrow types. This allows the dictionaries to accept
different value types (strings, nested dicts) for error handling and various
return structures.
Fixed:
- Line 246: caches dictionary in get_memory_summary()
- Line 371: cache_stats dictionary in _get_cache_memory_stats()
- Line 439: litellm_router_memory dictionary in _get_router_memory_stats()
* fix(proxy): fix Python 3.8 compatibility in debug_utils type annotations
- Replace tuple[...], list[...] with Tuple[...], List[...] from typing
- Replace Dict | None with Optional[Dict] for Python 3.8 compatibility
- Add missing imports: List, Optional, Tuple to typing imports
Fixes TypeError: 'type' object is not subscriptable in Python 3.8
---------
Co-authored-by: AlexsanderHamir <alexsanderhamirgomesbaptista@gmail.com>
128 lines
4.0 KiB
Python
128 lines
4.0 KiB
Python
"""
|
|
Tests for Redis batch caching optimizations (commit 3f52e8c)
|
|
|
|
Verifies:
|
|
|
|
1. Batch cache size increased from 100 → 1000 (minimum 1k)
|
|
2. Repeated Redis queries for cache misses are throttled
|
|
"""
|
|
|
|
import os
|
|
import sys
|
|
import time
|
|
from unittest.mock import AsyncMock, patch
|
|
|
|
import pytest
|
|
from dotenv import load_dotenv
|
|
|
|
load_dotenv()
|
|
sys.path.insert(0, os.path.abspath("../.."))
|
|
|
|
import uuid
|
|
from litellm.caching.dual_cache import DualCache
|
|
from litellm.caching.in_memory_cache import InMemoryCache
|
|
from litellm.caching.redis_cache import RedisCache
|
|
from litellm.constants import DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE
|
|
|
|
|
|
@pytest.fixture
|
|
def cache_setup():
|
|
"""Create cache instances for testing"""
|
|
in_memory = InMemoryCache()
|
|
redis_cache = RedisCache(
|
|
host=os.getenv("REDIS_HOST"), port=os.getenv("REDIS_PORT")
|
|
)
|
|
dual_cache = DualCache(
|
|
in_memory_cache=in_memory,
|
|
redis_cache=redis_cache,
|
|
default_max_redis_batch_cache_size=DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE,
|
|
)
|
|
return dual_cache, in_memory, redis_cache
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_batch_cache_size_is_1000_minimum(cache_setup):
|
|
"""Verify batch cache size is set to 1000 (never below 1k)"""
|
|
dual_cache, _, _ = cache_setup
|
|
|
|
# Critical: batch cache size must be at least DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE
|
|
assert dual_cache.last_redis_batch_access_time.max_size >= DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_throttling_prevents_duplicate_redis_calls(cache_setup):
|
|
"""Test throttling prevents repeated Redis queries for cache misses"""
|
|
dual_cache, _, redis_cache = cache_setup
|
|
|
|
test_keys = [f"miss_{str(uuid.uuid4())}" for _ in range(3)]
|
|
|
|
# Set short expiry for testing
|
|
dual_cache.redis_batch_cache_expiry = 0.1 # 100ms
|
|
|
|
with patch.object(
|
|
redis_cache, "async_batch_get_cache", new_callable=AsyncMock
|
|
) as mock_redis:
|
|
mock_redis.return_value = {key: None for key in test_keys}
|
|
|
|
# First call hits Redis (no throttle data exists)
|
|
await dual_cache.async_batch_get_cache(test_keys)
|
|
assert mock_redis.call_count == 1
|
|
|
|
# Second call immediately - throttled (within expiry window)
|
|
await dual_cache.async_batch_get_cache(test_keys)
|
|
assert mock_redis.call_count == 1
|
|
|
|
# Verify all keys tracked in throttle cache
|
|
for key in test_keys:
|
|
assert key in dual_cache.last_redis_batch_access_time
|
|
|
|
# Wait for expiry time to pass
|
|
time.sleep(0.15)
|
|
|
|
# Third call after expiry - call_count increases to 2
|
|
await dual_cache.async_batch_get_cache(test_keys)
|
|
assert mock_redis.call_count == 2
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_basic_functionality_not_broken(cache_setup):
|
|
"""Ensure basic cache functionality still works after optimizations"""
|
|
dual_cache, _, _ = cache_setup
|
|
|
|
# Test basic set/get works
|
|
test_key = f"functional_test_{str(uuid.uuid4())}"
|
|
test_value = {"test": "data"}
|
|
|
|
await dual_cache.async_set_cache(test_key, test_value)
|
|
result = await dual_cache.async_get_cache(test_key)
|
|
|
|
assert result == test_value
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_batch_get_with_no_in_memory_cache():
|
|
"""Test that batch get works when in_memory_cache is None"""
|
|
redis_cache = RedisCache(
|
|
host=os.getenv("REDIS_HOST"), port=os.getenv("REDIS_PORT")
|
|
)
|
|
|
|
# Create DualCache with no in-memory cache
|
|
dual_cache = DualCache(
|
|
in_memory_cache=None, # This is the edge case we're testing
|
|
redis_cache=redis_cache,
|
|
)
|
|
|
|
# Set some test data directly in Redis
|
|
test_key = f"no_memory_test_{str(uuid.uuid4())}"
|
|
test_value = {"test": "data_without_memory_cache"}
|
|
|
|
await redis_cache.async_set_cache(test_key, test_value)
|
|
|
|
# Should not crash when fetching from Redis without in-memory cache
|
|
result = await dual_cache.async_batch_get_cache([test_key])
|
|
|
|
assert result is not None
|
|
assert len(result) == 1
|
|
assert result[0] == test_value
|
|
|