mirror of
https://github.com/tiennm99/litellm.git
synced 2026-06-17 22:48:35 +00:00
533eab4dbd
* test(vcr): make Redis-backed cassettes replay deterministically across runs - Pin LITELLM_LOCAL_MODEL_COST_MAP=True in the shared VCR harness so the per-test importlib.reload(litellm) no longer fetches the model cost map from raw.githubusercontent.com. That live fetch was being recorded into cassettes; for tests that subsequently skip it was the only recorded episode, so the persister refused to save it (skipped tests don't persist) and the test re-recorded it live every run (MISS:NOT_PERSISTED). - Compare-time symmetric matcher tolerance for Google OAuth (ya29.*) tokens, observability/telemetry payloads, credential-exchange bodies, and volatile UUID/timestamp tokens, so existing cassettes select a recorded episode instead of growing past the 50-episode cap and re-recording live. - Don't record fire-and-forget telemetry (langfuse/arize/otel/...) into non-telemetry tests' cassettes. Several modules set litellm.success_callback at import time, so observability logging is globally enabled and an async flush from the background logging worker lands in an unrelated test's VCR window, saved as a spurious MISS:RECORDED (observed: a Langfuse batch from another completion landing on test_lowest_latency_routing_buffer). Such a request now passes through live (telemetry hosts aren't real-spend hosts); tests that actually assert on telemetry keep recording it. - Dedupe + cap the VCR diagnostic dump so the classification summary survives CircleCI's ~400KB step-output truncation. - Stabilize a non-deterministic rate-limit test body; mark AWS Secrets Manager lifecycle tests VCR-incompatible (uniquely-named secrets can't be replayed). - Mark test_router_text_completion_client VCR-incompatible: it fires 300 identical requests to verify async-client reuse, but vcrpy patches the HTTP transport so replay never exercises the real connection pool the test validates, and recording 300 near-identical episodes overflows the 50-episode cap (MISS:OVERFLOW every run). It hits a free mock endpoint. - Mark the Vertex AI MaaS Mistral OCR tests (vertex_ai/mistral-ocr-2505) VCR-incompatible: the MaaS model is not provisioned in the CI GCP project, so the live :rawPredict call fails and the test skips every run, leaving no cassette to record (MISS:NOT_PERSISTED every run). Sibling direct-Mistral and Azure OCR tests are unaffected and still replay from cache. * fix(tests/vcr): refresh cassette TTL on read so replayed cassettes don't expire The Redis VCR persister loaded cassettes with a plain GET, which does not touch the key's TTL. A cassette that is only ever replayed (HIT/NOOP, never re-recorded) therefore expired exactly 24h after its last *write*, no matter how often it was read. Whichever CI run happened to cross that boundary re-recorded the cassette live and surfaced a spurious VCR MISS on otherwise deterministic cassettes — the residual per-run flakiness floor (a different random subset of read-only cassettes expiring each run). Slide the expiry forward on every successful load (best-effort EXPIRE), so any cassette used at least once per TTL window stays alive indefinitely and the 2nd/3rd run of a day replays cleanly. * fix(tests/vcr): recover from spurious GET-None for existing cassette keys Under concurrent CI load, the persister's load GET was observed returning None for a cassette key that demonstrably existed on the (single, non- clustered) Redis master — an external monitor saw the key present with a healthy TTL at the same instant the in-process client read None. Because None is a valid GET result (not a RedisError), the retry-on-error client config never engaged, so the cassette re-recorded live (a phantom MISS:RECORDED); for flaky/networked tests the failed live call then triggered a pytest rerun, which is why a rotating subset of otherwise deterministic tests missed each run. On a None result, re-check EXISTS and re-read once. If the key really exists, use the recovered value and log [vcr-transient-miss-recovered] (also counted in cassette_cache_health). A genuinely absent key (a new cassette) still falls through to CassetteNotFoundError. * chore(tests/vcr): TEMP diagnostic for persistent-miss cassette load path Logs GET/EXISTS at load time for the three cassettes that re-record every run despite being present in Redis, to capture what the in-process client sees. To be reverted before merge. * chore(tests/vcr): write load diagnostic to Redis (truncation-proof) CI stdout truncates to the last ~400KB, dropping the early loaddbg lines for the alphabetically-first failing test. Push the load probe to a Redis list instead so it survives. To be reverted before merge. * fix(tests/vcr): don't drop stored telemetry episodes during cassette load Root cause of the residual per-run misses on present cassettes: vcrpy's Cassette._load() replays each *stored* interaction through Cassette.append(), which runs before_record_request on it — and a None return there silently drops that episode. The telemetry-leak suppressor (_should_drop_telemetry_record) returns None for telemetry requests, so when a non-telemetry-named test (or the alphabetically-first test in a worker, whose _current_test_nodeid is still empty) loaded a cassette containing a Langfuse ingestion episode, the episode was dropped on read — forcing an endless live re-record (a phantom MISS:RECORDED on a cassette that was demonstrably present in Redis). Verified by reproducing Cassette._load() against the real cassette: empty/non-telemetry nodeid -> 0 episodes survive; with the guard -> 1 survives. Fix: guard the suppressor with a thread-local set around Cassette._load (via a small idempotent monkeypatch), so the drop only ever stops *new* incidental telemetry from being recorded and never filters the existing cassette on read. Also drops the speculative GET-None recovery + its diagnostics from the previous commits: the load diagnostic showed GET returns the cassette bytes fine (get=1440B), so the persister never returned a spurious None — the loss happened later in vcrpy's append. The proven TTL-refresh-on-read fix is retained. * fix(tests/vcr): drop incidental telemetry export POSTs to stop rotating async-flush misses litellm's observability loggers flush on a background thread, so a Langfuse ingestion POST scheduled by one telemetry test can fire mid-way through a *later* telemetry-named test (after that test's own httpx mock has exited) and be recorded by VCR as a phantom episode — a non-deterministic MISS:RECORDED / PARTIAL that rotates onto a different telemetry test from run to run. Telemetry export POSTs are fire-and-forget; no test asserts on a *recorded* export response except the pass-through proxy test (which forwards a client POST to Langfuse ingestion and replays its 207). So _should_drop_telemetry_record now drops incidental export POSTs for every test except that one. Dropping returns None (live fire-and-forget, never stored), so it can only turn a phantom miss into a harmless live call, never the reverse; recorded read-back GETs that telemetry tests assert on are matched by method and left untouched. * fix(tests/vcr): restore assertion in test_banner_silent_when_vcr_disabled The assertion that the banner is suppressed when VCR is disabled was inadvertently moved into test_diagnostic_log_silent_when_no_dir when the diagnostic-log tests were added, leaving the disabled-VCR test verifying nothing. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>
214 lines
9.4 KiB
Python
214 lines
9.4 KiB
Python
# What is this?
|
|
## Unit tests for the max_parallel_requests feature on Router
|
|
import asyncio
|
|
import inspect
|
|
import os
|
|
import sys
|
|
import time
|
|
import traceback
|
|
from datetime import datetime
|
|
|
|
import pytest
|
|
|
|
sys.path.insert(0, os.path.abspath("../.."))
|
|
from typing import Optional
|
|
|
|
import litellm
|
|
from litellm.utils import calculate_max_parallel_requests
|
|
|
|
"""
|
|
- only rpm
|
|
- only tpm
|
|
- only max_parallel_requests
|
|
- max_parallel_requests + rpm
|
|
- max_parallel_requests + tpm
|
|
- max_parallel_requests + tpm + rpm
|
|
"""
|
|
|
|
|
|
max_parallel_requests_values = [None, 10]
|
|
tpm_values = [None, 20, 300000]
|
|
rpm_values = [None, 30]
|
|
default_max_parallel_requests = [None, 40]
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"max_parallel_requests, tpm, rpm, default_max_parallel_requests",
|
|
[
|
|
(mp, tp, rp, dmp)
|
|
for mp in max_parallel_requests_values
|
|
for tp in tpm_values
|
|
for rp in rpm_values
|
|
for dmp in default_max_parallel_requests
|
|
],
|
|
)
|
|
def test_scenario(max_parallel_requests, tpm, rpm, default_max_parallel_requests):
|
|
calculated_max_parallel_requests = calculate_max_parallel_requests(
|
|
max_parallel_requests=max_parallel_requests,
|
|
rpm=rpm,
|
|
tpm=tpm,
|
|
default_max_parallel_requests=default_max_parallel_requests,
|
|
)
|
|
if max_parallel_requests is not None:
|
|
assert max_parallel_requests == calculated_max_parallel_requests
|
|
elif rpm is not None:
|
|
assert rpm == calculated_max_parallel_requests
|
|
elif tpm is not None:
|
|
calculated_rpm = int(tpm / 1000 * 6)
|
|
if calculated_rpm == 0:
|
|
calculated_rpm = 1
|
|
print(
|
|
f"test calculated_rpm: {calculated_rpm}, calculated_max_parallel_requests={calculated_max_parallel_requests}"
|
|
)
|
|
assert calculated_rpm == calculated_max_parallel_requests
|
|
elif default_max_parallel_requests is not None:
|
|
assert calculated_max_parallel_requests == default_max_parallel_requests
|
|
else:
|
|
assert calculated_max_parallel_requests is None
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"max_parallel_requests, tpm, rpm, default_max_parallel_requests",
|
|
[
|
|
(mp, tp, rp, dmp)
|
|
for mp in max_parallel_requests_values
|
|
for tp in tpm_values
|
|
for rp in rpm_values
|
|
for dmp in default_max_parallel_requests
|
|
],
|
|
)
|
|
def test_setting_mpr_limits_per_model(
|
|
max_parallel_requests, tpm, rpm, default_max_parallel_requests
|
|
):
|
|
deployment = {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": {
|
|
"model": "gpt-3.5-turbo",
|
|
"max_parallel_requests": max_parallel_requests,
|
|
"tpm": tpm,
|
|
"rpm": rpm,
|
|
},
|
|
"model_info": {"id": "my-unique-id"},
|
|
}
|
|
|
|
router = litellm.Router(
|
|
model_list=[deployment],
|
|
default_max_parallel_requests=default_max_parallel_requests,
|
|
)
|
|
|
|
mpr_client: Optional[asyncio.Semaphore] = router._get_client(
|
|
deployment=deployment,
|
|
kwargs={},
|
|
client_type="max_parallel_requests",
|
|
)
|
|
|
|
if max_parallel_requests is not None:
|
|
assert max_parallel_requests == mpr_client._value
|
|
elif rpm is not None:
|
|
assert rpm == mpr_client._value
|
|
elif tpm is not None:
|
|
calculated_rpm = int(tpm / 1000 * 6)
|
|
if calculated_rpm == 0:
|
|
calculated_rpm = 1
|
|
print(
|
|
f"test calculated_rpm: {calculated_rpm}, calculated_max_parallel_requests={mpr_client._value}"
|
|
)
|
|
assert calculated_rpm == mpr_client._value
|
|
elif default_max_parallel_requests is not None:
|
|
assert mpr_client._value == default_max_parallel_requests
|
|
else:
|
|
assert mpr_client is None
|
|
|
|
# raise Exception("it worked!")
|
|
|
|
|
|
async def _handle_router_calls(router):
|
|
pre_fill = """
|
|
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ut finibus massa. Quisque a magna magna. Quisque neque diam, varius sit amet tellus eu, elementum fermentum sapien. Integer ut erat eget arcu rutrum blandit. Morbi a metus purus. Nulla porta, urna at finibus malesuada, velit ante suscipit orci, vitae laoreet dui ligula ut augue. Cras elementum pretium dui, nec luctus nulla aliquet ut. Nam faucibus, diam nec semper interdum, nisl nisi viverra nulla, vitae sodales elit ex a purus. Donec tristique malesuada lobortis. Donec posuere iaculis nisl, vitae accumsan libero dignissim dignissim. Suspendisse finibus leo et ex mattis tempor. Praesent at nisl vitae quam egestas lacinia. Donec in justo non erat aliquam accumsan sed vitae ex. Vivamus gravida diam vel ipsum tincidunt dignissim.
|
|
|
|
Cras vitae efficitur tortor. Curabitur vel erat mollis, euismod diam quis, consequat nibh. Ut vel est eu nulla euismod finibus. Aliquam euismod at risus quis dignissim. Integer non auctor massa. Nullam vitae aliquet mauris. Etiam risus enim, dignissim ut volutpat eget, pulvinar ac augue. Mauris elit est, ultricies vel convallis at, rhoncus nec elit. Aenean ornare maximus orci, ut maximus felis cursus venenatis. Nulla facilisi.
|
|
|
|
Maecenas aliquet ante massa, at ullamcorper nibh dictum quis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Quisque id egestas justo. Suspendisse fringilla in massa in consectetur. Quisque scelerisque egestas lacus at posuere. Vestibulum dui sem, bibendum vehicula ultricies vel, blandit id nisi. Curabitur ullamcorper semper metus, vitae commodo magna. Nulla mi metus, suscipit in neque vitae, porttitor pharetra erat. Vestibulum libero velit, congue in diam non, efficitur suscipit diam. Integer arcu velit, fermentum vel tortor sit amet, venenatis rutrum felis. Donec ultricies enim sit amet iaculis mattis.
|
|
|
|
Integer at purus posuere, malesuada tortor vitae, mattis nibh. Mauris ex quam, tincidunt et fermentum vitae, iaculis non elit. Nullam dapibus non nisl ac sagittis. Duis lacinia eros iaculis lectus consectetur vehicula. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Ut cursus semper est, vel interdum turpis ultrices dictum. Suspendisse posuere lorem et accumsan ultrices. Duis sagittis bibendum consequat. Ut convallis vestibulum enim, non dapibus est porttitor et. Quisque suscipit pulvinar turpis, varius tempor turpis. Vestibulum semper dui nunc, vel vulputate elit convallis quis. Fusce aliquam enim nulla, eu congue nunc tempus eu.
|
|
|
|
Nam vitae finibus eros, eu eleifend erat. Maecenas hendrerit magna quis molestie dictum. Ut consequat quam eu massa auctor pulvinar. Pellentesque vitae eros ornare urna accumsan tempor. Maecenas porta id quam at sodales. Donec quis accumsan leo, vel viverra nibh. Vestibulum congue blandit nulla, sed rhoncus libero eleifend ac. In risus lorem, rutrum et tincidunt a, interdum a lectus. Pellentesque aliquet pulvinar mauris, ut ultrices nibh ultricies nec. Mauris mi mauris, facilisis nec metus non, egestas luctus ligula. Quisque ac ligula at felis mollis blandit id nec risus. Nam sollicitudin lacus sed sapien fringilla ullamcorper. Etiam dui quam, posuere sit amet velit id, aliquet molestie ante. Integer cursus eget sapien fringilla elementum. Integer molestie, mi ac scelerisque ultrices, nunc purus condimentum est, in posuere quam nibh vitae velit.
|
|
"""
|
|
completion = await router.acompletion(
|
|
"gpt-3.5-turbo",
|
|
[
|
|
{
|
|
"role": "user",
|
|
# Fixed speed (was random.random()*100) so the request body is
|
|
# deterministic and the VCR cassette replays instead of
|
|
# appending a new episode every run. This is a rate-limiting
|
|
# test; the prompt content is irrelevant to what it asserts.
|
|
"content": f"{pre_fill * 3}\n\nRecite the Declaration of independence at a speed of 50.0 words per minute.",
|
|
}
|
|
],
|
|
stream=True,
|
|
temperature=0.0,
|
|
stream_options={"include_usage": True},
|
|
)
|
|
|
|
async for chunk in completion:
|
|
pass
|
|
print("done", chunk)
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_max_parallel_requests_rpm_rate_limiting():
|
|
"""
|
|
- make sure requests > model limits are retried successfully.
|
|
"""
|
|
from litellm import Router
|
|
|
|
router = Router(
|
|
routing_strategy="usage-based-routing-v2",
|
|
enable_pre_call_checks=True,
|
|
model_list=[
|
|
{
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": {
|
|
"model": "gpt-3.5-turbo",
|
|
"temperature": 0.0,
|
|
"rpm": 1,
|
|
"num_retries": 3,
|
|
},
|
|
}
|
|
],
|
|
)
|
|
await asyncio.gather(*[_handle_router_calls(router) for _ in range(3)])
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_max_parallel_requests_tpm_rate_limiting_base_case():
|
|
"""
|
|
- check error raised if defined tpm limit crossed.
|
|
"""
|
|
from litellm import Router, token_counter
|
|
|
|
_messages = [{"role": "user", "content": "Hey, how's it going?"}]
|
|
router = Router(
|
|
routing_strategy="usage-based-routing-v2",
|
|
enable_pre_call_checks=True,
|
|
model_list=[
|
|
{
|
|
"model_name": "gpt-4o-2024-08-06",
|
|
"litellm_params": {
|
|
"model": "gpt-4o-2024-08-06",
|
|
"temperature": 0.0,
|
|
"tpm": 1,
|
|
},
|
|
}
|
|
],
|
|
num_retries=0,
|
|
)
|
|
|
|
with pytest.raises(litellm.RateLimitError):
|
|
for _ in range(2):
|
|
await router.acompletion(
|
|
model="gpt-4o-2024-08-06",
|
|
messages=_messages,
|
|
)
|