mirror of
https://github.com/tiennm99/litellm.git
synced 2026-06-18 00:48:01 +00:00
533eab4dbd
* test(vcr): make Redis-backed cassettes replay deterministically across runs - Pin LITELLM_LOCAL_MODEL_COST_MAP=True in the shared VCR harness so the per-test importlib.reload(litellm) no longer fetches the model cost map from raw.githubusercontent.com. That live fetch was being recorded into cassettes; for tests that subsequently skip it was the only recorded episode, so the persister refused to save it (skipped tests don't persist) and the test re-recorded it live every run (MISS:NOT_PERSISTED). - Compare-time symmetric matcher tolerance for Google OAuth (ya29.*) tokens, observability/telemetry payloads, credential-exchange bodies, and volatile UUID/timestamp tokens, so existing cassettes select a recorded episode instead of growing past the 50-episode cap and re-recording live. - Don't record fire-and-forget telemetry (langfuse/arize/otel/...) into non-telemetry tests' cassettes. Several modules set litellm.success_callback at import time, so observability logging is globally enabled and an async flush from the background logging worker lands in an unrelated test's VCR window, saved as a spurious MISS:RECORDED (observed: a Langfuse batch from another completion landing on test_lowest_latency_routing_buffer). Such a request now passes through live (telemetry hosts aren't real-spend hosts); tests that actually assert on telemetry keep recording it. - Dedupe + cap the VCR diagnostic dump so the classification summary survives CircleCI's ~400KB step-output truncation. - Stabilize a non-deterministic rate-limit test body; mark AWS Secrets Manager lifecycle tests VCR-incompatible (uniquely-named secrets can't be replayed). - Mark test_router_text_completion_client VCR-incompatible: it fires 300 identical requests to verify async-client reuse, but vcrpy patches the HTTP transport so replay never exercises the real connection pool the test validates, and recording 300 near-identical episodes overflows the 50-episode cap (MISS:OVERFLOW every run). It hits a free mock endpoint. - Mark the Vertex AI MaaS Mistral OCR tests (vertex_ai/mistral-ocr-2505) VCR-incompatible: the MaaS model is not provisioned in the CI GCP project, so the live :rawPredict call fails and the test skips every run, leaving no cassette to record (MISS:NOT_PERSISTED every run). Sibling direct-Mistral and Azure OCR tests are unaffected and still replay from cache. * fix(tests/vcr): refresh cassette TTL on read so replayed cassettes don't expire The Redis VCR persister loaded cassettes with a plain GET, which does not touch the key's TTL. A cassette that is only ever replayed (HIT/NOOP, never re-recorded) therefore expired exactly 24h after its last *write*, no matter how often it was read. Whichever CI run happened to cross that boundary re-recorded the cassette live and surfaced a spurious VCR MISS on otherwise deterministic cassettes — the residual per-run flakiness floor (a different random subset of read-only cassettes expiring each run). Slide the expiry forward on every successful load (best-effort EXPIRE), so any cassette used at least once per TTL window stays alive indefinitely and the 2nd/3rd run of a day replays cleanly. * fix(tests/vcr): recover from spurious GET-None for existing cassette keys Under concurrent CI load, the persister's load GET was observed returning None for a cassette key that demonstrably existed on the (single, non- clustered) Redis master — an external monitor saw the key present with a healthy TTL at the same instant the in-process client read None. Because None is a valid GET result (not a RedisError), the retry-on-error client config never engaged, so the cassette re-recorded live (a phantom MISS:RECORDED); for flaky/networked tests the failed live call then triggered a pytest rerun, which is why a rotating subset of otherwise deterministic tests missed each run. On a None result, re-check EXISTS and re-read once. If the key really exists, use the recovered value and log [vcr-transient-miss-recovered] (also counted in cassette_cache_health). A genuinely absent key (a new cassette) still falls through to CassetteNotFoundError. * chore(tests/vcr): TEMP diagnostic for persistent-miss cassette load path Logs GET/EXISTS at load time for the three cassettes that re-record every run despite being present in Redis, to capture what the in-process client sees. To be reverted before merge. * chore(tests/vcr): write load diagnostic to Redis (truncation-proof) CI stdout truncates to the last ~400KB, dropping the early loaddbg lines for the alphabetically-first failing test. Push the load probe to a Redis list instead so it survives. To be reverted before merge. * fix(tests/vcr): don't drop stored telemetry episodes during cassette load Root cause of the residual per-run misses on present cassettes: vcrpy's Cassette._load() replays each *stored* interaction through Cassette.append(), which runs before_record_request on it — and a None return there silently drops that episode. The telemetry-leak suppressor (_should_drop_telemetry_record) returns None for telemetry requests, so when a non-telemetry-named test (or the alphabetically-first test in a worker, whose _current_test_nodeid is still empty) loaded a cassette containing a Langfuse ingestion episode, the episode was dropped on read — forcing an endless live re-record (a phantom MISS:RECORDED on a cassette that was demonstrably present in Redis). Verified by reproducing Cassette._load() against the real cassette: empty/non-telemetry nodeid -> 0 episodes survive; with the guard -> 1 survives. Fix: guard the suppressor with a thread-local set around Cassette._load (via a small idempotent monkeypatch), so the drop only ever stops *new* incidental telemetry from being recorded and never filters the existing cassette on read. Also drops the speculative GET-None recovery + its diagnostics from the previous commits: the load diagnostic showed GET returns the cassette bytes fine (get=1440B), so the persister never returned a spurious None — the loss happened later in vcrpy's append. The proven TTL-refresh-on-read fix is retained. * fix(tests/vcr): drop incidental telemetry export POSTs to stop rotating async-flush misses litellm's observability loggers flush on a background thread, so a Langfuse ingestion POST scheduled by one telemetry test can fire mid-way through a *later* telemetry-named test (after that test's own httpx mock has exited) and be recorded by VCR as a phantom episode — a non-deterministic MISS:RECORDED / PARTIAL that rotates onto a different telemetry test from run to run. Telemetry export POSTs are fire-and-forget; no test asserts on a *recorded* export response except the pass-through proxy test (which forwards a client POST to Langfuse ingestion and replays its 207). So _should_drop_telemetry_record now drops incidental export POSTs for every test except that one. Dropping returns None (live fire-and-forget, never stored), so it can only turn a phantom miss into a harmless live call, never the reverse; recorded read-back GETs that telemetry tests assert on are matched by method and left untouched. * fix(tests/vcr): restore assertion in test_banner_silent_when_vcr_disabled The assertion that the banner is suppressed when VCR is disabled was inadvertently moved into test_diagnostic_log_silent_when_no_dir when the diagnostic-log tests were added, leaving the disabled-VCR test verifying nothing. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>
270 lines
9.6 KiB
Python
270 lines
9.6 KiB
Python
# conftest.py
|
|
#
|
|
# xdist-compatible test isolation for local_testing tests.
|
|
# Pattern matches tests/test_litellm/conftest.py:
|
|
# - Function-scoped fixture saves/restores litellm globals (no reload)
|
|
# - Module-scoped fixture reloads only in single-process mode
|
|
#
|
|
# IMPORTANT: True defaults are captured at conftest import time (before any
|
|
# test module can pollute them via module-level assignments like
|
|
# `litellm.num_retries = 3`). The function-scoped fixture resets globals to
|
|
# these true defaults before every test, preventing cross-test contamination
|
|
# under xdist where module reload is skipped.
|
|
|
|
import importlib
|
|
import os
|
|
import sys
|
|
|
|
import pytest
|
|
|
|
sys.path.insert(
|
|
0, os.path.abspath("../..")
|
|
) # Adds the parent directory to the system path
|
|
import litellm
|
|
|
|
# ``litellm.model_cost`` is loaded at import time from the URL pinned to
|
|
# ``main`` (``LITELLM_MODEL_COST_MAP_URL``). The in-tree backup ships with
|
|
# this branch and can include pricing entries that main has not yet picked
|
|
# up (e.g. an upstream provider rotates a model id and the test cassette
|
|
# records the new name). Backfill any entries that are missing from the
|
|
# remote-fetched map so cost-calculator lookups in tests succeed against
|
|
# the cassette state the branch is being tested with.
|
|
from litellm.litellm_core_utils.get_model_cost_map import GetModelCostMap
|
|
|
|
for _k, _v in GetModelCostMap.load_local_model_cost_map().items():
|
|
litellm.model_cost.setdefault(_k, _v)
|
|
|
|
from tests._vcr_conftest_common import ( # noqa: E402,F401
|
|
VerboseReporterState,
|
|
_pin_multipart_boundary,
|
|
apply_vcr_auto_marker_to_items,
|
|
emit_cassette_cache_session_banner,
|
|
emit_vcr_classification_summary,
|
|
emit_vcr_diagnostic_log,
|
|
install_live_call_probe,
|
|
record_vcr_outcome,
|
|
register_persister_if_enabled,
|
|
reset_vcr_diag_dir,
|
|
vcr_config_dict,
|
|
)
|
|
|
|
# Per-item respx detection (``apply_vcr_auto_marker_to_items``) auto-skips
|
|
# tests whose ``@pytest.mark.respx`` marker or ``respx_mock`` fixture
|
|
# would conflict with vcrpy's transport patch. We no longer maintain a
|
|
# file-level ``_RESPX_CONFLICTING_FILES`` list here — the previous
|
|
# entries (``test_router.py``) had only a stale ``from respx import
|
|
# MockRouter`` import with no actual respx wiring, so file-level
|
|
# blacklisting was masking valid cache opportunities.
|
|
|
|
# Files where VCR replay breaks the test:
|
|
# - ``test_assistants.py``: polls fresh per-session run IDs that no cassette
|
|
# can match, so every CI run re-records and the suite times out.
|
|
# - ``test_router_caching.py``: asserts upstream returns a *new* id per call,
|
|
# which a deterministic cassette replay violates.
|
|
_VCR_INCOMPATIBLE_FILES = frozenset(
|
|
{
|
|
"test_assistants.py",
|
|
"test_router_caching.py",
|
|
}
|
|
)
|
|
|
|
# Individual tests (vs. whole files above) that VCR replay can't model:
|
|
# - ``test_router_text_completion_client``: a concurrency test that fires 300
|
|
# identical requests to verify the async OpenAI client is *reused* across
|
|
# calls (per its own comment, it "fails when we create a new Async OpenAI
|
|
# client per request"). vcrpy patches the HTTP transport, so replay never
|
|
# opens real connections and cannot exercise the client pool the test exists
|
|
# to validate. Recording instead stores ~300 near-identical episodes, which
|
|
# blows past MAX_EPISODES_PER_CASSETTE (50) so the cassette is refused on
|
|
# every run (MISS:OVERFLOW). The endpoint is a free mock, so the live calls
|
|
# carry no real provider cost.
|
|
_VCR_INCOMPATIBLE_NODEID_SUFFIXES: tuple[str, ...] = (
|
|
"test_router.py::test_router_text_completion_client",
|
|
)
|
|
|
|
|
|
_verbose_state = VerboseReporterState()
|
|
|
|
|
|
@pytest.fixture(scope="module")
|
|
def vcr_config():
|
|
return vcr_config_dict()
|
|
|
|
|
|
def pytest_recording_configure(config, vcr):
|
|
register_persister_if_enabled(vcr)
|
|
|
|
|
|
@pytest.hookimpl(hookwrapper=True)
|
|
def pytest_runtest_makereport(item, call):
|
|
outcome = yield
|
|
rep = outcome.get_result()
|
|
setattr(item, f"rep_{rep.when}", rep)
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def _vcr_outcome_gate(request, vcr):
|
|
install_live_call_probe(request, vcr)
|
|
yield
|
|
record_vcr_outcome(request, vcr)
|
|
|
|
|
|
def pytest_configure(config):
|
|
_verbose_state.remember_pluginmanager(config)
|
|
reset_vcr_diag_dir()
|
|
|
|
|
|
def pytest_runtest_logreport(report):
|
|
_verbose_state.maybe_emit_verdict(report)
|
|
|
|
|
|
def pytest_terminal_summary(terminalreporter, exitstatus, config):
|
|
emit_cassette_cache_session_banner(terminalreporter)
|
|
emit_vcr_classification_summary(terminalreporter)
|
|
emit_vcr_diagnostic_log(terminalreporter)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Capture TRUE defaults at conftest import time. This runs before any test
|
|
# module's top-level code (e.g. `litellm.num_retries = 3`) executes, so
|
|
# the values here are guaranteed to be the real package defaults.
|
|
# ---------------------------------------------------------------------------
|
|
_SCALAR_DEFAULTS = {
|
|
"num_retries": getattr(litellm, "num_retries", None),
|
|
"num_retries_per_request": getattr(litellm, "num_retries_per_request", None),
|
|
"request_timeout": getattr(litellm, "request_timeout", None),
|
|
"set_verbose": getattr(litellm, "set_verbose", False),
|
|
"cache": getattr(litellm, "cache", None),
|
|
"allowed_fails": getattr(litellm, "allowed_fails", 3),
|
|
"default_fallbacks": getattr(litellm, "default_fallbacks", None),
|
|
"enable_azure_ad_token_refresh": getattr(
|
|
litellm, "enable_azure_ad_token_refresh", None
|
|
),
|
|
"tag_budget_config": getattr(litellm, "tag_budget_config", None),
|
|
"model_cost": getattr(litellm, "model_cost", None),
|
|
"token_counter": getattr(litellm, "token_counter", None),
|
|
"disable_aiohttp_transport": getattr(litellm, "disable_aiohttp_transport", False),
|
|
"force_ipv4": getattr(litellm, "force_ipv4", False),
|
|
"drop_params": getattr(litellm, "drop_params", None),
|
|
"modify_params": getattr(litellm, "modify_params", False),
|
|
"api_base": getattr(litellm, "api_base", None),
|
|
"api_key": getattr(litellm, "api_key", None),
|
|
}
|
|
|
|
|
|
@pytest.fixture(scope="function", autouse=True)
|
|
def isolate_litellm_state():
|
|
"""
|
|
Per-function isolation fixture.
|
|
|
|
Resets litellm globals to their true defaults before each test and
|
|
restores them afterward, so tests don't leak side effects.
|
|
Works safely under pytest-xdist parallel execution.
|
|
"""
|
|
# ---- Save current callback state (for teardown restore) ----
|
|
original_state = {}
|
|
for attr in (
|
|
"callbacks",
|
|
"success_callback",
|
|
"failure_callback",
|
|
"_async_success_callback",
|
|
"_async_failure_callback",
|
|
):
|
|
if hasattr(litellm, attr):
|
|
val = getattr(litellm, attr)
|
|
original_state[attr] = val.copy() if val else []
|
|
|
|
# Save list-type globals
|
|
for attr in ("pre_call_rules", "post_call_rules"):
|
|
if hasattr(litellm, attr):
|
|
val = getattr(litellm, attr)
|
|
original_state[attr] = val.copy() if val else []
|
|
|
|
# Save scalar globals
|
|
for attr in _SCALAR_DEFAULTS:
|
|
if hasattr(litellm, attr):
|
|
original_state[attr] = getattr(litellm, attr)
|
|
|
|
# ---- Reset to true defaults before the test ----
|
|
# Flush HTTP client cache
|
|
if hasattr(litellm, "in_memory_llm_clients_cache"):
|
|
litellm.in_memory_llm_clients_cache.flush_cache()
|
|
|
|
# Clear callbacks and rules
|
|
for attr in (
|
|
"callbacks",
|
|
"success_callback",
|
|
"failure_callback",
|
|
"_async_success_callback",
|
|
"_async_failure_callback",
|
|
"pre_call_rules",
|
|
"post_call_rules",
|
|
):
|
|
if hasattr(litellm, attr):
|
|
setattr(litellm, attr, [])
|
|
|
|
# Reset scalar globals to true defaults (prevents contamination from
|
|
# module-level code like `litellm.num_retries = 3` in test files)
|
|
for attr, default_val in _SCALAR_DEFAULTS.items():
|
|
if hasattr(litellm, attr):
|
|
setattr(litellm, attr, default_val)
|
|
|
|
yield
|
|
|
|
# ---- Teardown: restore saved state ----
|
|
if hasattr(litellm, "in_memory_llm_clients_cache"):
|
|
litellm.in_memory_llm_clients_cache.flush_cache()
|
|
|
|
for attr, original_value in original_state.items():
|
|
if hasattr(litellm, attr):
|
|
setattr(litellm, attr, original_value)
|
|
|
|
|
|
@pytest.fixture(scope="module", autouse=True)
|
|
def setup_and_teardown():
|
|
"""
|
|
Module-scoped setup. Reloads litellm only in single-process mode
|
|
(skipped under xdist to avoid cross-worker interference).
|
|
"""
|
|
sys.path.insert(0, os.path.abspath("../.."))
|
|
|
|
import litellm
|
|
|
|
worker_id = os.environ.get("PYTEST_XDIST_WORKER", None)
|
|
if worker_id is None:
|
|
importlib.reload(litellm)
|
|
|
|
try:
|
|
if hasattr(litellm, "proxy") and hasattr(litellm.proxy, "proxy_server"):
|
|
import litellm.proxy.proxy_server
|
|
|
|
importlib.reload(litellm.proxy.proxy_server)
|
|
except Exception as e:
|
|
print(f"Error reloading litellm.proxy.proxy_server: {e}")
|
|
|
|
if hasattr(litellm, "in_memory_llm_clients_cache"):
|
|
litellm.in_memory_llm_clients_cache.flush_cache()
|
|
|
|
yield
|
|
|
|
|
|
def pytest_collection_modifyitems(config, items):
|
|
apply_vcr_auto_marker_to_items(
|
|
items,
|
|
skip_files=_VCR_INCOMPATIBLE_FILES,
|
|
skip_nodeid_suffixes=_VCR_INCOMPATIBLE_NODEID_SUFFIXES,
|
|
)
|
|
|
|
# Separate tests in 'test_amazing_proxy_custom_logger.py' and other tests
|
|
custom_logger_tests = [
|
|
item for item in items if "custom_logger" in item.parent.name
|
|
]
|
|
other_tests = [item for item in items if "custom_logger" not in item.parent.name]
|
|
|
|
# Sort tests based on their names
|
|
custom_logger_tests.sort(key=lambda x: x.name)
|
|
other_tests.sort(key=lambda x: x.name)
|
|
|
|
# Reorder the items list
|
|
items[:] = custom_logger_tests + other_tests
|