Files
litellm/tests/ocr_tests/conftest.py
T
Mateo Wang 533eab4dbd fix(tests/vcr): make Redis cassette cache replay deterministically (zero VCR misses on consecutive runs) (#28826)
* test(vcr): make Redis-backed cassettes replay deterministically across runs

- Pin LITELLM_LOCAL_MODEL_COST_MAP=True in the shared VCR harness so the
  per-test importlib.reload(litellm) no longer fetches the model cost map
  from raw.githubusercontent.com. That live fetch was being recorded into
  cassettes; for tests that subsequently skip it was the only recorded
  episode, so the persister refused to save it (skipped tests don't persist)
  and the test re-recorded it live every run (MISS:NOT_PERSISTED).

- Compare-time symmetric matcher tolerance for Google OAuth (ya29.*) tokens,
  observability/telemetry payloads, credential-exchange bodies, and volatile
  UUID/timestamp tokens, so existing cassettes select a recorded episode
  instead of growing past the 50-episode cap and re-recording live.

- Don't record fire-and-forget telemetry (langfuse/arize/otel/...) into
  non-telemetry tests' cassettes. Several modules set litellm.success_callback
  at import time, so observability logging is globally enabled and an async
  flush from the background logging worker lands in an unrelated test's VCR
  window, saved as a spurious MISS:RECORDED (observed: a Langfuse batch from
  another completion landing on test_lowest_latency_routing_buffer). Such a
  request now passes through live (telemetry hosts aren't real-spend hosts);
  tests that actually assert on telemetry keep recording it.

- Dedupe + cap the VCR diagnostic dump so the classification summary survives
  CircleCI's ~400KB step-output truncation.

- Stabilize a non-deterministic rate-limit test body; mark AWS Secrets Manager
  lifecycle tests VCR-incompatible (uniquely-named secrets can't be replayed).

- Mark test_router_text_completion_client VCR-incompatible: it fires 300
  identical requests to verify async-client reuse, but vcrpy patches the HTTP
  transport so replay never exercises the real connection pool the test
  validates, and recording 300 near-identical episodes overflows the
  50-episode cap (MISS:OVERFLOW every run). It hits a free mock endpoint.

- Mark the Vertex AI MaaS Mistral OCR tests (vertex_ai/mistral-ocr-2505)
  VCR-incompatible: the MaaS model is not provisioned in the CI GCP project,
  so the live :rawPredict call fails and the test skips every run, leaving no
  cassette to record (MISS:NOT_PERSISTED every run). Sibling direct-Mistral
  and Azure OCR tests are unaffected and still replay from cache.

* fix(tests/vcr): refresh cassette TTL on read so replayed cassettes don't expire

The Redis VCR persister loaded cassettes with a plain GET, which does not
touch the key's TTL. A cassette that is only ever replayed (HIT/NOOP, never
re-recorded) therefore expired exactly 24h after its last *write*, no matter
how often it was read. Whichever CI run happened to cross that boundary
re-recorded the cassette live and surfaced a spurious VCR MISS on otherwise
deterministic cassettes — the residual per-run flakiness floor (a different
random subset of read-only cassettes expiring each run).

Slide the expiry forward on every successful load (best-effort EXPIRE), so
any cassette used at least once per TTL window stays alive indefinitely and
the 2nd/3rd run of a day replays cleanly.

* fix(tests/vcr): recover from spurious GET-None for existing cassette keys

Under concurrent CI load, the persister's load GET was observed returning
None for a cassette key that demonstrably existed on the (single, non-
clustered) Redis master — an external monitor saw the key present with a
healthy TTL at the same instant the in-process client read None. Because
None is a valid GET result (not a RedisError), the retry-on-error client
config never engaged, so the cassette re-recorded live (a phantom
MISS:RECORDED); for flaky/networked tests the failed live call then
triggered a pytest rerun, which is why a rotating subset of otherwise
deterministic tests missed each run.

On a None result, re-check EXISTS and re-read once. If the key really
exists, use the recovered value and log [vcr-transient-miss-recovered]
(also counted in cassette_cache_health). A genuinely absent key (a new
cassette) still falls through to CassetteNotFoundError.

* chore(tests/vcr): TEMP diagnostic for persistent-miss cassette load path

Logs GET/EXISTS at load time for the three cassettes that re-record every
run despite being present in Redis, to capture what the in-process client
sees. To be reverted before merge.

* chore(tests/vcr): write load diagnostic to Redis (truncation-proof)

CI stdout truncates to the last ~400KB, dropping the early loaddbg lines
for the alphabetically-first failing test. Push the load probe to a Redis
list instead so it survives. To be reverted before merge.

* fix(tests/vcr): don't drop stored telemetry episodes during cassette load

Root cause of the residual per-run misses on present cassettes: vcrpy's
Cassette._load() replays each *stored* interaction through Cassette.append(),
which runs before_record_request on it — and a None return there silently
drops that episode. The telemetry-leak suppressor (_should_drop_telemetry_record)
returns None for telemetry requests, so when a non-telemetry-named test (or the
alphabetically-first test in a worker, whose _current_test_nodeid is still empty)
loaded a cassette containing a Langfuse ingestion episode, the episode was
dropped on read — forcing an endless live re-record (a phantom MISS:RECORDED on
a cassette that was demonstrably present in Redis). Verified by reproducing
Cassette._load() against the real cassette: empty/non-telemetry nodeid -> 0
episodes survive; with the guard -> 1 survives.

Fix: guard the suppressor with a thread-local set around Cassette._load (via a
small idempotent monkeypatch), so the drop only ever stops *new* incidental
telemetry from being recorded and never filters the existing cassette on read.

Also drops the speculative GET-None recovery + its diagnostics from the previous
commits: the load diagnostic showed GET returns the cassette bytes fine
(get=1440B), so the persister never returned a spurious None — the loss happened
later in vcrpy's append. The proven TTL-refresh-on-read fix is retained.

* fix(tests/vcr): drop incidental telemetry export POSTs to stop rotating async-flush misses

litellm's observability loggers flush on a background thread, so a Langfuse
ingestion POST scheduled by one telemetry test can fire mid-way through a
*later* telemetry-named test (after that test's own httpx mock has exited) and
be recorded by VCR as a phantom episode — a non-deterministic MISS:RECORDED /
PARTIAL that rotates onto a different telemetry test from run to run.

Telemetry export POSTs are fire-and-forget; no test asserts on a *recorded*
export response except the pass-through proxy test (which forwards a client POST
to Langfuse ingestion and replays its 207). So _should_drop_telemetry_record now
drops incidental export POSTs for every test except that one. Dropping returns
None (live fire-and-forget, never stored), so it can only turn a phantom miss
into a harmless live call, never the reverse; recorded read-back GETs that
telemetry tests assert on are matched by method and left untouched.

* fix(tests/vcr): restore assertion in test_banner_silent_when_vcr_disabled

The assertion that the banner is suppressed when VCR is disabled was
inadvertently moved into test_diagnostic_log_silent_when_no_dir when
the diagnostic-log tests were added, leaving the disabled-VCR test
verifying nothing.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-26 11:30:44 -07:00

97 lines
3.2 KiB
Python

# conftest.py
#
# Wires OCR tests into the Redis-backed VCR cache so live provider
# calls (Mistral OCR, Azure AI OCR, Azure Document Intelligence,
# Vertex AI OCR) are replayed for 24h. See tests/llm_translation/Readme.md
# for the design overview.
import os
import sys
import pytest
sys.path.insert(0, os.path.abspath("../.."))
from tests._vcr_conftest_common import ( # noqa: E402,F401
VerboseReporterState,
_pin_multipart_boundary,
apply_vcr_auto_marker_to_items,
emit_cassette_cache_session_banner,
emit_vcr_classification_summary,
emit_vcr_diagnostic_log,
install_live_call_probe,
record_vcr_outcome,
register_persister_if_enabled,
reset_vcr_diag_dir,
vcr_config_dict,
)
# Vertex AI MaaS Mistral OCR tests that cannot be VCR-cached in CI.
#
# ``vertex_ai/mistral-ocr-2505`` is a Model-as-a-Service partner model that
# must be explicitly enabled in the GCP project's Model Garden. It is not
# provisioned in the CI project (``litellm-ci-cd``), so the live
# ``:rawPredict`` call fails on every run and ``BaseOCRTest`` catches the
# provider error and skips. Because the doomed live call is recorded but the
# test then skips, the persister refuses to save it (skipped tests don't
# persist) and the cassette is never seeded — so the test re-records live and
# is classified MISS:NOT_PERSISTED on every single run, forever. No cassette
# can be recorded until the model is provisioned. Mark the tests VCR-
# incompatible so they are honestly accounted as live calls (UNMARKED:LIVE_CALL)
# rather than phantom cache misses; behaviour is unchanged (they still run and
# still skip on the provider error). The sibling direct-Mistral and Azure OCR
# tests replay from cache normally and are unaffected. Remove these entries if
# the MaaS model is enabled in the CI project.
_VCR_INCOMPATIBLE_NODEID_SUFFIXES: tuple[str, ...] = (
"test_ocr_vertex_ai.py::TestVertexAIMistralOCR::test_ocr_response_structure",
"test_ocr_vertex_ai.py::TestVertexAIMistralOCR::test_basic_ocr_with_url[True]",
"test_ocr_vertex_ai.py::TestVertexAIMistralOCR::test_basic_ocr_with_url[False]",
)
_verbose_state = VerboseReporterState()
@pytest.fixture(scope="module")
def vcr_config():
return vcr_config_dict()
def pytest_recording_configure(config, vcr):
register_persister_if_enabled(vcr)
@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
outcome = yield
rep = outcome.get_result()
setattr(item, f"rep_{rep.when}", rep)
@pytest.fixture(autouse=True)
def _vcr_outcome_gate(request, vcr):
install_live_call_probe(request, vcr)
yield
record_vcr_outcome(request, vcr)
def pytest_configure(config):
_verbose_state.remember_pluginmanager(config)
reset_vcr_diag_dir()
def pytest_runtest_logreport(report):
_verbose_state.maybe_emit_verdict(report)
def pytest_collection_modifyitems(config, items):
apply_vcr_auto_marker_to_items(
items,
skip_nodeid_suffixes=_VCR_INCOMPATIBLE_NODEID_SUFFIXES,
)
def pytest_terminal_summary(terminalreporter, exitstatus, config):
emit_cassette_cache_session_banner(terminalreporter)
emit_vcr_classification_summary(terminalreporter)
emit_vcr_diagnostic_log(terminalreporter)