litellm

tiennm99/litellm
Fork 0
mirror of https://github.com/tiennm99/litellm.git synced 2026-08-01 22:22:22 +00:00
Files
T
History
7e13256fee test: add 24hr Redis-backed VCR cache to additional test suites (#27159 )
* test: add 24hr Redis-backed VCR cache to additional test suites

Extracts the existing llm_translation VCR plumbing into a reusable helper
(tests/_vcr_conftest_common.py) and wires it into the conftest.py files
of the test directories listed in LIT-2787:

  audio_tests, batches_tests, guardrails_tests, image_gen_tests,
  litellm_utils_tests, local_testing, logging_callback_tests,
  pass_through_unit_tests, router_unit_tests, unified_google_tests

The same helper is also adopted by the pre-existing llm_translation and
llm_responses_api_testing conftests to remove the copy-pasted VCR setup.

Each consuming conftest:
- registers the Redis persister via pytest_recording_configure
- auto-marks collected tests with pytest.mark.vcr (skipping respx-using
  files where applicable, since respx and vcrpy both patch httpx)
- gates cassette writes on test success via _vcr_outcome_gate

The cache is opt-in via CASSETTE_REDIS_URL; when unset, VCR is disabled
and tests hit live providers as before. LITELLM_VCR_DISABLE=1 still
forces a bypass for ad-hoc local runs.

Test directories that run LiteLLM proxy in Docker (build_and_test,
proxy_logging_guardrails_model_info_tests, proxy_store_model_in_db_tests)
are intentionally not included: VCR.py patches the in-process httpx
transport and cannot intercept calls made from inside a Docker container.
The installing_litellm_on_python* jobs make no LLM calls and don't
benefit from caching.

https://linear.app/litellm-ai/issue/LIT-2787/add-24hr-caching-to-additional-test-suites

* test(vcr): add safe-body matcher to handle JSONL and binary request bodies

vcrpy's stock body matcher inspects Content-Type and unconditionally
runs json.loads on application/json bodies. JSON Lines payloads (used
by the Bedrock batch S3 PUT and other upload paths) crash that with
json.JSONDecodeError: Extra data, before the matcher can return
'not a match'.

This was the root cause of the batches_testing CI job failing on
test_async_create_file once VCR auto-marking was applied to the
batches_tests directory.

Add a conservative byte-equality body matcher and use it in place of
'body' in the shared match_on tuple. The matcher is strictly more
conservative than vcrpy's default — the only thing it gives up is
'different JSON key order is treated as the same body', which doesn't
apply to deterministic litellm-built request payloads. It can never
produce a false positive that the default would have rejected, so
there is no cross-contamination risk.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): exclude tests that VCR replay actively breaks

A few tests are incompatible with cassette replay and were failing on
the latest CI run after VCR auto-marking was extended to local_testing
and logging_callback_tests:

- test_amazing_s3_logs.py (logging_callback_tests): the test asserts on
  a per-run response_id that should round-trip through a real S3
  PUT/LIST. vcrpy's boto3 stub intercepts the PUT and the LIST replays
  stale keys, so the freshly-generated id is never found.
- test_async_embedding_azure (logging_callback_tests) and
  test_amazing_sync_embedding (local_testing): the failure branches
  deliberately pass api_key='my-bad-key' to assert that the failure
  callback fires. We scrub auth headers from cassettes (so the bad-key
  request matches the prior good-key request), and vcrpy replays the
  recorded 200 — the failure callback never fires.
- test_assistants.py (local_testing): the OpenAI Assistants polling
  APIs mint fresh thread/run IDs every recording session and then poll
  until status=='completed'. Replays of those polled GETs can never
  match a freshly-generated run id, so every CI run effectively
  re-records and the suite blows past the 15m no_output_timeout.

Skip these from VCR auto-marking so they continue to hit live providers
as they did before this change. The remaining tests in each directory
still get cached.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): expand skip lists for second batch of incompatible tests

Followup to the previous commit. After re-running CI on the rebuilt
branch, three more tests surfaced as VCR-replay-incompatible:

- litellm_utils_testing :: test_get_valid_models_from_dynamic_api_key
  Calls GET /v1/models with api_key='123' to assert the result is empty.
  We scrub auth headers, so the bad-key request matches the prior
  good-key cassette and replays the recorded model list.
- litellm_utils_testing :: test_litellm_overhead.py
  Measures litellm_overhead_time_ms as a percentage of total wall-clock
  time. With cached responses the upstream 'network' time collapses to
  microseconds, blowing past the 40%% threshold the test asserts on.
  Skip the whole file (every parametrization is at risk).
- local_testing_part1 :: test_async_custom_handler_completion and
  test_async_custom_handler_embedding
  Same bad-key failure-callback pattern as the already-skipped
  test_amazing_sync_embedding.
- litellm_router_testing :: test_router_caching.py
  Asserts on litellm's own router-level response cache by comparing
  response1.id to response2.id across repeat upstream calls (test
  bypasses litellm cache via ttl=0 and expects upstream to return a
  *new* id). With VCR replay both upstream calls return the same
  cassette body, so the ids are identical. Skip the whole file.
- logging_callback_tests :: test_async_chat_azure (preemptive)
  Same shape as already-skipped test_async_embedding_azure; was masked
  by upstream OpenAI rate-limit failures on baseline.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): use item.path and tighten matcher docstring

- Replace pytest's deprecated item.fspath with item.path in
  apply_vcr_auto_marker_to_items so we don't emit deprecation
  warnings under pytest 8.
- Clarify _safe_body_matcher docstring to reflect actual behavior
  (direct == first, then UTF-8 bytes comparison, no repr fallback).

Addresses Greptile review feedback on PR #27159.

* test(vcr): swallow all RedisError on cassette save/load

Cassette persistence is strictly best-effort: any Redis-side failure
(connection blip, timeout, OutOfMemoryError when the maxmemory cap is
hit, READONLY replicas, etc.) should degrade to 'test passed but
cassette not cached' rather than fail the test on teardown.

Previously the persister only caught ConnectionError and TimeoutError,
so OutOfMemoryError — which Redis Cloud raises when the cassette cache
hits its memory cap and there are no evictable keys — propagated out of
vcrpy's autouse fixture and ERRORed otherwise-passing tests on
teardown. This caused the litellm_utils_testing CircleCI job to fail on
the latest commit's run, even though the underlying test was a unit
test that used mock_response and produced no real upstream traffic
(the cassette was dirtied by a background langfuse callback). The
rerun only succeeded because Redis evictions happened to free enough
room before the SET — i.e. it was timing-dependent flakiness.

Catch redis.exceptions.RedisError (the common base of all server- and
client-side Redis exceptions) on both save and load, and parametrize
the regression tests across ConnectionError, TimeoutError, and
OutOfMemoryError to pin the new behavior.

* test(vcr): surface cassette-cache failures with warnings + session banner

When the persister silently swallows a Redis OOM (or any RedisError) on
save/load there is otherwise no visible signal that the cache is
degraded — tests pass, the cassette just isn't persisted, and the next
session still hits the same Redis at the same near-cap memory.

Add three layers of observability so that failure mode is loud:

1. Per-process health counters ("save_failures", "load_failures", and
   the last error string for each), exposed via cassette_cache_health()
   and reset via reset_cassette_cache_health(). The persister
   increments these in addition to logging.

2. VCRCassetteCacheWarning (UserWarning subclass) emitted via
   warnings.warn() inside the persister's except block. Pytest's
   built-in warnings summary at session end automatically lists every
   such warning, so the failure is visible in CI logs without any
   conftest-level wiring.

3. Session-end banner via emit_cassette_cache_session_banner() and a
   stderr-fallback atexit handler registered from
   register_persister_if_enabled(). Two states:
     - red "VCR CASSETTE CACHE DEGRADED" when save_failures or
       load_failures > 0
     - yellow "VCR CASSETTE CACHE NEAR CAPACITY" (no failures, but
       used_memory >= 85% of maxmemory) so the next session knows
       the Redis is approaching OOM before any SET actually fails

Capacity comes from a best-effort INFO memory probe
(cassette_cache_capacity_snapshot) that returns None on any failure or
when maxmemory is uncapped. The atexit handler skips xdist workers so
only the controller emits.

Tests: parametrize the existing save/load swallow-error tests across
ConnectionError/TimeoutError/OutOfMemoryError, add direct tests for
the health counters and warning emission, and a new
test_vcr_conftest_common_banner.py covering banner output for every
state (silent/red/yellow/disabled/xdist-worker).

* test(vcr): bucket cassettes by API key fingerprint, drop bad-key skips

Tests that deliberately call an LLM API with a bad key (e.g. to assert
that the failure callback fires, or that check_valid_key returns False)
were being silently served the prior good-key cassette: we scrub the
real Authorization / x-api-key header from the cassette before storing
it, so a follow-up bad-key call is byte-identical to the good-key call
under the existing match_on tuple.

Add a 'key_fingerprint' custom matcher that distinguishes requests by
the SHA-256 of their API-key headers. The fingerprint is stamped into
a synthetic 'x-litellm-key-fp' header by a new before_record_request
hook, which then strips the real auth headers (we have to do the
scrubbing here instead of via vcrpy's filter_headers knob, because
filter_headers runs *first* and would erase the value we want to hash).

Bad-key requests now get a different cassette bucket than good-key
requests, so vcrpy will not replay a recorded 200 in place of the
expected 401. The fingerprint is a one-way hash of the secret, so
cassettes never contain the key.

This permanently removes the 'bad-key' category of skips:

- tests/local_testing: dropped ::test_amazing_sync_embedding,
  ::test_async_custom_handler_completion,
  ::test_async_custom_handler_embedding
- tests/logging_callback_tests: dropped ::test_async_chat_azure,
  ::test_async_embedding_azure
- tests/litellm_utils_tests: dropped
  ::test_get_valid_models_from_dynamic_api_key

Coverage: 7 new unit tests in tests/test_litellm/test_vcr_safe_body_matcher.py
covering header stripping, fingerprint determinism, no-auth bucketing,
good-vs-bad key discrimination, x-api-key (Anthropic/Azure) discrimination,
and idempotence under replay.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop redundant comments and docstrings

Trim narration of code that is already self-evident from function and
variable names. Keep the two genuinely non-obvious bits:

- ordering constraint between filter_headers and before_record_request,
  which would invite a maintainer to re-introduce the bug if removed
- the per-directory _VCR_INCOMPATIBLE_FILES rationale, since 'why
  exactly is this skipped' is not knowable from the test name alone

Also drop the 40-line commented-out drop-in conftest snippet at the
bottom of _vcr_conftest_common.py — the consuming conftests are the
canonical reference.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): make _before_record_request idempotent

vcrpy invokes before_record_request more than once per request:
can_play_response_for calls it, then __contains__ /
_responses (reached via play_response) call it again on the
result. The second invocation sees a request whose auth headers we
already stripped, so a naive recompute yields "no-key" and
overwrites the real fingerprint stored in the header.

This makes can_play_response_for and play_response disagree on
matchability — the former says "yes, we have a stored response for
this" (matching no-key to no-key) and the latter throws
UnhandledHTTPRequestError because it computes a fresh real
fingerprint that doesn't match the stored no-key.

In CI this manifested as ~30 failing tests across guardrails_testing,
audio_testing, batches_testing, image_gen_testing, llm_responses_api,
litellm_router_unit_testing, etc. Skip the recompute when the header
is already set, so re-applying the hook is a no-op.

Adds a regression test that fires the hook twice on the same dict and
asserts the fingerprint stays put.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop more redundant docstrings and headers

* test(vcr): enable 24hr cache for ocr_tests and search_tests

These two directories were the only non-dockerized test suites in the
build_and_test workflow that make live LLM/provider API calls but were
not VCR-enabled by this PR. Together they account for 96 tests:

- tests/ocr_tests/ (31): Mistral OCR, Azure AI OCR, Azure Document
  Intelligence, Vertex AI OCR. Pure-unit tests inside the same files
  (e.g. TestAzureDocumentIntelligencePagesParam) make no HTTP calls
  and become benign VCR NOOPs.
- tests/search_tests/ (65): Brave, DataForSEO, DuckDuckGo, Exa,
  Firecrawl, Google PSE, Linkup, Parallel.ai, Perplexity, SearchAPI,
  Searxng, Serper, Tavily.

Both directories use the canonical minimal conftest pattern from
tests/audio_tests/conftest.py with no skip lists. None of the test
files use respx, none assert on per-call upstream non-determinism
(no response1.id != response2.id, no overhead-as-fraction-of-total,
no live polling), so the default match_on tuple should cache cleanly.
If a flake surfaces during the first cassette-recording CI run, we
can add a targeted skip the same way we did for the other dirs.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>