mirror of
https://github.com/tiennm99/litellm.git
synced 2026-06-17 20:48:32 +00:00
bb448b0031
* fix(tests): stabilize image-edit VCR cassettes to stop live gpt-image-1 spend
The image-edit cassettes for ``gpt-image-1`` were accumulating >50
episodes and being refused by the persister
(``tests/_vcr_redis_persister.py``), so every CI run was hitting the
real OpenAI endpoint. The async parametrize was the clearest tell:
``test_openai_image_edit_litellm_sdk[True]`` cached to 1 entry, but the
``[False]`` (async) sibling grew to 51 entries and never replayed.
Two non-deterministic sources were fueling the growth, both fixed
here. After this patch, the cassettes settle at one episode per
unique call and replay for the 24-hour TTL like every other suite.
1. Pin httpx's multipart boundary at the source. The existing
``_normalize_multipart_boundary`` rewrites the boundary in the
``Content-Type`` header reliably, but on the async transport path
the body is not always a contiguous ``bytes`` object when
``before_record_request`` runs, so the body-side replacement
silently no-ops and the recorded cassette retains the random
``boundary=<hex>`` string. The next CI run gets a fresh random
boundary, the ``safe_body`` matcher misses, and
``record_mode="new_episodes"`` appends another episode. Wrapping
``httpx._multipart.MultipartStream.__init__`` so it always uses
``vcr-static-boundary`` when no boundary is supplied eliminates
the variance for both sync and async paths and leaves the normalizer
in place as a backstop. Exposed as
``pin_httpx_multipart_boundary`` so other multipart-heavy suites
(audio, ocr, batches) can adopt the same fixture later.
2. Pass raw ``bytes`` (not ``BytesIO`` streams) through the
image-edit fixtures. A ``BytesIO`` whose file pointer is at EOF
after the first multipart upload silently encodes an empty image on
the next SDK / Router retry — yet another divergent body that VCR
records as a new episode. ``bytes`` are immutable and position-less,
so retries re-encode an identical payload every time. This is also
a small production-correctness improvement: a customer passing
``BytesIO`` today would hit the same empty-body retry bug. The
BytesIO-specific smoke test
(``test_openai_image_edit_with_bytesio``) is preserved by giving
``get_test_images_as_bytesio`` its own factory instead of aliasing
the bytes one.
3. Add ``scripts/flush_image_edit_vcr_cassettes.py`` — a one-shot
Redis SCAN/DEL helper that clears the bloated pre-fix cassettes
under ``litellm:vcr:cassette:tests/image_gen_tests/test_image_edits/*``.
Without this, the next CI run still loads the existing 51-entry
cassette, the new fixed-boundary body still doesn't match any of
the stale entries, the persister still refuses to save, and the
bleed continues. Run once with the production
``CASSETTE_REDIS_URL`` after merge (dry-run by default).
* DIAGNOSTIC: log VCR body mismatches + per-episode body hashes
Temporary observability boost so we can root-cause why
``test_image_edits.py`` async parametrizes still record fresh
episodes on every CI run even though the multipart boundary is now
pinned (sync parametrizes cache cleanly as VCR HIT). The matcher
currently raises ``AssertionError("request bodies differ")`` with
zero context, so we cannot tell whether the live body genuinely
varies, the matcher is comparing a bytes object to a stream object,
or the normalizer is silently skipping the body because it is not
bytes/str.
Three logs added; the first two are worth keeping permanently, the
third is intended to be reverted after the diagnosis lands:
1. ``_safe_body_matcher`` now emits a structured stderr block on
mismatch (type of each side, length, SHA-256, first divergent
byte offset, ±100-byte window). Always-on -- mismatches are
signal, not noise, and the existing per-test verdict already
logs once per test. PERMANENT.
2. ``_normalize_multipart_boundary`` now logs to stderr when the
body type is not bytes/bytearray/str -- the silent ``else:
return`` branch was masking exactly the case we suspect is
firing on async (httpx ``MultipartStream`` handed to vcrpy
before the body is read). PERMANENT.
3. ``_RedisPersister.save_cassette`` now logs every episode's body
SHA-256, length, and 120-byte preview at save time. This lets
two consecutive CI runs be diffed: if the same test records a
different hash run-to-run, the live body genuinely varies; if
both runs record the same hash but the matcher still misses, the
bug is in the matcher itself. TEMPORARY -- revert once the
async variance is identified and fixed.
Once a single ``image_gen_testing`` CI run produces these logs,
revert this commit (or just the persister hash block) with a force
push so the cassette save path is not noisy in steady-state.
* DIAGNOSTIC: route VCR diagnostics through per-PID files (bypass xdist capture)
Re-push of the diagnostic logging from the previous commit, this
time wired so the output actually survives to the CI log. xdist
captures stdout/stderr from every passing test in the worker
process; the body-matcher and normalizer-skip diagnostics fire from
inside vcrpy machinery during the test, so for any test that
ultimately passes (which is all of them once the cassettes are
recorded), the diagnostic lines are silently swallowed.
Fix: write each diagnostic line to a per-PID file under
``test-results/vcr-diagnostics/<pid>.log`` instead of writing to
stderr. The controller's ``pytest_terminal_summary`` aggregates
those files and writes them through ``terminalreporter.write_line``,
which is not subject to per-test capture. As a bonus,
``test-results/`` is already collected by the ``store_test_results``
step in CircleCI, so the raw per-worker logs survive as build
artifacts even after the test session ends.
Three call sites updated:
1. ``_emit_body_mismatch_diagnostic`` (matcher) -- writes the
structured type/length/sha/window block via ``vcr_diag_write_line``.
2. ``_normalize_multipart_boundary`` -- logs the silent-skip path
(body not bytes/bytearray/str) the same way.
3. ``_maybe_log_episode_body_hashes`` (persister) -- replaces the
``_log.warning`` calls (which the root-logger config also
swallows in CI) with ``vcr_diag_write_line``.
Image-gen conftest is the only suite wired to dump the aggregated
log at session end. Other suites can opt in by adding
``emit_vcr_diagnostic_log(terminalreporter)`` to their own
``pytest_terminal_summary``. The diagnostic dir is cleared at the
start of each session (controller-only) so a local rerun does not
mix output from prior runs.
Same revert plan as the previous diagnostic commit: keep the
matcher + normalizer skip diagnostics permanently (they only fire
on signal events), revert the persister body-hash dump once the
async variance is identified.
* fix(tests): coalesce iterable request bodies before matching/recording
Root cause of the residual async image-edit cassette leak. The
diagnostic run for ``ba3915d9`` printed:
[vcr-safe-body-matcher] request body mismatch
body[a]: type='list_iterator' length=unknown sha256=N/A
body[b]: type='list_iterator' length=unknown sha256=N/A
httpx's async transport hands vcrpy a ``request.body`` that is a
``list_iterator`` over multipart chunks rather than a contiguous
``bytes`` blob. Two consequences:
1. ``_safe_body_matcher`` compares the two iterator objects with
``==``, which is identity comparison for arbitrary iterators -
semantically identical multipart bodies never compare equal, and
``record_mode="new_episodes"`` appends a new episode on every CI
run until the cassette crosses ``MAX_EPISODES_PER_CASSETTE`` and
the persister refuses to save (this is exactly what the OVERFLOW
warning has been catching).
2. ``_normalize_multipart_boundary`` short-circuits its
``else: return`` branch because the body is neither bytes nor
str, so any residual random boundary characters in the body bytes
are never rewritten.
Sync requests do not hit this code path: httpx's sync transport
hands vcrpy a single ``bytes`` body, so ``==`` works and the
boundary normalizer runs as intended. That is why
``test_openai_image_edit_litellm_sdk[True]`` records to ``entries=1``
and replays cleanly while ``[False]`` (async) kept growing by one
episode per run.
Fix: add ``_materialize_iterable_body`` which coalesces an iterable
``request.body`` into ``bytes`` in-place. Call it from two places:
* The top of ``_before_record_request``, so the boundary normalizer
and the cassette serializer both see bytes from then on.
* The top of ``_safe_body_matcher``, as defense in depth in case a
future vcrpy code path invokes the matcher without first going
through ``_before_record_request``.
The vcrpy ``Request`` is a wrapper used for matching and recording;
the underlying httpx transport sends its own request body
separately, so replacing the iterator on the vcrpy wrapper does
not starve the live HTTP send.
After this lands the async parametrizes should flip from
``[VCR MISS:RECORDED] entries=N+1`` to ``[VCR HIT] entries=N`` on
the next CI run, matching the sync side and dropping the residual
~$3/day to $0.
* fix(tests): handle bytes_iterator + never leave an exhausted body
Follow-up to 8e08272b. The previous attempt at coalescing iterable
request bodies bailed out (``return`` without writing
``request.body``) whenever it could not classify the chunk type.
That was the wrong failure mode for one critical case: vcrpy
sometimes presents the body as ``iter(some_bytes)``, whose Python
type is ``bytes_iterator`` and which yields ``int`` byte values
(0-255), not byte chunks. The old code saw an ``int`` chunk, hit
the ``else: return`` branch, and left ``request.body`` pointing at
the now-exhausted iterator.
The post-fix diagnostic run made this loud:
[vcr-safe-body-matcher] request body mismatch
body[a]: type='bytes_iterator' length=unknown sha256=N/A
body[b]: type='bytes_iterator' length=unknown sha256=N/A
Every async image-edit test then ballooned from entries=2 to
entries=10 in that single CI run -- the exhausted iterator meant
the live multipart upload went out as an empty body, OpenAI
returned 400, the SDK + flaky retries fired, each retry got a
fresh iterator that my hook exhausted again, and ``new_episodes``
recorded each failed attempt as a new cassette episode.
This patch:
* Recognizes ``bytes_iterator`` (chunks are ``int``) and
reconstructs the buffer via ``bytes(chunks)``.
* Keeps the existing ``list_iterator``-over-bytes-chunks handling
via ``b"".join(...)``.
* **Always writes a bytes value back to ``request.body`` after
consuming the iterator.** If the chunk shape is unrecognized,
``request.body`` is set to ``b""`` rather than left as an
exhausted iterator. That is wrong in the sense of "we lost the
body" but right in the sense of "the failure mode is now visible
(live API call sends empty body and fails fast) instead of
invisible (corrupt cassette grows silently)". Combined with the
matcher diagnostic, any future regression in this code path will
surface in the CI log immediately.
Local verification covers ``bytes_iterator``, ``list_iterator``
over bytes chunks, generator over bytes chunks, empty iterator,
already-bytes (idempotent), identical-content iterator equality
in the matcher (now matches), and differing-content iterator
inequality (still raises).
* fix(tests): clear vcrpy's sticky _was_iter flag so materialized bodies stay bytes
Actual root cause of the async image-edit cassette leak. The
previous diagnostic run produced this dead giveaway:
[vcr-episode-body-hash] ... episode[0]: body type='bytes_iterator'
is not bytes/bytearray/str -- cannot hash
[vcr-safe-body-matcher] request body mismatch
body[a]: type='bytes_iterator' length=unknown sha256=N/A
body[b]: type='bytes_iterator' length=unknown sha256=N/A
Both sides of the matcher were ``bytes_iterator`` **after** the
materializer had supposedly converted them to bytes. That made no
sense until I read vcrpy's ``Request`` class.
vcrpy's ``Request`` keeps two private flags that are set in
``__init__`` from the original body's type and **never cleared by
the setter**:
def __init__(self, method, uri, body, headers):
self._was_file = hasattr(body, "read")
self._was_iter = _is_nonsequence_iterator(body)
...
@property
def body(self):
if self._was_file: return BytesIO(self._body)
if self._was_iter: return iter(self._body)
return self._body
@body.setter
def body(self, value):
if isinstance(value, str): value = value.encode("utf-8")
self._body = value # <-- does NOT touch _was_iter / _was_file
So when httpx's async transport hands vcrpy an iterator body,
``_was_iter`` becomes ``True`` and stays there forever. Even after
``_materialize_iterable_body`` writes plain bytes via
``request.body = out``, the next read of ``.body`` re-wraps the
stored bytes in ``iter()`` -- producing a fresh ``bytes_iterator``
that compares unequal to any other ``bytes_iterator`` via object
identity. The matcher missed every time, the cassette grew by one
episode per run, and the persister saw the same iterator type when
trying to hash the body for the diagnostic log.
Fix: after writing the materialized bytes, also force
``_was_iter`` and ``_was_file`` to ``False``. vcrpy exposes no
public API for this, so we touch the private flags directly --
acknowledged as a pragmatic test-only hack with a clear unit
boundary (the only call site is ``_materialize_iterable_body``).
Local repro reproduces the exact production setup:
``Request('POST', url, iter(b'multipart-content'), {})`` on two
sides, runs the matcher, asserts HIT. Verified the matcher hits on
identical content and still raises on differing content.
Should be the last fix needed. Existing cassettes that contain
oddly-shaped bodies (lists of int chunks, etc. from the previous
``_was_iter=True`` save path) still match because the materializer
canonicalises both sides to bytes before comparison -- no fourth
re-flush required.
* revert(tests): drop the temp per-episode body-hash diagnostic
Removed now that 1c51ad13 has confirmed the root cause (vcrpy's
sticky ``_was_iter`` flag making the body getter re-wrap stored
bytes in ``iter()`` on every access). The hash dump did its job --
the post-1c51ad13 image_gen_testing run shows all five async
image-edit tests as ``[VCR HIT]`` with stable entry counts and
zero billing errors -- and is too noisy to keep on by default
(over 100 lines per session at steady state).
Kept permanently:
* ``_safe_body_matcher`` mismatch diagnostic in
``_vcr_conftest_common.py``. Only fires on a body mismatch,
which is signal worth surfacing whenever it happens.
* ``_normalize_multipart_boundary`` "skipped" log line. Same
rationale -- only fires when the body shape is something the
normalizer cannot rewrite in place.
* The ``test-results/vcr-diagnostics/<pid>.log`` per-PID file
plumbing (``vcr_diag_write_line`` /
``emit_vcr_diagnostic_log``). Useful for any future diagnostic
that needs to bypass xdist stdout/stderr capture; cheap to keep.
* chore(tests): delete unused flush script + wire VCR diagnostic dump everywhere
* Remove ``scripts/flush_image_edit_vcr_cassettes.py``. It was a
one-shot helper for the initial cassette flush; the iterator and
``_was_iter`` fixes mean no future flush should be required, and
the script was never run anywhere (the actual flushes happened
inside the CI conftest via the temp hacks that have since been
reverted).
* The matcher mismatch + normalizer skip diagnostics already write
per-PID files for every suite that imports the shared VCR
plumbing, but ``emit_vcr_diagnostic_log`` -- the controller-side
dump that surfaces those files into the CI log at session end --
was only wired into ``image_gen_tests``. Add the one-line call to
the 12 sibling conftests that already use VCR so the diagnostics
surface in any suite's terminal output if a body matcher ever
misses. No new output in steady state -- the dump is a no-op when
no diagnostics were recorded that session.
* chore(tests): trim non-essential comments per project comment policy
Strips docstrings, inline comments, and block comments that this PR
introduced where the code itself was already self-evident. Keeps the
few lines that document non-obvious behaviour (raw-bytes-not-BytesIO
rationale on the image fixtures, the per-PID-files-bypass-xdist note
on the diagnostic directory). Touches only comments this PR added --
no pre-existing comment is removed.
Net: -161 lines of comment/docstring across 3 files, no code
behaviour change.
* chore(tests): forward **kwargs in pin_httpx_multipart_boundary wrapper
Defensive against future httpx MultipartStream.__init__ adding new
optional kwargs. Without the forward, the wrapper would silently drop
them. No behaviour change today.
* chore(tests): canonicalize VCR matchers and surface shouldn't-happen branches
Bundles the "follow-up cleanup PR" into this one so it does not get
lost. Four small changes:
1. Introduce ``_canonical_body(req) -> (bytes, pre_type)`` and route
``_safe_body_matcher`` through it. The matcher now operates on
bytes by construction; the "compare two iterator objects via
``==`` and silently get object-identity semantics" failure mode
(which cost us this entire PR to diagnose) is structurally
impossible to reintroduce. ``pre_type`` is the body type *before*
canonicalization, surfaced by the mismatch diagnostic so a future
regression involving a new body shape is still visible.
2. Add a structured diagnostic to ``_key_fingerprint_matcher``. It
was previously raising a bare ``AssertionError("API key
fingerprints differ")`` with zero context -- exactly the
anti-pattern the body matcher had before this PR.
3. Surface "shouldn't-happen" branches via ``vcr_diag_write_line``:
* ``_strip_image_b64_payloads`` -- logs when ``response``,
``response['body']``, or ``response['body']['string']`` arrives
in an unexpected shape (vcrpy contract violation).
* ``_compute_key_fingerprint`` -- logs the ``"no-key"`` fallback
with the request method/URL so a stripped-auth-header bug is
visible instead of masked.
* ``_canonical_body`` -- logs its own empty-bytes fallback when a
body has a shape ``_materialize_iterable_body`` did not handle.
4. Re-introduce per-episode body-hash logging in
``_RedisPersister.save_cassette`` (was reverted in 927c5548 as
"noisy"). Quantified cost: ~25 KB of CI log per session at peak,
~ms-scale CPU, zero output in steady state (no save = no log).
Trade-off favours keeping it: lets two consecutive CI runs be
diffed by body hash, which is how we will spot the next regression
in the same class.
All call sites still work: local repro confirms iter==iter HIT,
iter!=iter raises, plain-bytes HIT, body-hash log emits via the same
per-PID file plumbing as the matcher diagnostics.
* chore(tests): symmetrize diag-log cleanup across every VCR-using conftest
``image_gen_tests/conftest.py`` was the only suite that cleared
``test-results/vcr-diagnostics/*.log`` at session start. The other 12
VCR-using conftests inherited any stale per-PID logs from a previous
local run and would dump them in the terminal summary -- harmless in
CI (fresh container) but confusing locally when running multiple
suites in sequence.
Extracts the cleanup into a ``reset_vcr_diag_dir`` helper in
``tests/_vcr_conftest_common.py`` and calls it from every VCR-using
conftest's ``pytest_configure``. Same single source of truth, no
inline duplication.
* fix(tests): gate body materialization on __next__ and strip PR comments
aiohttp/vcrpy stores the json kwarg as a dict; _materialize_iterable_body
was iterating it via __iter__ and joining the keys, replacing the request
body with concatenated key names ("textlanguageentities"). Gate on
__next__ so containers (dict/list/tuple) are left alone — only single-use
iterators like httpx's bytes_iterator / list_iterator are materialized.
Log diagnostic line when chunk type is unrecognized.
* fix(tests): JSON-encode dict bodies in canonical_body for stable matching
aiohttp stubs store the json kwarg as a dict; the fallback that compared
all dicts as b"" caused concurrent presidio analyze calls to be served
the wrong cassette episode. JSON-encode with sort_keys for stable bytes.
* fix(tests): guard emit_vcr_diagnostic_log against multi-conftest re-emission
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* fix(tests): globalize multipart-boundary pin + stabilize whisper fixtures
Diagnostic shows audio_testing was silently re-recording 50+ live Whisper
episodes per CI run (over MAX_EPISODES_PER_CASSETTE, so the persister
refused to save). Two changes:
* Move the session-autouse _pin_multipart_boundary fixture into the
shared _vcr_conftest_common module so every VCR-using suite picks it
up via a single import. image_gen had it inline; the other 12 suites
silently lacked it.
* Replace the module-level open("rb") audio file handles in test_whisper
with cached bytes + a per-call (filename, bytes, mimetype) tuple,
mirroring the image_edits raw-bytes pattern. Stops the file-pointer-
at-EOF bug where the second test got an empty multipart body.
* chore(tests): drop per-episode body-hash dump and redundant emit guard
---------
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
124 lines
3.1 KiB
Python
124 lines
3.1 KiB
Python
# conftest.py
|
|
|
|
import asyncio
|
|
import importlib
|
|
import os
|
|
import sys
|
|
|
|
import pytest
|
|
|
|
sys.path.insert(
|
|
0, os.path.abspath("../..")
|
|
) # Adds the parent directory to the system path
|
|
|
|
import litellm # noqa: E402
|
|
|
|
from tests._vcr_conftest_common import ( # noqa: E402,F401
|
|
VerboseReporterState,
|
|
_pin_multipart_boundary,
|
|
apply_vcr_auto_marker_to_items,
|
|
emit_cassette_cache_session_banner,
|
|
emit_vcr_classification_summary,
|
|
emit_vcr_diagnostic_log,
|
|
install_live_call_probe,
|
|
record_vcr_outcome,
|
|
register_persister_if_enabled,
|
|
reset_vcr_diag_dir,
|
|
vcr_config_dict,
|
|
)
|
|
|
|
_verbose_state = VerboseReporterState()
|
|
|
|
|
|
@pytest.fixture(scope="module")
|
|
def vcr_config():
|
|
return vcr_config_dict()
|
|
|
|
|
|
def pytest_recording_configure(config, vcr):
|
|
register_persister_if_enabled(vcr)
|
|
|
|
|
|
@pytest.hookimpl(hookwrapper=True)
|
|
def pytest_runtest_makereport(item, call):
|
|
outcome = yield
|
|
rep = outcome.get_result()
|
|
setattr(item, f"rep_{rep.when}", rep)
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def _vcr_outcome_gate(request, vcr):
|
|
install_live_call_probe(request, vcr)
|
|
yield
|
|
record_vcr_outcome(request, vcr)
|
|
|
|
|
|
def pytest_configure(config):
|
|
_verbose_state.remember_pluginmanager(config)
|
|
reset_vcr_diag_dir()
|
|
|
|
|
|
def pytest_runtest_logreport(report):
|
|
_verbose_state.maybe_emit_verdict(report)
|
|
|
|
|
|
@pytest.fixture(scope="session")
|
|
def event_loop():
|
|
try:
|
|
loop = asyncio.get_running_loop()
|
|
except RuntimeError:
|
|
loop = asyncio.new_event_loop()
|
|
yield loop
|
|
loop.close()
|
|
|
|
|
|
@pytest.fixture(scope="function", autouse=True)
|
|
def setup_and_teardown():
|
|
"""
|
|
This fixture reloads litellm before every function. To speed up testing by removing callbacks being chained.
|
|
"""
|
|
sys.path.insert(
|
|
0, os.path.abspath("../..")
|
|
) # Adds the project directory to the system path
|
|
|
|
import litellm
|
|
|
|
importlib.reload(litellm)
|
|
|
|
try:
|
|
if hasattr(litellm, "proxy") and hasattr(litellm.proxy, "proxy_server"):
|
|
import litellm.proxy.proxy_server
|
|
|
|
importlib.reload(litellm.proxy.proxy_server)
|
|
except Exception as e:
|
|
print(f"Error reloading litellm.proxy.proxy_server: {e}")
|
|
|
|
loop = asyncio.get_event_loop_policy().new_event_loop()
|
|
asyncio.set_event_loop(loop)
|
|
print(litellm)
|
|
yield
|
|
|
|
# Teardown code (executes after the yield point)
|
|
loop.close() # Close the loop created earlier
|
|
asyncio.set_event_loop(None) # Remove the reference to the loop
|
|
|
|
|
|
def pytest_collection_modifyitems(config, items):
|
|
apply_vcr_auto_marker_to_items(items)
|
|
|
|
custom_logger_tests = [
|
|
item for item in items if "custom_logger" in item.parent.name
|
|
]
|
|
other_tests = [item for item in items if "custom_logger" not in item.parent.name]
|
|
|
|
custom_logger_tests.sort(key=lambda x: x.name)
|
|
other_tests.sort(key=lambda x: x.name)
|
|
|
|
items[:] = custom_logger_tests + other_tests
|
|
|
|
|
|
def pytest_terminal_summary(terminalreporter, exitstatus, config):
|
|
emit_cassette_cache_session_banner(terminalreporter)
|
|
emit_vcr_classification_summary(terminalreporter)
|
|
emit_vcr_diagnostic_log(terminalreporter)
|