Commit Graph

362 Commits

Author SHA1 Message Date
Mateo Wang b4aee2c7dd test(vcr): close out the remaining VCR live-call leaks (#29603)
* Fix remaining VCR live-call leaks

* test(vcr): dedupe live-test helpers and drop spurious kwargs

Extract the duplicated isVertexQuotaError/runVertexRequestOrSkip Vertex
quota-skip helpers into tests/pass_through_tests/vertex_test_helpers.js and the
duplicated _skip_live_prompt_caching_test guard into tests/_live_test_helpers.py
so each lives in one place. In test_aarun_thread_litellm, build a separate
message_data carrying role/content for add_message and a thread_data without
them for run_thread/run_thread_stream/get_messages, which no longer receive the
spurious message fields.

* test(overhead): assert mock transport is exercised in non-streaming and stream tests
2026-06-03 13:46:43 -07:00
Mateo Wang bfbb5d2375 fix(ci): make litellm_internal_staging green (logging test + Bedrock Opus 4.7 self-heal) (#29344)
* test(logging): align DB metrics event_metadata assertions with safe redaction

PR #28909 hardened log_db_metrics to emit a minimal, non-sensitive
event_metadata (only table_name when present, otherwise None) instead of
dumping function_name, function_kwargs, and function_args onto the span. The
test in test_log_db_redis_services was not updated and still asserted
"function_name" in event_metadata, which raised TypeError (argument of type
'NoneType' is not iterable) and turned the logging_testing CI job red on
litellm_internal_staging.

Update test_log_db_metrics_success to assert event_metadata is None when no
table_name is passed, and add test_log_db_metrics_event_metadata_is_safe as a
regression guard verifying that only the table name surfaces and that sensitive
kwargs (tokens, prisma client) are never dumped.

* test(bedrock): self-heal opus-4-7 grid cells when unentitled on CI

The bedrock-claude-opus-4-7 converse cells are unentitled on the Bedrock CI
account, so they were marked xfail. xfail keeps reporting them as expected
failures even after access is granted, so the wire translation never gets
verified again. Now the cell makes the call and skips only when Bedrock
replies "is not available for this account"; the moment the model is
entitled the same cells run their full assertions with no edit.

A focused unit test pins the tolerance predicate so any other failure still
surfaces loudly and the available path still runs the assertions.
2026-05-30 13:57:18 -07:00
Mateo Wang f11c12d157 Revert "chore(tests): migrate Bedrock CI to AWS account 941277531214 (#28728)" (#29326)
This reverts the Bedrock CI account migration (#28728). The original account
(888602223428) was put under an AWS security restriction after a leaked key
and has since been reactivated, while the replacement account (941277531214)
lacks access to several models the suites exercise (legacy Bedrock Claude 3
models, Cohere, Nova Canvas image gen, Bedrock batch inference, and flagship
Opus). Pointing CI back at the reactivated account restores that coverage.

This is the exact inverse of #28728: all hardcoded 941277531214 references go
back to 888602223428 (provisioned/imported-model ARNs, AgentCore runtime ARNs
and their suffixes, batch execution role ARN, and the example proxy config),
the S3 buckets revert to litellm-proxy and load-testing-oct, the guardrail IDs
revert to wf0hkdb5x07f and ff6ujrregl1q, the SageMaker endpoint and Knowledge
Base revert to their original ids, and the live-call tests go back to the
legacy model strings. The grid_spec fail_reason workaround for the unentitled
Opus cells is dropped while keeping the unrelated bedrock_effort_ceiling field
added after the migration.

The CircleCI AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars still point at
941277531214 and must be set to the reactivated account's fresh credentials
separately via the CircleCI API; AWS_REGION_NAME stays us-west-2.
2026-05-30 11:26:24 -07:00
Mateo Wang f9407bc036 chore(tests): migrate Bedrock CI to AWS account 941277531214 (#28728)
* chore(tests): migrate Bedrock CI from AWS account 888602223428 to 941277531214

The original account (888602223428) was put under a security restriction by
AWS after a root access key leaked in a PR comment. While that account works
its way through the AWS Support unlock process, Bedrock-touching CI tests have
been migrated to a fresh account (941277531214).

Changes:
  - Replace 26 hardcoded references to 888602223428 with 941277531214 across
    8 files (provisioned-model ARNs, imported-model ARNs, AgentCore runtime
    ARNs, batch execution role ARN, and example proxy config).
  - The provisioned-model and imported-model ARNs are referenced only from
    mocked unit tests — no AWS resources to recreate.
  - The batch execution IAM role has been recreated in the new account with
    the same name and equivalent permissions.
  - The two AgentCore runtimes (hosted_agent_r9jvp-3ySZuRHjLC,
    hosted_agent_13sf6-cALnp38iZD) are being recreated in the new account
    under the same names — see tools/agentcore-deploy/ in a follow-up.

CircleCI env vars AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION_NAME
were updated separately via the CircleCI API to point at the new account.

Smoke-tested locally against the new account:
  aws bedrock-runtime converse --region us-west-2 \
    --model-id us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --messages '[{"role":"user","content":[{"text":"ping"}]}]'
  → 200, model returned 'pong'

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(tests): refresh AgentCore ARN suffixes to match newly-deployed runtimes

The first migration commit replaced just the account ID, but AgentCore
auto-assigns a random 10-char suffix to every runtime on creation — we
can't reuse the original suffixes (`3ySZuRHjLC`, `cALnp38iZD`) in the
new account. Updated the AgentCore-runtime ARNs in the three files that
reference real runtime IDs (not the mock-based unit-test ARNs).

Deployed runtimes:
  arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_r9jvp-Rq79QFC2fp
  arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_13sf6-4046UzHSwy

Both runtimes are status=READY and pass a smoke invoke:
  $ aws bedrock-agentcore invoke-agent-runtime --agent-runtime-arn ... --payload '{"prompt":"ping"}'
  → 200, {"result": "echo: ping"}

The agent is a minimal echo (see /tmp/agentcore_deploy/agent.py for the
deploy artifacts). Tests that only verify the SDK wiring will pass; if any
test asserts on agent output content, swap the echo for the real agent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(tests): point Bedrock batch tests at new-account S3 bucket

The account migration (888602223428 -> 941277531214) was a flat
account-ID swap, which only rewrites ARNs that embed the account
number. S3 bucket names carry no account ID, so the live Bedrock
batch tests still uploaded to `litellm-proxy` — a bucket that lives
in the old account. S3 names are globally unique, and the old account
still holds that name, so it can't be recreated in the new account.

Rename to `litellm-proxy-941277531214` (account-ID suffix guarantees
global uniqueness). The bucket must be created in 941277531214 and the
batch execution role granted s3:GetObject/PutObject/ListBucket on it
before this job is run in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tests): point live S3 logging test at new-account bucket

Same account-ID-free blind spot as the batch bucket: `load-testing-oct`
lives in the old account and its name can't be reused globally. The
`logging_testing` CI job is wired into the workflow and runs
test_basic_s3_logging, which uploads to this bucket with the CI env
creds, then lists and deletes objects — a live dependency.

Rename to `load-testing-oct-941277531214`. The bucket must exist in the
new account with the CI IAM principal granted
s3:PutObject/GetObject/ListBucket/DeleteObject before this job runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tests): repoint Bedrock guardrail IDs to new-account guardrails

The migration left guardrail IDs untouched (no account ID in them), so
all live guardrail tests failed with "guardrail identifier or version
does not exist" against 941277531214. Recreated both guardrails in the
new account and updated the hardcoded IDs:
  - wf0hkdb5x07f -> zgkmukebruil (PII mask: PHONE + CREDIT_DEBIT_CARD,
    with explicit inputAction=ANONYMIZE so masking applies to INPUT,
    which is the source litellm's moderation hook sends)
  - ff6ujrregl1q -> 4w3d1di3snt5 (blocks "coffee"; blocked message set
    to the exact string the tests assert on)

Updated test_bedrock_guardrails.py, otel_test_config.yaml, and the
guardrailConfig in test_bedrock_completion.py. Verified locally: the 5
previously-failing guardrail tests now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bedrock): migrate legacy models to current inference profiles

The new CI account (941277531214) cannot invoke legacy Bedrock models
(AWS gates them: "marked by provider as Legacy... not actively using in
the last 30 days"). Migrated the live-call tests:
  - anthropic.claude-3-sonnet-20240229    -> us.anthropic.claude-sonnet-4-5-20250929-v1:0
  - anthropic.claude-3-haiku-20240307     -> us.anthropic.claude-haiku-4-5-20251001-v1:0
Current Claude models on Bedrock require the us. inference-profile prefix
(bare on-demand ids are rejected).

cohere.command-r-plus has no working replacement (all Cohere is legacy-
gated in the new account): swapped to claude-haiku-4-5 in provider-
agnostic param lists. amazon.titan-image-generator skipped (no working
replacement). Mocked/transformation/cost tests that reference the legacy
strings are intentionally left unchanged. Verified live against the new
account.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bedrock): repoint SageMaker + Knowledge Base to new-account resources

These referenced account-scoped resources by hardcoded id that only
existed in the old account, so the migration's account-ID swap missed
them. Recreated in 941277531214 and repointed:
  - SageMaker endpoint jumpstart-dft-hf-textgeneration1-mp-20240815-185614
    -> litellm-ci-textgen (gpt2 on a TGI container, ml.g5.xlarge)
  - Bedrock Knowledge Base T37J8R4WTM -> LCYXFBR2TU (OpenSearch Serverless
    vector store + titan-embed-text-v2, seeded with a LiteLLM doc)
Verified live: test_sagemaker.py (12 passed) and
test_bedrock_knowledgebase_hook.py (12 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(reasoning_effort_grid): skip bedrock claude-opus-4-7 cells (not entitled on 941277531214)

claude-opus-4-7 is listed in the new Bedrock CI account's foundation
models but invoke is denied (AccessDeniedException: "not available for
this account"). Bedrock access to the flagship Opus requires an AWS
Sales request, not the self-serve model-access toggle, so it can't be
enabled inline with the rest of the account migration.

Add an optional `skip_reason` to ModelEntry and set it on the
bedrock-claude-opus-4-7 entry; the grid test honors it via pytest.skip.
Cell count (231) and route coverage are unchanged, so the structural
asserts still pass. Restore coverage by deleting the one skip_reason
line once access is granted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bedrock): swap/skip legacy-gated models unavailable on new CI account

The migrated AWS account (941277531214) cannot access several models that
the old account could, so the remaining red CI jobs were hitting real
Bedrock "Access denied / Legacy" and "account not authorized" errors:

- image_gen: skip both Nova Canvas test classes (amazon.nova-canvas-v1:0 is
  legacy-gated), matching the existing titan skip.
- batches: skip test_async_file_and_batch (Bedrock batch inference is not
  authorized on the new account; requires an AWS support case).
- litellm_overhead: swap legacy claude-3-5-haiku for the active
  us.anthropic.claude-haiku-4-5 inference profile.
- test_completion_claude_3_function_call: swap legacy claude-3-sonnet for the
  active us.anthropic.claude-sonnet-4-5 inference profile.

https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa

* test(bedrock): fix remaining e2e legacy-model + batch failures on new CI account

- e2e_openai_endpoints: skip test_bedrock_batches_api (Bedrock batch inference
  is not authorized on account 941277531214) and migrate the missed
  s3_bucket_name in oai_misc_config.yaml to litellm-proxy-941277531214.
- build_and_test: swap legacy bedrock claude-3-sonnet for the active
  us.anthropic.claude-sonnet-4-5 inference profile in the proxy structured
  output e2e test.

https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa

* test(bedrock): make opus-4-7 + batch cells fail loudly and mock image-gen (#28791)

Replace the silent skips added for the new CI account with noisier behavior:
- reasoning-effort grid: opus-4-7 cells now fail (when AWS creds are present)
  instead of skipping, so the missing entitlement stays visible in CI; they
  still skip when AWS creds are absent (local dev)
- Bedrock batch inference tests: drop the skip so they run and fail until
  batch access is granted
- Titan + Nova Canvas image-gen tests: mock the Bedrock HTTP call so the
  transform + cost-tracking path stays under test without live model access

https://claude.ai/code/session_01MT7SWDnXUjv6e6EPG7BDjT

Co-authored-by: Claude <noreply@anthropic.com>

* test(bedrock): use pytest.xfail for known-failing opus-4-7 cells

Replace pytest.fail with pytest.xfail when a model has a fail_reason,
so known-broken cells stay visible as XFAIL without keeping CI red.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-25 12:03:17 -07:00
Mateo Wang bb448b0031 fix(tests): stabilize image-edit VCR cassettes to stop live gpt-image-1 spend (#28110)
* fix(tests): stabilize image-edit VCR cassettes to stop live gpt-image-1 spend

The image-edit cassettes for ``gpt-image-1`` were accumulating >50
episodes and being refused by the persister
(``tests/_vcr_redis_persister.py``), so every CI run was hitting the
real OpenAI endpoint. The async parametrize was the clearest tell:
``test_openai_image_edit_litellm_sdk[True]`` cached to 1 entry, but the
``[False]`` (async) sibling grew to 51 entries and never replayed.

Two non-deterministic sources were fueling the growth, both fixed
here. After this patch, the cassettes settle at one episode per
unique call and replay for the 24-hour TTL like every other suite.

1. Pin httpx's multipart boundary at the source. The existing
   ``_normalize_multipart_boundary`` rewrites the boundary in the
   ``Content-Type`` header reliably, but on the async transport path
   the body is not always a contiguous ``bytes`` object when
   ``before_record_request`` runs, so the body-side replacement
   silently no-ops and the recorded cassette retains the random
   ``boundary=<hex>`` string. The next CI run gets a fresh random
   boundary, the ``safe_body`` matcher misses, and
   ``record_mode="new_episodes"`` appends another episode. Wrapping
   ``httpx._multipart.MultipartStream.__init__`` so it always uses
   ``vcr-static-boundary`` when no boundary is supplied eliminates
   the variance for both sync and async paths and leaves the normalizer
   in place as a backstop. Exposed as
   ``pin_httpx_multipart_boundary`` so other multipart-heavy suites
   (audio, ocr, batches) can adopt the same fixture later.

2. Pass raw ``bytes`` (not ``BytesIO`` streams) through the
   image-edit fixtures. A ``BytesIO`` whose file pointer is at EOF
   after the first multipart upload silently encodes an empty image on
   the next SDK / Router retry — yet another divergent body that VCR
   records as a new episode. ``bytes`` are immutable and position-less,
   so retries re-encode an identical payload every time. This is also
   a small production-correctness improvement: a customer passing
   ``BytesIO`` today would hit the same empty-body retry bug. The
   BytesIO-specific smoke test
   (``test_openai_image_edit_with_bytesio``) is preserved by giving
   ``get_test_images_as_bytesio`` its own factory instead of aliasing
   the bytes one.

3. Add ``scripts/flush_image_edit_vcr_cassettes.py`` — a one-shot
   Redis SCAN/DEL helper that clears the bloated pre-fix cassettes
   under ``litellm:vcr:cassette:tests/image_gen_tests/test_image_edits/*``.
   Without this, the next CI run still loads the existing 51-entry
   cassette, the new fixed-boundary body still doesn't match any of
   the stale entries, the persister still refuses to save, and the
   bleed continues. Run once with the production
   ``CASSETTE_REDIS_URL`` after merge (dry-run by default).

* DIAGNOSTIC: log VCR body mismatches + per-episode body hashes

Temporary observability boost so we can root-cause why
``test_image_edits.py`` async parametrizes still record fresh
episodes on every CI run even though the multipart boundary is now
pinned (sync parametrizes cache cleanly as VCR HIT). The matcher
currently raises ``AssertionError("request bodies differ")`` with
zero context, so we cannot tell whether the live body genuinely
varies, the matcher is comparing a bytes object to a stream object,
or the normalizer is silently skipping the body because it is not
bytes/str.

Three logs added; the first two are worth keeping permanently, the
third is intended to be reverted after the diagnosis lands:

1. ``_safe_body_matcher`` now emits a structured stderr block on
   mismatch (type of each side, length, SHA-256, first divergent
   byte offset, ±100-byte window). Always-on -- mismatches are
   signal, not noise, and the existing per-test verdict already
   logs once per test. PERMANENT.

2. ``_normalize_multipart_boundary`` now logs to stderr when the
   body type is not bytes/bytearray/str -- the silent ``else:
   return`` branch was masking exactly the case we suspect is
   firing on async (httpx ``MultipartStream`` handed to vcrpy
   before the body is read). PERMANENT.

3. ``_RedisPersister.save_cassette`` now logs every episode's body
   SHA-256, length, and 120-byte preview at save time. This lets
   two consecutive CI runs be diffed: if the same test records a
   different hash run-to-run, the live body genuinely varies; if
   both runs record the same hash but the matcher still misses, the
   bug is in the matcher itself. TEMPORARY -- revert once the
   async variance is identified and fixed.

Once a single ``image_gen_testing`` CI run produces these logs,
revert this commit (or just the persister hash block) with a force
push so the cassette save path is not noisy in steady-state.

* DIAGNOSTIC: route VCR diagnostics through per-PID files (bypass xdist capture)

Re-push of the diagnostic logging from the previous commit, this
time wired so the output actually survives to the CI log. xdist
captures stdout/stderr from every passing test in the worker
process; the body-matcher and normalizer-skip diagnostics fire from
inside vcrpy machinery during the test, so for any test that
ultimately passes (which is all of them once the cassettes are
recorded), the diagnostic lines are silently swallowed.

Fix: write each diagnostic line to a per-PID file under
``test-results/vcr-diagnostics/<pid>.log`` instead of writing to
stderr. The controller's ``pytest_terminal_summary`` aggregates
those files and writes them through ``terminalreporter.write_line``,
which is not subject to per-test capture. As a bonus,
``test-results/`` is already collected by the ``store_test_results``
step in CircleCI, so the raw per-worker logs survive as build
artifacts even after the test session ends.

Three call sites updated:

1. ``_emit_body_mismatch_diagnostic`` (matcher) -- writes the
   structured type/length/sha/window block via ``vcr_diag_write_line``.
2. ``_normalize_multipart_boundary`` -- logs the silent-skip path
   (body not bytes/bytearray/str) the same way.
3. ``_maybe_log_episode_body_hashes`` (persister) -- replaces the
   ``_log.warning`` calls (which the root-logger config also
   swallows in CI) with ``vcr_diag_write_line``.

Image-gen conftest is the only suite wired to dump the aggregated
log at session end. Other suites can opt in by adding
``emit_vcr_diagnostic_log(terminalreporter)`` to their own
``pytest_terminal_summary``. The diagnostic dir is cleared at the
start of each session (controller-only) so a local rerun does not
mix output from prior runs.

Same revert plan as the previous diagnostic commit: keep the
matcher + normalizer skip diagnostics permanently (they only fire
on signal events), revert the persister body-hash dump once the
async variance is identified.

* fix(tests): coalesce iterable request bodies before matching/recording

Root cause of the residual async image-edit cassette leak. The
diagnostic run for ``ba3915d9`` printed:

  [vcr-safe-body-matcher] request body mismatch
    body[a]: type='list_iterator' length=unknown sha256=N/A
    body[b]: type='list_iterator' length=unknown sha256=N/A

httpx's async transport hands vcrpy a ``request.body`` that is a
``list_iterator`` over multipart chunks rather than a contiguous
``bytes`` blob. Two consequences:

1. ``_safe_body_matcher`` compares the two iterator objects with
   ``==``, which is identity comparison for arbitrary iterators -
   semantically identical multipart bodies never compare equal, and
   ``record_mode="new_episodes"`` appends a new episode on every CI
   run until the cassette crosses ``MAX_EPISODES_PER_CASSETTE`` and
   the persister refuses to save (this is exactly what the OVERFLOW
   warning has been catching).
2. ``_normalize_multipart_boundary`` short-circuits its
   ``else: return`` branch because the body is neither bytes nor
   str, so any residual random boundary characters in the body bytes
   are never rewritten.

Sync requests do not hit this code path: httpx's sync transport
hands vcrpy a single ``bytes`` body, so ``==`` works and the
boundary normalizer runs as intended. That is why
``test_openai_image_edit_litellm_sdk[True]`` records to ``entries=1``
and replays cleanly while ``[False]`` (async) kept growing by one
episode per run.

Fix: add ``_materialize_iterable_body`` which coalesces an iterable
``request.body`` into ``bytes`` in-place. Call it from two places:

* The top of ``_before_record_request``, so the boundary normalizer
  and the cassette serializer both see bytes from then on.
* The top of ``_safe_body_matcher``, as defense in depth in case a
  future vcrpy code path invokes the matcher without first going
  through ``_before_record_request``.

The vcrpy ``Request`` is a wrapper used for matching and recording;
the underlying httpx transport sends its own request body
separately, so replacing the iterator on the vcrpy wrapper does
not starve the live HTTP send.

After this lands the async parametrizes should flip from
``[VCR MISS:RECORDED] entries=N+1`` to ``[VCR HIT] entries=N`` on
the next CI run, matching the sync side and dropping the residual
~$3/day to $0.

* fix(tests): handle bytes_iterator + never leave an exhausted body

Follow-up to 8e08272b. The previous attempt at coalescing iterable
request bodies bailed out (``return`` without writing
``request.body``) whenever it could not classify the chunk type.
That was the wrong failure mode for one critical case: vcrpy
sometimes presents the body as ``iter(some_bytes)``, whose Python
type is ``bytes_iterator`` and which yields ``int`` byte values
(0-255), not byte chunks. The old code saw an ``int`` chunk, hit
the ``else: return`` branch, and left ``request.body`` pointing at
the now-exhausted iterator.

The post-fix diagnostic run made this loud:

  [vcr-safe-body-matcher] request body mismatch
    body[a]: type='bytes_iterator' length=unknown sha256=N/A
    body[b]: type='bytes_iterator' length=unknown sha256=N/A

Every async image-edit test then ballooned from entries=2 to
entries=10 in that single CI run -- the exhausted iterator meant
the live multipart upload went out as an empty body, OpenAI
returned 400, the SDK + flaky retries fired, each retry got a
fresh iterator that my hook exhausted again, and ``new_episodes``
recorded each failed attempt as a new cassette episode.

This patch:

* Recognizes ``bytes_iterator`` (chunks are ``int``) and
  reconstructs the buffer via ``bytes(chunks)``.
* Keeps the existing ``list_iterator``-over-bytes-chunks handling
  via ``b"".join(...)``.
* **Always writes a bytes value back to ``request.body`` after
  consuming the iterator.** If the chunk shape is unrecognized,
  ``request.body`` is set to ``b""`` rather than left as an
  exhausted iterator. That is wrong in the sense of "we lost the
  body" but right in the sense of "the failure mode is now visible
  (live API call sends empty body and fails fast) instead of
  invisible (corrupt cassette grows silently)". Combined with the
  matcher diagnostic, any future regression in this code path will
  surface in the CI log immediately.

Local verification covers ``bytes_iterator``, ``list_iterator``
over bytes chunks, generator over bytes chunks, empty iterator,
already-bytes (idempotent), identical-content iterator equality
in the matcher (now matches), and differing-content iterator
inequality (still raises).

* fix(tests): clear vcrpy's sticky _was_iter flag so materialized bodies stay bytes

Actual root cause of the async image-edit cassette leak. The
previous diagnostic run produced this dead giveaway:

  [vcr-episode-body-hash] ... episode[0]: body type='bytes_iterator'
    is not bytes/bytearray/str -- cannot hash
  [vcr-safe-body-matcher] request body mismatch
    body[a]: type='bytes_iterator' length=unknown sha256=N/A
    body[b]: type='bytes_iterator' length=unknown sha256=N/A

Both sides of the matcher were ``bytes_iterator`` **after** the
materializer had supposedly converted them to bytes. That made no
sense until I read vcrpy's ``Request`` class.

vcrpy's ``Request`` keeps two private flags that are set in
``__init__`` from the original body's type and **never cleared by
the setter**:

  def __init__(self, method, uri, body, headers):
      self._was_file = hasattr(body, "read")
      self._was_iter = _is_nonsequence_iterator(body)
      ...

  @property
  def body(self):
      if self._was_file: return BytesIO(self._body)
      if self._was_iter: return iter(self._body)
      return self._body

  @body.setter
  def body(self, value):
      if isinstance(value, str): value = value.encode("utf-8")
      self._body = value   # <-- does NOT touch _was_iter / _was_file

So when httpx's async transport hands vcrpy an iterator body,
``_was_iter`` becomes ``True`` and stays there forever. Even after
``_materialize_iterable_body`` writes plain bytes via
``request.body = out``, the next read of ``.body`` re-wraps the
stored bytes in ``iter()`` -- producing a fresh ``bytes_iterator``
that compares unequal to any other ``bytes_iterator`` via object
identity. The matcher missed every time, the cassette grew by one
episode per run, and the persister saw the same iterator type when
trying to hash the body for the diagnostic log.

Fix: after writing the materialized bytes, also force
``_was_iter`` and ``_was_file`` to ``False``. vcrpy exposes no
public API for this, so we touch the private flags directly --
acknowledged as a pragmatic test-only hack with a clear unit
boundary (the only call site is ``_materialize_iterable_body``).

Local repro reproduces the exact production setup:
``Request('POST', url, iter(b'multipart-content'), {})`` on two
sides, runs the matcher, asserts HIT. Verified the matcher hits on
identical content and still raises on differing content.

Should be the last fix needed. Existing cassettes that contain
oddly-shaped bodies (lists of int chunks, etc. from the previous
``_was_iter=True`` save path) still match because the materializer
canonicalises both sides to bytes before comparison -- no fourth
re-flush required.

* revert(tests): drop the temp per-episode body-hash diagnostic

Removed now that 1c51ad13 has confirmed the root cause (vcrpy's
sticky ``_was_iter`` flag making the body getter re-wrap stored
bytes in ``iter()`` on every access). The hash dump did its job --
the post-1c51ad13 image_gen_testing run shows all five async
image-edit tests as ``[VCR HIT]`` with stable entry counts and
zero billing errors -- and is too noisy to keep on by default
(over 100 lines per session at steady state).

Kept permanently:

* ``_safe_body_matcher`` mismatch diagnostic in
  ``_vcr_conftest_common.py``. Only fires on a body mismatch,
  which is signal worth surfacing whenever it happens.
* ``_normalize_multipart_boundary`` "skipped" log line. Same
  rationale -- only fires when the body shape is something the
  normalizer cannot rewrite in place.
* The ``test-results/vcr-diagnostics/<pid>.log`` per-PID file
  plumbing (``vcr_diag_write_line`` /
  ``emit_vcr_diagnostic_log``). Useful for any future diagnostic
  that needs to bypass xdist stdout/stderr capture; cheap to keep.

* chore(tests): delete unused flush script + wire VCR diagnostic dump everywhere

* Remove ``scripts/flush_image_edit_vcr_cassettes.py``. It was a
  one-shot helper for the initial cassette flush; the iterator and
  ``_was_iter`` fixes mean no future flush should be required, and
  the script was never run anywhere (the actual flushes happened
  inside the CI conftest via the temp hacks that have since been
  reverted).

* The matcher mismatch + normalizer skip diagnostics already write
  per-PID files for every suite that imports the shared VCR
  plumbing, but ``emit_vcr_diagnostic_log`` -- the controller-side
  dump that surfaces those files into the CI log at session end --
  was only wired into ``image_gen_tests``. Add the one-line call to
  the 12 sibling conftests that already use VCR so the diagnostics
  surface in any suite's terminal output if a body matcher ever
  misses. No new output in steady state -- the dump is a no-op when
  no diagnostics were recorded that session.

* chore(tests): trim non-essential comments per project comment policy

Strips docstrings, inline comments, and block comments that this PR
introduced where the code itself was already self-evident. Keeps the
few lines that document non-obvious behaviour (raw-bytes-not-BytesIO
rationale on the image fixtures, the per-PID-files-bypass-xdist note
on the diagnostic directory). Touches only comments this PR added --
no pre-existing comment is removed.

Net: -161 lines of comment/docstring across 3 files, no code
behaviour change.

* chore(tests): forward **kwargs in pin_httpx_multipart_boundary wrapper

Defensive against future httpx MultipartStream.__init__ adding new
optional kwargs. Without the forward, the wrapper would silently drop
them. No behaviour change today.

* chore(tests): canonicalize VCR matchers and surface shouldn't-happen branches

Bundles the "follow-up cleanup PR" into this one so it does not get
lost. Four small changes:

1. Introduce ``_canonical_body(req) -> (bytes, pre_type)`` and route
   ``_safe_body_matcher`` through it. The matcher now operates on
   bytes by construction; the "compare two iterator objects via
   ``==`` and silently get object-identity semantics" failure mode
   (which cost us this entire PR to diagnose) is structurally
   impossible to reintroduce. ``pre_type`` is the body type *before*
   canonicalization, surfaced by the mismatch diagnostic so a future
   regression involving a new body shape is still visible.

2. Add a structured diagnostic to ``_key_fingerprint_matcher``. It
   was previously raising a bare ``AssertionError("API key
   fingerprints differ")`` with zero context -- exactly the
   anti-pattern the body matcher had before this PR.

3. Surface "shouldn't-happen" branches via ``vcr_diag_write_line``:

   * ``_strip_image_b64_payloads`` -- logs when ``response``,
     ``response['body']``, or ``response['body']['string']`` arrives
     in an unexpected shape (vcrpy contract violation).
   * ``_compute_key_fingerprint`` -- logs the ``"no-key"`` fallback
     with the request method/URL so a stripped-auth-header bug is
     visible instead of masked.
   * ``_canonical_body`` -- logs its own empty-bytes fallback when a
     body has a shape ``_materialize_iterable_body`` did not handle.

4. Re-introduce per-episode body-hash logging in
   ``_RedisPersister.save_cassette`` (was reverted in 927c5548 as
   "noisy"). Quantified cost: ~25 KB of CI log per session at peak,
   ~ms-scale CPU, zero output in steady state (no save = no log).
   Trade-off favours keeping it: lets two consecutive CI runs be
   diffed by body hash, which is how we will spot the next regression
   in the same class.

All call sites still work: local repro confirms iter==iter HIT,
iter!=iter raises, plain-bytes HIT, body-hash log emits via the same
per-PID file plumbing as the matcher diagnostics.

* chore(tests): symmetrize diag-log cleanup across every VCR-using conftest

``image_gen_tests/conftest.py`` was the only suite that cleared
``test-results/vcr-diagnostics/*.log`` at session start. The other 12
VCR-using conftests inherited any stale per-PID logs from a previous
local run and would dump them in the terminal summary -- harmless in
CI (fresh container) but confusing locally when running multiple
suites in sequence.

Extracts the cleanup into a ``reset_vcr_diag_dir`` helper in
``tests/_vcr_conftest_common.py`` and calls it from every VCR-using
conftest's ``pytest_configure``. Same single source of truth, no
inline duplication.

* fix(tests): gate body materialization on __next__ and strip PR comments

aiohttp/vcrpy stores the json kwarg as a dict; _materialize_iterable_body
was iterating it via __iter__ and joining the keys, replacing the request
body with concatenated key names ("textlanguageentities"). Gate on
__next__ so containers (dict/list/tuple) are left alone — only single-use
iterators like httpx's bytes_iterator / list_iterator are materialized.
Log diagnostic line when chunk type is unrecognized.

* fix(tests): JSON-encode dict bodies in canonical_body for stable matching

aiohttp stubs store the json kwarg as a dict; the fallback that compared
all dicts as b"" caused concurrent presidio analyze calls to be served
the wrong cassette episode. JSON-encode with sort_keys for stable bytes.

* fix(tests): guard emit_vcr_diagnostic_log against multi-conftest re-emission

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(tests): globalize multipart-boundary pin + stabilize whisper fixtures

Diagnostic shows audio_testing was silently re-recording 50+ live Whisper
episodes per CI run (over MAX_EPISODES_PER_CASSETTE, so the persister
refused to save). Two changes:

* Move the session-autouse _pin_multipart_boundary fixture into the
  shared _vcr_conftest_common module so every VCR-using suite picks it
  up via a single import. image_gen had it inline; the other 12 suites
  silently lacked it.
* Replace the module-level open("rb") audio file handles in test_whisper
  with cached bytes + a per-call (filename, bytes, mimetype) tuple,
  mirroring the image_edits raw-bytes pattern. Stops the file-pointer-
  at-EOF bug where the second test got an empty multipart body.

* chore(tests): drop per-episode body-hash dump and redundant emit guard

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-18 09:15:39 -07:00
Mateo Wang 2c733c00f5 chore(ci): modernize model references in tests and configs (#27856)
* test: modernize models used in CircleCI e2e test suites

Replaces obsolete models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo,
claude-3-5-sonnet-20240620, claude-sonnet-4-20250514) with current
equivalents across the e2e_openai_endpoints and
proxy_e2e_anthropic_messages_tests CircleCI jobs.

- gpt-4o -> gpt-5.5 (responses API e2e tests)
- gpt-4o-mini -> gpt-5-mini (websocket responses, oai_misc_config)
- gpt-4o-mini-2024-07-18 -> gpt-4.1-mini-2025-04-14 (fine-tuning,
  still actively fine-tunable)
- gpt-4 / gpt-3.5-turbo target_model_names example -> gpt-5.5 /
  gpt-5-mini
- bedrock claude-3-5-sonnet-20240620 batch entry -> haiku-4-5-20251001
  (also aligning oai_misc_config model_name with what
  test_bedrock_batches_api.py actually requests)
- bedrock claude-sonnet-4-20250514 (deprecated, retires 2026-06-15)
  -> claude-sonnet-4-5-20250929

* test: point bedrock-claude-sonnet-4 alias at Sonnet 4.6, not 4.5

Greptile/Cursor flagged that after the previous commit, the
bedrock-claude-sonnet-4 alias collided with bedrock-claude-sonnet-4.5
(both pointed to claude-sonnet-4-5-20250929). Rename to
bedrock-claude-sonnet-4.6 and point it at the Sonnet 4.6 Bedrock ID
(us.anthropic.claude-sonnet-4-6, already in the litellm model
registry) so the alias name matches the underlying model version.

* test: modernize models across remaining CI-mounted configs & tests

Expands the modernization sweep to all CircleCI-mounted proxy configs
and to test directories where the model literal is a fixture/route key
(not the test's subject).

Config changes:
- proxy_server_config.yaml: bump gpt-3.5-turbo / gpt-3.5-turbo-1106 /
  gpt-4o / gemini-1.5-flash / dall-e-3 underlying models; rename
  gpt-3.5-turbo-end-user-test alias to gpt-5-mini-end-user-test; bump
  text-embedding-ada-002 underlying to text-embedding-3-small. User-
  facing aliases (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, etc.)
  preserved for backward compatibility with tests.
- simple_config.yaml, otel_test_config.yaml, spend_tracking_config.yaml:
  bump gpt-3.5-turbo underlying to gpt-5-mini.
- pass_through_config.yaml: claude-3-5-sonnet / claude-3-7-sonnet /
  claude-3-haiku entries replaced with claude-sonnet-4-5 / claude-
  haiku-4-5 / claude-opus-4-7.
- oai_misc_config.yaml: align alias name with the gpt-5-mini rename.

Test changes (proactive: claude-sonnet-4-20250514 / claude-opus-4-
20250514 retire 2026-06-15):
- tests/llm_translation/test_anthropic_completion.py: bump 3 references
  + paired Vertex AI ID to claude-sonnet-4-5.
- tests/llm_translation/test_optional_params.py: bump 2 references.
- tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py
  and test_bedrock_anthropic_messages_test.py: bump router fixtures
  using the deprecated model IDs.
- tests/pass_through_unit_tests/base_anthropic_messages_tool_search_test.py:
  modernize docstring examples.
- tests/test_end_users.py: update references to renamed alias.

* test: modernize placeholder model literals in router_unit_tests

Mass replace_all on fixture/placeholder model literals across the
router_unit_tests/ suite (model name is a routing key / label, not the
test subject). Sub-agent sweep so far — additional commits will follow
for logging_callback_tests/, enterprise/, top-level tests/test_*.py,
and other CI-mounted dirs.

Mappings applied:
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 / claude-3-opus-20240229 /
  claude-3-haiku-20240307 / claude-3-5-sonnet-20240620 ->
  claude-sonnet-4-5-20250929 / claude-opus-4-7 /
  claude-haiku-4-5-20251001 as appropriate

Explicitly preserved:
- gpt-4o-mini-* variants (transcribe, tts, etc.) where they're current
- gpt-4-turbo / gpt-4-vision-preview / gpt-4-0613 (subject literals)
- JSONL batch body literals
- Mock LLM response model fields (must match upstream)
- Fake/mock identifiers

* test: modernize placeholder model literals across remaining CI suites

Sub-agent sweep across logging_callback_tests/, guardrails_tests/,
enterprise/, pass_through_unit_tests/, otel_tests/,
llm_responses_api_testing/, batches_tests/, spend_tracking_tests/,
litellm_utils_tests/, unified_google_tests/, and a few top-level
tests/test_*.py files where the model literal is a fixture or
placeholder (router model_list, mock standard logging payload, mock
callback data) rather than the test's subject.

Mappings applied (see scope notes below):
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5.5 (corrected from initial gpt-5 — bare gpt-5
  is not a valid OpenAI alias; only gpt-5.5 / gpt-5.4 / gpt-5.2-codex
  / gpt-5-mini exist)
- gpt-4o-mini (bare) -> gpt-5-mini
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 -> claude-sonnet-4-5-20250929
- claude-3-opus-20240229 -> claude-opus-4-7
- claude-3-haiku-20240307 -> claude-haiku-4-5-20251001
- claude-3-5-sonnet-20240620/20241022 -> claude-sonnet-4-5-20250929
- claude-3-7-sonnet-20250219 -> claude-sonnet-4-6
- gemini-1.5-flash -> gemini-2.5-flash
- gemini-1.5-pro -> gemini-2.5-pro

Explicitly preserved (not modernized):
- llm_translation/ tests where model is the SUBJECT (provider-specific
  translation/transformation logic). Only the deprecated 20250514
  references were already bumped in a prior commit.
- Cost-calc / tokenizer subject tests in test_utils.py (skip-ranges
  documented by the sub-agent).
- Bedrock model IDs in test_health_check.py path-stripping tests.
- JSONL batch request bodies and mock LLM response bodies (must match
  upstream literal).
- Langfuse expected-request-body JSON fixtures (cost values are exact-
  match-asserted; changing the model would shift response_cost).
- gpt-3.5-turbo-instruct (text-completion endpoint; no modern OpenAI
  equivalent).
- Top-level tests calling the proxy through user-facing aliases
  (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, dall-e-3) — aliases
  in proxy_server_config.yaml stay; only the underlying model was
  bumped.
- tests/test_gpt5_azure_temperature_support.py (the test's whole point
  is model-name handling).
- Fake / mock / openai/fake identifiers.

Notable side fixes:
- test_spend_accuracy_tests.py: UPSTREAM_MODEL now matches what
  spend_tracking_config.yaml's proxy actually routes to (gpt-5-mini),
  resolving a latent inconsistency.
- proxy_server_config.yaml: bare `gpt-5` alias renamed to `gpt-5.5`
  (bare gpt-5 is not a valid OpenAI alias).
- test_batches_logging_unit_tests.py: explicit_models list entries
  kept distinct (gpt-5-mini + gpt-5.5) after bulk rename.

* test: fix CI failures from model modernization sweep

CI surfaced 4 categories of regression from the bulk modernization:

1. Azure deployment names are customer-specific. Reverted:
   - tests/litellm_utils_tests/test_health_check.py: azure/text-
     embedding-3-small -> azure/text-embedding-ada-002 (the CI Azure
     account does not have a text-embedding-3-small deployment).
   - tests/logging_callback_tests/test_custom_callback_router.py:
     same revert for two router fixtures driving aembedding.

2. gpt-5 family does not accept temperature != 1. Tests that pass a
   custom temperature swapped from gpt-5-mini to gpt-4.1-mini (modern
   non-reasoning OpenAI mini that still accepts temperature/logprobs):
   - tests/logging_callback_tests/test_datadog.py
   - tests/logging_callback_tests/test_langsmith_unit_test.py
   - tests/logging_callback_tests/test_otel_logging.py

3. proxy_server_config.yaml's gpt-3.5-turbo-large alias was routing to
   gpt-5.5 (a reasoning model that rejects logprobs). The proxy test
   tests/test_openai_endpoints.py::test_chat_completion_streaming
   exercises logprobs/top_logprobs through that alias. Bumped the
   underlying model to gpt-4.1 (non-reasoning, still modern).

4. tests/logging_callback_tests/test_gcs_pub_sub.py asserts against a
   pinned JSON fixture (gcs_pub_sub_body/spend_logs_payload.json) with
   hardcoded model="gpt-4o" and a model-specific spend value. Reverted
   the litellm.acompletion calls in the test to model="gpt-4o" so the
   fixture's exact-match assertions still hold.

5. tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py:
   anthropic.messages.create routing to openai/gpt-5-mini returned an
   empty content[0] with max_tokens=100 (reasoning-token consumption).
   Swapped to openai/gpt-4.1-mini.

* test: fix Assistants API model + 2 cursor[bot] review nits

1. pass_through_unit_tests/test_custom_logger_passthrough.py: gpt-5.5
   isn't accepted by the /v1/assistants endpoint
   ("unsupported_model"). Switch to gpt-4.1-mini (modern, Assistants-
   API-supported, non-reasoning).

2. example_config_yaml/pass_through_config.yaml: the previous sweep
   bumped the claude-3-7-sonnet alias to claude-opus-4-7, which is a
   tier change (Sonnet -> Opus). Map to claude-sonnet-4-6 to keep the
   Sonnet tier intact. (Cursor bugbot review.)

3. example_config_yaml/simple_config.yaml: model_name was left as
   gpt-3.5-turbo while the underlying was bumped to gpt-5-mini, which
   muddles the "simple" example. Make both sides gpt-5-mini so the
   most basic example is a straight 1:1 mapping again. (Cursor bugbot
   review.)

* fix: revert gpt-4/gpt-3.5-turbo alias underlying to non-reasoning models

tests/test_openai_endpoints.py::test_completion calls the proxy alias
"gpt-4" with temperature=0, and other tests call gpt-3.5-turbo with
custom temperature / logprobs / the legacy /v1/completions endpoint.
The earlier modernization mapped both aliases to gpt-5.5 / gpt-5-mini,
which are reasoning models that reject temperature != 1 and don't
expose /v1/completions. Map the aliases to gpt-4.1 / gpt-4.1-mini
(modern non-reasoning OpenAI models) instead — keeps user-facing
aliases preserved while picking a current underlying that still
supports the parameters/endpoints the tests exercise.
2026-05-15 15:44:28 -07:00
Cursor Agent b637d9f64a test(vcr): classify cache verdicts, detect live calls, surface cost leaks
Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
2026-05-13 00:31:47 +00:00
Shivam Rawat 29e4eb16da Merge pull request #27222 from BerriAI/litellm_s3AuditParams
[Feat] Decouple S3 audit-log config via s3_audit_callback_params
2026-05-07 12:49:03 -07:00
Mateo Wang 7e13256fee test: add 24hr Redis-backed VCR cache to additional test suites (#27159)
* test: add 24hr Redis-backed VCR cache to additional test suites

Extracts the existing llm_translation VCR plumbing into a reusable helper
(tests/_vcr_conftest_common.py) and wires it into the conftest.py files
of the test directories listed in LIT-2787:

  audio_tests, batches_tests, guardrails_tests, image_gen_tests,
  litellm_utils_tests, local_testing, logging_callback_tests,
  pass_through_unit_tests, router_unit_tests, unified_google_tests

The same helper is also adopted by the pre-existing llm_translation and
llm_responses_api_testing conftests to remove the copy-pasted VCR setup.

Each consuming conftest:
- registers the Redis persister via pytest_recording_configure
- auto-marks collected tests with pytest.mark.vcr (skipping respx-using
  files where applicable, since respx and vcrpy both patch httpx)
- gates cassette writes on test success via _vcr_outcome_gate

The cache is opt-in via CASSETTE_REDIS_URL; when unset, VCR is disabled
and tests hit live providers as before. LITELLM_VCR_DISABLE=1 still
forces a bypass for ad-hoc local runs.

Test directories that run LiteLLM proxy in Docker (build_and_test,
proxy_logging_guardrails_model_info_tests, proxy_store_model_in_db_tests)
are intentionally not included: VCR.py patches the in-process httpx
transport and cannot intercept calls made from inside a Docker container.
The installing_litellm_on_python* jobs make no LLM calls and don't
benefit from caching.

https://linear.app/litellm-ai/issue/LIT-2787/add-24hr-caching-to-additional-test-suites

* test(vcr): add safe-body matcher to handle JSONL and binary request bodies

vcrpy's stock body matcher inspects Content-Type and unconditionally
runs json.loads on application/json bodies. JSON Lines payloads (used
by the Bedrock batch S3 PUT and other upload paths) crash that with
json.JSONDecodeError: Extra data, before the matcher can return
'not a match'.

This was the root cause of the batches_testing CI job failing on
test_async_create_file once VCR auto-marking was applied to the
batches_tests directory.

Add a conservative byte-equality body matcher and use it in place of
'body' in the shared match_on tuple. The matcher is strictly more
conservative than vcrpy's default — the only thing it gives up is
'different JSON key order is treated as the same body', which doesn't
apply to deterministic litellm-built request payloads. It can never
produce a false positive that the default would have rejected, so
there is no cross-contamination risk.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): exclude tests that VCR replay actively breaks

A few tests are incompatible with cassette replay and were failing on
the latest CI run after VCR auto-marking was extended to local_testing
and logging_callback_tests:

- test_amazing_s3_logs.py (logging_callback_tests): the test asserts on
  a per-run response_id that should round-trip through a real S3
  PUT/LIST. vcrpy's boto3 stub intercepts the PUT and the LIST replays
  stale keys, so the freshly-generated id is never found.
- test_async_embedding_azure (logging_callback_tests) and
  test_amazing_sync_embedding (local_testing): the failure branches
  deliberately pass api_key='my-bad-key' to assert that the failure
  callback fires. We scrub auth headers from cassettes (so the bad-key
  request matches the prior good-key request), and vcrpy replays the
  recorded 200 — the failure callback never fires.
- test_assistants.py (local_testing): the OpenAI Assistants polling
  APIs mint fresh thread/run IDs every recording session and then poll
  until status=='completed'. Replays of those polled GETs can never
  match a freshly-generated run id, so every CI run effectively
  re-records and the suite blows past the 15m no_output_timeout.

Skip these from VCR auto-marking so they continue to hit live providers
as they did before this change. The remaining tests in each directory
still get cached.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): expand skip lists for second batch of incompatible tests

Followup to the previous commit. After re-running CI on the rebuilt
branch, three more tests surfaced as VCR-replay-incompatible:

- litellm_utils_testing :: test_get_valid_models_from_dynamic_api_key
  Calls GET /v1/models with api_key='123' to assert the result is empty.
  We scrub auth headers, so the bad-key request matches the prior
  good-key cassette and replays the recorded model list.
- litellm_utils_testing :: test_litellm_overhead.py
  Measures litellm_overhead_time_ms as a percentage of total wall-clock
  time. With cached responses the upstream 'network' time collapses to
  microseconds, blowing past the 40%% threshold the test asserts on.
  Skip the whole file (every parametrization is at risk).
- local_testing_part1 :: test_async_custom_handler_completion and
  test_async_custom_handler_embedding
  Same bad-key failure-callback pattern as the already-skipped
  test_amazing_sync_embedding.
- litellm_router_testing :: test_router_caching.py
  Asserts on litellm's own router-level response cache by comparing
  response1.id to response2.id across repeat upstream calls (test
  bypasses litellm cache via ttl=0 and expects upstream to return a
  *new* id). With VCR replay both upstream calls return the same
  cassette body, so the ids are identical. Skip the whole file.
- logging_callback_tests :: test_async_chat_azure (preemptive)
  Same shape as already-skipped test_async_embedding_azure; was masked
  by upstream OpenAI rate-limit failures on baseline.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): use item.path and tighten matcher docstring

- Replace pytest's deprecated item.fspath with item.path in
  apply_vcr_auto_marker_to_items so we don't emit deprecation
  warnings under pytest 8.
- Clarify _safe_body_matcher docstring to reflect actual behavior
  (direct == first, then UTF-8 bytes comparison, no repr fallback).

Addresses Greptile review feedback on PR #27159.

* test(vcr): swallow all RedisError on cassette save/load

Cassette persistence is strictly best-effort: any Redis-side failure
(connection blip, timeout, OutOfMemoryError when the maxmemory cap is
hit, READONLY replicas, etc.) should degrade to 'test passed but
cassette not cached' rather than fail the test on teardown.

Previously the persister only caught ConnectionError and TimeoutError,
so OutOfMemoryError — which Redis Cloud raises when the cassette cache
hits its memory cap and there are no evictable keys — propagated out of
vcrpy's autouse fixture and ERRORed otherwise-passing tests on
teardown. This caused the litellm_utils_testing CircleCI job to fail on
the latest commit's run, even though the underlying test was a unit
test that used mock_response and produced no real upstream traffic
(the cassette was dirtied by a background langfuse callback). The
rerun only succeeded because Redis evictions happened to free enough
room before the SET — i.e. it was timing-dependent flakiness.

Catch redis.exceptions.RedisError (the common base of all server- and
client-side Redis exceptions) on both save and load, and parametrize
the regression tests across ConnectionError, TimeoutError, and
OutOfMemoryError to pin the new behavior.

* test(vcr): surface cassette-cache failures with warnings + session banner

When the persister silently swallows a Redis OOM (or any RedisError) on
save/load there is otherwise no visible signal that the cache is
degraded — tests pass, the cassette just isn't persisted, and the next
session still hits the same Redis at the same near-cap memory.

Add three layers of observability so that failure mode is loud:

1. Per-process health counters ("save_failures", "load_failures", and
   the last error string for each), exposed via cassette_cache_health()
   and reset via reset_cassette_cache_health(). The persister
   increments these in addition to logging.

2. VCRCassetteCacheWarning (UserWarning subclass) emitted via
   warnings.warn() inside the persister's except block. Pytest's
   built-in warnings summary at session end automatically lists every
   such warning, so the failure is visible in CI logs without any
   conftest-level wiring.

3. Session-end banner via emit_cassette_cache_session_banner() and a
   stderr-fallback atexit handler registered from
   register_persister_if_enabled(). Two states:
     - red "VCR CASSETTE CACHE DEGRADED" when save_failures or
       load_failures > 0
     - yellow "VCR CASSETTE CACHE NEAR CAPACITY" (no failures, but
       used_memory >= 85% of maxmemory) so the next session knows
       the Redis is approaching OOM before any SET actually fails

Capacity comes from a best-effort INFO memory probe
(cassette_cache_capacity_snapshot) that returns None on any failure or
when maxmemory is uncapped. The atexit handler skips xdist workers so
only the controller emits.

Tests: parametrize the existing save/load swallow-error tests across
ConnectionError/TimeoutError/OutOfMemoryError, add direct tests for
the health counters and warning emission, and a new
test_vcr_conftest_common_banner.py covering banner output for every
state (silent/red/yellow/disabled/xdist-worker).

* test(vcr): bucket cassettes by API key fingerprint, drop bad-key skips

Tests that deliberately call an LLM API with a bad key (e.g. to assert
that the failure callback fires, or that check_valid_key returns False)
were being silently served the prior good-key cassette: we scrub the
real Authorization / x-api-key header from the cassette before storing
it, so a follow-up bad-key call is byte-identical to the good-key call
under the existing match_on tuple.

Add a 'key_fingerprint' custom matcher that distinguishes requests by
the SHA-256 of their API-key headers. The fingerprint is stamped into
a synthetic 'x-litellm-key-fp' header by a new before_record_request
hook, which then strips the real auth headers (we have to do the
scrubbing here instead of via vcrpy's filter_headers knob, because
filter_headers runs *first* and would erase the value we want to hash).

Bad-key requests now get a different cassette bucket than good-key
requests, so vcrpy will not replay a recorded 200 in place of the
expected 401. The fingerprint is a one-way hash of the secret, so
cassettes never contain the key.

This permanently removes the 'bad-key' category of skips:

- tests/local_testing: dropped ::test_amazing_sync_embedding,
  ::test_async_custom_handler_completion,
  ::test_async_custom_handler_embedding
- tests/logging_callback_tests: dropped ::test_async_chat_azure,
  ::test_async_embedding_azure
- tests/litellm_utils_tests: dropped
  ::test_get_valid_models_from_dynamic_api_key

Coverage: 7 new unit tests in tests/test_litellm/test_vcr_safe_body_matcher.py
covering header stripping, fingerprint determinism, no-auth bucketing,
good-vs-bad key discrimination, x-api-key (Anthropic/Azure) discrimination,
and idempotence under replay.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop redundant comments and docstrings

Trim narration of code that is already self-evident from function and
variable names. Keep the two genuinely non-obvious bits:

- ordering constraint between filter_headers and before_record_request,
  which would invite a maintainer to re-introduce the bug if removed
- the per-directory _VCR_INCOMPATIBLE_FILES rationale, since 'why
  exactly is this skipped' is not knowable from the test name alone

Also drop the 40-line commented-out drop-in conftest snippet at the
bottom of _vcr_conftest_common.py — the consuming conftests are the
canonical reference.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): make _before_record_request idempotent

vcrpy invokes before_record_request more than once per request:
can_play_response_for calls it, then __contains__ /
_responses (reached via play_response) call it again on the
result. The second invocation sees a request whose auth headers we
already stripped, so a naive recompute yields "no-key" and
overwrites the real fingerprint stored in the header.

This makes can_play_response_for and play_response disagree on
matchability — the former says "yes, we have a stored response for
this" (matching no-key to no-key) and the latter throws
UnhandledHTTPRequestError because it computes a fresh real
fingerprint that doesn't match the stored no-key.

In CI this manifested as ~30 failing tests across guardrails_testing,
audio_testing, batches_testing, image_gen_testing, llm_responses_api,
litellm_router_unit_testing, etc. Skip the recompute when the header
is already set, so re-applying the hook is a no-op.

Adds a regression test that fires the hook twice on the same dict and
asserts the fingerprint stays put.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop more redundant docstrings and headers

* test(vcr): enable 24hr cache for ocr_tests and search_tests

These two directories were the only non-dockerized test suites in the
build_and_test workflow that make live LLM/provider API calls but were
not VCR-enabled by this PR. Together they account for 96 tests:

- tests/ocr_tests/ (31): Mistral OCR, Azure AI OCR, Azure Document
  Intelligence, Vertex AI OCR. Pure-unit tests inside the same files
  (e.g. TestAzureDocumentIntelligencePagesParam) make no HTTP calls
  and become benign VCR NOOPs.
- tests/search_tests/ (65): Brave, DataForSEO, DuckDuckGo, Exa,
  Firecrawl, Google PSE, Linkup, Parallel.ai, Perplexity, SearchAPI,
  Searxng, Serper, Tavily.

Both directories use the canonical minimal conftest pattern from
tests/audio_tests/conftest.py with no skip lists. None of the test
files use respx, none assert on per-call upstream non-determinism
(no response1.id != response2.id, no overhead-as-fraction-of-total,
no live polling), so the default match_on tuple should cache cleanly.
If a flake surfaces during the first cassette-recording CI run, we
can add a targeted skip the same way we did for the other dirs.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
2026-05-05 15:13:31 -07:00
Michael Riad Zaky 87d2b98a22 decouple S3 audit-log config via s3_audit_callback_params 2026-05-05 13:23:32 -07:00
user bb6d7c9715 fix(callbacks): preserve langfuse secret alias 2026-04-30 14:36:51 -07:00
user 258edac727 test(callbacks): cover upstream langfuse debug env 2026-04-30 14:34:39 -07:00
user 15d4d51453 chore(callbacks): guard dynamic integration hosts 2026-04-30 14:27:19 -07:00
user 7497674661 fix(proxy): sanitize redaction controls at ingress 2026-04-29 22:52:31 -07:00
user 842eea0131 chore(proxy): harden request control fields 2026-04-29 22:35:17 -07:00
Sameer Kankute b516120036 Merge pull request #26737 from BerriAI/litellm_internal_staging
merge internal staging
2026-04-29 08:50:12 +05:30
Sameer Kankute cf74f55b79 Fix extra body error 2026-04-29 08:34:31 +05:30
milan-berri 10aed9e981 feat(logging): add retry settings for generic API logger (#26645)
* Add retry settings for generic API logger

Made-with: Cursor

* Refine generic API retry behavior

Made-with: Cursor
2026-04-28 08:38:17 -07:00
ishaan-berri 8a9faa81b2 feat(guardrails): LLM-as-a-Judge guardrail (#26360)
* feat(guardrails): add LLM_AS_A_JUDGE to SupportedGuardrailIntegrations

* feat(types): add EvalVerdict, StandardLoggingEvalInformation; wire eval_information into SpendLogsMetadata

* feat(guardrails): add self-contained llm_as_a_judge guardrail hook

* fix(a2a): filter agent-only litellm_params from acompletion kwargs; pass agent_id into body

* feat(ui): add LLMJudgeFields criteria builder component

* feat(ui): wire LLM-as-a-Judge into add guardrail form

* feat(ui): update EvalViewer — title 'LLM Judge Results', weighted score column, summary row

* fix(ui): wire EvalViewer into LogDetailContent to show LLM judge results on logs page

* fix(guardrails-ui): route llm_as_a_judge to criteria builder step; rename to LiteLLM LLM as a Judge; add litellm logo

* fix(guardrail-viewer): stack lifecycle + eval details vertically to avoid badge overflow in narrow drawer

* fix(guardrail-create): surface config validation errors on create instead of silently orphaning guardrail in DB

* fix(guardrail-registry): hardcode llm_as_a_judge in initializer registry so it loads regardless of package install path

* fix(llm-as-a-judge): fix P1 code quality issues - validate weights/on_failure, guard pre_call, handle multimodal, move imports to module level, fix spurious finally logging

* fix(guardrail_endpoints): use correct PK field in rollback delete and log rollback failure

* fix(llm_as_a_judge): support Pydantic object in _get_litellm_param fallback chain

* fix(LLMJudgeFields): replace @tremor/react Button with antd Button

* fix(llm_as_a_judge): remove dead registry dicts, fix KeyError in prompt builder, set correct status on judge failure

* test(llm_as_a_judge): add unit tests for guardrail hook

* fix(llm_as_a_judge): remove @log_guardrail_information decorator to fix duplicate guardrail_information entries

The decorator and the manual finally block both called add_standard_logging_guardrail_information_to_request_data, producing two entries per request. The decorator also misclassified HTTPException(422) blocks as guardrail_failed_to_respond (it checks for 400). The finally block correctly tracks status throughout, so removing the decorator is sufficient.

* fix(test_gcs_pub_sub): ignore metadata.eval_information in comparison

* fix(test_spend_management): ignore metadata.eval_information in payload comparison

* fix(types/guardrails): add input_type and messages to ApplyGuardrailRequest

* fix(guardrail_endpoints): pass input_type and messages through apply_guardrail endpoint

* fix(guardrail_endpoints): auto-detect post_call guardrails and use input_type=response

* fix(a2a_endpoints): merge agent litellm_params guardrails into data before post_call hooks

* fix(llm_as_a_judge): use float sum with tolerance for weight validation

* fix(guardrail_registry): split long import line for black formatting

* fix(llm_as_a_judge): guard guardrail_name Optional for mypy

* fix(llm_as_a_judge): set guardrail_status=guardrail_intervened when score fails, regardless of on_failure mode

* fix(a2a_endpoints): use try/finally so deferred spend log fires even when guardrail blocks with 422

* fix(litellm_logging): declare _defer_async_logging and _enqueue_deferred_logging on Logging class for mypy

* fix(logging_worker): restore queue.join() in flush() to wait for in-flight callbacks
2026-04-24 17:15:32 -07:00
ishaan-berri 8a4a775b1b fix(logging): add litellm_call_id to StandardLoggingPayload and OTel span (#26133)
* add litellm_call_id field to StandardLoggingPayload

* populate litellm_call_id in get_standard_logging_object_payload

* emit litellm.call_id span attribute in OTel integration

* test: litellm_call_id is present in StandardLoggingPayload

* test: litellm.call_id emitted as OTel span attribute

* test: allow litellm. prefix attributes in redacted span validator
2026-04-21 15:24:32 -07:00
Yuneng Jiang 11c3270cdc Merge remote-tracking branch 'origin/litellm_internal_staging' into litellm_yj_apr17
# Conflicts:
#	litellm/__init__.py
2026-04-17 17:36:40 -07:00
Yuneng Jiang ee2cf0e6e8 fix: address three CI failures from recent security PR merges
- url_utils.py: narrow sockaddr[0] from str|int to str via a helper with a
  fail-closed isinstance check. Fixes the two mypy errors introduced by
  the SSRF hardening without masking unexpected stdlib behavior.

- key_management_endpoints.py: restore the documented team member_permissions
  path for /key/update. The cross-key admin check added to close the
  cross-org rewrite attack was over-broad: it rejected non-admin team
  members even when can_team_member_execute_key_management_endpoint had
  already validated their team membership and /key/update grant. Now skip
  the admin check when the key has a team_id and the change is non-budget
  (membership + permission already enforced above). Budget/spend changes
  still require team/org admin. The cross-org attack remains blocked:
  an outside org admin fails the earlier team membership check.

- test_logging_redaction_e2e_test.py: rename and rewrite two parametrized
  tests to assert that request-body turn_off_message_logging has no effect.
  Reflects the intentional removal of turn_off_message_logging from
  _supported_callback_params so the caller cannot override admin logging
  policy via the request body.

- test_key_management_endpoints.py: add two tests covering the restored
  team member permission path — one positive (non-budget update succeeds
  for a team member with /key/update grant), one negative (max_budget
  change still rejected without admin role).
2026-04-17 15:11:45 -07:00
Ishaan Jaffer e8461b5b97 style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
Ishaan Jaffer 98c2d90f5c fix(logging): update test_get_additional_headers to reflect provider header passthrough 2026-04-15 12:23:33 -07:00
David Chen b7ccc5b691 [Test Fix] fix gov pricing tests (#25022)
* fix pricing tests

* fix mypy

* fix cost expectation since us based model is used now.

* fix test get model info
2026-04-02 15:55:55 -07:00
David Chen d1df4e838b Litellm fix update bedrock models (#24947)
* update bedrock models in tests

* updated more tests and model_prices_and_context_window

* fix model id and pricing

* replace more sonnet models

* update tests

* git push

* update pricing

* flaky total cost

* monkey patch

* relax the cost change

* fix and revert some changes

* revert the pricing

* chore: move cost/pricing changes to bedrock-cost-fixes branch

* chore: split Bedrock file-api beta stripping to separate branch

Removes strip_unsupported_file_api_betas_for_bedrock_invoke from this branch;
see litellm_bedrock_invoke_strip_file_api_betas for that fix.

Made-with: Cursor
2026-04-01 19:22:54 -07:00
ishaan-berri e4442a4d98 test fix us.anthropic.claude-haiku-4-5-20251001-v1:0 (#24931)
* test fix us.anthropic.claude-haiku-4-5-20251001-v1:0

* ignore mypy cache files

---------

Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
Co-authored-by: David Chen <clfhhc@gmail.com>
2026-04-01 11:01:03 -07:00
Ishaan Jaffer 0298c1f58d test_basic_s3_v2_logging 2026-03-30 18:17:52 -07:00
Ishaan Jaffer 443566d4f5 test fixes 2026-03-30 16:59:27 -07:00
Ishaan Jaffer 28afbc152f test_async_gcs_pub_sub_v1 2026-03-30 16:52:56 -07:00
Ishaan Jaffer 431782c3fe test azure blob storage 2026-03-30 15:54:07 -07:00
Krrish Dholakia 25f2baad71 test: cleanup dead tests 2026-03-28 20:49:02 -07:00
Krrish Dholakia 0fef88d67c test: remove dead tests 2026-03-28 20:23:44 -07:00
Krrish Dholakia bc829d51f2 test: test 2026-03-28 19:17:38 -07:00
Ishaan Jaff 81dadb698a Ishaan - March 18th changes (#24056)
* add DD Tracing (#24033)

* feat(models): add Azure GPT-5.4 mini and nano variants (#24045)

Add `azure/gpt-5.4-mini` and `azure/gpt-5.4-nano` to the model
database with official pricing from Azure OpenAI:

- GPT-5.4 mini: $0.75/M input, $0.075/M cached, $4.5/M output
- GPT-5.4 nano: $0.20/M input, $0.02/M cached, $1.25/M output

Both models support:
- 1.05M input / 128K output context window
- Chat, batch, and responses endpoints
- Function calling, tools, vision, reasoning
- Prompt caching with automatic tiered pricing

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Add new model pricing details for volcengine Doubao-Seed-2.0 series (#23871)

Add entries for volcengine Doubao-Seed-2.0 series

* fix(mcp): support refresh_token grant type in OAuth token endpoint (#23701)

* fix(mcp): support refresh_token grant type in OAuth token endpoint (#23700)

The .well-known/oauth-authorization-server metadata advertises
refresh_token as a supported grant type, but the token endpoint
rejected it with HTTP 400. This adds refresh_token grant support
so MCP clients can refresh expired tokens without re-authenticating.

* test(mcp): add tests for refresh_token grant type in OAuth token endpoint

* fix(mcp): move code_verifier guard into authorization_code branch

code_verifier is only relevant for authorization_code grants (PKCE).
Move it inside the else branch so it doesn't apply to refresh_token.

* fix(mcp): guard None client_secret and forward scope in token exchange

- Conditionally include client_secret in form data to prevent httpx
  from sending the literal string "None" (applies to both
  authorization_code and refresh_token branches)
- Forward optional scope parameter per RFC 6749 §6, allowing clients
  to request a subset of originally-granted scopes on refresh

* fix(mcp): validate code param in authorization_code grant

Guard against None code being form-encoded as literal string "None"
by httpx, symmetric with the existing refresh_token guard.

* docs: add incident report for guardrail logging secret exposure (#24059)

Add blog post documenting the guardrail logging path exposing internal
request data (e.g. Authorization headers) in spend logs and OTEL traces.
Fix available in LiteLLM 1.82.3+.

Made-with: Cursor

* [Fix] Datadog LLM Observability tags format (env, service, version missing) (#23673)

* tag fix

* greptile comment

* fix(ci): stabilize 6 failing CI jobs

1. mypy: remove duplicate type annotation for token_data in discoverable_endpoints.py
2. integrations tests: add parameterized to CI test deps
3. doc quality: document OTEL_IGNORE_CONTEXT_PROPAGATION env key
4. security: allowlist CVE-2026-2673, CVE-2026-3644, CVE-2026-4224 (no fix available)
5. proxy_store_model_in_db: fix missing x-litellm-call-id header on error responses
6. google tests: add --retries 3 for transient Vertex AI rate limits

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(streaming): handle RuntimeError during model_copy in streaming handler

The race condition occurs when model_copy(deep=True) tries to deepcopy
_hidden_params dict while it's being concurrently modified by logging
callbacks. Fall back to shallow copy if the deep copy fails.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(cost): handle non-string traffic_type in cost calculator + add retries

1. Fix AttributeError in _map_traffic_type_to_service_tier when traffic_type
   is an integer (cast to str before calling .upper()). This was causing
   pass-through vertex spend logging to fail silently.
2. Add --retries to llm_translation_testing for flaky external API calls.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

---------

Co-authored-by: Emerson Gomes <emerson.gomes@thalesgroup.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: ExMatics HydrogenC <33123710+HydrogenC@users.noreply.github.com>
Co-authored-by: Jack Venberg <jack.venberg@rover.com>
Co-authored-by: milan-berri <milan@berri.ai>
Co-authored-by: Shivam Rawat <161387515+shivamrawat1@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>
2026-03-19 10:20:35 -07:00
Ishaan Jaff 8e61b32b8e [Staging] - Ishaan March 17th (#23903)
* feat(xai): add grok-4.20 beta 2 models with pricing (#23900)

Add three grok-4.20 beta 2 model variants from xAI:
- grok-4.20-multi-agent-beta-0309 (reasoning + multi-agent)
- grok-4.20-beta-0309-reasoning (reasoning)
- grok-4.20-beta-0309-non-reasoning

Pricing (from https://docs.x.ai/docs/models):
- Input: $2.00/1M tokens ($0.20/1M cached)
- Output: $6.00/1M tokens
- Context: 2M tokens

All variants support vision, function calling, tool choice, and web search.
Closes LIT-2171

* docs: add Quick Install section for litellm --setup wizard (#23905)

* docs: add Quick Install section for litellm --setup wizard

* docs: clarify setup wizard is for local/beginner use

* feat(setup): interactive setup wizard + install.sh (#23644)

* feat(setup): add interactive setup wizard + install.sh

Adds `litellm --setup` — a Claude Code-style TUI onboarding wizard that
guides users through provider selection, API key entry, and proxy config
generation, then optionally starts the proxy immediately.

- litellm/setup_wizard.py: wizard with ASCII art, numbered provider menu
  (OpenAI, Anthropic, Azure, Gemini, Bedrock, Ollama), API key prompts,
  port/master-key config, and litellm_config.yaml generation
- litellm/proxy/proxy_cli.py: adds --setup flag that invokes the wizard
- scripts/install.sh: curl-installable script (detect OS/Python, pip
  install litellm[proxy], launch wizard)

Usage:
  curl -fsSL https://raw.githubusercontent.com/BerriAI/litellm/main/scripts/install.sh | sh
  litellm --setup

* fix(install.sh): remove orange color, add LITELLM_BRANCH env var for branch installs

* fix(install.sh): install from git branch so --setup is available for QA

* fix(install.sh): remove stale LITELLM_BRANCH reference that caused unbound variable error

* fix(install.sh): force-reinstall from git to bypass cached PyPI version

* fix(install.sh): show pip progress bar during install

* fix(install.sh): always launch wizard via $PYTHON_BIN -m litellm, not PATH binary

* fix(install.sh): use litellm.proxy.proxy_cli module (no __main__.py exists)

* fix(install.sh): suppress RuntimeWarning from module invocation

* fix(install.sh): use Python bin-dir litellm binary to avoid CWD sys.path shadowing

* fix(install.sh): use sysconfig.get_path('scripts') to find pip-installed litellm binary

* fix(install.sh): redirect stdin from /dev/tty on exec so wizard gets terminal, not exhausted pipe

* fix(install.sh): warn about git clone duration, drop --no-cache-dir so re-runs are faster

* feat(setup_wizard): arrow-key selector, updated model names

* fix(setup_wizard): use sysconfig binary to start proxy, not python -m litellm

* feat(setup_wizard): credential validation after key entry + clear next-steps after proxy start

* style(install.sh): show git clone warning in blue

* refactor(setup_wizard): class with static methods, use check_valid_key from litellm.utils

* address greptile review: fix yaml escaping, port validation, display name collisions, tests

- setup_wizard.py: add _yaml_escape() for safe YAML embedding of API keys
- setup_wizard.py: add _styled_input() with readline ANSI ignore markers
- setup_wizard.py: change DIVIDER to _divider() fn to avoid import-time color capture
- setup_wizard.py: validate port range 1-65535, initialize before loop
- setup_wizard.py: qualify azure display names (azure-gpt-4o) to avoid collision with openai
- setup_wizard.py: work on env_copy in _build_config to avoid mutating caller's dict
- setup_wizard.py: skip model_list entries for providers with no credentials
- setup_wizard.py: prompt for azure deployment name
- setup_wizard.py: wrap os.execlp in try/except with friendly fallback
- setup_wizard.py: wrap config write in try/except OSError
- setup_wizard.py: fix _validate_and_report to use two print lines (no \r overwrite)
- setup_wizard.py: add .gitignore tip next to key storage notice
- setup_wizard.py: fix run_setup_wizard() return type annotation to None
- scripts/install.sh: drop pipefail (not supported by dash on Ubuntu when invoked as sh)
- scripts/install.sh: use litellm[proxy] from PyPI (not hardcoded dev branch)
- scripts/install.sh: guard /dev/tty read with -r check for Docker/CI compat
- scripts/install.sh: remove --force-reinstall to avoid downgrading dependencies
- tests/test_litellm/test_setup_wizard.py: 13 unit tests for _build_config and _yaml_escape

* style: black format setup_wizard.py

* fix: address remaining greptile issues - Windows compat, YAML quoting, credential flow

- guard termios/tty imports with try/except ImportError for Windows compat
- quote master_key as YAML double-quoted scalar (same as env vars)
- remove unused port param from _build_config signature
- _validate_and_report now returns the final key so re-entered creds are stored
- add test for master_key YAML quoting

* fix: add --port to suggested command, guard /dev/tty exec in install.sh

* fix: quote api_base in YAML, skip azure if no deployment, only redraw on state change

* fix: address greptile review comments

- _yaml_escape: add control character escaping (\n, \r, \t)
- test: fix tautological assertion in test_build_config_azure_no_deployment_skipped
- test: add tests for control character escaping in _yaml_escape

* feat(ui): remove Chat UI page link and banner from sidebar and playground (#23908)

* feat(guardrails): MCPJWTSigner - built-in guardrail for zero trust MCP auth (#23897)

* Allow pre_mcp_call guardrail hooks to mutate outbound MCP headers

* Enhance MCPServerManager to support hook-modified arguments and extra headers. Update tests to validate argument mutation and header injection behavior, including warnings for OpenAPI-backed servers when headers are present.

* Refactor MCPServerManager to raise HTTPException for extra headers in OpenAPI-backed servers. Update tests to reflect this change, ensuring proper exception handling instead of logging warnings.

* Allow pre_mcp_call guardrail hooks to mutate outbound MCP headers

* Enhance MCPServerManager to support hook-modified arguments and extra headers. Update tests to validate argument mutation and header injection behavior, including warnings for OpenAPI-backed servers when headers are present.

* Refactor MCPServerManager to raise HTTPException for extra headers in OpenAPI-backed servers. Update tests to reflect this change, ensuring proper exception handling instead of logging warnings.

* feat(guardrails): add MCPJWTSigner built-in guardrail for zero trust MCP auth

Signs outbound MCP tool calls with a LiteLLM-issued RS256 JWT so MCP servers
can trust a single signing authority instead of every upstream IdP.

Enable in config.yaml:
  guardrails:
    - guardrail_name: mcp-jwt-signer
      litellm_params:
        guardrail: mcp_jwt_signer
        mode: pre_mcp_call
        default_on: true

JWT carries sub (user_id), act.sub (team_id, RFC 8693), tool-level scope, iss,
aud, iat/exp/nbf. RSA-2048 keypair auto-generated at startup unless
MCP_JWT_SIGNING_KEY env var is set.

Adds /.well-known/jwks.json endpoint and jwks_uri to /.well-known/openid-configuration
so MCP servers can verify LiteLLM-issued tokens via OIDC discovery.

* Update MCPServerManager to raise HTTPException with status code 400 for extra headers in OpenAPI-backed servers. Adjust tests to verify the correct status code and exception message.

* fix: address P1 issues in MCPJWTSigner

- OpenAPI servers: warn + skip header injection instead of 500
- JWKS Cache-Control: 5min for auto-generated keys, 1h for persistent
- sub claim: fallback to apikey:{token_hash} for anonymous callers
- ttl_seconds: validate > 0 at init time

* docs: add MCP zero trust auth guide with architecture diagram

* docs: add FastMCP JWT verification guide to zero trust doc

* fix: address remaining Greptile review issues (round 2)

- mcp_server_manager: warn when hook Authorization overwrites existing header
- __init__: remove _mcp_jwt_signer_instance from __all__ (private internal)
- discoverable_endpoints: copy dict instead of mutating in-place on OIDC augmentation
- test docstring: reflect warn-and-continue behavior for OpenAPI servers
- test: update scope assertions for least-privilege (no mcp:tools/list on tool-call JWTs)

* fix: address Greptile round 3 feedback

- initialize_guardrail: validate mode='pre_mcp_call' at init time — misconfigured
  mode silently bypasses JWT injection, which is a zero-trust bypass
- _build_claims: remove duplicate inline 'import re' (module-level import already present)
- _types.py: add TODO comment explaining jwt_claims is forward-compat plumbing
  for a follow-up PR that will forward upstream IdP claims into outbound MCP JWTs

* feat(mcp_jwt_signer): add verify+re-sign, claim ops, two-token model, configurable scopes

Addresses all missing pieces from the scoping doc review:

FR-5 (Verify + re-sign): MCPJWTSigner now accepts access_token_discovery_uri
and token_introspection_endpoint.  When set, the incoming Bearer token is
extracted from raw_headers (threaded through pre_call_tool_check), verified
against the IdP's JWKS (JWT) or introspected (opaque), and only re-signed if
valid.  Falls back to user_api_key_dict.jwt_claims for LiteLLM JWT-auth mode.

FR-12 (Configurable end-user identity mapping): end_user_claim_sources
ordered list drives sub resolution — sources: token:<claim>, litellm:user_id,
litellm:email, litellm:end_user_id, litellm:team_id.

FR-13 (Claim operations): add_claims (insert-if-absent), set_claims (always
override), remove_claims (delete) applied in that order.

FR-14 (Two-token model): channel_token_audience + channel_token_ttl issue a
second JWT injected as x-mcp-channel-token: Bearer <token>.

FR-15 (Incoming claim validation): required_claims raises HTTP 403 when any
listed claim is absent; optional_claims passes listed claims from verified
token into the outbound JWT.

FR-9 (Debug headers): debug_headers: true emits x-litellm-mcp-debug with kid,
sub, iss, exp, scope.

FR-10 (Configurable scopes): allowed_scopes replaces auto-generation.  Also
fixed: tool-call JWTs no longer grant mcp:tools/list (overpermission).

P1 fixes:
- proxy/utils.py: _convert_mcp_hook_response_to_kwargs merges rather than
  replaces extra_headers, preserving headers from prior guardrails.
- mcp_server_manager.py: warns when hook injects Authorization alongside a
  server-configured authentication_token (previously silent).
- mcp_server_manager.py: pre_call_tool_check now accepts raw_headers and
  extracts incoming_bearer_token so FR-5 verification has the raw token.
- proxy/utils.py: remove stray inline import inspect inside loop (pre-existing
  lint error, now cleaned up).

Tests: 43 passing (28 new tests covering all FR flags + P1 fixes).

* feat(mcp_jwt_signer): add verify+re-sign, claim ops, two-token model, configurable scopes (core)

Remaining files from the FR implementation:

mcp_jwt_signer.py — full rewrite with all new params:
  FR-5:  access_token_discovery_uri, token_introspection_endpoint,
         verify_issuer, verify_audience + _verify_incoming_jwt(),
         _introspect_opaque_token()
  FR-12: end_user_claim_sources ordered resolution chain
  FR-13: add_claims, set_claims, remove_claims
  FR-14: channel_token_audience, channel_token_ttl → x-mcp-channel-token
  FR-15: required_claims (raises 403), optional_claims (passthrough)
  FR-9:  debug_headers → x-litellm-mcp-debug
  FR-10: allowed_scopes; tool-call JWTs no longer over-grant tools/list

mcp_server_manager.py:
  - pre_call_tool_check gains raw_headers param to extract incoming_bearer_token
  - Silent Authorization override warning fixed: now fires when server has
    authentication_token AND hook injects Authorization

tests/test_mcp_jwt_signer.py:
  28 new tests covering all FR flags + P1 fixes (43 total, all passing)

* fix(mcp_jwt_signer): address pre-landing review issues

- Remove stale TODO comment on UserAPIKeyAuth.jwt_claims — the field is
  already populated and consumed by MCPJWTSigner in the same PR
- Fix _get_oidc_discovery to only cache the OIDC discovery doc when
  jwks_uri is present; a malformed/empty doc now retries on the next
  request instead of being permanently cached until proxy restart
- Add FR-5 test coverage for _fetch_jwks (cache hit/miss),
  _get_oidc_discovery (cache/no-cache on bad doc), _verify_incoming_jwt
  (valid token, expired token), _introspect_opaque_token (active,
  inactive, no endpoint), and the end-to-end 401 hook path — 53 tests
  total, all passing

* docs(mcp_zero_trust): rewrite as use-case guide covering all new JWT signer features

Add scenario-driven sections for each new config area:
- Verify+re-sign with Okta/Azure AD (access_token_discovery_uri,
  end_user_claim_sources, token_introspection_endpoint)
- Enforcing caller attributes with required_claims / optional_claims
- Adding metadata via add_claims / set_claims / remove_claims
- Two-token model for AWS Bedrock AgentCore Gateway
  (channel_token_audience / channel_token_ttl)
- Controlling scopes with allowed_scopes
- Debugging JWT rejections with debug_headers

Update JWT claims table to reflect configurable sub (end_user_claim_sources)

* fix(mcp_jwt_signer): wire all config.yaml params through initialize_guardrail

The factory was only passing issuer/audience/ttl_seconds to MCPJWTSigner.
All FR-5/9/10/12/13/14/15 params (access_token_discovery_uri,
end_user_claim_sources, add/set/remove_claims, channel_token_audience,
required/optional_claims, debug_headers, allowed_scopes, etc.) were
silently dropped, making every advertised advanced feature non-functional
when loaded from config.yaml.

Add regression test that asserts every param is wired through correctly.

* docs(mcp_zero_trust): add hero image

* docs(mcp_zero_trust): apply Linear-style edits

- Lead with the problem (unsigned direct calls bypass access controls)
- Shorter statement section headers instead of question-form headers
- Move diagram/OIDC discovery block after the reader is bought in
- Add 'read further only if you need to' callout after basic setup
- Two-token section now opens from the user problem not product jargon
- Add concrete 403 error response example in required_claims section
- Debug section opens from the symptom (MCP server returning 401)
- Lowercase claims reference header for consistency

* fix(mcp_jwt_signer): fix algorithm confusion attack + add OIDC discovery 24h TTL

- Remove alg from unverified JWT header; use signing_jwk.algorithm_name from JWKS key instead.
  Reading alg from attacker-controlled headers enables alg:none / HS256 confusion attacks.
- Add _oidc_discovery_fetched_at timestamp and _OIDC_DISCOVERY_TTL = 86400 (24h).
  Without a TTL the cached discovery doc never refreshes, so IdP key rotation is invisible.

---------

Co-authored-by: Noah Nistler <60981020+noahnistler@users.noreply.github.com>

* fix(ci): stabilize CI - formatting, type errors, test polling, security CVEs, router bug, batch resolution

Fix 1: Run Black formatter on 35 files
Fix 2: Fix MyPy type errors:
  - setup_wizard.py: add type annotation for 'selected' set variable
  - user_api_key_auth.py: remove redundant type annotation on jwt_claims reassignment
Fix 3: Fix spend accuracy test burst 2 polling to wait for expected total
  spend instead of just 'any increase' from burst 2
Fix 4: Bump Next.js 16.1.6 -> 16.1.7 to fix CVE-2026-27978, CVE-2026-27979,
  CVE-2026-27980, CVE-2026-29057
Fix 5: Fix router _pre_call_checks model variable being overwritten inside
  loop, causing wrong model lookups on subsequent deployments. Use local
  _deployment_model variable instead.
Fix 6: Add missing resolve_output_file_ids_to_unified call in batch retrieve
  non-terminal-to-terminal path (matching the terminal path behavior)

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* chore: regenerate poetry.lock to sync with pyproject.toml

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix: format merged files from main and regenerate poetry.lock

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(mypy): annotate jwt_claims as Optional[dict] to fix type incompatibility

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): update router region test to use gpt-4.1-mini (fix flaky model lookup)

Replace deprecated gpt-3.5-turbo-1106 with gpt-4.1-mini + mock_response in
test_router_region_pre_call_check, following the same pattern used in commit
717d37cc5b for test_router_context_window_check_pre_call_check_out_group.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* ci: retry flaky logging_testing (async event loop race condition)

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): aggregate all mock calls in langfuse e2e test to fix race condition

The _verify_langfuse_call helper only inspected the last mock call
(mock_post.call_args), but the Langfuse SDK may split trace-create and
generation-create events across separate HTTP flush cycles. This caused
an IndexError when the last call's batch contained only one event type.

Fix: iterate over mock_post.call_args_list to collect batch items from
ALL calls. Also add a safety assertion after filtering by trace_id and
mark all langfuse e2e tests with @pytest.mark.flaky(retries=3) as an
extra safety net for any residual timing issues.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): black formatting + update OpenAPI compliance tests for spec changes

- Apply Black 26.x formatting to litellm_logging.py (parenthesized style)
- Update test_input_types_match_spec to follow $ref to InteractionsInput schema
  (Google updated their OpenAPI spec to use $ref instead of inline oneOf)
- Update test_content_schema_uses_discriminator to handle discriminator without
  explicit mapping (Google removed the mapping key from Content discriminator)

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* revert: undo incorrect Black 26.x formatting on litellm_logging.py

The file was correctly formatted for Black 23.12.1 (the version pinned
in pyproject.toml). The previous commit applied Black 26.x formatting
which was incompatible with the CI's Black version.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): deduplicate and sort langfuse batch events after aggregation

The Langfuse SDK may send the same event (e.g., trace-create) in
multiple flush cycles, causing duplicates when we aggregate from all
mock calls. After filtering by trace_id, deduplicate by keeping only
the first event of each type, then sort to ensure trace-create is at
index 0 and generation-create at index 1.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

---------

Co-authored-by: Noah Nistler <60981020+noahnistler@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>
2026-03-18 15:09:01 -07:00
yuneng-jiang 8f56ddb9c6 Merge remote main into litellm_ci_optimize
Resolved conflict in test_claude_agent_sdk.py by keeping main's additions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 00:50:22 -07:00
yuneng-jiang cc027a2b90 Fix flaky test_langsmith_queue_logging: poll instead of fixed sleep
The test waited a fixed 3s for async callbacks to populate log_queue.
Under xdist -n 4, CPU contention can delay the GLOBAL_LOGGING_WORKER
background task beyond 3s. Replace fixed sleeps with polling loops
(up to 10s) that break as soon as the expected condition is met.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 20:25:11 -07:00
yuneng-jiang 9b77524354 Fix logging_testing: capture true defaults at conftest import time
Module-level mutations (litellm.num_retries=3 in test_langfuse_e2e_test.py
and test_amazing_s3_logs.py, litellm.success_callback=['langfuse']) run
at import time, BEFORE any function fixture. The save/restore pattern
captured these polluted values as 'originals' and kept restoring them.

Fix: capture litellm defaults when conftest.py is first imported (before
test modules), then reset to those true defaults before each test instead
of saving/restoring the current (potentially polluted) state.
2026-03-15 19:52:46 -07:00
yuneng-jiang 13a46598e7 Fix logging_testing: clear _in_memory_loggers and add missing globals
- Clear _in_memory_loggers before/after each test to prevent cached logger
  instances (LangsmithLogger, SlackAlerting, etc.) from leaking stale state
- Add pre_call_rules, post_call_rules to list attrs save/restore
- Add vector_store_registry to scalar attrs save/restore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 18:48:32 -07:00
yuneng-jiang 92ad90de2a Fix logging_testing: expand save/restore to cover redaction and other globals
The logging tests mutate many more litellm globals than guardrails tests
(turn_off_message_logging, s3_callback_params, datadog_params, service_callback,
etc.). The initial save/restore list only covered callbacks and a few basics,
causing state leaks like redaction settings bleeding across tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 18:37:07 -07:00
yuneng-jiang 19e8a16cce Optimize logging_testing CI: suppress DEBUG logs, fix xdist isolation
- Add LITELLM_LOG=WARNING to suppress verbose DEBUG log output
- Remove -s flag to stop capturing all stdout
- Bump xdist workers from -n 2 to -n 4
- Add --timeout=120 for safety
- Rewrite conftest.py to use save/restore pattern (matching guardrails_tests)
  instead of per-function importlib.reload + event loop creation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 18:24:57 -07:00
Harshit28j d7c9ec6276 add tests for fix 2026-03-15 00:58:08 +05:30
yuneng-jiang 89d8401d72 Merge pull request #23483 from BerriAI/litellm_update_deprecated_test_models
[Fix] Update Deprecated Model Names in CI Tests
2026-03-12 14:16:52 -07:00
yuneng-jiang cc81e3c226 Replace deprecated model names in tests that were removed from remote model cost map
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:12:07 -07:00
Cesar Garcia e01d722803 Merge branch 'main' into litellm_oss_staging_03_11_2026 2026-03-12 13:53:14 -03:00
Chesars feed274aa3 Reapply "feat: add model_cost aliases expansion support"
This reverts commit 3d2df7e8b5.
2026-03-12 13:36:57 -03:00
Chesars 1be6b31e2f merge: resolve conflicts between main and litellm_oss_staging_03_11_2026 2026-03-12 09:38:31 -03:00
Sameer Kankute 36ec80d90c Fix azure model router 2026-03-12 12:40:37 +05:30
Cesar Garcia 3d2df7e8b5 Revert "feat: add model_cost aliases expansion support" 2026-03-10 22:39:19 -03:00