Commit Graph

38 Commits

Author SHA1 Message Date
Mateo Wang 20e453f698 feat(cli): per-agent lite claude / codex / opencode commands that wrap coding agents through the proxy (#29850)
* feat(cli): add `litellm-proxy run -- <agent>` to wrap coding agents through the proxy

Wraps Claude Code, Codex, OpenCode, and any other coding agent so all of its
LLM traffic routes through a LiteLLM proxy, with the agent-vault style of "just
works" DX: one `run -- <agent>` command, auto SSO login when interactive,
env-key "agent mode" for containers/CI, and a fail-fast key check against the
proxy so bad credentials error immediately instead of deep inside the agent.

The wrapped binary is detected by name to pick the right variables. Claude Code
gets ANTHROPIC_BASE_URL (the bare proxy root, so it appends /v1/messages) and
ANTHROPIC_AUTH_TOKEN, with any stray ANTHROPIC_API_KEY cleared so the proxy
token wins. Codex and OpenCode get OPENAI_BASE_URL (proxy + /v1) and
OPENAI_API_KEY. Unrecognized commands get both sets so they work either way.
`litellm-proxy claude-code` remains as a shortcut for `run -- claude`.

The core logic is split into dependency-injected helpers (agent_profile,
build_agent_env, verify_proxy_key, run_agent) so env wiring, the preflight, and
the launch handoff are unit-tested without monkeypatching, alongside CliRunner
tests for auth resolution, agent mode, and auto-login. Mutation-tested the env
profiles, preflight, and agent-mode branch to confirm the tests fail when the
behavior is broken.

https://claude.ai/code/session_0154VpLXW7mMvk5wfbgPRJa6

* Make each coding agent its own litellm-proxy command

Replace the `run -- <agent>` interface and the `claude-code` shortcut with
top-level commands generated per known agent, so launching is just
`litellm-proxy claude`, `litellm-proxy codex`, or `litellm-proxy opencode`,
with everything after the agent name forwarded straight to it. This drops the
ceremony of `run --` and cuts typing.

The `--model`/`--small-fast-model` wrapper flags are gone; pass the agent's
own model flag instead, or export the model env vars (the wrapper preserves
what you already have set), which keeps the surface minimal and avoids
intercepting flags the agent owns. Rename the module to agents.py to match.

* fix(cli): route `litellm-proxy codex` through the proxy via a custom provider

Codex ignores OPENAI_BASE_URL (it always dials api.openai.com over the
Responses WebSocket transport), so the OpenAI env profile alone left
`litellm-proxy codex` talking to OpenAI directly instead of the proxy. Point
Codex at the proxy with a custom provider passed as `-c` config overrides, and
force the HTTP/SSE Responses transport with supports_websockets=false since the
proxy does not speak the Responses WebSocket protocol. The provider reads its
key from OPENAI_API_KEY, which the agent env already exports.

The overrides are injected ahead of the user's args so they precede Codex's
subcommand. Claude Code and OpenCode are unaffected; they honor the exported
env vars. Adds regression tests for the per-agent launch args and the
injection ordering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* Rename litellm-proxy CLI command to lite

The proxy management CLI was invoked as litellm-proxy, which is a lot to
type for an everyday command. Rename the console script entry point to
lite and update the in-CLI usage examples, help text, error messages and
docs to match.

* fix(sso): stop CLI auth success page from hanging on "Closing..."

The CLI opens the SSO success page with webbrowser.open, so the tab is
not script-opened and the browser refuses window.close(). The countdown
would end on "Closing..." and the tab would sit there forever.

Drop the countdown and just show "You can now close this window and
return to your terminal." from the start, while still attempting
window.close() once so the tab auto-closes in the rare case the browser
allows it. Add a regression test asserting the manual-close instruction
is always present and the misleading countdown/"Closing..." text is gone.

* fix(cli): reattach controlling terminal after SSO login, keep litellm-proxy alias

When the first `lite claude` has to log in via browser SSO, completing the login could
leave stdin detached from the terminal, so a TUI agent like Claude Code would start in
non-interactive mode and exit with "Input must be provided". The wrapper now reopens the
controlling terminal onto stdin just before handoff when the session started interactively;
piped or redirected input is detected up front and left alone, so agent-mode and
non-interactive use are unchanged.

Also keep the `litellm-proxy` console script as an alias for `lite` so existing scripts and
CI that invoke `litellm-proxy` keep working; both names map to the same CLI.

* feat(install): make the curl installer need only curl, not a pre-existing Python

The installer now lets uv provision a managed Python 3.13 when no suitable
interpreter is found, instead of aborting. The minimum is also bumped from
3.9 to 3.10 to match the package's requires-python (>=3.10), so a system
Python 3.9 is no longer selected only for uv tool install to reject it.

* feat(cli): add thin litellm[cli] install path (install-cli.sh + brew) for the lite CLI

On a developer laptop the `lite` CLI only needs `lite login` and running coding
agents through a proxy, but the sole install path was `litellm[proxy]`, which
drags in the whole server tree (fastapi, uvicorn, boto3, polars, cryptography,
litellm-enterprise). The CLI's heavy imports are all guarded, so it runs on the
base SDK plus just rich, pyyaml and requests.

Add a `cli` extra carrying exactly those three, a `scripts/install-cli.sh` curl
one-liner that installs `litellm[cli]`, and a `BerriAI/homebrew-litellm` tap
formula with a release runbook under `packaging/homebrew/`. The installer passes
no `--python`, so uv honours litellm's requires-python and provisions a managed
interpreter, skipping a too-old (3.9) or too-new (3.14+) system Python instead
of failing to resolve.

A pyproject thin-contract test asserts the `cli` extra keeps the deps the CLI
imports and never leaks a server-only dependency from `proxy`, so the laptop
install cannot silently re-bloat

* fix(install): let uv pick the Python via --python-preference system

Both installers detected a system Python with a floor-only check and forced it
with `uv tool install --python <interp>`. On a host whose only Python is outside
litellm's requires-python (a too-old 3.9 or, increasingly, a too-new 3.14) that
forced an incompatible interpreter and the resolve failed. Drop the detection and
pass `--python-preference system`: uv reuses a compatible system Python when
present and downloads a managed one otherwise, always honouring requires-python

* test(router): filter aiohttp unclosed-session gc noise in test_async_fallbacks

test_async_fallbacks asserts the last three captured log records are the
router's fallback messages. Under the litellm_router_testing job (pytest -k
router -n 4) many router tests share the module-level in_memory_llm_clients_cache
(max 200, ttl 3600s). Older cached OpenAI/Azure clients get evicted while their
aiohttp ClientSession is still open, and when the gc reclaims them aiohttp emits
"Unclosed client session"/"Unclosed connector" through the asyncio logger.
Those records land in caplog mid-test and push the expected router logs out of
the last-three window, so the assertion flips to failing non-deterministically.

These warnings are async cleanup noise, not router debug logs, so filter them
out exactly like the existing leaked-task warnings before asserting order. The
assertion on the three router fallback messages is unchanged.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-10 13:52:26 -07:00
Yassin Kortam d5d6b26a72 fix: improve bedrock streaming hot path perf (#28720) 2026-05-28 11:31:37 -07:00
Yassin Kortam 2eab9ee2c0 perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths (#28289)
* perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths

- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution

* perf: address greptile review for anthropic streaming hot path

- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
  events from different block indexes are observed without an intervening
  flush. Anthropic sends blocks strictly sequentially, but defensive bail
  prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
  in `_callback_capabilities` with a function-identity comparison that
  walks the MRO. A vendor base class can carry the override and the
  registered class can add nothing else; before this PR the hook was
  unconditionally invoked, so an inherited-override miss would silently
  drop the hook on the streaming path.
- Add unit tests for both behaviors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mypy): narrow model_name to str in cost-injection branch

The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 12:15:59 -07:00
Yassin Kortam a6494e6fe3 perf: eliminate per-request callback scanning on proxy hot path (#27858)
- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead
- Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered
- Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active
- Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields
- Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk
- Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement
- Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support
- Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
2026-05-14 09:28:31 -07:00
ryan-crabbe-berri be84d5cd7d ci: add manually-triggered mutation testing workflow (#27576)
* ci: add manually-triggered mutation testing smoke workflow

Adds a workflow_dispatch-only GitHub Actions workflow that runs mutmut
against a single source/test pair (router_settings_endpoints) to validate
the tooling end-to-end before scaling.

The workflow reinstalls litellm non-editable so the mutants/ sandbox is
not shadowed by the editable .pth on sys.path, and sets PYTHONPATH so
the trampolined sandbox copy wins over site-packages.

mutmut itself is pulled in via uv run --with so it does not appear in
uv.lock or affect the shared dev environment.

Includes a temporary push: trigger scoped to this branch so we can
iterate before the workflow file lands on the default branch — to be
removed before merging (workflow_dispatch only requires the file on the
default branch to surface the manual trigger button).

* ci(mutation): disable rerun and xdist plugins for mutmut runs

mutmut's in-process pytest.main() call hits
`INTERNALERROR: no option named 'filtered_exceptions'` from
pytest-retry's pytest_configure hook. Reruns are also wrong for
mutation testing — a "failed" mutant test that gets retried would
mask which mutants are killed vs. survive. Disable retry,
rerunfailures, and xdist via pytest_add_cli_args in [tool.mutmut].

* ci(mutation): uninstall pytest-retry before mutmut runs

`-p no:retry` (and similar names) didn't match pytest-retry's
entry-point name, so the plugin still loaded and crashed during
mutmut's "Running clean tests" phase. Uninstalling the package is
surgical and doesn't depend on guessing the entry-point name.

* ci(mutation): emit per-survivor diffs to run-page summary + artifact

The previous artifact only contained `mutmut results` text (which in
mutmut 3.x lists survivor names but not the actual mutations). Adds:

- `mutmut export-cicd-stats` to produce mutmut-cicd-stats.json with the
  killed/survived/total scoreboard.
- `mutmut show <name>` per surviving mutant to capture each mutation as
  a unified diff.
- A `mutmut-report.md` that combines summary + run-progress tail +
  per-survivor diffs, written to both the artifact and
  $GITHUB_STEP_SUMMARY (visible on the run page, no download needed).
- Corrected artifact paths: stats files live under mutants/, not the
  project root.
- The trampolined source file from the sandbox so survivors can be
  inspected even outside `mutmut show`.

* ci(mutation): document intended manual weekly cadence in trigger comment

* ci(mutation): generate ACH-style report with embedded function bodies

Replaces the inline bash markdown generation with a Python script that:
- Groups survivors by function (one section per function, function body
  shown once per section, surviving mutants nested as subsections)
- Embeds each enclosing function's source via Python AST (so the agent
  has full context, not just a 3-line `mutmut show` diff)
- Inlines the existing test file(s) listed in [tool.mutmut].tests_dir
- Writes an ACH-style task description at the bottom following the
  prompt template from arXiv 2501.12862

Output goes to mutation-report.md (artifact) and the head of the file
is appended to $GITHUB_STEP_SUMMARY for at-a-glance visibility.

* fix(mutation report): correctly parse function names with leading underscores

mutmut's mutant-name prefix is x_ (single underscore), so a function
named _foo produces mutants x__foo__mutmut_N. The previous regex
\.x__(.+)__mutmut_ ate the function's leading underscore as part of
the prefix. Changed to \.x_(.+)__mutmut_ so leading underscores are
preserved in the captured function name; verified for normal, leading-
underscore, and dunder-method names.

* feat(mutation report): full Meta ACH-style rendering with MUTANT delimiters

For each surviving mutant, parse the mutmut sandbox trampoline file and
render the mutated function as it appears in the source — with the
differing lines wrapped in `# MUTANT START` / `# MUTANT END` comments,
matching the format from Meta's ACH paper (arXiv 2501.12862, Table 1).
Renames the function header back to its original name so the agent sees
the function as it would appear in the file. Falls back to the unified
diff if the trampoline lookup fails.

Handles replace, insert, and delete diff ops; uses difflib's
SequenceMatcher to find the differing line ranges.

The unified diff is preserved in a collapsible <details> block as
secondary context.

* ci(mutation): scope to whole management_endpoints folder, drop temp push trigger

Final scope before merge:
- paths_to_mutate / tests_dir broadened from one file to the entire
  management_endpoints source/test folders
- Trigger is now `workflow_dispatch` only — the temporary push: block
  used during workflow iteration is removed
- timeout-minutes bumped from 60 to 350 (just under the GH-hosted job
  cap of 360); whole-folder mutation against ~15 files / ~7.5k LOC can
  take a few hours
- Artifact path for the trampoline files glob-expanded to cover all
  files under mutants/litellm/proxy/management_endpoints/

* fix(mutation report): warn when multiple functions in a file share a name

Addresses the Greptile review concern: ast.walk's first-match-wins
behavior could embed the wrong function body when a file defines the
same name in multiple places (e.g., a module-level helper and a class
method). mutmut's mutant identifier does not carry class context, so
we can't always determine which definition was mutated.

find_function_in_file now returns the start line of every matching
definition; render() surfaces a "Note: N functions named X" warning
in the report when there is more than one match. The first match is
still embedded as the body — the warning tells the reader to verify
manually instead of silently using the wrong context.

Smoke-tested against the existing artifact: single-match files render
unchanged.

* Fix mutation report anchors

* Fix mutation report TOC anchors

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
2026-05-11 15:19:57 -07:00
harish-berri a67b7a7e87 Refactor Bedrock response stream shape handling (#27257)
* Refactor Bedrock response stream shape handling

- Introduced a module-level constant `BEDROCK_RESPONSE_STREAM_SHAPE` to cache the response stream shape, eliminating the need for per-instance caching in `BedrockEventStreamDecoderBase`.
- Updated relevant methods to utilize the new constant, improving performance by avoiding redundant loading of the shape.
- Added tests to ensure the shape is loaded correctly at import time and is consistent across different modules.
- Added a new mock server script for testing Bedrock pass-through functionality.

* Refactor response parsing for Bedrock and SageMaker

- Improved code readability by formatting the parsing method calls in `AWSEventStreamDecoder` for both Bedrock and SageMaker response stream shapes.
- Added blank lines for better separation of code blocks in `invoke_handler.py` and `common_utils.py` to enhance maintainability.

* Enhance error handling for Bedrock and SageMaker response stream shape loading

- Wrapped the loading logic in `_load_bedrock_response_stream_shape` and `_load_sagemaker_response_stream_shape` with try-except blocks to gracefully handle exceptions.
- Added logging to warn when the response stream shape cannot be pre-loaded, ensuring the module imports cleanly.
- Updated tests to verify that loading failures return `None` instead of propagating exceptions.

* Implement error handling for missing response stream shapes in Bedrock and SageMaker

- Added checks in `_parse_message_from_event` methods to raise appropriate errors when `BEDROCK_RESPONSE_STREAM_SHAPE` or `SAGEMAKER_RESPONSE_STREAM_SHAPE` is None, ensuring clearer error reporting.
- Updated logging messages to reflect the unavailability of event-stream decoding for both Bedrock and SageMaker.
- Enhanced unit tests to verify that the correct exceptions are raised when the response stream shapes are not loaded.
2026-05-06 17:39:38 -07:00
Yassin Kortam 950074eea2 fix: atomic TPM rate limit (#27001)
Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
2026-05-05 16:58:07 -07:00
Krrish Dholakia b6fc75b3ce Merge branch 'litellm_internal_staging' into litellm_adaptive_routing 2026-04-20 15:28:08 -07:00
Krrish Dholakia 386f334fee Prompt Compression - add it to the proxy (#25729)
* refactor: new agentic loop event hook

simplifies how to create logic for tool based multi llm calls

* fix: compress - make it work on anthropic input as well

* fix(compress.py): working prompt compression for claude code

ensures claude code messages can run through proxy easily

* docs: add agentic loop hook guide

* docs: add agentic_loop_hook to sidebar

* fix: fix multiple arguments error

* fix: fix tool call loop for compression on streaming /v1/messages

* fix: fix linting errors

* fix: fix ci/cd errors

* feat(litellm_pre_call_utils.py): use claude code session for litellm session id

allows claude code logs to be stitched together, making it easy to know they were all part of the same conversation

* fix: suppress incorrect mypy warning rE: module

* revert: drop PR's changes to litellm/proxy/_experimental/out/

Restores the 34 HTML files under _experimental/out/ to their pre-PR
paths (X/index.html -> X.html). All renames are R100 (content
unchanged); no other files are touched.

* fix: address greptile review comments on PR #25729

- Skip ``kwargs["tools"] = []`` injection when compression is a no-op —
  Anthropic Messages rejects empty tool arrays on requests that did not
  originally declare tools.
- Move agentic-loop safety guards (fingerprint cycle / max depth) out of
  the per-callback try/except so they propagate instead of being swallowed
  by the generic exception handler. Extracted _check_agentic_loop_safety.
- Gate generic ``x-<vendor>-session-id`` capture behind the
  LITELLM_CAPTURE_VENDOR_SESSION_HEADERS env var (off by default) to
  preserve backwards compatibility; explicit x-litellm-* headers are
  unaffected.
- Fix monkeypatch target in pre-call-hook test to patch the actual
  module-level binding
  (litellm.integrations.compression_interception.handler.compress).
- Add regression tests for empty-tools skip and opt-in session capture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* revert: drop LITELLM_CAPTURE_VENDOR_SESSION_HEADERS flag

Generic x-<vendor>-session-id header capture is a new feature and only
runs *after* the explicit x-litellm-trace-id / x-litellm-session-id
checks, so it does not change behavior for any existing caller that was
already using the LiteLLM headers — no backwards-incompatibility to gate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(compress): replace input_type with CallTypes call_type

Drop the bespoke ``CompressionInputType`` literal and use the existing
``litellm.types.utils.CallTypes`` enum instead.  ``litellm.compress()``
now takes ``call_type: Union[CallTypes, str]`` (default
``CallTypes.completion``) — no new concept to learn, and the enum is
already the way the rest of the codebase talks about request shapes.

Supported values: ``completion`` / ``acompletion`` (OpenAI chat-completions
shape) and ``anthropic_messages`` (Anthropic structured content blocks).

Updated: compress(), the compression_interception handler, tests, docs,
and the two eval scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 15:08:00 -07:00
Krrish Dholakia 924fa6a3bc feat: commit new adaptive routing 2026-04-18 21:29:39 -07:00
Krrish Dholakia dd4a1d2be2 feat: add adaptive routing to litellm
allow model routing to improve based on conversation signals

ensures router is picking best model for task
2026-04-18 16:35:17 -07:00
Ishaan Jaffer e8461b5b97 style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
ishaan-berri cb8fc480e6 Merge pull request #25732 from harish876/health-check-oom
Optimize database query to prevent OOM errors during health checks
2026-04-15 18:13:11 -07:00
harish876 d20c70f24c Optimize database query which fetches latest model_id, model_name pairs and dedupes them in memory.
Current fix includes
 - Updates test case
 - Optimized query with docstring. The change leverages deduplication and sorting logic from SQL
 - Added a bench script to differentiate peak memory usage before and after
2026-04-15 00:54:37 +00:00
ishaan-berri 0e43050a01 Merge pull request #25650 from BerriAI/litellm_dev_04_13_2026_p1
feat: add litellm.compress() — BM25-based prompt compression with ret…
2026-04-14 12:24:47 -07:00
Krrish Dholakia 26c7412339 feat: add litellm.compress() — BM25-based prompt compression with retrieval tool (#25637)
* feat: add litellm.compress() for BM25-based context compression

Adds a compress() utility that reduces context size for LLM calls using
BM25 relevance scoring (with optional semantic embeddings via
litellm.embedding()). Messages below a token threshold pass through
unchanged; messages above are scored, ranked, and the lowest-relevance
ones replaced with stubs. Originals are cached and a retrieval tool is
injected so the model can recover dropped content on demand.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(compress): truncate high-scoring messages instead of fully stubbing them

When a relevant message was too large to fit in the token budget it was
replaced with a stub, leaving the LLM with no real content to work with.
Now the highest-scoring overflow message is truncated (first 70% + last 30%
of words) to fill the remaining budget, so the LLM always receives actual
content rather than just a retrieval pointer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(bm25): add prefix expansion so query terms match inflected doc tokens

"cook" now matches "cooking", "auth" matches "authentication", etc.
Without this, short query terms scored 0 against longer inflected forms
in documents, causing the wrong message to be kept.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add routing correctness test and eval harness for litellm.compress()

- test_simple_compression: parametrized test verifying BM25 routes the
  right message based on query ("How to cook?" keeps cooking, "Fix auth"
  keeps auth content)
- eval_compression.py: end-to-end eval harness comparing baseline vs
  compressed model performance on HumanEval-style coding problems

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): add SWE-bench Lite compression eval harness

Uses princeton-nlp/SWE-bench_Lite_bm25_27K which bundles ~27k tokens of
BM25-retrieved repo context per problem — large enough to meaningfully
stress litellm.compress() without Docker or GitHub API calls.

Proxy eval metrics (no test runner needed):
  - has_diff: model produced a valid unified diff
  - file_overlap: fraction of gold-patch files in generated patch
  - exact_file_match: generated patch touches exactly the right files

Run: python tests/eval_swe_bench.py --model gpt-4o --problems 10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(eval): robust dataset loading + sys.path fix for worktree imports

- Add HuggingFace API fallback so the SWE-bench loader doesn't need
  the `datasets` library (avoids pyarrow/numpy binary compat issues)
- Insert repo root into sys.path so compression module resolves
  from worktrees
- Use direct import of litellm_compress to avoid __getattr__ issues

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* improve compression quality: line-based truncation, multi-message budget, 70% default target

- Switch truncate_message from word-based to line-based splitting to
  preserve code structure (function boundaries, indentation)
- Allow multiple messages to be truncated instead of burning entire
  budget on one overflow message
- Raise default compression target from 50% to 70% of trigger for
  better quality/cost tradeoff
- Add --compression-target CLI arg to SWE-bench eval harness
- Move tests to canonical locations (tests/test_litellm/, scripts/)
- Add docs page and sidebar entries for compress()

Eval results (5 problems, Opus, trigger=10k):
  Hunk overlap delta improved from -0.417 to -0.221
  Content similarity now matches baseline (+0.006)
  Cost savings: 72%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add SWE-bench performance results to compress() docs

Include benchmark table from Opus eval (5 problems, trigger=10k)
showing 72% cost savings with file-level quality fully preserved.
Add metric explanations and eval runner examples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(eval): use tolerance-based hunk overlap metric

The exact line-number matching was too brittle — LLM-generated patches
often target the right code region but with slightly offset line numbers.
Switch to hunk-level overlap with a 10-line tolerance window so nearby
edits count as matches. This better reflects actual patch quality.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add compression_interception callback for LiteLLM Proxy

Add a proxy callback that automatically compresses incoming /v1/messages
payloads above a configurable token threshold, runs the retrieval tool
loop server-side, and returns the final response. This brings compress()
support to proxy deployments (e.g. Claude Code via /v1/messages).

- New callback: litellm/integrations/compression_interception/
- Proxy config: compression_interception_params in litellm_settings
- Support for input_type param in compress() (openai vs anthropic)
- Docs: proxy setup instructions with YAML config example
- Tests: 139-line unit test suite for the interception handler

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "feat: add compression_interception callback for LiteLLM Proxy"

This reverts commit 72bd5cb152ca1df07f14a14e14a2816e188874a8.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 12:23:54 -07:00
stuxf a6c30b30bf build: migrate packaging, CI, and Docker from Poetry to uv (#25007)
* build: migrate packaging metadata to uv

* ci: move automation and local tooling to uv

* docker: migrate image builds and runtime setup to uv

* docs: update install and deployment guidance for uv

* chore: align auxiliary scripts and tests with uv

* test: harden test_litellm isolation

* fix: keep release and health check images self-contained

* build: pin uv tooling and health check deps

* test: isolate bedrock image request formatting from suite state

* test: cover sandbox executor requirements flow

* ci: fix circleci no-op command steps

* ci: fix circleci publish workflow parsing

* fix: stabilize remaining uv migration CI checks

* ci: increase matrix test timeout headroom

* fix: restore published docker and license coverage

* fix: restore proxy runtime build parity

* fix: restore proxy extras parity and venv migrations

* ci: persist uv path across circleci steps

* fix: keep psycopg binary in default test env

* docker: preserve prisma cache across stages

* test: run local proxy checks through uv python

* build: restore runtime deps moved into ci

* build: refresh uv lock after upstream merge

* fix: restore module import in test_check_migration after merge

The conflict resolution imported only the function but the test body
references check_migration as a module throughout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert dependency promotions, remove nodejs-wheel-binaries, fix Docker layer caching

- Move google-generativeai, Pillow, tenacity back to ci group (they are
  lazily imported and bloat the base SDK install needlessly)
- Remove nodejs-wheel-binaries from extra_proxy and proxy-dev (redundant
  in Docker where system Node.js is already installed via apk)
- Remove all nodejs-wheel node replacement and venv npm patching blocks
  from Dockerfiles since the wheel is no longer installed
- Add --no-default-groups to CodSpeed benchmark workflow so the benchmark
  environment matches the old minimal pip install footprint
- Apply standard uv two-phase Docker pattern: copy metadata first, install
  deps (cached layer), then copy source and install project
- Replace CircleCI enterprise no-op with proper uv sync command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate uv.lock after removing nodejs-wheel-binaries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): use cache/restore instead of cache to prevent cache poisoning

The old workflow used actions/cache/restore (read-only). The uv migration
changed it to actions/cache (read-write), which zizmor flags as a cache
poisoning risk. Restore the safer read-only variant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): disable setup-uv built-in cache to silence cache-poisoning alert

The setup-uv action enables caching by default, which zizmor flags as a
cache poisoning risk. Disable it since we already use a read-only
cache/restore step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): disable setup-uv cache in publish workflow

Silences zizmor cache-poisoning alert. Publishing workflow runs
infrequently on protected branches so caching adds no real benefit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(test): remove duplicate verbose_logger mock in test_check_migration

The logger was patched twice — first via mocker.patch() then via
mocker.patch.object(autospec=True). The second call fails because
autospec cannot inspect an already-mocked attribute. Remove the
redundant first patch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): free disk space before Docker build in test-server-root-path

The Dockerfile.non_root build ran out of disk on the CI runner. Remove
Android SDK, .NET, Boost, and GHC toolchains (~12GB) to free space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 11:46:23 -07:00
Yuneng Jiang 51af6fedb3 [Infra] Harden supply chain: remove unused scripts, add pip binary-only install
Remove ci_cd/publish-proxy-extras.sh (dead, unreferenced PyPI publish script)
and .pre-commit-config.yaml (pulls external repos from GitHub on git commit).
Add --only-binary :all: to scripts/install.sh to prevent execution of
malicious setup.py during pip install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 14:13:57 -07:00
Yuneng Jiang 5f63873dca [Infra] Pin all Docker build dependencies to exact versions
Pin every dependency across all Docker builds so upgrades are intentional.
Verified by building all 3 production images and diffing pip freeze against
known-good v1.83.0-nightly baselines — zero version drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 00:05:39 -07:00
Ishaan Jaff 8e61b32b8e [Staging] - Ishaan March 17th (#23903)
* feat(xai): add grok-4.20 beta 2 models with pricing (#23900)

Add three grok-4.20 beta 2 model variants from xAI:
- grok-4.20-multi-agent-beta-0309 (reasoning + multi-agent)
- grok-4.20-beta-0309-reasoning (reasoning)
- grok-4.20-beta-0309-non-reasoning

Pricing (from https://docs.x.ai/docs/models):
- Input: $2.00/1M tokens ($0.20/1M cached)
- Output: $6.00/1M tokens
- Context: 2M tokens

All variants support vision, function calling, tool choice, and web search.
Closes LIT-2171

* docs: add Quick Install section for litellm --setup wizard (#23905)

* docs: add Quick Install section for litellm --setup wizard

* docs: clarify setup wizard is for local/beginner use

* feat(setup): interactive setup wizard + install.sh (#23644)

* feat(setup): add interactive setup wizard + install.sh

Adds `litellm --setup` — a Claude Code-style TUI onboarding wizard that
guides users through provider selection, API key entry, and proxy config
generation, then optionally starts the proxy immediately.

- litellm/setup_wizard.py: wizard with ASCII art, numbered provider menu
  (OpenAI, Anthropic, Azure, Gemini, Bedrock, Ollama), API key prompts,
  port/master-key config, and litellm_config.yaml generation
- litellm/proxy/proxy_cli.py: adds --setup flag that invokes the wizard
- scripts/install.sh: curl-installable script (detect OS/Python, pip
  install litellm[proxy], launch wizard)

Usage:
  curl -fsSL https://raw.githubusercontent.com/BerriAI/litellm/main/scripts/install.sh | sh
  litellm --setup

* fix(install.sh): remove orange color, add LITELLM_BRANCH env var for branch installs

* fix(install.sh): install from git branch so --setup is available for QA

* fix(install.sh): remove stale LITELLM_BRANCH reference that caused unbound variable error

* fix(install.sh): force-reinstall from git to bypass cached PyPI version

* fix(install.sh): show pip progress bar during install

* fix(install.sh): always launch wizard via $PYTHON_BIN -m litellm, not PATH binary

* fix(install.sh): use litellm.proxy.proxy_cli module (no __main__.py exists)

* fix(install.sh): suppress RuntimeWarning from module invocation

* fix(install.sh): use Python bin-dir litellm binary to avoid CWD sys.path shadowing

* fix(install.sh): use sysconfig.get_path('scripts') to find pip-installed litellm binary

* fix(install.sh): redirect stdin from /dev/tty on exec so wizard gets terminal, not exhausted pipe

* fix(install.sh): warn about git clone duration, drop --no-cache-dir so re-runs are faster

* feat(setup_wizard): arrow-key selector, updated model names

* fix(setup_wizard): use sysconfig binary to start proxy, not python -m litellm

* feat(setup_wizard): credential validation after key entry + clear next-steps after proxy start

* style(install.sh): show git clone warning in blue

* refactor(setup_wizard): class with static methods, use check_valid_key from litellm.utils

* address greptile review: fix yaml escaping, port validation, display name collisions, tests

- setup_wizard.py: add _yaml_escape() for safe YAML embedding of API keys
- setup_wizard.py: add _styled_input() with readline ANSI ignore markers
- setup_wizard.py: change DIVIDER to _divider() fn to avoid import-time color capture
- setup_wizard.py: validate port range 1-65535, initialize before loop
- setup_wizard.py: qualify azure display names (azure-gpt-4o) to avoid collision with openai
- setup_wizard.py: work on env_copy in _build_config to avoid mutating caller's dict
- setup_wizard.py: skip model_list entries for providers with no credentials
- setup_wizard.py: prompt for azure deployment name
- setup_wizard.py: wrap os.execlp in try/except with friendly fallback
- setup_wizard.py: wrap config write in try/except OSError
- setup_wizard.py: fix _validate_and_report to use two print lines (no \r overwrite)
- setup_wizard.py: add .gitignore tip next to key storage notice
- setup_wizard.py: fix run_setup_wizard() return type annotation to None
- scripts/install.sh: drop pipefail (not supported by dash on Ubuntu when invoked as sh)
- scripts/install.sh: use litellm[proxy] from PyPI (not hardcoded dev branch)
- scripts/install.sh: guard /dev/tty read with -r check for Docker/CI compat
- scripts/install.sh: remove --force-reinstall to avoid downgrading dependencies
- tests/test_litellm/test_setup_wizard.py: 13 unit tests for _build_config and _yaml_escape

* style: black format setup_wizard.py

* fix: address remaining greptile issues - Windows compat, YAML quoting, credential flow

- guard termios/tty imports with try/except ImportError for Windows compat
- quote master_key as YAML double-quoted scalar (same as env vars)
- remove unused port param from _build_config signature
- _validate_and_report now returns the final key so re-entered creds are stored
- add test for master_key YAML quoting

* fix: add --port to suggested command, guard /dev/tty exec in install.sh

* fix: quote api_base in YAML, skip azure if no deployment, only redraw on state change

* fix: address greptile review comments

- _yaml_escape: add control character escaping (\n, \r, \t)
- test: fix tautological assertion in test_build_config_azure_no_deployment_skipped
- test: add tests for control character escaping in _yaml_escape

* feat(ui): remove Chat UI page link and banner from sidebar and playground (#23908)

* feat(guardrails): MCPJWTSigner - built-in guardrail for zero trust MCP auth (#23897)

* Allow pre_mcp_call guardrail hooks to mutate outbound MCP headers

* Enhance MCPServerManager to support hook-modified arguments and extra headers. Update tests to validate argument mutation and header injection behavior, including warnings for OpenAPI-backed servers when headers are present.

* Refactor MCPServerManager to raise HTTPException for extra headers in OpenAPI-backed servers. Update tests to reflect this change, ensuring proper exception handling instead of logging warnings.

* Allow pre_mcp_call guardrail hooks to mutate outbound MCP headers

* Enhance MCPServerManager to support hook-modified arguments and extra headers. Update tests to validate argument mutation and header injection behavior, including warnings for OpenAPI-backed servers when headers are present.

* Refactor MCPServerManager to raise HTTPException for extra headers in OpenAPI-backed servers. Update tests to reflect this change, ensuring proper exception handling instead of logging warnings.

* feat(guardrails): add MCPJWTSigner built-in guardrail for zero trust MCP auth

Signs outbound MCP tool calls with a LiteLLM-issued RS256 JWT so MCP servers
can trust a single signing authority instead of every upstream IdP.

Enable in config.yaml:
  guardrails:
    - guardrail_name: mcp-jwt-signer
      litellm_params:
        guardrail: mcp_jwt_signer
        mode: pre_mcp_call
        default_on: true

JWT carries sub (user_id), act.sub (team_id, RFC 8693), tool-level scope, iss,
aud, iat/exp/nbf. RSA-2048 keypair auto-generated at startup unless
MCP_JWT_SIGNING_KEY env var is set.

Adds /.well-known/jwks.json endpoint and jwks_uri to /.well-known/openid-configuration
so MCP servers can verify LiteLLM-issued tokens via OIDC discovery.

* Update MCPServerManager to raise HTTPException with status code 400 for extra headers in OpenAPI-backed servers. Adjust tests to verify the correct status code and exception message.

* fix: address P1 issues in MCPJWTSigner

- OpenAPI servers: warn + skip header injection instead of 500
- JWKS Cache-Control: 5min for auto-generated keys, 1h for persistent
- sub claim: fallback to apikey:{token_hash} for anonymous callers
- ttl_seconds: validate > 0 at init time

* docs: add MCP zero trust auth guide with architecture diagram

* docs: add FastMCP JWT verification guide to zero trust doc

* fix: address remaining Greptile review issues (round 2)

- mcp_server_manager: warn when hook Authorization overwrites existing header
- __init__: remove _mcp_jwt_signer_instance from __all__ (private internal)
- discoverable_endpoints: copy dict instead of mutating in-place on OIDC augmentation
- test docstring: reflect warn-and-continue behavior for OpenAPI servers
- test: update scope assertions for least-privilege (no mcp:tools/list on tool-call JWTs)

* fix: address Greptile round 3 feedback

- initialize_guardrail: validate mode='pre_mcp_call' at init time — misconfigured
  mode silently bypasses JWT injection, which is a zero-trust bypass
- _build_claims: remove duplicate inline 'import re' (module-level import already present)
- _types.py: add TODO comment explaining jwt_claims is forward-compat plumbing
  for a follow-up PR that will forward upstream IdP claims into outbound MCP JWTs

* feat(mcp_jwt_signer): add verify+re-sign, claim ops, two-token model, configurable scopes

Addresses all missing pieces from the scoping doc review:

FR-5 (Verify + re-sign): MCPJWTSigner now accepts access_token_discovery_uri
and token_introspection_endpoint.  When set, the incoming Bearer token is
extracted from raw_headers (threaded through pre_call_tool_check), verified
against the IdP's JWKS (JWT) or introspected (opaque), and only re-signed if
valid.  Falls back to user_api_key_dict.jwt_claims for LiteLLM JWT-auth mode.

FR-12 (Configurable end-user identity mapping): end_user_claim_sources
ordered list drives sub resolution — sources: token:<claim>, litellm:user_id,
litellm:email, litellm:end_user_id, litellm:team_id.

FR-13 (Claim operations): add_claims (insert-if-absent), set_claims (always
override), remove_claims (delete) applied in that order.

FR-14 (Two-token model): channel_token_audience + channel_token_ttl issue a
second JWT injected as x-mcp-channel-token: Bearer <token>.

FR-15 (Incoming claim validation): required_claims raises HTTP 403 when any
listed claim is absent; optional_claims passes listed claims from verified
token into the outbound JWT.

FR-9 (Debug headers): debug_headers: true emits x-litellm-mcp-debug with kid,
sub, iss, exp, scope.

FR-10 (Configurable scopes): allowed_scopes replaces auto-generation.  Also
fixed: tool-call JWTs no longer grant mcp:tools/list (overpermission).

P1 fixes:
- proxy/utils.py: _convert_mcp_hook_response_to_kwargs merges rather than
  replaces extra_headers, preserving headers from prior guardrails.
- mcp_server_manager.py: warns when hook injects Authorization alongside a
  server-configured authentication_token (previously silent).
- mcp_server_manager.py: pre_call_tool_check now accepts raw_headers and
  extracts incoming_bearer_token so FR-5 verification has the raw token.
- proxy/utils.py: remove stray inline import inspect inside loop (pre-existing
  lint error, now cleaned up).

Tests: 43 passing (28 new tests covering all FR flags + P1 fixes).

* feat(mcp_jwt_signer): add verify+re-sign, claim ops, two-token model, configurable scopes (core)

Remaining files from the FR implementation:

mcp_jwt_signer.py — full rewrite with all new params:
  FR-5:  access_token_discovery_uri, token_introspection_endpoint,
         verify_issuer, verify_audience + _verify_incoming_jwt(),
         _introspect_opaque_token()
  FR-12: end_user_claim_sources ordered resolution chain
  FR-13: add_claims, set_claims, remove_claims
  FR-14: channel_token_audience, channel_token_ttl → x-mcp-channel-token
  FR-15: required_claims (raises 403), optional_claims (passthrough)
  FR-9:  debug_headers → x-litellm-mcp-debug
  FR-10: allowed_scopes; tool-call JWTs no longer over-grant tools/list

mcp_server_manager.py:
  - pre_call_tool_check gains raw_headers param to extract incoming_bearer_token
  - Silent Authorization override warning fixed: now fires when server has
    authentication_token AND hook injects Authorization

tests/test_mcp_jwt_signer.py:
  28 new tests covering all FR flags + P1 fixes (43 total, all passing)

* fix(mcp_jwt_signer): address pre-landing review issues

- Remove stale TODO comment on UserAPIKeyAuth.jwt_claims — the field is
  already populated and consumed by MCPJWTSigner in the same PR
- Fix _get_oidc_discovery to only cache the OIDC discovery doc when
  jwks_uri is present; a malformed/empty doc now retries on the next
  request instead of being permanently cached until proxy restart
- Add FR-5 test coverage for _fetch_jwks (cache hit/miss),
  _get_oidc_discovery (cache/no-cache on bad doc), _verify_incoming_jwt
  (valid token, expired token), _introspect_opaque_token (active,
  inactive, no endpoint), and the end-to-end 401 hook path — 53 tests
  total, all passing

* docs(mcp_zero_trust): rewrite as use-case guide covering all new JWT signer features

Add scenario-driven sections for each new config area:
- Verify+re-sign with Okta/Azure AD (access_token_discovery_uri,
  end_user_claim_sources, token_introspection_endpoint)
- Enforcing caller attributes with required_claims / optional_claims
- Adding metadata via add_claims / set_claims / remove_claims
- Two-token model for AWS Bedrock AgentCore Gateway
  (channel_token_audience / channel_token_ttl)
- Controlling scopes with allowed_scopes
- Debugging JWT rejections with debug_headers

Update JWT claims table to reflect configurable sub (end_user_claim_sources)

* fix(mcp_jwt_signer): wire all config.yaml params through initialize_guardrail

The factory was only passing issuer/audience/ttl_seconds to MCPJWTSigner.
All FR-5/9/10/12/13/14/15 params (access_token_discovery_uri,
end_user_claim_sources, add/set/remove_claims, channel_token_audience,
required/optional_claims, debug_headers, allowed_scopes, etc.) were
silently dropped, making every advertised advanced feature non-functional
when loaded from config.yaml.

Add regression test that asserts every param is wired through correctly.

* docs(mcp_zero_trust): add hero image

* docs(mcp_zero_trust): apply Linear-style edits

- Lead with the problem (unsigned direct calls bypass access controls)
- Shorter statement section headers instead of question-form headers
- Move diagram/OIDC discovery block after the reader is bought in
- Add 'read further only if you need to' callout after basic setup
- Two-token section now opens from the user problem not product jargon
- Add concrete 403 error response example in required_claims section
- Debug section opens from the symptom (MCP server returning 401)
- Lowercase claims reference header for consistency

* fix(mcp_jwt_signer): fix algorithm confusion attack + add OIDC discovery 24h TTL

- Remove alg from unverified JWT header; use signing_jwk.algorithm_name from JWKS key instead.
  Reading alg from attacker-controlled headers enables alg:none / HS256 confusion attacks.
- Add _oidc_discovery_fetched_at timestamp and _OIDC_DISCOVERY_TTL = 86400 (24h).
  Without a TTL the cached discovery doc never refreshes, so IdP key rotation is invisible.

---------

Co-authored-by: Noah Nistler <60981020+noahnistler@users.noreply.github.com>

* fix(ci): stabilize CI - formatting, type errors, test polling, security CVEs, router bug, batch resolution

Fix 1: Run Black formatter on 35 files
Fix 2: Fix MyPy type errors:
  - setup_wizard.py: add type annotation for 'selected' set variable
  - user_api_key_auth.py: remove redundant type annotation on jwt_claims reassignment
Fix 3: Fix spend accuracy test burst 2 polling to wait for expected total
  spend instead of just 'any increase' from burst 2
Fix 4: Bump Next.js 16.1.6 -> 16.1.7 to fix CVE-2026-27978, CVE-2026-27979,
  CVE-2026-27980, CVE-2026-29057
Fix 5: Fix router _pre_call_checks model variable being overwritten inside
  loop, causing wrong model lookups on subsequent deployments. Use local
  _deployment_model variable instead.
Fix 6: Add missing resolve_output_file_ids_to_unified call in batch retrieve
  non-terminal-to-terminal path (matching the terminal path behavior)

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* chore: regenerate poetry.lock to sync with pyproject.toml

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix: format merged files from main and regenerate poetry.lock

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(mypy): annotate jwt_claims as Optional[dict] to fix type incompatibility

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): update router region test to use gpt-4.1-mini (fix flaky model lookup)

Replace deprecated gpt-3.5-turbo-1106 with gpt-4.1-mini + mock_response in
test_router_region_pre_call_check, following the same pattern used in commit
717d37cc5b for test_router_context_window_check_pre_call_check_out_group.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* ci: retry flaky logging_testing (async event loop race condition)

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): aggregate all mock calls in langfuse e2e test to fix race condition

The _verify_langfuse_call helper only inspected the last mock call
(mock_post.call_args), but the Langfuse SDK may split trace-create and
generation-create events across separate HTTP flush cycles. This caused
an IndexError when the last call's batch contained only one event type.

Fix: iterate over mock_post.call_args_list to collect batch items from
ALL calls. Also add a safety assertion after filtering by trace_id and
mark all langfuse e2e tests with @pytest.mark.flaky(retries=3) as an
extra safety net for any residual timing issues.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): black formatting + update OpenAPI compliance tests for spec changes

- Apply Black 26.x formatting to litellm_logging.py (parenthesized style)
- Update test_input_types_match_spec to follow $ref to InteractionsInput schema
  (Google updated their OpenAPI spec to use $ref instead of inline oneOf)
- Update test_content_schema_uses_discriminator to handle discriminator without
  explicit mapping (Google removed the mapping key from Content discriminator)

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* revert: undo incorrect Black 26.x formatting on litellm_logging.py

The file was correctly formatted for Black 23.12.1 (the version pinned
in pyproject.toml). The previous commit applied Black 26.x formatting
which was incompatible with the CI's Black version.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

* fix(ci): deduplicate and sort langfuse batch events after aggregation

The Langfuse SDK may send the same event (e.g., trace-create) in
multiple flush cycles, causing duplicates when we aggregate from all
mock calls. After filtering by trace_id, deduplicate by keeping only
the first event of each type, then sort to ensure trace-create is at
index 0 and generation-create at index 1.

Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>

---------

Co-authored-by: Noah Nistler <60981020+noahnistler@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Ishaan Jaff <ishaan-jaff@users.noreply.github.com>
2026-03-18 15:09:01 -07:00
Ishaan Jaff 1f412bc6d8 [Feat] Add Tool Policies for AI Gateway (#22732)
* fix: fix ui render

* fix: fix minor bugs

* refactor: use prisma functions instead of raw sql (safer)

* fix(add-new-tiles-to-tool-policies): allow developer to see what's available

* feat: ensure tool allowlist runs correctly for tool names + mcp's

* refactor: more ui improvements

* feat: working key tool blocking

* feat(tools): show tool logs

* refactor: backend code improvements

* refactor: improve log viewer for tools

* fix: address PR review feedback for tool access control

- Add missing blocked_tools column to root schema.prisma (schema drift)
- Invalidate ToolPolicyRegistry after policy mutations so changes take effect immediately
- Remove dead code: unused get_effective_policies, get_tool_policies_cached, and helpers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: race condition in permission resolution and remove duplicate allowlist check

- Use atomic update_many with object_permission_id=None to prevent concurrent
  requests from creating orphaned permission rows and losing tool blocks
- Remove duplicate allowed_tools enforcement from guardrail (already enforced
  in auth layer via check_tools_allowlist)
- Move inline uuid import to module level

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* update to account for  userAgent

* UI - Add ToolDetails

* input/output policy

* LiteLLM_PolicyAttachmentTable

* LiteLLM_PolicyAttachmentTable

* fix: add _enqueue_tool_registry_upsert

* fix: tool mgmt endpoints

* tool mgmt endpoints

* Update tests/test_litellm/proxy/db/test_tool_registry_writer.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update tests/test_litellm/proxy/db/test_tool_registry_writer.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update tests/test_litellm/proxy/db/test_tool_registry_writer.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: sync root schema.prisma and fix test_tool_registry_writer for input/output policy

- Migrate root schema.prisma LiteLLM_ToolTable from call_policy to
  input_policy/output_policy, add missing user_agent and last_used_at columns
  (now consistent with litellm/proxy/schema.prisma and litellm-proxy-extras)
- Fix SpendLogToolIndex comment across all three schema files
- Fix all call_policy references in test_tool_registry_writer.py:
  swapped update_tool_policy arguments, wrong get_tools_by_names return type
  assertions, _mock_tool_row setting call_policy instead of input_policy

Addresses Greptile review feedback on PR #22732.

Made-with: Cursor

---------

Co-authored-by: Krrish Dholakia <krrishdholakia@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2026-03-03 20:22:20 -08:00
Krish Dholakia 67f90254ed feat(guardrails): team-based guardrail registration and approval workflow (#22459)
* feat(guardrails): team-based guardrail registration and approval workflow

Add team-based guardrail submission system where teams can register
Generic Guardrail API guardrails for admin review. Includes:

- POST /guardrails/register endpoint for team-scoped submissions
- Admin review endpoints (list/get/approve/reject submissions)
- Team Guardrails tab in the UI dashboard
- extra_headers support for forwarding client headers to guardrail APIs
- Prisma schema migration for status, submitted_at, reviewed_at fields
- Documentation for team-based guardrails and static/dynamic headers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(guardrails): address review feedback - SSRF, silent failure, redundant query

- Validate api_base URL scheme (http/https only) and hostname in
  register_guardrail to prevent SSRF via team submissions
- Return warning field in approve response when in-memory initialization
  fails so admins know the guardrail won't work until next sync cycle
- Eliminate redundant DB query in list_guardrail_submissions by fetching
  all team guardrails once and deriving both filtered list and summary
  counts from the single result set

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(guardrails): add pending_review status guard to reject endpoint

Prevent rejecting already-active or already-rejected guardrails, which
would create a DB/memory inconsistency (active in memory but rejected
in DB). Now mirrors the approve endpoint's status check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 22:06:49 -08:00
Krish Dholakia 12c4876891 Agents - assign tools (#22064)
* feat(proxy): add max_iterations limiter for agent session loops (#22058)

Adds a new proxy hook that enforces a per-session cap on the number of
LLM calls an agentic loop can make. Callers send a session_id with each
request, and the hook counts calls per session, returning 429 when the
configured max_iterations limit is exceeded.

- Uses Redis Lua script for atomic increment (multi-instance safe)
- Falls back to in-memory cache when Redis unavailable
- Follows parallel_request_limiter_v3 pattern
- Configurable via key metadata: {"max_iterations": 25}
- Session counters auto-expire via TTL (default 1hr)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add new code execution dataset

* feat(agent_endpoints/): allow giving agents keys

* fix: ui fixes

* feat: allow assigning mcp servers to agents

* fix: eliminate duplicate DB queries in MCP agent auth and N+1 in agent listing (#22110)

- Extract _get_agent_object_permission helper so _get_allowed_mcp_servers_for_agent
  and _get_agent_tool_permissions_for_server share a single DB fetch instead of
  each independently querying the same agent row (was 1+N queries per MCP request)
- Use include={"object_permission": True} on find_many in get_all_agents_from_db
  to eagerly load permissions in one query instead of N+1
- Use include={"object_permission": True} on create/update/find_unique in all
  agent CRUD operations, removing attach_object_permission_to_dict follow-up calls

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 11:44:30 -08:00
Ryan Crabbe e0ddb2a525 fix: guard print_aggregate against empty latencies 2026-02-23 10:06:10 -08:00
Ryan Crabbe 95d9514054 fix: add auth headers and empty latencies guard to benchmark script 2026-02-23 09:58:40 -08:00
Ryan Crabbe 94b76ea9ad feat: add network_mock transport for benchmarking proxy overhead without real API calls
Intercepts at httpx transport layer so the full proxy path (auth, routing,
OpenAI SDK, response transformation) is exercised with zero-latency responses.
Activated via `litellm_settings: { network_mock: true }` in proxy config.
2026-02-21 17:52:39 -08:00
Alexsander Hamir 7f81dea8b3 Add custom auth header support and increase default prompt size to 100k chars (#19436) 2026-01-20 13:25:12 -08:00
Alexsander Hamir 270b41b0f4 Simplify file comments (#19382) 2026-01-19 17:01:38 -08:00
Alexsander Hamir 0cd7763d5f Add health check scripts and parallel execution support (#19295)
- Add health_check_client.py for monitoring model availability
- Add health_check_client_README.md with usage documentation
- Add health_check_requirements.txt for dependencies
- Add run_parallel_health_checks.ps1 (PowerShell version)
- Add run_parallel_health_checks.sh (Bash version)
- Organize all scripts under scripts/health_check/ directory
2026-01-19 08:38:38 -08:00
Xiaohan Fu 07fe9e8604 implement failopen option default to True on grayswan guardrail (#18266)
* implement failopen option default to True

* introduce a config to set the timeout limit (default to 30)
2026-01-06 15:17:05 +05:30
Alexsander Hamir b635f92d90 Add benchmark_proxy_vs_provider.py script to scripts directory with usage examples (#17889) 2025-12-12 11:26:34 -08:00
Alexsander Hamir 762b429d6c enhance: create_litellm_branch tool to be more robust (#17874) 2025-12-12 05:35:50 -08:00
Cesar Garcia a7ad8a36a4 chore: cleanup unused scripts and fix misplaced test file (#17611)
Remove scripts/ directory containing unused development/debug scripts:
- mock_ibm_guardrails_server.py
- test_groq_streaming_issue.py (debug for #12660)
- test_mock_ibm_guardrails.py
- update_readme_providers_table.py

Move misplaced test file to correct location:
- test_litellm/ -> tests/test_litellm/ (from PR #17221)
2025-12-08 19:00:55 -08:00
Alexsander Hamir c44e075b2d feat: add script to create branches with litellm_ prefix (#17606)
Add utility scripts to create branches with litellm_ prefix from contributor branches.
This helps maintain consistent branch naming conventions for CI/CD.

- scripts/create_litellm_branch.sh (Bash for macOS/Linux)
- scripts/create_litellm_branch.ps1 (PowerShell for Windows)

Usage:
  ./scripts/create_litellm_branch.sh [source_branch] [new_branch_name]
  ./scripts/create_litellm_branch.ps1 [source_branch] [new_branch_name]

Features:
- Auto-prefixes branch names with litellm_
- Handles existing branches gracefully
- Validates branch names
- Supports local and remote source branches
2025-12-06 10:41:39 -08:00
Rob Geada d35d9008c9 Ensure detector-id is passed as header to IBM detector server (#16649) 2025-11-14 19:35:49 -08:00
Ishaan Jaff 0428229032 [Docs] readme fixes add supported providers (#16109)
* add provider test

* docs readme.md

* docs providers

* order providers

* test_providers_alphabetically_ordered

* docs endpoint

* fix config

* add ENDPOINT_COLUMNS

* add provider endpoints

* docs fix
2025-10-30 17:23:20 -07:00
Krish Dholakia ddacaf6c32 (feat) Organizations: allow org admins to create teams on UI + (feat) IBM Guardrails (#15924)
* fix(oldteams.tsx): allow org admin to create team on ui

* fix(oldteams.tsx): show org admin a dropdown of allowed orgs for team creation

* docs(access_control.md): cleanup doc

* feat(ibm_guardrails/): initial commit adding support for ibm guardrails on litellm

allows user to use self-hosted ibm guardrails

* feat(ibm_detector.py): working detector

* docs(ibm_guardrails.md): document new ibm guardrails

* fix: fix linting errors
2025-10-25 11:13:39 -07:00
Cole McIntosh 000ecad4e2 Fix Groq streaming ASCII encoding issue
Replace iter_lines()/aiter_lines() with iter_text()/aiter_text() using explicit
UTF-8 encoding to handle non-ASCII characters like µ in streaming responses.

- Added utf8_iter_lines() and utf8_aiter_lines() helper functions
- Ensures proper UTF-8 decoding of streaming response content
- Added comprehensive tests for Unicode character handling

Fixes #12660
2025-08-16 08:32:22 -05:00