Ishaan Jaff 9761ba7c7a [Bug Fix] Responses api session management for streaming responses (#13396)
* fix proxy config

* fix(responses api): fix streaming ID consistency and tool format handling (#12640)

* fix(responses): ensure streaming chunk IDs use consistent encoding format

Fixes streaming ID inconsistency where streaming responses used raw provider IDs
while non-streaming responses used properly encoded IDs with provider context.

Changes:
- Updated LiteLLMCompletionStreamingIterator to accept provider context
- Added _encode_chunk_id() method using same logic as non-streaming responses
- Modified chunk transformation to encode all streaming item_ids with resp_ prefix
- Updated handlers to pass custom_llm_provider and litellm_metadata to streaming iterator

Impact:
- Streaming chunk IDs now format: resp_<base64_encoded_provider_context>
- Enables session continuity when using streaming response IDs as previous_response_id
- Allows provider detection and load balancing with streaming responses
- Maintains backward compatibility with existing streaming functionality

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(types): add explicit Optional[str] type annotation for model_id

This resolves MyPy type checking error where model_id could be None
but wasn't explicitly typed as Optional[str].

* fix(types): handle None case for litellm_metadata access

Prevents 'Item None has no attribute get' error by checking for None
before accessing litellm_metadata dictionary.

* test: add comprehensive tests for streaming ID consistency

Adds unit and E2E tests to verify streaming chunk IDs are properly encoded
with consistent format across streaming responses.

## Tests Added

### Unit Test (test_reasoning_content_transformation.py)
- `test_streaming_chunk_id_encoding()`: Validates the `_encode_chunk_id()` method
  correctly encodes chunk IDs with `resp_` prefix and provider context

### E2E Tests (test_e2e_openai_responses_api.py)
- `test_streaming_id_consistency_across_chunks()`: Tests that all streaming chunk IDs
  are properly encoded across multiple chunks in a real streaming response
- `test_streaming_response_id_as_previous_response_id()`: Tests the core use case -
  using streaming response IDs for session continuity with `previous_response_id`

## Key Testing Approach
- Uses **Gemini** (non-OpenAI model) to test the transformation logic rather than
  OpenAI passthrough, since the streaming ID consistency issue occurs when LiteLLM
  transforms responses rather than just passing through to native OpenAI responses API
- Tests validate that streaming chunk IDs now use same encoding as non-streaming responses
- Verifies session continuity works with streaming responses

Addresses @ishaan-jaff's request for unit tests covering the streaming ID consistency fix.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(lint): remove unused imports in transformation.py

Removes unused imports to fix CI linting errors:
- GenericResponseOutputItem
- OutputFunctionToolCall

* test: remove E2E tests from openai_endpoints_tests

Remove streaming ID consistency E2E tests as requested by @ishaan-jaff.
Keep only the mock/unit test in test_reasoning_content_transformation.py

* revert: remove streaming chunk ID encoding to original behavior

This reverts the streaming chunk ID encoding changes to understand the original issue better.
Original behavior was:
- Streaming chunks: raw provider IDs
- Streaming final response: raw IDs (PROBLEM!)
- Non-streaming final response: encoded IDs (correct)

The real issue: streaming final response IDs were not encoded, breaking session continuity.

* fix(responses): encode streaming final response IDs to match OpenAI behavior

Fixes streaming ID inconsistency to match OpenAI's Responses API behavior:
- Streaming chunks: raw message IDs (like OpenAI's msg_xxx)
- Final response: encoded IDs (like OpenAI's resp_xxx)

This enables session continuity by ensuring streaming final response IDs
have the same encoded format as non-streaming responses, allowing them
to be used as previous_response_id in follow-up requests.

Changes:
- Add custom_llm_provider and litellm_metadata to LiteLLMCompletionStreamingIterator
- Update handlers to pass provider context to streaming iterator
- Apply _update_responses_api_response_id_with_model_id to final streaming response
- Keep streaming chunks as raw IDs to match OpenAI format

Impact:
- Session continuity works with streaming responses
- Load balancing can detect provider from streaming final response IDs
- Format matches OpenAI's Responses API exactly

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* test: update unit test to match correct OpenAI-compatible behavior

Updates the unit test to verify streaming chunk IDs are raw (not encoded)
to match OpenAI's responses API format:
- Streaming chunks: raw message IDs (like msg_xxx)
- Final response: encoded IDs (like resp_xxx)

This reflects the correct behavior implemented in the fix.

---------

Co-authored-by: Claude <noreply@anthropic.com>

* cleanup

* TestBaseResponsesAPIStreamingIterator

---------

Co-authored-by: Javier de la Torre <jatorre@carto.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-08-07 20:13:24 -07:00
2025-08-05 10:56:34 -07:00
2025-06-05 16:29:28 -07:00
2025-06-30 10:45:37 -07:00

🚅 LiteLLM

Deploy to Render Deploy on Railway

Call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]

LiteLLM Proxy Server (LLM Gateway) | Hosted Proxy (Preview) | Enterprise Tier

PyPI Version Y Combinator W23 Whatsapp Discord Slack

LiteLLM manages:

  • Translate inputs to provider's completion, embedding, and image_generation endpoints
  • Consistent output, text responses will always be available at ['choices'][0]['message']['content']
  • Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - Router
  • Set Budgets & Rate limits per project, api key, model LiteLLM Proxy Server (LLM Gateway)

Jump to LiteLLM Proxy (LLM Gateway) Docs
Jump to Supported LLM Providers

🚨 Stable Release: Use docker images with the -stable tag. These have undergone 12 hour load tests, before being published. More information about the release cycle here

Support for more providers. Missing a provider or LLM Platform, raise a feature request.

Usage (Docs)

Important

LiteLLM v1.0.0 now requires openai>=1.0.0. Migration guide here
LiteLLM v1.40.14+ now requires pydantic>=2.0.0. No changes required.

Open In Colab
pip install litellm
from litellm import completion
import os

## set ENV variables
os.environ["OPENAI_API_KEY"] = "your-openai-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"

messages = [{ "content": "Hello, how are you?","role": "user"}]

# openai call
response = completion(model="openai/gpt-4o", messages=messages)

# anthropic call
response = completion(model="anthropic/claude-sonnet-4-20250514", messages=messages)
print(response)

Response (OpenAI Format)

{
    "id": "chatcmpl-1214900a-6cdd-4148-b663-b5e2f642b4de",
    "created": 1751494488,
    "model": "claude-sonnet-4-20250514",
    "object": "chat.completion",
    "system_fingerprint": null,
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help with whatever you'd like to discuss or work on. How are you doing today?",
                "role": "assistant",
                "tool_calls": null,
                "function_call": null
            }
        }
    ],
    "usage": {
        "completion_tokens": 39,
        "prompt_tokens": 13,
        "total_tokens": 52,
        "completion_tokens_details": null,
        "prompt_tokens_details": {
            "audio_tokens": null,
            "cached_tokens": 0
        },
        "cache_creation_input_tokens": 0,
        "cache_read_input_tokens": 0
    }
}

Call any model supported by a provider, with model=<provider_name>/<model_name>. There might be provider-specific details here, so refer to provider docs for more information

Async (Docs)

from litellm import acompletion
import asyncio

async def test_get_response():
    user_message = "Hello, how are you?"
    messages = [{"content": user_message, "role": "user"}]
    response = await acompletion(model="openai/gpt-4o", messages=messages)
    return response

response = asyncio.run(test_get_response())
print(response)

Streaming (Docs)

liteLLM supports streaming the model response back, pass stream=True to get a streaming iterator in response.
Streaming is supported for all models (Bedrock, Huggingface, TogetherAI, Azure, OpenAI, etc.)

from litellm import completion
response = completion(model="openai/gpt-4o", messages=messages, stream=True)
for part in response:
    print(part.choices[0].delta.content or "")

# claude sonnet 4
response = completion('anthropic/claude-sonnet-4-20250514', messages, stream=True)
for part in response:
    print(part)

Response chunk (OpenAI Format)

{
    "id": "chatcmpl-fe575c37-5004-4926-ae5e-bfbc31f356ca",
    "created": 1751494808,
    "model": "claude-sonnet-4-20250514",
    "object": "chat.completion.chunk",
    "system_fingerprint": null,
    "choices": [
        {
            "finish_reason": null,
            "index": 0,
            "delta": {
                "provider_specific_fields": null,
                "content": "Hello",
                "role": "assistant",
                "function_call": null,
                "tool_calls": null,
                "audio": null
            },
            "logprobs": null
        }
    ],
    "provider_specific_fields": null,
    "stream_options": null,
    "citations": null
}

Logging Observability (Docs)

LiteLLM exposes pre defined callbacks to send data to Lunary, MLflow, Langfuse, DynamoDB, s3 Buckets, Helicone, Promptlayer, Traceloop, Athina, Slack

from litellm import completion

## set env variables for logging tools (when using MLflow, no API key set up is required)
os.environ["LUNARY_PUBLIC_KEY"] = "your-lunary-public-key"
os.environ["HELICONE_API_KEY"] = "your-helicone-auth-key"
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["ATHINA_API_KEY"] = "your-athina-api-key"

os.environ["OPENAI_API_KEY"] = "your-openai-key"

# set callbacks
litellm.success_callback = ["lunary", "mlflow", "langfuse", "athina", "helicone"] # log input/output to lunary, langfuse, supabase, athina, helicone etc

#openai call
response = completion(model="openai/gpt-4o", messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}])

LiteLLM Proxy Server (LLM Gateway) - (Docs)

Track spend + Load Balance across multiple projects

Hosted Proxy (Preview)

The proxy provides:

  1. Hooks for auth
  2. Hooks for logging
  3. Cost tracking
  4. Rate Limiting

📖 Proxy Endpoints - Swagger Docs

Quick Start Proxy - CLI

pip install 'litellm[proxy]'

Step 1: Start litellm proxy

$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:4000

Step 2: Make ChatCompletions Request to Proxy

Important

💡 Use LiteLLM Proxy with Langchain (Python, JS), OpenAI SDK (Python, JS) Anthropic SDK, Mistral SDK, LlamaIndex, Instructor, Curl

import openai # openai v1.0.0+
client = openai.OpenAI(api_key="anything",base_url="http://0.0.0.0:4000") # set proxy to base_url
# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem"
    }
])

print(response)

Proxy Key Management (Docs)

Connect the proxy with a Postgres DB to create proxy keys

# Get the code
git clone https://github.com/BerriAI/litellm

# Go to folder
cd litellm

# Add the master key - you can change this after setup
echo 'LITELLM_MASTER_KEY="sk-1234"' > .env

# Add the litellm salt key - you cannot change this after adding a model
# It is used to encrypt / decrypt your LLM API Key credentials
# We recommend - https://1password.com/password-generator/ 
# password generator to get a random hash for litellm salt key
echo 'LITELLM_SALT_KEY="sk-1234"' >> .env

source .env

# Start
docker-compose up

UI on /ui on your proxy server ui_3

Set budgets and rate limits across multiple projects POST /key/generate

Request

curl 'http://0.0.0.0:4000/key/generate' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data-raw '{"models": ["gpt-3.5-turbo", "gpt-4", "claude-2"], "duration": "20m","metadata": {"user": "ishaan@berri.ai", "team": "core-infra"}}'

Expected Response

{
    "key": "sk-kdEXbIqZRwEeEiHwdg7sFA", # Bearer token
    "expires": "2023-11-19T01:38:25.838000+00:00" # datetime object
}

Supported Providers (Docs)

Provider Completion Streaming Async Completion Async Streaming Async Embedding Async Image Generation
openai
Meta - Llama API
azure
AI/ML API
aws - sagemaker
aws - bedrock
google - vertex_ai
google - palm
google AI Studio - gemini
mistral ai api
cloudflare AI Workers
cohere
anthropic
empower
huggingface
replicate
together_ai
openrouter
ai21
baseten
vllm
nlp_cloud
aleph alpha
petals
ollama
deepinfra
perplexity-ai
Groq AI
Deepseek
anyscale
IBM - watsonx.ai
voyage ai
xinference [Xorbits Inference]
FriendliAI
Galadriel
Novita AI
Featherless AI
Nebius AI Studio

Read the Docs

Contributing

Interested in contributing? Contributions to LiteLLM Python SDK, Proxy Server, and LLM integrations are both accepted and highly encouraged!

Quick start: git clonemake install-devmake formatmake lintmake test-unit

See our comprehensive Contributing Guide (CONTRIBUTING.md) for detailed instructions.

Enterprise

For companies that need better security, user management and professional support

Talk to founders

This covers:

  • Features under the LiteLLM Commercial License:
  • Feature Prioritization
  • Custom Integrations
  • Professional Support - Dedicated discord + slack
  • Custom SLAs
  • Secure access with Single Sign-On

Contributing

We welcome contributions to LiteLLM! Whether you're fixing bugs, adding features, or improving documentation, we appreciate your help.

Quick Start for Contributors

git clone https://github.com/BerriAI/litellm.git
cd litellm
make install-dev    # Install development dependencies
make format         # Format your code
make lint           # Run all linting checks
make test-unit      # Run unit tests

For detailed contributing guidelines, see CONTRIBUTING.md.

Code Quality / Linting

LiteLLM follows the Google Python Style Guide.

Our automated checks include:

  • Black for code formatting
  • Ruff for linting and code quality
  • MyPy for type checking
  • Circular import detection
  • Import safety checks

Run all checks locally:

make lint           # Run all linting (matches CI)
make format-check   # Check formatting only

All these checks must pass before your PR can be merged.

Support / talk with founders

Why did we build this

  • Need for simplicity: Our code started to get extremely complicated managing & translating calls between Azure, OpenAI and Cohere.

Contributors

Run in Developer mode

Services

  1. Setup .env file in root
  2. Run dependant services docker-compose up db prometheus

Backend

  1. (In root) create virtual environment python -m venv .venv
  2. Activate virtual environment source .venv/bin/activate
  3. Install dependencies pip install -e ".[all]"
  4. Start proxy backend uvicorn litellm.proxy.proxy_server:app --host localhost --port 4000 --reload

Frontend

  1. Navigate to ui/litellm-dashboard
  2. Install dependencies npm install
  3. Run npm run dev to start the dashboard
S
Description
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]
Readme MIT 1.1 GiB
Languages
Python 81%
TypeScript 12.2%
JavaScript 5.9%
HTML 0.5%
HCL 0.2%