tiennm99/litellm

Fork 0

mirror of https://github.com/tiennm99/litellm.git synced 2026-06-22 23:36:12 +00:00

Files

T

Krrish Dholakia fa278a1886 docs: move guardrails and responses api out of beta

2025-11-05 16:48:54 -08:00

30 KiB

Raw Blame History

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

/responses

LiteLLM provides a BETA endpoint in the spec of OpenAI's /responses API

Requests to /chat/completions may be bridged here automatically when the provider lacks support for that endpoint. The model’s default mode determines how bridging works.(see model_prices_and_context_window)

Feature	Supported	Notes
Cost Tracking	✅	Works with all supported models
Logging	✅	Works across all integrations
End-user Tracking	✅
Streaming	✅
Image Generation Streaming	✅	Progressive image generation with partial images (1-3)
Fallbacks	✅	Works between supported models
Loadbalancing	✅	Works between supported models
Guardrails	✅	Applies to input and output text (non-streaming only)
Supported operations	Create a response, Get a response, Delete a response
Supported LiteLLM Versions	1.63.8+
Supported LLM providers	All LiteLLM supported providers	`openai`, `anthropic`, `bedrock`, `vertex_ai`, `gemini`, `azure`, `azure_ai` etc.

Usage

LiteLLM Python SDK

Non-streaming

import litellm

# Non-streaming response
response = litellm.responses(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

print(response)

Streaming

import litellm

# Streaming response
response = litellm.responses(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

Image Generation with Streaming

import litellm
import base64

# Streaming image generation with partial images
stream = litellm.responses(
    model="gpt-4.1",  # Use an actual image generation model
    input="Generate a gorgeous image of a river made of white owl feathers",
    stream=True,
    tools=[{"type": "image_generation", "partial_images": 2}],

)

for event in stream:
    if event.type == "response.image_generation_call.partial_image":
        idx = event.partial_image_index
        image_base64 = event.partial_image_b64
        image_bytes = base64.b64decode(image_base64)
        with open(f"river{idx}.png", "wb") as f:
            f.write(image_bytes)

GET a Response

import litellm

# First, create a response
response = litellm.responses(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

# Get the response ID
response_id = response.id

# Retrieve the response by ID
retrieved_response = litellm.get_responses(
    response_id=response_id
)

print(retrieved_response)

# For async usage
# retrieved_response = await litellm.aget_responses(response_id=response_id)

CANCEL a Response

You can cancel an in-progress response (if supported by the provider):

import litellm

# First, create a response
response = litellm.responses(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

# Get the response ID
response_id = response.id

# Cancel the response by ID
cancel_response = litellm.cancel_responses(
    response_id=response_id
)

print(cancel_response)

# For async usage
# cancel_response = await litellm.acancel_responses(response_id=response_id)

REST API:

curl -X POST http://localhost:4000/v1/responses/response_id/cancel \
    -H "Authorization: Bearer sk-1234"

This will attempt to cancel the in-progress response with the given ID. Note: Not all providers support response cancellation. If unsupported, an error will be raised.

DELETE a Response

import litellm

# First, create a response
response = litellm.responses(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

# Get the response ID
response_id = response.id

# Delete the response by ID
delete_response = litellm.delete_responses(
    response_id=response_id
)

print(delete_response)

# For async usage
# delete_response = await litellm.adelete_responses(response_id=response_id)

Non-streaming

import litellm
import os

# Set API key
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Non-streaming response
response = litellm.responses(
    model="anthropic/claude-3-5-sonnet-20240620",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

print(response)

Streaming

import litellm
import os

# Set API key
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Streaming response
response = litellm.responses(
    model="anthropic/claude-3-5-sonnet-20240620",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

Non-streaming

import litellm
import os

# Set credentials - Vertex AI uses application default credentials
# Run 'gcloud auth application-default login' to authenticate
os.environ["VERTEXAI_PROJECT"] = "your-gcp-project-id"
os.environ["VERTEXAI_LOCATION"] = "us-central1"

# Non-streaming response
response = litellm.responses(
    model="vertex_ai/gemini-1.5-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

print(response)

Streaming

import litellm
import os

# Set credentials - Vertex AI uses application default credentials
# Run 'gcloud auth application-default login' to authenticate
os.environ["VERTEXAI_PROJECT"] = "your-gcp-project-id"
os.environ["VERTEXAI_LOCATION"] = "us-central1"

# Streaming response
response = litellm.responses(
    model="vertex_ai/gemini-1.5-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

Non-streaming

import litellm
import os

# Set AWS credentials
os.environ["AWS_ACCESS_KEY_ID"] = "your-access-key-id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-secret-access-key"
os.environ["AWS_REGION_NAME"] = "us-west-2"  # or your AWS region

# Non-streaming response
response = litellm.responses(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

print(response)

Streaming

import litellm
import os

# Set AWS credentials
os.environ["AWS_ACCESS_KEY_ID"] = "your-access-key-id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-secret-access-key"
os.environ["AWS_REGION_NAME"] = "us-west-2"  # or your AWS region

# Streaming response
response = litellm.responses(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

Non-streaming

import litellm
import os

# Set API key for Google AI Studio
os.environ["GEMINI_API_KEY"] = "your-gemini-api-key"

# Non-streaming response
response = litellm.responses(
    model="gemini/gemini-1.5-flash",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

print(response)

Streaming

import litellm
import os

# Set API key for Google AI Studio
os.environ["GEMINI_API_KEY"] = "your-gemini-api-key"

# Streaming response
response = litellm.responses(
    model="gemini/gemini-1.5-flash",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

LiteLLM Proxy with OpenAI SDK

First, set up and start your LiteLLM proxy server.

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

First, add this to your litellm proxy config.yaml:

model_list:
  - model_name: openai/o1-pro
    litellm_params:
      model: openai/o1-pro
      api_key: os.environ/OPENAI_API_KEY

Non-streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Streaming response
response = client.responses.create(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

Image Generation with Streaming

from openai import OpenAI
import base64

client = OpenAI(api_key="sk-1234", base_url="http://localhost:4000")

stream = client.responses.create(
    model="gpt-4.1",
    input="Draw a gorgeous image of a river made of white owl feathers, snaking its way through a serene winter landscape",
    stream=True,
    tools=[{"type": "image_generation", "partial_images": 2}],
)


for event in stream:
    print(f"event: {event}")
    if event.type == "response.image_generation_call.partial_image":
        idx = event.partial_image_index
        image_base64 = event.partial_image_b64
        image_bytes = base64.b64decode(image_base64)
        with open(f"river{idx}.png", "wb") as f:
            f.write(image_bytes)

GET a Response

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# First, create a response
response = client.responses.create(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn."
)

# Get the response ID
response_id = response.id

# Retrieve the response by ID
retrieved_response = client.responses.retrieve(response_id)

print(retrieved_response)

DELETE a Response

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# First, create a response
response = client.responses.create(
    model="openai/o1-pro",
    input="Tell me a three sentence bedtime story about a unicorn."
)

# Get the response ID
response_id = response.id

# Delete the response by ID
delete_response = client.responses.delete(response_id)

print(delete_response)

First, add this to your litellm proxy config.yaml:

model_list:
  - model_name: anthropic/claude-3-5-sonnet-20240620
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620
      api_key: os.environ/ANTHROPIC_API_KEY

Non-streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-20240620",
    input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Streaming response
response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-20240620",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

First, add this to your litellm proxy config.yaml:

model_list:
  - model_name: vertex_ai/gemini-1.5-pro
    litellm_params:
      model: vertex_ai/gemini-1.5-pro
      vertex_project: your-gcp-project-id
      vertex_location: us-central1

Non-streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
    model="vertex_ai/gemini-1.5-pro",
    input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Streaming response
response = client.responses.create(
    model="vertex_ai/gemini-1.5-pro",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

First, add this to your litellm proxy config.yaml:

model_list:
  - model_name: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-west-2

Non-streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Streaming response
response = client.responses.create(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

First, add this to your litellm proxy config.yaml:

model_list:
  - model_name: gemini/gemini-1.5-flash
    litellm_params:
      model: gemini/gemini-1.5-flash
      api_key: os.environ/GEMINI_API_KEY

Non-streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
    model="gemini/gemini-1.5-flash",
    input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming

from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-api-key"             # Your proxy API key
)

# Streaming response
response = client.responses.create(
    model="gemini/gemini-1.5-flash",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)

for event in response:
    print(event)

Response ID Security

By default, LiteLLM Proxy prevents users from accessing other users' response IDs.

This is done by encrypting the response ID with the user ID, enabling users to only access their own response IDs.

Trying to access someone else's response ID returns 403:

{
  "error": {
    "message": "Forbidden. The response id is not associated with the user, who this key belongs to.",
    "code": 403
  }
}

To disable this, set disable_responses_id_security: true:

general_settings:
  disable_responses_id_security: true

This allows any user to access any response ID.

Supported Responses API Parameters

Provider	Supported Parameters
`openai`	All Responses API parameters are supported
`azure`	All Responses API parameters are supported
`anthropic`	See supported parameters here
`bedrock`	See supported parameters here
`gemini`	See supported parameters here
`vertex_ai`	See supported parameters here
`azure_ai`	See supported parameters here
All other llm api providers	See supported parameters here

Load Balancing with Session Continuity.

When using the Responses API with multiple deployments of the same model (e.g., multiple Azure OpenAI endpoints), LiteLLM provides session continuity. This ensures that follow-up requests using a previous_response_id are routed to the same deployment that generated the original response.

Example Usage

import litellm

# Set up router with multiple deployments of the same model
router = litellm.Router(
    model_list=[
        {
            "model_name": "azure-gpt4-turbo",
            "litellm_params": {
                "model": "azure/gpt-4-turbo",
                "api_key": "your-api-key-1",
                "api_version": "2024-06-01",
                "api_base": "https://endpoint1.openai.azure.com",
            },
        },
        {
            "model_name": "azure-gpt4-turbo",
            "litellm_params": {
                "model": "azure/gpt-4-turbo",
                "api_key": "your-api-key-2",
                "api_version": "2024-06-01",
                "api_base": "https://endpoint2.openai.azure.com",
            },
        },
    ],
    optional_pre_call_checks=["responses_api_deployment_check"],
)

# Initial request
response = await router.aresponses(
    model="azure-gpt4-turbo",
    input="Hello, who are you?",
    truncation="auto",
)

# Store the response ID
response_id = response.id

# Follow-up request - will be automatically routed to the same deployment
follow_up = await router.aresponses(
    model="azure-gpt4-turbo",
    input="Tell me more about yourself",
    truncation="auto",
    previous_response_id=response_id  # This ensures routing to the same deployment
)

1. Setup session continuity on proxy config.yaml

To enable session continuity for Responses API in your LiteLLM proxy, set optional_pre_call_checks: ["responses_api_deployment_check"] in your proxy config.yaml.

model_list:
  - model_name: azure-gpt4-turbo
    litellm_params:
      model: azure/gpt-4-turbo
      api_key: your-api-key-1
      api_version: 2024-06-01
      api_base: https://endpoint1.openai.azure.com
  - model_name: azure-gpt4-turbo
    litellm_params:
      model: azure/gpt-4-turbo
      api_key: your-api-key-2
      api_version: 2024-06-01
      api_base: https://endpoint2.openai.azure.com

router_settings:
  optional_pre_call_checks: ["responses_api_deployment_check"]

2. Use the OpenAI Python SDK to make requests to LiteLLM Proxy

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000",
    api_key="your-api-key"
)

# Initial request
response = client.responses.create(
    model="azure-gpt4-turbo",
    input="Hello, who are you?"
)

response_id = response.id

# Follow-up request - will be automatically routed to the same deployment
follow_up = client.responses.create(
    model="azure-gpt4-turbo",
    input="Tell me more about yourself",
    previous_response_id=response_id  # This ensures routing to the same deployment
)

Calling non-Responses API endpoints (`/responses` to `/chat/completions` Bridge)

LiteLLM allows you to call non-Responses API models via a bridge to LiteLLM's /chat/completions endpoint. This is useful for calling Anthropic, Gemini and even non-Responses API OpenAI models.

Python SDK Usage

import litellm
import os

# Set API key
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Non-streaming response
response = litellm.responses(
    model="anthropic/claude-3-5-sonnet-20240620",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=100
)

print(response)

LiteLLM Proxy Usage

Setup Config:

model_list:
- model_name: anthropic-model
  litellm_params:
    model: anthropic/claude-3-5-sonnet-20240620
    api_key: os.environ/ANTHROPIC_API_KEY

Start Proxy:

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

Make Request:

curl http://localhost:4000/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "anthropic-model",
    "input": "who is Michael Jordan"
  }'

Session Management

LiteLLM Proxy supports session management for all supported models. This allows you to store and fetch conversation history (state) in LiteLLM Proxy.

Usage

Enable storing request / response content in the database

Set store_prompts_in_cold_storage: true in your proxy config.yaml. When this is enabled, LiteLLM will store the request and response content in the s3 bucket you specify.

litellm_settings:
  callbacks: ["s3_v2"]
  cold_storage_custom_logger: s3_v2
  s3_callback_params: # learn more https://docs.litellm.ai/docs/proxy/logging#s3-buckets
    s3_bucket_name: litellm-logs   # AWS Bucket Name for S3
    s3_region_name: us-west-2      

general_settings:
  store_prompts_in_cold_storage: true
  store_prompts_in_spend_logs: true

Make request 1 with no previous_response_id (new session)

Start a new conversation by making a request without specifying a previous response ID.

curl http://localhost:4000/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "anthropic/claude-3-5-sonnet-latest",
    "input": "who is Michael Jordan"
  }'

from openai import OpenAI

# Initialize the client with your LiteLLM proxy URL
client = OpenAI(
    base_url="http://localhost:4000",
    api_key="sk-1234"
)

# Make initial request to start a new conversation
response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-latest",
    input="who is Michael Jordan"
)

print(response.id)  # Store this ID for future requests in same session
print(response.output[0].content[0].text)

Response:

{
  "id":"resp_123abc",
  "model":"claude-3-5-sonnet-20241022",
  "output":[{
    "type":"message",
    "content":[{
      "type":"output_text",
      "text":"Michael Jordan is widely considered one of the greatest basketball players of all time. He played for the Chicago Bulls (1984-1993, 1995-1998) and Washington Wizards (2001-2003), winning 6 NBA Championships with the Bulls."
    }]
  }]
}

Make request 2 with previous_response_id (same session)

Continue the conversation by referencing the previous response ID to maintain conversation context.

curl http://localhost:4000/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "anthropic/claude-3-5-sonnet-latest",
    "input": "can you tell me more about him",
    "previous_response_id": "resp_123abc"
  }'

from openai import OpenAI

# Initialize the client with your LiteLLM proxy URL
client = OpenAI(
    base_url="http://localhost:4000",
    api_key="sk-1234"
)

# Make follow-up request in the same conversation session
follow_up_response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-latest",
    input="can you tell me more about him",
    previous_response_id="resp_123abc"  # ID from the previous response
)

print(follow_up_response.output[0].content[0].text)

Response:

{
  "id":"resp_456def",
  "model":"claude-3-5-sonnet-20241022",
  "output":[{
    "type":"message",
    "content":[{
      "type":"output_text",
      "text":"Michael Jordan was born February 17, 1963. He attended University of North Carolina before being drafted 3rd overall by the Bulls in 1984. Beyond basketball, he built the Air Jordan brand with Nike and later became owner of the Charlotte Hornets."
    }]
  }]
}

Make request 3 with no previous_response_id (new session)

Start a brand new conversation without referencing previous context to demonstrate how context is not maintained between sessions.

curl http://localhost:4000/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "anthropic/claude-3-5-sonnet-latest",
    "input": "can you tell me more about him"
  }'

from openai import OpenAI

# Initialize the client with your LiteLLM proxy URL
client = OpenAI(
    base_url="http://localhost:4000",
    api_key="sk-1234"
)

# Make a new request without previous context
new_session_response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-latest",
    input="can you tell me more about him"
    # No previous_response_id means this starts a new conversation
)

print(new_session_response.output[0].content[0].text)

Response:

{
  "id":"resp_789ghi",
  "model":"claude-3-5-sonnet-20241022",
  "output":[{
    "type":"message",
    "content":[{
      "type":"output_text",
      "text":"I don't see who you're referring to in our conversation. Could you let me know which person you'd like to learn more about?"
    }]
  }]
}

30 KiB Raw Blame History Unescape Escape

/responses

Usage

LiteLLM Python SDK

Non-streaming

Streaming

Image Generation with Streaming

GET a Response

CANCEL a Response

DELETE a Response

Non-streaming

Streaming

Non-streaming

Streaming

Non-streaming

Streaming

Non-streaming

Streaming

LiteLLM Proxy with OpenAI SDK

Non-streaming

Streaming

Image Generation with Streaming

GET a Response

DELETE a Response

Non-streaming

Streaming

Non-streaming

Streaming

Non-streaming

Streaming

Non-streaming

Streaming

Response ID Security

Supported Responses API Parameters

Load Balancing with Session Continuity.

Example Usage

1. Setup session continuity on proxy config.yaml

2. Use the OpenAI Python SDK to make requests to LiteLLM Proxy

Calling non-Responses API endpoints (/responses to /chat/completions Bridge)

Python SDK Usage

LiteLLM Proxy Usage

Session Management

Usage

30 KiB

Raw Blame History

Calling non-Responses API endpoints (`/responses` to `/chat/completions` Bridge)