mirror of
https://github.com/tiennm99/litellm.git
synced 2026-06-18 03:31:23 +00:00
3872af5fe0
* fixes 1 * docs fix * docs fix * docs fix * docs fix
399 lines
16 KiB
Markdown
399 lines
16 KiB
Markdown
# LiteLLM Architecture - LiteLLM SDK + AI Gateway
|
||
|
||
This document helps contributors understand where to make changes in LiteLLM.
|
||
|
||
---
|
||
|
||
## How It Works
|
||
|
||
The LiteLLM AI Gateway (Proxy) uses the LiteLLM SDK internally for all LLM calls:
|
||
|
||
```
|
||
OpenAI SDK (client) ──▶ LiteLLM AI Gateway (proxy/) ──▶ LiteLLM SDK (litellm/) ──▶ LLM API
|
||
Anthropic SDK (client) ──▶ LiteLLMAI Gateway (proxy/) ──▶ LiteLLM SDK (litellm/) ──▶ LLM API
|
||
Any HTTP client ──▶ LiteLLMAI Gateway (proxy/) ──▶ LiteLLM SDK (litellm/) ──▶ LLM API
|
||
```
|
||
|
||
The **AI Gateway** adds authentication, rate limiting, budgets, and routing on top of the SDK.
|
||
The **SDK** handles the actual LLM provider calls, request/response transformations, and streaming.
|
||
|
||
---
|
||
|
||
## 1. AI Gateway (Proxy) Request Flow
|
||
|
||
The AI Gateway (`litellm/proxy/`) wraps the SDK with authentication, rate limiting, and management features.
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant Client
|
||
participant ProxyServer as proxy/proxy_server.py
|
||
participant Auth as proxy/auth/user_api_key_auth.py
|
||
participant Redis as Redis Cache
|
||
participant Hooks as proxy/hooks/
|
||
participant Router as router.py
|
||
participant Main as main.py + utils.py
|
||
participant Handler as llms/custom_httpx/llm_http_handler.py
|
||
participant Transform as llms/{provider}/chat/transformation.py
|
||
participant Provider as LLM Provider API
|
||
participant CostCalc as cost_calculator.py
|
||
participant LoggingObj as litellm_logging.py
|
||
participant DBWriter as db/db_spend_update_writer.py
|
||
participant Postgres as PostgreSQL
|
||
|
||
%% Request Flow
|
||
Client->>ProxyServer: POST /v1/chat/completions
|
||
ProxyServer->>Auth: user_api_key_auth()
|
||
Auth->>Redis: Check API key cache
|
||
Redis-->>Auth: Key info + spend limits
|
||
ProxyServer->>Hooks: max_budget_limiter, parallel_request_limiter
|
||
Hooks->>Redis: Check/increment rate limit counters
|
||
ProxyServer->>Router: route_request()
|
||
Router->>Main: litellm.acompletion()
|
||
Main->>Handler: BaseLLMHTTPHandler.completion()
|
||
Handler->>Transform: ProviderConfig.transform_request()
|
||
Handler->>Provider: HTTP Request
|
||
Provider-->>Handler: Response
|
||
Handler->>Transform: ProviderConfig.transform_response()
|
||
Transform-->>Handler: ModelResponse
|
||
Handler-->>Main: ModelResponse
|
||
|
||
%% Cost Attribution (in utils.py wrapper)
|
||
Main->>LoggingObj: update_response_metadata()
|
||
LoggingObj->>CostCalc: _response_cost_calculator()
|
||
CostCalc->>CostCalc: completion_cost(tokens × price)
|
||
CostCalc-->>LoggingObj: response_cost
|
||
LoggingObj-->>Main: Set response._hidden_params["response_cost"]
|
||
Main-->>ProxyServer: ModelResponse (with cost in _hidden_params)
|
||
|
||
%% Response Headers + Async Logging
|
||
ProxyServer->>ProxyServer: Extract cost from hidden_params
|
||
ProxyServer->>LoggingObj: async_success_handler()
|
||
LoggingObj->>Hooks: async_log_success_event()
|
||
Hooks->>DBWriter: update_database(response_cost)
|
||
DBWriter->>Redis: Queue spend increment
|
||
DBWriter->>Postgres: Batch write spend logs (async)
|
||
ProxyServer-->>Client: ModelResponse + x-litellm-response-cost header
|
||
```
|
||
|
||
### Proxy Components
|
||
|
||
```mermaid
|
||
graph TD
|
||
subgraph "Incoming Request"
|
||
Client["POST /v1/chat/completions"]
|
||
end
|
||
|
||
subgraph "proxy/proxy_server.py"
|
||
Endpoint["chat_completion()"]
|
||
end
|
||
|
||
subgraph "proxy/auth/"
|
||
Auth["user_api_key_auth()"]
|
||
end
|
||
|
||
subgraph "proxy/"
|
||
PreCall["litellm_pre_call_utils.py"]
|
||
RouteRequest["route_llm_request.py"]
|
||
end
|
||
|
||
subgraph "litellm/"
|
||
Router["router.py"]
|
||
Main["main.py"]
|
||
end
|
||
|
||
subgraph "Infrastructure"
|
||
DualCache["DualCache<br/>(in-memory + Redis)"]
|
||
Postgres["PostgreSQL<br/>(keys, teams, spend logs)"]
|
||
end
|
||
|
||
Client --> Endpoint
|
||
Endpoint --> Auth
|
||
Auth --> DualCache
|
||
DualCache -.->|cache miss| Postgres
|
||
Auth --> PreCall
|
||
PreCall --> RouteRequest
|
||
RouteRequest --> Router
|
||
Router --> DualCache
|
||
Router --> Main
|
||
Main --> Client
|
||
```
|
||
|
||
**Key proxy files:**
|
||
- `proxy/proxy_server.py` - Main API endpoints
|
||
- `proxy/auth/` - Authentication (API keys, JWT, OAuth2)
|
||
- `proxy/hooks/` - Proxy-level callbacks
|
||
- `router.py` - Load balancing, fallbacks
|
||
- `router_strategy/` - Routing algorithms (`lowest_latency.py`, `simple_shuffle.py`, etc.)
|
||
|
||
**LLM-specific proxy endpoints:**
|
||
|
||
| Endpoint | Directory | Purpose |
|
||
|----------|-----------|---------|
|
||
| `/v1/messages` | `proxy/anthropic_endpoints/` | Anthropic Messages API |
|
||
| `/vertex-ai/*` | `proxy/vertex_ai_endpoints/` | Vertex AI passthrough |
|
||
| `/gemini/*` | `proxy/google_endpoints/` | Google AI Studio passthrough |
|
||
| `/v1/images/*` | `proxy/image_endpoints/` | Image generation |
|
||
| `/v1/batches` | `proxy/batches_endpoints/` | Batch processing |
|
||
| `/v1/files` | `proxy/openai_files_endpoints/` | File uploads |
|
||
| `/v1/fine_tuning` | `proxy/fine_tuning_endpoints/` | Fine-tuning jobs |
|
||
| `/v1/rerank` | `proxy/rerank_endpoints/` | Reranking |
|
||
| `/v1/responses` | `proxy/response_api_endpoints/` | OpenAI Responses API |
|
||
| `/v1/vector_stores` | `proxy/vector_store_endpoints/` | Vector stores |
|
||
| `/*` (passthrough) | `proxy/pass_through_endpoints/` | Direct provider passthrough |
|
||
|
||
**Proxy Hooks** (`proxy/hooks/__init__.py`):
|
||
|
||
| Hook | File | Purpose |
|
||
|------|------|---------|
|
||
| `max_budget_limiter` | `proxy/hooks/max_budget_limiter.py` | Enforce budget limits |
|
||
| `parallel_request_limiter` | `proxy/hooks/parallel_request_limiter_v3.py` | Rate limiting per key/user |
|
||
| `cache_control_check` | `proxy/hooks/cache_control_check.py` | Cache validation |
|
||
| `responses_id_security` | `proxy/hooks/responses_id_security.py` | Response ID validation |
|
||
| `litellm_skills` | `proxy/hooks/skills_injection.py` | Skills injection |
|
||
|
||
To add a new proxy hook, implement `CustomLogger` and register in `PROXY_HOOKS`.
|
||
|
||
### Infrastructure Components
|
||
|
||
The AI Gateway uses external infrastructure for persistence and caching:
|
||
|
||
```mermaid
|
||
graph LR
|
||
subgraph "AI Gateway (proxy/)"
|
||
Proxy["proxy_server.py"]
|
||
Auth["auth/user_api_key_auth.py"]
|
||
DBWriter["db/db_spend_update_writer.py<br/>DBSpendUpdateWriter"]
|
||
InternalCache["utils.py<br/>InternalUsageCache"]
|
||
CostCallback["hooks/proxy_track_cost_callback.py<br/>_ProxyDBLogger"]
|
||
Scheduler["APScheduler<br/>ProxyStartupEvent"]
|
||
end
|
||
|
||
subgraph "SDK (litellm/)"
|
||
Router["router.py<br/>Router.cache (DualCache)"]
|
||
LLMCache["caching/caching_handler.py<br/>LLMCachingHandler"]
|
||
CacheClass["caching/caching.py<br/>Cache"]
|
||
end
|
||
|
||
subgraph "Redis (caching/redis_cache.py)"
|
||
RateLimit["Rate Limit Counters"]
|
||
SpendQueue["Spend Increment Queue"]
|
||
KeyCache["API Key Cache"]
|
||
TPM_RPM["TPM/RPM Tracking"]
|
||
Cooldowns["Deployment Cooldowns"]
|
||
LLMResponseCache["LLM Response Cache"]
|
||
end
|
||
|
||
subgraph "PostgreSQL (proxy/schema.prisma)"
|
||
Keys["LiteLLM_VerificationToken"]
|
||
Teams["LiteLLM_TeamTable"]
|
||
SpendLogs["LiteLLM_SpendLogs"]
|
||
Users["LiteLLM_UserTable"]
|
||
end
|
||
|
||
Auth --> InternalCache
|
||
InternalCache --> KeyCache
|
||
InternalCache -.->|cache miss| Keys
|
||
InternalCache --> RateLimit
|
||
Router --> TPM_RPM
|
||
Router --> Cooldowns
|
||
LLMCache --> CacheClass
|
||
CacheClass --> LLMResponseCache
|
||
CostCallback --> DBWriter
|
||
DBWriter --> SpendQueue
|
||
DBWriter --> SpendLogs
|
||
Scheduler --> SpendLogs
|
||
Scheduler --> Keys
|
||
```
|
||
|
||
| Component | Purpose | Key Files/Classes |
|
||
|-----------|---------|-------------------|
|
||
| **Redis** | Rate limiting, API key caching, TPM/RPM tracking, cooldowns, LLM response caching, spend queuing | `caching/redis_cache.py` (`RedisCache`), `caching/dual_cache.py` (`DualCache`) |
|
||
| **PostgreSQL** | API keys, teams, users, spend logs | `proxy/utils.py` (`PrismaClient`), `proxy/schema.prisma` |
|
||
| **InternalUsageCache** | Proxy-level cache for rate limits + API keys (in-memory + Redis) | `proxy/utils.py` (`InternalUsageCache`) |
|
||
| **Router.cache** | TPM/RPM tracking, deployment cooldowns, client caching (in-memory + Redis) | `router.py` (`Router.cache: DualCache`) |
|
||
| **LLMCachingHandler** | SDK-level LLM response/embedding caching | `caching/caching_handler.py` (`LLMCachingHandler`), `caching/caching.py` (`Cache`) |
|
||
| **DBSpendUpdateWriter** | Batches spend updates to reduce DB writes | `proxy/db/db_spend_update_writer.py` (`DBSpendUpdateWriter`) |
|
||
| **Cost Tracking** | Calculates and logs response costs | `proxy/hooks/proxy_track_cost_callback.py` (`_ProxyDBLogger`) |
|
||
|
||
**Background Jobs** (APScheduler, initialized in `proxy/proxy_server.py` → `ProxyStartupEvent.initialize_scheduled_background_jobs()`):
|
||
|
||
| Job | Interval | Purpose | Key Files |
|
||
|-----|----------|---------|-----------|
|
||
| `update_spend` | 60s | Batch write spend logs to PostgreSQL | `proxy/db/db_spend_update_writer.py` |
|
||
| `reset_budget` | 10-12min | Reset budgets for keys/users/teams | `proxy/management_helpers/budget_reset_job.py` |
|
||
| `add_deployment` | 10s | Sync new model deployments from DB | `proxy/proxy_server.py` (`ProxyConfig`) |
|
||
| `cleanup_old_spend_logs` | cron/interval | Delete old spend logs | `proxy/management_helpers/spend_log_cleanup.py` |
|
||
| `check_batch_cost` | 30min | Calculate costs for batch jobs | `proxy/management_helpers/check_batch_cost_job.py` |
|
||
| `check_responses_cost` | 30min | Calculate costs for responses API | `proxy/management_helpers/check_responses_cost_job.py` |
|
||
| `process_rotations` | 1hr | Auto-rotate API keys | `proxy/management_helpers/key_rotation_manager.py` |
|
||
| `_run_background_health_check` | continuous | Health check model deployments | `proxy/proxy_server.py` |
|
||
| `send_weekly_spend_report` | weekly | Slack spend alerts | `proxy/utils.py` (`SlackAlerting`) |
|
||
| `send_monthly_spend_report` | monthly | Slack spend alerts | `proxy/utils.py` (`SlackAlerting`) |
|
||
|
||
**Cost Attribution Flow:**
|
||
1. LLM response returns to `utils.py` wrapper after `litellm.acompletion()` completes
|
||
2. `update_response_metadata()` (`llm_response_utils/response_metadata.py`) is called
|
||
3. `logging_obj._response_cost_calculator()` (`litellm_logging.py`) calculates cost via `litellm.completion_cost()` (`cost_calculator.py`)
|
||
4. Cost is stored in `response._hidden_params["response_cost"]`
|
||
5. `proxy/common_request_processing.py` extracts cost from `hidden_params` and adds to response headers (`x-litellm-response-cost`)
|
||
6. `logging_obj.async_success_handler()` triggers callbacks including `_ProxyDBLogger.async_log_success_event()`
|
||
7. `DBSpendUpdateWriter.update_database()` queues spend increments to Redis
|
||
8. Background job `update_spend` flushes queued spend to PostgreSQL every 60s
|
||
|
||
---
|
||
|
||
## 2. SDK Request Flow
|
||
|
||
The SDK (`litellm/`) provides the core LLM calling functionality used by both direct SDK users and the AI Gateway.
|
||
|
||
```mermaid
|
||
graph TD
|
||
subgraph "SDK Entry Points"
|
||
Completion["litellm.completion()"]
|
||
Messages["litellm.messages()"]
|
||
end
|
||
|
||
subgraph "main.py"
|
||
Main["completion()<br/>acompletion()"]
|
||
end
|
||
|
||
subgraph "utils.py"
|
||
GetProvider["get_llm_provider()"]
|
||
end
|
||
|
||
subgraph "llms/custom_httpx/"
|
||
Handler["llm_http_handler.py<br/>BaseLLMHTTPHandler"]
|
||
HTTP["http_handler.py<br/>HTTPHandler / AsyncHTTPHandler"]
|
||
end
|
||
|
||
subgraph "llms/{provider}/chat/"
|
||
TransformReq["transform_request()"]
|
||
TransformResp["transform_response()"]
|
||
end
|
||
|
||
subgraph "litellm_core_utils/"
|
||
Streaming["streaming_handler.py"]
|
||
end
|
||
|
||
subgraph "integrations/ (async, off main thread)"
|
||
Callbacks["custom_logger.py<br/>Langfuse, Datadog, etc."]
|
||
end
|
||
|
||
Completion --> Main
|
||
Messages --> Main
|
||
Main --> GetProvider
|
||
GetProvider --> Handler
|
||
Handler --> TransformReq
|
||
TransformReq --> HTTP
|
||
HTTP --> Provider["LLM Provider API"]
|
||
Provider --> HTTP
|
||
HTTP --> TransformResp
|
||
TransformResp --> Streaming
|
||
Streaming --> Response["ModelResponse"]
|
||
Response -.->|async| Callbacks
|
||
```
|
||
|
||
**Key SDK files:**
|
||
- `main.py` - Entry points: `completion()`, `acompletion()`, `embedding()`
|
||
- `utils.py` - `get_llm_provider()` resolves model → provider
|
||
- `llms/custom_httpx/llm_http_handler.py` - Central HTTP orchestrator
|
||
- `llms/custom_httpx/http_handler.py` - Low-level HTTP client
|
||
- `llms/{provider}/chat/transformation.py` - Provider-specific transformations
|
||
- `litellm_core_utils/streaming_handler.py` - Streaming response handling
|
||
- `integrations/` - Async callbacks (Langfuse, Datadog, etc.)
|
||
|
||
---
|
||
|
||
## 3. Translation Layer
|
||
|
||
When a request comes in, it goes through a **translation layer** that converts between API formats.
|
||
Each translation is isolated in its own file, making it easy to test and modify independently.
|
||
|
||
### Where to find translations
|
||
|
||
| Incoming API | Provider | Translation File |
|
||
|--------------|----------|------------------|
|
||
| `/v1/chat/completions` | Anthropic | `llms/anthropic/chat/transformation.py` |
|
||
| `/v1/chat/completions` | Bedrock Converse | `llms/bedrock/chat/converse_transformation.py` |
|
||
| `/v1/chat/completions` | Bedrock Invoke | `llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py` |
|
||
| `/v1/chat/completions` | Gemini | `llms/gemini/chat/transformation.py` |
|
||
| `/v1/chat/completions` | Vertex AI | `llms/vertex_ai/gemini/transformation.py` |
|
||
| `/v1/chat/completions` | OpenAI | `llms/openai/chat/gpt_transformation.py` |
|
||
| `/v1/messages` (passthrough) | Anthropic | `llms/anthropic/experimental_pass_through/messages/transformation.py` |
|
||
| `/v1/messages` (passthrough) | Bedrock | `llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py` |
|
||
| `/v1/messages` (passthrough) | Vertex AI | `llms/vertex_ai/vertex_ai_partner_models/anthropic/experimental_pass_through/transformation.py` |
|
||
| Passthrough endpoints | All | `proxy/pass_through_endpoints/llm_provider_handlers/` |
|
||
|
||
### Example: Debugging prompt caching
|
||
|
||
If `/v1/messages` → Bedrock Converse prompt caching isn't working but Bedrock Invoke works:
|
||
|
||
1. **Bedrock Converse translation**: `llms/bedrock/chat/converse_transformation.py`
|
||
2. **Bedrock Invoke translation**: `llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py`
|
||
3. Compare how each handles `cache_control` in `transform_request()`
|
||
|
||
### How translations work
|
||
|
||
Each provider has a `Config` class that inherits from `BaseConfig` (`llms/base_llm/chat/transformation.py`):
|
||
|
||
```python
|
||
class ProviderConfig(BaseConfig):
|
||
def transform_request(self, model, messages, optional_params, litellm_params, headers):
|
||
# Convert OpenAI format → Provider format
|
||
return {"messages": transformed_messages, ...}
|
||
|
||
def transform_response(self, model, raw_response, model_response, logging_obj, ...):
|
||
# Convert Provider format → OpenAI format
|
||
return ModelResponse(choices=[...], usage=Usage(...))
|
||
```
|
||
|
||
The `BaseLLMHTTPHandler` (`llms/custom_httpx/llm_http_handler.py`) calls these methods - you never need to modify the handler itself.
|
||
|
||
---
|
||
|
||
## 4. Adding/Modifying Providers
|
||
|
||
### To add a new provider:
|
||
|
||
1. Create `llms/{provider}/chat/transformation.py`
|
||
2. Implement `Config` class with `transform_request()` and `transform_response()`
|
||
3. Add tests in `tests/llm_translation/test_{provider}.py`
|
||
|
||
### To add a feature (e.g., prompt caching):
|
||
|
||
1. Find the translation file from the table above
|
||
2. Modify `transform_request()` to handle the new parameter
|
||
3. Add unit tests that verify the transformation
|
||
|
||
### Testing checklist
|
||
|
||
When adding a feature, verify it works across all paths:
|
||
|
||
| Test | File Pattern |
|
||
|------|--------------|
|
||
| OpenAI passthrough | `tests/llm_translation/test_openai*.py` |
|
||
| Anthropic direct | `tests/llm_translation/test_anthropic*.py` |
|
||
| Bedrock Invoke | `tests/llm_translation/test_bedrock*.py` |
|
||
| Bedrock Converse | `tests/llm_translation/test_bedrock*converse*.py` |
|
||
| Vertex AI | `tests/llm_translation/test_vertex*.py` |
|
||
| Gemini | `tests/llm_translation/test_gemini*.py` |
|
||
|
||
### Unit testing translations
|
||
|
||
Translations are designed to be unit testable without making API calls:
|
||
|
||
```python
|
||
from litellm.llms.bedrock.chat.converse_transformation import BedrockConverseConfig
|
||
|
||
def test_prompt_caching_transform():
|
||
config = BedrockConverseConfig()
|
||
result = config.transform_request(
|
||
model="anthropic.claude-3-opus",
|
||
messages=[{"role": "user", "content": "test", "cache_control": {"type": "ephemeral"}}],
|
||
optional_params={},
|
||
litellm_params={},
|
||
headers={}
|
||
)
|
||
assert "cachePoint" in str(result) # Verify cache_control was translated
|
||
```
|