claude-central-gateway/docs/system-architecture.md

# System Architecture

## High-Level Overview

Claude Central Gateway acts as a protocol translator between Anthropic's Messages API and OpenAI's Chat Completions API. Requests flow through a series of transformation stages with minimal overhead.

```
Client (Claude Code)
    ↓
HTTP Request (Anthropic API format)
    ↓
[Auth Middleware] → Validates x-api-key token
    ↓
[Model Mapping] → Maps claude-* model names to openai models
    ↓
[Request Transformation] → Anthropic format → OpenAI format
    ↓
[OpenAI Client] → Sends request to OpenAI API
    ↓
OpenAI Response Stream
    ↓
[Response Transformation] → OpenAI format → Anthropic SSE format
    ↓
HTTP Response (Anthropic SSE or JSON)
    ↓
Client receives response
```

## Request Flow (Detailed)

### 1. Incoming Request
```
POST /v1/messages HTTP/1.1
Host: gateway.example.com
x-api-key: my-secret-token
Content-Type: application/json

{
  "model": "claude-sonnet-4-20250514",
  "messages": [...],
  "tools": [...],
  "stream": true,
  ...
}
```

### 2. Authentication Stage
- **Middleware**: `authMiddleware()` from `auth-middleware.js`
- **Input**: HTTP request with headers
- **Process**:
  1. Extract `x-api-key` header or `Authorization: Bearer` header
  2. Compare against `GATEWAY_TOKEN` using `timingSafeEqual()` (constant-time comparison)
  3. If invalid: Return 401 Unauthorized
  4. If valid: Proceed to next middleware

### 3. Model Mapping
- **Module**: `openai-client.js`
- **Input**: Model name from request (e.g., `claude-sonnet-4-20250514`)
- **Process**:
  1. Check `MODEL_MAP` environment variable (format: `claude:gpt-4o,claude-opus:gpt-4-turbo`)
  2. If mapping found: Use mapped model name
  3. If no mapping: Use original model name as fallback
- **Output**: Canonical OpenAI model name (e.g., `gpt-4o`)

### 4. Request Transformation
- **Module**: `transform-request.js`, function `buildOpenAIRequest()`
- **Input**: Anthropic request body + mapped model name
- **Transformations**:

  **Parameters** (direct pass-through with mappings):
  - `max_tokens` → `max_tokens`
  - `temperature` → `temperature`
  - `top_p` → `top_p`
  - `stream` → `stream` (and adds `stream_options: { include_usage: true }`)
  - `stop_sequences` → `stop` array

  **Tools**:
  - Convert Anthropic tool definitions to OpenAI function tools
  - Map `tool_choice` enum to OpenAI tool_choice format

  **Messages Array** (complex transformation):
  - **System message**: String or array of text blocks → Single system message
  - **User messages**: Handle text, images, and tool_result blocks
  - **Assistant messages**: Handle text and tool_use blocks

  **Content Block Handling**:
  - `text`: Preserved as-is
  - `image` (base64 or URL): Converted to `image_url` format
  - `tool_use`: Converted to OpenAI `tool_calls`
  - `tool_result`: Split into separate tool messages
  - Other blocks (thinking, cache_control): Filtered out

- **Output**: OpenAI Chat Completions request payload (object, not stringified)

### 5. OpenAI API Call
- **Module**: `routes/messages.js` route handler
- **Process**:
  1. Serialize payload to JSON
  2. Send to OpenAI API with authentication header
  3. If streaming: Request returns async iterable of chunks
  4. If non-streaming: Request returns single response object

### 6. Response Transformation

#### Non-Streaming Path
- **Module**: `transform-response.js`, function `transformResponse()`
- **Input**: OpenAI response object + original Anthropic request
- **Process**:
  1. Extract first choice from OpenAI response
  2. Build content blocks array:
     - Extract text from `message.content` if present
     - Extract tool_calls and convert to Anthropic `tool_use` format
  3. Map OpenAI `finish_reason` to Anthropic `stop_reason`
  4. Build response envelope with message metadata
  5. Convert usage tokens (prompt/completion → input/output)
- **Output**: Single Anthropic message response object

#### Streaming Path
- **Module**: `transform-response.js`, function `streamAnthropicResponse()`
- **Input**: Hono context + OpenAI response stream + original Anthropic request
- **Process**:
  1. Emit `message_start` event with empty message envelope
  2. For each OpenAI chunk:
     - Track `finish_reason` for final stop_reason
     - Handle text deltas: Send `content_block_start`, `content_block_delta`, `content_block_stop`
     - Handle tool_calls deltas: Similar sequencing, buffer arguments
     - Track usage tokens from final chunk
  3. Emit `message_delta` with final stop_reason and output tokens
  4. Emit `message_stop` to mark end of stream
- **Output**: Server-Sent Events stream (Content-Type: text/event-stream)

### 7. HTTP Response
```
HTTP/1.1 200 OK
Content-Type: text/event-stream (streaming) or application/json (non-streaming)

event: message_start
data: {"type":"message_start","message":{...}}

event: content_block_start
data: {"type":"content_block_start",...}

event: content_block_delta
data: {"type":"content_block_delta",...}

event: message_delta
data: {"type":"message_delta",...}

event: message_stop
data: {"type":"message_stop"}
```

## Tool Use Round-Trip (Special Case)

Complete workflow for tool execution:

### Step 1: Initial Request with Tools
```
Client sends:
{
  "messages": [{"role": "user", "content": "Search for X"}],
  "tools": [{"name": "search", "description": "...", "input_schema": {...}}]
}
```

### Step 2: Model Selects Tool
```
OpenAI responds:
{
  "choices": [{
    "message": {
      "content": null,
      "tool_calls": [{"id": "call_123", "function": {"name": "search", "arguments": "{..."}}]
    }
  }]
}
```

### Step 3: Transform & Return to Client
```
Gateway converts:
{
  "content": [
    {"type": "tool_use", "id": "call_123", "name": "search", "input": {...}}
  ],
  "stop_reason": "tool_use"
}
```

### Step 4: Client Executes Tool and Responds
```
Client sends:
{
  "messages": [
    {"role": "user", "content": "Search for X"},
    {"role": "assistant", "content": [{"type": "tool_use", "id": "call_123", ...}]},
    {"role": "user", "content": [
      {"type": "tool_result", "tool_use_id": "call_123", "content": "Result: ..."}
    ]}
  ]
}
```

### Step 5: Transform & Forward to OpenAI
```
Gateway converts:
{
  "messages": [
    {"role": "user", "content": "Search for X"},
    {"role": "assistant", "content": null, "tool_calls": [...]},
    {"role": "tool", "tool_call_id": "call_123", "content": "Result: ..."}
  ]
}
```

### Step 6: Model Continues
OpenAI processes tool result and continues conversation.

## Stop Reason Mapping

| OpenAI `finish_reason` | Anthropic `stop_reason` | Notes |
|----------------------|----------------------|-------|
| `stop` | `end_turn` | Normal completion |
| `stop` (with stop_sequences) | `stop_sequence` | Hit user-specified stop sequence |
| `length` | `max_tokens` | Hit max_tokens limit |
| `tool_calls` | `tool_use` | Model selected a tool |
| `content_filter` | `end_turn` | Content filtered by safety filters |

## Data Structures

### Request Object (Anthropic format)
```javascript
{
  model: string,
  messages: [{
    role: "user" | "assistant",
    content: string | [{
      type: "text" | "image" | "tool_use" | "tool_result",
      text?: string,
      source?: {type: "base64" | "url", media_type?: string, data?: string, url?: string},
      id?: string,
      name?: string,
      input?: object,
      tool_use_id?: string,
      is_error?: boolean
    }]
  }],
  system?: string | [{type: "text", text: string}],
  tools?: [{
    name: string,
    description: string,
    input_schema: object
  }],
  tool_choice?: {type: "auto" | "any" | "none" | "tool", name?: string},
  max_tokens: number,
  temperature?: number,
  top_p?: number,
  stop_sequences?: string[],
  stream?: boolean
}
```

### Response Object (Anthropic format)
```javascript
{
  id: string,
  type: "message",
  role: "assistant",
  content: [{
    type: "text" | "tool_use",
    text?: string,
    id?: string,
    name?: string,
    input?: object
  }],
  model: string,
  stop_reason: "end_turn" | "max_tokens" | "stop_sequence" | "tool_use",
  usage: {
    input_tokens: number,
    output_tokens: number
  }
}
```

## Deployment Topology

### Single-Instance Deployment (Typical)
```
                    ┌─────────────────────┐
                    │   Claude Code       │
                    │   (Claude IDE)      │
                    └──────────┬──────────┘
                               │ HTTP/HTTPS
                               ▼
                    ┌─────────────────────┐
                    │  Claude Central     │
                    │  Gateway (Vercel)   │
                    │  ┌────────────────┐ │
                    │  │ Auth            │ │
                    │  │ Transform Req   │ │
                    │  │ Transform Resp  │ │
                    │  └────────────────┘ │
                    └──────────┬──────────┘
                               │ HTTP/HTTPS
                               ▼
                    ┌─────────────────────┐
                    │   OpenAI API        │
                    │   chat/completions  │
                    └─────────────────────┘
```

### Multi-Instance Deployment (Stateless)
Multiple gateway instances can run independently. Requests distribute via:
- Load balancer (Vercel built-in, Cloudflare routing)
- Client-side retry on failure

Each instance:
- Shares same `GATEWAY_TOKEN` for authentication
- Shares same `MODEL_MAP` for consistent routing
- Connects independently to OpenAI

No coordination required between instances.

## Scalability Characteristics

### Horizontal Scaling
- ✅ Fully stateless: Add more instances without coordination
- ✅ No shared state: Each instance owns only active requests
- ✅ Database-free: No bottleneck or single point of failure

### Rate Limiting
- ⚠️ Currently none: Single token shared across all users
- Recommendation: Implement per-token or per-IP rate limiting if needed

### Performance
- Latency: ~50-200ms overhead per request (serialization + HTTP)
- Throughput: Limited by OpenAI API tier, not gateway capacity
- Memory: ~20MB per instance (Hono + dependencies)

## Error Handling Architecture

### Authentication Errors
```
Client → Gateway (missing/invalid token)
         └→ Return 401 with error details
            No API call made
```

### Transform Errors
```
Client → Gateway → Transform fails (malformed request)
                   └→ Return 400 Bad Request
                      No API call made
```

### OpenAI API Errors
```
Client → Gateway → OpenAI API returns error
                   └→ Convert to Anthropic error format
                      └→ Return to client
```

### Network Errors
```
Client → Gateway → OpenAI unreachable
                   └→ Timeout or connection error
                      └→ Return 500 Internal Server Error
```

## Security Model

### Authentication
- **Method**: Single shared token (`GATEWAY_TOKEN`)
- **Comparison**: Timing-safe to prevent brute-force via timing attacks
- **Suitable for**: Personal use, small teams with trusted members
- **Not suitable for**: Multi-tenant, public access, high-security requirements

### Token Locations
- Client stores in `ANTHROPIC_AUTH_TOKEN` environment variable
- Server validates against `GATEWAY_TOKEN` environment variable
- Never logged or exposed in error messages

### Recommendations for Production
1. Use strong, randomly generated token (32+ characters)
2. Rotate token periodically
3. Use HTTPS only (Vercel provides free HTTPS)
4. Consider rate limiting by IP if exposed to untrusted networks
5. Monitor token usage logs for suspicious patterns

## Monitoring & Observability

### Built-in Logging
- Hono logger middleware logs all requests (method, path, status, latency)
- Errors logged to console with stack traces

### Recommended Additions
- Request/response body logging (for debugging, exclude in production)
- Token usage tracking (prompt/completion tokens)
- API error rate monitoring
- Latency percentiles (p50, p95, p99)
- OpenAI API quota tracking

## Future Architecture Considerations

### Potential Enhancements
1. **Per-request authentication**: Support API keys per user/token
2. **Request routing**: Route based on model, user, or other properties
3. **Response caching**: Cache repeated identical requests
4. **Rate limiting**: Token bucket or sliding window per client
5. **Webhook logging**: Send detailed logs to external system
6. **Provider abstraction**: Support multiple backends (Google, Anthropic, etc.)

### Current Constraints Preventing Enhancement
- Single-token auth: No per-user isolation
- Minimal state: Cannot track usage per user
- Stateless design: Cannot implement caching or rate limiting without storage
- Simple model mapping: Cannot route intelligently

These are intentional trade-offs prioritizing simplicity over flexibility.