* attempt to implement the passthrough feature
* Formatting and small change
* Fix formatting
* Format test file
---------
Co-authored-by: Xiaohan Fu <xiaohan@grayswan.ai>
* fix(router): use cacheable prefix for prompt caching cache keys
Fix issue where requests with same cacheable prefix but different user
messages were routing to different deployments, preventing cached token
reuse. The cache key now correctly includes only the cacheable prefix
(up to and including the last cache_control block) instead of the
entire messages array.
## New Functions
### extract_cacheable_prefix()
Static method that extracts the cacheable prefix from messages for
prompt caching. The cacheable prefix is defined as everything UP TO
AND INCLUDING the LAST content block (across all messages) that has
cache_control with type "ephemeral". This includes ALL blocks
before the last cacheable block (even if they don't have cache_control
themselves).
- Finds the last content block with cache_control across all messages
- Returns all messages and content blocks up to and including that
last cacheable block
- Excludes everything after the last cacheable block (including user
messages that come after)
- Returns empty list if no cacheable blocks are found
## Changed Functions
### get_prompt_caching_cache_key()
Modified to use the cacheable prefix instead of the full messages array
when generating cache keys. This ensures that requests with the same
cacheable prefix but different user messages generate the same cache
key, enabling proper routing to the same deployment.
- Now calls extract_cacheable_prefix() to get only cacheable content
- Returns None if no cacheable prefix is found (can't generate key)
- Cache key is now based on cacheable prefix only, not full messages
### async_get_model_id()
Completely refactored to use the cacheable prefix directly instead of
the previous workaround that checked progressively shorter message
slices. The previous implementation was inefficient and unreliable.
- Removed progressive message slicing logic (messages[:-1], messages[:-2], etc.)
- Now uses single direct cache lookup with cacheable prefix-based key
- More efficient (1 lookup instead of up to 4)
- More reliable (uses correct cache key based on cacheable prefix)
- Returns None if no cacheable prefix found
### add_model_id()
Added None check for cache_key to prevent caching when no cacheable
prefix is found. This ensures we don't attempt to cache when there's
no meaningful cache key to use.
- Added guard: returns early if cache_key is None
- Prevents attempting to cache when no cacheable prefix exists
### async_add_model_id()
Added None check for cache_key to prevent caching when no cacheable
prefix is found. Matches the behavior of add_model_id() for consistency.
- Added guard: returns early if cache_key is None
- Prevents attempting to cache when no cacheable prefix exists
### get_model_id()
Added None check for cache_key to handle cases where no cacheable
prefix is found. Ensures consistent behavior across all cache methods.
- Added guard: returns None if cache_key is None
- Prevents calling get_cache() with None key
## Test
### test_router_prompt_caching_same_cacheable_prefix_routes_to_same_deployment()
New end-to-end test that validates the fix. Tests that requests with
the same cacheable prefix (system blocks with cache_control) but
different user messages:
1. Generate the same cache key
2. Successfully perform cache lookup
3. Route to the same deployment
This test reproduces the exact scenario from the user's bug report
where three requests with different user messages should route to the
same deployment but were previously routing to different ones.
Fixes issue where cached tokens couldn't be reused because requests
were routed to different providers due to different cache keys.
* fix(router): use cast() for proper type handling in extract_cacheable_prefix
Replace type annotation with type: ignore comment with proper cast()
from typing module, matching the pattern used throughout the
codebase for creating modified AllMessageValues dictionaries.
Previously, get_ssl_configuration() created a new SSL context on every
call, even when the configuration was identical. This caused continuous
memory allocation from ssl.create_default_context(), especially during:
- Proxy server startup
- Background health checks
- HTTP client creation
Solution:
- Added _ssl_context_cache to cache SSL contexts by configuration
parameters (cafile, ssl_security_level, ssl_ecdh_curve)
- Refactored SSL context creation into _create_ssl_context() helper
- Modified get_ssl_configuration() to reuse cached contexts when
configuration matches
This significantly reduces memory allocation while maintaining backward
compatibility. SSL contexts are now reused instead of being recreated
repeatedly, eliminating the memory leak observed in memray profiling.
Fixes memory allocation issue where create_default_context was allocating
6.282MB+ continuously even without any requests.
This fix addresses the same issue that was resolved for OpenAI video in PR #16708.
The GeminiVideoConfig class was importing BaseVideoConfig only within TYPE_CHECKING,
causing it to be 'Any' at runtime. This prevented the async_transform_video_content_response
method from being available during video content downloads.
Changes:
- Moved BaseVideoConfig import from TYPE_CHECKING to top-level imports
- Added test_gemini_video_config_has_async_transform() to verify the fix
- Ensures GeminiVideoConfig properly inherits BaseVideoConfig at runtime
Fixes video generation errors for Gemini Veo models:
'GeminiVideoConfig' object has no attribute 'async_transform_video_content_response'
Add gemini-3-pro-image-preview model configuration for Google's new
image generation model (aka "Nano Banana Pro 🍌").
Model details:
- Input: $2.00/1M tokens (text), $0.0011/image
- Output: $12.00/1M tokens (text), $0.134/image (1K/2K)
- Context: 65k input / 32k output tokens
- Capabilities: structured outputs, web search, caching, thinking
- No function calling support
- Available on both Gemini API and Vertex AI
Added variants:
- gemini-3-pro-image-preview (base, uses Vertex AI)
- gemini/gemini-3-pro-image-preview (Gemini API)
- vertex_ai/gemini-3-pro-image-preview (Vertex AI)
Source: https://ai.google.dev/gemini-api/docs/pricingFixes: #16925
Change model identifier from cerebras/openai/gpt-oss-120b to
cerebras/gpt-oss-120b to match Cerebras API requirements.
The Cerebras API only accepts 'gpt-oss-120b' as the model ID, not
'openai/gpt-oss-120b'. The previous name was causing "Model does not
exist" errors when users tried to use it.
Tested with real API calls to confirm:
- cerebras/gpt-oss-120b → sends 'gpt-oss-120b' → ✅ works
- cerebras/openai/gpt-oss-120b → sends 'openai/gpt-oss-120b' → ❌ fails
Fixes#16924
* add _get_prompt_data_from_dotprompt_content
* fix pre call hook for prompt template
* fix: get_latest_version_prompt_id
* fix get_latest_version_prompt_id
* test_get_latest_version_prompt_id
* fx info and delete lookup for prompts
* refactor prompt table
* - rename to prompt studio
* fix get_prompt_info
* fix endpoints
* add PromptCodeSnippets
* prompt info view
* add prompt info view
* show correct version for prompts
* fix version selector
* fix endpoints and version
* fix get_prompt_info
* fix version display
* Attempt CI/CD Fix
* Adding test for coverage
* Adding max depth to copilot and vertex
* Fixing mypy lint and docker database
* Fixing UI build issues
* Update playwright test
* though signature tool call id
* [stripe] refactor and tests
* [stripe] remove md and move to factory
* [stripe] remove redudant test
* [stripe] ran black formatting
* [stripe] add thought signature docs
* [stripe] remove unused import