Files
litellm/litellm-proxy-extras/litellm_proxy_extras/utils.py
T
Krish Dholakia b96f033c90 fix: prisma migrate deploy failures on pre-existing instances (#23655)
* fix: prisma migrate deploy failures on pre-existing instances

Fixes failed migrations due to idempotent schema changes on pre-existing litellm instances.

Problems:
1. P3018 recovery handler never returned True on successful resolution, causing "Database setup failed after multiple retries" even when the final recovery succeeded
2. _roll_back_migration exceptions escaped the P3018 handler, preventing _resolve_specific_migration from running
3. Migration SQL used ADD COLUMN/DROP COLUMN without IF [NOT] EXISTS, failing if schema was already modified

Changes:
- Add return True after successful P3018 idempotent error recovery
- Wrap _roll_back_migration in try/except to allow recovery continuation even if rollback fails
- Make migration.sql idempotent with IF NOT EXISTS / IF EXISTS clauses

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* test: add migration SQL idempotency safety tests

Adds TestMigrationSQLIdempotency test class that statically validates all
migration SQL files created after 2026-03-11 use idempotent DDL:
- ADD COLUMN must use IF NOT EXISTS
- DROP COLUMN must use IF EXISTS
- DROP INDEX must use IF EXISTS
- CREATE INDEX must use IF NOT EXISTS

This prevents the class of errors where prisma migrate deploy fails on
pre-existing instances because the schema was already modified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: also catch TimeoutExpired in P3018 rollback handler

_roll_back_migration uses subprocess.run with timeout=60, so it can raise
subprocess.TimeoutExpired in addition to CalledProcessError. Without
catching this, a slow database during rollback would escape the handler
and bypass _resolve_specific_migration — the same class of bug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make all 85 migration SQL files idempotent, remove test cutoff

Fixed all existing migration files to use IF [NOT] EXISTS for DDL
statements (ADD COLUMN, DROP COLUMN, DROP INDEX, CREATE INDEX).
Removed the date cutoff from the idempotency tests so they now
validate all migrations, not just recent ones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make migration failure non-fatal by default, add --require_db_migration flag

By default the proxy now warns and continues when database migration
fails. Pass --require_db_migration (or set REQUIRE_DB_MIGRATION=true)
to restore the previous behavior of exiting with an error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wrap _resolve_specific_migration in try/except, guard RENAME COLUMN and ADD CONSTRAINT

Three fixes:

1. _resolve_specific_migration in the P3018 handler was not wrapped in
   try/except, so failures there would bypass the return True and
   propagate unexpectedly — partially defeating the rollback fix.

2. Bare RENAME COLUMN in 20260303000000_update_tool_table_policies was
   non-idempotent. Wrapped in DO $$ IF EXISTS block. Also wrapped all
   28 bare ADD CONSTRAINT statements across 9 migration files in
   DO $$ IF NOT EXISTS (pg_constraint) blocks.

3. Added test_rename_column_is_guarded and test_add_constraint_is_guarded
   to TestMigrationSQLIdempotency for full DDL coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: retry after resolving idempotent migration, guard DROP CONSTRAINT

Three fixes:

1. Both P3009 and P3018 idempotent handlers returned True after
   resolving a single migration, exiting before remaining pending
   migrations were applied. Now they continue the retry loop so
   prisma migrate deploy runs again for any remaining migrations.

2. Two migration files had bare DROP CONSTRAINT without a DO $$ IF
   EXISTS guard, which fails if the constraint was already dropped.
   Wrapped both in idempotent DO $$ blocks.

3. Added test_drop_constraint_is_guarded to catch unguarded DROP
   CONSTRAINT in future migrations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: P3009 try/except, CREATE TABLE IF NOT EXISTS, restore fail-fast default

Four fixes:

1. P3009 idempotent handler now has the same try/except around
   _roll_back_migration and _resolve_specific_migration as the P3018
   handler. Previously a rollback or resolve failure in the P3009 path
   would propagate and leave the migration unresolved.

2. Added IF NOT EXISTS to all 57 bare CREATE TABLE statements across
   34 migration files. Added test_create_table_uses_if_not_exists to
   catch this pattern.

3. Reverted the backwards-incompatible default behavior change: the
   proxy now fails fast on migration failure (original behavior).
   Added --skip_db_migration_check / SKIP_DB_MIGRATION_CHECK to
   opt into warn-and-continue instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-14 16:54:21 -07:00

587 lines
26 KiB
Python

import glob
import os
import random
import re
import shutil
import subprocess
import time
from datetime import datetime
from pathlib import Path
from typing import Optional
from litellm_proxy_extras._logging import logger
def str_to_bool(value: Optional[str]) -> bool:
if value is None:
return False
return value.lower() in ("true", "1", "t", "y", "yes")
def _get_prisma_env() -> dict:
"""Get environment variables for Prisma, handling offline mode if configured."""
prisma_env = os.environ.copy()
if str_to_bool(os.getenv("PRISMA_OFFLINE_MODE")):
# These env vars prevent Prisma from attempting downloads
prisma_env["NPM_CONFIG_PREFER_OFFLINE"] = "true"
prisma_env["NPM_CONFIG_CACHE"] = os.getenv(
"NPM_CONFIG_CACHE", "/app/.cache/npm"
)
return prisma_env
def _get_prisma_command() -> str:
"""Get the Prisma command to use, bypassing Python wrapper in offline mode."""
if str_to_bool(os.getenv("PRISMA_OFFLINE_MODE")):
# Primary location where Prisma Python package installs the CLI
default_cli_path = "/app/.cache/prisma-python/binaries/node_modules/.bin/prisma"
# Check if custom path is provided (for flexibility)
custom_cli_path = os.getenv("PRISMA_CLI_PATH")
if custom_cli_path and os.path.exists(custom_cli_path):
logger.info(f"Using custom Prisma CLI at {custom_cli_path}")
return custom_cli_path
# Check the default location
if os.path.exists(default_cli_path):
logger.info(f"Using cached Prisma CLI at {default_cli_path}")
return default_cli_path
# If not found, log warning and fall back
logger.warning(
f"Prisma CLI not found at {default_cli_path}. "
"Falling back to Python wrapper (may attempt downloads)"
)
# Fall back to the Python wrapper (will work in online mode)
return "prisma"
class ProxyExtrasDBManager:
@staticmethod
def _get_prisma_dir() -> str:
"""
Get the path to the migrations directory
Set os.environ["LITELLM_MIGRATION_DIR"] to a custom migrations directory, to support baselining db in read-only fs.
"""
custom_migrations_dir = os.getenv("LITELLM_MIGRATION_DIR")
pkg_migrations_dir = os.path.dirname(__file__)
if custom_migrations_dir:
# If migrations_dir exists, copy contents
if os.path.exists(custom_migrations_dir):
# Copy contents instead of directory itself
for item in os.listdir(pkg_migrations_dir):
src_path = os.path.join(pkg_migrations_dir, item)
dst_path = os.path.join(custom_migrations_dir, item)
if os.path.isdir(src_path):
shutil.copytree(src_path, dst_path, dirs_exist_ok=True)
else:
shutil.copy2(src_path, dst_path)
else:
# If directory doesn't exist, create it and copy everything
shutil.copytree(pkg_migrations_dir, custom_migrations_dir)
return custom_migrations_dir
return pkg_migrations_dir
@staticmethod
def _create_baseline_migration(schema_path: str) -> bool:
"""Create a baseline migration for an existing database"""
prisma_dir = ProxyExtrasDBManager._get_prisma_dir()
prisma_dir_path = Path(prisma_dir)
init_dir = prisma_dir_path / "migrations" / "0_init"
# Create migrations/0_init directory
init_dir.mkdir(parents=True, exist_ok=True)
database_url = os.getenv("DATABASE_URL")
if not database_url:
logger.error("DATABASE_URL not set")
return False
# Set up environment for offline mode if configured
prisma_env = _get_prisma_env()
try:
# 1. Generate migration SQL file by comparing empty state to current db state
logger.info("Generating baseline migration...")
migration_file = init_dir / "migration.sql"
subprocess.run(
[
_get_prisma_command(),
"migrate",
"diff",
"--from-empty",
"--to-url",
database_url,
"--script",
],
stdout=open(migration_file, "w"),
check=True,
timeout=30,
env=prisma_env,
)
# 3. Mark the migration as applied since it represents current state
logger.info("Marking baseline migration as applied...")
subprocess.run(
[
_get_prisma_command(),
"migrate",
"resolve",
"--applied",
"0_init",
],
check=True,
timeout=30,
env=prisma_env,
)
return True
except subprocess.TimeoutExpired:
logger.warning(
"Migration timed out - the database might be under heavy load."
)
return False
except subprocess.CalledProcessError as e:
logger.warning(
f"Error creating baseline migration: {e}, {e.stderr}, {e.stdout}"
)
raise e
@staticmethod
def _get_migration_names(migrations_dir: str) -> list:
"""Get all migration directory names from the migrations folder"""
migration_paths = glob.glob(f"{migrations_dir}/migrations/*/migration.sql")
logger.info(f"Found {len(migration_paths)} migrations at {migrations_dir}")
return [Path(p).parent.name for p in migration_paths]
@staticmethod
def _roll_back_migration(migration_name: str):
"""Mark a specific migration as rolled back"""
# Set up environment for offline mode if configured
prisma_env = _get_prisma_env()
subprocess.run(
[
_get_prisma_command(),
"migrate",
"resolve",
"--rolled-back",
migration_name,
],
timeout=60,
check=True,
capture_output=True,
env=prisma_env,
)
@staticmethod
def _resolve_specific_migration(migration_name: str):
"""Mark a specific migration as applied"""
prisma_env = _get_prisma_env()
subprocess.run(
[_get_prisma_command(), "migrate", "resolve", "--applied", migration_name],
timeout=60,
check=True,
capture_output=True,
env=prisma_env,
)
@staticmethod
def _is_permission_error(error_message: str) -> bool:
"""
Check if the error message indicates a database permission error.
Permission errors should NOT be marked as applied, as the migration
did not actually execute successfully.
Args:
error_message: The error message from Prisma migrate
Returns:
bool: True if this is a permission error, False otherwise
"""
permission_patterns = [
r"Database error code: 42501", # PostgreSQL insufficient privilege
r"must be owner of table",
r"permission denied for schema",
r"permission denied for table",
r"must be owner of schema",
]
for pattern in permission_patterns:
if re.search(pattern, error_message, re.IGNORECASE):
return True
return False
@staticmethod
def _is_idempotent_error(error_message: str) -> bool:
"""
Check if the error message indicates an idempotent operation error.
Idempotent errors (like "column already exists") mean the migration
has effectively already been applied, so it's safe to mark as applied.
Args:
error_message: The error message from Prisma migrate
Returns:
bool: True if this is an idempotent error, False otherwise
"""
idempotent_patterns = [
r"already exists",
r"column .* already exists",
r"duplicate key value violates",
r"relation .* already exists",
r"constraint .* already exists",
r"does not exist",
r"Can't drop database.* because it doesn't exist",
]
for pattern in idempotent_patterns:
if re.search(pattern, error_message, re.IGNORECASE):
return True
return False
@staticmethod
def _resolve_all_migrations(
migrations_dir: str, schema_path: str, mark_all_applied: bool = True
):
"""
1. Compare the current database state to schema.prisma and generate a migration for the diff.
2. Run prisma migrate deploy to apply any pending migrations.
3. Mark all existing migrations as applied.
"""
database_url = os.getenv("DATABASE_URL")
if not database_url:
logger.error("DATABASE_URL not set")
return
diff_dir = (
Path(migrations_dir)
/ "migrations"
/ f"{datetime.now().strftime('%Y%m%d%H%M%S')}_baseline_diff"
)
try:
diff_dir.mkdir(parents=True, exist_ok=True)
except Exception as e:
if "Permission denied" in str(e):
logger.warning(
f"Permission denied - {e}\nunable to baseline db. Set LITELLM_MIGRATION_DIR environment variable to a writable directory to enable migrations."
)
return
raise e
diff_sql_path = diff_dir / "migration.sql"
# 1. Generate migration SQL for the diff between DB and schema
try:
logger.info("Generating migration diff between DB and schema.prisma...")
with open(diff_sql_path, "w") as f:
subprocess.run(
[
_get_prisma_command(),
"migrate",
"diff",
"--from-url",
database_url,
"--to-schema-datamodel",
schema_path,
"--script",
],
check=True,
timeout=60,
stdout=f,
env=_get_prisma_env(),
)
except subprocess.CalledProcessError as e:
logger.warning(f"Failed to generate migration diff: {e.stderr}")
except subprocess.TimeoutExpired:
logger.warning("Migration diff generation timed out.")
# check if the migration was created
if not diff_sql_path.exists():
logger.warning("Migration diff was not created")
return
logger.info(f"Migration diff created at {diff_sql_path}")
# 2. Run prisma db execute to apply the migration
try:
logger.info("Running prisma db execute to apply the migration diff...")
result = subprocess.run(
[
_get_prisma_command(),
"db",
"execute",
"--file",
str(diff_sql_path),
"--schema",
schema_path,
],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.info(f"prisma db execute stdout: {result.stdout}")
logger.info("✅ Migration diff applied successfully")
except subprocess.CalledProcessError as e:
logger.warning(f"Failed to apply migration diff: {e.stderr}")
except subprocess.TimeoutExpired:
logger.warning("Migration diff application timed out.")
# 3. Mark all migrations as applied
if not mark_all_applied:
return
migration_names = ProxyExtrasDBManager._get_migration_names(migrations_dir)
logger.info(f"Resolving {len(migration_names)} migrations")
for migration_name in migration_names:
try:
logger.info(f"Resolving migration: {migration_name}")
subprocess.run(
[
_get_prisma_command(),
"migrate",
"resolve",
"--applied",
migration_name,
],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.debug(f"Resolved migration: {migration_name}")
except subprocess.CalledProcessError as e:
if "is already recorded as applied in the database." not in e.stderr:
logger.warning(
f"Failed to resolve migration {migration_name}: {e.stderr}"
)
@staticmethod
def setup_database(use_migrate: bool = False) -> bool:
"""
Set up the database using either prisma migrate or prisma db push
Uses migrations from litellm-proxy-extras package
Args:
schema_path (str): Path to the Prisma schema file
use_migrate (bool): Whether to use prisma migrate instead of db push
Returns:
bool: True if setup was successful, False otherwise
"""
schema_path = ProxyExtrasDBManager._get_prisma_dir() + "/schema.prisma"
for attempt in range(4):
original_dir = os.getcwd()
migrations_dir = ProxyExtrasDBManager._get_prisma_dir()
os.chdir(migrations_dir)
try:
if use_migrate:
logger.info("Running prisma migrate deploy")
try:
# Set migrations directory for Prisma
result = subprocess.run(
[_get_prisma_command(), "migrate", "deploy"],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.info(f"prisma migrate deploy stdout: {result.stdout}")
logger.info("prisma migrate deploy completed")
# Run sanity check to ensure DB matches schema
logger.info("Running post-migration sanity check...")
ProxyExtrasDBManager._resolve_all_migrations(
migrations_dir, schema_path, mark_all_applied=False
)
logger.info("✅ Post-migration sanity check completed")
return True
except subprocess.CalledProcessError as e:
logger.info(f"prisma db error: {e.stderr}, e: {e.stdout}")
if "P3009" in e.stderr:
# Extract the failed migration name from the error message
migration_match = re.search(
r"`(\d+_.*)` migration", e.stderr
)
if migration_match:
failed_migration = migration_match.group(1)
if ProxyExtrasDBManager._is_idempotent_error(e.stderr):
logger.info(
f"Migration {failed_migration} failed due to idempotent error (e.g., column already exists), resolving as applied"
)
try:
ProxyExtrasDBManager._roll_back_migration(
failed_migration
)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as rollback_err:
logger.warning(
f"Failed to roll back migration {failed_migration}: {rollback_err}. "
f"It may already be in a rolled-back state."
)
try:
ProxyExtrasDBManager._resolve_specific_migration(
failed_migration
)
logger.info(
f"✅ Migration {failed_migration} resolved, retrying to apply remaining migrations"
)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as resolve_err:
logger.warning(
f"Failed to resolve migration {failed_migration}: {resolve_err}"
)
else:
logger.info(
f"Found failed migration: {failed_migration}, marking as rolled back"
)
# Mark the failed migration as rolled back
subprocess.run(
[
_get_prisma_command(),
"migrate",
"resolve",
"--rolled-back",
failed_migration,
],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.info(
f"✅ Migration {failed_migration} marked as rolled back... retrying"
)
elif (
"P3005" in e.stderr
and "database schema is not empty" in e.stderr
):
logger.info(
"Database schema is not empty, creating baseline migration. In read-only file system, please set an environment variable `LITELLM_MIGRATION_DIR` to a writable directory to enable migrations. Learn more - https://docs.litellm.ai/docs/proxy/prod#read-only-file-system"
)
ProxyExtrasDBManager._create_baseline_migration(schema_path)
logger.info(
"Baseline migration created, resolving all migrations"
)
ProxyExtrasDBManager._resolve_all_migrations(
migrations_dir, schema_path
)
logger.info("✅ All migrations resolved.")
return True
elif "P3018" in e.stderr:
# Check if this is a permission error or idempotent error
if ProxyExtrasDBManager._is_permission_error(e.stderr):
# Permission errors should NOT be marked as applied
# Extract migration name for logging
migration_match = re.search(
r"Migration name: (\d+_.*)", e.stderr
)
migration_name = (
migration_match.group(1)
if migration_match
else "unknown"
)
logger.error(
f"❌ Migration {migration_name} failed due to insufficient permissions. "
f"Please check database user privileges. Error: {e.stderr}"
)
# Mark as rolled back and exit with error
if migration_match:
try:
ProxyExtrasDBManager._roll_back_migration(
migration_name
)
logger.info(
f"Migration {migration_name} marked as rolled back"
)
except Exception as rollback_error:
logger.warning(
f"Failed to mark migration as rolled back: {rollback_error}"
)
# Re-raise the error to prevent silent failures
raise RuntimeError(
f"Migration failed due to permission error. Migration {migration_name} "
f"was NOT applied. Please grant necessary database permissions and retry."
) from e
elif ProxyExtrasDBManager._is_idempotent_error(e.stderr):
# Idempotent errors mean the migration has effectively been applied
logger.info(
"Migration failed due to idempotent error (e.g., column already exists), "
"resolving as applied"
)
# Extract the migration name from the error message
migration_match = re.search(
r"Migration name: (\d+_.*)", e.stderr
)
if migration_match:
migration_name = migration_match.group(1)
try:
logger.info(
f"Rolling back migration {migration_name}"
)
ProxyExtrasDBManager._roll_back_migration(
migration_name
)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as rollback_err:
logger.warning(
f"Failed to roll back migration {migration_name}: {rollback_err}. "
f"It may already be in a rolled-back state."
)
try:
logger.info(
f"Resolving migration {migration_name} that failed "
f"due to existing schema objects"
)
ProxyExtrasDBManager._resolve_specific_migration(
migration_name
)
logger.info(
f"✅ Migration {migration_name} resolved, "
f"retrying to apply remaining migrations"
)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as resolve_err:
logger.warning(
f"Failed to resolve migration {migration_name}: {resolve_err}"
)
else:
# Unknown P3018 error - log and re-raise for safety
logger.warning(
f"P3018 error encountered but could not classify "
f"as permission or idempotent error. "
f"Error: {e.stderr}"
)
raise
else:
# Use prisma db push with increased timeout
subprocess.run(
[_get_prisma_command(), "db", "push", "--accept-data-loss"],
timeout=60,
check=True,
)
return True
except subprocess.TimeoutExpired:
logger.info(f"Attempt {attempt + 1} timed out")
time.sleep(random.randrange(5, 15))
except subprocess.CalledProcessError as e:
attempts_left = 3 - attempt
retry_msg = (
f" Retrying... ({attempts_left} attempts left)"
if attempts_left > 0
else ""
)
logger.info(f"The process failed to execute. Details: {e}.{retry_msg}")
time.sleep(random.randrange(5, 15))
finally:
os.chdir(original_dir)
pass
return False