fix(prometheus): emit litellm_remaining_tokens_metric for Bedrock and Vertex (#27705)

* fix(prometheus): emit remaining_tokens/requests gauges for bedrock + vertex (LIT-2719) Bedrock and Vertex AI never return x-ratelimit-remaining-* response headers, so litellm_remaining_tokens_metric / litellm_remaining_requests_metric only fired for OpenAI / Azure / Anthropic deployments even when tpm/rpm was configured on the router. Add a provider-agnostic fallback in PrometheusLogger.async_log_success_event that asks Router.get_remaining_model_group_usage() for the same model_group and emits the gauges with configured_limit - current_usage when the upstream provider didn't populate the headers itself. Existing OpenAI / Azure / Anthropic flows are unchanged because the fallback short-circuits when both header values are already present. Tests: 8 new tests covering bedrock + vertex emission, header short-circuit, partial-header fill, llm_router=None, missing model_group, empty router result, and router exception swallowing. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): narrow except to ImportError, log router lookup failures via verbose_logger.exception Address greptile review: - The optional 'from litellm.proxy.proxy_server import llm_router' should guard against ImportError specifically, not all exceptions, so that unexpected errors (e.g. AttributeError from partially-initialized state) stay visible. - get_remaining_model_group_usage failures are now logged via verbose_logger.exception (with traceback) instead of debug, matching the PR description's intent and avoiding silent loss of router-cache errors in production. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): subtract in-flight delta in router-remaining fallback The router's TPM/RPM counter is incremented by Router.deployment_callback_on_success, which fires alongside this prometheus callback in the success-log fan-out. Prometheus wins the race, so get_remaining_model_group_usage returns the pre-decrement counter for the current request — while vendor headers (OpenAI/Anthropic/Azure) are already post-decrement. That broke parity between providers on the same gauge: dashboards plotting litellm_remaining_requests_metric showed Bedrock/Vertex perpetually one request behind Anthropic for the same throughput. Replay the in-flight increment before emit: subtract total_tokens from remaining_tokens and 1 from remaining_requests. * Revert "fix(prometheus): subtract in-flight delta in router-remaining fallback" This reverts commit 001ce95ecdd952b4b5a23dd2b1e62c4562c932bc. * fix(router): post-decrement router-derived ratelimit headers Router.set_response_headers injects x-ratelimit-remaining-{tokens, requests} for providers that don't return them natively (Bedrock, Vertex). The values come from get_remaining_model_group_usage, which reads the router's TPM/RPM counter — incremented post-response by deployment_callback_on_success. So the headers reflected the counter state before the current request was counted: pre-decrement. Vendor headers from OpenAI/Anthropic/Azure are post-decrement (the vendor counted the request before responding). Same metric name, two semantics — dashboards plotting litellm_remaining_requests_metric showed Bedrock/Vertex perpetually one request behind for the same throughput, and the HTTP response headers exposed the same skew to clients. Subtract the in-flight delta before writing: 1 from remaining-requests, response.usage.total_tokens from remaining-tokens. Fixes both the response headers and (transitively) the prometheus gauges that read from standard_logging_payload.additional_headers. --------- Co-authored-by: cursor <cursor@example.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
2026-07-05 15:08:18 +00:00 · 2026-05-13 17:40:59 -07:00
parent 6274b4c217
commit f028a622e2
4 changed files with 496 additions and 1 deletions
@@ -1226,6 +1226,17 @@ class PrometheusLogger(CustomLogger):
            label_context=label_context,
        )

+        # Provider-agnostic fallback: providers like Bedrock and Vertex don't return
+        # x-ratelimit-remaining-* headers, so the gauges above only fire for OpenAI /
+        # Anthropic / Azure. When the proxy router has tpm/rpm configured for the
+        # model_group, derive remaining from configured-limit minus current usage so
+        # the same metric is populated for any provider.
+        await self._async_set_router_remaining_metrics(
+            standard_logging_payload=standard_logging_payload,  # type: ignore
+            enum_values=enum_values,
+            label_context=label_context,
+        )
+
        # cache metrics
        self._increment_cache_metrics(
            standard_logging_payload=standard_logging_payload,  # type: ignore
@@ -2199,6 +2210,99 @@ class PrometheusLogger(CustomLogger):
            )
            self.litellm_deployment_rpm_limit.labels(**_labels).set(rpm)

+    async def _async_set_router_remaining_metrics(
+        self,
+        standard_logging_payload: StandardLoggingPayload,
+        enum_values: UserAPIKeyLabelValues,
+        label_context: Optional[PrometheusLabelFactoryContext] = None,
+    ) -> None:
+        """
+        Populate ``litellm_remaining_tokens_metric`` /
+        ``litellm_remaining_requests_metric`` from the router's internal usage
+        counters when the upstream provider did not return
+        ``x-ratelimit-remaining-*`` response headers.
+
+        OpenAI / Anthropic / Azure return remaining tokens/requests in response
+        headers, but Bedrock and Vertex AI do not. This fallback computes
+        ``configured_limit - current_usage`` via
+        ``Router.get_remaining_model_group_usage`` so the same gauges are
+        emitted for every provider when tpm/rpm is configured on the
+        deployment.
+        """
+        try:
+            additional_headers = (
+                standard_logging_payload.get("hidden_params", {}) or {}
+            ).get("additional_headers") or {}
+
+            already_have_tokens = (
+                additional_headers.get("x_ratelimit_remaining_tokens") is not None
+            )
+            already_have_requests = (
+                additional_headers.get("x_ratelimit_remaining_requests") is not None
+            )
+            if already_have_tokens and already_have_requests:
+                return
+
+            model_group = standard_logging_payload.get("model_group")
+            if not model_group:
+                return
+
+            try:
+                from litellm.proxy.proxy_server import llm_router
+            except ImportError:
+                llm_router = None
+
+            if llm_router is None:
+                return
+
+            try:
+                remaining_usage = await llm_router.get_remaining_model_group_usage(
+                    model_group
+                )
+            except Exception as e:
+                verbose_logger.exception(
+                    "Prometheus: get_remaining_model_group_usage failed for "
+                    "model_group=%s: %s",
+                    model_group,
+                    e,
+                )
+                return
+
+            if not remaining_usage:
+                return
+
+            remaining_tokens = remaining_usage.get("x-ratelimit-remaining-tokens")
+            remaining_requests = remaining_usage.get("x-ratelimit-remaining-requests")
+
+            if not already_have_tokens and remaining_tokens is not None:
+                _labels = prometheus_label_factory(
+                    supported_enum_labels=self.get_labels_for_metric(
+                        metric_name="litellm_remaining_tokens_metric"
+                    ),
+                    enum_values=enum_values,
+                    label_context=label_context,
+                )
+                self.litellm_remaining_tokens_metric.labels(**_labels).set(
+                    remaining_tokens
+                )
+
+            if not already_have_requests and remaining_requests is not None:
+                _labels = prometheus_label_factory(
+                    supported_enum_labels=self.get_labels_for_metric(
+                        metric_name="litellm_remaining_requests_metric"
+                    ),
+                    enum_values=enum_values,
+                    label_context=label_context,
+                )
+                self.litellm_remaining_requests_metric.labels(**_labels).set(
+                    remaining_requests
+                )
+        except Exception as e:
+            verbose_logger.exception(
+                "Prometheus Error: _async_set_router_remaining_metrics. "
+                "Exception occured - {}".format(str(e))
+            )
+
    def set_llm_deployment_success_metrics(
        self,
        request_kwargs: dict,
@@ -8771,9 +8771,29 @@ class Router:
                    model_group
                )

+                # get_remaining_model_group_usage reads the router's TPM/RPM
+                # counter, which is incremented post-response by
+                # deployment_callback_on_success. So the values returned here
+                # are pre-decrement for the current request, while vendor
+                # headers (OpenAI/Anthropic/Azure) are post-decrement. Replay
+                # the in-flight increment so router-derived headers match
+                # vendor-derived semantics — for both the HTTP response sent
+                # to the client and the prometheus gauges that read these
+                # headers downstream (LIT-2719).
+                in_flight_tokens = 0
+                usage = getattr(response, "usage", None)
+                if usage is not None:
+                    in_flight_tokens = getattr(usage, "total_tokens", 0) or 0
+                in_flight_delta = {
+                    "x-ratelimit-remaining-tokens": in_flight_tokens,
+                    "x-ratelimit-remaining-requests": 1,
+                }
+
                for header, value in remaining_usage.items():
                    if value is not None:
-                        additional_headers[header] = value
+                        additional_headers[header] = value - in_flight_delta.get(
+                            header, 0
+                        )
        return response

    def _build_model_name_index(self, model_list: list) -> None:
@@ -879,6 +879,79 @@ async def test_set_response_headers(model_list):
    assert resp is None


+@pytest.mark.asyncio
+async def test_set_response_headers_subtracts_in_flight_delta(model_list):
+    """
+    LIT-2719: router-derived `x-ratelimit-remaining-*` headers must be
+    post-decrement (match OpenAI/Anthropic vendor semantics) so the proxy's
+    HTTP response headers and the prometheus gauges that read them stay
+    comparable across providers.
+
+    Router's TPM/RPM counter is incremented post-response by
+    `deployment_callback_on_success`, so `get_remaining_model_group_usage`
+    sees pre-decrement values. `set_response_headers` must replay the
+    in-flight increment before writing the headers.
+    """
+    from pydantic import BaseModel
+
+    class _Usage(BaseModel):
+        total_tokens: int = 42
+
+    class _Resp(BaseModel):
+        usage: _Usage = _Usage()
+        _hidden_params: dict = {}
+
+    router = Router(model_list=model_list)
+    router.get_remaining_model_group_usage = AsyncMock(
+        return_value={
+            "x-ratelimit-remaining-tokens": 1000,
+            "x-ratelimit-limit-tokens": 1000,
+            "x-ratelimit-remaining-requests": 100,
+            "x-ratelimit-limit-requests": 100,
+        }
+    )
+
+    resp = _Resp()
+    resp._hidden_params = {}
+    await router.set_response_headers(response=resp, model_group="gpt-3.5-turbo")
+
+    headers = resp._hidden_params["additional_headers"]
+    assert headers["x-ratelimit-remaining-tokens"] == 958
+    assert headers["x-ratelimit-remaining-requests"] == 99
+    # Limit headers pass through unmodified.
+    assert headers["x-ratelimit-limit-tokens"] == 1000
+    assert headers["x-ratelimit-limit-requests"] == 100
+
+
+@pytest.mark.asyncio
+async def test_set_response_headers_handles_missing_usage(model_list):
+    """
+    Streaming chunks and some response shapes may lack a `usage` attribute or
+    populated `total_tokens`. The in-flight subtraction must default to 0
+    tokens (still subtract 1 from requests) and never raise.
+    """
+    from pydantic import BaseModel
+
+    class _Resp(BaseModel):
+        _hidden_params: dict = {}
+
+    router = Router(model_list=model_list)
+    router.get_remaining_model_group_usage = AsyncMock(
+        return_value={
+            "x-ratelimit-remaining-tokens": 1000,
+            "x-ratelimit-remaining-requests": 100,
+        }
+    )
+
+    resp = _Resp()
+    resp._hidden_params = {}
+    await router.set_response_headers(response=resp, model_group="gpt-3.5-turbo")
+
+    headers = resp._hidden_params["additional_headers"]
+    assert headers["x-ratelimit-remaining-tokens"] == 1000
+    assert headers["x-ratelimit-remaining-requests"] == 99
+
+
 def test_get_all_deployments(model_list):
    """Test if the 'get_all_deployments' function is working correctly"""
    router = Router(model_list=model_list)
@@ -0,0 +1,298 @@
+"""
+LIT-2719 — `litellm_remaining_tokens_metric` and
+`litellm_remaining_requests_metric` only fired for providers that return
+`x-ratelimit-remaining-*` response headers (OpenAI, Azure, Anthropic).
+
+This guarded the gauges behind a provider-specific code path, so Bedrock and
+Vertex deployments — which never populate those headers — silently produced no
+data even when the proxy router had `tpm`/`rpm` configured.
+
+`_async_set_router_remaining_metrics` adds a provider-agnostic fallback that
+asks `Router.get_remaining_model_group_usage` for the same model_group and
+emits the gauges with `configured_limit - current_usage`.
+
+Tests cover:
+- Bedrock fallback emits both gauges.
+- Vertex AI fallback emits both gauges.
+- Already-present headers short-circuit the router lookup entirely.
+- Partial header coverage (only requests) still triggers the missing tokens
+  gauge.
+- llm_router unavailable / model_group missing / router raises → silent no-op.
+"""
+
+import os
+import sys
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+from prometheus_client import REGISTRY
+
+sys.path.insert(0, os.path.abspath("../../.."))
+
+from litellm.integrations.prometheus import PrometheusLogger
+from litellm.types.integrations.prometheus import UserAPIKeyLabelValues
+
+
+@pytest.fixture(scope="function")
+def prometheus_logger():
+    collectors = list(REGISTRY._collector_to_names.keys())
+    for collector in collectors:
+        REGISTRY.unregister(collector)
+    return PrometheusLogger()
+
+
+def _build_payload(
+    model_group: str = "bedrock-claude-group",
+    custom_llm_provider: str = "bedrock",
+    additional_headers: dict | None = None,
+):
+    return {
+        "model_group": model_group,
+        "custom_llm_provider": custom_llm_provider,
+        "model": "anthropic.claude-3-sonnet-20240229-v1:0",
+        "model_id": "deployment-id-1",
+        "api_base": "https://bedrock-runtime.us-east-1.amazonaws.com",
+        "hidden_params": {
+            "additional_headers": additional_headers or {},
+        },
+        "metadata": {
+            "user_api_key_hash": "test-key",
+            "user_api_key_alias": None,
+            "user_api_key_team_id": None,
+            "user_api_key_team_alias": None,
+        },
+    }
+
+
+def _enum_values(model_group: str = "bedrock-claude-group"):
+    return UserAPIKeyLabelValues(
+        end_user=None,
+        hashed_api_key="test-key",
+        api_key_alias=None,
+        team=None,
+        team_alias=None,
+        requested_model=model_group,
+        model_group=model_group,
+        model_id="deployment-id-1",
+        api_base="https://bedrock-runtime.us-east-1.amazonaws.com",
+        api_provider="bedrock",
+        litellm_model_name="anthropic.claude-3-sonnet-20240229-v1:0",
+    )
+
+
+class TestRouterFallbackEmitsForBedrock:
+    @pytest.mark.asyncio
+    async def test_should_emit_both_gauges_for_bedrock_when_router_has_limits(
+        self, prometheus_logger
+    ):
+        payload = _build_payload(custom_llm_provider="bedrock")
+        enum_values = _enum_values()
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock(
+            return_value={
+                "x-ratelimit-remaining-tokens": 75,
+                "x-ratelimit-limit-tokens": 100,
+                "x-ratelimit-remaining-requests": 9,
+                "x-ratelimit-limit-requests": 10,
+            }
+        )
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=enum_values,
+            )
+
+        fake_router.get_remaining_model_group_usage.assert_awaited_once_with(
+            "bedrock-claude-group"
+        )
+        prometheus_logger.litellm_remaining_tokens_metric.labels.assert_called_once()
+        prometheus_logger.litellm_remaining_tokens_metric.labels().set.assert_called_once_with(
+            75
+        )
+        prometheus_logger.litellm_remaining_requests_metric.labels.assert_called_once()
+        prometheus_logger.litellm_remaining_requests_metric.labels().set.assert_called_once_with(
+            9
+        )
+
+
+class TestRouterFallbackEmitsForVertex:
+    @pytest.mark.asyncio
+    async def test_should_emit_both_gauges_for_vertex_when_router_has_limits(
+        self, prometheus_logger
+    ):
+        payload = _build_payload(
+            model_group="vertex-gemini-group",
+            custom_llm_provider="vertex_ai",
+        )
+        enum_values = _enum_values(model_group="vertex-gemini-group")
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock(
+            return_value={
+                "x-ratelimit-remaining-tokens": 12345,
+                "x-ratelimit-remaining-requests": 50,
+            }
+        )
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=enum_values,
+            )
+
+        fake_router.get_remaining_model_group_usage.assert_awaited_once_with(
+            "vertex-gemini-group"
+        )
+        prometheus_logger.litellm_remaining_tokens_metric.labels().set.assert_called_once_with(
+            12345
+        )
+        prometheus_logger.litellm_remaining_requests_metric.labels().set.assert_called_once_with(
+            50
+        )
+
+
+class TestExistingHeadersShortCircuit:
+    @pytest.mark.asyncio
+    async def test_should_skip_router_lookup_when_both_headers_already_present(
+        self, prometheus_logger
+    ):
+        payload = _build_payload(
+            additional_headers={
+                "x_ratelimit_remaining_tokens": 999,
+                "x_ratelimit_remaining_requests": 99,
+            }
+        )
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock()
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=_enum_values(),
+            )
+
+        fake_router.get_remaining_model_group_usage.assert_not_called()
+        prometheus_logger.litellm_remaining_tokens_metric.labels.assert_not_called()
+        prometheus_logger.litellm_remaining_requests_metric.labels.assert_not_called()
+
+    @pytest.mark.asyncio
+    async def test_should_only_fill_missing_dimension_when_one_header_present(
+        self, prometheus_logger
+    ):
+        payload = _build_payload(
+            additional_headers={
+                "x_ratelimit_remaining_requests": 7,
+            }
+        )
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock(
+            return_value={
+                "x-ratelimit-remaining-tokens": 555,
+                "x-ratelimit-remaining-requests": 999,
+            }
+        )
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=_enum_values(),
+            )
+
+        prometheus_logger.litellm_remaining_tokens_metric.labels().set.assert_called_once_with(
+            555
+        )
+        prometheus_logger.litellm_remaining_requests_metric.labels.assert_not_called()
+
+
+class TestRouterFallbackDefensivePaths:
+    @pytest.mark.asyncio
+    async def test_should_noop_when_llm_router_is_none(self, prometheus_logger):
+        payload = _build_payload()
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", None, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=_enum_values(),
+            )
+
+        prometheus_logger.litellm_remaining_tokens_metric.labels.assert_not_called()
+        prometheus_logger.litellm_remaining_requests_metric.labels.assert_not_called()
+
+    @pytest.mark.asyncio
+    async def test_should_noop_when_model_group_missing(self, prometheus_logger):
+        payload = _build_payload()
+        payload["model_group"] = None
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock()
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=_enum_values(),
+            )
+
+        fake_router.get_remaining_model_group_usage.assert_not_called()
+        prometheus_logger.litellm_remaining_tokens_metric.labels.assert_not_called()
+
+    @pytest.mark.asyncio
+    async def test_should_noop_when_router_returns_empty_dict(self, prometheus_logger):
+        payload = _build_payload()
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock(return_value={})
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=_enum_values(),
+            )
+
+        prometheus_logger.litellm_remaining_tokens_metric.labels.assert_not_called()
+        prometheus_logger.litellm_remaining_requests_metric.labels.assert_not_called()
+
+    @pytest.mark.asyncio
+    async def test_should_swallow_router_exception(self, prometheus_logger):
+        payload = _build_payload()
+
+        fake_router = MagicMock()
+        fake_router.get_remaining_model_group_usage = AsyncMock(
+            side_effect=RuntimeError("router boom")
+        )
+
+        prometheus_logger.litellm_remaining_tokens_metric = MagicMock()
+        prometheus_logger.litellm_remaining_requests_metric = MagicMock()
+
+        with patch("litellm.proxy.proxy_server.llm_router", fake_router, create=True):
+            # Must not raise.
+            await prometheus_logger._async_set_router_remaining_metrics(
+                standard_logging_payload=payload,
+                enum_values=_enum_values(),
+            )
+
+        prometheus_logger.litellm_remaining_tokens_metric.labels.assert_not_called()