fix: harden webhook reliability, fix bugs, add test suite

- Statuspage webhook always returns 200 to prevent subscriber removal - Fix parseKvKey returning string chatId instead of number - Queue consumer retries on Telegram 5xx instead of acking (prevents message loss) - Fix observability top-level enabled flag (false → true) - Add defensive null checks for webhook payload body - Cache Bot instance per isolate to avoid middleware rebuild per request - Add vitest + @cloudflare/vitest-pool-workers with 31 tests - Document DLQ and KV sharding as declined features
2026-04-17 11:20:30 +00:00 · 2026-04-09 10:29:30 +07:00
parent bb8f4dcde8
commit 8c993df72b
15 changed files with 1680 additions and 57 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -13,7 +13,10 @@ Telegram bot that forwards [status.claude.com](https://status.claude.com/) (Atla
 - `npx wrangler deploy --dry-run` — Verify build without deploying
 - `node scripts/setup-bot.js` — One-time: register bot commands + set Telegram webhook (interactive prompts)

-No test framework configured yet. No linter configured.
+- `npm test` — Run tests (vitest + @cloudflare/vitest-pool-workers, runs in Workers runtime)
+- `npm run test:watch` — Run tests in watch mode
+
+No linter configured.

 ## Secrets (set via `wrangler secret put`)

--- a/docs/feature-decisions.md
+++ b/docs/feature-decisions.md
@@ -78,7 +78,23 @@ Ordered by likelihood of future implementation (top = most likely to revisit).

 **Why this rank**: Out of scope. The bot is the product — adding a web frontend changes the project's nature.

-### 10. Digest / Quiet Mode
+### 10. Dead Letter Queue for Failed Messages
+
+**Idea**: After CF Queues exhausts 3 retries, persist failed messages to KV or a dedicated DLQ for debugging.
+
+**Decision**: Skip. CF Workers already logs all queue consumer errors (including final retry failures) via the observability config. With 100% log sampling and persisted invocation logs, failed messages are visible in the Cloudflare Dashboard. Adding a KV-based DLQ introduces write overhead on every failure and cleanup logic for stale entries — not worth it when logs already provide the same visibility.
+
+**Why this rank**: Logging is sufficient for current scale. Revisit only if log retention (3-day free tier) is too short for debugging patterns.
+
+### 11. KV List Scalability (Subscriber Sharding)
+
+**Idea**: Shard subscriber keys by event type (e.g., `sub:incident:{chatId}`, `sub:component:{chatId}`) to avoid listing all subscribers on every webhook.
+
+**Decision**: Skip. Current `kv.list({ prefix: "sub:" })` pagination works for hundreds of subscribers. Sharding requires a KV schema migration, dual-write logic during transition, and doubles storage for subscribers who want both types. Not justified until `kv.list()` latency or cost becomes measurable.
+
+**Why this rank**: Clear trigger: slow webhook response times at high subscriber counts. Migration path is straightforward when needed.
+
+### 12. Digest / Quiet Mode

 **Idea**: Batch notifications into a daily summary instead of instant alerts.

--- a/docs/system-architecture.md
+++ b/docs/system-architecture.md
@@ -56,7 +56,7 @@ A middleware in `index.js` normalizes double slashes in URL paths (Statuspage oc
 | File | Lines | Responsibility |
 |------|-------|---------------|
 | `index.js` | ~30 | Hono router, path normalization middleware, export handlers |
-| `bot-commands.js` | ~145 | `/start`, `/stop`, `/subscribe` — subscription management |
+| `bot-commands.js` | ~155 | `/start`, `/stop`, `/subscribe` — subscription management (cached Bot instance) |
 | `bot-info-commands.js` | ~125 | `/help`, `/status`, `/history`, `/uptime` — read-only info |
 | `statuspage-webhook.js` | ~85 | Webhook validation, event parsing, subscriber fan-out |
 | `queue-consumer.js` | ~65 | Batch message delivery, retry/removal logic |
@@ -94,6 +94,7 @@ Binding: `claude-status` queue
 - **Batch size**: 30 messages per consumer invocation
 - **Max retries**: 3 (configured in `wrangler.jsonc`)
 - **429 handling**: `msg.retry()` with CF Queues backoff; `Retry-After` header logged
+- **5xx handling**: `msg.retry()` for transient Telegram server errors
 - **403/400 handling**: subscriber removed from KV, message acknowledged
 - **Network errors**: `msg.retry()` for transient failures

@@ -108,6 +109,7 @@ Enabled via `wrangler.jsonc` `observability` config. Automatic — no code chang

 ## Security

+- **Statuspage webhook always-200**: Handler always returns HTTP 200 (even on errors) to prevent Statuspage from removing the webhook subscription. Errors are logged, not surfaced as HTTP status codes.
 - **Statuspage webhook auth**: URL path secret validated with timing-safe SHA-256 comparison
 - **Telegram webhook**: Registered via `setup-bot.js` — Telegram only sends to the registered URL
 - **No secrets in code**: `BOT_TOKEN` and `WEBHOOK_SECRET` stored as Cloudflare secrets
--- a/package-lock.json
+++ b/package-lock.json
--- a/package.json
+++ b/package.json
@@ -6,7 +6,9 @@
  "type": "module",
  "scripts": {
    "dev": "wrangler dev",
-    "deploy": "wrangler deploy"
+    "deploy": "wrangler deploy",
+    "test": "vitest run",
+    "test:watch": "vitest"
  },
  "repository": {
    "type": "git",
@@ -24,6 +26,8 @@
    "hono": "^4.12.12"
  },
  "devDependencies": {
+    "@cloudflare/vitest-pool-workers": "^0.14.2",
+    "vitest": "^4.1.3",
    "wrangler": "^4.81.0"
  }
 }
--- a/src/bot-commands.js
+++ b/src/bot-commands.js
@@ -8,6 +8,13 @@ import {
 } from "./kv-store.js";
 import { fetchComponentByName, escapeHtml } from "./status-fetcher.js";
 import { registerInfoCommands } from "./bot-info-commands.js";
+
+/**
+ * Module-level KV reference, updated each request.
+ * Safe because CF Workers are single-threaded per isolate.
+ */
+let kv = null;
+
 /**
 * Extract chatId and threadId from grammY context
 */
@@ -19,11 +26,10 @@ function getChatTarget(ctx) {
 }

 /**
- * Handle incoming Telegram webhook via grammY
+ * Create Bot with all commands registered. Called once per isolate.
 */
-export async function handleTelegramWebhook(c) {
-  const bot = new Bot(c.env.BOT_TOKEN);
-  const kv = c.env.claude_status;
+function createBot(token) {
+  const bot = new Bot(token);

  bot.command("start", async (ctx) => {
    const { chatId, threadId } = getChatTarget(ctx);
@@ -140,6 +146,29 @@ export async function handleTelegramWebhook(c) {
    );
  });

-  const handler = webhookCallback(bot, "cloudflare-mod");
-  return handler(c.req.raw);
+  return bot;
+}
+
+/**
+ * Cached Bot instance — avoids rebuilding middleware chain on every request.
+ * CF Workers reuse isolates, so module-level state persists across requests.
+ */
+let cachedBot = null;
+let cachedToken = null;
+let cachedHandler = null;
+
+/**
+ * Handle incoming Telegram webhook via grammY
+ */
+export async function handleTelegramWebhook(c) {
+  // Update module-level KV ref (same binding across requests, but kept explicit)
+  kv = c.env.claude_status;
+
+  if (!cachedBot || cachedToken !== c.env.BOT_TOKEN) {
+    cachedBot = createBot(c.env.BOT_TOKEN);
+    cachedToken = c.env.BOT_TOKEN;
+    cachedHandler = webhookCallback(cachedBot, "cloudflare-mod");
+  }
+
+  return cachedHandler(c.req.raw);
 }
--- a/src/kv-store.js
+++ b/src/kv-store.js
@@ -17,7 +17,7 @@ function parseKvKey(kvKey) {
  const lastColon = raw.lastIndexOf(":");
  // No colon or only negative sign prefix — no threadId
  if (lastColon <= 0) {
-    return { chatId: raw, threadId: null };
+    return { chatId: Number(raw), threadId: null };
  }
  // Check if the part after last colon is a valid threadId (numeric)
  const possibleThread = raw.slice(lastColon + 1);
--- a/src/queue-consumer.js
+++ b/src/queue-consumer.js
@@ -46,6 +46,10 @@ export async function handleQueue(batch, env) {
        console.log(`Queue: rate limited for ${chatId}, Retry-After: ${retryAfter ?? "unknown"}`);
        retried++;
        msg.retry();
+      } else if (res.status >= 500) {
+        console.error(`Queue: Telegram 5xx (${res.status}) for ${chatId}, retrying`);
+        retried++;
+        msg.retry();
      } else {
        console.error(`Queue: unexpected HTTP ${res.status} for ${chatId}`);
        failed++;
--- a/src/statuspage-webhook.js
+++ b/src/statuspage-webhook.js
@@ -34,13 +34,16 @@ function formatComponentMessage(component, update) {
 }

 /**
- * Handle incoming Statuspage webhook
+ * Handle incoming Statuspage webhook.
+ * CRITICAL: Always return 200 — Statuspage removes subscriber webhooks on non-2xx responses.
 */
 export async function handleStatuspageWebhook(c) {
+  try {
    // Validate URL secret (timing-safe)
    const secret = c.req.param("secret");
    if (!await timingSafeEqual(secret, c.env.WEBHOOK_SECRET)) {
-    return c.text("Unauthorized", 401);
+      console.error("Statuspage webhook: invalid secret");
+      return c.text("OK", 200);
    }

    // Parse body
@@ -48,25 +51,38 @@ export async function handleStatuspageWebhook(c) {
    try {
      body = await c.req.json();
    } catch {
-    return c.text("Bad Request", 400);
+      console.error("Statuspage webhook: invalid JSON body");
+      return c.text("OK", 200);
    }

    const eventType = body?.meta?.event_type;
-  if (!eventType) return c.text("Bad Request", 400);
+    if (!eventType) {
+      console.error("Statuspage webhook: missing event_type");
+      return c.text("OK", 200);
+    }

    console.log(`Statuspage webhook: ${eventType}`);

    // Determine category and format message
    let category, html, componentName;
    if (eventType.startsWith("incident.")) {
+      if (!body.incident) {
+        console.error("Statuspage webhook: incident event missing incident data");
+        return c.text("OK", 200);
+      }
      category = "incident";
      html = formatIncidentMessage(body.incident);
    } else if (eventType.startsWith("component.")) {
+      if (!body.component) {
+        console.error("Statuspage webhook: component event missing component data");
+        return c.text("OK", 200);
+      }
      category = "component";
-    componentName = body.component?.name || null;
+      componentName = body.component.name || null;
      html = formatComponentMessage(body.component, body.component_update);
    } else {
-    return c.text("Unknown event type", 400);
+      console.error(`Statuspage webhook: unknown event type ${eventType}`);
+      return c.text("OK", 200);
    }

    // Get filtered subscribers (with component name filtering)
@@ -81,6 +97,10 @@ export async function handleStatuspageWebhook(c) {
    }

    console.log(`Enqueued ${messages.length} messages for ${category}${componentName ? `:${componentName}` : ""}`);
-
    return c.text("OK", 200);
+  } catch (err) {
+    // Catch-all: log error but still return 200 to prevent Statuspage from removing us
+    console.error("Statuspage webhook: unexpected error", err);
+    return c.text("OK", 200);
+  }
 }
--- a/test/crypto-utils.test.js
+++ b/test/crypto-utils.test.js
@@ -0,0 +1,20 @@
+import { describe, it, expect } from "vitest";
+import { timingSafeEqual } from "../src/crypto-utils.js";
+
+describe("timingSafeEqual", () => {
+  it("returns true for identical strings", async () => {
+    expect(await timingSafeEqual("secret123", "secret123")).toBe(true);
+  });
+
+  it("returns false for different strings", async () => {
+    expect(await timingSafeEqual("secret123", "wrong")).toBe(false);
+  });
+
+  it("returns false for empty vs non-empty", async () => {
+    expect(await timingSafeEqual("", "something")).toBe(false);
+  });
+
+  it("returns true for both empty", async () => {
+    expect(await timingSafeEqual("", "")).toBe(true);
+  });
+});
--- a/test/kv-store.test.js
+++ b/test/kv-store.test.js
@@ -0,0 +1,124 @@
+import { describe, it, expect } from "vitest";
+import { env } from "cloudflare:test";
+import {
+  addSubscriber,
+  removeSubscriber,
+  getSubscriber,
+  updateSubscriberTypes,
+  updateSubscriberComponents,
+  getSubscribersByType,
+} from "../src/kv-store.js";
+
+// Each test uses unique chatIds to avoid cross-test interference (miniflare KV persists across tests)
+describe("kv-store", () => {
+  const kv = env.claude_status;
+
+  describe("addSubscriber / getSubscriber", () => {
+    it("adds subscriber with default types", async () => {
+      await addSubscriber(kv, 100, null);
+      const sub = await getSubscriber(kv, 100, null);
+      expect(sub).toEqual({ types: ["incident", "component"], components: [] });
+    });
+
+    it("adds subscriber with threadId", async () => {
+      await addSubscriber(kv, 101, 456);
+      const sub = await getSubscriber(kv, 101, 456);
+      expect(sub).toEqual({ types: ["incident", "component"], components: [] });
+    });
+
+    it("handles threadId=0 (General topic)", async () => {
+      await addSubscriber(kv, 102, 0);
+      const sub = await getSubscriber(kv, 102, 0);
+      expect(sub).toEqual({ types: ["incident", "component"], components: [] });
+    });
+
+    it("preserves existing data on re-subscribe", async () => {
+      await addSubscriber(kv, 103, null);
+      await updateSubscriberTypes(kv, 103, null, ["incident"]);
+      await addSubscriber(kv, 103, null);
+      const sub = await getSubscriber(kv, 103, null);
+      expect(sub.types).toEqual(["incident"]);
+    });
+  });
+
+  describe("removeSubscriber", () => {
+    it("removes existing subscriber", async () => {
+      await addSubscriber(kv, 200, null);
+      await removeSubscriber(kv, 200, null);
+      const sub = await getSubscriber(kv, 200, null);
+      expect(sub).toBeNull();
+    });
+  });
+
+  describe("updateSubscriberTypes", () => {
+    it("updates types for existing subscriber", async () => {
+      await addSubscriber(kv, 300, null);
+      const result = await updateSubscriberTypes(kv, 300, null, ["incident"]);
+      expect(result).toBe(true);
+      const sub = await getSubscriber(kv, 300, null);
+      expect(sub.types).toEqual(["incident"]);
+    });
+
+    it("returns false for non-existent subscriber", async () => {
+      const result = await updateSubscriberTypes(kv, 99999, null, ["incident"]);
+      expect(result).toBe(false);
+    });
+  });
+
+  describe("updateSubscriberComponents", () => {
+    it("sets component filter", async () => {
+      await addSubscriber(kv, 400, null);
+      await updateSubscriberComponents(kv, 400, null, ["API"]);
+      const sub = await getSubscriber(kv, 400, null);
+      expect(sub.components).toEqual(["API"]);
+    });
+  });
+
+  describe("getSubscribersByType", () => {
+    it("filters by event type", async () => {
+      // Use unique IDs unlikely to collide with other tests
+      await addSubscriber(kv, 50001, null);
+      await updateSubscriberTypes(kv, 50001, null, ["incident"]);
+      await addSubscriber(kv, 50002, null);
+      await updateSubscriberTypes(kv, 50002, null, ["component"]);
+
+      const incident = await getSubscribersByType(kv, "incident");
+      const incidentIds = incident.map((s) => s.chatId);
+      expect(incidentIds).toContain(50001);
+      expect(incidentIds).not.toContain(50002);
+
+      const component = await getSubscribersByType(kv, "component");
+      const componentIds = component.map((s) => s.chatId);
+      expect(componentIds).toContain(50002);
+      expect(componentIds).not.toContain(50001);
+    });
+
+    it("filters by component name", async () => {
+      await addSubscriber(kv, 60001, null);
+      await updateSubscriberComponents(kv, 60001, null, ["API"]);
+      await addSubscriber(kv, 60002, null); // no component filter = all
+
+      const results = await getSubscribersByType(kv, "component", "API");
+      const ids = results.map((s) => s.chatId);
+      expect(ids).toContain(60001);
+      expect(ids).toContain(60002);
+    });
+
+    it("excludes non-matching component filter", async () => {
+      await addSubscriber(kv, 70001, null);
+      await updateSubscriberComponents(kv, 70001, null, ["Console"]);
+
+      const results = await getSubscribersByType(kv, "component", "API");
+      const ids = results.map((s) => s.chatId);
+      expect(ids).not.toContain(70001);
+    });
+
+    it("returns chatId as number", async () => {
+      await addSubscriber(kv, 80001, null);
+      const results = await getSubscribersByType(kv, "incident");
+      const match = results.find((s) => s.chatId === 80001);
+      expect(match).toBeDefined();
+      expect(typeof match.chatId).toBe("number");
+    });
+  });
+});
--- a/test/queue-consumer.test.js
+++ b/test/queue-consumer.test.js
@@ -0,0 +1,79 @@
+import { describe, it, expect, vi, beforeEach } from "vitest";
+import { handleQueue } from "../src/queue-consumer.js";
+
+/**
+ * Create a mock queue message with ack/retry tracking
+ */
+function mockMessage(body) {
+  return {
+    body,
+    ack: vi.fn(),
+    retry: vi.fn(),
+  };
+}
+
+describe("handleQueue", () => {
+  let env;
+
+  beforeEach(() => {
+    env = {
+      BOT_TOKEN: "test-token",
+      claude_status: {
+        delete: vi.fn(),
+      },
+    };
+    vi.restoreAllMocks();
+  });
+
+  it("acks on successful send", async () => {
+    vi.stubGlobal("fetch", vi.fn().mockResolvedValue({ ok: true, status: 200 }));
+    const msg = mockMessage({ chatId: 123, html: "<b>test</b>" });
+    await handleQueue({ messages: [msg] }, env);
+    expect(msg.ack).toHaveBeenCalled();
+    expect(msg.retry).not.toHaveBeenCalled();
+  });
+
+  it("removes subscriber and acks on 403", async () => {
+    vi.stubGlobal("fetch", vi.fn().mockResolvedValue({ ok: false, status: 403 }));
+    const msg = mockMessage({ chatId: 123, threadId: null, html: "<b>test</b>" });
+    await handleQueue({ messages: [msg] }, env);
+    expect(msg.ack).toHaveBeenCalled();
+    expect(env.claude_status.delete).toHaveBeenCalled();
+  });
+
+  it("retries on 429 rate limit", async () => {
+    vi.stubGlobal(
+      "fetch",
+      vi.fn().mockResolvedValue({
+        ok: false,
+        status: 429,
+        headers: new Headers({ "Retry-After": "5" }),
+      })
+    );
+    const msg = mockMessage({ chatId: 123, html: "<b>test</b>" });
+    await handleQueue({ messages: [msg] }, env);
+    expect(msg.retry).toHaveBeenCalled();
+    expect(msg.ack).not.toHaveBeenCalled();
+  });
+
+  it("retries on 5xx server error", async () => {
+    vi.stubGlobal("fetch", vi.fn().mockResolvedValue({ ok: false, status: 502 }));
+    const msg = mockMessage({ chatId: 123, html: "<b>test</b>" });
+    await handleQueue({ messages: [msg] }, env);
+    expect(msg.retry).toHaveBeenCalled();
+    expect(msg.ack).not.toHaveBeenCalled();
+  });
+
+  it("retries on network error", async () => {
+    vi.stubGlobal("fetch", vi.fn().mockRejectedValue(new Error("network fail")));
+    const msg = mockMessage({ chatId: 123, html: "<b>test</b>" });
+    await handleQueue({ messages: [msg] }, env);
+    expect(msg.retry).toHaveBeenCalled();
+  });
+
+  it("skips malformed messages", async () => {
+    const msg = mockMessage({ chatId: null, html: null });
+    await handleQueue({ messages: [msg] }, env);
+    expect(msg.ack).toHaveBeenCalled();
+  });
+});
--- a/test/status-fetcher.test.js
+++ b/test/status-fetcher.test.js
@@ -0,0 +1,63 @@
+import { describe, it, expect } from "vitest";
+import {
+  escapeHtml,
+  humanizeStatus,
+  statusIndicator,
+  formatComponentLine,
+  formatOverallStatus,
+} from "../src/status-fetcher.js";
+
+describe("escapeHtml", () => {
+  it("escapes HTML special chars", () => {
+    expect(escapeHtml('<script>"alert&"</script>')).toBe(
+      "&lt;script&gt;&quot;alert&amp;&quot;&lt;/script&gt;"
+    );
+  });
+
+  it("returns empty string for null/undefined", () => {
+    expect(escapeHtml(null)).toBe("");
+    expect(escapeHtml(undefined)).toBe("");
+  });
+});
+
+describe("humanizeStatus", () => {
+  it("maps known statuses", () => {
+    expect(humanizeStatus("operational")).toBe("Operational");
+    expect(humanizeStatus("major_outage")).toBe("Major Outage");
+    expect(humanizeStatus("resolved")).toBe("Resolved");
+  });
+
+  it("returns raw string for unknown status", () => {
+    expect(humanizeStatus("custom_status")).toBe("custom_status");
+  });
+});
+
+describe("statusIndicator", () => {
+  it("returns green check for operational", () => {
+    expect(statusIndicator("operational")).toBe("\u2705");
+  });
+
+  it("returns question mark for unknown", () => {
+    expect(statusIndicator("unknown_status")).toBe("\u2753");
+  });
+});
+
+describe("formatComponentLine", () => {
+  it("formats component with indicator and escaped name", () => {
+    const line = formatComponentLine({ name: "API", status: "operational" });
+    expect(line).toContain("\u2705");
+    expect(line).toContain("<b>API</b>");
+    expect(line).toContain("Operational");
+  });
+});
+
+describe("formatOverallStatus", () => {
+  it("maps known indicators", () => {
+    expect(formatOverallStatus("none")).toContain("All Systems Operational");
+    expect(formatOverallStatus("critical")).toContain("Critical System Outage");
+  });
+
+  it("returns raw value for unknown indicator", () => {
+    expect(formatOverallStatus("custom")).toBe("custom");
+  });
+});
--- a/vitest.config.js
+++ b/vitest.config.js
@@ -0,0 +1,22 @@
+import { defineConfig } from "vitest/config";
+import { cloudflarePool, cloudflareTest } from "@cloudflare/vitest-pool-workers";
+
+export default defineConfig({
+  plugins: [
+    cloudflareTest({
+      wrangler: { configPath: "./wrangler.jsonc" },
+      miniflare: {
+        // Override remote KV with local-only for tests
+        kvNamespaces: ["claude_status"],
+      },
+    }),
+  ],
+  test: {
+    pool: cloudflarePool({
+      wrangler: { configPath: "./wrangler.jsonc" },
+      miniflare: {
+        kvNamespaces: ["claude_status"],
+      },
+    }),
+  },
+});
--- a/wrangler.jsonc
+++ b/wrangler.jsonc
@@ -25,7 +25,7 @@
 		]
 	},
 	"observability": {
-		"enabled": false,
+		"enabled": true,
 		"head_sampling_rate": 1,
 		"logs": {
 			"enabled": true,