fix: harden webhook reliability, fix bugs, add test suite

- Statuspage webhook always returns 200 to prevent subscriber removal - Fix parseKvKey returning string chatId instead of number - Queue consumer retries on Telegram 5xx instead of acking (prevents message loss) - Fix observability top-level enabled flag (false → true) - Add defensive null checks for webhook payload body - Cache Bot instance per isolate to avoid middleware rebuild per request - Add vitest + @cloudflare/vitest-pool-workers with 31 tests - Document DLQ and KV sharding as declined features
2026-04-17 13:21:01 +00:00 · 2026-04-09 10:29:30 +07:00
parent bb8f4dcde8
commit 8c993df72b
15 changed files with 1680 additions and 57 deletions
--- a/docs/feature-decisions.md
+++ b/docs/feature-decisions.md
@@ -78,7 +78,23 @@ Ordered by likelihood of future implementation (top = most likely to revisit).

 **Why this rank**: Out of scope. The bot is the product — adding a web frontend changes the project's nature.

-### 10. Digest / Quiet Mode
+### 10. Dead Letter Queue for Failed Messages
+
+**Idea**: After CF Queues exhausts 3 retries, persist failed messages to KV or a dedicated DLQ for debugging.
+
+**Decision**: Skip. CF Workers already logs all queue consumer errors (including final retry failures) via the observability config. With 100% log sampling and persisted invocation logs, failed messages are visible in the Cloudflare Dashboard. Adding a KV-based DLQ introduces write overhead on every failure and cleanup logic for stale entries — not worth it when logs already provide the same visibility.
+
+**Why this rank**: Logging is sufficient for current scale. Revisit only if log retention (3-day free tier) is too short for debugging patterns.
+
+### 11. KV List Scalability (Subscriber Sharding)
+
+**Idea**: Shard subscriber keys by event type (e.g., `sub:incident:{chatId}`, `sub:component:{chatId}`) to avoid listing all subscribers on every webhook.
+
+**Decision**: Skip. Current `kv.list({ prefix: "sub:" })` pagination works for hundreds of subscribers. Sharding requires a KV schema migration, dual-write logic during transition, and doubles storage for subscribers who want both types. Not justified until `kv.list()` latency or cost becomes measurable.
+
+**Why this rank**: Clear trigger: slow webhook response times at high subscriber counts. Migration path is straightforward when needed.
+
+### 12. Digest / Quiet Mode

 **Idea**: Batch notifications into a daily summary instead of instant alerts.