fix: harden webhook reliability, fix bugs, add test suite

- Statuspage webhook always returns 200 to prevent subscriber removal
- Fix parseKvKey returning string chatId instead of number
- Queue consumer retries on Telegram 5xx instead of acking (prevents message loss)
- Fix observability top-level enabled flag (false → true)
- Add defensive null checks for webhook payload body
- Cache Bot instance per isolate to avoid middleware rebuild per request
- Add vitest + @cloudflare/vitest-pool-workers with 31 tests
- Document DLQ and KV sharding as declined features
This commit is contained in:
2026-04-09 10:29:30 +07:00
parent bb8f4dcde8
commit 8c993df72b
15 changed files with 1680 additions and 57 deletions

View File

@@ -78,7 +78,23 @@ Ordered by likelihood of future implementation (top = most likely to revisit).
**Why this rank**: Out of scope. The bot is the product — adding a web frontend changes the project's nature.
### 10. Digest / Quiet Mode
### 10. Dead Letter Queue for Failed Messages
**Idea**: After CF Queues exhausts 3 retries, persist failed messages to KV or a dedicated DLQ for debugging.
**Decision**: Skip. CF Workers already logs all queue consumer errors (including final retry failures) via the observability config. With 100% log sampling and persisted invocation logs, failed messages are visible in the Cloudflare Dashboard. Adding a KV-based DLQ introduces write overhead on every failure and cleanup logic for stale entries — not worth it when logs already provide the same visibility.
**Why this rank**: Logging is sufficient for current scale. Revisit only if log retention (3-day free tier) is too short for debugging patterns.
### 11. KV List Scalability (Subscriber Sharding)
**Idea**: Shard subscriber keys by event type (e.g., `sub:incident:{chatId}`, `sub:component:{chatId}`) to avoid listing all subscribers on every webhook.
**Decision**: Skip. Current `kv.list({ prefix: "sub:" })` pagination works for hundreds of subscribers. Sharding requires a KV schema migration, dual-write logic during transition, and doubles storage for subscribers who want both types. Not justified until `kv.list()` latency or cost becomes measurable.
**Why this rank**: Clear trigger: slow webhook response times at high subscriber counts. Migration path is straightforward when needed.
### 12. Digest / Quiet Mode
**Idea**: Batch notifications into a daily summary instead of instant alerts.