diff --git a/docs/feature-decisions.md b/docs/feature-decisions.md index 4aa5e34..fb10b58 100644 --- a/docs/feature-decisions.md +++ b/docs/feature-decisions.md @@ -6,7 +6,32 @@ Ordered by likelihood of future implementation (top = most likely to revisit). ## Declined Features -### 1. Admin Commands (/stats) +### 1. Fan-Out Decoupling (Two-Phase Queue) + +**Idea**: Webhook handler enqueues a single "dispatch" message; queue consumer lists subscribers and re-enqueues individual "deliver" messages. Converts O(N) webhook handler to O(1). + +**Decision**: Skip. Current subscriber count is small. The webhook handler completing in one pass is simpler to reason about and debug. Adding a two-phase queue introduces message type routing, a new queue message schema, and makes the data flow harder to follow — all for a scaling problem that doesn't exist yet. + +**Why this rank**: Clear trigger: webhook response times or CPU usage climbing in CF dashboard. Straightforward to implement when needed. + +### 2. Queue Message Idempotency Keys + +**Idea**: Include `{ incidentId, chatId }` hash as dedup key. Check short-TTL KV key before sending to prevent duplicate delivery on queue retries. + +**Decision**: Skip. Duplicate notifications are a minor UX annoyance, not a correctness issue. Adding a KV read+write per message doubles KV operations in the queue consumer for a rare edge case (crash between successful Telegram send and `msg.ack()`). CF Queues retry is already bounded to 3 attempts. + +**Why this rank**: Only worth it if users report duplicate notifications as a real problem. + +### 3. /ping Command + +**Idea**: Bot replies with worker region + timestamp for liveness check. + +**Decision**: Skip. `/status` already proves the bot is alive (it fetches from external API and replies). A dedicated `/ping` adds another command for marginal value. The web health check endpoint (`GET /`) serves the same purpose for monitoring. + +**Why this rank**: Trivial to add but not useful enough to justify another command. + +### 4. Admin Commands (/stats) + **Idea**: `/stats` to show subscriber count, recent webhook events (useful for bot operator). @@ -14,7 +39,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why highest**: Low effort, no architectural changes. Just a new command + `kv.list()` count. First thing to add if the bot grows. -### 2. Webhook HMAC Signature Verification +### 5. Webhook HMAC Signature Verification **Idea**: Verify Statuspage webhook payloads using HMAC signatures as a second auth layer beyond URL secret. @@ -22,7 +47,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Not blocked by effort — blocked by platform. Would be implemented immediately if Atlassian ships HMAC support. -### 3. Proactive Rate Limit Tracking +### 6. Proactive Rate Limit Tracking **Idea**: Track per-chat message counts to stay within Telegram's rate limits proactively. @@ -30,7 +55,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Becomes necessary at scale. Clear trigger: frequent 429 errors in logs. -### 4. Status Change Deduplication +### 7. Status Change Deduplication **Idea**: If a component flaps (operational → degraded → operational in 2 minutes), debounce into one message. @@ -38,7 +63,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Useful if flapping becomes noisy. Moderate effort with clear user-facing benefit. -### 5. Inline Keyboard for /subscribe +### 8. Inline Keyboard for /subscribe **Idea**: Replace text commands with clickable buttons using grammY's inline keyboard support. @@ -46,7 +71,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Nice UX polish but not functional gap. grammY supports it well — moderate effort. -### 6. Scheduled Status Digest +### 9. Scheduled Status Digest **Idea**: CF Workers `scheduled` cron trigger sends a daily "all clear" or summary to subscribers. @@ -54,7 +79,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Low user value. Only useful if users explicitly request daily summaries. -### 7. Mute Command (/mute \) +### 10. Mute Command (/mute \) **Idea**: Temporarily pause notifications without unsubscribing (e.g., `/mute 2h`). @@ -62,7 +87,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Contradicts real-time purpose. `/stop` + `/start` is sufficient. -### 8. Multi-Language Support +### 11. Multi-Language Support **Idea**: At minimum English/Vietnamese support. @@ -70,7 +95,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Source data is English-only. Translating bot chrome while incidents stay English creates a mixed-language experience. -### 9. Web Dashboard +### 12. Web Dashboard **Idea**: Replace the `/` health check with a status page showing subscriber count and recent webhook events. @@ -78,7 +103,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Out of scope. The bot is the product — adding a web frontend changes the project's nature. -### 10. Dead Letter Queue for Failed Messages +### 13. Dead Letter Queue for Failed Messages **Idea**: After CF Queues exhausts 3 retries, persist failed messages to KV or a dedicated DLQ for debugging. @@ -86,7 +111,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Logging is sufficient for current scale. Revisit only if log retention (3-day free tier) is too short for debugging patterns. -### 11. KV List Scalability (Subscriber Sharding) +### 14. KV List Scalability (Subscriber Sharding) **Idea**: Shard subscriber keys by event type (e.g., `sub:incident:{chatId}`, `sub:component:{chatId}`) to avoid listing all subscribers on every webhook. @@ -94,7 +119,7 @@ Ordered by likelihood of future implementation (top = most likely to revisit). **Why this rank**: Clear trigger: slow webhook response times at high subscriber counts. Migration path is straightforward when needed. -### 12. Digest / Quiet Mode +### 15. Digest / Quiet Mode **Idea**: Batch notifications into a daily summary instead of instant alerts. diff --git a/src/bot-commands.js b/src/bot-commands.js index fad2a79..1f8693c 100644 --- a/src/bot-commands.js +++ b/src/bot-commands.js @@ -1,3 +1,5 @@ +/** @import { ChatTarget } from "./types.js" */ + import { Bot, webhookCallback } from "grammy"; import { addSubscriber, @@ -17,6 +19,7 @@ let kv = null; /** * Extract chatId and threadId from grammY context + * @returns {ChatTarget} */ function getChatTarget(ctx) { return { diff --git a/src/bot-info-commands.js b/src/bot-info-commands.js index 7b81f3b..6d3d79e 100644 --- a/src/bot-info-commands.js +++ b/src/bot-info-commands.js @@ -69,7 +69,7 @@ export function registerInfoCommands(bot) { ); } } catch (err) { - console.error("status command error:", err); + console.error(JSON.stringify({ event: "command_error", command: "status", error: err.message })); await ctx.reply("Unable to fetch status. Please try again later."); } }); @@ -91,7 +91,7 @@ export function registerInfoCommands(bot) { { parse_mode: "HTML", disable_web_page_preview: true } ); } catch (err) { - console.error("history command error:", err); + console.error(JSON.stringify({ event: "command_error", command: "history", error: err.message })); await ctx.reply("Unable to fetch incident history. Please try again later."); } }); @@ -116,7 +116,7 @@ export function registerInfoCommands(bot) { { parse_mode: "HTML", disable_web_page_preview: true } ); } catch (err) { - console.error("uptime command error:", err); + console.error(JSON.stringify({ event: "command_error", command: "uptime", error: err.message })); await ctx.reply("Unable to fetch uptime data. Please try again later."); } }); diff --git a/src/kv-store.js b/src/kv-store.js index 3cd8415..389a045 100644 --- a/src/kv-store.js +++ b/src/kv-store.js @@ -1,3 +1,5 @@ +/** @import { Subscriber, ChatTarget } from "./types.js" */ + const KEY_PREFIX = "sub:"; /** @@ -54,6 +56,10 @@ async function listAllSubscriberKeys(kv) { /** * Add or re-subscribe a user. Preserves existing types and components if already subscribed. + * @param {KVNamespace} kv + * @param {number} chatId + * @param {?number} threadId + * @param {string[]} types */ export async function addSubscriber(kv, chatId, threadId, types = ["incident", "component"]) { const key = buildKvKey(chatId, threadId); @@ -105,6 +111,10 @@ export async function updateSubscriberComponents(kv, chatId, threadId, component /** * Get a single subscriber's data, or null if not subscribed + * @param {KVNamespace} kv + * @param {number} chatId + * @param {?number} threadId + * @returns {Promise} */ export async function getSubscriber(kv, chatId, threadId) { const key = buildKvKey(chatId, threadId); @@ -114,6 +124,10 @@ export async function getSubscriber(kv, chatId, threadId) { /** * Get subscribers filtered by event type and optional component name. * Uses KV metadata from list() — O(1) list call, no individual get() needed. + * @param {KVNamespace} kv + * @param {string} eventType + * @param {?string} componentName + * @returns {Promise} */ export async function getSubscribersByType(kv, eventType, componentName = null) { const keys = await listAllSubscriberKeys(kv); diff --git a/src/queue-consumer.js b/src/queue-consumer.js index 0f564c5..da119f3 100644 --- a/src/queue-consumer.js +++ b/src/queue-consumer.js @@ -1,8 +1,12 @@ +/** @import { QueueMessage } from "./types.js" */ + import { removeSubscriber } from "./kv-store.js"; import { telegramUrl } from "./telegram-api.js"; /** * Process a batch of queued messages, sending each to Telegram. * Handles rate limits (429 → retry), blocked bots (403/400 → remove subscriber). + * @param {{ messages: Array<{ body: QueueMessage, ack: () => void, retry: () => void }> }} batch + * @param {object} env */ export async function handleQueue(batch, env) { let sent = 0, failed = 0, retried = 0, removed = 0; @@ -12,7 +16,7 @@ export async function handleQueue(batch, env) { // Defensive check for malformed messages if (!chatId || !html) { - console.error("Queue: malformed message, skipping", msg.body); + console.error(JSON.stringify({ event: "queue_skip", reason: "malformed", body: msg.body })); msg.ack(); continue; } @@ -37,32 +41,32 @@ export async function handleQueue(batch, env) { sent++; msg.ack(); } else if (res.status === 403 || res.status === 400) { - console.log(`Queue: removing subscriber ${chatId}:${threadId} (HTTP ${res.status})`); + console.log(JSON.stringify({ event: "queue_remove", chatId, threadId, status: res.status })); await removeSubscriber(env.claude_status, chatId, threadId); removed++; msg.ack(); } else if (res.status === 429) { const retryAfter = res.headers.get("Retry-After"); - console.log(`Queue: rate limited for ${chatId}, Retry-After: ${retryAfter ?? "unknown"}`); + console.log(JSON.stringify({ event: "queue_ratelimit", chatId, retryAfter })); retried++; msg.retry(); } else if (res.status >= 500) { - console.error(`Queue: Telegram 5xx (${res.status}) for ${chatId}, retrying`); + console.error(JSON.stringify({ event: "queue_retry", chatId, status: res.status })); retried++; msg.retry(); } else { - console.error(`Queue: unexpected HTTP ${res.status} for ${chatId}`); + console.error(JSON.stringify({ event: "queue_error", chatId, status: res.status })); failed++; msg.ack(); } } catch (err) { - console.error("Queue: network error, retrying", err); + console.error(JSON.stringify({ event: "queue_network_error", chatId, error: err.message })); retried++; msg.retry(); } } if (sent || failed || retried || removed) { - console.log(`Queue batch: sent=${sent} failed=${failed} retried=${retried} removed=${removed}`); + console.log(JSON.stringify({ event: "queue_batch", sent, failed, retried, removed })); } } diff --git a/src/statuspage-webhook.js b/src/statuspage-webhook.js index 44add6f..d7a8cb8 100644 --- a/src/statuspage-webhook.js +++ b/src/statuspage-webhook.js @@ -40,9 +40,15 @@ function formatComponentMessage(component, update) { export async function handleStatuspageWebhook(c) { try { // Validate URL secret (timing-safe) + // Guard against misconfigured deploy (undefined env var) + if (!c.env.WEBHOOK_SECRET) { + console.error(JSON.stringify({ event: "webhook_error", reason: "WEBHOOK_SECRET not configured" })); + return c.text("OK", 200); + } + const secret = c.req.param("secret"); if (!await timingSafeEqual(secret, c.env.WEBHOOK_SECRET)) { - console.error("Statuspage webhook: invalid secret"); + console.error(JSON.stringify({ event: "webhook_error", reason: "invalid_secret" })); return c.text("OK", 200); } @@ -51,37 +57,37 @@ export async function handleStatuspageWebhook(c) { try { body = await c.req.json(); } catch { - console.error("Statuspage webhook: invalid JSON body"); + console.error(JSON.stringify({ event: "webhook_error", reason: "invalid_json" })); return c.text("OK", 200); } const eventType = body?.meta?.event_type; if (!eventType) { - console.error("Statuspage webhook: missing event_type"); + console.error(JSON.stringify({ event: "webhook_error", reason: "missing_event_type" })); return c.text("OK", 200); } - console.log(`Statuspage webhook: ${eventType}`); + console.log(JSON.stringify({ event: "webhook_received", eventType })); // Determine category and format message let category, html, componentName; if (eventType.startsWith("incident.")) { if (!body.incident) { - console.error("Statuspage webhook: incident event missing incident data"); + console.error(JSON.stringify({ event: "webhook_error", reason: "missing_incident_data", eventType })); return c.text("OK", 200); } category = "incident"; html = formatIncidentMessage(body.incident); } else if (eventType.startsWith("component.")) { if (!body.component) { - console.error("Statuspage webhook: component event missing component data"); + console.error(JSON.stringify({ event: "webhook_error", reason: "missing_component_data", eventType })); return c.text("OK", 200); } category = "component"; componentName = body.component.name || null; html = formatComponentMessage(body.component, body.component_update); } else { - console.error(`Statuspage webhook: unknown event type ${eventType}`); + console.error(JSON.stringify({ event: "webhook_error", reason: "unknown_event_type", eventType })); return c.text("OK", 200); } @@ -96,11 +102,11 @@ export async function handleStatuspageWebhook(c) { await c.env["claude-status"].sendBatch(messages.slice(i, i + 100)); } - console.log(`Enqueued ${messages.length} messages for ${category}${componentName ? `:${componentName}` : ""}`); + console.log(JSON.stringify({ event: "webhook_enqueued", category, componentName, count: messages.length })); return c.text("OK", 200); } catch (err) { // Catch-all: log error but still return 200 to prevent Statuspage from removing us - console.error("Statuspage webhook: unexpected error", err); + console.error(JSON.stringify({ event: "webhook_error", reason: "unexpected", error: err.message })); return c.text("OK", 200); } } diff --git a/src/types.js b/src/types.js new file mode 100644 index 0000000..e1c8df0 --- /dev/null +++ b/src/types.js @@ -0,0 +1,21 @@ +/** + * Shared JSDoc type definitions for the project. + * Import via: @import { Subscriber, QueueMessage, ChatTarget } from "./types.js" + */ + +/** + * Subscriber preferences stored in KV value and metadata + * @typedef {{ types: string[], components: string[] }} Subscriber + */ + +/** + * Message body enqueued to CF Queue for fan-out delivery + * @typedef {{ chatId: number, threadId: ?number, html: string }} QueueMessage + */ + +/** + * Chat target extracted from Telegram update + * @typedef {{ chatId: number, threadId: ?number }} ChatTarget + */ + +export {};