refactor(doantu): swap Workers AI bge-m3 for hosted phow2sim HTTP API

Doantu now mirrors semantle's pre-Workers-AI shape: a thin fetch wrapper
around /random + /similarity on https://phow2sim.sg.miti99.com (overridable
via PHOW2SIM_API_URL). Drops the local Viet22K wordlist + build script —
the service owns vocabulary now. Promotes commands from protected to
public so they show up in Telegram's native / menu.
This commit is contained in:
2026-04-23 11:35:32 +07:00
parent fd5a1d2903
commit 4acc471f6f
9 changed files with 239 additions and 22429 deletions
+30 -30
View File
@@ -1,48 +1,43 @@
# Doantu Module
Vietnamese "đoán từ" (guess-the-word) — same core mechanic as `semantle`,
but targets come from a Vietnamese wordlist and similarity is computed
with a multilingual embedding model. Unlimited guesses per round; solve
on exact match (case-insensitive, diacritic-sensitive).
but targets + similarity come from a Vietnamese-tuned embedding service.
Unlimited guesses per round; solve on exact match (case-insensitive,
diacritic-sensitive).
**Visibility: `protected`** — commands appear in `/help` but are hidden
from Telegram's native `/` autocomplete menu while the module is still
experimental.
**Visibility: `public`** — commands appear in both `/help` and Telegram's
native `/` autocomplete menu.
## Commands
| Command | Visibility | Description |
|---------|-----------|-------------|
| `/doantu` | protected | Show current board or submit a word guess |
| `/doantu_giveup` | protected | Reveal the answer and end the round (next `/doantu` starts a fresh one) |
| `/doantu_stats` | protected | Show per-subject stats |
| `/doantu` | public | Show current board or submit a word guess |
| `/doantu_giveup` | public | Reveal the answer and end the round (next `/doantu` starts a fresh one) |
| `/doantu_stats` | public | Show per-subject stats |
Submit with `/doantu <word>` (e.g. `/doantu con chó`). Multi-syllable words
with single spaces between them are accepted. `cá` and `ca` are different
targets.
targets. Out-of-vocabulary words don't count toward the guess tally.
Repeating a prior guess replies with a `🔁 already guessed` notice and is
ignored (no cost, no stat inflation).
## Data source
**Target + vocabulary:** [duyet/vietnamese-wordlist](https://github.com/duyet/vietnamese-wordlist)'s
Viet22K list (~22k entries), lowercased and deduped. The same list is
both the target pool and the vocabulary — OOV detection is `Set.has()`
with no upstream call. License: GPL-2.0 (Ho Ngoc Duc).
Regenerate with `node scripts/build-doantu-words.js`.
Target words + similarity scores come from our self-hosted **phow2sim**
instance (default: `https://phow2sim.sg.miti99.com`). Wraps two endpoints:
**Similarity:** `@cf/baai/bge-m3` multilingual text embeddings via the
`env.AI` binding. Chosen over the English-only `bge-small-en-v1.5`
because that model's tokenizer shreds Vietnamese diacritics into noisy
byte-level subwords. Each in-vocab guess runs one inference call
batching target + guess (1024-dim vectors); the module scores them with
local cosine similarity.
- `GET /random` — pick a secret Vietnamese word at round start.
- `GET /similarity?a=…&b=…` — cosine similarity + canonical forms +
`in_vocab_a` / `in_vocab_b` flags.
Override the base URL for local dev via `PHOW2SIM_API_URL`.
## Architecture
- `api-client.js`Workers AI wrapper: `randomWord()` picks from the
local pool, `similarity(a, b)` calls `env.AI.run()` and returns
`{ in_vocab_b, similarity }`. `UpstreamError` on inference failure.
- `words-data.js` — auto-generated Viet22K dictionary.
- `wordlist.js` — one-function module exposing `randomLine()`.
- `api-client.js`thin `fetch` wrapper around `/random` and
`/similarity`. 5 s timeout; `UpstreamError` carries HTTP status + body
snippet on failure.
- `state.js` — KV persistence for game + stats. Same shape as semantle.
- `lookup.js` — guess normalization + shape validation. Accepts Unicode
letters + combining marks + single internal spaces.
@@ -50,6 +45,9 @@ local cosine similarity.
to semantle/format.js — score display is language-agnostic).
- `render.js` — Telegram HTML `<pre>` monospace board with a 🇻🇳 header.
- `handlers.js` — subject resolution + the three command entry points.
Fast-path dedup (exact text OR prior canonical) skips wasted API calls
on repeat guesses; post-API dedup catches different inputs that
canonicalize to the same token.
Near-clone of the semantle sibling — kept separate per the repo's
one-module-per-game convention rather than factoring out a shared base.
@@ -65,13 +63,15 @@ KV namespace prefix: `doantu:`
| `game:<subject>` | `{ target, startedAt, solved, guesses[] }` — active round (TTL 7 days). |
| `stats:<subject>` | `{ played, solved, totalGuesses, bestGuessCount, lastResultAt }` |
Each `guesses[]` entry is `{ word, canonical, similarity }`.
## Config
No env vars. Model defaults to `@cf/baai/bge-m3`; override with
`createClient(env.AI, { model: "..." })` in a test or alternative deploy.
| Env var | Default | Purpose |
|---------|---------|---------|
| `PHOW2SIM_API_URL` | `https://phow2sim.sg.miti99.com` | Base URL for the phow2sim service. |
## Credits
- Embeddings: [`@cf/baai/bge-m3`](https://developers.cloudflare.com/workers-ai/models/bge-m3/) on Cloudflare Workers AI (multilingual).
- Wordlist: [duyet/vietnamese-wordlist](https://github.com/duyet/vietnamese-wordlist) by Ho Ngoc Duc (GPL-2.0).
- Similarity backend: self-hosted `phow2sim` (Vietnamese word2vec/PhoBERT-style).
- Game concept: [Semantle](https://semantle.com/) by David Turner.