mirror of
https://github.com/tiennm99/miti99bot.git
synced 2026-04-28 12:20:42 +00:00
refactor(doantu): swap Workers AI bge-m3 for hosted phow2sim HTTP API
Doantu now mirrors semantle's pre-Workers-AI shape: a thin fetch wrapper around /random + /similarity on https://phow2sim.sg.miti99.com (overridable via PHOW2SIM_API_URL). Drops the local Viet22K wordlist + build script — the service owns vocabulary now. Promotes commands from protected to public so they show up in Telegram's native / menu.
This commit is contained in:
@@ -1,48 +1,43 @@
|
||||
# Doantu Module
|
||||
|
||||
Vietnamese "đoán từ" (guess-the-word) — same core mechanic as `semantle`,
|
||||
but targets come from a Vietnamese wordlist and similarity is computed
|
||||
with a multilingual embedding model. Unlimited guesses per round; solve
|
||||
on exact match (case-insensitive, diacritic-sensitive).
|
||||
but targets + similarity come from a Vietnamese-tuned embedding service.
|
||||
Unlimited guesses per round; solve on exact match (case-insensitive,
|
||||
diacritic-sensitive).
|
||||
|
||||
**Visibility: `protected`** — commands appear in `/help` but are hidden
|
||||
from Telegram's native `/` autocomplete menu while the module is still
|
||||
experimental.
|
||||
**Visibility: `public`** — commands appear in both `/help` and Telegram's
|
||||
native `/` autocomplete menu.
|
||||
|
||||
## Commands
|
||||
|
||||
| Command | Visibility | Description |
|
||||
|---------|-----------|-------------|
|
||||
| `/doantu` | protected | Show current board or submit a word guess |
|
||||
| `/doantu_giveup` | protected | Reveal the answer and end the round (next `/doantu` starts a fresh one) |
|
||||
| `/doantu_stats` | protected | Show per-subject stats |
|
||||
| `/doantu` | public | Show current board or submit a word guess |
|
||||
| `/doantu_giveup` | public | Reveal the answer and end the round (next `/doantu` starts a fresh one) |
|
||||
| `/doantu_stats` | public | Show per-subject stats |
|
||||
|
||||
Submit with `/doantu <word>` (e.g. `/doantu con chó`). Multi-syllable words
|
||||
with single spaces between them are accepted. `cá` and `ca` are different
|
||||
targets.
|
||||
targets. Out-of-vocabulary words don't count toward the guess tally.
|
||||
Repeating a prior guess replies with a `🔁 already guessed` notice and is
|
||||
ignored (no cost, no stat inflation).
|
||||
|
||||
## Data source
|
||||
|
||||
**Target + vocabulary:** [duyet/vietnamese-wordlist](https://github.com/duyet/vietnamese-wordlist)'s
|
||||
Viet22K list (~22k entries), lowercased and deduped. The same list is
|
||||
both the target pool and the vocabulary — OOV detection is `Set.has()`
|
||||
with no upstream call. License: GPL-2.0 (Ho Ngoc Duc).
|
||||
Regenerate with `node scripts/build-doantu-words.js`.
|
||||
Target words + similarity scores come from our self-hosted **phow2sim**
|
||||
instance (default: `https://phow2sim.sg.miti99.com`). Wraps two endpoints:
|
||||
|
||||
**Similarity:** `@cf/baai/bge-m3` multilingual text embeddings via the
|
||||
`env.AI` binding. Chosen over the English-only `bge-small-en-v1.5`
|
||||
because that model's tokenizer shreds Vietnamese diacritics into noisy
|
||||
byte-level subwords. Each in-vocab guess runs one inference call
|
||||
batching target + guess (1024-dim vectors); the module scores them with
|
||||
local cosine similarity.
|
||||
- `GET /random` — pick a secret Vietnamese word at round start.
|
||||
- `GET /similarity?a=…&b=…` — cosine similarity + canonical forms +
|
||||
`in_vocab_a` / `in_vocab_b` flags.
|
||||
|
||||
Override the base URL for local dev via `PHOW2SIM_API_URL`.
|
||||
|
||||
## Architecture
|
||||
|
||||
- `api-client.js` — Workers AI wrapper: `randomWord()` picks from the
|
||||
local pool, `similarity(a, b)` calls `env.AI.run()` and returns
|
||||
`{ in_vocab_b, similarity }`. `UpstreamError` on inference failure.
|
||||
- `words-data.js` — auto-generated Viet22K dictionary.
|
||||
- `wordlist.js` — one-function module exposing `randomLine()`.
|
||||
- `api-client.js` — thin `fetch` wrapper around `/random` and
|
||||
`/similarity`. 5 s timeout; `UpstreamError` carries HTTP status + body
|
||||
snippet on failure.
|
||||
- `state.js` — KV persistence for game + stats. Same shape as semantle.
|
||||
- `lookup.js` — guess normalization + shape validation. Accepts Unicode
|
||||
letters + combining marks + single internal spaces.
|
||||
@@ -50,6 +45,9 @@ local cosine similarity.
|
||||
to semantle/format.js — score display is language-agnostic).
|
||||
- `render.js` — Telegram HTML `<pre>` monospace board with a 🇻🇳 header.
|
||||
- `handlers.js` — subject resolution + the three command entry points.
|
||||
Fast-path dedup (exact text OR prior canonical) skips wasted API calls
|
||||
on repeat guesses; post-API dedup catches different inputs that
|
||||
canonicalize to the same token.
|
||||
|
||||
Near-clone of the semantle sibling — kept separate per the repo's
|
||||
one-module-per-game convention rather than factoring out a shared base.
|
||||
@@ -65,13 +63,15 @@ KV namespace prefix: `doantu:`
|
||||
| `game:<subject>` | `{ target, startedAt, solved, guesses[] }` — active round (TTL 7 days). |
|
||||
| `stats:<subject>` | `{ played, solved, totalGuesses, bestGuessCount, lastResultAt }` |
|
||||
|
||||
Each `guesses[]` entry is `{ word, canonical, similarity }`.
|
||||
|
||||
## Config
|
||||
|
||||
No env vars. Model defaults to `@cf/baai/bge-m3`; override with
|
||||
`createClient(env.AI, { model: "..." })` in a test or alternative deploy.
|
||||
| Env var | Default | Purpose |
|
||||
|---------|---------|---------|
|
||||
| `PHOW2SIM_API_URL` | `https://phow2sim.sg.miti99.com` | Base URL for the phow2sim service. |
|
||||
|
||||
## Credits
|
||||
|
||||
- Embeddings: [`@cf/baai/bge-m3`](https://developers.cloudflare.com/workers-ai/models/bge-m3/) on Cloudflare Workers AI (multilingual).
|
||||
- Wordlist: [duyet/vietnamese-wordlist](https://github.com/duyet/vietnamese-wordlist) by Ho Ngoc Duc (GPL-2.0).
|
||||
- Similarity backend: self-hosted `phow2sim` (Vietnamese word2vec/PhoBERT-style).
|
||||
- Game concept: [Semantle](https://semantle.com/) by David Turner.
|
||||
|
||||
Reference in New Issue
Block a user