Commit Graph

2 Commits

Author SHA1 Message Date
tiennm99 4acc471f6f refactor(doantu): swap Workers AI bge-m3 for hosted phow2sim HTTP API
Doantu now mirrors semantle's pre-Workers-AI shape: a thin fetch wrapper
around /random + /similarity on https://phow2sim.sg.miti99.com (overridable
via PHOW2SIM_API_URL). Drops the local Viet22K wordlist + build script —
the service owns vocabulary now. Promotes commands from protected to
public so they show up in Telegram's native / menu.
2026-04-23 11:35:32 +07:00
tiennm99 0740dffd6b refactor(doantu): swap ConceptNet for Workers AI bge-m3 embeddings
Mirror the semantle migration but with @cf/baai/bge-m3 — BAAI's
multilingual embedding model — because the English-only BGE variants
can't produce meaningful Vietnamese vectors (their tokenizer shreds
diacritics into noisy byte-level subwords).

bge-m3 is trained across 194 languages incl. Vietnamese and is
actually cheaper in Neurons (1,075 vs 1,841 per M tokens for
bge-small-en-v1.5). Vocab check reuses the local Viet22K wordlist as
an in-memory Set — O(1) OOV detection, no upstream call.

Also add a test file for the module (mirrors semantle coverage plus
Vietnamese-specific cases: diacritics, multi-syllable compounds).
2026-04-22 23:53:36 +07:00