Files
miti99bot/tests
tiennm99 0740dffd6b refactor(doantu): swap ConceptNet for Workers AI bge-m3 embeddings
Mirror the semantle migration but with @cf/baai/bge-m3 — BAAI's
multilingual embedding model — because the English-only BGE variants
can't produce meaningful Vietnamese vectors (their tokenizer shreds
diacritics into noisy byte-level subwords).

bge-m3 is trained across 194 languages incl. Vietnamese and is
actually cheaper in Neurons (1,075 vs 1,841 per M tokens for
bge-small-en-v1.5). Vocab check reuses the local Viet22K wordlist as
an in-memory Set — O(1) OOV detection, no upstream call.

Also add a test file for the module (mirrors semantle coverage plus
Vietnamese-specific cases: diacritics, multi-syllable compounds).
2026-04-22 23:53:36 +07:00
..