miti99bot

tiennm99/miti99bot

Fork 0

mirror of https://github.com/tiennm99/miti99bot.git synced 2026-04-28 18:22:47 +00:00

Commit Graph

Author	SHA1	Message	Date
tiennm99	0740dffd6b	refactor(doantu): swap ConceptNet for Workers AI bge-m3 embeddings Mirror the semantle migration but with @cf/baai/bge-m3 — BAAI's multilingual embedding model — because the English-only BGE variants can't produce meaningful Vietnamese vectors (their tokenizer shreds diacritics into noisy byte-level subwords). bge-m3 is trained across 194 languages incl. Vietnamese and is actually cheaper in Neurons (1,075 vs 1,841 per M tokens for bge-small-en-v1.5). Vocab check reuses the local Viet22K wordlist as an in-memory Set — O(1) OOV detection, no upstream call. Also add a test file for the module (mirrors semantle coverage plus Vietnamese-specific cases: diacritics, multi-syllable compounds).	2026-04-22 23:53:36 +07:00

Author

SHA1

Message

Date

tiennm99

0740dffd6b

refactor(doantu): swap ConceptNet for Workers AI bge-m3 embeddings

Mirror the semantle migration but with @cf/baai/bge-m3 — BAAI's
multilingual embedding model — because the English-only BGE variants
can't produce meaningful Vietnamese vectors (their tokenizer shreds
diacritics into noisy byte-level subwords).

bge-m3 is trained across 194 languages incl. Vietnamese and is
actually cheaper in Neurons (1,075 vs 1,841 per M tokens for
bge-small-en-v1.5). Vocab check reuses the local Viet22K wordlist as
an in-memory Set — O(1) OOV detection, no upstream call.

Also add a test file for the module (mirrors semantle coverage plus
Vietnamese-specific cases: diacritics, multi-syllable compounds).

2026-04-22 23:53:36 +07:00

1 Commits