The hard ${MODEL_URL:?...} gate made 'docker compose up' fail at config
parsing if no .env existed — container never started, no logs beyond
compose's own error. Now:
- MODEL_URL defaults to empty in compose. The Python loader checks at
startup and raises a clear FileNotFoundError naming the missing path
and the two ways to fix it (set MODEL_URL, or mount a local file).
- Document an alternative local-mount flow in README, mirroring
word2sim's ./vectors.bin pattern.
- Add container_name: phow2sim for easier docker ps / docker logs.
phow2sim
Tiny HTTP service that returns Vietnamese word2vec similarity and nearest
neighbors — Vietnamese sibling of word2sim. Same endpoint
shapes; swap URLs and it's a drop-in replacement.
Backed by PhoW2V (VinAI / Dat Quoc Nguyen), the largest pretrained Vietnamese word vectors available. Chosen over PhoBERT for this purpose because word2vec's similarity distribution is wide enough to drive a Semantle-style warmth meter, whereas raw transformer embeddings saturate at the top.
License note. PhoW2V's research-only license forbids public redistribution, so this service doesn't — and can't — embed or auto-download the vectors from any public URL. You supply your own private mirror (typically a Nextcloud instance you control). See Quick start.
Stack
- FastAPI + uvicorn
- gensim (loads PhoW2V
.txtfiles; caches a binary.binalongside for 5× faster restarts)
Variants
PhoW2V ships in four flavors. Pick one per deployment.
| Variant | Dims | Size | Best for |
|---|---|---|---|
word-100 |
100 | ~400MB | low-RAM hosts, compound-aware |
word-300 |
300 | ~1.2GB | default — best quality, compound-aware |
syllable-100 |
100 | ~50MB | single-syllable guesses, tiny footprint |
syllable-300 |
300 | ~150MB | single-syllable guesses, richer vectors |
The "word" variants expect underscore-joined compounds (sinh_viên);
the "syllable" variants have no multi-token keys. The canonicalizer
tries both forms, but the client should pre-segment for the word variant
if it wants reliable coverage of compounds.
Endpoints
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
liveness probe |
| GET | /similarity?a=X&b=Y |
cosine similarity between two keys |
| GET | /neighbors?word=X&topn=10 |
nearest-neighbor keys with scores |
| GET | /vocab?word=X |
check in-vocab; return canonical form |
| GET | /random |
random vocab key, filtered for game-friendliness |
Response shape is identical to word2sim.
Examples
curl 'http://localhost:8001/similarity?a=con_chó&b=con_mèo'
# {"a":"con_chó","b":"con_mèo","canonical_a":"con_chó","canonical_b":"con_mèo",
# "in_vocab_a":true,"in_vocab_b":true,"similarity":0.78}
curl 'http://localhost:8001/neighbors?word=đại_học&topn=5'
curl 'http://localhost:8001/vocab?word=con%20ch%C3%B3' # "con chó" → tries "con_chó"
# {"word":"con chó","canonical":"con_chó","in_vocab":true}
curl 'http://localhost:8001/random?min_rank=500&max_rank=20000&min_len=3&max_len=12'
Out-of-vocab returns in_vocab:false and similarity:null. Lookup
tries exact → lowercase → space-to-underscore variants.
Quick start
-
Get the vectors once. Download from the upstream Google Drive mirror (the one linked from the PhoW2V README — the original
public.vinai.ioURLs are dead). Four zips; keep the one matching your chosen variant. -
Host the zip somewhere a plain
GETcan reach it. Options:- Nextcloud public share with file upload, then use the
/downloadendpoint:https://cloud.example.com/s/<token>/download. The share token acts as the capability; leave it unguessable and unlisted. - Any signed/pre-signed URL from your object store (S3, R2, BackBlaze B2), or your own HTTP(S) endpoint.
The service sends no auth headers — any authentication must be baked into the URL itself. This keeps the code minimal and puts hosting policy on the operator.
- Nextcloud public share with file upload, then use the
-
Configure env. Copy
.env.exampleto.envand setMODEL_URL:cp .env.example .env # edit .env: # MODEL_URL=https://cloud.example.com/s/abc123XYZ/download -
Boot.
docker compose up --buildFirst boot streams ~1.2GB (word-300d) into the
phow2v-cachevolume, then parses ~60s. A binary.binis written alongside so later restarts load in ~10s. Health check start period is 10 min to cover the first-boot cost.
Alternative: mount a local file instead
If you've already downloaded the .txt locally and don't want to
re-upload anywhere, skip MODEL_URL entirely and mount the file. In
docker-compose.yml, uncomment the bind mount:
volumes:
- phow2v-cache:/data/phow2v
- ./models/word2vec_vi_words_300dims.txt:/data/phow2v/word2vec_vi_words_300dims.txt:ro
Then docker compose up boots straight into parse — no download step.
Switching variant
Host the desired zip and update .env:
MODEL_URL=https://cloud.example.com/s/<token-for-syllables-100>/download
MODEL_PATH=/data/phow2v/word2vec_vi_syllables_100dims.txt
Delete the phow2v-cache volume when switching, otherwise the stale
.bin from the previous variant loads instead:
docker compose down -v && docker compose up --build
Config (env vars)
| Var | Default | Meaning |
|---|---|---|
MODEL_URL |
(required) | Any URL that serves the zip via a plain GET. Bake any auth into the URL. |
MODEL_PATH |
/data/phow2v/word2vec_vi_words_300dims.txt |
Where the text-format vectors are persisted. |
Using from doantu (miti99bot)
The Cloudflare Worker module's api-client.js already produces the same
response shape. Replace embedPair + local cosine with a single fetch:
const url = `${env.PHOW2SIM_URL}/similarity?a=${encodeURIComponent(a)}&b=${encodeURIComponent(b)}`;
const resp = await fetch(url, { headers: { Authorization: `Bearer ${env.PHOW2SIM_TOKEN}` } });
return await resp.json(); // { in_vocab_a, in_vocab_b, similarity, ... }
Auth is not built-in here — add a reverse proxy (Caddy, Cloudflare Tunnel, or nginx) in front that checks a bearer token before passing through. The service itself trusts its caller.
Project layout
phow2sim/
├── app/
│ ├── main.py # FastAPI routes
│ └── vectors.py # PhoW2V loader + canonicalize + similarity/neighbors/random
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .env.example # copy to .env and fill in Nextcloud creds