Stateless FastAPI service exposing word2vec cosine similarity, nearest neighbors, vocab lookup, and random-word picker. Dockerized with gensim GoogleNews pretrained model support.
3.4 KiB
word2sim
Tiny HTTP service that returns word2vec cosine similarity and nearest neighbors. Stateless. No sessions. Just the math.
Designed as a backend building block — a Semantle-style game, a search re-ranker, or a writing-assistance tool can all sit on top.
Stack
- FastAPI + uvicorn
- gensim (loads pretrained
word2vec-google-news-300by default: 3M tokens × 300 dims, ~3.4GB RAM)
Endpoints
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
liveness probe |
| GET | /similarity?a=X&b=Y |
cosine similarity between two words |
| GET | /neighbors?word=X&topn=10 |
nearest-neighbor words with scores |
| GET | /vocab?word=X |
check if a word is in vocab; return canonical form |
| GET | /random |
random vocab word, filtered for game-friendliness |
Examples
curl 'http://localhost:8000/similarity?a=king&b=queen'
# {"a":"king","b":"queen","canonical_a":"king","canonical_b":"queen",
# "in_vocab_a":true,"in_vocab_b":true,"similarity":0.6510957}
curl 'http://localhost:8000/neighbors?word=ocean&topn=5'
# {"word":"ocean","canonical":"ocean","in_vocab":true,
# "neighbors":[{"word":"oceans","similarity":0.78},{"word":"sea","similarity":0.75}, ...]}
curl 'http://localhost:8000/vocab?word=Paris'
# {"word":"Paris","canonical":"Paris","in_vocab":true}
curl 'http://localhost:8000/random?min_rank=500&max_rank=20000&min_len=4&max_len=8'
# {"word":"harbor","rank":8421}
/random query params
| Param | Default | Meaning |
|---|---|---|
min_rank |
100 | skip the top-N most frequent tokens (common function words) |
max_rank |
50000 | cap at top-N most frequent (avoids rare/noisy tail) |
alpha_only |
true | reject phrases (new_york), digits, punctuation |
min_len |
3 | |
max_len |
12 |
Uses rejection sampling over the frequency-sorted vocab; returns 503 if no word matches within 1000 attempts (loosen the filters).
Out-of-vocab words return in_vocab:false and similarity:null. Case-insensitive lookup tries exact → lower → capitalized.
Quick start
docker compose up --build
# first boot downloads ~1.6GB model into the gensim-cache volume; later boots are instant
Using your own vectors
Skip the download by mounting a locally trained vectors.bin:
# docker-compose.yml
services:
word2sim:
environment:
MODEL_PATH: /models/vectors.bin
volumes:
- ./vectors.bin:/models/vectors.bin:ro
(Train one with bash demo-word.sh from the upstream word2vec repo.)
Config (env vars)
| Var | Default | Meaning |
|---|---|---|
MODEL_NAME |
word2vec-google-news-300 |
gensim downloader id |
MODEL_PATH |
(unset) | if set + file exists, load this .bin instead (skips download) |
GENSIM_DATA_DIR |
/data/gensim-cache |
where gensim caches downloaded models |
Project layout
word2sim/
├── app/
│ ├── main.py # FastAPI routes
│ └── vectors.py # model loader + similarity/neighbors
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
Building a Semantle-style game on top
The game server keeps state (session, secret, guess log); it calls word2sim per guess:
new game: GET /random?min_rank=500&max_rank=20000&min_len=4&max_len=10
GET /neighbors?word={secret}&topn=1000 → cache ranks locally
on guess: GET /similarity?a={secret}&b={guess}
word2sim stays stateless and cache-friendly.