mirror of https://github.com/tiennm99/word2sim.git synced 2026-05-19 17:29:36 +00:00

Files

T

tiennm99 2e3e61dcbb feat: initial scaffold for word2sim similarity API

Stateless FastAPI service exposing word2vec cosine similarity,
nearest neighbors, vocab lookup, and random-word picker.
Dockerized with gensim GoogleNews pretrained model support.

2026-04-22 21:18:02 +07:00

3.4 KiB

Raw Permalink Blame History

word2sim

Tiny HTTP service that returns word2vec cosine similarity and nearest neighbors. Stateless. No sessions. Just the math.

Designed as a backend building block — a Semantle-style game, a search re-ranker, or a writing-assistance tool can all sit on top.

Stack

FastAPI + uvicorn
gensim (loads pretrained word2vec-google-news-300 by default: 3M tokens × 300 dims, ~3.4GB RAM)

Endpoints

Method	Path	Purpose
GET	`/health`	liveness probe
GET	`/similarity?a=X&b=Y`	cosine similarity between two words
GET	`/neighbors?word=X&topn=10`	nearest-neighbor words with scores
GET	`/vocab?word=X`	check if a word is in vocab; return canonical form
GET	`/random`	random vocab word, filtered for game-friendliness

Examples

curl 'http://localhost:8000/similarity?a=king&b=queen'
# {"a":"king","b":"queen","canonical_a":"king","canonical_b":"queen",
#  "in_vocab_a":true,"in_vocab_b":true,"similarity":0.6510957}

curl 'http://localhost:8000/neighbors?word=ocean&topn=5'
# {"word":"ocean","canonical":"ocean","in_vocab":true,
#  "neighbors":[{"word":"oceans","similarity":0.78},{"word":"sea","similarity":0.75}, ...]}

curl 'http://localhost:8000/vocab?word=Paris'
# {"word":"Paris","canonical":"Paris","in_vocab":true}

curl 'http://localhost:8000/random?min_rank=500&max_rank=20000&min_len=4&max_len=8'
# {"word":"harbor","rank":8421}

`/random` query params

Param	Default	Meaning
`min_rank`	100	skip the top-N most frequent tokens (common function words)
`max_rank`	50000	cap at top-N most frequent (avoids rare/noisy tail)
`alpha_only`	true	reject phrases (`new_york`), digits, punctuation
`min_len`	3
`max_len`	12

Uses rejection sampling over the frequency-sorted vocab; returns 503 if no word matches within 1000 attempts (loosen the filters).

Out-of-vocab words return in_vocab:false and similarity:null. Case-insensitive lookup tries exact → lower → capitalized.

Quick start

docker compose up --build
# first boot downloads ~1.6GB model into the gensim-cache volume; later boots are instant

Using your own vectors

Skip the download by mounting a locally trained vectors.bin:

# docker-compose.yml
services:
  word2sim:
    environment:
      MODEL_PATH: /models/vectors.bin
    volumes:
      - ./vectors.bin:/models/vectors.bin:ro

(Train one with bash demo-word.sh from the upstream word2vec repo.)

Config (env vars)

Var	Default	Meaning
`MODEL_NAME`	`word2vec-google-news-300`	gensim downloader id
`MODEL_PATH`	(unset)	if set + file exists, load this `.bin` instead (skips download)
`GENSIM_DATA_DIR`	`/data/gensim-cache`	where gensim caches downloaded models

Project layout

word2sim/
├── app/
│   ├── main.py       # FastAPI routes
│   └── vectors.py    # model loader + similarity/neighbors
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Building a Semantle-style game on top

The game server keeps state (session, secret, guess log); it calls word2sim per guess:

new game:  GET  /random?min_rank=500&max_rank=20000&min_len=4&max_len=10
           GET  /neighbors?word={secret}&topn=1000     → cache ranks locally
on guess:  GET  /similarity?a={secret}&b={guess}

word2sim stays stateless and cache-friendly.

3.4 KiB Raw Permalink Blame History Unescape Escape