'Waiting for application startup.' was the last line visible for
several minutes while the lifespan hook silently downloaded 1.2GB and
parsed the text vectors — looks like a hang.
- Print milestones for each load phase (cache hit / download /
extract / parse / cache-write) with timings.
- During download, print every ~50 MiB with running percent if the
server sent Content-Length.
- PYTHONUNBUFFERED=1 in Dockerfile so the prints flush to
'docker compose logs' in real time.
Uses plain print (not logging) because uvicorn's default log config
filters INFO on non-uvicorn loggers, and wrestling with that for six
operator-facing status lines isn't worth the surface area.
Match word2sim's pattern — port stays unexposed by default; caller
uncomments only if they want to reach the service from the host.
Keeps it network-internal-friendly when run behind a reverse proxy on
the same compose network.
Remove the doantu/miti99bot integration section — phow2sim is a
standalone building block; consumer-specific wiring doesn't belong in
its README. Replace with a short generic 'Auth' note about fronting
the service with a reverse proxy.
Also correct the MODEL_URL row — it's no longer required since the
runtime accepts either a URL or a pre-populated MODEL_PATH.
The hard ${MODEL_URL:?...} gate made 'docker compose up' fail at config
parsing if no .env existed — container never started, no logs beyond
compose's own error. Now:
- MODEL_URL defaults to empty in compose. The Python loader checks at
startup and raises a clear FileNotFoundError naming the missing path
and the two ways to fix it (set MODEL_URL, or mount a local file).
- Document an alternative local-mount flow in README, mirroring
word2sim's ./vectors.bin pattern.
- Add container_name: phow2sim for easier docker ps / docker logs.
Was set in Dockerfile/compose/.env/README but never read by the app.
Tokenization is inferred at lookup time by _variant_candidates trying
both spaced and underscore-joined forms.
Simpler contract: operator hosts the zip behind any URL that answers a
plain GET (Nextcloud public share, signed S3/R2 URL, etc.). Any auth is
baked into the URL; the service sends no Authorization headers.
Removes MODEL_DOWNLOAD_USER / MODEL_DOWNLOAD_PASSWORD and their
plumbing. .env.example and README rewritten around the URL-only flow.
The upstream public.vinai.io mirror is dead and PhoW2V's research
license forbids public redistribution, so anonymous auto-download is
no longer viable. Expect a private Nextcloud (WebDAV or password-
protected public share) per deployment.
- Stream downloads in 1MiB chunks (flat RAM for ~1GB zips)
- Basic auth via MODEL_DOWNLOAD_USER / MODEL_DOWNLOAD_PASSWORD
- Drop the broken public.vinai.io default; compose requires MODEL_URL
- Add .env.example with WebDAV and public-share recipes
- Remove scripts/download-phow2v.sh (pointed at the dead mirror)
- README rewritten around the NC workflow; update license caveat
Tiny FastAPI service over PhoW2V Vietnamese word vectors. Mirrors
word2sim's endpoint shapes (/similarity /neighbors /vocab /random) so
clients can swap URLs without code changes.
- Auto-downloads VinAI's PhoW2V on first boot, caches binary .bin for ~5x faster restarts
- Viet-aware canonicalizer: exact -> lowercase -> space-to-underscore
- Supports both word (compound) and syllable variants via env
- Unicode-aware random-word filter accepts diacritics, rejects digits/punct