9 Commits

Author SHA1 Message Date
tiennm99 e120ddb370 chore: add Apache-2.0 license 2026-04-29 21:33:33 +07:00
tiennm99 54eaf95fc4 feat: progress logging during model download/parse
'Waiting for application startup.' was the last line visible for
several minutes while the lifespan hook silently downloaded 1.2GB and
parsed the text vectors — looks like a hang.

- Print milestones for each load phase (cache hit / download /
  extract / parse / cache-write) with timings.
- During download, print every ~50 MiB with running percent if the
  server sent Content-Length.
- PYTHONUNBUFFERED=1 in Dockerfile so the prints flush to
  'docker compose logs' in real time.

Uses plain print (not logging) because uvicorn's default log config
filters INFO on non-uvicorn loggers, and wrestling with that for six
operator-facing status lines isn't worth the surface area.
2026-04-23 11:22:06 +07:00
tiennm99 503d877a94 chore: comment out port mapping in docker-compose
Match word2sim's pattern — port stays unexposed by default; caller
uncomments only if they want to reach the service from the host.
Keeps it network-internal-friendly when run behind a reverse proxy on
the same compose network.
2026-04-23 11:19:12 +07:00
tiennm99 ec8f70e799 docs: focus README on phow2sim only
Remove the doantu/miti99bot integration section — phow2sim is a
standalone building block; consumer-specific wiring doesn't belong in
its README. Replace with a short generic 'Auth' note about fronting
the service with a reverse proxy.

Also correct the MODEL_URL row — it's no longer required since the
runtime accepts either a URL or a pre-populated MODEL_PATH.
2026-04-23 11:17:06 +07:00
tiennm99 7f8990fd30 fix: allow compose to start without MODEL_URL; defer missing-model error to runtime
The hard ${MODEL_URL:?...} gate made 'docker compose up' fail at config
parsing if no .env existed — container never started, no logs beyond
compose's own error. Now:

- MODEL_URL defaults to empty in compose. The Python loader checks at
  startup and raises a clear FileNotFoundError naming the missing path
  and the two ways to fix it (set MODEL_URL, or mount a local file).
- Document an alternative local-mount flow in README, mirroring
  word2sim's ./vectors.bin pattern.
- Add container_name: phow2sim for easier docker ps / docker logs.
2026-04-23 11:16:16 +07:00
tiennm99 8140b51d3d refactor: remove unused MODEL_VARIANT env var
Was set in Dockerfile/compose/.env/README but never read by the app.
Tokenization is inferred at lookup time by _variant_candidates trying
both spaced and underscore-joined forms.
2026-04-23 11:09:55 +07:00
tiennm99 a1fd486937 refactor: drop Basic auth; require plain-GET MODEL_URL
Simpler contract: operator hosts the zip behind any URL that answers a
plain GET (Nextcloud public share, signed S3/R2 URL, etc.). Any auth is
baked into the URL; the service sends no Authorization headers.

Removes MODEL_DOWNLOAD_USER / MODEL_DOWNLOAD_PASSWORD and their
plumbing. .env.example and README rewritten around the URL-only flow.
2026-04-23 11:07:00 +07:00
tiennm99 6b1b401283 feat: fetch model via Nextcloud WebDAV with Basic auth
The upstream public.vinai.io mirror is dead and PhoW2V's research
license forbids public redistribution, so anonymous auto-download is
no longer viable. Expect a private Nextcloud (WebDAV or password-
protected public share) per deployment.

- Stream downloads in 1MiB chunks (flat RAM for ~1GB zips)
- Basic auth via MODEL_DOWNLOAD_USER / MODEL_DOWNLOAD_PASSWORD
- Drop the broken public.vinai.io default; compose requires MODEL_URL
- Add .env.example with WebDAV and public-share recipes
- Remove scripts/download-phow2v.sh (pointed at the dead mirror)
- README rewritten around the NC workflow; update license caveat
2026-04-23 10:44:33 +07:00
tiennm99 8dd17acd4f feat: initial phow2sim service
Tiny FastAPI service over PhoW2V Vietnamese word vectors. Mirrors
word2sim's endpoint shapes (/similarity /neighbors /vocab /random) so
clients can swap URLs without code changes.

- Auto-downloads VinAI's PhoW2V on first boot, caches binary .bin for ~5x faster restarts
- Viet-aware canonicalizer: exact -> lowercase -> space-to-underscore
- Supports both word (compound) and syllable variants via env
- Unicode-aware random-word filter accepts diacritics, rejects digits/punct
2026-04-23 10:05:50 +07:00