Commit Graph

4 Commits

Author SHA1 Message Date
tiennm99 8f86e4dc3b feat: refresh data from baotintuc.vn source, fix overflow sheet loss
Dataset update:
- Crawl all 63 .xls province files from baotintuc.vn CDN (original source)
- Old xlsx dataset moved to data-old/ for reference
- Net: +13,719 students (Hà Nội +7,275, HCM +6,445) — the old .xls → xlsx
  conversion silently dropped rows beyond the 65,536 per-sheet cap
- Also removes 1 bogus header row that had leaked into the old DB
- 100% identical scores on the 847,348 SBDs present in both datasets

Build pipeline:
- build-database.js: iterate ALL sheets per workbook (fixes the overflow
  loss) and accept .xls in addition to .xlsx

Audit tooling:
- scripts/crawl-baotintuc.js: idempotent 63-province downloader
- scripts/diff-datasets.js: compares two DBs by SBD set and per-column
  score deltas
2026-04-14 21:42:29 +07:00
tiennm99 718e2e9117 refactor: flatten data layout to data/, drop update/ overrides
- Move 63 Excel files from data/raw/ to data/ (single flat dir)
- Remove all 53 files in data/raw/update/: verified identical SBD
  coverage to raw/ (847349 rows either way), so they added no new
  students — only potential score corrections that can be reintroduced
  later if source is recovered
- Update build-database.js to read data/ directly
- Add scripts/audit-row-counts.js: compares source row count to DB row
  count to verify zero-loss parsing
- Point check-duplicates.js at new data/ location
2026-04-14 21:02:47 +07:00
tiennm99 f10046f63d chore: remove duplicate Excel files, add md5 audit script
- Drop 10_LamDong_GNFT (1) and 2.BacKan_YQNX(1): identical row content to
  siblings (Excel metadata differs but file size & sheet rows match)
- Add scripts/check-duplicates.js to detect byte-identical and row-identical
  files across data/raw and data/raw/update
2026-04-14 20:49:41 +07:00
tiennm99 4474547433 refactor: remove Java code, move web app to project root
- Remove Gradle build, Java sources, Hibernate config, old database.sqlite
- Move Excel data files from src/main/resources/raw/ to data/raw/
- Move Vite+React app from web/ to project root
- Merge package.json into single root-level config
- Update build script paths and CI workflow accordingly
2026-04-13 00:06:22 +07:00