Dataset update:
- Crawl all 63 .xls province files from baotintuc.vn CDN (original source)
- Old xlsx dataset moved to data-old/ for reference
- Net: +13,719 students (Hà Nội +7,275, HCM +6,445) — the old .xls → xlsx
conversion silently dropped rows beyond the 65,536 per-sheet cap
- Also removes 1 bogus header row that had leaked into the old DB
- 100% identical scores on the 847,348 SBDs present in both datasets
Build pipeline:
- build-database.js: iterate ALL sheets per workbook (fixes the overflow
loss) and accept .xls in addition to .xlsx
Audit tooling:
- scripts/crawl-baotintuc.js: idempotent 63-province downloader
- scripts/diff-datasets.js: compares two DBs by SBD set and per-column
score deltas
- Move 63 Excel files from data/raw/ to data/ (single flat dir)
- Remove all 53 files in data/raw/update/: verified identical SBD
coverage to raw/ (847349 rows either way), so they added no new
students — only potential score corrections that can be reintroduced
later if source is recovered
- Update build-database.js to read data/ directly
- Add scripts/audit-row-counts.js: compares source row count to DB row
count to verify zero-loss parsing
- Point check-duplicates.js at new data/ location
- Drop 10_LamDong_GNFT (1) and 2.BacKan_YQNX(1): identical row content to
siblings (Excel metadata differs but file size & sheet rows match)
- Add scripts/check-duplicates.js to detect byte-identical and row-identical
files across data/raw and data/raw/update
- Remove Gradle build, Java sources, Hibernate config, old database.sqlite
- Move Excel data files from src/main/resources/raw/ to data/raw/
- Move Vite+React app from web/ to project root
- Merge package.json into single root-level config
- Update build script paths and CI workflow accordingly
- Node script parses 119 Excel files into SQLite (847K students)
- Vite + React frontend with sql.js for client-side querying
- Search by exam ID (số báo danh) or student name
- Gzipped DB (36MB) with download progress bar
- GitHub Actions workflow for GitHub Pages deployment