Dataset update:
- Crawl all 63 .xls province files from baotintuc.vn CDN (original source)
- Old xlsx dataset moved to data-old/ for reference
- Net: +13,719 students (Hà Nội +7,275, HCM +6,445) — the old .xls → xlsx
conversion silently dropped rows beyond the 65,536 per-sheet cap
- Also removes 1 bogus header row that had leaked into the old DB
- 100% identical scores on the 847,348 SBDs present in both datasets
Build pipeline:
- build-database.js: iterate ALL sheets per workbook (fixes the overflow
loss) and accept .xls in addition to .xlsx
Audit tooling:
- scripts/crawl-baotintuc.js: idempotent 63-province downloader
- scripts/diff-datasets.js: compares two DBs by SBD set and per-column
score deltas