mirror of
https://github.com/tiennm99/thptqg2017.git
synced 2026-05-13 22:58:31 +00:00
8f86e4dc3b
Dataset update: - Crawl all 63 .xls province files from baotintuc.vn CDN (original source) - Old xlsx dataset moved to data-old/ for reference - Net: +13,719 students (Hà Nội +7,275, HCM +6,445) — the old .xls → xlsx conversion silently dropped rows beyond the 65,536 per-sheet cap - Also removes 1 bogus header row that had leaked into the old DB - 100% identical scores on the 847,348 SBDs present in both datasets Build pipeline: - build-database.js: iterate ALL sheets per workbook (fixes the overflow loss) and accept .xls in addition to .xlsx Audit tooling: - scripts/crawl-baotintuc.js: idempotent 63-province downloader - scripts/diff-datasets.js: compares two DBs by SBD set and per-column score deltas