mirror of
https://github.com/tiennm99/miti-scraper.git
synced 2026-06-16 16:48:44 +00:00
main
Bumps the gomod-minor-patch group with 1 update in the / directory: [github.com/gocolly/colly/v2](https://github.com/gocolly/colly). Updates `github.com/gocolly/colly/v2` from 2.1.0 to 2.3.0 - [Release notes](https://github.com/gocolly/colly/releases) - [Changelog](https://github.com/gocolly/colly/blob/master/CHANGELOG.md) - [Commits](https://github.com/gocolly/colly/compare/v2.1.0...v2.3.0) Updates `golang.org/x/net` from 0.43.0 to 0.47.0 - [Commits](https://github.com/golang/net/compare/v0.43.0...v0.47.0) --- updated-dependencies: - dependency-name: github.com/gocolly/colly/v2 dependency-version: 2.3.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: gomod-minor-patch - dependency-name: golang.org/x/net dependency-version: 0.47.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: gomod-minor-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
miti-scraper
Configurable single-site web scraper in Go. Crawls one root URL, follows internal links that match whitelist regexes, strips HTML to plain text, and writes one .txt per page under data/.
Built on gocolly/colly.
Config — config.yaml
root_url: "https://example.com/"
# Only URLs matching at least one regex are crawled and saved
whitelist:
- "^https?://([^/]*\\.)?example\\.com(/[^?]*)?$"
# Newline-delimited list of already-processed URLs. Auto-resumes across runs.
data_file: "processed_urls.txt"
# Politeness — seconds between requests
delay_seconds: 1
Run
go run .
Output
data/<urlhost>_<urlpath>.txt— one file per scraped page, HTML stripped to whitespace-collapsed text (drops<script>,<style>,<noscript>,<head>)processed_urls.txt— running ledger of visited URLs; re-running skips them
Error handling
Logs (does not abort) on 301/302/303/307/308 REDIRECT, 403 BLOCKED, 404 NOT_FOUND, 429 RATE_LIMITED, network errors.
License
Apache-2.0 — see LICENSE.
Description
Languages
Go
100%