dependabot[bot] ec39695f1b build(deps): bump the gomod-minor-patch group across 1 directory with 2 updates (#7)
Bumps the gomod-minor-patch group with 1 update in the / directory: [github.com/gocolly/colly/v2](https://github.com/gocolly/colly).


Updates `github.com/gocolly/colly/v2` from 2.1.0 to 2.3.0
- [Release notes](https://github.com/gocolly/colly/releases)
- [Changelog](https://github.com/gocolly/colly/blob/master/CHANGELOG.md)
- [Commits](https://github.com/gocolly/colly/compare/v2.1.0...v2.3.0)

Updates `golang.org/x/net` from 0.43.0 to 0.47.0
- [Commits](https://github.com/golang/net/compare/v0.43.0...v0.47.0)

---
updated-dependencies:
- dependency-name: github.com/gocolly/colly/v2
  dependency-version: 2.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: gomod-minor-patch
- dependency-name: golang.org/x/net
  dependency-version: 0.47.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: gomod-minor-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-23 18:15:25 +07:00
2025-08-26 19:51:30 +07:00
2025-08-25 21:24:27 +07:00
2025-08-25 21:24:27 +07:00
2025-08-25 20:18:27 +07:00
2025-08-26 19:51:30 +07:00

miti-scraper

Configurable single-site web scraper in Go. Crawls one root URL, follows internal links that match whitelist regexes, strips HTML to plain text, and writes one .txt per page under data/.

Built on gocolly/colly.

Config — config.yaml

root_url: "https://example.com/"

# Only URLs matching at least one regex are crawled and saved
whitelist:
  - "^https?://([^/]*\\.)?example\\.com(/[^?]*)?$"

# Newline-delimited list of already-processed URLs. Auto-resumes across runs.
data_file: "processed_urls.txt"

# Politeness — seconds between requests
delay_seconds: 1

Run

go run .

Output

  • data/<urlhost>_<urlpath>.txt — one file per scraped page, HTML stripped to whitespace-collapsed text (drops <script>, <style>, <noscript>, <head>)
  • processed_urls.txt — running ledger of visited URLs; re-running skips them

Error handling

Logs (does not abort) on 301/302/303/307/308 REDIRECT, 403 BLOCKED, 404 NOT_FOUND, 429 RATE_LIMITED, network errors.

License

Apache-2.0 — see LICENSE.

S
Description
Scrape a website
Readme Apache-2.0 179 KiB
Languages
Go 100%