Project 4: Concurrent Web Scraper
A concurrent web crawler that discovers and fetches pages from a website, respects robots.txt, limits concurrent requests, and extracts structured data—all using goroutines and channels.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | Go |
| Alternative Languages | Python, Rust, Node.js |
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1-2 weeks |
| Knowledge Area | Concurrency, HTTP Client, HTML Parsing |
| Tooling | colly (optional), goquery |
| Prerequisites | Completed Projects 1-3. Understand goroutines, channels basics, and HTTP requests. |
What You Will Build
A concurrent web crawler that discovers and fetches pages from a website, respects robots.txt, limits concurrent requests, and extracts structured data—all using goroutines and channels.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Coordinating concurrent fetchers → maps to goroutines, channels, and WaitGroups
- Rate limiting requests → maps to time.Ticker and semaphores
- Tracking visited URLs → maps to sync.Map or mutex-protected maps
- Graceful shutdown → maps to context.Context and cancellation
Key Concepts
- Goroutines and channels: “Concurrency in Go” Ch. 3 - Katherine Cox-Buday
- Worker pools: “Concurrency in Go” Ch. 4 - Katherine Cox-Buday
- Context for cancellation: “Learning Go” Ch. 14 - Jon Bodner
- HTTP client: “The Go Programming Language” Ch. 5 - Donovan & Kernighan
Real-World Outcome
$ ./scraper --url https://example.com --depth 3 --workers 10 --delay 100ms
Starting crawl of https://example.com
Max depth: 3
Workers: 10
Delay between requests: 100ms
[Worker 1] Fetching: https://example.com/
[Worker 2] Fetching: https://example.com/about
[Worker 3] Fetching: https://example.com/products
[Worker 1] Found 15 links on /
[Worker 4] Fetching: https://example.com/products/item1
...
Crawl complete!
Pages fetched: 127
Time elapsed: 12.3s
Errors: 3 (404 not found)
Results saved to: results.json
$ cat results.json
{
"pages": [
{
"url": "https://example.com/",
"title": "Example - Home",
"links": ["https://example.com/about", ...],
"fetched_at": "2025-01-10T14:30:00Z"
},
...
]
}
# Live progress visualization:
$ ./scraper --url https://news.ycombinator.com --live
┌─────────────────────────────────────────────────────────┐
│ Active: 10/10 workers | Queue: 234 | Done: 89 | Err: 2 │
│ ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 27% │
└─────────────────────────────────────────────────────────┘
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
LEARN_GO_DEEP_DIVE.md - “Concurrency in Go” by Katherine Cox-Buday