Project 4: Concurrent Web Scraper

A concurrent web crawler that discovers and fetches pages from a website, respects robots.txt, limits concurrent requests, and extracts structured data—all using goroutines and channels.

Quick Reference

Attribute	Value
Primary Language	Go
Alternative Languages	Python, Rust, Node.js
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Knowledge Area	Concurrency, HTTP Client, HTML Parsing
Tooling	colly (optional), goquery
Prerequisites	Completed Projects 1-3. Understand goroutines, channels basics, and HTTP requests.

What You Will Build

A concurrent web crawler that discovers and fetches pages from a website, respects robots.txt, limits concurrent requests, and extracts structured data—all using goroutines and channels.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

Coordinating concurrent fetchers → maps to goroutines, channels, and WaitGroups
Rate limiting requests → maps to time.Ticker and semaphores
Tracking visited URLs → maps to sync.Map or mutex-protected maps
Graceful shutdown → maps to context.Context and cancellation

Key Concepts

Goroutines and channels: “Concurrency in Go” Ch. 3 - Katherine Cox-Buday
Worker pools: “Concurrency in Go” Ch. 4 - Katherine Cox-Buday
Context for cancellation: “Learning Go” Ch. 14 - Jon Bodner
HTTP client: “The Go Programming Language” Ch. 5 - Donovan & Kernighan

Real-World Outcome

$ ./scraper --url https://example.com --depth 3 --workers 10 --delay 100ms
Starting crawl of https://example.com
  Max depth: 3
  Workers: 10
  Delay between requests: 100ms

[Worker 1] Fetching: https://example.com/
[Worker 2] Fetching: https://example.com/about
[Worker 3] Fetching: https://example.com/products
[Worker 1] Found 15 links on /
[Worker 4] Fetching: https://example.com/products/item1
...

Crawl complete!
  Pages fetched: 127
  Time elapsed: 12.3s
  Errors: 3 (404 not found)

Results saved to: results.json

$ cat results.json
{
  "pages": [
    {
      "url": "https://example.com/",
      "title": "Example - Home",
      "links": ["https://example.com/about", ...],
      "fetched_at": "2025-01-10T14:30:00Z"
    },
    ...
  ]
}

# Live progress visualization:
$ ./scraper --url https://news.ycombinator.com --live
┌─────────────────────────────────────────────────────────┐
│ Active: 10/10 workers | Queue: 234 | Done: 89 | Err: 2 │
│ ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 27%    │
└─────────────────────────────────────────────────────────┘

Implementation Guide

Reproduce the simplest happy-path scenario.
Build the smallest working version of the core feature.
Add input validation and error handling.
Add instrumentation/logging to confirm behavior.
Refactor into clean modules with tests.

Milestones

Milestone 1: Minimal working program that runs end-to-end.
Milestone 2: Correct outputs for typical inputs.
Milestone 3: Robust handling of edge cases.
Milestone 4: Clean structure and documented usage.

Validation Checklist

Output matches the real-world outcome example
Handles invalid inputs safely
Provides clear errors and exit codes
Repeatable results across runs

References

Main guide: LEARN_GO_DEEP_DIVE.md
“Concurrency in Go” by Katherine Cox-Buday