Project 4: Concurrent Web Scraper

A concurrent web crawler that discovers and fetches pages from a website, respects robots.txt, limits concurrent requests, and extracts structured data—all using goroutines and channels.

Quick Reference

Attribute Value
Primary Language Go
Alternative Languages Python, Rust, Node.js
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Knowledge Area Concurrency, HTTP Client, HTML Parsing
Tooling colly (optional), goquery
Prerequisites Completed Projects 1-3. Understand goroutines, channels basics, and HTTP requests.

What You Will Build

A concurrent web crawler that discovers and fetches pages from a website, respects robots.txt, limits concurrent requests, and extracts structured data—all using goroutines and channels.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Coordinating concurrent fetchers → maps to goroutines, channels, and WaitGroups
  • Rate limiting requests → maps to time.Ticker and semaphores
  • Tracking visited URLs → maps to sync.Map or mutex-protected maps
  • Graceful shutdown → maps to context.Context and cancellation

Key Concepts

  • Goroutines and channels: “Concurrency in Go” Ch. 3 - Katherine Cox-Buday
  • Worker pools: “Concurrency in Go” Ch. 4 - Katherine Cox-Buday
  • Context for cancellation: “Learning Go” Ch. 14 - Jon Bodner
  • HTTP client: “The Go Programming Language” Ch. 5 - Donovan & Kernighan

Real-World Outcome

$ ./scraper --url https://example.com --depth 3 --workers 10 --delay 100ms
Starting crawl of https://example.com
  Max depth: 3
  Workers: 10
  Delay between requests: 100ms

[Worker 1] Fetching: https://example.com/
[Worker 2] Fetching: https://example.com/about
[Worker 3] Fetching: https://example.com/products
[Worker 1] Found 15 links on /
[Worker 4] Fetching: https://example.com/products/item1
...

Crawl complete!
  Pages fetched: 127
  Time elapsed: 12.3s
  Errors: 3 (404 not found)

Results saved to: results.json

$ cat results.json
{
  "pages": [
    {
      "url": "https://example.com/",
      "title": "Example - Home",
      "links": ["https://example.com/about", ...],
      "fetched_at": "2025-01-10T14:30:00Z"
    },
    ...
  ]
}

# Live progress visualization:
$ ./scraper --url https://news.ycombinator.com --live
┌─────────────────────────────────────────────────────────┐
│ Active: 10/10 workers | Queue: 234 | Done: 89 | Err: 2 │
│ ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 27%    │
└─────────────────────────────────────────────────────────┘

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: LEARN_GO_DEEP_DIVE.md
  • “Concurrency in Go” by Katherine Cox-Buday