Project 3: Fearless Concurrent Web Scraper (Data Races Impossible)

A highly concurrent web scraper that fetches hundreds of pages simultaneously, extracts data, and aggregates results—with zero data races guaranteed by the compiler.

Quick Reference

Attribute	Value
Primary Language	Rust
Alternative Languages	Go (for comparison), Python (to feel the pain)
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Knowledge Area	Concurrency / Async / Networking
Tooling	tokio, reqwest, scraper
Prerequisites	Project 1 completed, basic HTTP understanding

What You Will Build

A highly concurrent web scraper that fetches hundreds of pages simultaneously, extracts data, and aggregates results—with zero data races guaranteed by the compiler.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

Sharing state between async tasks → maps to Arc, Mutex, and the Send/Sync traits
Handling rate limiting without blocking → maps to async/await and tokio runtime
Aggregating results from many concurrent operations → maps to channels and message passing
Graceful error handling across tasks → maps to Result propagation in async contexts

Key Concepts

Send and Sync traits: “Rust Atomics and Locks” Chapter 1 - Mara Bos
Async/Await: “Asynchronous Programming in Rust” - Rust Async Book (online)
Arc and Mutex: “The Rust Programming Language” Chapter 16 - Steve Klabnik
Channels (mpsc): “Programming Rust, 2nd Edition” Chapter 19 - Jim Blandy

Real-World Outcome

$ cargo run -- --urls urls.txt --workers 50 --output results.json
🕷️  Fearless Web Scraper v1.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[░░░░░░░░░░░░░░░░░░░░] 0/500 pages

... scraping ...

[████████████████████] 500/500 pages
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Completed in 4.2 seconds
   • 500 pages scraped
   • 50 concurrent workers
   • 0 data races (guaranteed by Rust!)
   • 12 failed requests (logged to errors.log)

Results written to results.json

Implementation Guide

Reproduce the simplest happy-path scenario.
Build the smallest working version of the core feature.
Add input validation and error handling.
Add instrumentation/logging to confirm behavior.
Refactor into clean modules with tests.

Milestones

Milestone 1: Minimal working program that runs end-to-end.
Milestone 2: Correct outputs for typical inputs.
Milestone 3: Robust handling of edge cases.
Milestone 4: Clean structure and documented usage.

Validation Checklist

Output matches the real-world outcome example
Handles invalid inputs safely
Provides clear errors and exit codes
Repeatable results across runs

References

Main guide: LEARN_RUST_DEEP_DIVE.md
“Rust Atomics and Locks” by Mara Bos