Learn Web Scraping and Crawling: From Data Extraction to Building a Search Engine

Goal: Deeply understand how to extract data from the web, from targeted scraping of a single page to building a broad, recursive web crawler. Master the tools, techniques, and ethics of automated web data collection.

Why Learn Web Scraping & Crawling?

The web is the world’s largest database, but most of it is unstructured, designed for human eyes. Learning to scrape and crawl is a superpower: it allows you to turn the web into your own personal, structured API. You can track prices, gather social media data, monitor news, aggregate information, and power data science projects.

After completing these projects, you will:

Master the difference between targeted scraping and broad crawling.
Extract any data from any website, whether it’s static HTML or a dynamic JavaScript-powered application.
Build a resilient, polite, and efficient web crawler from scratch.
Understand the legal and ethical considerations of web automation.
Be proficient with industry-standard tools like BeautifulSoup, Scrapy, and Playwright/Selenium.

Core Concept Analysis

First, let’s clarify the key difference you asked about. While often used interchangeably, scraping and crawling are distinct processes.

Web Scraping (Targeted Extraction): The goal is to extract specific pieces of data from a known set of web pages. Think of it as surgically removing valuable information.
Web Crawling (Broad Discovery): The goal is to discover and index web pages by following links recursively. A crawler is a trailblazer, finding the paths for a scraper to follow.

A crawler’s job is to build a map of URLs. A scraper’s job is to extract data from those URLs. Often, a crawler will contain a scraper to process the pages it discovers.

The Scraping & Crawling Landscape

┌─────────────────────────────────────────────────────────────────────────┐
│                           The Web (Unstructured Data)                    │
│ <html>...<a href="/page2">Link</a><p class="price">$99.99</p>...</html>   │
└─────────────────────────────────────────────────────────────────────────┘
                                 │
          ┌──────────────────────┼──────────────────────┐
          ▼                      ▼                      ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│   HTTP REQUEST   │  │   HTML PARSING   │  │  DATA EXTRACTION │
│                  │  │                  │  │                  │
│ • GET page HTML  │  │ • CSS Selectors  │  │ • Find price     │
│ • Set User-Agent │  │ • XPath          │  │ • Find title     │
│ • Handle cookies │  │ • Build DOM tree │  │ • Find reviews   │
└──────────────────┘  └──────────────────┘  └──────────────────┘
          │                      │                      │
          └────────────┐         │         ┌────────────┘
                       ▼         ▼         ▼
                ┌──────────────────────────────────┐
                │          CRAWLING LOGIC          │
                │                                  │
                │  • Extract all links (`<a>` tags) │
                │  • Add new links to a queue      │
                │  • Avoid visiting duplicate URLs │
                │  • Obey robots.txt               │
                └──────────────────────────────────┘
                                 │
                                 ▼
                     ┌────────────────────────┐
                     │   STRUCTURED DATA      │
                     │                        │
                     │  • CSV, JSON, Database │
                     └────────────────────────┘

Project List

These projects are designed to build your skills progressively, starting with simple scraping and culminating in a sophisticated, distributed crawler.

Project 1: “Simple Quote Scraper” — Your First Data Extraction Script

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	JavaScript (Node.js), Ruby
Coolness Level	Level 2: Practical but Forgettable
Business Potential	1. The “Resume Gold”
Difficulty	Level 1: Beginner
Knowledge Area	Web Scraping / HTML Parsing
Software or Tool	Requests, BeautifulSoup (Python) / Axios, Cheerio (Node.js)
Main Book	“Web Scraping with Python, 2nd Edition” by Ryan Mitchell

What you’ll build: A command-line script that scrapes all quotes and their authors from the static website quotes.toscrape.com.

Why it teaches scraping: This is the “Hello, World!” of web scraping. It teaches the two most fundamental skills: fetching a web page’s HTML content and parsing that HTML to find the specific data you need.

Core challenges you’ll face:

Making an HTTP GET request → maps to learning how to fetch web page content
Inspecting the page source → maps to using browser dev tools to find the right HTML tags and classes
Parsing HTML with CSS selectors → maps to navigating the HTML structure to pinpoint data
Extracting text from elements → maps to getting the clean data out of the HTML tags

Key Concepts:

HTTP GET Requests: “Web Scraping with Python” Ch. 1
HTML Structure: MDN Web Docs - “Introduction to HTML”
CSS Selectors: “Web Scraping with Python” Ch. 2
BeautifulSoup Basics: Official BeautifulSoup Documentation

Difficulty: Beginner Time estimate: A few hours Prerequisites: Basic Python (or other language) skills.

Real world outcome: Your script will run and print a clean list of quotes and authors to the console.

$ python scrape_quotes.py
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." - Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities." - J.K. Rowling
... and so on

Implementation Hints:

Use the requests library to get the content of http://quotes.toscrape.com. Check response.status_code to make sure it’s 200.
Use your browser’s “Inspect” tool to look at the HTML for a quote. You’ll see it’s contained in a div with the class quote.
Create a BeautifulSoup object from the response text.
Use soup.find_all('div', class_='quote') to get a list of all quote elements.
Loop through this list. In each element, find the span with class text for the quote and the small with class author for the author.
Print the .text of each element.

Learning milestones:

Successfully fetch the HTML → You understand how programs access the web.
Extract the first quote → You can target a single piece of data.
Extract all quotes on the page → You can iterate over lists of elements.

Project 2: “Multi-Page Book Scraper” — Navigate Pagination and Build Datasets

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	JavaScript (Node.js), Go
Coolness Level	Level 2: Practical but Forgettable
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 2: Intermediate
Knowledge Area	Web Scraping / Crawling (Light)
Software or Tool	Requests, BeautifulSoup, CSV library
Main Book	“Web Scraping with Python, 2nd Edition” by Ryan Mitchell

What you’ll build: A scraper that navigates through all the pages of books.toscrape.com, extracts the title, price, and rating of every book, and saves the data neatly into a CSV file.

Why it teaches scraping: This project introduces pagination, a core concept. You’re no longer just scraping one page; you’re teaching your script to find and follow “Next” links. It also teaches you how to structure and save your data, moving from printing to creating a real dataset.

Core challenges you’ll face:

Handling pagination → maps to finding the “Next” button’s link and looping until it disappears
Extracting multiple data points per item → maps to building a dictionary or object for each book
Cleaning data → maps to converting price strings like “£51.77” to a float 51.77
Saving to a structured file → maps to using a CSV writer to create a dataset with headers

Key Concepts:

Relative vs. Absolute URLs: “Web Scraping with Python” Ch. 3
Data Cleaning: Basic string manipulation (strip, replace).
Writing to CSV: Python’s csv module documentation.

Difficulty: Beginner to Intermediate Time estimate: Weekend Prerequisites: Project 1.

Real world outcome: You will have a books.csv file that you can open in Excel or Google Sheets, containing a structured list of every book on the site.

title,price,rating
A Light in the Attic,51.77,3
Tipping the Velvet,53.74,1
...

Implementation Hints:

Start with a single URL and a while loop (while next_page_url:).
On each page, scrape all the books into a list of dictionaries.
After scraping the books, look for the “Next” button. It’s usually a link (<a> tag) inside a list item (<li>) with the class next.
Extract the href from that link. It will be a relative URL (e.g., catalogue/page-2.html). You must combine it with the base URL (http://books.toscrape.com/) to create the full URL for the next request.
If there’s no “Next” button, the link is None, and your while loop will terminate.
Use Python’s csv.DictWriter to easily write your list of dictionaries to a file.

Learning milestones:

Scrape a single page into a CSV → You can create structured data.
Successfully navigate to the second page → You understand how to follow links.
Scrape the entire site → You have mastered basic pagination.

Project 3: “robots.txt Parser” — Understand the Rules of Ethical Crawling

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	Go, Rust
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 2: Intermediate
Knowledge Area	Web Crawling / Ethics
Software or Tool	Requests
Main Book	N/A, rely on RFCs and web standards documentation

What you’ll build: A tool that takes a domain (e.g., google.com) and a user-agent string as input. It will fetch and parse the robots.txt file and determine if that user-agent is allowed to access a given URL path.

Why it teaches crawling: This is the first rule of ethical crawling. Before you can build a wide-ranging crawler, you MUST understand how to respect a site’s wishes. This project forces you to parse and interpret the rules that govern all well-behaved bots on the internet.

Core challenges you’ll face:

Parsing a non-HTML text format → maps to line-by-line parsing and state management
Handling user-agent groups → maps to applying the right rules for the right bot, including the wildcard *
Matching URL paths → maps to implementing Allow and Disallow path-matching logic
Handling edge cases → maps to Crawl-delay directives, comments, and non-standard extensions

Resources for key challenges:

Google’s robots.txt Specification - A clear explanation of the standard.

Key Concepts:

Robots Exclusion Protocol: The official standard.
String Matching: Implementing the path matching rules.
Stateful Parsing: Keeping track of the current user-agent group as you read the file.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Basic programming, understanding of HTTP.

Real world outcome: A command-line tool that acts as a robots.txt validator.

$ python check_robot.py --domain=google.com --path=/search --user-agent=MyBot
Path /search is DISALLOWED for user-agent MyBot

$ python check_robot.py --domain=google.com --path=/ --user-agent=Googlebot
Path / is ALLOWED for user-agent Googlebot

Implementation Hints:

Your function will take a domain and construct the robots.txt URL (e.g., https://google.com/robots.txt).
Read the file line by line.
The logic is stateful. When you see a User-agent: line, you are now “in” that agent’s rule block.
Store the rules (Allow/Disallow) in a dictionary, keyed by user-agent. Remember the wildcard *.
To check a path, first look for rules for your specific user-agent. If none exist, fall back to the wildcard * agent.
The most specific rule wins. A Disallow: /private/ should match /private/page.html but not /private. The Google spec has details on this.

Learning milestones:

Fetch and print robots.txt → You can access the rules file.
Parse rules for a single user-agent → You understand the basic format.
Correctly handle the wildcard user-agent → You’ve implemented the fallback logic.
Correctly determine access for any path → You’ve mastered the core of the Robots Exclusion Protocol.

Project 4: “The Single-Domain Scrawler” — Build Your First True Web Crawler

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	Go, Rust
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 2: Intermediate
Knowledge Area	Web Crawling
Software or Tool	Requests, BeautifulSoup, Python’s `collections.deque`
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann (for concepts on data systems)

What you’ll build: A true (but simple) web crawler. Given a starting URL, it will systematically discover all reachable pages within that same domain by following links, while keeping track of visited pages to not get stuck in loops.

Why it teaches crawling: This project solidifies the difference between scraping and crawling. The primary goal is no longer data extraction but link discovery and traversal. You will implement the fundamental data structures of any web crawler: a queue for URLs to visit (the “frontier”) and a set of visited URLs.

Core challenges you’ll face:

Managing a URL frontier → maps to using a queue data structure to manage which pages to visit next
Tracking visited URLs → maps to using a set for efficient lookup to prevent re-visiting and infinite loops
Distinguishing internal vs. external links → maps to ensuring your crawler doesn’t wander off the target site
Graceful error handling → maps to what to do if a page returns a 404 or 500 error

Key Concepts:

Breadth-First Search (BFS): The algorithm your crawler will implement, naturally suited for a queue.
URL Frontier: “Designing Data-Intensive Applications” Ch. 11 (relevant concepts on stream processing).
Sets for Membership Testing: Python documentation on Set types.

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1 & 2.

Real world outcome: Your script will run and produce a site map, printing every unique internal URL it finds on a given domain.

$ python scrawler.py https://your-blog.com
Crawling https://your-blog.com...
FOUND: https://your-blog.com/
FOUND: https://your-blog.com/about
FOUND: https://your-blog.com/posts/article-1
FOUND: https://your-blog.com/posts/article-2
...
Crawl complete. Found 25 unique pages.

Implementation Hints:

Initialize a queue (e.g., collections.deque in Python) with the starting URL. This is your frontier.
Initialize an empty set called visited_urls.
Start a while loop that continues as long as the frontier queue is not empty.
Inside the loop, pop a URL from the front of the queue. If it’s already in visited_urls, continue.
Add the URL to visited_urls and process the page (fetch, parse).
Extract all href attributes from <a> tags.
For each new link, resolve it to an absolute URL. Check if it belongs to the same domain as the start URL.
If it’s an internal link and not already in visited_urls, add it to the back of the frontier queue.

Learning milestones:

Crawl a two-page site → You have a working frontier and visited set.
Crawl an entire small domain without getting stuck → Your loop detection is working.
Correctly ignore external links and mailto: links → Your link filtering logic is robust.

Project 5: “JavaScript-Dependent Site Scraper” — Automate a Real Browser

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	JavaScript (Node.js)
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 3: Advanced
Knowledge Area	Web Scraping / Browser Automation
Software or Tool	Playwright, Selenium (Python) / Puppeteer (Node.js)
Main Book	Official documentation for Playwright or Selenium

What you’ll build: A scraper for a site where the data is loaded dynamically with JavaScript, such as the “infinite scroll” version of quotes.toscrape.com/scroll.

Why it teaches scraping: This breaks you out of the simple requests model. You’ll learn that what you see in your browser is not always what requests gets. This forces you to automate a real web browser to wait for JS to execute, simulate user actions (like scrolling), and then scrape the final HTML.

Core challenges you’ll face:

Controlling a headless browser → maps to launching and steering a browser programmatically
Waiting for dynamic content → maps to using explicit waits for elements to appear, instead of time.sleep()
Simulating user actions → maps to programmatically scrolling the page to trigger ‘infinite scroll’ events
Extracting HTML after JS execution → maps to getting the final DOM state from the browser

Key Concepts:

Browser Automation: Playwright/Selenium documentation.
The DOM vs. View Source: Understanding the difference between initial HTML and the live Document Object Model.
Explicit Waits: A core concept in Selenium/Playwright for creating reliable scripts.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1, understanding of the DOM.

Real world outcome: Your script will launch a headless browser, scroll down the infinite-scroll page multiple times, and successfully scrape all the quotes, including those that were loaded dynamically. It will succeed where a simple requests-based scraper would fail.

Implementation Hints:

Use playwright.sync_api for a simpler synchronous coding style to start.
Launch a browser, create a new page object, and navigate to the URL.
Instead of fetching HTML immediately, you need to interact with the page.
To handle infinite scroll, use a loop that calls page.evaluate("window.scrollTo(0, document.body.scrollHeight)") to scroll to the bottom.
After scrolling, you must wait for the new content to load. A good way is to see how many quotes are on the page, scroll, then wait until the quote count increases. Or, wait for the “loading” spinner to disappear.
Once you’ve scrolled enough, get the page’s full content with page.content() and then parse it with BeautifulSoup as you did in Project 1.

Learning milestones:

Launch a browser and take a screenshot → You can control the browser.
Scrape data visible on initial page load → You can get the JS-rendered HTML.
Scrape data that appears after a button click → You can simulate user actions and wait for results.
Successfully scrape an infinite scroll page → You’ve mastered dynamic content scraping.

Project 6: “Authenticated Scraper” — Log In and Access Protected Data

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	JavaScript (Node.js)
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 3: Advanced
Knowledge Area	Web Scraping / Session Management
Software or Tool	Requests with Sessions, or Playwright
Main Book	“Black Hat Python, 2nd Edition” by Justin Seitz (has relevant chapters on authentication)

What you’ll build: A script that logs into a site (like GitHub or a forum you have an account on) and scrapes information that is only visible after logging in, like your profile details.

Why it teaches scraping: This teaches you how to manage state, specifically authentication cookies. You’ll learn how websites track who you are across multiple requests and how to make your scraper act like a logged-in user.

Core challenges you’ll face:

Inspecting the login process → maps to using browser dev tools to see what data the login form POSTs
Managing cookies and sessions → maps to using a requests.Session object or browser automation to persist login cookies
Handling CSRF tokens → maps to finding and including anti-forgery tokens in your login request
Scraping pages that require a login → maps to navigating to protected pages after the session is established

Key Concepts:

HTTP POST Requests: Sending data to a server.
Cookies & Sessions: How servers remember you.
CSRF (Cross-Site Request Forgery) Tokens: A common security measure in forms.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1, understanding of HTTP POST.

Real world outcome: Your script will run without any manual browser interaction, log into your account, navigate to your profile, and print out private information (like your email address listed on the profile), proving it was successfully authenticated.

Implementation Hints:

Method 1: Browser Automation (Easier): Use Playwright. Navigate to the login page, use page.fill() to type in your username and password into the form fields, and page.click() to submit. The browser instance will automatically handle the cookies. Then, just navigate to the profile page.
Method 2: requests.Session (Harder but faster):
1. Use a requests.Session() object, which will persist cookies automatically.
2. First, GET the login page. Parse the HTML to find any CSRF tokens (usually a hidden input field in the form).
3. Then, POST to the form’s action URL. The payload should be a dictionary containing your username, password, and the CSRF token.
4. After the POST is successful, the session object now has the login cookie. You can now GET any protected page using the same session object.

Learning milestones:

Successfully log in via Playwright → You understand the browser-based approach.
Identify the POST request and CSRF token for a login form → You can analyze authentication mechanisms.
Successfully log in via requests.Session → You have mastered programmatic session handling.

Project 7: “E-commerce Price Tracker” — Build an Automated Monitoring System

Attribute	Value
File	LEARN_WEB_SCRAPING_AND_CRAWLING.md
Main Programming Language	Python
Alternative Programming Languages	Go, Node.js
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. The “Service & Support” Model
Difficulty	Level 3: Advanced
Knowledge Area	Web Scraping / Automation / Anti-Scraping
Software or Tool	BeautifulSoup, Playwright, smtplib, a database (SQLite)
Main Book	“Automate the Boring Stuff with Python” by Al Sweigart (for scheduling and email)

What you’ll build: A robust script that tracks the price of a product on an e-commerce site. It will run automatically on a schedule, save the price and timestamp to a database, and send you an email alert when the price drops below a target you set.

Why it teaches scraping: This is a practical, real-world project that combines many skills: scraping, data storage, automation, and dealing with anti-scraping measures (as e-commerce sites are often heavily protected). It forces you to write resilient code that can run unattended.

Core challenges you’ll face:

Bypassing basic anti-scraping → maps to setting a realistic User-Agent and other headers
Parsing complex, messy HTML → maps to finding the price in a jungle of divs and spans with dynamic class names
Storing time-series data → maps to using a database like SQLite to log prices over time
Scheduling and automation → maps to using cron (Linux/macOS) or Task Scheduler (Windows) to run your script
Sending notifications → maps to using Python’s smtplib to send an email

Key Concepts:

HTTP Headers: Specifically User-Agent.
Database Interaction: Basic SQL for inserting and querying data.
Task Scheduling: cron syntax or OS equivalents.
SMTP Protocol: For sending emails.

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 5, and 6.

Real world outcome: You will receive an email in your inbox with a subject like “Price Alert! Product X is now $49.99!”, and you’ll have a database file containing the price history of that product, which you can even use to plot a graph.

Implementation Hints:

Pick a product on a less-complex e-commerce site to start. Amazon is notoriously difficult.
Your scraper function should be robust. It needs to handle cases where the price element isn’t found, or the page fails to load. Use try...except blocks.
Set a User-Agent header that mimics a real browser. You can find your browser’s user agent by searching “what is my user agent”.
Use sqlite3 in Python. Create a simple table: CREATE TABLE prices (timestamp TEXT, price REAL).
Write a function that checks the latest price in the database and sends an email using smtplib if the new price is lower.
Set up a cron job to run your Python script once a day: 0 9 * * * /usr/bin/python3 /path/to/your/script.py.

Learning milestones:

Reliably extract the price from the product page → Your selectors are robust.
Save the price to a SQLite database → You can persist data.
Send a test email from your script → You’ve integrated notifications.
The entire system runs automatically via cron and sends a real alert → You have built a complete, automated monitoring system.

Summary

Project	Main Programming Language
Project 1: Simple Quote Scraper	Python
Project 2: Multi-Page Book Scraper	Python
Project 3: `robots.txt` Parser	Python
Project 4: The Single-Domain “Scrawler”	Python
Project 5: JavaScript-Dependent Site Scraper	Python
Project 6: Authenticated Scraper	Python
Project 7: E-commerce Price Tracker	Python