LEARN WEB SCRAPING AND CRAWLING
Learn Web Scraping and Crawling: From Data Extraction to Building a Search Engine
Goal: Deeply understand how to extract data from the web, from targeted scraping of a single page to building a broad, recursive web crawler. Master the tools, techniques, and ethics of automated web data collection.
Why Learn Web Scraping & Crawling?
The web is the world’s largest database, but most of it is unstructured, designed for human eyes. Learning to scrape and crawl is a superpower: it allows you to turn the web into your own personal, structured API. You can track prices, gather social media data, monitor news, aggregate information, and power data science projects.
After completing these projects, you will:
- Master the difference between targeted scraping and broad crawling.
- Extract any data from any website, whether it’s static HTML or a dynamic JavaScript-powered application.
- Build a resilient, polite, and efficient web crawler from scratch.
- Understand the legal and ethical considerations of web automation.
- Be proficient with industry-standard tools like BeautifulSoup, Scrapy, and Playwright/Selenium.
Core Concept Analysis
First, let’s clarify the key difference you asked about. While often used interchangeably, scraping and crawling are distinct processes.
- Web Scraping (Targeted Extraction): The goal is to extract specific pieces of data from a known set of web pages. Think of it as surgically removing valuable information.
- Web Crawling (Broad Discovery): The goal is to discover and index web pages by following links recursively. A crawler is a trailblazer, finding the paths for a scraper to follow.
A crawler’s job is to build a map of URLs. A scraper’s job is to extract data from those URLs. Often, a crawler will contain a scraper to process the pages it discovers.
The Scraping & Crawling Landscape
┌─────────────────────────────────────────────────────────────────────────┐
│ The Web (Unstructured Data) │
│ <html>...<a href="/page2">Link</a><p class="price">$99.99</p>...</html> │
└─────────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ HTTP REQUEST │ │ HTML PARSING │ │ DATA EXTRACTION │
│ │ │ │ │ │
│ • GET page HTML │ │ • CSS Selectors │ │ • Find price │
│ • Set User-Agent │ │ • XPath │ │ • Find title │
│ • Handle cookies │ │ • Build DOM tree │ │ • Find reviews │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │ │
└────────────┐ │ ┌────────────┘
▼ ▼ ▼
┌──────────────────────────────────┐
│ CRAWLING LOGIC │
│ │
│ • Extract all links (`<a>` tags) │
│ • Add new links to a queue │
│ • Avoid visiting duplicate URLs │
│ • Obey robots.txt │
└──────────────────────────────────┘
│
▼
┌────────────────────────┐
│ STRUCTURED DATA │
│ │
│ • CSV, JSON, Database │
└────────────────────────┘
Project List
These projects are designed to build your skills progressively, starting with simple scraping and culminating in a sophisticated, distributed crawler.
Project 1: Simple Quote Scraper
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js), Ruby
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Web Scraping / HTML Parsing
- Software or Tool: Requests, BeautifulSoup (Python) / Axios, Cheerio (Node.js)
- Main Book: “Web Scraping with Python, 2nd Edition” by Ryan Mitchell
What you’ll build: A command-line script that scrapes all quotes and their authors from the static website quotes.toscrape.com.
Why it teaches scraping: This is the “Hello, World!” of web scraping. It teaches the two most fundamental skills: fetching a web page’s HTML content and parsing that HTML to find the specific data you need.
Core challenges you’ll face:
- Making an HTTP GET request → maps to learning how to fetch web page content
- Inspecting the page source → maps to using browser dev tools to find the right HTML tags and classes
- Parsing HTML with CSS selectors → maps to navigating the HTML structure to pinpoint data
- Extracting text from elements → maps to getting the clean data out of the HTML tags
Key Concepts:
- HTTP GET Requests: “Web Scraping with Python” Ch. 1
- HTML Structure: MDN Web Docs - “Introduction to HTML”
- CSS Selectors: “Web Scraping with Python” Ch. 2
- BeautifulSoup Basics: Official BeautifulSoup Documentation
Difficulty: Beginner Time estimate: A few hours Prerequisites: Basic Python (or other language) skills.
Real world outcome: Your script will run and print a clean list of quotes and authors to the console.
$ python scrape_quotes.py
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." - Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities." - J.K. Rowling
... and so on
Implementation Hints:
- Use the
requestslibrary to get the content ofhttp://quotes.toscrape.com. Checkresponse.status_codeto make sure it’s 200. - Use your browser’s “Inspect” tool to look at the HTML for a quote. You’ll see it’s contained in a
divwith the classquote. - Create a
BeautifulSoupobject from the response text. - Use
soup.find_all('div', class_='quote')to get a list of all quote elements. - Loop through this list. In each element, find the
spanwith classtextfor the quote and thesmallwith classauthorfor the author. - Print the
.textof each element.
Learning milestones:
- Successfully fetch the HTML → You understand how programs access the web.
- Extract the first quote → You can target a single piece of data.
- Extract all quotes on the page → You can iterate over lists of elements.
Project 2: Multi-Page Book Scraper
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js), Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 1: Beginner
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Scraping / Crawling (Light)
- Software or Tool: Requests, BeautifulSoup, CSV library
- Main Book: “Web Scraping with Python, 2nd Edition” by Ryan Mitchell
What you’ll build: A scraper that navigates through all the pages of books.toscrape.com, extracts the title, price, and rating of every book, and saves the data neatly into a CSV file.
Why it teaches scraping: This project introduces pagination, a core concept. You’re no longer just scraping one page; you’re teaching your script to find and follow “Next” links. It also teaches you how to structure and save your data, moving from printing to creating a real dataset.
Core challenges you’ll face:
- Handling pagination → maps to finding the “Next” button’s link and looping until it disappears
- Extracting multiple data points per item → maps to building a dictionary or object for each book
- Cleaning data → maps to converting price strings like “£51.77” to a float
51.77 - Saving to a structured file → maps to using a CSV writer to create a dataset with headers
Key Concepts:
- Relative vs. Absolute URLs: “Web Scraping with Python” Ch. 3
- Data Cleaning: Basic string manipulation (strip, replace).
- Writing to CSV: Python’s
csvmodule documentation.
Difficulty: Beginner to Intermediate Time estimate: Weekend Prerequisites: Project 1.
Real world outcome:
You will have a books.csv file that you can open in Excel or Google Sheets, containing a structured list of every book on the site.
title,price,rating
A Light in the Attic,51.77,3
Tipping the Velvet,53.74,1
...
Implementation Hints:
- Start with a single URL and a
whileloop (while next_page_url:). - On each page, scrape all the books into a list of dictionaries.
- After scraping the books, look for the “Next” button. It’s usually a link (
<a>tag) inside a list item (<li>) with the classnext. - Extract the
hreffrom that link. It will be a relative URL (e.g.,catalogue/page-2.html). You must combine it with the base URL (http://books.toscrape.com/) to create the full URL for the next request. - If there’s no “Next” button, the link is
None, and yourwhileloop will terminate. - Use Python’s
csv.DictWriterto easily write your list of dictionaries to a file.
Learning milestones:
- Scrape a single page into a CSV → You can create structured data.
- Successfully navigate to the second page → You understand how to follow links.
- Scrape the entire site → You have mastered basic pagination.
Project 3: robots.txt Parser
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Crawling / Ethics
- Software or Tool: Requests
- Main Book: N/A, rely on RFCs and web standards documentation.
What you’ll build: A tool that takes a domain (e.g., google.com) and a user-agent string as input. It will fetch and parse the robots.txt file and determine if that user-agent is allowed to access a given URL path.
Why it teaches crawling: This is the first rule of ethical crawling. Before you can build a wide-ranging crawler, you MUST understand how to respect a site’s wishes. This project forces you to parse and interpret the rules that govern all well-behaved bots on the internet.
Core challenges you’ll face:
- Parsing a non-HTML text format → maps to line-by-line parsing and state management
- Handling user-agent groups → maps to applying the right rules for the right bot, including the wildcard
* - Matching URL paths → maps to implementing
AllowandDisallowpath-matching logic - Handling edge cases → maps to
Crawl-delaydirectives, comments, and non-standard extensions
Resources for key challenges:
- Google’s
robots.txtSpecification - A clear explanation of the standard.
Key Concepts:
- Robots Exclusion Protocol: The official standard.
- String Matching: Implementing the path matching rules.
- Stateful Parsing: Keeping track of the current user-agent group as you read the file.
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Basic programming, understanding of HTTP.
Real world outcome:
A command-line tool that acts as a robots.txt validator.
$ python check_robot.py --domain=google.com --path=/search --user-agent=MyBot
Path /search is DISALLOWED for user-agent MyBot
$ python check_robot.py --domain=google.com --path=/ --user-agent=Googlebot
Path / is ALLOWED for user-agent Googlebot
Implementation Hints:
- Your function will take a domain and construct the
robots.txtURL (e.g.,https://google.com/robots.txt). - Read the file line by line.
- The logic is stateful. When you see a
User-agent:line, you are now “in” that agent’s rule block. - Store the rules (Allow/Disallow) in a dictionary, keyed by user-agent. Remember the wildcard
*. - To check a path, first look for rules for your specific user-agent. If none exist, fall back to the wildcard
*agent. - The most specific rule wins. A
Disallow: /private/should match/private/page.htmlbut not/private. The Google spec has details on this.
Learning milestones:
- Fetch and print
robots.txt→ You can access the rules file. - Parse rules for a single user-agent → You understand the basic format.
- Correctly handle the wildcard user-agent → You’ve implemented the fallback logic.
- Correctly determine access for any path → You’ve mastered the core of the Robots Exclusion Protocol.
Project 4: The Single-Domain “Scrawler”
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Crawling
- Software or Tool: Requests, BeautifulSoup, Python’s
collections.deque - Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (for concepts on data systems)
What you’ll build: A true (but simple) web crawler. Given a starting URL, it will systematically discover all reachable pages within that same domain by following links, while keeping track of visited pages to not get stuck in loops.
Why it teaches crawling: This project solidifies the difference between scraping and crawling. The primary goal is no longer data extraction but link discovery and traversal. You will implement the fundamental data structures of any web crawler: a queue for URLs to visit (the “frontier”) and a set of visited URLs.
Core challenges you’ll face:
- Managing a URL frontier → maps to using a queue data structure to manage which pages to visit next
- Tracking visited URLs → maps to using a set for efficient lookup to prevent re-visiting and infinite loops
- Distinguishing internal vs. external links → maps to ensuring your crawler doesn’t wander off the target site
- Graceful error handling → maps to what to do if a page returns a 404 or 500 error
Key Concepts:
- Breadth-First Search (BFS): The algorithm your crawler will implement, naturally suited for a queue.
- URL Frontier: “Designing Data-Intensive Applications” Ch. 11 (relevant concepts on stream processing).
- Sets for Membership Testing: Python documentation on Set types.
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1 & 2.
Real world outcome: Your script will run and produce a site map, printing every unique internal URL it finds on a given domain.
$ python scrawler.py https://your-blog.com
Crawling https://your-blog.com...
FOUND: https://your-blog.com/
FOUND: https://your-blog.com/about
FOUND: https://your-blog.com/posts/article-1
FOUND: https://your-blog.com/posts/article-2
...
Crawl complete. Found 25 unique pages.
Implementation Hints:
- Initialize a queue (e.g.,
collections.dequein Python) with the starting URL. This is your frontier. - Initialize an empty set called
visited_urls. - Start a
whileloop that continues as long as the frontier queue is not empty. - Inside the loop, pop a URL from the front of the queue. If it’s already in
visited_urls,continue. - Add the URL to
visited_urlsand process the page (fetch, parse). - Extract all
hrefattributes from<a>tags. - For each new link, resolve it to an absolute URL. Check if it belongs to the same domain as the start URL.
- If it’s an internal link and not already in
visited_urls, add it to the back of the frontier queue.
Learning milestones:
- Crawl a two-page site → You have a working frontier and visited set.
- Crawl an entire small domain without getting stuck → Your loop detection is working.
- Correctly ignore external links and mailto: links → Your link filtering logic is robust.
Project 5: JavaScript-Dependent Site Scraper
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Web Scraping / Browser Automation
- Software or Tool: Playwright, Selenium (Python) / Puppeteer (Node.js)
- Main Book: Official documentation for Playwright or Selenium.
What you’ll build: A scraper for a site where the data is loaded dynamically with JavaScript, such as the “infinite scroll” version of quotes.toscrape.com/scroll.
Why it teaches scraping: This breaks you out of the simple requests model. You’ll learn that what you see in your browser is not always what requests gets. This forces you to automate a real web browser to wait for JS to execute, simulate user actions (like scrolling), and then scrape the final HTML.
Core challenges you’ll face:
- Controlling a headless browser → maps to launching and steering a browser programmatically
- Waiting for dynamic content → maps to using explicit waits for elements to appear, instead of
time.sleep() - Simulating user actions → maps to programmatically scrolling the page to trigger ‘infinite scroll’ events
- Extracting HTML after JS execution → maps to getting the final DOM state from the browser
Key Concepts:
- Browser Automation: Playwright/Selenium documentation.
- The DOM vs. View Source: Understanding the difference between initial HTML and the live Document Object Model.
- Explicit Waits: A core concept in Selenium/Playwright for creating reliable scripts.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1, understanding of the DOM.
Real world outcome:
Your script will launch a headless browser, scroll down the infinite-scroll page multiple times, and successfully scrape all the quotes, including those that were loaded dynamically. It will succeed where a simple requests-based scraper would fail.
Implementation Hints:
- Use
playwright.sync_apifor a simpler synchronous coding style to start. - Launch a browser, create a new page object, and navigate to the URL.
- Instead of fetching HTML immediately, you need to interact with the page.
- To handle infinite scroll, use a loop that calls
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")to scroll to the bottom. - After scrolling, you must wait for the new content to load. A good way is to see how many quotes are on the page, scroll, then wait until the quote count increases. Or, wait for the “loading” spinner to disappear.
- Once you’ve scrolled enough, get the page’s full content with
page.content()and then parse it with BeautifulSoup as you did in Project 1.
Learning milestones:
- Launch a browser and take a screenshot → You can control the browser.
- Scrape data visible on initial page load → You can get the JS-rendered HTML.
- Scrape data that appears after a button click → You can simulate user actions and wait for results.
- Successfully scrape an infinite scroll page → You’ve mastered dynamic content scraping.
Project 6: Authenticated Scraper
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Web Scraping / Session Management
- Software or Tool: Requests with Sessions, or Playwright
- Main Book: “Black Hat Python, 2nd Edition” by Justin Seitz (has relevant chapters on authentication)
What you’ll build: A script that logs into a site (like GitHub or a forum you have an account on) and scrapes information that is only visible after logging in, like your profile details.
Why it teaches scraping: This teaches you how to manage state, specifically authentication cookies. You’ll learn how websites track who you are across multiple requests and how to make your scraper act like a logged-in user.
Core challenges you’ll face:
- Inspecting the login process → maps to using browser dev tools to see what data the login form POSTs
- Managing cookies and sessions → maps to using a
requests.Sessionobject or browser automation to persist login cookies - Handling CSRF tokens → maps to finding and including anti-forgery tokens in your login request
- Scraping pages that require a login → maps to navigating to protected pages after the session is established
Key Concepts:
- HTTP POST Requests: Sending data to a server.
- Cookies & Sessions: How servers remember you.
- CSRF (Cross-Site Request Forgery) Tokens: A common security measure in forms.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1, understanding of HTTP POST.
Real world outcome: Your script will run without any manual browser interaction, log into your account, navigate to your profile, and print out private information (like your email address listed on the profile), proving it was successfully authenticated.
Implementation Hints:
- Method 1: Browser Automation (Easier): Use Playwright. Navigate to the login page, use
page.fill()to type in your username and password into the form fields, andpage.click()to submit. The browser instance will automatically handle the cookies. Then, just navigate to the profile page. - Method 2:
requests.Session(Harder but faster):- Use a
requests.Session()object, which will persist cookies automatically. - First,
GETthe login page. Parse the HTML to find any CSRF tokens (usually a hidden input field in the form). - Then,
POSTto the form’s action URL. The payload should be a dictionary containing your username, password, and the CSRF token. - After the
POSTis successful, the session object now has the login cookie. You can nowGETany protected page using the same session object.
- Use a
Learning milestones:
- Successfully log in via Playwright → You understand the browser-based approach.
- Identify the POST request and CSRF token for a login form → You can analyze authentication mechanisms.
- Successfully log in via
requests.Session→ You have mastered programmatic session handling.
Project 7: E-commerce Price Tracker
- File: LEARN_WEB_SCRAPING_AND_CRAWLING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Node.js
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Web Scraping / Automation / Anti-Scraping
- Software or Tool: BeautifulSoup, Playwright, smtplib, a database (SQLite)
- Main Book: “Automate the Boring Stuff with Python” by Al Sweigart (for scheduling and email)
What you’ll build: A robust script that tracks the price of a product on an e-commerce site. It will run automatically on a schedule, save the price and timestamp to a database, and send you an email alert when the price drops below a target you set.
Why it teaches scraping: This is a practical, real-world project that combines many skills: scraping, data storage, automation, and dealing with anti-scraping measures (as e-commerce sites are often heavily protected). It forces you to write resilient code that can run unattended.
Core challenges you’ll face:
- Bypassing basic anti-scraping → maps to setting a realistic User-Agent and other headers
- Parsing complex, messy HTML → maps to finding the price in a jungle of divs and spans with dynamic class names
- Storing time-series data → maps to using a database like SQLite to log prices over time
- Scheduling and automation → maps to using cron (Linux/macOS) or Task Scheduler (Windows) to run your script
- Sending notifications → maps to using Python’s
smtplibto send an email
Key Concepts:
- HTTP Headers: Specifically
User-Agent. - Database Interaction: Basic SQL for inserting and querying data.
- Task Scheduling:
cronsyntax or OS equivalents. - SMTP Protocol: For sending emails.
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 5, and 6.
Real world outcome: You will receive an email in your inbox with a subject like “Price Alert! Product X is now $49.99!”, and you’ll have a database file containing the price history of that product, which you can even use to plot a graph.
Implementation Hints:
- Pick a product on a less-complex e-commerce site to start. Amazon is notoriously difficult.
- Your scraper function should be robust. It needs to handle cases where the price element isn’t found, or the page fails to load. Use
try...exceptblocks. - Set a
User-Agentheader that mimics a real browser. You can find your browser’s user agent by searching “what is my user agent”. - Use
sqlite3in Python. Create a simple table:CREATE TABLE prices (timestamp TEXT, price REAL). - Write a function that checks the latest price in the database and sends an email using
smtplibif the new price is lower. - Set up a cron job to run your Python script once a day:
0 9 * * * /usr/bin/python3 /path/to/your/script.py.
Learning milestones:
- Reliably extract the price from the product page → Your selectors are robust.
- Save the price to a SQLite database → You can persist data.
- Send a test email from your script → You’ve integrated notifications.
- The entire system runs automatically via cron and sends a real alert → You have built a complete, automated monitoring system.
Summary
| Project | Main Programming Language |
|---|---|
| Project 1: Simple Quote Scraper | Python |
| Project 2: Multi-Page Book Scraper | Python |
Project 3: robots.txt Parser |
Python |
| Project 4: The Single-Domain “Scrawler” | Python |
| Project 5: JavaScript-Dependent Site Scraper | Python |
| Project 6: Authenticated Scraper | Python |
| Project 7: E-commerce Price Tracker | Python |