LEARN WEB CRAWLING AND SCRAPING
Learn Web Crawling & Scraping: From Zero to Data Extraction Master
Goal: Deeply understand the web’s data layer—from fetching single pages (scraping) to discovering entire websites (crawling), handling modern web apps, and building robust, scalable data extraction pipelines.
Why Learn Web Crawling & Scraping?
The web is the largest database in the world, but most of it is unstructured. Learning to crawl and scrape is the superpower of turning messy web pages into clean, structured data. This skill is the foundation for data science, market analysis, machine learning, price comparison engines, and countless automated workflows.
After completing these projects, you will:
- Understand the crucial difference between crawling and scraping.
- Be able to extract any data from any website, static or dynamic.
- Build robust crawlers that politely navigate websites without getting blocked.
- Handle challenges like JavaScript rendering, anti-bot measures, and pagination.
- Design scalable systems to collect data from thousands of pages.
Core Concept Analysis: Crawling vs. Scraping
Though often used interchangeably, crawling and scraping are two distinct but related processes. Understanding the difference is key.
- Web Crawling is discovery. Its job is to explore. It starts with a URL, finds all the links on that page, and follows them to discover new URLs, mapping out a website’s structure.
- Web Scraping is extraction. Its job is to pull specific data out of a known page. It targets elements like prices, names, or articles and saves them in a structured format.
A crawler asks, “Where can I go from here?” A scraper asks, “What is the price on this page?”
The “Scrawling” Pipeline: How They Work Together
Most real-world data projects involve both. This is the typical workflow:
┌─────────────────┐
│ 1. Seed URL │
│ (e.g., website.com) │
└─────────────────┘
│
▼
┌─────────────────┐
│ 2. CRAWLER │
│ - Fetches page │
│ - Finds links │
│ - Adds new, │
│ in-scope URLs │
│ to a queue │
└─────────────────┘
│
▼
┌─────────────────┐
│ Queue of URLs │
│ (e.g., product/1) │
│ (e.g., product/2) │
└─────────────────┘
│
▼
┌─────────────────┐
│ 3. SCRAPER │
│ - Fetches page │
│ from queue │
│ - Parses HTML │
│ - Extracts data │
│ (price, name) │
└─────────────────┘
│
▼
┌─────────────────┐
│ 4. Structured │
│ Data │
│ (CSV, JSON, DB) │
└─────────────────┘
Key Technologies & Concepts
1. HTTP & HTML
- HTTP Verbs:
GET(to fetch pages) andPOST(to submit forms). - HTML & CSS Selectors: The language for targeting data. You’ll use tags (
<p>), classes (.price), and IDs (#product-title) to pinpoint information.
2. Core Libraries (Python)
requests: The de facto standard for making HTTP requests in Python.BeautifulSoup: The most popular library for parsing HTML and navigating the document tree.Scrapy: A full-fledged framework for large-scale crawling and scraping projects.Selenium/Playwright: Browser automation tools essential for scraping JavaScript-rendered websites.
3. Politeness & Ethics
robots.txt: A file websites use to tell bots which parts of the site they shouldn’t visit. Always respect it.- Rate Limiting: Intentionally slowing down your requests to avoid overwhelming the server.
- User-Agent: An HTTP header that identifies your bot. Set a custom one to be transparent.
4. Advanced Challenges
- Dynamic Content (JavaScript): Websites that load data after the initial page load require a real browser (or a tool that acts like one) to scrape.
- Anti-Scraping: Websites may try to block scrapers using CAPTCHAs, IP bans, or by analyzing request patterns.
- Pagination: Navigating through multiple pages of results (
Page 1,Page 2, etc.).
Project List
The following 10 projects are designed to build your skills progressively, starting with simple scraping and ending with a robust, automated data pipeline.
Project 1: Simple Static Page Scraper
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js with Axios/Cheerio), Ruby (Nokogiri)
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Web Scraping / HTML Parsing
- Software or Tool:
requests,BeautifulSoup4 - Main Book: “Automate the Boring Stuff with Python, 2nd Edition” by Al Sweigart (Chapter 12)
What you’ll build: A command-line script that takes a URL of a simple blog post and prints out its title and all the paragraph text.
Why it teaches the core concepts: This project is the “Hello, World!” of web scraping. It forces you to learn the fundamental loop: make an HTTP request, check for success, parse the HTML, find specific tags, and extract their content.
Core challenges you’ll face:
- Making a GET request → maps to fetching the raw HTML from a server
- Parsing the HTML string into a navigable object → maps to understanding the role of a parser like BeautifulSoup
- Finding the title tag (
<h1>) → maps to basic element selection by tag name - Finding all paragraph tags (
<p>) → maps to iterating over a list of found elements
Key Concepts:
- HTTP GET Request: “Automate the Boring Stuff” Chapter 12
- HTML Parsing: BeautifulSoup Documentation - “Quick Start”
- Element Selection: “Web Scraping with Python, 2nd Ed.” by Ryan Mitchell (Chapter 2)
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python (variables, loops, functions).
Real world outcome:
$ python scraper.py https://example-blog.com/my-first-post
Title: My First Blog Post
Body:
This is the first paragraph. It contains some interesting text.
This is the second paragraph. Scrapers need to be able to extract all of this.
Implementation Hints:
- Import the
requestsandBeautifulSouplibraries. - Use
requests.get(url)to fetch the page. Checkresponse.status_codeto make sure it’s200. - Create a
BeautifulSoupobject fromresponse.text. - Use
soup.find('h1').get_text()to get the title. - Use
soup.find_all('p')to get a list of all paragraph elements. - Loop through the list and print the
get_text()of each one. - Questions to guide you: What happens if the
h1tag doesn’t exist? How would you handle that gracefully? What’s the difference betweensoup.find()andsoup.find_all()?
Learning milestones:
- Successfully print the HTML content → You understand how to fetch a web page.
- Extract and print just the title → You can target a single, specific element.
- Extract and print all paragraphs → You can iterate over multiple elements.
Project 2: Scraping Structured Data to a CSV
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js), Ruby
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 1: Beginner
- Knowledge Area: Web Scraping / Data Structuring
- Software or Tool:
requests,BeautifulSoup4,csv - Main Book: “Web Scraping with Python, 2nd Edition” by Ryan Mitchell
What you’ll build: A script that scrapes a “Top 100” list (e.g., top movies on IMDb, top books on Goodreads) and saves the rank, title, and rating into a clean output.csv file.
Why it teaches the core concepts: This project teaches you how to turn unstructured web content into structured data. You’ll move from just printing text to identifying the parent container of each item and extracting multiple related data points, then saving them in a universally usable format.
Core challenges you’ll face:
- Identifying the repeating HTML container for each item → maps to using browser developer tools to inspect the page structure
- Selecting elements by class or more complex CSS selectors → maps to targeting data more precisely than just by tag name
- Looping through the containers to extract data for each item → maps to building a list of dictionaries or tuples
- Writing the extracted data to a CSV file → maps to using Python’s built-in
csvmodule correctly
Key Concepts:
- CSS Selectors: MDN Web Docs - “CSS Selectors”
- Inspecting HTML: Chrome DevTools Documentation
- Writing CSVs in Python: “Automate the Boring Stuff” Chapter 16
Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1.
Real world outcome:
You will have an output.csv file that you can open in Excel or Google Sheets, with three columns: “Rank”, “Title”, and “Rating”, populated with the data from the website.
Implementation Hints:
- Right-click on the list on the website and “Inspect” to open developer tools.
- Find the HTML element that wraps a single item in the list (e.g., a
<div>or<li>with a class like"list-item"). - Use
soup.select('.list-item')to get a list of all item containers. - Loop through this list. Inside the loop, use
item.find()oritem.select_one()to get the specific data points (rank, title, rating) relative to that item. - Store the extracted data for each item in a dictionary.
- Use Python’s
csv.DictWriterto write your list of dictionaries to a file. This is better than manual string formatting as it handles escaping correctly.
Learning milestones:
- Printing the name of just the first item → You have correctly identified the CSS selectors.
- Printing the rank, name, and rating for all items → Your loop and relative-finding logic is working.
- The
output.csvfile is created and correctly formatted → You have successfully structured and saved the data.
Project 3: A Basic, Recursive Web Crawler
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Crawling / Graph Traversal
- Software or Tool:
requests,BeautifulSoup4,collections.deque - Main Book: “How to Build a Web Crawler from Scratch” - freeCodeCamp article
What you’ll build: A script that starts at a single URL, finds all the links on that page, adds them to a queue, and systematically visits them, keeping track of visited URLs to avoid infinite loops. It will run up to a specified depth and print all unique URLs found.
Why it teaches the core concepts: This project is the essence of crawling. It’s not about extracting data, but about discovering pages. You’ll learn the fundamental algorithm of web traversal: using a queue (FIFO) for breadth-first search and a set for tracking visited pages.
Core challenges you’ll face:
- Extracting all
<a>tags and theirhrefattributes → maps to finding links on a page - Managing a queue of URLs to visit → maps to the core data structure of a crawler (breadth-first search)
- Tracking visited URLs to avoid re-visiting and infinite loops → maps to using a
setfor efficient lookups - Converting relative URLs to absolute URLs → maps to handling links like
/aboutvshttp://example.com/about
Key Concepts:
- Breadth-First Search (BFS): A common graph traversal algorithm perfect for crawlers.
- URL Parsing and Joining: Python’s
urllib.parsemodule (urljoin,urlparse). - Sets for Efficient Lookups: Python documentation on
setobjects.
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1, basic understanding of data structures (queues, sets).
Real world outcome: When you run the script, it will print a stream of messages showing which page it’s currently crawling and, at the end, a list of all unique URLs it discovered within the given domain and depth limit.
Implementation Hints:
- Use
collections.dequeas your queue of URLs to visit. Initialize it with your seed URL. - Use a
setto store URLs you’ve already visited. - The main loop:
- While the queue is not empty:
- Pop a URL from the left of the queue.
- If it’s already in the visited set,
continue. - Add it to the visited set.
- Fetch the page.
- Find all
<a>tags. For each tag, get thehref. - Use
urllib.parse.urljoin(base_url, href)to convert the link to an absolute URL. - Add the new, absolute URL to the right of the queue.
- Implement a depth limit to stop it from running forever. You can do this by storing tuples in your queue:
(url, depth).
Learning milestones:
- The script prints all links from the first page → You can find and extract URLs.
- The script visits the links it found on the first page → Your queue logic is working.
- The script stops when it revisits a page or hits the depth limit → Your visited set and depth control are working correctly.
Project 4: Handling JavaScript-Rendered Websites
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js with Playwright/Puppeteer)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Scraping / Browser Automation
- Software or Tool:
seleniumorplaywright - Main Book: “Selenium with Python” - Official Documentation
What you’ll build: A scraper for a modern, dynamic website (like a single-page e-commerce site or a social media feed) that loads its content using JavaScript. The script will control a real browser to wait for the content to load and then extract it.
Why it teaches the core concepts: requests and BeautifulSoup can’t see content loaded by JavaScript. This project teaches you how to handle the modern web by using browser automation tools. You’ll learn about waiting for elements to appear, scrolling to trigger lazy-loading, and interacting with pages programmatically.
Core challenges you’ll face:
- Setting up a browser driver (e.g., chromedriver) → maps to the initial configuration of browser automation
- Navigating to a page and getting its content → maps to basic browser control
- Waiting explicitly for an element to be visible → maps to handling asynchronous data loading, the most critical skill in dynamic scraping
- Scrolling the page to trigger more content to load → maps to simulating user interaction for infinite scroll pages
Key Concepts:
- Document Object Model (DOM): The browser’s representation of a page, which JavaScript modifies.
- Explicit Waits: Telling Selenium to wait until a specific condition is met (e.g., an element is clickable), which is much more reliable than fixed
time.sleep()calls. - Headless Browsers: Running a browser in the background without a visible GUI, essential for server-side automation.
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 2, understanding of the DOM.
Real world outcome:
You can successfully scrape product prices from a site like shopee.sg or tweets from twitter.com, which would be impossible with requests alone because the data isn’t in the initial HTML source. The script will launch a browser, navigate, wait, and then print the extracted data.
Implementation Hints:
- Install
seleniumand download the appropriate WebDriver for your browser (e.g., ChromeDriver). from selenium import webdriverandfrom selenium.webdriver.common.by import By.driver = webdriver.Chrome()driver.get(url)- This is the key part: Use
WebDriverWaitto pause execution until the element you want is loaded. Example:from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Wait up to 10 seconds for elements with class 'product-price' to appear price_element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CLASS_NAME, "product-price")) ) - Once the element is present, you can get its text with
price_element.text. - To scroll, use
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);").
Learning milestones:
- The script successfully opens a browser and navigates to the URL → Your driver setup is correct.
- You can extract the title of a dynamic page → You’ve successfully retrieved the post-JS DOM.
- The script waits for a specific element and extracts its data → You have mastered explicit waits.
- The script can scrape data that only appears after scrolling → You can simulate user interactions.
Project 5: Building a Polite, Respectful Crawler
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Crawling / Ethics
- Software or Tool:
requests,urllib.robotparser,time - Main Book: “The Web Application Hacker’s Handbook” (for understanding web architecture)
What you’ll build: Enhance your basic crawler from Project 3 to be a “good internet citizen.” It will read and respect the robots.txt file of a domain, add a delay between requests to avoid overwhelming the server, and identify itself with a custom User-Agent.
Why it teaches the core concepts: Aggressive crawling is a great way to get your IP address banned. This project teaches the crucial, real-world skill of “ethical” or “responsible” crawling. You’ll learn how to interact with web servers in a way that is transparent and low-impact.
Core challenges you’ll face:
- Fetching and parsing
robots.txt→ maps to usingurllib.robotparserto understand crawling rules - Checking if a URL is allowed before fetching → maps to implementing the
can_fetch()check in your crawl loop - Implementing a rate limit → maps to using
time.sleep()to pause between requests - Setting a custom User-Agent header → maps to identifying your bot in HTTP requests
Key Concepts:
- Robots Exclusion Protocol (
robots.txt): The standard for communicating with web crawlers. - HTTP Headers: Metadata sent with every request. The
User-Agentis a key part of your crawler’s identity. - Rate Limiting: A fundamental technique for being a polite crawler and avoiding blocks.
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 3.
Real world outcome:
Your crawler will now automatically skip URLs that are disallowed (e.g., /admin, /cart). When you watch it run, you’ll see a noticeable pause between page fetches. If you inspect the server logs for the target site, you would see your custom User-Agent string in the access logs.
Implementation Hints:
- Before your main loop, initialize the
RobotFileParser:import urllib.robotparser rp = urllib.robotparser.RobotFileParser() rp.set_url('http://example.com/robots.txt') rp.read() - Set a custom User-Agent for your
requestssession:headers = {'User-Agent': 'MyCoolBot/1.0 (+http://my-bot-info.com)'} session = requests.Session() session.headers.update(headers) - Inside your crawl loop, before fetching a URL, check if you’re allowed:
if not rp.can_fetch(headers['User-Agent'], url_to_fetch): print(f"Skipping disallowed URL: {url_to_fetch}") continue - After each successful fetch, add a delay:
time.sleep(2)(2 seconds is a good starting point).
Learning milestones:
- The script correctly identifies your User-Agent → You can modify request headers.
- The script waits between requests → You have implemented rate limiting.
- The script skips URLs listed in
robots.txt→ You are now running a polite crawler.
Project 6: The “Scrawler” - Crawl and Scrape a Blog
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Ruby
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Web Crawling & Scraping
- Software or Tool:
requests,BeautifulSoup4,csv - Main Book: “Web Scraping with Python, 2nd Edition” by Ryan Mitchell
What you’ll build: A program that crawls a blog, finds all the articles, and then scrapes the title, author, and publication date from each one, saving it all to a CSV file.
Why it teaches the core concepts: This project combines crawling and scraping into a single, powerful pipeline. You’ll learn to separate the logic of discovery (finding article URLs) from extraction (getting data from those URLs). This pattern is the foundation of almost all large-scale data collection efforts.
Core challenges you’ll face:
- Differentiating between article links and other links → maps to using URL patterns (e.g.,
/posts/2023/...) to identify what to scrape - Structuring the code to separate crawling and scraping logic → maps to creating distinct functions,
crawl_site()andscrape_page() - Passing data between the crawler and the scraper → maps to using a shared queue or data structure
- Handling duplicate pages found through different links → maps to reusing the
visitedset from the crawler project
Key Concepts:
- URL Pattern Matching: Using regular expressions or simple string methods to classify URLs.
- Producer-Consumer Pattern: The crawler “produces” URLs, and the scraper “consumes” them.
- Code Modularity: Breaking a complex task into smaller, reusable functions.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 2 and Project 4.
Real world outcome: You can point the script at the homepage of a blog (e.g., a Medium publication or a personal blog), and it will produce a CSV file containing a structured list of all articles on that blog, which you could then use for analysis or archiving.
Implementation Hints:
- Start with your crawler code from Project 4.
- Create two queues (or deques):
urls_to_crawlandurls_to_scrape. - In your main loop, when you pop a URL from
urls_to_crawl:- Fetch the page.
- For every link found:
- Use a condition (e.g.,
if '/post/' in link:) to decide if it’s an article. - If it’s an article link, add it to
urls_to_scrape. - If it’s another page on the site (like
/aboutor a category page), add it tourls_to_crawl.
- Use a condition (e.g.,
- After the crawling is finished (or in parallel), create a second loop that works through the
urls_to_scrapequeue. - For each URL in this queue, call a
scrape_article_page()function that contains the logic from Project 1 (find title, author, etc.). - Append the scraped data to a list of dictionaries, and finally write it to a CSV file.
Learning milestones:
- The script correctly identifies and separates article links from navigation links → Your URL classification logic is working.
- The script successfully crawls the site to find all articles → Your crawling logic is robust.
- The final CSV contains scraped data from every article found → The entire crawl-and-scrape pipeline is functional.
Project 7: Full-Fledged Scrapy Project
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: N/A (Scrapy is unique)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Web Crawling Frameworks
- Software or Tool:
Scrapy - Main Book: “Scrapy: Official Documentation”
What you’ll build: Re-implement Project 6 (the blog “scrawler”) using the Scrapy framework. You’ll define Items, write a Spider, and use Scrapy’s built-in mechanisms for following links and exporting data.
Why it teaches the core concepts: Building everything from scratch is great for learning, but professionals use frameworks for speed and scalability. This project teaches you the “Scrapy way” of doing things. You’ll learn about its architecture (Spiders, Items, Pipelines, Middlewares) and how it handles concurrency, rate-limiting, and data export out-of-the-box.
Core challenges you’ll face:
- Setting up a Scrapy project and spider → maps to learning the framework’s command-line tools and project structure
- Defining a structured
Item→ maps to creating a schema for your scraped data - Writing the
parsemethod to extract data and find new links → maps to the core logic of a Scrapy spider - Yielding
Requestobjects to follow links → maps to Scrapy’s way of managing the crawl queue - Exporting data using Feed Exporters → maps to using Scrapy’s built-in
scrapy crawl myspider -o output.jsonfunctionality
Key Concepts:
- Spiders: The core class where you define how a site (or group of sites) will be scraped.
- Items: Think of them as structured dictionaries for your scraped data.
- Pipelines: Components that process an item after it has been scraped (e.g., cleaning data, saving to a database).
- Asynchronous Processing: Scrapy is built on Twisted, an asynchronous networking library, which makes it very fast.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 6, understanding of Python classes.
Real world outcome:
You will have a well-structured Scrapy project. Running scrapy crawl blog_spider -o articles.json from your terminal will execute the entire crawl and scrape process, producing a clean JSON file with the data. Your code will be much cleaner and more organized than the manual version.
Implementation Hints:
- Run
scrapy startproject myblogscraper. - Define your data schema in
items.py(e.g.,title = scrapy.Field(),author = scrapy.Field()). - Create a spider in the
spiders/directory.- Set
name,allowed_domains, andstart_urls. - In the
parsemethod:- Use
response.css()orresponse.xpath()to select data. - If you are on a page with article data,
yieldanItemfilled with your data. - Find links to follow using
response.css('a::attr(href)'). - For each link,
yield response.follow(url=next_page, callback=self.parse)to continue the crawl.
- Use
- Set
- No need to write your own CSV/JSON export logic. Scrapy handles it automatically with the
-oflag.
Learning milestones:
- The spider scrapes the first page correctly → You understand the basic
parsemethod and Item yielding. - The spider follows links to other pages → You understand how to
yield Requestor useresponse.follow. - The final
output.jsoncontains all the articles from the site → You have successfully built a complete Scrapy project.
Project 8: Scalable Crawler with a Job Queue
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go with RabbitMQ
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems / Scalability
- Software or Tool:
redis,rq(Redis Queue),requests - Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (Chapter 11)
What you’ll build: A distributed crawling system. A “master” process finds URLs and pushes them into a Redis queue. Multiple “worker” processes pull URLs from the queue, fetch the pages, and perform the scraping, allowing you to crawl much faster.
Why it teaches the core concepts: This project teaches you how to scale your data collection efforts. A single-threaded crawler is slow. By decoupling URL discovery from page fetching using a message queue, you can add more workers to increase throughput. This is a fundamental pattern in large-scale systems.
Core challenges you’ll face:
- Setting up Redis and RQ → maps to learning the basics of a message broker and job queue
- Creating a “producer” script → maps to a crawler that finds links but instead of visiting them, enqueues them as jobs
- Creating a “worker” function → maps to a scraper that receives a URL as a job, processes it, and saves the data
- Sharing a “visited” set → maps to using a Redis
setto track visited URLs across all workers
Key Concepts:
- Message Queues: A system for asynchronous communication between processes (e.g., Redis, RabbitMQ, SQS).
- Worker Processes: Background processes that consume tasks from a queue.
- Distributed State: Managing shared information (like visited URLs) in a central store like Redis.
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 6, Docker for running Redis easily.
Real world outcome: You will run one “producer” script. Then, in multiple separate terminals, you can run the “worker” script. You will see the workers pick up jobs from the queue in parallel and process them, scraping data much faster than a single script ever could.
Implementation Hints:
- Use Docker to quickly spin up a Redis container.
- Producer (
producer.py):- This is your crawler.
- Connect to Redis.
from rq import Queue; q = Queue(connection=redis_conn). - When it finds a URL to scrape, instead of fetching it, enqueue it:
q.enqueue(scrape_function, url). - To avoid adding duplicate URLs to the queue, check against a Redis set:
if not redis_conn.sismember('visited_urls', url): redis_conn.sadd('visited_urls', url); q.enqueue(...).
- Worker (
worker.py):- Define your
scrape_function(url)here. This function contains the logic to fetch a page, parse it, and save the data. - In your terminal, run
rq workerto start a worker process that listens to your queue. You can run this command in many terminals to start many workers.
- Define your
- The scraper function should save its data directly to a database or a shared file system, as workers are independent.
Learning milestones:
- The producer successfully adds a URL to the Redis queue → Your producer and queue are connected.
- A single worker process picks up the URL and scrapes it → Your worker function is working.
- Multiple workers run in parallel without scraping the same page twice → Your distributed state management (Redis set) is working, and you have achieved scalable crawling.
Project 9: Evading Basic Anti-Scraping Measures
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Web Security / Proxies
- Software or Tool:
requests, public proxy lists - Main Book: “The Web Application Hacker’s Handbook”
What you’ll build: A scraper that is more resilient to blocking. It will rotate its IP address by using a list of public proxies and will cycle through a list of common User-Agent strings for each request.
Why it teaches the core concepts: Naive scrapers are easily detected and blocked. This project teaches you the basics of evasion. You’ll learn how websites identify bots (by IP and User-Agent) and the fundamental techniques used to circumvent these simple checks.
Core challenges you’ll face:
- Finding and managing a list of proxy servers → maps to scraping a public proxy site or using a proxy service
- Configuring
requeststo use a proxy → maps to theproxiesparameter inrequests.get() - Rotating the User-Agent header for each request → maps to creating a list of common User-Agents and picking one at random
- Handling proxy errors → maps to retrying a request with a different proxy if one fails
Key Concepts:
- HTTP Proxies: An intermediary server that forwards your requests, masking your original IP address.
- User-Agent Rotation: Changing your
User-Agentheader to mimic different browsers and devices, making your traffic look less uniform and bot-like. - Error Handling & Retries: A crucial part of robust scraping, especially when dealing with unreliable public proxies.
Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 2, understanding of HTTP headers.
Real world outcome: Your scraper will be able to successfully extract data from a website that employs basic IP-based rate limiting or User-Agent filtering. You can watch the script try different proxies and User-Agents as it runs.
Implementation Hints:
- Create a list of common User-Agent strings. You can find these online easily.
- Find a public proxy list website. Write a separate, simple scraper to get a list of IP addresses and ports.
- In your main scraper:
- Before each request, select a random User-Agent:
headers = {'User-Agent': random.choice(user_agents)}. - Select a random proxy from your list. The format for the
proxiesdictionary is{'http': 'http://IP:PORT', 'https': 'https://IP:PORT'}. - Wrap your
requests.get()call in atry...exceptblock to catch connection errors, which are common with public proxies. - If a proxy fails, remove it from your list and try the request again with a different proxy.
- Before each request, select a random User-Agent:
Learning milestones:
- A request is successfully sent using a custom User-Agent → You can modify headers.
- A request is successfully sent through a proxy server → You understand how to route traffic.
- The scraper successfully completes its task on a site that blocks naive requests → You have built a more resilient bot.
Project 10: Change Detection & Notification Bot
- File: LEARN_WEB_CRAWLING_AND_SCRAPING.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript (Node.js)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Automation / Monitoring
- Software or Tool:
requests,BeautifulSoup4,smtplib,schedulelibrary - Main Book: “Automate the Boring Stuff with Python, 2nd Edition”
What you’ll build: An automated bot that scrapes a single piece of information from a web page (e.g., a product’s price, the number of tickets left for an event) on a schedule. It will store the result, and if the value changes from the last check, it will send you an email notification.
Why it teaches the core concepts: This project teaches you how to turn a simple scraper into a useful, long-running monitoring service. You’ll learn about storing state, scheduling tasks, and integrating with other services (like email) to create a complete automation pipeline.
Core challenges you’ll face:
- Storing the previously scraped value → maps to saving state between runs, e.g., in a simple text file or JSON file
- Scheduling the script to run periodically → maps to using a library like
scheduleor a system tool likecron - Comparing the new value to the old value → maps to the core change-detection logic
- Sending an email notification → maps to using Python’s
smtplibto connect to an email server and send a message
Key Concepts:
- Statefulness: Making a script that remembers information from its previous executions.
- Task Scheduling: Running code at predefined intervals.
- SMTP (Simple Mail Transfer Protocol): The standard protocol for sending email.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 2, access to an SMTP server (e.g., Gmail allows this with an “App Password”).
Real world outcome: The script runs quietly in the background. When the price of a product you’re watching drops, you will automatically receive an email in your inbox that says “Price Drop Alert! The price of [Product Name] is now $19.99!”.
Implementation Hints:
- Write a scraper function that targets and extracts the specific piece of data you want to monitor. Make sure it cleans the data (e.g., removes currency symbols, converts to a float).
- Create a function
read_last_value()that opens a file (e.g.,last_price.txt) and reads the value. - Create a function
write_new_value(value)that saves the new value to that same file. - Main logic:
current_value = scrape_value()last_value = read_last_value()if current_value != last_value: send_notification(); write_new_value(current_value)
- Use the
schedulelibrary for easy scheduling in Python:schedule.every(1).hour.do(check_price_job). - For email,
smtplibis built-in. You’ll need an SMTP server, port, username, and password.
Learning milestones:
- The script can successfully scrape and print the target value → Your scraper is working.
- The script saves the value to a file and reads it back on the next run → You have implemented statefulness.
- The script runs automatically every few minutes → Your scheduler is working.
- You receive an email when the value on the live website is changed manually → The full notification pipeline is functional.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Simple Scraper | Beginner | Weekend | Foundational Scraping | Practical |
| Scrape to CSV | Beginner | Weekend | Data Structuring | Practical |
| Basic Crawler | Intermediate | 1-2 weeks | Foundational Crawling | Genuinely Clever |
| Dynamic Scraper | Intermediate | 1-2 weeks | Modern Web Apps | Hardcore Tech Flex |
| Polite Crawler | Intermediate | Weekend | Ethics & Robustness | Genuinely Clever |
| The “Scrawler” | Advanced | 1-2 weeks | Full Pipeline | Hardcore Tech Flex |
| Scrapy Project | Advanced | 1-2 weeks | Frameworks & Speed | Hardcore Tech Flex |
| Scalable Crawler | Expert | 2-3 weeks | Distributed Systems | Pure Magic |
| Evasion Bot | Expert | 1-2 weeks | Anti-Bot Bypass | Hardcore Tech Flex |
| - **Change Bot | Advanced | 1-2 weeks | Automation & Monitoring | Genuinely Clever |
Recommendation
To get a solid foundation, start with Project 1 (Simple Scraper) and then immediately do Project 2 (Scrape to CSV). This will teach you 90% of what you need for basic data extraction.
After that, jump to Project 3 (Basic Crawler) to understand the discovery aspect. Once you’ve completed those three, you’ll have a clear, practical understanding of the difference between scraping and crawling. From there, Project 6 (The “Scrawler”) is the perfect next step to combine your skills into a real-world project.
Summary
- Project 1: Simple Static Page Scraper: Python
- Project 2: Scraping Structured Data to a CSV: Python
- Project 3: A Basic, Recursive Web Crawler: Python
- Project 4: Handling JavaScript-Rendered Websites: Python
- Project 5: Building a Polite, Respectful Crawler: Python
- Project 6: The “Scrawler” - Crawl and Scrape a Blog: Python
- Project 7: Full-Fledged Scrapy Project: Python
- Project 8: Scalable Crawler with a Job Queue: Python
- Project 9: Evading Basic Anti-Scraping Measures: Python
- Project 10: Change Detection & Notification Bot: Python