LEARN REGEX DEEP DIVE

Learn Regular Expressions: From Zero to Pattern Master

Goal: Deeply understand regular expressions—from basic pattern matching to lookaheads, backreferences, and even building your own regex engine.

Why Regex Matters

Regular expressions are one of the most powerful tools in a programmer’s arsenal. They appear everywhere:

Text editors: Find and replace with patterns
Command line: grep, sed, awk
Validation: Email, phone numbers, passwords
Parsing: Logs, configuration files, data extraction
Web scraping: Extracting data from HTML
Compilers: Lexical analysis (tokenizers)

Yet most developers only know basic patterns and copy-paste the rest from Stack Overflow. After completing these projects, you will:

Write complex patterns from scratch with confidence
Understand the theory behind regex (finite automata)
Debug and optimize slow patterns
Know when regex is the right tool (and when it isn’t)
Build a working regex engine from scratch

Core Concept Analysis

The Regex Mental Model

A regex is a pattern that describes a set of strings. Think of it as a tiny program that answers: “Does this string match?”

Pattern: /cat/
Matches: "cat", "catalog", "scatter"
Doesn't match: "dog", "Cat" (case sensitive)

Pattern: /^cat$/
Matches: "cat" (exactly)
Doesn't match: "catalog", "scatter" (anchored)

Anatomy of a Regex

    /^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}$/i
     │ └──────┬──────┘│└─────┬─────┘│ └───┬───┘││
     │        │       │      │      │     │    ││
     │   local part   @ domain name  .   TLD   ││
     │                                         ││
     └── anchor (start)            anchor (end)─┘│
                                       flag (case insensitive)─┘

Fundamental Concepts

1. Literal Characters

Most characters match themselves:

/hello/  matches  "hello" in "hello world"

2. Metacharacters (Special Characters)

These have special meaning and need escaping to match literally:

. ^ $ * + ? { } [ ] \ | ( )

/a.c/    matches  "abc", "a1c", "a-c" (. = any char)
/a\.c/   matches  "a.c" only (escaped dot)

3. Character Classes

Match one of several characters:

[abc]     one of: a, b, or c
[a-z]     any lowercase letter
[A-Za-z]  any letter
[0-9]     any digit
[^abc]    NOT a, b, or c (negation)

4. Predefined Character Classes

Shortcuts for common patterns:

\d    digit         [0-9]
\D    non-digit     [^0-9]
\w    word char     [A-Za-z0-9_]
\W    non-word      [^A-Za-z0-9_]
\s    whitespace    [ \t\n\r\f\v]
\S    non-whitespace
.     any char (except newline, usually)

5. Anchors

Match positions, not characters:

^     start of string/line
$     end of string/line
\b    word boundary
\B    non-word boundary

6. Quantifiers

Specify how many times to match:

*       0 or more (greedy)
+       1 or more (greedy)
?       0 or 1 (optional)
{3}     exactly 3
{3,}    3 or more
{3,5}   3 to 5
*?      0 or more (lazy/non-greedy)
+?      1 or more (lazy)

7. Groups and Capturing

(abc)       capturing group
(?:abc)     non-capturing group
(?<name>x)  named capturing group
\1, \2      backreferences

8. Lookahead and Lookbehind

Match based on what comes before/after, without consuming:

(?=...)     positive lookahead (followed by)
(?!...)     negative lookahead (not followed by)
(?<=...)    positive lookbehind (preceded by)
(?<!...)    negative lookbehind (not preceded by)

9. Alternation

cat|dog     matches "cat" or "dog"

Quick Reference Table

Pattern	Meaning	Example Match
`.`	Any character	`a.c` → “abc”, “a1c”
`^`	Start of string	`^hello` → “hello world”
`$`	End of string	`world$` → “hello world”
`*`	0 or more	`ab*c` → “ac”, “abc”, “abbc”
`+`	1 or more	`ab+c` → “abc”, “abbc”
`?`	0 or 1	`colou?r` → “color”, “colour”
`\d`	Digit	`\d+` → “123”
`\w`	Word character	`\w+` → “hello_123”
`\s`	Whitespace	`a\sb` → “a b”
`[abc]`	Character class	`[aeiou]` → “a”, “e”, etc.
`[^abc]`	Negated class	`[^0-9]` → any non-digit
`(...)`	Capture group	`(ab)+` → “abab” (captures “ab”)
`\1`	Backreference	`(.)\1` → “aa”, “bb” (repeated char)
`(?=...)`	Lookahead	`foo(?=bar)` → “foo” in “foobar”
`(?<=...)`	Lookbehind	`(?<=@)\w+` → “gmail” in “@gmail”

Project Progression Map

Level 1 (Beginner)           Level 2 (Intermediate)         Level 3 (Advanced)
─────────────────────        ────────────────────────       ─────────────────────

┌─────────────────┐          ┌─────────────────────┐        ┌──────────────────┐
│ 1. Pattern      │          │ 6. Log Parser       │        │ 11. Regex Engine │
│    Matcher CLI  │───────▶  │                     │──────▶ │     (Basic)      │
└─────────────────┘          └─────────────────────┘        └──────────────────┘
        │                            │                              │
        ▼                            ▼                              ▼
┌─────────────────┐          ┌─────────────────────┐        ┌──────────────────┐
│ 2. Text         │          │ 7. Markdown         │        │ 12. Regex        │
│    Validator    │───────▶  │    Parser           │──────▶ │     Debugger     │
└─────────────────┘          └─────────────────────┘        └──────────────────┘
        │                            │                              │
        ▼                            ▼                              ▼
┌─────────────────┐          ┌─────────────────────┐        ┌──────────────────┐
│ 3. Search &     │          │ 8. Data Extractor   │        │ 13. Regex        │
│    Replace Tool │───────▶  │    (Web Scraping)   │──────▶ │     Optimizer    │
└─────────────────┘          └─────────────────────┘        └──────────────────┘
        │                            │
        ▼                            ▼
┌─────────────────┐          ┌─────────────────────┐
│ 4. Input        │          │ 9. Tokenizer /      │
│    Sanitizer    │───────▶  │    Lexer            │
└─────────────────┘          └─────────────────────┘
        │                            │
        ▼                            ▼
┌─────────────────┐          ┌─────────────────────┐
│ 5. URL/Email    │          │ 10. Template        │
│    Parser       │───────▶  │     Engine          │
└─────────────────┘          └─────────────────────┘

Project 1: Pattern Matcher CLI

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: JavaScript, Go, Rust
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Basic Pattern Matching
Software or Tool: grep-like CLI
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A command-line tool like grep that searches files for lines matching a regex pattern, with options for case-insensitive search, line numbers, and inverted matching.

Why it teaches regex: This project introduces you to basic regex operations in a practical context. You’ll learn how patterns are compiled, matched, and how different flags affect matching behavior.

Core challenges you’ll face:

Pattern compilation → maps to understanding regex engines
Line-by-line matching → maps to anchors and multiline mode
Case insensitivity → maps to regex flags
Match highlighting → maps to capturing match positions

Key Concepts:

Basic Pattern Matching: “Mastering Regular Expressions” Chapter 1 - Friedl
Regex in Python: Python re module documentation
grep Internals: GNU grep source code comments

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic programming, command line familiarity

Real world outcome:

$ pgrep "error" server.log
server.log:42: Connection error: timeout
server.log:156: Database error: connection refused
server.log:203: Authentication error: invalid token

$ pgrep -i "ERROR" server.log    # Case insensitive
server.log:42: Connection error: timeout
server.log:89: ERROR: Disk full
server.log:156: Database error: connection refused

$ pgrep -n "\d{3}-\d{4}" contacts.txt    # Show line numbers
15: John: 555-1234
23: Jane: 555-5678
31: Bob: 555-9012

$ pgrep -v "^#" config.ini    # Invert match (show non-comments)
host=localhost
port=8080
debug=true

$ pgrep -c "TODO" *.py    # Count matches
main.py: 3
utils.py: 7
tests.py: 1

$ pgrep -o "https?://\S+" README.md    # Only show matched part
https://github.com/user/repo
http://example.com/docs
https://api.example.com/v1

Implementation Hints:

Core matching logic:

1. Compile the regex pattern (with flags)
2. For each line in the file:
   a. Apply the regex
   b. If match found (or not found for -v):
      - Store/display the result
      - Track line numbers if needed
3. Handle output formatting (highlighting, counts, etc.)

Key operations to understand:

search()   - Find pattern anywhere in string
match()    - Find pattern at start of string
findall()  - Find all occurrences
finditer() - Iterator over match objects (with positions)

Flags to implement:

-i    re.IGNORECASE    Case-insensitive matching
-v    (invert logic)   Show non-matching lines
-n    (add numbers)    Prefix with line numbers
-c    (count mode)     Only show count of matches
-o    (only matching)  Show only the matched part
-l    (files only)     Only show filenames with matches

Questions to consider:

What happens with invalid regex syntax?
How do you handle binary files?
How do you match across multiple lines?

Learning milestones:

Basic patterns work → You understand literal matching
Metacharacters work → You understand special characters
Flags affect matching → You understand regex modifiers
Match positions available → You understand match objects

Project 2: Text Validator Library

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: JavaScript
Alternative Programming Languages: Python, TypeScript, Go
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 1: Beginner
Knowledge Area: Input Validation
Software or Tool: Validation Library
Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan

What you’ll build: A validation library with pre-built patterns for common data types (email, phone, URL, credit card, etc.) and the ability to create custom validators with helpful error messages.

Why it teaches regex: Validation is where most developers first encounter regex. You’ll learn character classes, quantifiers, and anchors while building something immediately useful.

Core challenges you’ll face:

Email validation → maps to complex character classes
Phone number formats → maps to alternation and optional groups
Credit card patterns → maps to quantifiers and grouping
Error messages → maps to understanding why patterns fail

Key Concepts:

Email Regex: RFC 5322 and practical simplifications
Character Classes: “Regular Expressions Cookbook” Chapter 2 - Goyvaerts
Anchoring: “Mastering Regular Expressions” Chapter 3 - Friedl

Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1, understanding of character classes

Real world outcome:

// Your validation library
const v = require('validex');

// Built-in validators
v.isEmail('user@example.com');       // true
v.isEmail('invalid-email');           // false

v.isPhone('(555) 123-4567');          // true
v.isPhone('+1-555-123-4567');         // true
v.isPhone('123');                      // false

v.isURL('https://example.com/path');  // true
v.isCreditCard('4111111111111111');   // true (Visa)
v.isIPv4('192.168.1.1');              // true
v.isHexColor('#FF5733');              // true
v.isSlug('my-blog-post');             // true
v.isUUID('550e8400-e29b-41d4-a716-446655440000'); // true

// Validation with details
const result = v.validate('not-an-email', 'email');
// {
//   valid: false,
//   value: 'not-an-email',
//   errors: ['Missing @ symbol', 'No domain specified']
// }

// Custom validators
const usernameValidator = v.create({
  pattern: /^[a-zA-Z][a-zA-Z0-9_]{2,15}$/,
  messages: {
    format: 'Username must start with a letter',
    length: 'Username must be 3-16 characters'
  }
});

usernameValidator.test('john_doe');   // true
usernameValidator.test('2cool');       // false

// Chained validation
v.string('test@example.com')
  .isEmail()
  .maxLength(50)
  .notContains('spam')
  .validate();  // { valid: true, ... }

Implementation Hints:

Common validation patterns:

const patterns = {
  // Email (simplified but practical)
  email: /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/,

  // Phone (US formats)
  phone: /^(\+1[-.\s]?)?(\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}$/,

  // URL
  url: /^(https?:\/\/)?([\da-z.-]+)\.([a-z.]{2,6})([\/\w .-]*)*\/?$/,

  // Credit card (basic Luhn-valid length)
  creditCard: /^\d{13,19}$/,

  // IPv4
  ipv4: /^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/,

  // Hex color
  hexColor: /^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$/,

  // UUID
  uuid: /^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i,

  // Slug (URL-friendly)
  slug: /^[a-z0-9]+(?:-[a-z0-9]+)*$/
};

Understanding the email pattern:

/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/
  │               │ │             │ │          │
  │               │ │             │ │          └── 2+ letter TLD
  │               │ │             │ └── literal dot before TLD
  │               │ │             └── domain (letters, numbers, dots, hyphens)
  │               │ └── literal @ symbol
  │               └── 1+ local characters
  └── start anchor

Questions to consider:

Should you use a simple regex or the full RFC 5322 spec for email?
How do you validate international phone numbers?
Why is credit card validation more than just a pattern (Luhn algorithm)?

Learning milestones:

Character classes work → You understand [a-z] and \d
Quantifiers work → You understand +, *, {n,m}
Anchors work → You understand ^, $
Groups work → You understand (…) for structure

Project 3: Search and Replace Tool

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: JavaScript, Go, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Substitution / Backreferences
Software or Tool: sed-like Tool
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A text transformation tool that can search and replace using regex patterns with capture groups and backreferences—like sed but with a friendlier interface.

Why it teaches regex: Replacement patterns with backreferences are where regex becomes truly powerful. You’ll learn how captured groups can be rearranged and transformed in the replacement.

Core challenges you’ll face:

Capture groups → maps to grouping and referencing
Backreferences → maps to using captures in replacement
Named groups → maps to readable replacements
Global replacement → maps to replace all vs first

Key Concepts:

Backreferences: “Mastering Regular Expressions” Chapter 5 - Friedl
Named Groups: “Regular Expressions Cookbook” Chapter 3 - Goyvaerts
Replacement Syntax: Language-specific documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 2, understanding of groups

Real world outcome:

# Basic replacement
$ rex 's/color/colour/' file.txt
Changed 5 occurrences

# Using capture groups
$ rex 's/(\w+), (\w+)/\2 \1/' names.txt
# "Smith, John" → "John Smith"

# Named groups for clarity
$ rex 's/(?P<last>\w+), (?P<first>\w+)/\g<first> \g<last>/' names.txt

# Multiple patterns
$ rex -e 's/foo/bar/' -e 's/baz/qux/' file.txt

# Date format conversion
$ rex 's/(\d{2})\/(\d{2})\/(\d{4})/\3-\1-\2/' dates.txt
# "12/25/2024" → "2024-12-25"

# Case transformation (with special replacement syntax)
$ rex 's/([a-z]+)/\U\1/' words.txt  # Uppercase
# "hello" → "HELLO"

$ rex 's/([A-Z])([A-Z]+)/\1\L\2/' shout.txt  # Title case
# "HELLO" → "Hello"

# Preview changes without applying
$ rex -n 's/error/ERROR/' log.txt
Would change:
  Line 42: Connection error → Connection ERROR
  Line 89: Database error → Database ERROR

# In-place editing with backup
$ rex -i.bak 's/old/new/' file.txt
Created backup: file.txt.bak
Modified: file.txt

Implementation Hints:

Understanding backreferences:

Pattern: (\w+), (\w+)
Input:   "Smith, John"
Match:   Group 1 = "Smith", Group 2 = "John"

Replacement: \2 \1
Output:      "John Smith"

Implementation approach:

import re

def replace(text, pattern, replacement, flags=0):
    # Handle backreference syntax differences
    # Python uses \1, JavaScript uses $1

    # Compile pattern
    regex = re.compile(pattern, flags)

    # Replace with group expansion
    result = regex.sub(replacement, text)

    return result

# Named group example
pattern = r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})'
replacement = r'\g<year>-\g<month>-\g<day>'

Case transformation in replacement:

\U    uppercase all following
\L    lowercase all following
\E    end case modification
\u    uppercase next char only
\l    lowercase next char only

Questions to consider:

How do you escape special characters in the replacement string?
How do you handle nested groups?
How do you implement case transformation?

Learning milestones:

Basic replacement works → You understand substitution
Capture groups work → You understand (…) and \1
Named groups work → You understand (?P...)
Case transforms work → You understand replacement modifiers

Project 4: Input Sanitizer

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: JavaScript
Alternative Programming Languages: Python, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Security / Text Processing
Software or Tool: Sanitization Library
Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan

What you’ll build: A library that sanitizes user input by removing or escaping dangerous patterns—preventing XSS, SQL injection attempts, and cleaning up formatting.

Why it teaches regex: Sanitization requires thinking about what NOT to match (dangerous patterns) and understanding edge cases attackers might exploit. You’ll learn negative patterns and the limits of regex for security.

Core challenges you’ll face:

HTML tag removal → maps to matching nested structures (limits of regex)
Script detection → maps to case-insensitive variations
Whitespace normalization → maps to \s and cleanup patterns
SQL injection patterns → maps to alternation and escaping

Key Concepts:

Security Considerations: OWASP Input Validation Cheat Sheet
Regex Limitations: Why regex can’t parse HTML properly
Escaping: “Regular Expressions Cookbook” Chapter 4

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 2 and 3

Real world outcome:

const sanitize = require('sanitex');

// HTML sanitization (remove tags)
sanitize.stripTags('<script>alert("xss")</script>Hello');
// Output: "Hello"

sanitize.stripTags('<b>Bold</b> and <i>italic</i>', { allow: ['b', 'i'] });
// Output: "<b>Bold</b> and <i>italic</i>"

// XSS prevention
sanitize.escapeHtml('<script>alert("xss")</script>');
// Output: "&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;"

// SQL special character escaping
sanitize.escapeSQL("O'Brien; DROP TABLE users--");
// Output: "O''Brien; DROP TABLE users--"

// Whitespace normalization
sanitize.normalizeWhitespace('  Hello    World  \n\n  ');
// Output: "Hello World"

// URL cleaning
sanitize.cleanURL('javascript:alert("xss")');
// Output: "" (blocked dangerous protocol)

sanitize.cleanURL('https://example.com/path?q=test');
// Output: "https://example.com/path?q=test" (allowed)

// Username sanitization
sanitize.username('John<script>Doe');
// Output: "JohnDoe" (only alphanumeric and underscore)

// Filename sanitization
sanitize.filename('../../../etc/passwd');
// Output: "etc_passwd" (path traversal removed)

// Custom sanitization
const clean = sanitize.create([
  { pattern: /[^\w\s-]/g, replacement: '' },      // Remove special chars
  { pattern: /\s+/g, replacement: ' ' },           // Normalize spaces
  { pattern: /^\s+|\s+$/g, replacement: '' }       // Trim
]);
clean('  Hello!!! World???  ');
// Output: "Hello World"

Implementation Hints:

Common sanitization patterns:

const sanitizers = {
  // Remove all HTML tags
  stripAllTags: /<[^>]*>/g,

  // Match script tags (with variations)
  scriptTags: /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,

  // Event handlers (onclick, onerror, etc.)
  eventHandlers: /\s*on\w+\s*=\s*["'][^"']*["']/gi,

  // JavaScript URLs
  jsUrls: /javascript\s*:/gi,

  // Multiple whitespace
  multiSpace: /\s+/g,

  // SQL dangerous characters
  sqlDanger: /['";\\]/g,

  // Path traversal
  pathTraversal: /\.\.[\\/]/g,

  // Null bytes
  nullBytes: /\x00/g
};

Why regex isn’t enough for HTML:

Regex CANNOT properly parse:
  <div attr=">" class="foo">   (quote contains >)
  <div><div></div></div>       (nested same tags)
  <!-- <script>x</script> -->  (tags in comments)

Use a proper HTML parser, then regex for cleanup.

Defense in depth:

function sanitizeInput(input) {
  let clean = input;

  // 1. Remove null bytes (before any other processing)
  clean = clean.replace(/\x00/g, '');

  // 2. Normalize unicode (prevent homoglyph attacks)
  clean = clean.normalize('NFKC');

  // 3. Remove dangerous patterns
  clean = clean.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

  // 4. Escape remaining special chars
  clean = escapeHtml(clean);

  return clean;
}

Learning milestones:

Tag stripping works → You understand basic HTML patterns
Case variations caught → You understand /i flag
Edge cases handled → You understand regex limitations
Multiple patterns combine → You understand defense in depth

Project 5: URL & Email Parser

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: JavaScript, Go, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Complex Pattern Parsing
Software or Tool: URL/Email Parser
Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan

What you’ll build: A parser that extracts and validates URLs and emails from text, breaking them into components (protocol, host, path, query params for URLs; local part, domain for emails).

Why it teaches regex: URLs and emails are complex patterns with many optional parts. You’ll learn optional groups, alternation, and how to extract structured data from patterns.

Core challenges you’ll face:

Optional URL components → maps to (…)? optional groups
Query string parsing → maps to repeated capture groups
International domains → maps to Unicode in regex
Edge cases → maps to understanding RFC specifications

Key Concepts:

URL Regex: RFC 3986 and practical simplifications
Optional Groups: “Mastering Regular Expressions” Chapter 4 - Friedl
Non-Capturing Groups: “Regular Expressions Cookbook” Chapter 2

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 2 and 3

Real world outcome:

from urlparser import URLParser, EmailParser

# Parse URLs
parser = URLParser()
result = parser.parse('https://user:pass@example.com:8080/path/to/page?foo=bar&baz=qux#section')
# {
#   'scheme': 'https',
#   'username': 'user',
#   'password': 'pass',
#   'host': 'example.com',
#   'port': 8080,
#   'path': '/path/to/page',
#   'query': {'foo': 'bar', 'baz': 'qux'},
#   'fragment': 'section'
# }

# Extract all URLs from text
text = "Check out https://example.com and http://test.org/page for more info"
urls = parser.extract_all(text)
# ['https://example.com', 'http://test.org/page']

# Validate URL
parser.is_valid('https://example.com')  # True
parser.is_valid('not a url')             # False
parser.is_valid('ftp://files.example.com')  # True (different scheme)

# Parse email addresses
email_parser = EmailParser()
result = email_parser.parse('John.Doe+tag@mail.example.co.uk')
# {
#   'local': 'John.Doe+tag',
#   'domain': 'mail.example.co.uk',
#   'subdomain': 'mail',
#   'sld': 'example',
#   'tld': 'co.uk',
#   'tag': 'tag'  # Plus addressing
# }

# Extract emails from text
text = "Contact us at support@example.com or sales@example.org"
emails = email_parser.extract_all(text)
# ['support@example.com', 'sales@example.org']

# Handle edge cases
parser.parse('http://localhost:3000')
# { 'scheme': 'http', 'host': 'localhost', 'port': 3000, ... }

parser.parse('file:///home/user/doc.txt')
# { 'scheme': 'file', 'path': '/home/user/doc.txt', ... }

Implementation Hints:

URL regex breakdown:

URL_PATTERN = r'''
    ^
    (?P<scheme>https?|ftp)://           # Scheme
    (?:
        (?P<username>[^:@]+)             # Username (optional)
        (?::(?P<password>[^@]+))?        # Password (optional)
        @
    )?
    (?P<host>
        (?:[\w-]+\.)*[\w-]+              # Domain name
        |
        \d{1,3}(?:\.\d{1,3}){3}          # or IP address
    )
    (?::(?P<port>\d+))?                  # Port (optional)
    (?P<path>/[^?#]*)?                   # Path (optional)
    (?:\?(?P<query>[^#]*))?              # Query string (optional)
    (?:\#(?P<fragment>.*))?              # Fragment (optional)
    $
'''

Optional groups explained:

(?:...)?    Non-capturing optional group

Example: (?::(?P<port>\d+))?
         │  │ │            │
         │  │ │            └── entire group is optional
         │  │ └── capture port number
         │  └── colon before port
         └── don't capture the colon, just group it

Parsing query strings:

def parse_query(query_string):
    params = {}
    pattern = r'([^&=]+)=([^&]*)'
    for match in re.finditer(pattern, query_string):
        key, value = match.groups()
        params[key] = value
    return params

Learning milestones:

Basic URLs parse → You understand optional groups
Query strings extract → You understand repeated groups
Edge cases work → You understand alternation
Components accessible → You understand named groups

Project 6: Log Parser

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Log Analysis / Data Extraction
Software or Tool: Log Parser
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A log parsing tool that can extract structured data from various log formats (Apache, Nginx, syslog, JSON logs), with support for custom format definitions.

Why it teaches regex: Real-world log formats are complex, with timestamps, IP addresses, quoted strings, and variable fields. You’ll learn to build patterns incrementally and handle real-world messiness.

Core challenges you’ll face:

Timestamp formats → maps to complex date/time patterns
Quoted strings → maps to handling escapes and nested quotes
IP addresses → maps to numeric patterns with validation
Variable fields → maps to greedy vs non-greedy matching

Key Concepts:

Log Format Patterns: Apache log format documentation
Non-Greedy Matching: “Mastering Regular Expressions” Chapter 4 - Friedl
Named Captures: “Regular Expressions Cookbook” Chapter 3

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1 and 5

Real world outcome:

from logparser import LogParser

# Built-in format: Apache Combined Log
parser = LogParser('apache_combined')
entry = parser.parse('192.168.1.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/" "Mozilla/5.0"')
# {
#   'ip': '192.168.1.1',
#   'user': 'frank',
#   'timestamp': datetime(2024, 10, 10, 13, 55, 36),
#   'method': 'GET',
#   'path': '/index.html',
#   'protocol': 'HTTP/1.0',
#   'status': 200,
#   'size': 2326,
#   'referrer': 'http://www.example.com/',
#   'user_agent': 'Mozilla/5.0'
# }

# Built-in format: Nginx
parser = LogParser('nginx')

# Built-in format: Syslog
parser = LogParser('syslog')
entry = parser.parse('Oct 11 22:14:15 server sshd[12345]: Failed password for root from 192.168.1.1')
# {
#   'timestamp': ...,
#   'host': 'server',
#   'program': 'sshd',
#   'pid': 12345,
#   'message': 'Failed password for root from 192.168.1.1'
# }

# Custom format definition
custom = LogParser.define(
    r'^\[(?P<level>\w+)\] (?P<timestamp>[\d-]+ [\d:]+) (?P<message>.*)',
    {'timestamp': '%Y-%m-%d %H:%M:%S'}
)
entry = custom.parse('[ERROR] 2024-10-11 13:55:36 Database connection failed')

# Analyze logs
results = parser.parse_file('access.log')
print(f"Total requests: {len(results)}")
print(f"Unique IPs: {len(set(r['ip'] for r in results))}")
print(f"Errors (5xx): {sum(1 for r in results if r['status'] >= 500)}")

# Stream large files
for entry in parser.stream('huge.log'):
    if entry['status'] >= 400:
        print(f"Error: {entry['path']} - {entry['status']}")

Implementation Hints:

Apache Combined Log Format pattern:

APACHE_COMBINED = r'''
    ^
    (?P<ip>\S+)\s+                       # IP address
    (?P<ident>\S+)\s+                    # Ident (usually -)
    (?P<user>\S+)\s+                     # User (or -)
    \[(?P<timestamp>[^\]]+)\]\s+         # Timestamp [...]
    "(?P<method>\w+)\s+                  # Request method
     (?P<path>\S+)\s+                    # Request path
     (?P<protocol>[^"]+)"\s+             # Protocol
    (?P<status>\d+)\s+                   # Status code
    (?P<size>\d+|-)\s+                   # Response size
    "(?P<referrer>[^"]*)"\s+             # Referrer
    "(?P<user_agent>[^"]*)"              # User agent
    $
'''

Handling quoted strings with escapes:

# Simple (no escapes): "[^"]*"
# With escapes: "(?:[^"\\]|\\.)*"
#                 │        │
#                 │        └── or escaped anything
#                 └── non-quote, non-backslash

QUOTED_STRING = r'"(?:[^"\\]|\\.)*"'

Parsing timestamps:

import re
from datetime import datetime

def parse_apache_time(timestamp):
    # [10/Oct/2024:13:55:36 -0700]
    pattern = r'(\d{2})/(\w{3})/(\d{4}):(\d{2}):(\d{2}):(\d{2})'
    match = re.match(pattern, timestamp)
    if match:
        day, month, year, hour, minute, second = match.groups()
        # Convert month name to number, build datetime
        ...

Learning milestones:

Single format parses → You understand complex patterns
Quoted strings work → You understand escape handling
Custom formats work → You understand pattern composition
Large files stream → You understand generator patterns

Project 7: Markdown Parser

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: JavaScript
Alternative Programming Languages: Python, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Text Transformation
Software or Tool: Markdown to HTML Converter
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A Markdown to HTML converter that handles headings, emphasis, links, images, code blocks, lists, and blockquotes—using regex for pattern matching and transformation.

Why it teaches regex: Markdown parsing requires matching nested structures, handling multiple formats for the same element, and transforming captures into output. You’ll push regex to its limits and learn when to combine it with other techniques.

Core challenges you’ll face:

Inline formatting → maps to nested and overlapping patterns
Links and images → maps to complex capture groups
Code blocks → maps to multi-line matching
Lists → maps to context-dependent patterns

Key Concepts:

Multi-line Mode: “Mastering Regular Expressions” Chapter 3 - Friedl
Greedy vs Lazy: “Regular Expressions Cookbook” Chapter 2
CommonMark Specification: commonmark.org

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 3 and 6

Real world outcome:

const md = require('markdownex');

const markdown = `
# Hello World

This is **bold** and *italic* and ***both***.

Here's a [link](https://example.com) and an ![image](img.png "title").

\`\`\`javascript
const x = 1;
console.log(x);
\`\`\`

- Item 1
- Item 2
  - Nested item

> This is a quote
> that spans multiple lines
`;

const html = md.toHtml(markdown);
// Output:
// <h1>Hello World</h1>
// <p>This is <strong>bold</strong> and <em>italic</em> and <em><strong>both</strong></em>.</p>
// <p>Here's a <a href="https://example.com">link</a> and an <img src="img.png" alt="image" title="title">.</p>
// <pre><code class="language-javascript">const x = 1;
// console.log(x);
// </code></pre>
// <ul>
// <li>Item 1</li>
// <li>Item 2
// <ul>
// <li>Nested item</li>
// </ul>
// </li>
// </ul>
// <blockquote>
// <p>This is a quote that spans multiple lines</p>
// </blockquote>

// Inline parsing only
md.parseInline('**bold** and `code`');
// '<strong>bold</strong> and <code>code</code>'

Implementation Hints:

Markdown patterns (simplified):

const patterns = {
  // Headings: # Heading
  heading: /^(#{1,6})\s+(.+)$/gm,

  // Bold: **text** or __text__
  bold: /\*\*(.+?)\*\*|__(.+?)__/g,

  // Italic: *text* or _text_
  italic: /\*(.+?)\*|_(.+?)_/g,

  // Links: [text](url "title")
  link: /\[([^\]]+)\]\(([^)\s]+)(?:\s+"([^"]+)")?\)/g,

  // Images: ![alt](src "title")
  image: /!\[([^\]]*)\]\(([^)\s]+)(?:\s+"([^"]+)")?\)/g,

  // Inline code: `code`
  inlineCode: /`([^`]+)`/g,

  // Code blocks: ```lang\ncode\n```
  codeBlock: /```(\w*)\n([\s\S]*?)```/g,

  // Blockquotes: > text
  blockquote: /^>\s?(.*)$/gm,

  // Unordered lists: - item or * item
  unorderedList: /^[\s]*[-*]\s+(.+)$/gm,

  // Ordered lists: 1. item
  orderedList: /^[\s]*\d+\.\s+(.+)$/gm,

  // Horizontal rule: --- or ***
  hr: /^[-*]{3,}$/gm
};

Processing order matters:

function parseMarkdown(text) {
  // 1. First handle code blocks (protect from other processing)
  text = text.replace(patterns.codeBlock, (match, lang, code) => {
    return `<pre><code class="language-${lang}">${escapeHtml(code)}</code></pre>`;
  });

  // 2. Handle block-level elements
  text = text.replace(patterns.heading, (match, hashes, content) => {
    const level = hashes.length;
    return `<h${level}>${content}</h${level}>`;
  });

  // 3. Handle inline elements (after blocks)
  text = text.replace(patterns.bold, '<strong>$1$2</strong>');
  text = text.replace(patterns.italic, '<em>$1$2</em>');
  text = text.replace(patterns.link, '<a href="$2" title="$3">$1</a>');

  return text;
}

Handling nested emphasis:

***bold and italic***

Must handle: <em><strong>bold and italic</strong></em>

Pattern: \*\*\*(.+?)\*\*\*
         OR process from outside in

Learning milestones:

Headings and emphasis work → You understand basic patterns
Links and images work → You understand complex captures
Code blocks preserve content → You understand multi-line
Nested formatting works → You understand processing order

Project 8: Data Extractor (Web Scraping)

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: JavaScript, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Data Extraction / Web Scraping
Software or Tool: Web Scraper
Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan

What you’ll build: A data extraction tool that can pull structured data from web pages using regex patterns—extracting prices, dates, emails, phone numbers, and custom patterns.

Why it teaches regex: Web scraping forces you to handle messy, inconsistent HTML. You’ll learn to write flexible patterns that handle variations and extract multiple pieces of data from single matches.

Core challenges you’ll face:

HTML structure → maps to tag matching (and its limits)
Inconsistent formatting → maps to flexible patterns with alternation
Multiple items → maps to findall and iteration
Greedy matching pitfalls → maps to non-greedy quantifiers

Key Concepts:

Non-Greedy Matching: “Mastering Regular Expressions” Chapter 4 - Friedl
findall vs finditer: Python re module documentation
Why not regex for HTML: Stack Overflow famous answer (for context)

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 5 and 6

Real world outcome:

from extractor import DataExtractor

# Initialize with HTML content
ex = DataExtractor(html_content)

# Built-in extractors
prices = ex.extract_prices()
# ['$19.99', '$149.00', '€25.50']

emails = ex.extract_emails()
# ['contact@example.com', 'support@example.com']

phones = ex.extract_phones()
# ['(555) 123-4567', '+1-555-987-6543']

dates = ex.extract_dates()
# ['2024-10-15', 'October 15, 2024', '10/15/24']

# Extract with context (get surrounding text)
results = ex.extract_prices(context=True)
# [
#   {'value': '$19.99', 'context': 'Sale price: $19.99 (was $29.99)'},
#   {'value': '$149.00', 'context': 'Total: $149.00 including shipping'}
# ]

# Custom patterns
product_pattern = r'<div class="product">\s*<h2>([^<]+)</h2>\s*<span class="price">\$([0-9.]+)</span>'
products = ex.extract_custom(product_pattern, ['name', 'price'])
# [
#   {'name': 'Widget Pro', 'price': '19.99'},
#   {'name': 'Gadget Plus', 'price': '29.99'}
# ]

# Extract all matches with positions
for match in ex.find_all(r'\b[A-Z]{2,}\b'):
    print(f"Acronym: {match.text} at position {match.start}")

# Pipeline extraction
ex.pipeline([
    ('title', r'<title>([^<]+)</title>'),
    ('meta_desc', r'<meta name="description" content="([^"]+)"'),
    ('h1_tags', r'<h1[^>]*>([^<]+)</h1>'),
])
# {
#   'title': 'My Page Title',
#   'meta_desc': 'Page description here',
#   'h1_tags': ['Welcome', 'About Us']
# }

Implementation Hints:

Price extraction pattern:

PRICE_PATTERNS = [
    r'\$[\d,]+\.?\d*',           # $19.99, $1,234.56
    r'€[\d,]+\.?\d*',            # €25.50
    r'£[\d,]+\.?\d*',            # £15.00
    r'USD\s*[\d,]+\.?\d*',       # USD 19.99
    r'[\d,]+\.?\d*\s*(?:USD|EUR|GBP)',  # 19.99 USD
]

def extract_prices(text):
    combined = '|'.join(f'({p})' for p in PRICE_PATTERNS)
    return re.findall(combined, text)

Why greedy matching fails with HTML:

# BAD: Greedy matching
pattern = r'<div class="product">.*</div>'
# Will match from first <div> to LAST </div> in document!

# GOOD: Non-greedy matching
pattern = r'<div class="product">.*?</div>'
# Matches first complete <div>...</div>

# BETTER: Be more specific
pattern = r'<div class="product">[^<]*(?:<(?!/div>)[^<]*)*</div>'
# Matches content that doesn't contain </div>

Handling variations:

# Date patterns - many formats
DATE_PATTERNS = [
    r'\d{4}-\d{2}-\d{2}',              # 2024-10-15
    r'\d{2}/\d{2}/\d{4}',              # 10/15/2024
    r'\d{2}/\d{2}/\d{2}',              # 10/15/24
    r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{1,2},?\s+\d{4}',  # October 15, 2024
    r'\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4}',     # 15 October 2024
]

Learning milestones:

Simple extractions work → You understand findall
Variations handled → You understand alternation
Context extracted → You understand groups
HTML pitfalls understood → You know when NOT to use regex

Project 9: Tokenizer / Lexer

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: JavaScript, Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Compilers / Parsing
Software or Tool: Lexer (Tokenizer)
Main Book: “Compilers: Principles and Practice” by Dave & Dave

What you’ll build: A lexer (tokenizer) that breaks source code into tokens using regex patterns—handling keywords, identifiers, numbers, strings, operators, and comments.

Why it teaches regex: Lexers are the classic application of regex theory. You’ll learn how regex patterns become finite automata, how to handle overlapping patterns with priorities, and how to process input efficiently.

Core challenges you’ll face:

Token priority → maps to pattern ordering
String literals → maps to handling escapes
Comments → maps to multi-line patterns
Error recovery → maps to handling unmatched input

Key Concepts:

Lexical Analysis: “Compilers: Principles and Practice” Chapter 2
Token Priority: “Engineering a Compiler” Chapter 2
Finite Automata: “Introduction to Automata Theory” Chapter 2

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 6 and 7

Real world outcome:

from lexer import Lexer, Token

# Define token patterns
lexer = Lexer([
    ('KEYWORD',    r'\b(if|else|while|for|function|return|let|const|var)\b'),
    ('IDENTIFIER', r'[a-zA-Z_][a-zA-Z0-9_]*'),
    ('NUMBER',     r'\d+\.?\d*'),
    ('STRING',     r'"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\''),
    ('OPERATOR',   r'[+\-*/%=<>!&|]+'),
    ('LPAREN',     r'\('),
    ('RPAREN',     r'\)'),
    ('LBRACE',     r'\{'),
    ('RBRACE',     r'\}'),
    ('SEMICOLON',  r';'),
    ('COMMA',      r','),
    ('COMMENT',    r'//.*|/\*[\s\S]*?\*/'),
    ('WHITESPACE', r'\s+'),
])

code = '''
function greet(name) {
    // This is a comment
    let message = "Hello, " + name;
    return message;
}
'''

tokens = lexer.tokenize(code)
for token in tokens:
    print(token)

# Output:
# Token(KEYWORD, 'function', line=2, col=1)
# Token(IDENTIFIER, 'greet', line=2, col=10)
# Token(LPAREN, '(', line=2, col=15)
# Token(IDENTIFIER, 'name', line=2, col=16)
# Token(RPAREN, ')', line=2, col=20)
# Token(LBRACE, '{', line=2, col=22)
# Token(COMMENT, '// This is a comment', line=3, col=5)
# Token(KEYWORD, 'let', line=4, col=5)
# Token(IDENTIFIER, 'message', line=4, col=9)
# Token(OPERATOR, '=', line=4, col=17)
# Token(STRING, '"Hello, "', line=4, col=19)
# ...

# Error handling
bad_code = 'let x = @#$;'  # Invalid characters
try:
    lexer.tokenize(bad_code)
except LexerError as e:
    print(f"Unexpected character '{e.char}' at line {e.line}, column {e.col}")

# Skip certain tokens
tokens = lexer.tokenize(code, skip=['WHITESPACE', 'COMMENT'])

Implementation Hints:

Lexer implementation:

import re

class Token:
    def __init__(self, type, value, line, col):
        self.type = type
        self.value = value
        self.line = line
        self.col = col

class Lexer:
    def __init__(self, rules):
        # Combine all patterns into one big regex
        # Using named groups for each token type
        parts = []
        for name, pattern in rules:
            parts.append(f'(?P<{name}>{pattern})')
        self.pattern = re.compile('|'.join(parts))
        self.rules = rules

    def tokenize(self, text, skip=None):
        skip = skip or []
        tokens = []
        line = 1
        line_start = 0

        for match in self.pattern.finditer(text):
            # Find which group matched
            token_type = match.lastgroup
            value = match.group()

            if token_type not in skip:
                col = match.start() - line_start + 1
                tokens.append(Token(token_type, value, line, col))

            # Track line numbers
            newlines = value.count('\n')
            if newlines:
                line += newlines
                line_start = match.end() - len(value.split('\n')[-1])

        return tokens

String literal with escapes:

# Match: "hello", "hello\nworld", "say \"hi\""
STRING = r'"(?:[^"\\]|\\.)*"'
#          │ │       │
#          │ │       └── OR backslash + any char
#          │ └── non-quote, non-backslash chars
#          └── opening quote

# Explanation of (?:[^"\\]|\\.)*:
#   [^"\\]  = any char except " and \
#   \\.     = backslash followed by any char (escape sequence)
#   (?:...)*= repeat this group

Multi-line comments:

# Match /* ... */ comments (can span lines)
MULTILINE_COMMENT = r'/\*[\s\S]*?\*/'
#                       │      │
#                       │      └── non-greedy (stop at first */)
#                       └── any character including newlines

Learning milestones:

Simple tokens work → You understand pattern matching
Keywords vs identifiers → You understand priority
Strings with escapes → You understand complex patterns
Line/column tracking → You understand position bookkeeping

Project 10: Template Engine

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: JavaScript
Alternative Programming Languages: Python, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Text Templating
Software or Tool: Template Engine (like Mustache/Handlebars)
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A template engine that supports variable interpolation, conditionals, loops, and partials—using regex to find and replace template syntax.

Why it teaches regex: Template engines use regex to find special syntax in text. You’ll learn to handle nested structures, recursive patterns, and the interplay between regex matching and programmatic logic.

Core challenges you’ll face:

Variable interpolation → maps to simple captures and replacement
Conditionals → maps to matching paired tags
Loops → maps to extracting content between tags
Escaping → maps to distinguishing literal vs template syntax

Key Concepts:

Template Patterns: Mustache.js source code
Paired Tag Matching: “Mastering Regular Expressions” Chapter 6
Replacement Functions: JavaScript replace() with function

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 3 and 7

Real world outcome:

const tmpl = require('templatex');

// Simple interpolation
const template = 'Hello, {{name}}!';
const result = tmpl.render(template, { name: 'World' });
// 'Hello, World!'

// Conditionals
const withCondition = `
{{#if isLoggedIn}}
  Welcome back, {{username}}!
{{else}}
  Please log in.
{{/if}}
`;
tmpl.render(withCondition, { isLoggedIn: true, username: 'Alice' });
// 'Welcome back, Alice!'

// Loops
const withLoop = `
<ul>
{{#each items}}
  <li>{{name}}: {{price}}</li>
{{/each}}
</ul>
`;
tmpl.render(withLoop, {
  items: [
    { name: 'Widget', price: '$10' },
    { name: 'Gadget', price: '$20' }
  ]
});
// <ul>
//   <li>Widget: $10</li>
//   <li>Gadget: $20</li>
// </ul>

// Nested data
const nested = '{{user.profile.name}}';
tmpl.render(nested, { user: { profile: { name: 'Bob' } } });
// 'Bob'

// Escaping (raw output)
const withEscape = '{{&htmlContent}}';  // or {{{htmlContent}}}
tmpl.render(withEscape, { htmlContent: '<b>Bold</b>' });
// '<b>Bold</b>' (not escaped)

// Partials
tmpl.registerPartial('header', '<header>{{title}}</header>');
const withPartial = '{{>header}}<main>Content</main>';
tmpl.render(withPartial, { title: 'My Page' });
// '<header>My Page</header><main>Content</main>'

// Compile for repeated use
const compiled = tmpl.compile('Hello, {{name}}!');
compiled({ name: 'Alice' });  // 'Hello, Alice!'
compiled({ name: 'Bob' });    // 'Hello, Bob!'

Implementation Hints:

Template patterns:

const patterns = {
  // Variable: {{name}} or {{user.name}}
  variable: /\{\{([a-zA-Z_][\w.]*)\}\}/g,

  // Raw variable: {{{var}}} or {{&var}}
  rawVariable: /\{\{\{([a-zA-Z_][\w.]*)\}\}\}|\{\{&([a-zA-Z_][\w.]*)\}\}/g,

  // Conditional: {{#if condition}}...{{/if}}
  conditional: /\{\{#if\s+(\w+)\}\}([\s\S]*?)(?:\{\{else\}\}([\s\S]*?))?\{\{\/if\}\}/g,

  // Loop: {{#each items}}...{{/each}}
  loop: /\{\{#each\s+(\w+)\}\}([\s\S]*?)\{\{\/each\}\}/g,

  // Partial: {{>partialName}}
  partial: /\{\{>\s*(\w+)\s*\}\}/g,

  // Comment: {{! comment }}
  comment: /\{\{![\s\S]*?\}\}/g
};

Variable resolution with dot notation:

function resolve(path, context) {
  const parts = path.split('.');
  let value = context;
  for (const part of parts) {
    if (value == null) return '';
    value = value[part];
  }
  return value ?? '';
}

// Usage
resolve('user.profile.name', { user: { profile: { name: 'Bob' } } });
// Returns: 'Bob'

Processing order:

function render(template, context) {
  let result = template;

  // 1. Remove comments first
  result = result.replace(patterns.comment, '');

  // 2. Process conditionals (before loops, they might contain loops)
  result = result.replace(patterns.conditional, (match, condition, ifContent, elseContent) => {
    return context[condition] ? ifContent : (elseContent || '');
  });

  // 3. Process loops
  result = result.replace(patterns.loop, (match, itemsKey, content) => {
    const items = context[itemsKey] || [];
    return items.map(item => render(content, { ...context, ...item })).join('');
  });

  // 4. Process partials
  result = result.replace(patterns.partial, (match, name) => {
    return render(partials[name], context);
  });

  // 5. Finally, replace variables
  result = result.replace(patterns.variable, (match, path) => {
    return escapeHtml(resolve(path, context));
  });

  return result;
}

Learning milestones:

Simple variables work → You understand basic replacement
Conditionals work → You understand paired tag matching
Loops work → You understand iterative replacement
Nesting works → You understand recursive processing

Project 11: Regex Engine (Basic)

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go, C
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 4: Expert
Knowledge Area: Automata Theory / Compilers
Software or Tool: Regex Engine
Main Book: “Introduction to Automata Theory” by Hopcroft, Motwani, Ullman

What you’ll build: A regex engine that compiles patterns to NFAs (Non-deterministic Finite Automata) and matches strings—implementing basic operators: concatenation, alternation (

), Kleene star (*), plus (+), and optional (?).

Why it teaches regex: Building a regex engine is the ultimate way to understand how regex works. You’ll learn Thompson’s construction, NFA simulation, and why certain patterns are slow (catastrophic backtracking).

Core challenges you’ll face:

Parsing regex syntax → maps to recursive descent parsing
Thompson’s construction → maps to NFA building
NFA simulation → maps to parallel state tracking
Handling special characters → maps to metacharacter escaping

Key Concepts:

Thompson’s Construction: “Regular Expression Matching Can Be Simple And Fast” by Russ Cox
NFA/DFA Theory: “Introduction to Automata Theory” Chapters 2-3 - Hopcroft
Regex Engine Internals: “Mastering Regular Expressions” Chapter 4 - Friedl

Difficulty: Expert Time estimate: 1 month Prerequisites: Automata theory basics, Projects 7 and 9

Real world outcome:

from regex_engine import Regex

# Basic matching
r = Regex('hello')
r.match('hello')        # True
r.match('hello world')  # True (partial match)
r.fullmatch('hello')    # True (exact match)

# Alternation
r = Regex('cat|dog')
r.match('cat')          # True
r.match('dog')          # True
r.match('bird')         # False

# Kleene star
r = Regex('ab*c')
r.match('ac')           # True  (zero b's)
r.match('abc')          # True  (one b)
r.match('abbbc')        # True  (three b's)

# Plus (one or more)
r = Regex('ab+c')
r.match('ac')           # False (need at least one b)
r.match('abc')          # True
r.match('abbbc')        # True

# Optional
r = Regex('colou?r')
r.match('color')        # True
r.match('colour')       # True

# Character classes
r = Regex('[a-z]+')
r.match('hello')        # True
r.match('HELLO')        # False

# Grouping
r = Regex('(ab)+')
r.match('ab')           # True
r.match('abab')         # True
r.match('ababab')       # True

# Complex pattern
r = Regex('(a|b)*abb')
r.match('abb')          # True
r.match('aabb')         # True
r.match('babb')         # True
r.match('abababb')      # True

# Debug: show NFA
r.visualize()
# Outputs DOT format for graphviz:
# digraph NFA {
#   0 -> 1 [label="a"];
#   0 -> 2 [label="b"];
#   ...
# }

Implementation Hints:

Regex engine architecture:

┌─────────────────────────────────────────────────────────────┐
│                        Regex Engine                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐   │
│  │  Parser  │───▶│   Thompson   │───▶│  NFA Simulator   │   │
│  │          │    │ Construction │    │                  │   │
│  └──────────┘    └──────────────┘    └──────────────────┘   │
│       │                │                       │             │
│       ▼                ▼                       ▼             │
│     AST              NFA                  Match/No Match     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Step 1: Parse regex to AST:

# Regex: (a|b)*c
# AST:
#   Concat
#   ├── Star
#   │   └── Alternation
#   │       ├── Literal('a')
#   │       └── Literal('b')
#   └── Literal('c')

class Literal:
    def __init__(self, char): self.char = char

class Alternation:
    def __init__(self, left, right): self.left, self.right = left, right

class Concat:
    def __init__(self, left, right): self.left, self.right = left, right

class Star:
    def __init__(self, expr): self.expr = expr

Step 2: Thompson’s Construction (AST to NFA):

class State:
    def __init__(self):
        self.transitions = {}  # char -> [states]
        self.epsilon = []      # ε-transitions

class NFA:
    def __init__(self, start, accept):
        self.start = start
        self.accept = accept

def build_nfa(ast):
    if isinstance(ast, Literal):
        start = State()
        accept = State()
        start.transitions[ast.char] = [accept]
        return NFA(start, accept)

    elif isinstance(ast, Concat):
        left = build_nfa(ast.left)
        right = build_nfa(ast.right)
        left.accept.epsilon.append(right.start)
        return NFA(left.start, right.accept)

    elif isinstance(ast, Alternation):
        left = build_nfa(ast.left)
        right = build_nfa(ast.right)
        start = State()
        accept = State()
        start.epsilon = [left.start, right.start]
        left.accept.epsilon.append(accept)
        right.accept.epsilon.append(accept)
        return NFA(start, accept)

    elif isinstance(ast, Star):
        inner = build_nfa(ast.expr)
        start = State()
        accept = State()
        start.epsilon = [inner.start, accept]
        inner.accept.epsilon = [inner.start, accept]
        return NFA(start, accept)

Step 3: NFA Simulation:

def match(nfa, text):
    # Start with all states reachable from start via ε
    current = epsilon_closure({nfa.start})

    for char in text:
        next_states = set()
        for state in current:
            if char in state.transitions:
                next_states.update(state.transitions[char])
        current = epsilon_closure(next_states)

    return nfa.accept in current

def epsilon_closure(states):
    """Find all states reachable via ε-transitions"""
    closure = set(states)
    stack = list(states)
    while stack:
        state = stack.pop()
        for next_state in state.epsilon:
            if next_state not in closure:
                closure.add(next_state)
                stack.append(next_state)
    return closure

Learning milestones:

Parser produces AST → You understand regex syntax
NFA is constructed → You understand Thompson’s construction
Matching works → You understand NFA simulation
All operators work → You understand the full algorithm

Project 12: Regex Debugger & Visualizer

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: JavaScript
Alternative Programming Languages: Python, TypeScript
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Debugging / Visualization
Software or Tool: Regex Debugger (like regex101)
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A regex debugger that shows step-by-step matching, explains why patterns fail, highlights matched groups, and visualizes the regex as a railroad diagram.

Why it teaches regex: Building a debugger requires you to understand exactly how regex engines work internally. You’ll learn backtracking, match states, and how to explain regex behavior to others.

Core challenges you’ll face:

Step-by-step execution → maps to tracking match state
Failure explanation → maps to understanding why matches fail
Railroad diagrams → maps to regex AST visualization
Performance analysis → maps to detecting catastrophic backtracking

Key Concepts:

Regex Debugging: regex101.com internals
Railroad Diagrams: Railroad diagram conventions
Backtracking: “Mastering Regular Expressions” Chapter 4 - Friedl

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 11

Real world outcome:

const debug = require('regex-debugger');

// Step-by-step matching
const result = debug.trace(/a+b/, 'aaab');
// {
//   matched: true,
//   steps: [
//     { position: 0, pattern: 'a+', action: 'match a', state: 'continue' },
//     { position: 1, pattern: 'a+', action: 'match a', state: 'continue' },
//     { position: 2, pattern: 'a+', action: 'match a', state: 'continue' },
//     { position: 3, pattern: 'b', action: 'match b', state: 'success' }
//   ]
// }

// Failure analysis
const failure = debug.explain(/foo/, 'bar');
// {
//   matched: false,
//   reason: "Pattern 'foo' expected 'f' at position 0, but found 'b'",
//   suggestion: "The input doesn't contain the literal text 'foo'"
// }

// Backtracking visualization
const backtrack = debug.trace(/a+ab/, 'aaab');
// Shows backtracking:
//   - a+ matches 'aaa'
//   - tries to match 'a', fails (position 3 is 'b')
//   - backtracks: a+ gives up one 'a'
//   - a+ now matches 'aa'
//   - 'a' matches at position 2
//   - 'b' matches at position 3
//   - SUCCESS

// Group highlighting
const groups = debug.groups(/((\d+)-(\d+))/, '2024-12-25');
// {
//   fullMatch: '2024-12',
//   groups: [
//     { index: 1, name: null, value: '2024-12', range: [0, 7] },
//     { index: 2, name: null, value: '2024', range: [0, 4] },
//     { index: 3, name: null, value: '12', range: [5, 7] }
//   ]
// }

// Railroad diagram
const diagram = debug.visualize(/colou?r/);
// Returns SVG/HTML of railroad diagram

// Performance analysis
const perf = debug.analyze(/(a+)+$/);
// {
//   warning: 'CATASTROPHIC_BACKTRACKING',
//   message: 'This pattern can cause exponential backtracking on inputs like "aaaaaaaaaaaaaaaaX"',
//   suggestion: 'Use atomic groups or possessive quantifiers: (?>a+)+$'
// }

// Interactive mode (CLI)
debug.interactive();
// > pattern: \d{3}-\d{4}
// > test: 555-1234
// ✓ Match found: "555-1234"
//   Group 0: "555-1234" [0-8]
// > test: abc
// ✗ No match
//   Failed at position 0: expected digit, found 'a'

Implementation Hints:

Tracing regex execution:

class RegexDebugger {
  trace(pattern, input) {
    const steps = [];
    const regex = new RegExp(pattern, 'g');

    // Use a custom matcher that logs steps
    // This is simplified - real implementation needs engine access
    let lastIndex = 0;

    for (let i = 0; i < input.length; i++) {
      // Try to match at each position
      regex.lastIndex = i;
      const match = regex.exec(input);

      if (match && match.index === i) {
        steps.push({
          position: i,
          pattern: pattern.toString(),
          action: `matched "${match[0]}"`,
          state: 'success'
        });
        break;
      } else {
        steps.push({
          position: i,
          pattern: pattern.toString(),
          action: `no match at position ${i}`,
          state: 'backtrack'
        });
      }
    }

    return { matched: steps.some(s => s.state === 'success'), steps };
  }
}

Railroad diagram generation:

function toRailroad(ast) {
  switch (ast.type) {
    case 'literal':
      return `<rect class="terminal"><text>${ast.char}</text></rect>`;

    case 'sequence':
      return ast.elements.map(toRailroad).join('→');

    case 'alternation':
      return `
        <g class="choice">
          ${ast.alternatives.map(a => `<path>${toRailroad(a)}</path>`).join('')}
        </g>
      `;

    case 'quantifier':
      if (ast.min === 0 && ast.max === Infinity) {
        // *: loop with skip path
        return `<g class="loop">${toRailroad(ast.expr)}</g>`;
      }
      // ... handle other quantifiers
  }
}

Catastrophic backtracking detection:

function detectCatastrophic(pattern) {
  const warnings = [];

  // Nested quantifiers with overlap
  if (/\([^)]*[+*][^)]*\)[+*]/.test(pattern)) {
    warnings.push({
      type: 'NESTED_QUANTIFIERS',
      message: 'Nested quantifiers can cause exponential backtracking'
    });
  }

  // Overlapping alternatives
  // (a|ab)* - both alternatives start with 'a'
  // This is harder to detect accurately

  return warnings;
}

Learning milestones:

Step tracing works → You understand match progression
Failure explanation works → You understand why patterns fail
Railroad diagrams generate → You understand regex structure
Catastrophic patterns detected → You understand performance

Project 13: Regex Optimizer

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Optimization / Automata
Software or Tool: Regex Optimizer
Main Book: “Mastering Regular Expressions” by Jeffrey Friedl

What you’ll build: A regex optimizer that analyzes patterns, suggests improvements, converts NFA to DFA for faster matching, and identifies performance problems.

Why it teaches regex: Understanding optimization requires deep knowledge of how regex engines work, the difference between NFA and DFA, and what makes patterns slow.

Core challenges you’ll face:

NFA to DFA conversion → maps to subset construction
DFA minimization → maps to state equivalence
Pattern simplification → maps to algebraic laws
Anchoring optimization → maps to match start detection

Key Concepts:

Subset Construction: “Introduction to Automata Theory” Chapter 2
DFA Minimization: “Engineering a Compiler” Chapter 2
Regex Optimization: “Mastering Regular Expressions” Chapter 6

Difficulty: Expert Time estimate: 1 month Prerequisites: Project 11

Real world outcome:

from regex_optimizer import optimize

# Simplify redundant patterns
optimize(r'(a|a)')
# Simplified: 'a'

optimize(r'a*a*')
# Simplified: 'a*'

optimize(r'[a-zA-Z0-9_]')
# Simplified: '\w' (if locale permits)

# Remove unnecessary groups
optimize(r'((a)(b))(c)')
# Simplified: 'abc' (if groups aren't captured)

# Factor common prefixes
optimize(r'foobar|foobaz')
# Optimized: 'fooba[rz]'

optimize(r'cat|car|cab')
# Optimized: 'ca[trb]'

# Anchor optimization
optimize(r'.*foo')
# Warning: "Starts with .* - will scan entire string. Consider anchoring or using specific start."

# NFA to DFA conversion
result = optimize(r'(a|b)*abb', convert_to_dfa=True)
# Returns optimized DFA with minimal states

# Performance analysis
analyze(r'(a+)+$')
# {
#   'complexity': 'exponential',
#   'reason': 'Nested quantifiers with overlapping matches',
#   'suggestion': 'Use possessive quantifier: (a+)++$ or atomic group: (?>a+)+$'
# }

# Equivalent simpler pattern
equivalent(r'^[0-9][0-9]*$')
# Simpler: '^\d+$'

equivalent(r'(foo|bar|baz){1,1}')
# Simpler: 'foo|bar|baz'

# Compile to DFA for repeated matching
dfa = compile_to_dfa(r'\b\w+@\w+\.\w+\b')
for line in huge_file:
    if dfa.match(line):  # O(n) guaranteed, no backtracking
        print(line)

Implementation Hints:

NFA to DFA (Subset Construction):

def nfa_to_dfa(nfa):
    """Convert NFA to DFA using subset construction"""
    # DFA state = set of NFA states
    start_state = frozenset(epsilon_closure({nfa.start}))

    dfa_states = {start_state}
    dfa_transitions = {}
    worklist = [start_state]

    while worklist:
        current = worklist.pop()

        for symbol in alphabet:
            # Find all NFA states reachable via this symbol
            next_nfa_states = set()
            for nfa_state in current:
                if symbol in nfa_state.transitions:
                    next_nfa_states.update(nfa_state.transitions[symbol])

            next_dfa_state = frozenset(epsilon_closure(next_nfa_states))

            if next_dfa_state and next_dfa_state not in dfa_states:
                dfa_states.add(next_dfa_state)
                worklist.append(next_dfa_state)

            if next_dfa_state:
                dfa_transitions[(current, symbol)] = next_dfa_state

    return DFA(start_state, dfa_states, dfa_transitions, nfa.accept)

DFA minimization:

def minimize_dfa(dfa):
    """Minimize DFA using partition refinement"""
    # Initial partition: accepting vs non-accepting
    accept = {s for s in dfa.states if dfa.is_accepting(s)}
    non_accept = dfa.states - accept
    partition = [accept, non_accept]

    # Refine until stable
    while True:
        new_partition = []
        changed = False

        for group in partition:
            # Split group based on transitions
            subgroups = split_by_transitions(group, partition, dfa)
            new_partition.extend(subgroups)
            if len(subgroups) > 1:
                changed = True

        if not changed:
            break
        partition = new_partition

    # Build minimized DFA from partition
    return build_minimized_dfa(partition, dfa)

Pattern simplification rules:

SIMPLIFICATION_RULES = [
    # a|a -> a
    (r'\(([^)]+)\|\1\)', r'\1'),

    # a*a* -> a*
    (r'([^*+?])\*\1\*', r'\1*'),

    # a?a? -> a{0,2}
    (r'([^*+?])\?\1\?', r'\1{0,2}'),

    # [a-za-z] -> [a-z] (overlapping ranges)
    # (more complex logic needed)

    # ^.*x -> x (if x is at start, .* is wasteful)
    # (context-dependent)
]

Learning milestones:

NFA to DFA works → You understand subset construction
DFA minimization works → You understand state equivalence
Simplification rules apply → You understand regex algebra
Performance warnings accurate → You understand complexity

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor	Business Value
1. Pattern Matcher CLI	Beginner	Weekend	⭐⭐	⭐⭐	Resume Gold
2. Text Validator	Beginner	Weekend	⭐⭐⭐	⭐⭐	Micro-SaaS
3. Search & Replace	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐	Micro-SaaS
4. Input Sanitizer	Intermediate	1 week	⭐⭐⭐	⭐⭐⭐	Service
5. URL/Email Parser	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐	Micro-SaaS
6. Log Parser	Intermediate	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐	Service
7. Markdown Parser	Advanced	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Open Core
8. Data Extractor	Intermediate	1-2 weeks	⭐⭐⭐	⭐⭐⭐⭐	Service
9. Tokenizer/Lexer	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Open Core
10. Template Engine	Intermediate	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐	Service
11. Regex Engine	Expert	1 month	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Disruptor
12. Regex Debugger	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Service
13. Regex Optimizer	Expert	1 month	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Open Core

Recommended Learning Paths

For Beginners (New to Regex)

Start with: Project 1 → Project 2 → Project 3

This path teaches you:

Basic pattern matching and metacharacters
Character classes, quantifiers, anchors
Capture groups and backreferences

Time: 2-3 weeks

For Practical Application

Start with: Project 5 → Project 6 → Project 8

This path teaches you:

Complex patterns for real data formats
Handling messy real-world input
Extraction and transformation patterns

Time: 3-5 weeks

For Language/Tool Builders

Start with: Project 9 → Project 11 → Project 13

This path teaches you:

Lexical analysis with regex
How regex engines actually work (NFA)
Optimization and performance

Time: 2-3 months

For Understanding Theory

Start with: Project 11 → Project 12 → Project 13

This path teaches you:

Finite automata theory
Debugging and visualization
NFA to DFA conversion and optimization

Time: 2-3 months

Final Capstone Project: Full Regex Suite

File: LEARN_REGEX_DEEP_DIVE.md
Main Programming Language: Rust (for performance) or Python (for clarity)
Alternative Programming Languages: Go, C
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Regex Engine Implementation
Software or Tool: Complete Regex Library
Main Book: “Introduction to Automata Theory” + “Mastering Regular Expressions”

What you’ll build: A complete regex library with:

Full regex syntax support (PCRE-like)
Both NFA (backtracking) and DFA (linear time) engines
Unicode support
Named groups, lookahead, lookbehind
Debugging and visualization tools
Performance analysis

Components to integrate:

Parser for full regex syntax
NFA construction (Thompson’s)
NFA simulation with backtracking
DFA construction for linear-time matching
Optimization passes
Debug/visualization output

Real world outcome:

from myregex import Regex, RegexBuilder

# Full syntax support
r = Regex(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
match = r.match('2024-12-25')
print(match.group('year'))   # '2024'
print(match.group('month'))  # '12'

# Lookahead/lookbehind
r = Regex(r'(?<=\$)\d+(?=\.)')
r.findall('$100.00 and $200.50')  # ['100', '200']

# Engine selection
r = Regex(r'simple-pattern', engine='dfa')  # O(n) guaranteed
r = Regex(r'complex(?=pattern)', engine='nfa')  # Backtracking for lookahead

# Performance mode
r = Regex(r'pattern', optimize=True)  # Run optimization passes

# Debug mode
r = Regex(r'(a+)+$', debug=True)
# Warning: Potential catastrophic backtracking detected

# Visualization
r.to_railroad()  # SVG railroad diagram
r.to_nfa_graph()  # DOT format NFA
r.to_dfa_graph()  # DOT format DFA

Time estimate: 3-4 months Prerequisites: Complete at least 8 of the earlier projects

Learning milestones:

Full syntax parses → You understand regex grammar
Both engines work → You understand NFA vs DFA trade-offs
Unicode works → You understand character encoding
Lookahead works → You understand advanced features
Performance is competitive → You understand optimization

Summary

#	Project	Main Language
1	Pattern Matcher CLI	Python
2	Text Validator Library	JavaScript
3	Search and Replace Tool	Python
4	Input Sanitizer	JavaScript
5	URL & Email Parser	Python
6	Log Parser	Python
7	Markdown Parser	JavaScript
8	Data Extractor	Python
9	Tokenizer / Lexer	Python
10	Template Engine	JavaScript
11	Regex Engine (Basic)	Python
12	Regex Debugger & Visualizer	JavaScript
13	Regex Optimizer	Python
Capstone	Full Regex Suite	Rust/Python

Additional Resources

Books

“Mastering Regular Expressions” by Jeffrey Friedl - THE regex bible
“Regular Expressions Cookbook” by Goyvaerts & Levithan - Practical patterns
“Introduction to Automata Theory” by Hopcroft, Motwani, Ullman - Theory

Online Resources

regex101.com - Interactive regex tester with explanation
Russ Cox’s articles - “Regular Expression Matching Can Be Simple And Fast”
Debuggex - Visual regex debugger

Tools to Study

RE2 - Google’s linear-time regex engine (no backtracking)
PCRE - Perl-compatible regex (full features)
Hyperscan - Intel’s high-performance regex engine

Practice

Regex Golf - Write shortest regex for given matches
Regex Crossword - Puzzles using regex
HackerRank Regex Challenges

Cheat Sheet

Metacharacters

.       Any character (except newline)
^       Start of string/line
$       End of string/line
*       0 or more
+       1 or more
?       0 or 1 (optional)
|       Alternation (or)
()      Grouping
[]      Character class
{}      Quantifier range
\       Escape

Character Classes

[abc]   a, b, or c
[^abc]  Not a, b, or c
[a-z]   a through z
\d      Digit [0-9]
\D      Non-digit
\w      Word char [a-zA-Z0-9_]
\W      Non-word char
\s      Whitespace
\S      Non-whitespace

Quantifiers

*       0 or more (greedy)
+       1 or more (greedy)
?       0 or 1 (greedy)
{n}     Exactly n
{n,}    n or more
{n,m}   n to m
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)

Groups & References

(...)       Capture group
(?:...)     Non-capturing group
(?P<n>...)  Named group (Python)
(?<n>...)   Named group (other)
\1, \2      Backreference
\g<name>    Named backreference

Lookaround

(?=...)     Positive lookahead
(?!...)     Negative lookahead
(?<=...)    Positive lookbehind
(?<!...)    Negative lookbehind

Flags

i   Case-insensitive
g   Global (find all)
m   Multiline (^ and $ match line boundaries)
s   Dotall (. matches newline)
x   Verbose (allow whitespace and comments)
u   Unicode

Regular expressions are the closest thing programmers have to magic spells. Master them, and you’ll see patterns everywhere.