LEARN REGEX DEEP DIVE
Learn Regular Expressions: From Zero to Pattern Master
Goal: Deeply understand regular expressions—from basic pattern matching to lookaheads, backreferences, and even building your own regex engine.
Why Regex Matters
Regular expressions are one of the most powerful tools in a programmer’s arsenal. They appear everywhere:
- Text editors: Find and replace with patterns
- Command line: grep, sed, awk
- Validation: Email, phone numbers, passwords
- Parsing: Logs, configuration files, data extraction
- Web scraping: Extracting data from HTML
- Compilers: Lexical analysis (tokenizers)
Yet most developers only know basic patterns and copy-paste the rest from Stack Overflow. After completing these projects, you will:
- Write complex patterns from scratch with confidence
- Understand the theory behind regex (finite automata)
- Debug and optimize slow patterns
- Know when regex is the right tool (and when it isn’t)
- Build a working regex engine from scratch
Core Concept Analysis
The Regex Mental Model
A regex is a pattern that describes a set of strings. Think of it as a tiny program that answers: “Does this string match?”
Pattern: /cat/
Matches: "cat", "catalog", "scatter"
Doesn't match: "dog", "Cat" (case sensitive)
Pattern: /^cat$/
Matches: "cat" (exactly)
Doesn't match: "catalog", "scatter" (anchored)
Anatomy of a Regex
/^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}$/i
│ └──────┬──────┘│└─────┬─────┘│ └───┬───┘││
│ │ │ │ │ │ ││
│ local part @ domain name . TLD ││
│ ││
└── anchor (start) anchor (end)─┘│
flag (case insensitive)─┘
Fundamental Concepts
1. Literal Characters
Most characters match themselves:
/hello/ matches "hello" in "hello world"
2. Metacharacters (Special Characters)
These have special meaning and need escaping to match literally:
. ^ $ * + ? { } [ ] \ | ( )
/a.c/ matches "abc", "a1c", "a-c" (. = any char)
/a\.c/ matches "a.c" only (escaped dot)
3. Character Classes
Match one of several characters:
[abc] one of: a, b, or c
[a-z] any lowercase letter
[A-Za-z] any letter
[0-9] any digit
[^abc] NOT a, b, or c (negation)
4. Predefined Character Classes
Shortcuts for common patterns:
\d digit [0-9]
\D non-digit [^0-9]
\w word char [A-Za-z0-9_]
\W non-word [^A-Za-z0-9_]
\s whitespace [ \t\n\r\f\v]
\S non-whitespace
. any char (except newline, usually)
5. Anchors
Match positions, not characters:
^ start of string/line
$ end of string/line
\b word boundary
\B non-word boundary
6. Quantifiers
Specify how many times to match:
* 0 or more (greedy)
+ 1 or more (greedy)
? 0 or 1 (optional)
{3} exactly 3
{3,} 3 or more
{3,5} 3 to 5
*? 0 or more (lazy/non-greedy)
+? 1 or more (lazy)
7. Groups and Capturing
(abc) capturing group
(?:abc) non-capturing group
(?<name>x) named capturing group
\1, \2 backreferences
8. Lookahead and Lookbehind
Match based on what comes before/after, without consuming:
(?=...) positive lookahead (followed by)
(?!...) negative lookahead (not followed by)
(?<=...) positive lookbehind (preceded by)
(?<!...) negative lookbehind (not preceded by)
9. Alternation
cat|dog matches "cat" or "dog"
Quick Reference Table
| Pattern | Meaning | Example Match |
|---|---|---|
. |
Any character | a.c → “abc”, “a1c” |
^ |
Start of string | ^hello → “hello world” |
$ |
End of string | world$ → “hello world” |
* |
0 or more | ab*c → “ac”, “abc”, “abbc” |
+ |
1 or more | ab+c → “abc”, “abbc” |
? |
0 or 1 | colou?r → “color”, “colour” |
\d |
Digit | \d+ → “123” |
\w |
Word character | \w+ → “hello_123” |
\s |
Whitespace | a\sb → “a b” |
[abc] |
Character class | [aeiou] → “a”, “e”, etc. |
[^abc] |
Negated class | [^0-9] → any non-digit |
(...) |
Capture group | (ab)+ → “abab” (captures “ab”) |
\1 |
Backreference | (.)\1 → “aa”, “bb” (repeated char) |
(?=...) |
Lookahead | foo(?=bar) → “foo” in “foobar” |
(?<=...) |
Lookbehind | (?<=@)\w+ → “gmail” in “@gmail” |
Project Progression Map
Level 1 (Beginner) Level 2 (Intermediate) Level 3 (Advanced)
───────────────────── ──────────────────────── ─────────────────────
┌─────────────────┐ ┌─────────────────────┐ ┌──────────────────┐
│ 1. Pattern │ │ 6. Log Parser │ │ 11. Regex Engine │
│ Matcher CLI │───────▶ │ │──────▶ │ (Basic) │
└─────────────────┘ └─────────────────────┘ └──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────────┐ ┌──────────────────┐
│ 2. Text │ │ 7. Markdown │ │ 12. Regex │
│ Validator │───────▶ │ Parser │──────▶ │ Debugger │
└─────────────────┘ └─────────────────────┘ └──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────────┐ ┌──────────────────┐
│ 3. Search & │ │ 8. Data Extractor │ │ 13. Regex │
│ Replace Tool │───────▶ │ (Web Scraping) │──────▶ │ Optimizer │
└─────────────────┘ └─────────────────────┘ └──────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ 4. Input │ │ 9. Tokenizer / │
│ Sanitizer │───────▶ │ Lexer │
└─────────────────┘ └─────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ 5. URL/Email │ │ 10. Template │
│ Parser │───────▶ │ Engine │
└─────────────────┘ └─────────────────────┘
Project 1: Pattern Matcher CLI
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, Go, Rust
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Basic Pattern Matching
- Software or Tool: grep-like CLI
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A command-line tool like grep that searches files for lines matching a regex pattern, with options for case-insensitive search, line numbers, and inverted matching.
Why it teaches regex: This project introduces you to basic regex operations in a practical context. You’ll learn how patterns are compiled, matched, and how different flags affect matching behavior.
Core challenges you’ll face:
- Pattern compilation → maps to understanding regex engines
- Line-by-line matching → maps to anchors and multiline mode
- Case insensitivity → maps to regex flags
- Match highlighting → maps to capturing match positions
Key Concepts:
- Basic Pattern Matching: “Mastering Regular Expressions” Chapter 1 - Friedl
- Regex in Python: Python
remodule documentation - grep Internals: GNU grep source code comments
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic programming, command line familiarity
Real world outcome:
$ pgrep "error" server.log
server.log:42: Connection error: timeout
server.log:156: Database error: connection refused
server.log:203: Authentication error: invalid token
$ pgrep -i "ERROR" server.log # Case insensitive
server.log:42: Connection error: timeout
server.log:89: ERROR: Disk full
server.log:156: Database error: connection refused
$ pgrep -n "\d{3}-\d{4}" contacts.txt # Show line numbers
15: John: 555-1234
23: Jane: 555-5678
31: Bob: 555-9012
$ pgrep -v "^#" config.ini # Invert match (show non-comments)
host=localhost
port=8080
debug=true
$ pgrep -c "TODO" *.py # Count matches
main.py: 3
utils.py: 7
tests.py: 1
$ pgrep -o "https?://\S+" README.md # Only show matched part
https://github.com/user/repo
http://example.com/docs
https://api.example.com/v1
Implementation Hints:
Core matching logic:
1. Compile the regex pattern (with flags)
2. For each line in the file:
a. Apply the regex
b. If match found (or not found for -v):
- Store/display the result
- Track line numbers if needed
3. Handle output formatting (highlighting, counts, etc.)
Key operations to understand:
search() - Find pattern anywhere in string
match() - Find pattern at start of string
findall() - Find all occurrences
finditer() - Iterator over match objects (with positions)
Flags to implement:
-i re.IGNORECASE Case-insensitive matching
-v (invert logic) Show non-matching lines
-n (add numbers) Prefix with line numbers
-c (count mode) Only show count of matches
-o (only matching) Show only the matched part
-l (files only) Only show filenames with matches
Questions to consider:
- What happens with invalid regex syntax?
- How do you handle binary files?
- How do you match across multiple lines?
Learning milestones:
- Basic patterns work → You understand literal matching
- Metacharacters work → You understand special characters
- Flags affect matching → You understand regex modifiers
- Match positions available → You understand match objects
Project 2: Text Validator Library
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: JavaScript
- Alternative Programming Languages: Python, TypeScript, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 1: Beginner
- Knowledge Area: Input Validation
- Software or Tool: Validation Library
- Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan
What you’ll build: A validation library with pre-built patterns for common data types (email, phone, URL, credit card, etc.) and the ability to create custom validators with helpful error messages.
Why it teaches regex: Validation is where most developers first encounter regex. You’ll learn character classes, quantifiers, and anchors while building something immediately useful.
Core challenges you’ll face:
- Email validation → maps to complex character classes
- Phone number formats → maps to alternation and optional groups
- Credit card patterns → maps to quantifiers and grouping
- Error messages → maps to understanding why patterns fail
Key Concepts:
- Email Regex: RFC 5322 and practical simplifications
- Character Classes: “Regular Expressions Cookbook” Chapter 2 - Goyvaerts
- Anchoring: “Mastering Regular Expressions” Chapter 3 - Friedl
Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1, understanding of character classes
Real world outcome:
// Your validation library
const v = require('validex');
// Built-in validators
v.isEmail('user@example.com'); // true
v.isEmail('invalid-email'); // false
v.isPhone('(555) 123-4567'); // true
v.isPhone('+1-555-123-4567'); // true
v.isPhone('123'); // false
v.isURL('https://example.com/path'); // true
v.isCreditCard('4111111111111111'); // true (Visa)
v.isIPv4('192.168.1.1'); // true
v.isHexColor('#FF5733'); // true
v.isSlug('my-blog-post'); // true
v.isUUID('550e8400-e29b-41d4-a716-446655440000'); // true
// Validation with details
const result = v.validate('not-an-email', 'email');
// {
// valid: false,
// value: 'not-an-email',
// errors: ['Missing @ symbol', 'No domain specified']
// }
// Custom validators
const usernameValidator = v.create({
pattern: /^[a-zA-Z][a-zA-Z0-9_]{2,15}$/,
messages: {
format: 'Username must start with a letter',
length: 'Username must be 3-16 characters'
}
});
usernameValidator.test('john_doe'); // true
usernameValidator.test('2cool'); // false
// Chained validation
v.string('test@example.com')
.isEmail()
.maxLength(50)
.notContains('spam')
.validate(); // { valid: true, ... }
Implementation Hints:
Common validation patterns:
const patterns = {
// Email (simplified but practical)
email: /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/,
// Phone (US formats)
phone: /^(\+1[-.\s]?)?(\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}$/,
// URL
url: /^(https?:\/\/)?([\da-z.-]+)\.([a-z.]{2,6})([\/\w .-]*)*\/?$/,
// Credit card (basic Luhn-valid length)
creditCard: /^\d{13,19}$/,
// IPv4
ipv4: /^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/,
// Hex color
hexColor: /^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$/,
// UUID
uuid: /^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i,
// Slug (URL-friendly)
slug: /^[a-z0-9]+(?:-[a-z0-9]+)*$/
};
Understanding the email pattern:
/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/
│ │ │ │ │ │
│ │ │ │ │ └── 2+ letter TLD
│ │ │ │ └── literal dot before TLD
│ │ │ └── domain (letters, numbers, dots, hyphens)
│ │ └── literal @ symbol
│ └── 1+ local characters
└── start anchor
Questions to consider:
- Should you use a simple regex or the full RFC 5322 spec for email?
- How do you validate international phone numbers?
- Why is credit card validation more than just a pattern (Luhn algorithm)?
Learning milestones:
- Character classes work → You understand [a-z] and \d
- Quantifiers work → You understand +, *, {n,m}
- Anchors work → You understand ^, $
- Groups work → You understand (…) for structure
Project 3: Search and Replace Tool
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Substitution / Backreferences
- Software or Tool: sed-like Tool
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A text transformation tool that can search and replace using regex patterns with capture groups and backreferences—like sed but with a friendlier interface.
Why it teaches regex: Replacement patterns with backreferences are where regex becomes truly powerful. You’ll learn how captured groups can be rearranged and transformed in the replacement.
Core challenges you’ll face:
- Capture groups → maps to grouping and referencing
- Backreferences → maps to using captures in replacement
- Named groups → maps to readable replacements
- Global replacement → maps to replace all vs first
Key Concepts:
- Backreferences: “Mastering Regular Expressions” Chapter 5 - Friedl
- Named Groups: “Regular Expressions Cookbook” Chapter 3 - Goyvaerts
- Replacement Syntax: Language-specific documentation
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 2, understanding of groups
Real world outcome:
# Basic replacement
$ rex 's/color/colour/' file.txt
Changed 5 occurrences
# Using capture groups
$ rex 's/(\w+), (\w+)/\2 \1/' names.txt
# "Smith, John" → "John Smith"
# Named groups for clarity
$ rex 's/(?P<last>\w+), (?P<first>\w+)/\g<first> \g<last>/' names.txt
# Multiple patterns
$ rex -e 's/foo/bar/' -e 's/baz/qux/' file.txt
# Date format conversion
$ rex 's/(\d{2})\/(\d{2})\/(\d{4})/\3-\1-\2/' dates.txt
# "12/25/2024" → "2024-12-25"
# Case transformation (with special replacement syntax)
$ rex 's/([a-z]+)/\U\1/' words.txt # Uppercase
# "hello" → "HELLO"
$ rex 's/([A-Z])([A-Z]+)/\1\L\2/' shout.txt # Title case
# "HELLO" → "Hello"
# Preview changes without applying
$ rex -n 's/error/ERROR/' log.txt
Would change:
Line 42: Connection error → Connection ERROR
Line 89: Database error → Database ERROR
# In-place editing with backup
$ rex -i.bak 's/old/new/' file.txt
Created backup: file.txt.bak
Modified: file.txt
Implementation Hints:
Understanding backreferences:
Pattern: (\w+), (\w+)
Input: "Smith, John"
Match: Group 1 = "Smith", Group 2 = "John"
Replacement: \2 \1
Output: "John Smith"
Implementation approach:
import re
def replace(text, pattern, replacement, flags=0):
# Handle backreference syntax differences
# Python uses \1, JavaScript uses $1
# Compile pattern
regex = re.compile(pattern, flags)
# Replace with group expansion
result = regex.sub(replacement, text)
return result
# Named group example
pattern = r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})'
replacement = r'\g<year>-\g<month>-\g<day>'
Case transformation in replacement:
\U uppercase all following
\L lowercase all following
\E end case modification
\u uppercase next char only
\l lowercase next char only
Questions to consider:
- How do you escape special characters in the replacement string?
- How do you handle nested groups?
- How do you implement case transformation?
Learning milestones:
- Basic replacement works → You understand substitution
- Capture groups work → You understand (…) and \1
- Named groups work → You understand (?P
...) - Case transforms work → You understand replacement modifiers
Project 4: Input Sanitizer
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: JavaScript
- Alternative Programming Languages: Python, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Security / Text Processing
- Software or Tool: Sanitization Library
- Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan
What you’ll build: A library that sanitizes user input by removing or escaping dangerous patterns—preventing XSS, SQL injection attempts, and cleaning up formatting.
Why it teaches regex: Sanitization requires thinking about what NOT to match (dangerous patterns) and understanding edge cases attackers might exploit. You’ll learn negative patterns and the limits of regex for security.
Core challenges you’ll face:
- HTML tag removal → maps to matching nested structures (limits of regex)
- Script detection → maps to case-insensitive variations
- Whitespace normalization → maps to \s and cleanup patterns
- SQL injection patterns → maps to alternation and escaping
Key Concepts:
- Security Considerations: OWASP Input Validation Cheat Sheet
- Regex Limitations: Why regex can’t parse HTML properly
- Escaping: “Regular Expressions Cookbook” Chapter 4
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 2 and 3
Real world outcome:
const sanitize = require('sanitex');
// HTML sanitization (remove tags)
sanitize.stripTags('<script>alert("xss")</script>Hello');
// Output: "Hello"
sanitize.stripTags('<b>Bold</b> and <i>italic</i>', { allow: ['b', 'i'] });
// Output: "<b>Bold</b> and <i>italic</i>"
// XSS prevention
sanitize.escapeHtml('<script>alert("xss")</script>');
// Output: "<script>alert("xss")</script>"
// SQL special character escaping
sanitize.escapeSQL("O'Brien; DROP TABLE users--");
// Output: "O''Brien; DROP TABLE users--"
// Whitespace normalization
sanitize.normalizeWhitespace(' Hello World \n\n ');
// Output: "Hello World"
// URL cleaning
sanitize.cleanURL('javascript:alert("xss")');
// Output: "" (blocked dangerous protocol)
sanitize.cleanURL('https://example.com/path?q=test');
// Output: "https://example.com/path?q=test" (allowed)
// Username sanitization
sanitize.username('John<script>Doe');
// Output: "JohnDoe" (only alphanumeric and underscore)
// Filename sanitization
sanitize.filename('../../../etc/passwd');
// Output: "etc_passwd" (path traversal removed)
// Custom sanitization
const clean = sanitize.create([
{ pattern: /[^\w\s-]/g, replacement: '' }, // Remove special chars
{ pattern: /\s+/g, replacement: ' ' }, // Normalize spaces
{ pattern: /^\s+|\s+$/g, replacement: '' } // Trim
]);
clean(' Hello!!! World??? ');
// Output: "Hello World"
Implementation Hints:
Common sanitization patterns:
const sanitizers = {
// Remove all HTML tags
stripAllTags: /<[^>]*>/g,
// Match script tags (with variations)
scriptTags: /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,
// Event handlers (onclick, onerror, etc.)
eventHandlers: /\s*on\w+\s*=\s*["'][^"']*["']/gi,
// JavaScript URLs
jsUrls: /javascript\s*:/gi,
// Multiple whitespace
multiSpace: /\s+/g,
// SQL dangerous characters
sqlDanger: /['";\\]/g,
// Path traversal
pathTraversal: /\.\.[\\/]/g,
// Null bytes
nullBytes: /\x00/g
};
Why regex isn’t enough for HTML:
Regex CANNOT properly parse:
<div attr=">" class="foo"> (quote contains >)
<div><div></div></div> (nested same tags)
<!-- <script>x</script> --> (tags in comments)
Use a proper HTML parser, then regex for cleanup.
Defense in depth:
function sanitizeInput(input) {
let clean = input;
// 1. Remove null bytes (before any other processing)
clean = clean.replace(/\x00/g, '');
// 2. Normalize unicode (prevent homoglyph attacks)
clean = clean.normalize('NFKC');
// 3. Remove dangerous patterns
clean = clean.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
// 4. Escape remaining special chars
clean = escapeHtml(clean);
return clean;
}
Learning milestones:
- Tag stripping works → You understand basic HTML patterns
- Case variations caught → You understand /i flag
- Edge cases handled → You understand regex limitations
- Multiple patterns combine → You understand defense in depth
Project 5: URL & Email Parser
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Complex Pattern Parsing
- Software or Tool: URL/Email Parser
- Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan
What you’ll build: A parser that extracts and validates URLs and emails from text, breaking them into components (protocol, host, path, query params for URLs; local part, domain for emails).
Why it teaches regex: URLs and emails are complex patterns with many optional parts. You’ll learn optional groups, alternation, and how to extract structured data from patterns.
Core challenges you’ll face:
- Optional URL components → maps to (…)? optional groups
- Query string parsing → maps to repeated capture groups
- International domains → maps to Unicode in regex
- Edge cases → maps to understanding RFC specifications
Key Concepts:
- URL Regex: RFC 3986 and practical simplifications
- Optional Groups: “Mastering Regular Expressions” Chapter 4 - Friedl
- Non-Capturing Groups: “Regular Expressions Cookbook” Chapter 2
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 2 and 3
Real world outcome:
from urlparser import URLParser, EmailParser
# Parse URLs
parser = URLParser()
result = parser.parse('https://user:pass@example.com:8080/path/to/page?foo=bar&baz=qux#section')
# {
# 'scheme': 'https',
# 'username': 'user',
# 'password': 'pass',
# 'host': 'example.com',
# 'port': 8080,
# 'path': '/path/to/page',
# 'query': {'foo': 'bar', 'baz': 'qux'},
# 'fragment': 'section'
# }
# Extract all URLs from text
text = "Check out https://example.com and http://test.org/page for more info"
urls = parser.extract_all(text)
# ['https://example.com', 'http://test.org/page']
# Validate URL
parser.is_valid('https://example.com') # True
parser.is_valid('not a url') # False
parser.is_valid('ftp://files.example.com') # True (different scheme)
# Parse email addresses
email_parser = EmailParser()
result = email_parser.parse('John.Doe+tag@mail.example.co.uk')
# {
# 'local': 'John.Doe+tag',
# 'domain': 'mail.example.co.uk',
# 'subdomain': 'mail',
# 'sld': 'example',
# 'tld': 'co.uk',
# 'tag': 'tag' # Plus addressing
# }
# Extract emails from text
text = "Contact us at support@example.com or sales@example.org"
emails = email_parser.extract_all(text)
# ['support@example.com', 'sales@example.org']
# Handle edge cases
parser.parse('http://localhost:3000')
# { 'scheme': 'http', 'host': 'localhost', 'port': 3000, ... }
parser.parse('file:///home/user/doc.txt')
# { 'scheme': 'file', 'path': '/home/user/doc.txt', ... }
Implementation Hints:
URL regex breakdown:
URL_PATTERN = r'''
^
(?P<scheme>https?|ftp):// # Scheme
(?:
(?P<username>[^:@]+) # Username (optional)
(?::(?P<password>[^@]+))? # Password (optional)
@
)?
(?P<host>
(?:[\w-]+\.)*[\w-]+ # Domain name
|
\d{1,3}(?:\.\d{1,3}){3} # or IP address
)
(?::(?P<port>\d+))? # Port (optional)
(?P<path>/[^?#]*)? # Path (optional)
(?:\?(?P<query>[^#]*))? # Query string (optional)
(?:\#(?P<fragment>.*))? # Fragment (optional)
$
'''
Optional groups explained:
(?:...)? Non-capturing optional group
Example: (?::(?P<port>\d+))?
│ │ │ │
│ │ │ └── entire group is optional
│ │ └── capture port number
│ └── colon before port
└── don't capture the colon, just group it
Parsing query strings:
def parse_query(query_string):
params = {}
pattern = r'([^&=]+)=([^&]*)'
for match in re.finditer(pattern, query_string):
key, value = match.groups()
params[key] = value
return params
Learning milestones:
- Basic URLs parse → You understand optional groups
- Query strings extract → You understand repeated groups
- Edge cases work → You understand alternation
- Components accessible → You understand named groups
Project 6: Log Parser
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Log Analysis / Data Extraction
- Software or Tool: Log Parser
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A log parsing tool that can extract structured data from various log formats (Apache, Nginx, syslog, JSON logs), with support for custom format definitions.
Why it teaches regex: Real-world log formats are complex, with timestamps, IP addresses, quoted strings, and variable fields. You’ll learn to build patterns incrementally and handle real-world messiness.
Core challenges you’ll face:
- Timestamp formats → maps to complex date/time patterns
- Quoted strings → maps to handling escapes and nested quotes
- IP addresses → maps to numeric patterns with validation
- Variable fields → maps to greedy vs non-greedy matching
Key Concepts:
- Log Format Patterns: Apache log format documentation
- Non-Greedy Matching: “Mastering Regular Expressions” Chapter 4 - Friedl
- Named Captures: “Regular Expressions Cookbook” Chapter 3
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1 and 5
Real world outcome:
from logparser import LogParser
# Built-in format: Apache Combined Log
parser = LogParser('apache_combined')
entry = parser.parse('192.168.1.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/" "Mozilla/5.0"')
# {
# 'ip': '192.168.1.1',
# 'user': 'frank',
# 'timestamp': datetime(2024, 10, 10, 13, 55, 36),
# 'method': 'GET',
# 'path': '/index.html',
# 'protocol': 'HTTP/1.0',
# 'status': 200,
# 'size': 2326,
# 'referrer': 'http://www.example.com/',
# 'user_agent': 'Mozilla/5.0'
# }
# Built-in format: Nginx
parser = LogParser('nginx')
# Built-in format: Syslog
parser = LogParser('syslog')
entry = parser.parse('Oct 11 22:14:15 server sshd[12345]: Failed password for root from 192.168.1.1')
# {
# 'timestamp': ...,
# 'host': 'server',
# 'program': 'sshd',
# 'pid': 12345,
# 'message': 'Failed password for root from 192.168.1.1'
# }
# Custom format definition
custom = LogParser.define(
r'^\[(?P<level>\w+)\] (?P<timestamp>[\d-]+ [\d:]+) (?P<message>.*)',
{'timestamp': '%Y-%m-%d %H:%M:%S'}
)
entry = custom.parse('[ERROR] 2024-10-11 13:55:36 Database connection failed')
# Analyze logs
results = parser.parse_file('access.log')
print(f"Total requests: {len(results)}")
print(f"Unique IPs: {len(set(r['ip'] for r in results))}")
print(f"Errors (5xx): {sum(1 for r in results if r['status'] >= 500)}")
# Stream large files
for entry in parser.stream('huge.log'):
if entry['status'] >= 400:
print(f"Error: {entry['path']} - {entry['status']}")
Implementation Hints:
Apache Combined Log Format pattern:
APACHE_COMBINED = r'''
^
(?P<ip>\S+)\s+ # IP address
(?P<ident>\S+)\s+ # Ident (usually -)
(?P<user>\S+)\s+ # User (or -)
\[(?P<timestamp>[^\]]+)\]\s+ # Timestamp [...]
"(?P<method>\w+)\s+ # Request method
(?P<path>\S+)\s+ # Request path
(?P<protocol>[^"]+)"\s+ # Protocol
(?P<status>\d+)\s+ # Status code
(?P<size>\d+|-)\s+ # Response size
"(?P<referrer>[^"]*)"\s+ # Referrer
"(?P<user_agent>[^"]*)" # User agent
$
'''
Handling quoted strings with escapes:
# Simple (no escapes): "[^"]*"
# With escapes: "(?:[^"\\]|\\.)*"
# │ │
# │ └── or escaped anything
# └── non-quote, non-backslash
QUOTED_STRING = r'"(?:[^"\\]|\\.)*"'
Parsing timestamps:
import re
from datetime import datetime
def parse_apache_time(timestamp):
# [10/Oct/2024:13:55:36 -0700]
pattern = r'(\d{2})/(\w{3})/(\d{4}):(\d{2}):(\d{2}):(\d{2})'
match = re.match(pattern, timestamp)
if match:
day, month, year, hour, minute, second = match.groups()
# Convert month name to number, build datetime
...
Learning milestones:
- Single format parses → You understand complex patterns
- Quoted strings work → You understand escape handling
- Custom formats work → You understand pattern composition
- Large files stream → You understand generator patterns
Project 7: Markdown Parser
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: JavaScript
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Text Transformation
- Software or Tool: Markdown to HTML Converter
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A Markdown to HTML converter that handles headings, emphasis, links, images, code blocks, lists, and blockquotes—using regex for pattern matching and transformation.
Why it teaches regex: Markdown parsing requires matching nested structures, handling multiple formats for the same element, and transforming captures into output. You’ll push regex to its limits and learn when to combine it with other techniques.
Core challenges you’ll face:
- Inline formatting → maps to nested and overlapping patterns
- Links and images → maps to complex capture groups
- Code blocks → maps to multi-line matching
- Lists → maps to context-dependent patterns
Key Concepts:
- Multi-line Mode: “Mastering Regular Expressions” Chapter 3 - Friedl
- Greedy vs Lazy: “Regular Expressions Cookbook” Chapter 2
- CommonMark Specification: commonmark.org
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 3 and 6
Real world outcome:
const md = require('markdownex');
const markdown = `
# Hello World
This is **bold** and *italic* and ***both***.
Here's a [link](https://example.com) and an .
\`\`\`javascript
const x = 1;
console.log(x);
\`\`\`
- Item 1
- Item 2
- Nested item
> This is a quote
> that spans multiple lines
`;
const html = md.toHtml(markdown);
// Output:
// <h1>Hello World</h1>
// <p>This is <strong>bold</strong> and <em>italic</em> and <em><strong>both</strong></em>.</p>
// <p>Here's a <a href="https://example.com">link</a> and an <img src="img.png" alt="image" title="title">.</p>
// <pre><code class="language-javascript">const x = 1;
// console.log(x);
// </code></pre>
// <ul>
// <li>Item 1</li>
// <li>Item 2
// <ul>
// <li>Nested item</li>
// </ul>
// </li>
// </ul>
// <blockquote>
// <p>This is a quote that spans multiple lines</p>
// </blockquote>
// Inline parsing only
md.parseInline('**bold** and `code`');
// '<strong>bold</strong> and <code>code</code>'
Implementation Hints:
Markdown patterns (simplified):
const patterns = {
// Headings: # Heading
heading: /^(#{1,6})\s+(.+)$/gm,
// Bold: **text** or __text__
bold: /\*\*(.+?)\*\*|__(.+?)__/g,
// Italic: *text* or _text_
italic: /\*(.+?)\*|_(.+?)_/g,
// Links: [text](url "title")
link: /\[([^\]]+)\]\(([^)\s]+)(?:\s+"([^"]+)")?\)/g,
// Images: 
image: /!\[([^\]]*)\]\(([^)\s]+)(?:\s+"([^"]+)")?\)/g,
// Inline code: `code`
inlineCode: /`([^`]+)`/g,
// Code blocks: ```lang\ncode\n```
codeBlock: /```(\w*)\n([\s\S]*?)```/g,
// Blockquotes: > text
blockquote: /^>\s?(.*)$/gm,
// Unordered lists: - item or * item
unorderedList: /^[\s]*[-*]\s+(.+)$/gm,
// Ordered lists: 1. item
orderedList: /^[\s]*\d+\.\s+(.+)$/gm,
// Horizontal rule: --- or ***
hr: /^[-*]{3,}$/gm
};
Processing order matters:
function parseMarkdown(text) {
// 1. First handle code blocks (protect from other processing)
text = text.replace(patterns.codeBlock, (match, lang, code) => {
return `<pre><code class="language-${lang}">${escapeHtml(code)}</code></pre>`;
});
// 2. Handle block-level elements
text = text.replace(patterns.heading, (match, hashes, content) => {
const level = hashes.length;
return `<h${level}>${content}</h${level}>`;
});
// 3. Handle inline elements (after blocks)
text = text.replace(patterns.bold, '<strong>$1$2</strong>');
text = text.replace(patterns.italic, '<em>$1$2</em>');
text = text.replace(patterns.link, '<a href="$2" title="$3">$1</a>');
return text;
}
Handling nested emphasis:
***bold and italic***
Must handle: <em><strong>bold and italic</strong></em>
Pattern: \*\*\*(.+?)\*\*\*
OR process from outside in
Learning milestones:
- Headings and emphasis work → You understand basic patterns
- Links and images work → You understand complex captures
- Code blocks preserve content → You understand multi-line
- Nested formatting works → You understand processing order
Project 8: Data Extractor (Web Scraping)
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Extraction / Web Scraping
- Software or Tool: Web Scraper
- Main Book: “Regular Expressions Cookbook” by Goyvaerts & Levithan
What you’ll build: A data extraction tool that can pull structured data from web pages using regex patterns—extracting prices, dates, emails, phone numbers, and custom patterns.
Why it teaches regex: Web scraping forces you to handle messy, inconsistent HTML. You’ll learn to write flexible patterns that handle variations and extract multiple pieces of data from single matches.
Core challenges you’ll face:
- HTML structure → maps to tag matching (and its limits)
- Inconsistent formatting → maps to flexible patterns with alternation
- Multiple items → maps to findall and iteration
- Greedy matching pitfalls → maps to non-greedy quantifiers
Key Concepts:
- Non-Greedy Matching: “Mastering Regular Expressions” Chapter 4 - Friedl
- findall vs finditer: Python
remodule documentation - Why not regex for HTML: Stack Overflow famous answer (for context)
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 5 and 6
Real world outcome:
from extractor import DataExtractor
# Initialize with HTML content
ex = DataExtractor(html_content)
# Built-in extractors
prices = ex.extract_prices()
# ['$19.99', '$149.00', '€25.50']
emails = ex.extract_emails()
# ['contact@example.com', 'support@example.com']
phones = ex.extract_phones()
# ['(555) 123-4567', '+1-555-987-6543']
dates = ex.extract_dates()
# ['2024-10-15', 'October 15, 2024', '10/15/24']
# Extract with context (get surrounding text)
results = ex.extract_prices(context=True)
# [
# {'value': '$19.99', 'context': 'Sale price: $19.99 (was $29.99)'},
# {'value': '$149.00', 'context': 'Total: $149.00 including shipping'}
# ]
# Custom patterns
product_pattern = r'<div class="product">\s*<h2>([^<]+)</h2>\s*<span class="price">\$([0-9.]+)</span>'
products = ex.extract_custom(product_pattern, ['name', 'price'])
# [
# {'name': 'Widget Pro', 'price': '19.99'},
# {'name': 'Gadget Plus', 'price': '29.99'}
# ]
# Extract all matches with positions
for match in ex.find_all(r'\b[A-Z]{2,}\b'):
print(f"Acronym: {match.text} at position {match.start}")
# Pipeline extraction
ex.pipeline([
('title', r'<title>([^<]+)</title>'),
('meta_desc', r'<meta name="description" content="([^"]+)"'),
('h1_tags', r'<h1[^>]*>([^<]+)</h1>'),
])
# {
# 'title': 'My Page Title',
# 'meta_desc': 'Page description here',
# 'h1_tags': ['Welcome', 'About Us']
# }
Implementation Hints:
Price extraction pattern:
PRICE_PATTERNS = [
r'\$[\d,]+\.?\d*', # $19.99, $1,234.56
r'€[\d,]+\.?\d*', # €25.50
r'£[\d,]+\.?\d*', # £15.00
r'USD\s*[\d,]+\.?\d*', # USD 19.99
r'[\d,]+\.?\d*\s*(?:USD|EUR|GBP)', # 19.99 USD
]
def extract_prices(text):
combined = '|'.join(f'({p})' for p in PRICE_PATTERNS)
return re.findall(combined, text)
Why greedy matching fails with HTML:
# BAD: Greedy matching
pattern = r'<div class="product">.*</div>'
# Will match from first <div> to LAST </div> in document!
# GOOD: Non-greedy matching
pattern = r'<div class="product">.*?</div>'
# Matches first complete <div>...</div>
# BETTER: Be more specific
pattern = r'<div class="product">[^<]*(?:<(?!/div>)[^<]*)*</div>'
# Matches content that doesn't contain </div>
Handling variations:
# Date patterns - many formats
DATE_PATTERNS = [
r'\d{4}-\d{2}-\d{2}', # 2024-10-15
r'\d{2}/\d{2}/\d{4}', # 10/15/2024
r'\d{2}/\d{2}/\d{2}', # 10/15/24
r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{1,2},?\s+\d{4}', # October 15, 2024
r'\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4}', # 15 October 2024
]
Learning milestones:
- Simple extractions work → You understand findall
- Variations handled → You understand alternation
- Context extracted → You understand groups
- HTML pitfalls understood → You know when NOT to use regex
Project 9: Tokenizer / Lexer
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Compilers / Parsing
- Software or Tool: Lexer (Tokenizer)
- Main Book: “Compilers: Principles and Practice” by Dave & Dave
What you’ll build: A lexer (tokenizer) that breaks source code into tokens using regex patterns—handling keywords, identifiers, numbers, strings, operators, and comments.
Why it teaches regex: Lexers are the classic application of regex theory. You’ll learn how regex patterns become finite automata, how to handle overlapping patterns with priorities, and how to process input efficiently.
Core challenges you’ll face:
- Token priority → maps to pattern ordering
- String literals → maps to handling escapes
- Comments → maps to multi-line patterns
- Error recovery → maps to handling unmatched input
Key Concepts:
- Lexical Analysis: “Compilers: Principles and Practice” Chapter 2
- Token Priority: “Engineering a Compiler” Chapter 2
- Finite Automata: “Introduction to Automata Theory” Chapter 2
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 6 and 7
Real world outcome:
from lexer import Lexer, Token
# Define token patterns
lexer = Lexer([
('KEYWORD', r'\b(if|else|while|for|function|return|let|const|var)\b'),
('IDENTIFIER', r'[a-zA-Z_][a-zA-Z0-9_]*'),
('NUMBER', r'\d+\.?\d*'),
('STRING', r'"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\''),
('OPERATOR', r'[+\-*/%=<>!&|]+'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('LBRACE', r'\{'),
('RBRACE', r'\}'),
('SEMICOLON', r';'),
('COMMA', r','),
('COMMENT', r'//.*|/\*[\s\S]*?\*/'),
('WHITESPACE', r'\s+'),
])
code = '''
function greet(name) {
// This is a comment
let message = "Hello, " + name;
return message;
}
'''
tokens = lexer.tokenize(code)
for token in tokens:
print(token)
# Output:
# Token(KEYWORD, 'function', line=2, col=1)
# Token(IDENTIFIER, 'greet', line=2, col=10)
# Token(LPAREN, '(', line=2, col=15)
# Token(IDENTIFIER, 'name', line=2, col=16)
# Token(RPAREN, ')', line=2, col=20)
# Token(LBRACE, '{', line=2, col=22)
# Token(COMMENT, '// This is a comment', line=3, col=5)
# Token(KEYWORD, 'let', line=4, col=5)
# Token(IDENTIFIER, 'message', line=4, col=9)
# Token(OPERATOR, '=', line=4, col=17)
# Token(STRING, '"Hello, "', line=4, col=19)
# ...
# Error handling
bad_code = 'let x = @#$;' # Invalid characters
try:
lexer.tokenize(bad_code)
except LexerError as e:
print(f"Unexpected character '{e.char}' at line {e.line}, column {e.col}")
# Skip certain tokens
tokens = lexer.tokenize(code, skip=['WHITESPACE', 'COMMENT'])
Implementation Hints:
Lexer implementation:
import re
class Token:
def __init__(self, type, value, line, col):
self.type = type
self.value = value
self.line = line
self.col = col
class Lexer:
def __init__(self, rules):
# Combine all patterns into one big regex
# Using named groups for each token type
parts = []
for name, pattern in rules:
parts.append(f'(?P<{name}>{pattern})')
self.pattern = re.compile('|'.join(parts))
self.rules = rules
def tokenize(self, text, skip=None):
skip = skip or []
tokens = []
line = 1
line_start = 0
for match in self.pattern.finditer(text):
# Find which group matched
token_type = match.lastgroup
value = match.group()
if token_type not in skip:
col = match.start() - line_start + 1
tokens.append(Token(token_type, value, line, col))
# Track line numbers
newlines = value.count('\n')
if newlines:
line += newlines
line_start = match.end() - len(value.split('\n')[-1])
return tokens
String literal with escapes:
# Match: "hello", "hello\nworld", "say \"hi\""
STRING = r'"(?:[^"\\]|\\.)*"'
# │ │ │
# │ │ └── OR backslash + any char
# │ └── non-quote, non-backslash chars
# └── opening quote
# Explanation of (?:[^"\\]|\\.)*:
# [^"\\] = any char except " and \
# \\. = backslash followed by any char (escape sequence)
# (?:...)*= repeat this group
Multi-line comments:
# Match /* ... */ comments (can span lines)
MULTILINE_COMMENT = r'/\*[\s\S]*?\*/'
# │ │
# │ └── non-greedy (stop at first */)
# └── any character including newlines
Learning milestones:
- Simple tokens work → You understand pattern matching
- Keywords vs identifiers → You understand priority
- Strings with escapes → You understand complex patterns
- Line/column tracking → You understand position bookkeeping
Project 10: Template Engine
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: JavaScript
- Alternative Programming Languages: Python, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Text Templating
- Software or Tool: Template Engine (like Mustache/Handlebars)
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A template engine that supports variable interpolation, conditionals, loops, and partials—using regex to find and replace template syntax.
Why it teaches regex: Template engines use regex to find special syntax in text. You’ll learn to handle nested structures, recursive patterns, and the interplay between regex matching and programmatic logic.
Core challenges you’ll face:
- Variable interpolation → maps to simple captures and replacement
- Conditionals → maps to matching paired tags
- Loops → maps to extracting content between tags
- Escaping → maps to distinguishing literal vs template syntax
Key Concepts:
- Template Patterns: Mustache.js source code
- Paired Tag Matching: “Mastering Regular Expressions” Chapter 6
- Replacement Functions: JavaScript replace() with function
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 3 and 7
Real world outcome:
const tmpl = require('templatex');
// Simple interpolation
const template = 'Hello, {{name}}!';
const result = tmpl.render(template, { name: 'World' });
// 'Hello, World!'
// Conditionals
const withCondition = `
{{#if isLoggedIn}}
Welcome back, {{username}}!
{{else}}
Please log in.
{{/if}}
`;
tmpl.render(withCondition, { isLoggedIn: true, username: 'Alice' });
// 'Welcome back, Alice!'
// Loops
const withLoop = `
<ul>
{{#each items}}
<li>{{name}}: {{price}}</li>
{{/each}}
</ul>
`;
tmpl.render(withLoop, {
items: [
{ name: 'Widget', price: '$10' },
{ name: 'Gadget', price: '$20' }
]
});
// <ul>
// <li>Widget: $10</li>
// <li>Gadget: $20</li>
// </ul>
// Nested data
const nested = '{{user.profile.name}}';
tmpl.render(nested, { user: { profile: { name: 'Bob' } } });
// 'Bob'
// Escaping (raw output)
const withEscape = '{{&htmlContent}}'; // or {{{htmlContent}}}
tmpl.render(withEscape, { htmlContent: '<b>Bold</b>' });
// '<b>Bold</b>' (not escaped)
// Partials
tmpl.registerPartial('header', '<header>{{title}}</header>');
const withPartial = '{{>header}}<main>Content</main>';
tmpl.render(withPartial, { title: 'My Page' });
// '<header>My Page</header><main>Content</main>'
// Compile for repeated use
const compiled = tmpl.compile('Hello, {{name}}!');
compiled({ name: 'Alice' }); // 'Hello, Alice!'
compiled({ name: 'Bob' }); // 'Hello, Bob!'
Implementation Hints:
Template patterns:
const patterns = {
// Variable: {{name}} or {{user.name}}
variable: /\{\{([a-zA-Z_][\w.]*)\}\}/g,
// Raw variable: {{{var}}} or {{&var}}
rawVariable: /\{\{\{([a-zA-Z_][\w.]*)\}\}\}|\{\{&([a-zA-Z_][\w.]*)\}\}/g,
// Conditional: {{#if condition}}...{{/if}}
conditional: /\{\{#if\s+(\w+)\}\}([\s\S]*?)(?:\{\{else\}\}([\s\S]*?))?\{\{\/if\}\}/g,
// Loop: {{#each items}}...{{/each}}
loop: /\{\{#each\s+(\w+)\}\}([\s\S]*?)\{\{\/each\}\}/g,
// Partial: {{>partialName}}
partial: /\{\{>\s*(\w+)\s*\}\}/g,
// Comment: {{! comment }}
comment: /\{\{![\s\S]*?\}\}/g
};
Variable resolution with dot notation:
function resolve(path, context) {
const parts = path.split('.');
let value = context;
for (const part of parts) {
if (value == null) return '';
value = value[part];
}
return value ?? '';
}
// Usage
resolve('user.profile.name', { user: { profile: { name: 'Bob' } } });
// Returns: 'Bob'
Processing order:
function render(template, context) {
let result = template;
// 1. Remove comments first
result = result.replace(patterns.comment, '');
// 2. Process conditionals (before loops, they might contain loops)
result = result.replace(patterns.conditional, (match, condition, ifContent, elseContent) => {
return context[condition] ? ifContent : (elseContent || '');
});
// 3. Process loops
result = result.replace(patterns.loop, (match, itemsKey, content) => {
const items = context[itemsKey] || [];
return items.map(item => render(content, { ...context, ...item })).join('');
});
// 4. Process partials
result = result.replace(patterns.partial, (match, name) => {
return render(partials[name], context);
});
// 5. Finally, replace variables
result = result.replace(patterns.variable, (match, path) => {
return escapeHtml(resolve(path, context));
});
return result;
}
Learning milestones:
- Simple variables work → You understand basic replacement
- Conditionals work → You understand paired tag matching
- Loops work → You understand iterative replacement
- Nesting works → You understand recursive processing
Project 11: Regex Engine (Basic)
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, Go, C
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 4: Expert
- Knowledge Area: Automata Theory / Compilers
- Software or Tool: Regex Engine
- Main Book: “Introduction to Automata Theory” by Hopcroft, Motwani, Ullman
| What you’ll build: A regex engine that compiles patterns to NFAs (Non-deterministic Finite Automata) and matches strings—implementing basic operators: concatenation, alternation ( | ), Kleene star (*), plus (+), and optional (?). |
Why it teaches regex: Building a regex engine is the ultimate way to understand how regex works. You’ll learn Thompson’s construction, NFA simulation, and why certain patterns are slow (catastrophic backtracking).
Core challenges you’ll face:
- Parsing regex syntax → maps to recursive descent parsing
- Thompson’s construction → maps to NFA building
- NFA simulation → maps to parallel state tracking
- Handling special characters → maps to metacharacter escaping
Key Concepts:
- Thompson’s Construction: “Regular Expression Matching Can Be Simple And Fast” by Russ Cox
- NFA/DFA Theory: “Introduction to Automata Theory” Chapters 2-3 - Hopcroft
- Regex Engine Internals: “Mastering Regular Expressions” Chapter 4 - Friedl
Difficulty: Expert Time estimate: 1 month Prerequisites: Automata theory basics, Projects 7 and 9
Real world outcome:
from regex_engine import Regex
# Basic matching
r = Regex('hello')
r.match('hello') # True
r.match('hello world') # True (partial match)
r.fullmatch('hello') # True (exact match)
# Alternation
r = Regex('cat|dog')
r.match('cat') # True
r.match('dog') # True
r.match('bird') # False
# Kleene star
r = Regex('ab*c')
r.match('ac') # True (zero b's)
r.match('abc') # True (one b)
r.match('abbbc') # True (three b's)
# Plus (one or more)
r = Regex('ab+c')
r.match('ac') # False (need at least one b)
r.match('abc') # True
r.match('abbbc') # True
# Optional
r = Regex('colou?r')
r.match('color') # True
r.match('colour') # True
# Character classes
r = Regex('[a-z]+')
r.match('hello') # True
r.match('HELLO') # False
# Grouping
r = Regex('(ab)+')
r.match('ab') # True
r.match('abab') # True
r.match('ababab') # True
# Complex pattern
r = Regex('(a|b)*abb')
r.match('abb') # True
r.match('aabb') # True
r.match('babb') # True
r.match('abababb') # True
# Debug: show NFA
r.visualize()
# Outputs DOT format for graphviz:
# digraph NFA {
# 0 -> 1 [label="a"];
# 0 -> 2 [label="b"];
# ...
# }
Implementation Hints:
Regex engine architecture:
┌─────────────────────────────────────────────────────────────┐
│ Regex Engine │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Parser │───▶│ Thompson │───▶│ NFA Simulator │ │
│ │ │ │ Construction │ │ │ │
│ └──────────┘ └──────────────┘ └──────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ AST NFA Match/No Match │
│ │
└─────────────────────────────────────────────────────────────┘
Step 1: Parse regex to AST:
# Regex: (a|b)*c
# AST:
# Concat
# ├── Star
# │ └── Alternation
# │ ├── Literal('a')
# │ └── Literal('b')
# └── Literal('c')
class Literal:
def __init__(self, char): self.char = char
class Alternation:
def __init__(self, left, right): self.left, self.right = left, right
class Concat:
def __init__(self, left, right): self.left, self.right = left, right
class Star:
def __init__(self, expr): self.expr = expr
Step 2: Thompson’s Construction (AST to NFA):
class State:
def __init__(self):
self.transitions = {} # char -> [states]
self.epsilon = [] # ε-transitions
class NFA:
def __init__(self, start, accept):
self.start = start
self.accept = accept
def build_nfa(ast):
if isinstance(ast, Literal):
start = State()
accept = State()
start.transitions[ast.char] = [accept]
return NFA(start, accept)
elif isinstance(ast, Concat):
left = build_nfa(ast.left)
right = build_nfa(ast.right)
left.accept.epsilon.append(right.start)
return NFA(left.start, right.accept)
elif isinstance(ast, Alternation):
left = build_nfa(ast.left)
right = build_nfa(ast.right)
start = State()
accept = State()
start.epsilon = [left.start, right.start]
left.accept.epsilon.append(accept)
right.accept.epsilon.append(accept)
return NFA(start, accept)
elif isinstance(ast, Star):
inner = build_nfa(ast.expr)
start = State()
accept = State()
start.epsilon = [inner.start, accept]
inner.accept.epsilon = [inner.start, accept]
return NFA(start, accept)
Step 3: NFA Simulation:
def match(nfa, text):
# Start with all states reachable from start via ε
current = epsilon_closure({nfa.start})
for char in text:
next_states = set()
for state in current:
if char in state.transitions:
next_states.update(state.transitions[char])
current = epsilon_closure(next_states)
return nfa.accept in current
def epsilon_closure(states):
"""Find all states reachable via ε-transitions"""
closure = set(states)
stack = list(states)
while stack:
state = stack.pop()
for next_state in state.epsilon:
if next_state not in closure:
closure.add(next_state)
stack.append(next_state)
return closure
Learning milestones:
- Parser produces AST → You understand regex syntax
- NFA is constructed → You understand Thompson’s construction
- Matching works → You understand NFA simulation
- All operators work → You understand the full algorithm
Project 12: Regex Debugger & Visualizer
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: JavaScript
- Alternative Programming Languages: Python, TypeScript
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Debugging / Visualization
- Software or Tool: Regex Debugger (like regex101)
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A regex debugger that shows step-by-step matching, explains why patterns fail, highlights matched groups, and visualizes the regex as a railroad diagram.
Why it teaches regex: Building a debugger requires you to understand exactly how regex engines work internally. You’ll learn backtracking, match states, and how to explain regex behavior to others.
Core challenges you’ll face:
- Step-by-step execution → maps to tracking match state
- Failure explanation → maps to understanding why matches fail
- Railroad diagrams → maps to regex AST visualization
- Performance analysis → maps to detecting catastrophic backtracking
Key Concepts:
- Regex Debugging: regex101.com internals
- Railroad Diagrams: Railroad diagram conventions
- Backtracking: “Mastering Regular Expressions” Chapter 4 - Friedl
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 11
Real world outcome:
const debug = require('regex-debugger');
// Step-by-step matching
const result = debug.trace(/a+b/, 'aaab');
// {
// matched: true,
// steps: [
// { position: 0, pattern: 'a+', action: 'match a', state: 'continue' },
// { position: 1, pattern: 'a+', action: 'match a', state: 'continue' },
// { position: 2, pattern: 'a+', action: 'match a', state: 'continue' },
// { position: 3, pattern: 'b', action: 'match b', state: 'success' }
// ]
// }
// Failure analysis
const failure = debug.explain(/foo/, 'bar');
// {
// matched: false,
// reason: "Pattern 'foo' expected 'f' at position 0, but found 'b'",
// suggestion: "The input doesn't contain the literal text 'foo'"
// }
// Backtracking visualization
const backtrack = debug.trace(/a+ab/, 'aaab');
// Shows backtracking:
// - a+ matches 'aaa'
// - tries to match 'a', fails (position 3 is 'b')
// - backtracks: a+ gives up one 'a'
// - a+ now matches 'aa'
// - 'a' matches at position 2
// - 'b' matches at position 3
// - SUCCESS
// Group highlighting
const groups = debug.groups(/((\d+)-(\d+))/, '2024-12-25');
// {
// fullMatch: '2024-12',
// groups: [
// { index: 1, name: null, value: '2024-12', range: [0, 7] },
// { index: 2, name: null, value: '2024', range: [0, 4] },
// { index: 3, name: null, value: '12', range: [5, 7] }
// ]
// }
// Railroad diagram
const diagram = debug.visualize(/colou?r/);
// Returns SVG/HTML of railroad diagram
// Performance analysis
const perf = debug.analyze(/(a+)+$/);
// {
// warning: 'CATASTROPHIC_BACKTRACKING',
// message: 'This pattern can cause exponential backtracking on inputs like "aaaaaaaaaaaaaaaaX"',
// suggestion: 'Use atomic groups or possessive quantifiers: (?>a+)+$'
// }
// Interactive mode (CLI)
debug.interactive();
// > pattern: \d{3}-\d{4}
// > test: 555-1234
// ✓ Match found: "555-1234"
// Group 0: "555-1234" [0-8]
// > test: abc
// ✗ No match
// Failed at position 0: expected digit, found 'a'
Implementation Hints:
Tracing regex execution:
class RegexDebugger {
trace(pattern, input) {
const steps = [];
const regex = new RegExp(pattern, 'g');
// Use a custom matcher that logs steps
// This is simplified - real implementation needs engine access
let lastIndex = 0;
for (let i = 0; i < input.length; i++) {
// Try to match at each position
regex.lastIndex = i;
const match = regex.exec(input);
if (match && match.index === i) {
steps.push({
position: i,
pattern: pattern.toString(),
action: `matched "${match[0]}"`,
state: 'success'
});
break;
} else {
steps.push({
position: i,
pattern: pattern.toString(),
action: `no match at position ${i}`,
state: 'backtrack'
});
}
}
return { matched: steps.some(s => s.state === 'success'), steps };
}
}
Railroad diagram generation:
function toRailroad(ast) {
switch (ast.type) {
case 'literal':
return `<rect class="terminal"><text>${ast.char}</text></rect>`;
case 'sequence':
return ast.elements.map(toRailroad).join('→');
case 'alternation':
return `
<g class="choice">
${ast.alternatives.map(a => `<path>${toRailroad(a)}</path>`).join('')}
</g>
`;
case 'quantifier':
if (ast.min === 0 && ast.max === Infinity) {
// *: loop with skip path
return `<g class="loop">${toRailroad(ast.expr)}</g>`;
}
// ... handle other quantifiers
}
}
Catastrophic backtracking detection:
function detectCatastrophic(pattern) {
const warnings = [];
// Nested quantifiers with overlap
if (/\([^)]*[+*][^)]*\)[+*]/.test(pattern)) {
warnings.push({
type: 'NESTED_QUANTIFIERS',
message: 'Nested quantifiers can cause exponential backtracking'
});
}
// Overlapping alternatives
// (a|ab)* - both alternatives start with 'a'
// This is harder to detect accurately
return warnings;
}
Learning milestones:
- Step tracing works → You understand match progression
- Failure explanation works → You understand why patterns fail
- Railroad diagrams generate → You understand regex structure
- Catastrophic patterns detected → You understand performance
Project 13: Regex Optimizer
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Optimization / Automata
- Software or Tool: Regex Optimizer
- Main Book: “Mastering Regular Expressions” by Jeffrey Friedl
What you’ll build: A regex optimizer that analyzes patterns, suggests improvements, converts NFA to DFA for faster matching, and identifies performance problems.
Why it teaches regex: Understanding optimization requires deep knowledge of how regex engines work, the difference between NFA and DFA, and what makes patterns slow.
Core challenges you’ll face:
- NFA to DFA conversion → maps to subset construction
- DFA minimization → maps to state equivalence
- Pattern simplification → maps to algebraic laws
- Anchoring optimization → maps to match start detection
Key Concepts:
- Subset Construction: “Introduction to Automata Theory” Chapter 2
- DFA Minimization: “Engineering a Compiler” Chapter 2
- Regex Optimization: “Mastering Regular Expressions” Chapter 6
Difficulty: Expert Time estimate: 1 month Prerequisites: Project 11
Real world outcome:
from regex_optimizer import optimize
# Simplify redundant patterns
optimize(r'(a|a)')
# Simplified: 'a'
optimize(r'a*a*')
# Simplified: 'a*'
optimize(r'[a-zA-Z0-9_]')
# Simplified: '\w' (if locale permits)
# Remove unnecessary groups
optimize(r'((a)(b))(c)')
# Simplified: 'abc' (if groups aren't captured)
# Factor common prefixes
optimize(r'foobar|foobaz')
# Optimized: 'fooba[rz]'
optimize(r'cat|car|cab')
# Optimized: 'ca[trb]'
# Anchor optimization
optimize(r'.*foo')
# Warning: "Starts with .* - will scan entire string. Consider anchoring or using specific start."
# NFA to DFA conversion
result = optimize(r'(a|b)*abb', convert_to_dfa=True)
# Returns optimized DFA with minimal states
# Performance analysis
analyze(r'(a+)+$')
# {
# 'complexity': 'exponential',
# 'reason': 'Nested quantifiers with overlapping matches',
# 'suggestion': 'Use possessive quantifier: (a+)++$ or atomic group: (?>a+)+$'
# }
# Equivalent simpler pattern
equivalent(r'^[0-9][0-9]*$')
# Simpler: '^\d+$'
equivalent(r'(foo|bar|baz){1,1}')
# Simpler: 'foo|bar|baz'
# Compile to DFA for repeated matching
dfa = compile_to_dfa(r'\b\w+@\w+\.\w+\b')
for line in huge_file:
if dfa.match(line): # O(n) guaranteed, no backtracking
print(line)
Implementation Hints:
NFA to DFA (Subset Construction):
def nfa_to_dfa(nfa):
"""Convert NFA to DFA using subset construction"""
# DFA state = set of NFA states
start_state = frozenset(epsilon_closure({nfa.start}))
dfa_states = {start_state}
dfa_transitions = {}
worklist = [start_state]
while worklist:
current = worklist.pop()
for symbol in alphabet:
# Find all NFA states reachable via this symbol
next_nfa_states = set()
for nfa_state in current:
if symbol in nfa_state.transitions:
next_nfa_states.update(nfa_state.transitions[symbol])
next_dfa_state = frozenset(epsilon_closure(next_nfa_states))
if next_dfa_state and next_dfa_state not in dfa_states:
dfa_states.add(next_dfa_state)
worklist.append(next_dfa_state)
if next_dfa_state:
dfa_transitions[(current, symbol)] = next_dfa_state
return DFA(start_state, dfa_states, dfa_transitions, nfa.accept)
DFA minimization:
def minimize_dfa(dfa):
"""Minimize DFA using partition refinement"""
# Initial partition: accepting vs non-accepting
accept = {s for s in dfa.states if dfa.is_accepting(s)}
non_accept = dfa.states - accept
partition = [accept, non_accept]
# Refine until stable
while True:
new_partition = []
changed = False
for group in partition:
# Split group based on transitions
subgroups = split_by_transitions(group, partition, dfa)
new_partition.extend(subgroups)
if len(subgroups) > 1:
changed = True
if not changed:
break
partition = new_partition
# Build minimized DFA from partition
return build_minimized_dfa(partition, dfa)
Pattern simplification rules:
SIMPLIFICATION_RULES = [
# a|a -> a
(r'\(([^)]+)\|\1\)', r'\1'),
# a*a* -> a*
(r'([^*+?])\*\1\*', r'\1*'),
# a?a? -> a{0,2}
(r'([^*+?])\?\1\?', r'\1{0,2}'),
# [a-za-z] -> [a-z] (overlapping ranges)
# (more complex logic needed)
# ^.*x -> x (if x is at start, .* is wasteful)
# (context-dependent)
]
Learning milestones:
- NFA to DFA works → You understand subset construction
- DFA minimization works → You understand state equivalence
- Simplification rules apply → You understand regex algebra
- Performance warnings accurate → You understand complexity
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor | Business Value |
|---|---|---|---|---|---|
| 1. Pattern Matcher CLI | Beginner | Weekend | ⭐⭐ | ⭐⭐ | Resume Gold |
| 2. Text Validator | Beginner | Weekend | ⭐⭐⭐ | ⭐⭐ | Micro-SaaS |
| 3. Search & Replace | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ | Micro-SaaS |
| 4. Input Sanitizer | Intermediate | 1 week | ⭐⭐⭐ | ⭐⭐⭐ | Service |
| 5. URL/Email Parser | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ | Micro-SaaS |
| 6. Log Parser | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ | Service |
| 7. Markdown Parser | Advanced | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Open Core |
| 8. Data Extractor | Intermediate | 1-2 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ | Service |
| 9. Tokenizer/Lexer | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Open Core |
| 10. Template Engine | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Service |
| 11. Regex Engine | Expert | 1 month | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Disruptor |
| 12. Regex Debugger | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Service |
| 13. Regex Optimizer | Expert | 1 month | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Open Core |
Recommended Learning Paths
For Beginners (New to Regex)
Start with: Project 1 → Project 2 → Project 3
This path teaches you:
- Basic pattern matching and metacharacters
- Character classes, quantifiers, anchors
- Capture groups and backreferences
Time: 2-3 weeks
For Practical Application
Start with: Project 5 → Project 6 → Project 8
This path teaches you:
- Complex patterns for real data formats
- Handling messy real-world input
- Extraction and transformation patterns
Time: 3-5 weeks
For Language/Tool Builders
Start with: Project 9 → Project 11 → Project 13
This path teaches you:
- Lexical analysis with regex
- How regex engines actually work (NFA)
- Optimization and performance
Time: 2-3 months
For Understanding Theory
Start with: Project 11 → Project 12 → Project 13
This path teaches you:
- Finite automata theory
- Debugging and visualization
- NFA to DFA conversion and optimization
Time: 2-3 months
Final Capstone Project: Full Regex Suite
- File: LEARN_REGEX_DEEP_DIVE.md
- Main Programming Language: Rust (for performance) or Python (for clarity)
- Alternative Programming Languages: Go, C
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Regex Engine Implementation
- Software or Tool: Complete Regex Library
- Main Book: “Introduction to Automata Theory” + “Mastering Regular Expressions”
What you’ll build: A complete regex library with:
- Full regex syntax support (PCRE-like)
- Both NFA (backtracking) and DFA (linear time) engines
- Unicode support
- Named groups, lookahead, lookbehind
- Debugging and visualization tools
- Performance analysis
Components to integrate:
- Parser for full regex syntax
- NFA construction (Thompson’s)
- NFA simulation with backtracking
- DFA construction for linear-time matching
- Optimization passes
- Debug/visualization output
Real world outcome:
from myregex import Regex, RegexBuilder
# Full syntax support
r = Regex(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
match = r.match('2024-12-25')
print(match.group('year')) # '2024'
print(match.group('month')) # '12'
# Lookahead/lookbehind
r = Regex(r'(?<=\$)\d+(?=\.)')
r.findall('$100.00 and $200.50') # ['100', '200']
# Engine selection
r = Regex(r'simple-pattern', engine='dfa') # O(n) guaranteed
r = Regex(r'complex(?=pattern)', engine='nfa') # Backtracking for lookahead
# Performance mode
r = Regex(r'pattern', optimize=True) # Run optimization passes
# Debug mode
r = Regex(r'(a+)+$', debug=True)
# Warning: Potential catastrophic backtracking detected
# Visualization
r.to_railroad() # SVG railroad diagram
r.to_nfa_graph() # DOT format NFA
r.to_dfa_graph() # DOT format DFA
Time estimate: 3-4 months Prerequisites: Complete at least 8 of the earlier projects
Learning milestones:
- Full syntax parses → You understand regex grammar
- Both engines work → You understand NFA vs DFA trade-offs
- Unicode works → You understand character encoding
- Lookahead works → You understand advanced features
- Performance is competitive → You understand optimization
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Pattern Matcher CLI | Python |
| 2 | Text Validator Library | JavaScript |
| 3 | Search and Replace Tool | Python |
| 4 | Input Sanitizer | JavaScript |
| 5 | URL & Email Parser | Python |
| 6 | Log Parser | Python |
| 7 | Markdown Parser | JavaScript |
| 8 | Data Extractor | Python |
| 9 | Tokenizer / Lexer | Python |
| 10 | Template Engine | JavaScript |
| 11 | Regex Engine (Basic) | Python |
| 12 | Regex Debugger & Visualizer | JavaScript |
| 13 | Regex Optimizer | Python |
| Capstone | Full Regex Suite | Rust/Python |
Additional Resources
Books
- “Mastering Regular Expressions” by Jeffrey Friedl - THE regex bible
- “Regular Expressions Cookbook” by Goyvaerts & Levithan - Practical patterns
- “Introduction to Automata Theory” by Hopcroft, Motwani, Ullman - Theory
Online Resources
- regex101.com - Interactive regex tester with explanation
- Russ Cox’s articles - “Regular Expression Matching Can Be Simple And Fast”
- Debuggex - Visual regex debugger
Tools to Study
- RE2 - Google’s linear-time regex engine (no backtracking)
- PCRE - Perl-compatible regex (full features)
- Hyperscan - Intel’s high-performance regex engine
Practice
- Regex Golf - Write shortest regex for given matches
- Regex Crossword - Puzzles using regex
- HackerRank Regex Challenges
Cheat Sheet
Metacharacters
. Any character (except newline)
^ Start of string/line
$ End of string/line
* 0 or more
+ 1 or more
? 0 or 1 (optional)
| Alternation (or)
() Grouping
[] Character class
{} Quantifier range
\ Escape
Character Classes
[abc] a, b, or c
[^abc] Not a, b, or c
[a-z] a through z
\d Digit [0-9]
\D Non-digit
\w Word char [a-zA-Z0-9_]
\W Non-word char
\s Whitespace
\S Non-whitespace
Quantifiers
* 0 or more (greedy)
+ 1 or more (greedy)
? 0 or 1 (greedy)
{n} Exactly n
{n,} n or more
{n,m} n to m
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
Groups & References
(...) Capture group
(?:...) Non-capturing group
(?P<n>...) Named group (Python)
(?<n>...) Named group (other)
\1, \2 Backreference
\g<name> Named backreference
Lookaround
(?=...) Positive lookahead
(?!...) Negative lookahead
(?<=...) Positive lookbehind
(?<!...) Negative lookbehind
Flags
i Case-insensitive
g Global (find all)
m Multiline (^ and $ match line boundaries)
s Dotall (. matches newline)
x Verbose (allow whitespace and comments)
u Unicode
Regular expressions are the closest thing programmers have to magic spells. Master them, and you’ll see patterns everywhere.