Project 4: NFA Regex Engine (Thompson’s Construction)
A regex engine using Thompson’s NFA construction. Parse a regex, convert to NFA, and simulate the NFA on input text.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C |
| Alternative Languages | Rust, Go, Zig |
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 weeks |
| Knowledge Area | Compilers / Automata Theory |
| Tooling | Custom Implementation |
| Prerequisites | Projects 1-3, understanding of automata theory basics |
What You Will Build
A regex engine using Thompson’s NFA construction. Parse a regex, convert to NFA, and simulate the NFA on input text.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Regex parsing → maps to recursive descent parsing
- NFA construction (concatenation, alternation, Kleene star) → maps to automata construction
- NFA simulation (tracking multiple states) → maps to set operations
- ε-closure computation → maps to graph traversal
Key Concepts
- Thompson’s Construction: “Compilers: Principles, Techniques, and Tools” Chapter 3.7 - Dragon Book
- NFA Simulation: “Regular Expression Matching Can Be Simple And Fast” - Russ Cox (swtch.com/~rsc/regexp/regexp1.html)
- ε-transitions: How to handle “empty” transitions efficiently
- State Set Representation: Bit vectors for O(1) state operations
Real-World Outcome
$ ./nfa_regex "a(b|c)*d" input.txt
Parsing regex: a(b|c)*d
NFA states: 8
NFA transitions: 10
Found match "abcd" at position 100
Found match "abbbcd" at position 200
Found match "acbcbd" at position 300
Search completed in 15ms
NFA visualization (DOT format):
0 -a-> 1
1 -ε-> 2
1 -ε-> 4
2 -b-> 3
4 -c-> 5
3 -ε-> 6
5 -ε-> 6
6 -ε-> 1
6 -ε-> 7
7 -d-> 8 (accepting)
Pathological test:
Pattern: (a+)+$
Input: "aaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
Time: 0.1ms # Still fast! No backtracking!
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Compilers: Principles, Techniques, and Tools” by Aho, Sethi, Ullman (Dragon Book)