Project 4: NFA Regex Engine (Thompson’s Construction)
Build a regex engine based on Thompson NFA construction and simulate it to match patterns in linear time per input.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 weeks |
| Main Programming Language | C (Alternatives: Rust, C++) |
| Alternative Programming Languages | Rust, C++ |
| Coolness Level | Level 4: Hardcore |
| Business Potential | Level 3: Infrastructure |
| Prerequisites | P01-P03, basic automata theory |
| Key Topics | Thompson NFA, epsilon closure, regex AST, linear-time matching |
1. Learning Objectives
By completing this project, you will:
- Parse a subset of regex syntax into an AST.
- Build an NFA using Thompson construction.
- Implement NFA simulation with epsilon-closure.
- Guarantee linear-time matching without backtracking.
- Validate correctness against a reference regex engine.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Thompson NFA Construction
Fundamentals
Thompson construction turns a regular expression into an NFA that recognizes the same language. Each regex operator (literal, concatenation, alternation, repetition) is translated into a small NFA fragment with epsilon transitions. These fragments are combined into a larger NFA with a single start and accept state. Because the resulting automaton is nondeterministic, a string is accepted if any path through the NFA consumes the input and ends in an accept state.
Thompson NFAs are efficient to build (linear in regex length) and, when simulated correctly, match in O(n*m) worst case where m is NFA size. With standard optimizations (like following epsilon closures once per input position), the behavior is effectively linear in input length times the number of active states, and crucially avoids exponential blowups caused by backtracking.
Deep Dive into the concept
Each regex operator has a canonical NFA fragment. A literal is a single transition. Concatenation links the accept of the first fragment to the start of the next. Alternation creates a new start with epsilon transitions into each branch and a new accept that both branches connect to. Kleene star wraps a fragment with a new start/accept and epsilon transitions that allow zero or more repetitions.
Simulation keeps a set of active NFA states. On each input character, you advance all transitions that match the character and then compute the epsilon-closure of the resulting states. This is the heart of Thompson’s algorithm. The epsilon-closure ensures you always include all states reachable without consuming input, preventing you from missing paths. The simulation is systematic and does not backtrack or explode exponentially; each input character causes a bounded amount of work proportional to the number of NFA states.
Regex engines like RE2 and Rust’s regex crate are built around this principle: use automata, not backtracking. This is the engine that makes ripgrep safe and predictable.
How this fits on projects
This NFA engine is the foundation for literal extraction (P06) and lazy DFA construction (P05). It is also the correctness core for P15.
Definitions & key terms
- NFA -> nondeterministic finite automaton
- epsilon transition -> move without consuming input
- closure -> set of states reachable by epsilon transitions
- regex AST -> tree representation of regex structure
Mental model diagram (ASCII)
Regex: a(b|c)*d
start --a--> (split)
|
v
b ----> (join) --d--> accept
^
|
c --------
(split/join are epsilon transitions)
How it works (step-by-step)
- Parse regex into AST.
- Convert AST nodes into NFA fragments.
- Connect fragments with epsilon transitions.
- Initialize active state set with epsilon-closure of start.
- For each input byte, advance transitions and recompute closure.
Minimal concrete example
struct State { int c; State* out; State* out1; };
// c = literal or SPLIT or MATCH
Common misconceptions
- Misconception: NFA simulation is exponential. Correction: Thompson simulation is polynomial and avoids backtracking.
- Misconception: regex requires backreferences. Correction: regular expressions do not include backreferences.
Check-your-understanding questions
- Why do we need epsilon-closure?
- How is alternation represented in Thompson NFA?
- What makes this engine safe from catastrophic backtracking?
Check-your-understanding answers
- To include states reachable without consuming input.
- A split state with epsilon transitions to both branches.
- It explores all paths in parallel without recursion.
Real-world applications
- Safe regex engines (RE2, Rust regex)
- Static analysis and log filtering
Where you’ll apply it
- This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
- Also used in: P05-dfa-regex-engine-with-lazy-construction, P15-build-your-own-ripgrep.
References
- Thompson (1968), “Programming Techniques: Regular expression search algorithm”
- Russ Cox, “Regular Expression Matching Can Be Simple And Fast”
Key insights
Thompson NFAs avoid exponential blowups by exploring all paths simultaneously.
Summary
Thompson construction provides a fast, safe regex engine that guarantees predictable performance.
Homework/Exercises to practice the concept
- Hand-construct the NFA for regex
(ab|cd)*e. - Simulate the NFA on input
ababcdand list active states.
Solutions to the homework/exercises
- Use split and join nodes for alternation and star.
- Track active sets per character, including epsilon closures.
3. Project Specification
3.1 What You Will Build
A CLI tool nfa-regex that parses a regex, builds a Thompson NFA, and matches against files. Supports basic regex operators: concatenation, alternation |, grouping (), and repetition *, +, ?, and . wildcard.
3.2 Functional Requirements
- Regex parser: parse a subset of regex into AST.
- NFA builder: Thompson construction to build an automaton.
- Matcher: simulate NFA over file content.
- Output: report matching lines and offsets.
- Deterministic mode: fixed ordering and stable output.
3.3 Non-Functional Requirements
- Performance: linear in input length times NFA size.
- Reliability: no catastrophic backtracking.
- Usability: syntax error messages include position.
3.4 Example Usage / Output
$ ./nfa-regex "error|panic" logs.txt
logs.txt:12: panic
logs.txt:48: error
3.5 Data Formats / Schemas / Protocols
Text output as in P01, with regex pattern in header when --stats is enabled.
3.6 Edge Cases
- Unbalanced parentheses
- Empty alternation (e.g.,
a|) - Patterns that match empty string
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./nfa-regex "a(b|c)*d" fixtures/input.txt
3.7.2 Golden Path Demo (Deterministic)
Use fixtures/regex_input.txt and --fixed-ts for stable output.
3.7.3 CLI Transcript (Success + Failure)
$ ./nfa-regex --fixed-ts "a(b|c)*d" fixtures/input.txt
fixtures/input.txt:2:1: abbd
exit code: 0
$ ./nfa-regex "a(b|" fixtures/input.txt
error: unbalanced '(' at position 3
exit code: 2
4. Solution Architecture
4.1 High-Level Design
+-----------+ +------------+ +-------------+ +-----------+
| Regex AST |-->| NFA Builder|-->| NFA Simulator|->| Formatter |
+-----------+ +------------+ +-------------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Parser | build AST from regex | shunting-yard or recursive descent | | NFA builder | Thompson construction | explicit states with pointers | | Simulator | active set + closure | bitmap or vector of states |
4.3 Data Structures (No Full Code)
struct State { int c; int out; int out1; };
int active[MAX_STATES];
4.4 Algorithm Overview
- Parse regex into AST.
- Build NFA with epsilon transitions.
- Simulate by maintaining active states per input byte.
- Report matches per line/offset.
Complexity Analysis
- Build: O(m)
- Match: O(n * S) where S is NFA states, typically linear in regex length
5. Implementation Guide
5.1 Development Environment Setup
cc -O2 -Wall -Wextra -o nfa-regex src/main.c
5.2 Project Structure
nfa-regex/
├── src/
│ ├── main.c
│ ├── parser.c
│ ├── nfa.c
│ └── simulate.c
├── fixtures/
└── Makefile
5.3 The Core Question You’re Answering
“How can I match regex patterns without risking exponential backtracking?”
5.4 Concepts You Must Understand First
- Regex operator precedence.
- Epsilon transitions and closure.
- Active state sets.
5.5 Questions to Guide Your Design
- How will you represent concatenation explicitly?
- How will you handle
*,+,?operators? - How will you map matches back to line/col?
5.6 Thinking Exercise
Draw the NFA for regex ab*|cd and list its epsilon transitions.
5.7 The Interview Questions They’ll Ask
- Why does Thompson NFA avoid catastrophic backtracking?
- How do you simulate epsilon transitions efficiently?
5.8 Hints in Layers
Hint 1: Start with literals + concatenation only.
Hint 2: Add alternation and grouping.
Hint 3: Add *, +, ? operators.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Regex automata | “Compilers” (Aho, Sethi, Ullman) | automata chapters | | NFA construction | Russ Cox articles | regex series |
5.10 Implementation Phases
Phase 1: Parser (4 days)
- Parse regex into AST.
- Checkpoint: AST matches hand-crafted cases.
Phase 2: NFA Construction (4 days)
- Build NFA for all operators.
- Checkpoint: NFA matches known patterns.
Phase 3: Simulation (4 days)
- Implement active set and closure.
- Checkpoint: matches against reference engine.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Parser | recursive descent vs shunting-yard | recursive descent | easier to extend | | Active set | bitset vs vector | bitset | faster membership |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|———-|———|———-|
| Unit | parser correctness | operator precedence cases |
| Integration | matching accuracy | small fixture strings |
| Edge | empty matches | patterns like a* |
6.2 Critical Test Cases
- Pattern
a*matches empty string. - Pattern
(ab|cd)+matches repeated inputs. - Invalid regex yields error with position.
6.3 Test Data
regex: a(b|c)*d
text: abcd
expected: match
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|———|———|———-|
| Missing epsilon closure | false negatives | always compute closure after transitions |
| Parser precedence bug | wrong matches | add explicit concatenation operator |
| Infinite loops | hangs on * | guard against repeated epsilon cycles |
7.2 Debugging Strategies
- Print NFA states and transitions.
- Compare with small regex engine in Python.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
^and$anchors. - Add character classes
[a-z].
8.2 Intermediate Extensions
- Add non-greedy quantifiers.
- Support UTF-8 literals.
8.3 Advanced Extensions
- Add named capture groups (store offsets only).
- Build a DFA cache on top of NFA.
9. Real-World Connections
9.1 Industry Applications
- Safe regex engines in search tools
- Log filtering pipelines
9.2 Related Open Source Projects
- RE2 and Rust regex
- ripgrep regex-automata crate
9.3 Interview Relevance
- Discussing NFA vs backtracking
10. Resources
10.1 Essential Reading
- Thompson (1968) regex paper
- Russ Cox regex series
10.2 Tools & Documentation
re2docs and testsregex-automatacrate docs
10.3 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain Thompson construction.
- I can explain epsilon closure.
11.2 Implementation
- Regex parsing handles precedence correctly.
- Matches agree with reference engine.
11.3 Growth
- I can explain why backtracking is unsafe.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Regex parser for literals, concatenation, alternation
- Thompson NFA construction
Full Completion:
*,+,?, and.operators- Deterministic output and fixtures
Excellence (Going Above & Beyond):
- Character classes and anchors
- Performance profile of active set updates