Project 4: NFA Regex Engine (Thompson’s Construction)

Build a regex engine based on Thompson NFA construction and simulate it to match patterns in linear time per input.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2-3 weeks
Main Programming Language	C (Alternatives: Rust, C++)
Alternative Programming Languages	Rust, C++
Coolness Level	Level 4: Hardcore
Business Potential	Level 3: Infrastructure
Prerequisites	P01-P03, basic automata theory
Key Topics	Thompson NFA, epsilon closure, regex AST, linear-time matching

1. Learning Objectives

By completing this project, you will:

Parse a subset of regex syntax into an AST.
Build an NFA using Thompson construction.
Implement NFA simulation with epsilon-closure.
Guarantee linear-time matching without backtracking.
Validate correctness against a reference regex engine.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Thompson NFA Construction

Fundamentals

Thompson construction turns a regular expression into an NFA that recognizes the same language. Each regex operator (literal, concatenation, alternation, repetition) is translated into a small NFA fragment with epsilon transitions. These fragments are combined into a larger NFA with a single start and accept state. Because the resulting automaton is nondeterministic, a string is accepted if any path through the NFA consumes the input and ends in an accept state.

Thompson NFAs are efficient to build (linear in regex length) and, when simulated correctly, match in O(n*m) worst case where m is NFA size. With standard optimizations (like following epsilon closures once per input position), the behavior is effectively linear in input length times the number of active states, and crucially avoids exponential blowups caused by backtracking.

Deep Dive into the concept

Each regex operator has a canonical NFA fragment. A literal is a single transition. Concatenation links the accept of the first fragment to the start of the next. Alternation creates a new start with epsilon transitions into each branch and a new accept that both branches connect to. Kleene star wraps a fragment with a new start/accept and epsilon transitions that allow zero or more repetitions.

Simulation keeps a set of active NFA states. On each input character, you advance all transitions that match the character and then compute the epsilon-closure of the resulting states. This is the heart of Thompson’s algorithm. The epsilon-closure ensures you always include all states reachable without consuming input, preventing you from missing paths. The simulation is systematic and does not backtrack or explode exponentially; each input character causes a bounded amount of work proportional to the number of NFA states.

Regex engines like RE2 and Rust’s regex crate are built around this principle: use automata, not backtracking. This is the engine that makes ripgrep safe and predictable.

How this fits on projects

This NFA engine is the foundation for literal extraction (P06) and lazy DFA construction (P05). It is also the correctness core for P15.

Definitions & key terms

NFA -> nondeterministic finite automaton
epsilon transition -> move without consuming input
closure -> set of states reachable by epsilon transitions
regex AST -> tree representation of regex structure

Mental model diagram (ASCII)

Regex: a(b|c)*d

start --a--> (split)
              |        
              v        
             b ----> (join) --d--> accept
              ^        
              |        
             c --------

(split/join are epsilon transitions)

How it works (step-by-step)

Parse regex into AST.
Convert AST nodes into NFA fragments.
Connect fragments with epsilon transitions.
Initialize active state set with epsilon-closure of start.
For each input byte, advance transitions and recompute closure.

Minimal concrete example

struct State { int c; State* out; State* out1; };
// c = literal or SPLIT or MATCH

Common misconceptions

Misconception: NFA simulation is exponential. Correction: Thompson simulation is polynomial and avoids backtracking.
Misconception: regex requires backreferences. Correction: regular expressions do not include backreferences.

Check-your-understanding questions

Why do we need epsilon-closure?
How is alternation represented in Thompson NFA?
What makes this engine safe from catastrophic backtracking?

Check-your-understanding answers

To include states reachable without consuming input.
A split state with epsilon transitions to both branches.
It explores all paths in parallel without recursion.

Real-world applications

Safe regex engines (RE2, Rust regex)
Static analysis and log filtering

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P05-dfa-regex-engine-with-lazy-construction, P15-build-your-own-ripgrep.

References

Thompson (1968), “Programming Techniques: Regular expression search algorithm”
Russ Cox, “Regular Expression Matching Can Be Simple And Fast”

Key insights

Thompson NFAs avoid exponential blowups by exploring all paths simultaneously.

Summary

Thompson construction provides a fast, safe regex engine that guarantees predictable performance.

Homework/Exercises to practice the concept

Hand-construct the NFA for regex (ab|cd)*e.
Simulate the NFA on input ababcd and list active states.

Solutions to the homework/exercises

Use split and join nodes for alternation and star.
Track active sets per character, including epsilon closures.

3. Project Specification

3.1 What You Will Build

A CLI tool nfa-regex that parses a regex, builds a Thompson NFA, and matches against files. Supports basic regex operators: concatenation, alternation |, grouping (), and repetition *, +, ?, and . wildcard.

3.2 Functional Requirements

Regex parser: parse a subset of regex into AST.
NFA builder: Thompson construction to build an automaton.
Matcher: simulate NFA over file content.
Output: report matching lines and offsets.
Deterministic mode: fixed ordering and stable output.

3.3 Non-Functional Requirements

Performance: linear in input length times NFA size.
Reliability: no catastrophic backtracking.
Usability: syntax error messages include position.

3.4 Example Usage / Output

$ ./nfa-regex "error|panic" logs.txt
logs.txt:12: panic
logs.txt:48: error

3.5 Data Formats / Schemas / Protocols

Text output as in P01, with regex pattern in header when --stats is enabled.

3.6 Edge Cases

Unbalanced parentheses
Empty alternation (e.g., a|)
Patterns that match empty string

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./nfa-regex "a(b|c)*d" fixtures/input.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixtures/regex_input.txt and --fixed-ts for stable output.

3.7.3 CLI Transcript (Success + Failure)

$ ./nfa-regex --fixed-ts "a(b|c)*d" fixtures/input.txt
fixtures/input.txt:2:1: abbd
exit code: 0

$ ./nfa-regex "a(b|" fixtures/input.txt
error: unbalanced '(' at position 3
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +------------+   +-------------+   +-----------+
| Regex AST |-->| NFA Builder|-->| NFA Simulator|->| Formatter |
+-----------+   +------------+   +-------------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct State { int c; int out; int out1; };
int active[MAX_STATES];

4.4 Algorithm Overview

Parse regex into AST.
Build NFA with epsilon transitions.
Simulate by maintaining active states per input byte.
Report matches per line/offset.

Complexity Analysis

Build: O(m)
Match: O(n * S) where S is NFA states, typically linear in regex length

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o nfa-regex src/main.c

5.2 Project Structure

nfa-regex/
├── src/
│   ├── main.c
│   ├── parser.c
│   ├── nfa.c
│   └── simulate.c
├── fixtures/
└── Makefile

5.3 The Core Question You’re Answering

“How can I match regex patterns without risking exponential backtracking?”

5.4 Concepts You Must Understand First

Regex operator precedence.
Epsilon transitions and closure.
Active state sets.

5.5 Questions to Guide Your Design

How will you represent concatenation explicitly?
How will you handle *, +, ? operators?
How will you map matches back to line/col?

5.6 Thinking Exercise

Draw the NFA for regex ab*|cd and list its epsilon transitions.

5.7 The Interview Questions They’ll Ask

Why does Thompson NFA avoid catastrophic backtracking?
How do you simulate epsilon transitions efficiently?

5.8 Hints in Layers

Hint 1: Start with literals + concatenation only. Hint 2: Add alternation and grouping. Hint 3: Add *, +, ? operators.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Parser (4 days)

Parse regex into AST.
Checkpoint: AST matches hand-crafted cases.

Phase 2: NFA Construction (4 days)

Build NFA for all operators.
Checkpoint: NFA matches known patterns.

Phase 3: Simulation (4 days)

Implement active set and closure.
Checkpoint: matches against reference engine.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Pattern a* matches empty string.
Pattern (ab|cd)+ matches repeated inputs.
Invalid regex yields error with position.

6.3 Test Data

regex: a(b|c)*d
text: abcd
expected: match

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Print NFA states and transitions.
Compare with small regex engine in Python.

8. Extensions & Challenges

8.1 Beginner Extensions

Add ^ and $ anchors.
Add character classes [a-z].

8.2 Intermediate Extensions

Add non-greedy quantifiers.
Support UTF-8 literals.

8.3 Advanced Extensions

Add named capture groups (store offsets only).
Build a DFA cache on top of NFA.

9. Real-World Connections

9.1 Industry Applications

Safe regex engines in search tools
Log filtering pipelines

RE2 and Rust regex
ripgrep regex-automata crate

9.3 Interview Relevance

Discussing NFA vs backtracking

10. Resources

10.1 Essential Reading

Thompson (1968) regex paper
Russ Cox regex series

10.2 Tools & Documentation

re2 docs and tests
regex-automata crate docs

Previous: P03-aho-corasick-multi-pattern-matching
Next: P05-dfa-regex-engine-with-lazy-construction

11. Self-Assessment Checklist

11.1 Understanding

I can explain Thompson construction.
I can explain epsilon closure.

11.2 Implementation

Regex parsing handles precedence correctly.
Matches agree with reference engine.

11.3 Growth

I can explain why backtracking is unsafe.

12. Submission / Completion Criteria

Minimum Viable Completion:

Regex parser for literals, concatenation, alternation
Thompson NFA construction

Full Completion:

*, +, ?, and . operators
Deterministic output and fixtures

Excellence (Going Above & Beyond):

Character classes and anchors
Performance profile of active set updates