Project 4: NFA Regex Engine (Thompson’s Construction)

Build a regex engine based on Thompson NFA construction and simulate it to match patterns in linear time per input.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 2-3 weeks
Main Programming Language C (Alternatives: Rust, C++)
Alternative Programming Languages Rust, C++
Coolness Level Level 4: Hardcore
Business Potential Level 3: Infrastructure
Prerequisites P01-P03, basic automata theory
Key Topics Thompson NFA, epsilon closure, regex AST, linear-time matching

1. Learning Objectives

By completing this project, you will:

  1. Parse a subset of regex syntax into an AST.
  2. Build an NFA using Thompson construction.
  3. Implement NFA simulation with epsilon-closure.
  4. Guarantee linear-time matching without backtracking.
  5. Validate correctness against a reference regex engine.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Thompson NFA Construction

Fundamentals

Thompson construction turns a regular expression into an NFA that recognizes the same language. Each regex operator (literal, concatenation, alternation, repetition) is translated into a small NFA fragment with epsilon transitions. These fragments are combined into a larger NFA with a single start and accept state. Because the resulting automaton is nondeterministic, a string is accepted if any path through the NFA consumes the input and ends in an accept state.

Thompson NFAs are efficient to build (linear in regex length) and, when simulated correctly, match in O(n*m) worst case where m is NFA size. With standard optimizations (like following epsilon closures once per input position), the behavior is effectively linear in input length times the number of active states, and crucially avoids exponential blowups caused by backtracking.

Deep Dive into the concept

Each regex operator has a canonical NFA fragment. A literal is a single transition. Concatenation links the accept of the first fragment to the start of the next. Alternation creates a new start with epsilon transitions into each branch and a new accept that both branches connect to. Kleene star wraps a fragment with a new start/accept and epsilon transitions that allow zero or more repetitions.

Simulation keeps a set of active NFA states. On each input character, you advance all transitions that match the character and then compute the epsilon-closure of the resulting states. This is the heart of Thompson’s algorithm. The epsilon-closure ensures you always include all states reachable without consuming input, preventing you from missing paths. The simulation is systematic and does not backtrack or explode exponentially; each input character causes a bounded amount of work proportional to the number of NFA states.

Regex engines like RE2 and Rust’s regex crate are built around this principle: use automata, not backtracking. This is the engine that makes ripgrep safe and predictable.

How this fits on projects

This NFA engine is the foundation for literal extraction (P06) and lazy DFA construction (P05). It is also the correctness core for P15.

Definitions & key terms

  • NFA -> nondeterministic finite automaton
  • epsilon transition -> move without consuming input
  • closure -> set of states reachable by epsilon transitions
  • regex AST -> tree representation of regex structure

Mental model diagram (ASCII)

Regex: a(b|c)*d

start --a--> (split)
              |        
              v        
             b ----> (join) --d--> accept
              ^        
              |        
             c --------

(split/join are epsilon transitions)

How it works (step-by-step)

  1. Parse regex into AST.
  2. Convert AST nodes into NFA fragments.
  3. Connect fragments with epsilon transitions.
  4. Initialize active state set with epsilon-closure of start.
  5. For each input byte, advance transitions and recompute closure.

Minimal concrete example

struct State { int c; State* out; State* out1; };
// c = literal or SPLIT or MATCH

Common misconceptions

  • Misconception: NFA simulation is exponential. Correction: Thompson simulation is polynomial and avoids backtracking.
  • Misconception: regex requires backreferences. Correction: regular expressions do not include backreferences.

Check-your-understanding questions

  1. Why do we need epsilon-closure?
  2. How is alternation represented in Thompson NFA?
  3. What makes this engine safe from catastrophic backtracking?

Check-your-understanding answers

  1. To include states reachable without consuming input.
  2. A split state with epsilon transitions to both branches.
  3. It explores all paths in parallel without recursion.

Real-world applications

  • Safe regex engines (RE2, Rust regex)
  • Static analysis and log filtering

Where you’ll apply it

References

  • Thompson (1968), “Programming Techniques: Regular expression search algorithm”
  • Russ Cox, “Regular Expression Matching Can Be Simple And Fast”

Key insights

Thompson NFAs avoid exponential blowups by exploring all paths simultaneously.

Summary

Thompson construction provides a fast, safe regex engine that guarantees predictable performance.

Homework/Exercises to practice the concept

  1. Hand-construct the NFA for regex (ab|cd)*e.
  2. Simulate the NFA on input ababcd and list active states.

Solutions to the homework/exercises

  1. Use split and join nodes for alternation and star.
  2. Track active sets per character, including epsilon closures.

3. Project Specification

3.1 What You Will Build

A CLI tool nfa-regex that parses a regex, builds a Thompson NFA, and matches against files. Supports basic regex operators: concatenation, alternation |, grouping (), and repetition *, +, ?, and . wildcard.

3.2 Functional Requirements

  1. Regex parser: parse a subset of regex into AST.
  2. NFA builder: Thompson construction to build an automaton.
  3. Matcher: simulate NFA over file content.
  4. Output: report matching lines and offsets.
  5. Deterministic mode: fixed ordering and stable output.

3.3 Non-Functional Requirements

  • Performance: linear in input length times NFA size.
  • Reliability: no catastrophic backtracking.
  • Usability: syntax error messages include position.

3.4 Example Usage / Output

$ ./nfa-regex "error|panic" logs.txt
logs.txt:12: panic
logs.txt:48: error

3.5 Data Formats / Schemas / Protocols

Text output as in P01, with regex pattern in header when --stats is enabled.

3.6 Edge Cases

  • Unbalanced parentheses
  • Empty alternation (e.g., a|)
  • Patterns that match empty string

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./nfa-regex "a(b|c)*d" fixtures/input.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixtures/regex_input.txt and --fixed-ts for stable output.

3.7.3 CLI Transcript (Success + Failure)

$ ./nfa-regex --fixed-ts "a(b|c)*d" fixtures/input.txt
fixtures/input.txt:2:1: abbd
exit code: 0

$ ./nfa-regex "a(b|" fixtures/input.txt
error: unbalanced '(' at position 3
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +------------+   +-------------+   +-----------+
| Regex AST |-->| NFA Builder|-->| NFA Simulator|->| Formatter |
+-----------+   +------------+   +-------------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Parser | build AST from regex | shunting-yard or recursive descent | | NFA builder | Thompson construction | explicit states with pointers | | Simulator | active set + closure | bitmap or vector of states |

4.3 Data Structures (No Full Code)

struct State { int c; int out; int out1; };
int active[MAX_STATES];

4.4 Algorithm Overview

  1. Parse regex into AST.
  2. Build NFA with epsilon transitions.
  3. Simulate by maintaining active states per input byte.
  4. Report matches per line/offset.

Complexity Analysis

  • Build: O(m)
  • Match: O(n * S) where S is NFA states, typically linear in regex length

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o nfa-regex src/main.c

5.2 Project Structure

nfa-regex/
├── src/
│   ├── main.c
│   ├── parser.c
│   ├── nfa.c
│   └── simulate.c
├── fixtures/
└── Makefile

5.3 The Core Question You’re Answering

“How can I match regex patterns without risking exponential backtracking?”

5.4 Concepts You Must Understand First

  1. Regex operator precedence.
  2. Epsilon transitions and closure.
  3. Active state sets.

5.5 Questions to Guide Your Design

  1. How will you represent concatenation explicitly?
  2. How will you handle *, +, ? operators?
  3. How will you map matches back to line/col?

5.6 Thinking Exercise

Draw the NFA for regex ab*|cd and list its epsilon transitions.

5.7 The Interview Questions They’ll Ask

  1. Why does Thompson NFA avoid catastrophic backtracking?
  2. How do you simulate epsilon transitions efficiently?

5.8 Hints in Layers

Hint 1: Start with literals + concatenation only. Hint 2: Add alternation and grouping. Hint 3: Add *, +, ? operators.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Regex automata | “Compilers” (Aho, Sethi, Ullman) | automata chapters | | NFA construction | Russ Cox articles | regex series |

5.10 Implementation Phases

Phase 1: Parser (4 days)

  • Parse regex into AST.
  • Checkpoint: AST matches hand-crafted cases.

Phase 2: NFA Construction (4 days)

  • Build NFA for all operators.
  • Checkpoint: NFA matches known patterns.

Phase 3: Simulation (4 days)

  • Implement active set and closure.
  • Checkpoint: matches against reference engine.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Parser | recursive descent vs shunting-yard | recursive descent | easier to extend | | Active set | bitset vs vector | bitset | faster membership |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | parser correctness | operator precedence cases | | Integration | matching accuracy | small fixture strings | | Edge | empty matches | patterns like a* |

6.2 Critical Test Cases

  1. Pattern a* matches empty string.
  2. Pattern (ab|cd)+ matches repeated inputs.
  3. Invalid regex yields error with position.

6.3 Test Data

regex: a(b|c)*d
text: abcd
expected: match

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Missing epsilon closure | false negatives | always compute closure after transitions | | Parser precedence bug | wrong matches | add explicit concatenation operator | | Infinite loops | hangs on * | guard against repeated epsilon cycles |

7.2 Debugging Strategies

  • Print NFA states and transitions.
  • Compare with small regex engine in Python.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add ^ and $ anchors.
  • Add character classes [a-z].

8.2 Intermediate Extensions

  • Add non-greedy quantifiers.
  • Support UTF-8 literals.

8.3 Advanced Extensions

  • Add named capture groups (store offsets only).
  • Build a DFA cache on top of NFA.

9. Real-World Connections

9.1 Industry Applications

  • Safe regex engines in search tools
  • Log filtering pipelines
  • RE2 and Rust regex
  • ripgrep regex-automata crate

9.3 Interview Relevance

  • Discussing NFA vs backtracking

10. Resources

10.1 Essential Reading

  • Thompson (1968) regex paper
  • Russ Cox regex series

10.2 Tools & Documentation

  • re2 docs and tests
  • regex-automata crate docs

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain Thompson construction.
  • I can explain epsilon closure.

11.2 Implementation

  • Regex parsing handles precedence correctly.
  • Matches agree with reference engine.

11.3 Growth

  • I can explain why backtracking is unsafe.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Regex parser for literals, concatenation, alternation
  • Thompson NFA construction

Full Completion:

  • *, +, ?, and . operators
  • Deterministic output and fixtures

Excellence (Going Above & Beyond):

  • Character classes and anchors
  • Performance profile of active set updates