Project 4: NFA Regex Engine (Thompson’s Construction)

A regex engine using Thompson’s NFA construction. Parse a regex, convert to NFA, and simulate the NFA on input text.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages Rust, Go, Zig
Difficulty Level 4: Expert
Time Estimate 2-3 weeks
Knowledge Area Compilers / Automata Theory
Tooling Custom Implementation
Prerequisites Projects 1-3, understanding of automata theory basics

What You Will Build

A regex engine using Thompson’s NFA construction. Parse a regex, convert to NFA, and simulate the NFA on input text.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Regex parsing → maps to recursive descent parsing
  • NFA construction (concatenation, alternation, Kleene star) → maps to automata construction
  • NFA simulation (tracking multiple states) → maps to set operations
  • ε-closure computation → maps to graph traversal

Key Concepts

  • Thompson’s Construction: “Compilers: Principles, Techniques, and Tools” Chapter 3.7 - Dragon Book
  • NFA Simulation: “Regular Expression Matching Can Be Simple And Fast” - Russ Cox (swtch.com/~rsc/regexp/regexp1.html)
  • ε-transitions: How to handle “empty” transitions efficiently
  • State Set Representation: Bit vectors for O(1) state operations

Real-World Outcome

$ ./nfa_regex "a(b|c)*d" input.txt
Parsing regex: a(b|c)*d
NFA states: 8
NFA transitions: 10

Found match "abcd" at position 100
Found match "abbbcd" at position 200
Found match "acbcbd" at position 300
Search completed in 15ms

NFA visualization (DOT format):
  0 -a-> 1
  1 -ε-> 2
  1 -ε-> 4
  2 -b-> 3
  4 -c-> 5
  3 -ε-> 6
  5 -ε-> 6
  6 -ε-> 1
  6 -ε-> 7
  7 -d-> 8 (accepting)

Pathological test:
  Pattern: (a+)+$
  Input: "aaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
  Time: 0.1ms  # Still fast! No backtracking!

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Compilers: Principles, Techniques, and Tools” by Aho, Sethi, Ullman (Dragon Book)