Project 5: DFA Regex Engine with Lazy Construction

Convert NFA to DFA using the subset construction, but build states lazily (on-demand) to avoid exponential blowup. Cache states for reuse.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages Rust, C++, Go
Difficulty Level 4: Expert
Time Estimate 2-3 weeks
Knowledge Area Compilers / Automata Theory
Tooling Custom Implementation
Prerequisites Project 4 (NFA Engine)

What You Will Build

Convert NFA to DFA using the subset construction, but build states lazily (on-demand) to avoid exponential blowup. Cache states for reuse.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Subset construction algorithm → maps to powerset of NFA states
  • State caching/memoization → maps to hash tables for state sets
  • Cache eviction under memory pressure → maps to LRU caches
  • Handling exponential blowup → maps to falling back to NFA

Key Concepts

  • Subset Construction: “Introduction to the Theory of Computation” Chapter 1.2 - Sipser
  • Lazy DFA: “Regular Expression Matching: the Virtual Machine Approach” - Russ Cox
  • State Set Hashing: Efficiently comparing and storing state sets
  • Cache Management: When to evict cached DFA states

Real-World Outcome

$ ./dfa_regex "a(b|c)*d" input.txt
NFA states: 8
DFA cache: lazy construction enabled

Searching...
DFA state 0 (NFA: {0}) -> 'a' -> DFA state 1 (NFA: {1,2,4,6,7})
DFA state 1 -> 'b' -> DFA state 2 (NFA: {1,2,3,4,6,7})
[States constructed on demand]

Found match "abcd" at position 100
Search completed in 5ms (vs 15ms NFA simulation)

Cache statistics:
  DFA states constructed: 4
  DFA states possible (worst case): 256 (2^8)
  Cache hits: 45,231
  Cache misses: 4

Benchmark:
  NFA simulation: 15ms
  Lazy DFA: 5ms
  Full DFA (if we built it all): 3ms but 2MB memory

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Introduction to the Theory of Computation” by Michael Sipser