Project 5: DFA Regex Engine with Lazy Construction
Convert NFA to DFA using the subset construction, but build states lazily (on-demand) to avoid exponential blowup. Cache states for reuse.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C |
| Alternative Languages | Rust, C++, Go |
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 weeks |
| Knowledge Area | Compilers / Automata Theory |
| Tooling | Custom Implementation |
| Prerequisites | Project 4 (NFA Engine) |
What You Will Build
Convert NFA to DFA using the subset construction, but build states lazily (on-demand) to avoid exponential blowup. Cache states for reuse.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Subset construction algorithm → maps to powerset of NFA states
- State caching/memoization → maps to hash tables for state sets
- Cache eviction under memory pressure → maps to LRU caches
- Handling exponential blowup → maps to falling back to NFA
Key Concepts
- Subset Construction: “Introduction to the Theory of Computation” Chapter 1.2 - Sipser
- Lazy DFA: “Regular Expression Matching: the Virtual Machine Approach” - Russ Cox
- State Set Hashing: Efficiently comparing and storing state sets
- Cache Management: When to evict cached DFA states
Real-World Outcome
$ ./dfa_regex "a(b|c)*d" input.txt
NFA states: 8
DFA cache: lazy construction enabled
Searching...
DFA state 0 (NFA: {0}) -> 'a' -> DFA state 1 (NFA: {1,2,4,6,7})
DFA state 1 -> 'b' -> DFA state 2 (NFA: {1,2,3,4,6,7})
[States constructed on demand]
Found match "abcd" at position 100
Search completed in 5ms (vs 15ms NFA simulation)
Cache statistics:
DFA states constructed: 4
DFA states possible (worst case): 256 (2^8)
Cache hits: 45,231
Cache misses: 4
Benchmark:
NFA simulation: 15ms
Lazy DFA: 5ms
Full DFA (if we built it all): 3ms but 2MB memory
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Introduction to the Theory of Computation” by Michael Sipser