Project 6: Literal Extraction Optimizer
Analyze a regex to extract literal strings that MUST appear in any match. Use these literals to quickly skip non-matching regions before running the full regex.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | Rust (or C) |
| Alternative Languages | C++, Go |
| Difficulty | Level 3: Advanced |
| Time Estimate | 1-2 weeks |
| Knowledge Area | Regex Optimization |
| Tooling | Custom Implementation |
| Prerequisites | Projects 4-5, understanding of regex ASTs |
What You Will Build
Analyze a regex to extract literal strings that MUST appear in any match. Use these literals to quickly skip non-matching regions before running the full regex.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Prefix extraction → maps to “what literal MUST the match start with?”
- Suffix extraction → maps to “what literal MUST the match end with?”
- Required substring extraction → maps to “what literal MUST appear somewhere?”
- Handling alternation →
(foo|bar).*bazhas no common prefix, but “baz” is required
Key Concepts
- Literal Extraction: regex-syntax crate documentation (Rust)
- Prefix/Suffix Analysis: Walk the regex AST looking for required literals
- Prefiltering: Run memchr on literals, verify with full regex
Real-World Outcome
$ ./literal_optimizer "function\s+\w+\s*\(.*async" input.js
Regex analysis:
Prefix literals: ["function"]
Suffix literals: ["async"]
Required substrings: ["function", "async"]
Strategy:
1. memchr for "function"
2. If found, check if "async" exists after it
3. If both found, run full regex on candidate region
Benchmark (10MB JavaScript file):
Full regex scan: 150ms
Literal-optimized: 8ms
Speedup: 18.75x
Analysis of common patterns:
"TODO": 100% literal, use memchr directly
"error.*line \d+": prefix "error", can skip most text
"[a-z]+@[a-z]+\.[a-z]+": no useful literals (email regex)
"\berror\b": literal "error", boundary check at matches
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Regular Expression Matching” by Russ Cox (online articles)