Project 6: Literal Extraction Optimizer

Analyze a regex to extract literal strings that MUST appear in any match. Use these literals to quickly skip non-matching regions before running the full regex.

Quick Reference

Attribute Value
Primary Language Rust (or C)
Alternative Languages C++, Go
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks
Knowledge Area Regex Optimization
Tooling Custom Implementation
Prerequisites Projects 4-5, understanding of regex ASTs

What You Will Build

Analyze a regex to extract literal strings that MUST appear in any match. Use these literals to quickly skip non-matching regions before running the full regex.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Prefix extraction → maps to “what literal MUST the match start with?”
  • Suffix extraction → maps to “what literal MUST the match end with?”
  • Required substring extraction → maps to “what literal MUST appear somewhere?”
  • Handling alternation(foo|bar).*baz has no common prefix, but “baz” is required

Key Concepts

  • Literal Extraction: regex-syntax crate documentation (Rust)
  • Prefix/Suffix Analysis: Walk the regex AST looking for required literals
  • Prefiltering: Run memchr on literals, verify with full regex

Real-World Outcome

$ ./literal_optimizer "function\s+\w+\s*\(.*async" input.js
Regex analysis:
  Prefix literals: ["function"]
  Suffix literals: ["async"]
  Required substrings: ["function", "async"]

Strategy:
  1. memchr for "function"
  2. If found, check if "async" exists after it
  3. If both found, run full regex on candidate region

Benchmark (10MB JavaScript file):
  Full regex scan: 150ms
  Literal-optimized: 8ms
  Speedup: 18.75x

Analysis of common patterns:
  "TODO": 100% literal, use memchr directly
  "error.*line \d+": prefix "error", can skip most text
  "[a-z]+@[a-z]+\.[a-z]+": no useful literals (email regex)
  "\berror\b": literal "error", boundary check at matches

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Regular Expression Matching” by Russ Cox (online articles)