Project 6: Literal Extraction Optimizer

Build a regex literal extractor that finds required substrings and chooses fast prefilters before the full regex engine runs.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks
Main Programming Language Rust (Alternatives: C++, Go)
Alternative Programming Languages C++, Go
Coolness Level Level 4: Hardcore
Business Potential Level 3: Infra / Tooling
Prerequisites P04-P05, regex AST basics
Key Topics Literal extraction, regex AST, prefilters, heuristic selection

1. Learning Objectives

By completing this project, you will:

  1. Walk a regex AST and extract mandatory literals.
  2. Rank literals by selectivity and choose prefilters.
  3. Integrate a literal prefilter with a regex engine.
  4. Measure performance gains from prefiltering.
  5. Explain correctness guarantees for literal extraction.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Literal Extraction from Regex AST

Fundamentals

Most regexes contain literal substrings that must appear in any match. For example, the regex foo.*bar requires the literal “foo” to appear. Literal extraction is the process of analyzing a regex AST to find these required substrings. If you can search for these literals quickly (using a SIMD or fast literal engine), you can filter out most candidate positions before running the full regex engine.

The core idea is correctness: if a literal is guaranteed to appear in every match, then scanning for it is safe. If you only scan for optional literals, you will miss matches. Therefore, extraction focuses on “must” literals derived from concatenation and repetition rules.

Deep Dive into the concept

In a regex AST, concatenation means all parts must appear in order, alternation means one branch will appear, and repetition may mean zero or more occurrences. A safe extractor should:

  • For concatenation: collect literals from each side and combine when possible.
  • For alternation: find literals common to all branches (intersection).
  • For repetition (*, ?): treat sub-literals as optional and do not require them.

A simple strategy is to compute a set of “required literals” with length >= 1. A more advanced strategy computes “literal sequences” and picks the longest literal as the prefilter. You can also compute a best literal based on estimated selectivity: longer literals are typically more selective. Some engines use heuristics like preferring rare bytes or longer substrings.

This optimization sits between regex parsing and matching. It is a cornerstone of ripgrep: literal extraction drives the SIMD prefilters and significantly reduces the number of times the full regex engine runs. It is also the reason why regex search is often much faster than a naive approach.

How this fits on projects

You will use literal extraction to drive SIMD prefilters (P07-P09) and integrate it into the capstone search tool (P15).

Definitions & key terms

  • literal extraction -> identifying mandatory literal substrings in regex
  • prefilter -> fast check before full regex verification
  • selectivity -> probability that a literal occurs in random text
  • AST -> abstract syntax tree of regex

Mental model diagram (ASCII)

Regex:  foo.*(bar|baz)
AST:   concat(foo, star(any), alt(bar, baz))
Required literals: foo
Optional literals: bar, baz (not common)

How it works (step-by-step)

  1. Parse regex into AST.
  2. Compute required literal sets for each node.
  3. Combine for concat (sequence) and intersect for alternation.
  4. Ignore optional literals for * and ?.
  5. Pick best literal for prefilter.

Minimal concrete example

fn required_literals(node: &Ast) -> Vec<String> { /* compute must-lits */ }

Common misconceptions

  • Misconception: Any literal in the regex is safe to prefilter. Correction: Only literals that appear in all matches are safe.
  • Misconception: Longer literals are always better. Correction: Longer literals can be more selective but may be rare.

Check-your-understanding questions

  1. Why is literal extraction safe only for required literals?
  2. What happens with alternation branches that share no literals?
  3. Why do * and ? make literals optional?

Check-your-understanding answers

  1. Optional literals might not appear in a valid match.
  2. The required literal set becomes empty.
  3. Because those operators allow zero occurrences.

Real-world applications

  • ripgrep and RE2 literal prefilters
  • log search tools with regex acceleration

Where you’ll apply it

References

  • Rust regex crate internals (literal extraction)
  • Russ Cox regex series

Key insights

Finding one mandatory literal can cut regex work by orders of magnitude.

Summary

Literal extraction transforms regex search into a fast literal scan plus a precise verification step.

Homework/Exercises to practice the concept

  1. For each regex, list mandatory literals: foo.*bar, a|b, (ab|ac)d.
  2. Implement a function that returns the longest required literal.

Solutions to the homework/exercises

  1. foo is required in the first; none in a|b; d and prefix a are required in (ab|ac)d.
  2. Choose the longest literal from the computed required set.

3. Project Specification

3.1 What You Will Build

A library + CLI literal-extract that parses a regex, extracts required literals, and runs a fast prefilter (using a literal search engine). It then verifies matches with your regex engine.

3.2 Functional Requirements

  1. AST walk: parse regex into AST.
  2. Required literals: compute mandatory literal set.
  3. Selection: choose best literal by length/selectivity.
  4. Prefilter: run literal search; verify with regex.
  5. Stats: report literals found and prefilter hit rate.

3.3 Non-Functional Requirements

  • Performance: 2x+ improvement on real inputs.
  • Reliability: never miss a valid regex match.
  • Usability: clear explanation of extracted literals.

3.4 Example Usage / Output

$ ./literal-extract "foo.*bar" logs.txt
Required literals: ["foo"]
Matches verified: 12

3.5 Data Formats / Schemas / Protocols

Text output plus optional JSON:

{"regex":"foo.*bar","literals":["foo"],"prefilter_hits":42}

3.6 Edge Cases

  • Regex with no required literals
  • Regex that matches empty string
  • Very short literals (length 1)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo run -- "foo.*bar" fixtures/logs.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixtures/logs.txt and fixed regex for stable outputs.

3.7.3 CLI Transcript (Success + Failure)

$ ./literal-extract "foo.*bar" fixtures/logs.txt
Required literals: ["foo"]
Verified matches: 3
exit code: 0

$ ./literal-extract "(foo|bar" fixtures/logs.txt
error: invalid regex syntax at position 5
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| Regex AST |-->| Extractor |-->| Prefilter |-->| Verifier  |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | AST parser | parse regex | reuse P04 parser | | Extractor | compute required literals | intersection for alternation | | Prefilter | search literals quickly | call BM or SIMD search | | Verifier | full regex check | NFA or DFA engine |

4.3 Data Structures (No Full Code)

struct LiteralInfo { text: String, required: bool, score: f32 }

4.4 Algorithm Overview

  1. Parse regex AST.
  2. Compute required literals.
  3. Choose best literal by length/score.
  4. Run prefilter; verify candidates with regex engine.

Complexity Analysis

  • Extraction: O(m) for regex length
  • Search: depends on prefilter; verification only for candidates

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

literal-extract/
├── src/
│   ├── main.rs
│   ├── ast.rs
│   ├── extract.rs
│   ├── prefilter.rs
│   └── verify.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How can I make regex search fast by scanning for only what must appear?”

5.4 Concepts You Must Understand First

  1. AST traversal and recursion.
  2. Required vs optional literals.
  3. Prefiltering correctness.

5.5 Questions to Guide Your Design

  1. How will you treat alternation branches?
  2. What if there are no required literals?
  3. How will you select the best literal?

5.6 Thinking Exercise

Given regex (foo|bar)baz, which literals are required and why?

5.7 The Interview Questions They’ll Ask

  1. What is literal extraction and why does it help?
  2. How do you ensure correctness when using prefilters?

5.8 Hints in Layers

Hint 1: Start by extracting literal runs in concatenation nodes. Hint 2: For alternation, intersect literal sets. Hint 3: Add a scoring heuristic for choosing the best literal.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Regex internals | Russ Cox series | regex articles | | Compiler ASTs | “Compilers” (Dragon Book) | parsing |

5.10 Implementation Phases

Phase 1: AST Parsing (3 days)

  • Reuse P04 parser.
  • Checkpoint: AST matches expected tree.

Phase 2: Extraction Logic (3 days)

  • Compute required literals and unit tests.
  • Checkpoint: correct for alternation/optional cases.

Phase 3: Prefilter Integration (3 days)

  • Plug in literal search engine.
  • Checkpoint: speedup over regex alone.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Selection heuristic | longest vs score | longest (v1) | simple and effective | | No literals | skip prefilter | yes | correctness-first |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | extraction rules | regex AST cases | | Integration | prefilter + regex | fixtures | | Edge | no literals | alternation-only regex |

6.2 Critical Test Cases

  1. Regex a|b -> no required literals.
  2. Regex foo.*bar -> required literal foo.
  3. Regex ab?c -> required literal ac? (careful: b optional).

6.3 Test Data

regex: foo.*bar
text: foo___bar
expected: match

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Treating optional as required | missed matches | mark * and ? as optional | | Incorrect alternation | missing required literals | intersect branch literals | | Weak heuristic | poor speedup | choose longest literal |

7.2 Debugging Strategies

  • Print extracted literals for each regex.
  • Compare matches with full regex engine.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Support multiple required literals (AND prefilter).
  • Output literal positions in AST.

8.2 Intermediate Extensions

  • Use frequency tables to score selectivity.
  • Integrate with SIMD literal search.

8.3 Advanced Extensions

  • Build a multi-literal prefilter with Aho-Corasick.
  • Implement automaton-based prefilters (Teddy-style).

9. Real-World Connections

9.1 Industry Applications

  • Fast grep-like tools
  • Regex search in IDEs
  • ripgrep regex-automata
  • Rust regex crate

9.3 Interview Relevance

  • Explaining optimization pipelines in search tools

10. Resources

10.1 Essential Reading

  • Russ Cox regex articles
  • Rust regex internal docs

10.2 Tools & Documentation

  • ripgrep source for literal extraction
  • regex-automata crate docs

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain required vs optional literals.
  • I can justify my literal selection heuristic.

11.2 Implementation

  • Extracted literals are correct for all fixtures.
  • Prefilter does not miss matches.

11.3 Growth

  • I can describe how this integrates with SIMD search.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Extract required literals from regex AST
  • Prefilter + verification pipeline

Full Completion:

  • Statistics on prefilter hits
  • Benchmark against regex-only search

Excellence (Going Above & Beyond):

  • Multi-literal prefilters
  • Selectivity scoring using frequency tables