Project 6: Literal Extraction Optimizer

Build a regex literal extractor that finds required substrings and chooses fast prefilters before the full regex engine runs.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Main Programming Language	Rust (Alternatives: C++, Go)
Alternative Programming Languages	C++, Go
Coolness Level	Level 4: Hardcore
Business Potential	Level 3: Infra / Tooling
Prerequisites	P04-P05, regex AST basics
Key Topics	Literal extraction, regex AST, prefilters, heuristic selection

1. Learning Objectives

By completing this project, you will:

Walk a regex AST and extract mandatory literals.
Rank literals by selectivity and choose prefilters.
Integrate a literal prefilter with a regex engine.
Measure performance gains from prefiltering.
Explain correctness guarantees for literal extraction.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Literal Extraction from Regex AST

Fundamentals

Most regexes contain literal substrings that must appear in any match. For example, the regex foo.*bar requires the literal “foo” to appear. Literal extraction is the process of analyzing a regex AST to find these required substrings. If you can search for these literals quickly (using a SIMD or fast literal engine), you can filter out most candidate positions before running the full regex engine.

The core idea is correctness: if a literal is guaranteed to appear in every match, then scanning for it is safe. If you only scan for optional literals, you will miss matches. Therefore, extraction focuses on “must” literals derived from concatenation and repetition rules.

Deep Dive into the concept

In a regex AST, concatenation means all parts must appear in order, alternation means one branch will appear, and repetition may mean zero or more occurrences. A safe extractor should:

For concatenation: collect literals from each side and combine when possible.
For alternation: find literals common to all branches (intersection).
For repetition (*, ?): treat sub-literals as optional and do not require them.

A simple strategy is to compute a set of “required literals” with length >= 1. A more advanced strategy computes “literal sequences” and picks the longest literal as the prefilter. You can also compute a best literal based on estimated selectivity: longer literals are typically more selective. Some engines use heuristics like preferring rare bytes or longer substrings.

This optimization sits between regex parsing and matching. It is a cornerstone of ripgrep: literal extraction drives the SIMD prefilters and significantly reduces the number of times the full regex engine runs. It is also the reason why regex search is often much faster than a naive approach.

How this fits on projects

You will use literal extraction to drive SIMD prefilters (P07-P09) and integrate it into the capstone search tool (P15).

Definitions & key terms

literal extraction -> identifying mandatory literal substrings in regex
prefilter -> fast check before full regex verification
selectivity -> probability that a literal occurs in random text
AST -> abstract syntax tree of regex

Mental model diagram (ASCII)

Regex:  foo.*(bar|baz)
AST:   concat(foo, star(any), alt(bar, baz))
Required literals: foo
Optional literals: bar, baz (not common)

How it works (step-by-step)

Parse regex into AST.
Compute required literal sets for each node.
Combine for concat (sequence) and intersect for alternation.
Ignore optional literals for * and ?.
Pick best literal for prefilter.

Minimal concrete example

fn required_literals(node: &Ast) -> Vec<String> { /* compute must-lits */ }

Common misconceptions

Misconception: Any literal in the regex is safe to prefilter. Correction: Only literals that appear in all matches are safe.
Misconception: Longer literals are always better. Correction: Longer literals can be more selective but may be rare.

Check-your-understanding questions

Why is literal extraction safe only for required literals?
What happens with alternation branches that share no literals?
Why do * and ? make literals optional?

Check-your-understanding answers

Optional literals might not appear in a valid match.
The required literal set becomes empty.
Because those operators allow zero occurrences.

Real-world applications

ripgrep and RE2 literal prefilters
log search tools with regex acceleration

Where you’ll apply it

This project: Section 3.2 Functional Requirements and Section 4.4 Algorithm.
Also used in: P07-simd-memchr-implementation, P15-build-your-own-ripgrep.

References

Rust regex crate internals (literal extraction)
Russ Cox regex series

Key insights

Finding one mandatory literal can cut regex work by orders of magnitude.

Summary

Literal extraction transforms regex search into a fast literal scan plus a precise verification step.

Homework/Exercises to practice the concept

For each regex, list mandatory literals: foo.*bar, a|b, (ab|ac)d.
Implement a function that returns the longest required literal.

Solutions to the homework/exercises

foo is required in the first; none in a|b; d and prefix a are required in (ab|ac)d.
Choose the longest literal from the computed required set.

3. Project Specification

3.1 What You Will Build

A library + CLI literal-extract that parses a regex, extracts required literals, and runs a fast prefilter (using a literal search engine). It then verifies matches with your regex engine.

3.2 Functional Requirements

AST walk: parse regex into AST.
Required literals: compute mandatory literal set.
Selection: choose best literal by length/selectivity.
Prefilter: run literal search; verify with regex.
Stats: report literals found and prefilter hit rate.

3.3 Non-Functional Requirements

Performance: 2x+ improvement on real inputs.
Reliability: never miss a valid regex match.
Usability: clear explanation of extracted literals.

3.4 Example Usage / Output

$ ./literal-extract "foo.*bar" logs.txt
Required literals: ["foo"]
Matches verified: 12

3.5 Data Formats / Schemas / Protocols

Text output plus optional JSON:

{"regex":"foo.*bar","literals":["foo"],"prefilter_hits":42}

3.6 Edge Cases

Regex with no required literals
Regex that matches empty string
Very short literals (length 1)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo run -- "foo.*bar" fixtures/logs.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixtures/logs.txt and fixed regex for stable outputs.

3.7.3 CLI Transcript (Success + Failure)

$ ./literal-extract "foo.*bar" fixtures/logs.txt
Required literals: ["foo"]
Verified matches: 3
exit code: 0

$ ./literal-extract "(foo|bar" fixtures/logs.txt
error: invalid regex syntax at position 5
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| Regex AST |-->| Extractor |-->| Prefilter |-->| Verifier  |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct LiteralInfo { text: String, required: bool, score: f32 }

4.4 Algorithm Overview

Parse regex AST.
Compute required literals.
Choose best literal by length/score.
Run prefilter; verify candidates with regex engine.

Complexity Analysis

Extraction: O(m) for regex length
Search: depends on prefilter; verification only for candidates

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

literal-extract/
├── src/
│   ├── main.rs
│   ├── ast.rs
│   ├── extract.rs
│   ├── prefilter.rs
│   └── verify.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How can I make regex search fast by scanning for only what must appear?”

5.4 Concepts You Must Understand First

AST traversal and recursion.
Required vs optional literals.
Prefiltering correctness.

5.5 Questions to Guide Your Design

How will you treat alternation branches?
What if there are no required literals?
How will you select the best literal?

5.6 Thinking Exercise

Given regex (foo|bar)baz, which literals are required and why?

5.7 The Interview Questions They’ll Ask

What is literal extraction and why does it help?
How do you ensure correctness when using prefilters?

5.8 Hints in Layers

Hint 1: Start by extracting literal runs in concatenation nodes. Hint 2: For alternation, intersect literal sets. Hint 3: Add a scoring heuristic for choosing the best literal.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: AST Parsing (3 days)

Reuse P04 parser.
Checkpoint: AST matches expected tree.

Phase 2: Extraction Logic (3 days)

Compute required literals and unit tests.
Checkpoint: correct for alternation/optional cases.

Phase 3: Prefilter Integration (3 days)

Plug in literal search engine.
Checkpoint: speedup over regex alone.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Regex a|b -> no required literals.
Regex foo.*bar -> required literal foo.
Regex ab?c -> required literal ac? (careful: b optional).

6.3 Test Data

regex: foo.*bar
text: foo___bar
expected: match

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Print extracted literals for each regex.
Compare matches with full regex engine.

8. Extensions & Challenges

8.1 Beginner Extensions

Support multiple required literals (AND prefilter).
Output literal positions in AST.

8.2 Intermediate Extensions

Use frequency tables to score selectivity.
Integrate with SIMD literal search.

8.3 Advanced Extensions

Build a multi-literal prefilter with Aho-Corasick.
Implement automaton-based prefilters (Teddy-style).

9. Real-World Connections

9.1 Industry Applications

Fast grep-like tools
Regex search in IDEs

ripgrep regex-automata
Rust regex crate

9.3 Interview Relevance

Explaining optimization pipelines in search tools

10. Resources

10.1 Essential Reading

Russ Cox regex articles
Rust regex internal docs

10.2 Tools & Documentation

ripgrep source for literal extraction
regex-automata crate docs

Previous: P05-dfa-regex-engine-with-lazy-construction
Next: P07-simd-memchr-implementation

11. Self-Assessment Checklist

11.1 Understanding

I can explain required vs optional literals.
I can justify my literal selection heuristic.

11.2 Implementation

Extracted literals are correct for all fixtures.
Prefilter does not miss matches.

11.3 Growth

I can describe how this integrates with SIMD search.

12. Submission / Completion Criteria

Minimum Viable Completion:

Extract required literals from regex AST
Prefilter + verification pipeline

Full Completion:

Statistics on prefilter hits
Benchmark against regex-only search

Excellence (Going Above & Beyond):

Multi-literal prefilters
Selectivity scoring using frequency tables