Learn Text Search Tools: From Naive Search to Production-Grade Performance

Goal: Deeply understand why modern search tools like ripgrep can be 5-10x faster than traditional grep—not through magic, but through systematic application of algorithm theory, SIMD vectorization, parallel systems programming, and careful I/O optimization. After completing these projects, you’ll internalize the principles that make search fast: finite automata over backtracking, SIMD literal extraction, lock-free parallelism, and adaptive I/O strategies. You’ll be able to build tools that search gigabytes per second, understand performance bottlenecks in any text processing system, and answer “why is this slow?” with concrete evidence.


Introduction: What This Guide Covers

Text search tools are command-line systems that scan bytes in files, apply pattern engines (literals, regex, fuzzy scoring), and return matches with context and metadata fast enough to keep humans in flow and systems in production. You are not just learning how to “find a string”; you are learning how to design a pipeline that pushes the CPU, memory hierarchy, and filesystem to their limits without sacrificing correctness.

What you will build by the end of this guide:

  • A family of string matching engines (naive, Boyer-Moore, Aho-Corasick) you can benchmark and reason about
  • A regex engine (NFA + lazy DFA) and a literal extraction optimizer
  • SIMD-accelerated literal search (memchr/memmem/Teddy-style)
  • A parallel directory walker with ignore rules and binary detection
  • A fuzzy matcher + interactive TUI search tool (fzf-style)
  • A production-grade ripgrep-style tool that integrates everything

Scope (what’s included):

  • Exact and fuzzy pattern matching algorithms
  • Regex engine architecture (NFA/DFA, linear-time guarantees)
  • SIMD vectorization and cache-aware search loops
  • Parallel traversal, file filtering, and I/O strategy selection
  • Line-oriented output, context windows, and UX/CLI ergonomics

Out of scope (for this guide):

  • Full-text indexing engines (Lucene/Elasticsearch), ranking, stemming, BM25
  • Distributed search clusters and inverted index maintenance
  • NLP pipelines or semantic search

The Big Picture (One-Screen Mental Model)

Pattern ---> Parse ---> Optimize ----> Execute ----> Format ----> Output
    |          |            |             |            |            |
    |          |            |             |            |            +--> Context lines, colors, stats
    |          |            |             |            +--> Line/byte offsets, file paths
    |          |            |             +--> Regex engine / SIMD literal scan
    |          |            +--> Literal extraction, prefilters, heuristics
    |          +--> Regex AST, glob/ignore rules
    +--> CLI flags, encodings, case-folding

Key Terms You’ll See Everywhere

  • Haystack: The text or file(s) you’re searching.
  • Needle: The pattern (literal/regex/fuzzy) you’re searching for.
  • Literal: A fixed substring that must appear in every match.
  • Prefilter: A fast check (often SIMD) that narrows candidates.
  • Automaton: A finite-state machine that recognizes patterns.
  • Line-oriented: Matches are reported per line, not per byte stream.

How to Use This Guide

  1. Read the Theory Primer once. It’s a mini-book: concepts first, projects later.
  2. Pick a learning path (Algorithm, Performance, Tool Builder, or Interview) and do projects in order.
  3. Always benchmark. Each project must be validated against a baseline (Project 1) and a real tool (grep/rg).
  4. Optimize only after correctness. Use tests, fuzzing, and diff outputs against existing tools.
  5. Keep a lab notebook. Record inputs, flags, timing, and why something is faster/slower.

Why Text Search Tools Matter

The Modern Problem It Solves (Start Here)

Modern software systems generate massive text streams: logs, traces, codebases, datasets, config files, and test artifacts. When something breaks, the fastest path to understanding is often “search the whole tree for a clue.” A tool that’s 5x faster doesn’t just save seconds—it collapses feedback loops, makes debugging viable under pressure, and changes how you work.

Real-World Impact (Benchmarks You Can Reproduce)

  • Single huge file search: In Andrew Gallant’s ripgrep benchmark suite, searching a 9.3GB file for a literal took ~1.786s with rg vs ~5.119s with GNU grep on the same corpus. (The benchmark uses the full English subtitle corpus; the suite also includes 1GB samples.)
  • Context output is expensive: In the same suite, asking for --context 2 adds noticeable overhead; tools that compute line numbers or context lines pay for it every match.
  • Unicode correctness vs speed tradeoff: The benchmarks show that tools that cut Unicode corners sometimes look fast but fail to find the right matches. Correctness matters.

Why CLI Search Still Wins (Even With IDEs)

  • Works over SSH on remote servers
  • Composes with shell pipelines and automation
  • Searches non-indexed or generated files (logs, data, binaries with text sections)
  • Scales to directories too large for IDE indexing
  • Runs in CI/CD, containers, and ephemeral environments

Context & Evolution (History Belongs Here)

Ken Thompson’s original grep (early 1970s) used Thompson NFA construction to guarantee linear-time regex matching. As codebases and log volumes exploded, tools evolved from single-threaded grep to smarter, parallel, SIMD-accelerated tools like ag and ripgrep. The key shifts: finite automata over backtracking, literal prefilters, and parallel filesystem traversal.

Evolution Timeline:

1973: grep       (single-threaded, NFA)
2005: ack        (code-aware, Perl regex)
2011: ag         (C + mmap + pthreads + .gitignore)
2016: ripgrep    (Rust + SIMD + lock-free parallelism)
2020s: modern    (NVMe-aware, Unicode-correct, SIMD everywhere)

The Performance Gap (What You’ll Learn to Explain)

Naive string search:      O(n×m) worst case
Boyer-Moore:              O(n/m) best case (sublinear!)
Thompson NFA regex:       O(n×m) guaranteed linear in practice
DFA regex:                O(n) but state explosion risk
Backtracking regex:       Exponential worst case (catastrophic)
SIMD literal scan:        16-32 bytes per compare (hardware parallelism)

Understanding why these gaps exist is the entire point of this guide.


Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Before diving into these projects, you should have:

  1. Solid C or Rust Programming: You’ll be working with low-level details, pointers, and manual memory management
  2. Basic Algorithm Analysis: Understanding Big-O notation, worst/average/best case complexity
  3. String Fundamentals: What a string is in memory (array of bytes), null termination, UTF-8 basics
  4. Command-Line Comfort: Familiarity with grep, basic regex syntax, shell pipes
  5. Debugging Skills: Using gdb/lldb or rust debugger to step through code

Helpful But Not Required

These concepts will be learned through the projects:

  • SIMD programming (SSE2/AVX2 intrinsics)
  • Automata theory (NFA, DFA, finite state machines)
  • Advanced regex engines (Thompson construction, subset construction)
  • Parallel programming (pthreads, work-stealing, lock-free data structures)
  • Systems programming (mmap, read syscalls, page faults)

Self-Assessment Questions

Stop and answer these before starting. If you can’t answer 80%, review prerequisites first:

  • Can you explain what O(n×m) means for string matching?
  • Do you know the difference between a pointer and a value?
  • Can you write a basic string search in C (even if naive)?
  • Do you understand what grep "pattern" file does at a high level?
  • Can you explain what SIMD means conceptually (even if you haven’t used it)?
  • Do you know what a regex is and have you used basic patterns (.*, \w+, [a-z])?
  • Have you used a debugger to step through code line-by-line?
  • Can you read x86 assembly at a basic level (mov, cmp, jmp)?

If you answered “no” to more than 2, spend time with these first:

  • For C/pointers: “C Programming Language” by K&R, chapters 1-6
  • For algorithms: “Algorithms, Fourth Edition” by Sedgewick & Wayne, Ch. 1-3
  • For debugging: “The Art of Debugging with GDB, DDD, and Eclipse”

Development Environment Setup

Required Tools:

  • C Compiler: gcc or clang with optimization flags (-O3, -march=native)
  • Rust (for later projects): rustc 1.70+, cargo
  • Debugger: gdb (Linux), lldb (macOS)
  • Profiler: perf (Linux), Instruments (macOS), or valgrind
  • Text Editor/IDE: Any with syntax highlighting

Recommended Tools:

  • Benchmarking: hyperfine for comparing performance
  • SIMD Reference: Intel Intrinsics Guide (online)
  • Git: For version control and testing on large repos
  • Large test files: Generate with base64 /dev/urandom | head -c 100M > test.txt

System Requirements:

  • x86_64 CPU with SSE2 (almost all modern CPUs)
  • AVX2 for advanced SIMD projects (Intel Haswell+ or AMD Excavator+)
  • 8GB+ RAM for building large projects
  • Linux or macOS (Windows with WSL works too)

Time Investment

Per-project estimates (for someone with prerequisites):

Phase Projects Total Time Skill Level Required
Phase 1 (Algorithms) 1-3 2-4 weeks Beginner → Advanced
Phase 2 (Regex) 4-6 4-6 weeks Advanced → Expert
Phase 3 (SIMD) 7-9 4-6 weeks Expert → Master
Phase 4 (I/O) 10-12 3-4 weeks Advanced → Expert
Phase 5 (Fuzzy) 13-14 3-5 weeks Advanced
Phase 6 (Integration) 15 2-3 months Master

Total commitment: 4-8 months of dedicated part-time work (10-15 hours/week)

Realistically:

  • Working through all 15 projects sequentially is a major undertaking
  • Most people should pick 3-5 projects that interest them
  • The capstone (Project 15) requires completing at least Projects 1, 4, 7, 10

Important Reality Check

This is HARD. These projects will make you struggle. That’s the point.

You will:

  • Get stuck on subtle bugs for hours
  • Rewrite code multiple times before it works
  • See performance that’s worse than expected initially
  • Feel frustrated when SIMD code crashes mysteriously

This is normal. Professional developers building tools like ripgrep went through the same process. The difference between giving up and succeeding is persisting through the frustration.

Tips for success:

  1. Start with working (even if slow) code, then optimize
  2. Test correctness before optimizing for speed
  3. Use assert() liberally to catch bugs early
  4. Compare your output with existing tools (grep, rg) on the same input
  5. Read the source code of real tools when stuck
  6. Take breaks when frustrated—your subconscious will work on the problem

Big Picture / Mental Model

Text search tools are pipelines. Every speedup comes from improving one stage without breaking the contract of the whole pipeline.

                      +------------------------+
Query & Flags ------> |  Pattern Parsing       |
(regex/literal/fuzzy) |  + Normalization       |
                      +-----------+------------+
                                  |
                                  v
                      +-----------+------------+
                      | Optimization Layer     |
                      | - Literal extraction   |
                      | - Prefilter selection  |
                      | - DFA/NFA compilation  |
                      +-----------+------------+
                                  |
                                  v
+------------------+     +--------+---------+     +------------------+
| Directory Walker | --> |  I/O Strategy    | --> |  Execution Engine |
| - ignore rules   |     | - read vs mmap   |     | - SIMD scan       |
| - file types     |     | - buffer sizing  |     | - regex engine    |
+------------------+     +--------+---------+     +--------+---------+
                                  |                       |
                                  v                       v
                          +-------+--------+      +-------+--------+
                          | Line & Context |      | Output Formatter|
                          | - line index   |      | - colors, stats |
                          | - context      |      | - paths, offsets|
                          +-------+--------+      +-------+--------+
                                  \___________________________/
                                              |
                                              v
                                         Final Results

Mental model to keep in your head:

  • Traversal decides what files you even consider.
  • I/O strategy decides how fast bytes enter your process.
  • Execution engine decides how fast you can scan those bytes.
  • Line/context/output decides how much extra work you do after a match.

Appendix A: Tool Landscape & Performance Notes

The following sections are optional context. Skim them now or return after the Theory Primer.

The Landscape: Tool Comparison

Tool Year Language Regex Engine Key Innovation
grep 1974 C Hand-rolled DFA/NFA + Boyer-Moore Original, Unix philosophy
egrep 1976 C DFA First deterministic regex
GNU grep 1988 C Hybrid DFA/NFA + Boyer-Moore Tuned Boyer-Moore
ack 2005 Perl Perl regex First “code-aware” grep
The Silver Searcher (ag) 2011 C PCRE (JIT) mmap + pthreads + gitignore
ripgrep (rg) 2016 Rust regex-automata (DFA) Teddy SIMD + parallel + Unicode
fd 2017 Rust regex Parallel traversal + gitignore
fzf 2013 Go Custom fuzzy Interactive + worker goroutines

Why Performance Differs: The Core Factors

1. Regex Engine Architecture (Biggest Factor)

Backtracking (NFA) Engines: PCRE, Perl, Python, Java, .NET

  • Can express everything (backreferences, lookahead)
  • Catastrophic backtracking: Exponential time on pathological patterns
  • Example: (a+)+$ on aaaaaaaaaaaaaaaaaaaaaaaaaaaa! takes SECONDS

Finite Automata (DFA) Engines: RE2, Rust regex, GNU grep (partial)

  • Linear time guaranteed: O(n) where n = input length
  • Cannot do backreferences, but who uses those in grep?
  • Key insight: “Only ever in one state at a time”

2. Literal Optimization (The Secret Weapon)

Most “regex” searches are actually literal strings or have literal prefixes:

  • TODO → literal
  • function.*async → literal prefix “function”
  • \berror\b → literal “error” + boundary check

The Strategy: Find the literal fast (SIMD), THEN run the regex engine.

3. SIMD Vectorization

Modern CPUs can compare 16-32 bytes simultaneously:

  • SSE2: 128-bit registers (16 bytes at once)
  • AVX2: 256-bit registers (32 bytes at once)

SIMD lets you compare 16-32 bytes per instruction; real throughput is often limited by memory bandwidth.

4. Parallelization

  • ag: pthreads, up to 8 workers
  • ripgrep: Lock-free parallel directory traversal via crossbeam
  • fzf: Worker goroutines for matching

5. I/O Strategy

  • mmap: Map file directly into memory (good for single large files)
  • read + buffer: Traditional buffered I/O (good for many small files)
  • Adaptive: ripgrep chooses automatically
  • Respect .gitignore (skip node_modules, build artifacts)
  • Skip binary files
  • Skip hidden files by default


Theory Primer (Read This Before Coding)

This section is the mini-book that turns the projects into deliberate practice. If you only read one thing before coding, read this. Each chapter is a mental model you will use repeatedly while building and profiling your tools.

Chapter 1: String Matching Foundations (Naive Search + Complexity)

Fundamentals

String matching is the problem of finding every occurrence of a pattern (the needle) inside a text (the haystack). At the machine level, this is nothing more than comparing bytes. The naive approach simply tries every possible alignment: for each position in the text, compare the next m bytes to the m-byte pattern. This yields a simple and correct algorithm, but it can do an enormous amount of redundant work. The key thing to internalize is that the naive algorithm ignores information. When it compares text[i] and pattern[0], and then later compares text[i+1] and pattern[0], it is re-learning the same facts about the text. Every faster algorithm is a more disciplined way of remembering and reusing this information.

The complexity model used throughout this guide is the standard RAM model with n = len(text) and m = len(pattern). The naive algorithm is worst-case O(n*m), and the worst case happens on inputs that almost match repeatedly, such as pattern aaaaab against text aaaaaaaaaaaaaaaaab. In practice, the naive method can be very fast when m is small and the CPU cache is hot because it is branch-predictable and sequential. This is why standard libraries often use naive or near-naive implementations for very short patterns.

Understanding this baseline gives you a reference point for every optimization you will build. You should be able to estimate, before running code, how many comparisons are done, where the hot loops are, and how cache lines move through the CPU. That is what makes a baseline valuable: it is not just “slow”; it is explainably slow.

Deep Dive into the Concept

At a deeper level, string matching is about information flow. Every time the naive algorithm compares characters, it receives one bit of information: equal or not equal. If the algorithm does not store any structure derived from these bits, it must re-discover them at the next alignment. The naive algorithm uses only two pieces of state: the current position in the text and the current position in the pattern. That is why its worst-case behavior is quadratic. It has no memory of partial matches beyond the current alignment.

Consider the alignment of pattern at position i. If the first k characters match and then a mismatch happens at k+1, the naive algorithm throws away the fact that the prefix pattern[0..k] matched text[i..i+k]. But that information is not useless. It tells you about the structure of the text around i and the internal structure of the pattern. Algorithms like KMP and Aho-Corasick exploit exactly this idea by building a failure function or automaton that keeps track of the longest proper prefix that is also a suffix. Even when you do not implement KMP in this guide, you need to know that it exists, because it explains why linear time is possible.

Another subtlety is overlapping matches. Suppose pattern ana occurs in banana at positions 1 and 3. A naive algorithm with a “jump to end of match” strategy will miss the second match. Correct matching must treat overlaps carefully. This matters later for regex engines and Aho-Corasick, where matches can overlap in complex ways. For correctness, you must define: do you return all matches, first match, or leftmost-longest? Each choice changes the algorithm and its output semantics.

Finally, note that the naive algorithm’s real cost is not just comparison count. It is also branch misprediction and memory bandwidth. In random text, mismatches happen quickly and branches are predictable. In worst-case text, branches are unpredictable and inner loops do much more work, causing pipeline stalls. This is why many optimized algorithms become branchless and rely on SIMD; they trade a few extra operations for a predictable pipeline.

In short: the naive algorithm is a benchmark, a mental model, and a correctness oracle. Every project uses it as a reference, either explicitly or implicitly.

How This Fits in Projects

  • Project 1 uses the naive algorithm directly as the baseline you will measure everything against.
  • Projects 2 and 3 exist because naive search does not reuse information across alignments.
  • Projects 4-6 build on the same idea, but for regex patterns instead of fixed strings.

Definitions & Key Terms

  • Alignment: A proposed position of the pattern over the text.
  • Overlap: When two matches share text positions.
  • Worst case: Inputs that maximize the number of comparisons.
  • Prefix / Suffix: Beginning or ending substring of a string.
  • Failure function: A rule that says how far to shift after a mismatch.

Mental Model Diagram

Text:    a a a a a a b
Pattern: a a a a a b

Alignment i=0:  a a a a a b
               a a a a a a b
               ^^^^^^^^^^ mismatch at last char

Alignment i=1:    a a a a a b
                  a a a a a a b
                  ^^^^^^^^^^ mismatch again

Naive: compare almost everything again.
Optimized: reuse prefix/suffix knowledge to skip.

How It Works (Step-by-Step)

  1. For each position i from 0 to n-m, attempt to match.
  2. Compare pattern[j] to text[i+j] for j from 0 to m-1.
  3. If a mismatch occurs, break and advance i by one.
  4. If all m characters match, record a match at i and continue.

Invariant: At each alignment i, the algorithm has verified pattern[0..k] == text[i..i+k] for the current k.

Failure mode: On highly repetitive input, the same prefix is rechecked many times.

Minimal Concrete Example

// Naive search: returns first match index or -1
int naive_find(const char *text, const char *pat) {
    int n = strlen(text), m = strlen(pat);
    if (m == 0) return 0;
    for (int i = 0; i <= n - m; i++) {
        int j = 0;
        while (j < m && text[i + j] == pat[j]) j++;
        if (j == m) return i;
    }
    return -1;
}

Common Misconceptions

  • “Naive is always slow” → It can be extremely fast for short patterns and random text.
  • “Worst case rarely matters” → In text tools, adversarial inputs (logs, repetitive data) are common.
  • “Overlaps are trivial” → Overlapping matches are easy to miss if you shift too far.

Check-Your-Understanding Questions

  1. Why does aaaaab in aaaaaaaaaab cause many comparisons?
  2. How many alignments are tested for text length n and pattern length m?
  3. What happens if the pattern is empty?
  4. How does naive search handle overlapping matches?

Check-Your-Understanding Answers

  1. Because the prefix matches at almost every position before failing at the last character.
  2. n-m+1 alignments are tested.
  3. Most definitions return 0 (the empty pattern matches at position 0).
  4. It will still find overlaps because it advances by one each time.

Real-World Applications

  • Baseline for every substring search implementation.
  • Simple pattern checking in parsers and config loaders.
  • Fast searches in small buffers (e.g., protocol headers).

Where You’ll Apply It

  • Project 1 (Naive baseline)
  • As correctness oracle in Projects 2-15

References

  • “Algorithms, Fourth Edition” by Sedgewick & Wayne — String searching chapter
  • “Algorithms in C” by Sedgewick — String processing sections
  • “Computer Systems: A Programmer’s Perspective” — performance implications of loops

Key Insight

The naive algorithm is not just a baseline; it is your correctness oracle and performance yardstick.

Summary

You now understand the baseline: a correct, straightforward algorithm with worst-case quadratic behavior. Every other project is an attempt to avoid re-checking the same characters by storing knowledge about the pattern or the text. Always start here when debugging or benchmarking.

Homework / Exercises to Practice the Concept

  1. Implement naive search that returns all matches (including overlaps).
  2. Generate a worst-case input for a given pattern and measure comparisons.
  3. Modify naive search to count character comparisons precisely.

Solutions to the Homework / Exercises

  1. After each match, increment i by 1 instead of m.
  2. For pattern a...ab, use text a...ab repeated with one extra a.
  3. Add a counter increment inside the inner comparison loop.

Chapter 2: Skip-Based Sublinear Search (Boyer-Moore Family)

Fundamentals

Boyer-Moore is the archetypal example of “search faster by skipping.” Instead of scanning left-to-right, it aligns the pattern with the text and compares from right to left. When a mismatch occurs, it uses precomputed tables to skip the pattern forward by more than one position. The two classic heuristics are the bad character rule (how far to shift based on the mismatched character) and the good suffix rule (how far to shift based on matched suffixes). The result is a search algorithm that is sublinear on average: it can skip entire segments of the text without reading them.

The core mental model: if you know a character in the text does not appear in the pattern (or appears only in certain positions), you can move the pattern forward enough that the mismatched character lines up with its last possible occurrence. The right-to-left scan makes this possible because you have already compared the suffix of the pattern at the time of the mismatch.

In real-world tools, Boyer-Moore is often used for literal matching (exact strings) or as a fast path before more expensive checks. GNU grep and many standard libraries use Boyer-Moore or a variant like Boyer-Moore-Horspool when the pattern is long enough to benefit from skipping.

Deep Dive into the Concept

The bad character rule is the simplest: when pattern[j] != text[i+j], shift the pattern so that the rightmost occurrence of text[i+j] in the pattern aligns with text[i+j]. If the character does not appear in the pattern, you can shift the pattern beyond that character entirely. This is precomputed as a table of size equal to the alphabet (typically 256 for bytes). The cost of building this table is O(m + |alphabet|) and pays for itself when the pattern is used many times.

The good suffix rule is more subtle. If you have matched a suffix S of the pattern and then hit a mismatch, you want to shift the pattern so that another occurrence of S aligns with the same text substring. If S does not occur elsewhere in the pattern, you instead align the longest suffix of S that is also a prefix of the pattern. This requires computing border arrays (prefix-suffix overlaps). This rule captures information about the internal structure of the pattern, which the bad character rule cannot.

The magic of Boyer-Moore is that these shifts are safe: they never skip a possible match. The algorithm is correct because every shift preserves the invariant that any skipped alignment could not have matched based on the mismatch evidence you already have. This is a common pattern in algorithm design: use a mismatch as a certificate that a range of alignments is impossible.

In practice, many implementations use the Horspool variant, which drops the good suffix rule and uses only a simplified bad character shift based on the last character of the pattern. This is much simpler to implement and still fast for many workloads, especially when the alphabet is large and the pattern is reasonably long. However, for adversarial or highly repetitive patterns, the full Boyer-Moore algorithm can be dramatically better.

How This Fits in Projects

  • Project 2 is a direct implementation of Boyer-Moore with both heuristics.
  • Project 6 (Literal Extraction) often feeds literals into Boyer-Moore-like scanners.
  • Project 15 (ripgrep clone) uses skip-based ideas in literal prefilters.

Definitions & Key Terms

  • Bad character rule: shift pattern based on mismatched character.
  • Good suffix rule: shift pattern based on matched suffix.
  • Border: a substring that is both a prefix and suffix.
  • Horspool: simplified Boyer-Moore using only one shift rule.

Mental Model Diagram

Text:    a b c x a b c a b c
Pattern:       a b c a b d
                     ^ mismatch at d vs c

Bad character rule:
  mismatched char = 'c'
  rightmost 'c' in pattern at position 2
  shift so that that 'c' lines up with text 'c'

How It Works (Step-by-Step)

  1. Precompute bad-character shift table.
  2. Precompute good-suffix shift table.
  3. Align pattern with text at position i.
  4. Compare from right to left until mismatch or match.
  5. If mismatch at j, compute shift = max(bad_char, good_suffix).
  6. Advance i by shift and repeat.

Invariant: Shifts never skip a valid alignment.

Failure mode: If tables are computed incorrectly, the algorithm can miss matches.

Minimal Concrete Example

// Simplified bad-character table for bytes
void build_bad_char(const char *pat, int m, int table[256]) {
    for (int i = 0; i < 256; i++) table[i] = m;
    for (int i = 0; i < m - 1; i++) table[(unsigned char)pat[i]] = m - 1 - i;
}

Common Misconceptions

  • “Boyer-Moore is always faster” → It can be worse for very short patterns.
  • “Good suffix is optional” → Optional for speed, but it matters in repetitive patterns.
  • “Skipping is unsafe” → The shifts are provably safe due to mismatch evidence.

Check-Your-Understanding Questions

  1. Why does scanning from right to left enable bigger skips?
  2. What happens if the mismatched character does not appear in the pattern?
  3. How does the good suffix rule differ from the bad character rule?

Check-Your-Understanding Answers

  1. Because you already matched a suffix, you can use its structure to shift safely.
  2. You can shift the pattern past the mismatched character entirely.
  3. The good suffix rule uses matched suffix structure, not the mismatched character.

Real-World Applications

  • Literal search in grep, ag, and ripgrep fast paths.
  • Fast scanning in parsers and file format detectors.

Where You’ll Apply It

  • Project 2 (Boyer-Moore)
  • Project 6 (literal extraction + prefilter scanning)

References

  • Boyer and Moore, “A fast string searching algorithm” (1977)
  • “Algorithms on Strings, Trees, and Sequences” by Dan Gusfield — Boyer-Moore chapter
  • “Algorithms, Fourth Edition” by Sedgewick — substring search variants

Key Insight

Boyer-Moore is about turning a mismatch into a proof that you can skip ahead.

Summary

Boyer-Moore shows that substring search can be sublinear by exploiting information gathered at mismatches. This idea recurs in literal prefilters, SIMD scanning, and even regex optimization.

Homework / Exercises to Practice the Concept

  1. Implement Horspool and compare it to full Boyer-Moore on short vs long patterns.
  2. Visualize shifts for a pattern with repeated prefixes.
  3. Create a pattern where good suffix dramatically helps.

Solutions to the Homework / Exercises

  1. Horspool wins on simplicity, full BM wins on repetitive patterns.
  2. Track suffix borders and show how good suffix avoids small shifts.
  3. Use pattern like ABABABX against text with repeated ABABAB.

Chapter 3: Multi-Pattern Automata (Aho-Corasick)

Fundamentals

Aho-Corasick solves the problem of matching many patterns at once. Instead of running a separate search for each pattern, it builds a trie of all patterns and augments it with failure links. The resulting automaton processes the text in a single pass, emitting matches whenever a pattern ends at the current position. This is the algorithm behind fgrep and many malware scanners, spam filters, and intrusion detection systems.

The key insight is that a trie already encodes all prefixes of all patterns. When a mismatch occurs, you do not restart from the root; you follow a failure link to the longest proper suffix that is also a prefix. This turns multiple searches into one linear scan. The runtime is O(n + m + z) where n is the text length, m is the total length of all patterns, and z is the number of matches.

Deep Dive into the Concept

The automaton has three parts: the trie transitions, the failure links, and the output function. The trie transitions are straightforward: for each node and character, you have either a next node or you fall back. The failure link from a node points to the longest suffix of that node’s string that is also a prefix in the trie. This is essentially the same idea as KMP’s failure function, generalized to a trie.

When scanning the text, you maintain a current state in the trie. For each input character, you follow a transition if it exists; otherwise you follow failure links until you can transition or you return to the root. Each time you reach a state, you emit all patterns that end at that state or any of its failure-linked suffix states. This is how overlapping and nested matches are handled automatically. For example, if patterns include “he”, “she”, and “hers”, the automaton can emit all relevant matches when scanning “ushers” without any backtracking.

A common implementation detail is the choice between sparse transitions (hash map or sorted vector) vs dense transitions (array of size 256). Dense transitions are fast but memory-heavy; sparse transitions are memory-friendly but slower. Many production implementations use a hybrid: build a sparse trie, then generate a dense transition table for hot states.

The algorithm is also used for virus signatures and network filtering. It scales to thousands or millions of patterns. That is why this project is more than a toy: it is a building block of real systems.

How This Fits in Projects

  • Project 3 is the direct implementation of Aho-Corasick.
  • Project 6 can use Aho-Corasick as a literal prefilter for regex sets.
  • Project 15 can use multi-pattern scanning for multiple literals.

Definitions & Key Terms

  • Trie: Tree that stores strings by shared prefixes.
  • Failure link: Edge to the longest suffix that is also a prefix.
  • Output function: The set of patterns that end at a state.
  • Alphabet: Set of possible characters (often bytes).

Mental Model Diagram

Trie (patterns: he, she, his, hers)
root
 ├─ h ─ e* ─ r ─ s*
 │     |        
 │     i ─ s*
 └─ s ─ h ─ e*

Failure links connect states to longest suffix that is also a prefix.

How It Works (Step-by-Step)

  1. Insert all patterns into a trie; mark terminal nodes.
  2. Build failure links using BFS from the root.
  3. For each node, merge output sets from its failure link.
  4. Scan text one character at a time, following transitions.
  5. Emit all patterns in the current node’s output set.

Invariant: The current node always represents the longest suffix of the processed text that is also a prefix in the trie.

Failure mode: Incorrect failure links cause missed matches or infinite loops.

Minimal Concrete Example

// Pseudocode for scanning
state = root
for c in text:
    while state != root and no edge(state, c):
        state = fail[state]
    if edge(state, c): state = edge(state, c)
    for pat in output[state]: report match

Common Misconceptions

  • “Aho-Corasick is only for dictionaries” → It is for any multi-pattern set.
  • “Failure links are backtracking” → They are deterministic transitions, not backtracking.
  • “Memory usage is unavoidable” → You can compress transitions or use sparse maps.

Check-Your-Understanding Questions

  1. Why is the runtime linear in the text length?
  2. What does a failure link represent in the trie?
  3. How do overlapping patterns get detected?

Check-Your-Understanding Answers

  1. Each input character triggers a bounded number of transitions; the scan never rewinds.
  2. The longest proper suffix that is also a prefix in the trie.
  3. Output sets include patterns from the current state and its suffix states.

Real-World Applications

  • Malware signature scanning
  • Spam keyword detection
  • Multiple keyword search in logs

Where You’ll Apply It

  • Project 3 (Aho-Corasick)
  • Optional prefilter in Project 15

References

  • Aho & Corasick, “Efficient string matching: An aid to bibliographic search” (1975)
  • “Algorithms, Fourth Edition” — multi-pattern searching
  • “Algorithms in C” — trie and automaton sections

Key Insight

Aho-Corasick turns many searches into one automaton scan.

Summary

By building a trie with failure links, you get a deterministic automaton that finds all patterns in one pass. This is the multi-pattern analog of KMP and a core tool for real systems.

Homework / Exercises to Practice the Concept

  1. Implement Aho-Corasick for a small alphabet and compare memory usage of dense vs sparse transitions.
  2. Add reporting of overlapping matches and verify with a test corpus.
  3. Add case-insensitive matching by normalizing input.

Solutions to the Homework / Exercises

  1. Dense transitions use 256 * states; sparse transitions use only existing edges.
  2. Output sets must include suffix matches via failure links.
  3. Normalize both pattern and input to lower case at insertion and scan.

Chapter 4: Regex Engines and Automata (NFA, DFA, and Literal Extraction)

Fundamentals

Regular expressions are patterns over strings. The central design choice in a regex engine is backtracking vs automata. Backtracking engines (PCRE, Perl, many language runtimes) are expressive but can explode exponentially on certain inputs. Automata-based engines (Thompson NFA, RE2, Rust regex) compile the regex into a state machine that guarantees linear-time matching, at the cost of giving up features like backreferences. Modern text search tools prioritize predictable performance over maximal expressiveness, because worst-case performance can become a denial-of-service vector.

Thompson’s construction turns a regex into an NFA by composing small fragments (literal, concatenation, alternation, and Kleene star). An NFA can be simulated in linear time by tracking the set of active states after each character. A DFA is obtained by subset construction: each DFA state represents a set of NFA states. The DFA can be faster at runtime but can also explode in size. Many tools use a hybrid: build an NFA, lazily build DFA states on demand, and fall back to NFA when the DFA grows too large.

Literal extraction is the practical bridge between regex theory and performance. If you can extract a literal substring that must appear in any match, you can prefilter with a fast literal search (often SIMD) and only run the full regex engine on candidate regions. This is why ripgrep is fast even with regex patterns: it does not run the full engine everywhere.

Deep Dive into the Concept

A regex is first parsed into an abstract syntax tree (AST). Operators like concatenation and alternation build tree nodes. Thompson’s construction then maps each node into an NFA fragment with one start and one accept state, with epsilon (empty) transitions connecting them. The resulting NFA has a number of states proportional to the regex length. During matching, the engine maintains a set of active states. For each input character, it computes the epsilon-closure (all states reachable via epsilon transitions) and then the set of transitions on that character.

The NFA simulation can be optimized with bitsets when the number of states is small, or with sparse sets when it is large. The critical invariant is that the set of active states after processing i characters represents all possible positions in the regex after consuming those i characters. There is no backtracking; every possible path is tracked simultaneously.

DFA construction is a power-set transform. In practice, constructing the entire DFA can be impossible for large patterns because the number of states can be exponential. A lazy DFA builds states only as needed for the input. If the DFA grows too big, the engine can fall back to the NFA simulation or use a hybrid strategy. This is what Rust’s regex ecosystem does, giving you near-DFA speed without catastrophic memory blowups.

Literal extraction is conceptually simple but practically subtle. You need to analyze the regex to find substrings that are guaranteed to appear in any match. For example, in foo.*bar, both foo and bar are literals, but only foo is guaranteed to appear before bar. The engine may choose the rarest literal for prefiltering. When a regex contains alternation, the set of required literals can be computed via intersection of literals from each branch. The more precise the extraction, the fewer candidate regions the regex engine must examine.

Finally, regex engines must define match semantics: leftmost-longest vs leftmost-first, greedy vs lazy quantifiers, and how to handle Unicode character classes. Text search tools typically use leftmost-first with greedy quantifiers but enforce linear-time behavior by disallowing backreferences and look-around features that would require backtracking.

How This Fits in Projects

  • Project 4 (NFA regex engine) implements Thompson construction.
  • Project 5 (lazy DFA) implements subset construction on demand.
  • Project 6 (literal extraction) builds prefilters used by the engine.
  • Project 15 integrates all three for performance.

Definitions & Key Terms

  • Regex AST: Parse tree representing regex structure.
  • NFA: Nondeterministic finite automaton (multiple possible states).
  • DFA: Deterministic finite automaton (single state at a time).
  • Epsilon transition: Transition that consumes no input.
  • Leftmost-first: Match semantics where the earliest match is preferred.
  • Literal extraction: Finding required substrings in a regex.

Mental Model Diagram

Regex: a(b|c)*d

Thompson NFA (conceptual):
  (a) -> split -> (b) -> loop -> (d)
           \-> (c) -/

At runtime: track all possible states simultaneously (no backtracking).

How It Works (Step-by-Step)

  1. Parse regex into AST.
  2. Convert AST to NFA using Thompson construction.
  3. Optionally compile lazy DFA states from NFA.
  4. Extract required literals for prefiltering.
  5. Scan input: prefilter hits trigger full regex verification.

Invariant: Active NFA states always represent all possible parse positions.

Failure mode: Incorrect literal extraction can drop valid matches.

Minimal Concrete Example

Regex: ab*c
Input: abbbc

NFA states:
S0 -a-> S1
S1 -b-> S1 (loop)
S1 -c-> S2 (accept)

Scan: a -> {S1}, b -> {S1}, b -> {S1}, b -> {S1}, c -> {S2}

Common Misconceptions

  • “DFA is always better” → DFA can explode in size for complex regexes.
  • “Backtracking is required for regex” → Regular expressions can be matched by automata.
  • “Literal extraction is optional” → It is the main performance lever in regex search tools.

Check-Your-Understanding Questions

  1. Why can backtracking be exponential in the worst case?
  2. What does a DFA state represent in terms of the NFA?
  3. Why does literal extraction speed up regex search?

Check-Your-Understanding Answers

  1. Because backtracking explores many overlapping paths for ambiguous regexes.
  2. A DFA state is a set of NFA states.
  3. It filters the text to only candidate regions that must contain the literal.

Real-World Applications

  • Fast regex in ripgrep, grep, RE2
  • Log filtering, security scanning, and code search

Where You’ll Apply It

  • Projects 4-6 and the regex pipeline in Project 15

References

  • Russ Cox, “Regular Expression Matching Can Be Simple And Fast”
  • RE2 documentation (linear-time guarantee)
  • “Compilers: Principles and Practice” — regex to automata

Key Insight

Regex performance is about automata, not backtracking.

Summary

Regex engines can be fast and safe if they use automata. Thompson NFA gives linear-time guarantees; lazy DFA gives speed without state explosion; literal extraction reduces work. These are the core ideas behind production regex tools.

Homework / Exercises to Practice the Concept

  1. Build a Thompson NFA for a regex with alternation and star.
  2. Implement a tiny NFA simulation using sets of states.
  3. Extract literals from a regex with alternation and compare to ground truth.

Solutions to the Homework / Exercises

  1. Draw fragments for literals, connect with split/merge epsilon states.
  2. Maintain a set of active states and update per input character.
  3. The required literal set is the intersection of branch literals.

Chapter 5: SIMD and Vectorized Literal Scanning

Fundamentals

SIMD (Single Instruction, Multiple Data) lets a CPU compare 16 or 32 bytes in a single instruction. This is the foundation for fast literal search. Instead of comparing one byte at a time, you load a vector of bytes and compare them all at once, producing a bitmask of matches. This bitmask is then used to compute candidate positions quickly. Libraries like memchr use SIMD internally and are heavily tuned for modern CPUs.

The key mental model is that SIMD turns the inner loop of string search into data-parallel operations. When combined with branchless code and prefetching, SIMD can approach memory bandwidth limits. This is why literal extraction is so powerful: if your regex contains a literal, you can scan for it at SIMD speed, and only run the slower regex engine at match positions.

Deep Dive into the Concept

SIMD literal scanning usually happens in two stages: fast rejection and verification. In fast rejection, you scan for a rare byte (often the last byte of the pattern) using SIMD. This yields candidate positions. In verification, you compare the full pattern at those positions. The idea is identical to Boyer-Moore, but performed in parallel on many bytes at once.

The memchr family of functions focuses on finding a single byte. memmem extends this to find a substring. For memmem, SIMD is used to find candidate positions of the first and last byte simultaneously, and then confirm the interior bytes. The Teddy algorithm (used in Hyperscan) extends this further to multiple literals at once by building bitmasks for several bytes and combining them with vector operations. This is especially effective when you have many literals (like regex alternations) and want to quickly filter candidates.

SIMD also affects line counting (Project 12). Counting newlines is equivalent to memchr for ` `, but you must also map byte offsets back to line numbers. The fastest approach is to count newlines in chunks (SIMD), compute prefix sums, and build a sparse index so you can translate byte offsets to line numbers quickly.

The final performance constraint is memory bandwidth. Even perfect SIMD cannot exceed the speed at which data can be pulled from memory. That is why real-world throughput is often “limited by RAM” rather than CPU. Your goal is to reduce instruction overhead so that the loop is limited by memory bandwidth, not by computation.

How This Fits in Projects

  • Project 7 (memchr) is the SIMD baseline.
  • Project 8 (memmem) generalizes to substring search.
  • Project 9 (Teddy) handles multi-literal SIMD prefilters.
  • Project 12 uses SIMD to count newlines and accelerate line mapping.

Definitions & Key Terms

  • SIMD: Vectorized operations over multiple bytes.
  • Vector register: 128-bit (SSE2) or 256-bit (AVX2) register.
  • Bitmask: Result of SIMD compare; each bit indicates a match.
  • Prefilter: Fast scan that narrows candidate positions.
  • Teddy algorithm: Multi-literal SIMD matcher from Hyperscan.

Mental Model Diagram

Bytes:  [h e l l o ...]
SIMD:   compare 16 bytes at once to target byte 'o'
Mask:   0001000000000000 (bit set where byte == 'o')

How It Works (Step-by-Step)

  1. Load a SIMD vector from the text.
  2. Compare it to the target byte (or bytes) in parallel.
  3. Convert the comparison result to a bitmask.
  4. For each set bit, verify the full pattern at that position.
  5. Advance by vector width and repeat.

Invariant: The SIMD scan never misses a candidate position.

Failure mode: Incorrect alignment or boundary handling can skip matches.

Minimal Concrete Example

// Pseudocode: SIMD find byte using 16-byte vectors
for (i = 0; i < n; i += 16) {
    vec = load16(text + i);
    mask = compare_eq(vec, target_byte);
    if (mask != 0) {
        while (mask) { pos = ctz(mask); report(i + pos); mask &= mask - 1; }
    }
}

Common Misconceptions

  • “SIMD always gives 32x speedup” → Real speedup is limited by memory bandwidth.
  • “SIMD is only for experts” → You can use portable intrinsics or libraries.
  • “Alignment doesn’t matter” → Unaligned loads are fine, but boundary handling is critical.

Check-Your-Understanding Questions

  1. Why is SIMD most effective for rare literals?
  2. What does the bitmask represent after a SIMD compare?
  3. Why must you still verify the full pattern after a candidate hit?

Check-Your-Understanding Answers

  1. Rare literals reduce the number of candidate verifications.
  2. Each bit indicates whether a byte position matched the target.
  3. SIMD only checks a subset of bytes; full verification ensures correctness.

Real-World Applications

  • High-speed string scanning in ripgrep and fd
  • Network intrusion detection (Hyperscan)
  • Line counting and log filtering

Where You’ll Apply It

  • Projects 7-9 and Project 12

References

  • memchr crate documentation and source
  • Hyperscan papers (Teddy/Harry literal matchers)
  • “Computer Systems: A Programmer’s Perspective” — memory hierarchy

Key Insight

SIMD turns literal search into a bandwidth problem rather than a CPU problem.

Summary

SIMD lets you check many bytes in parallel and forms the fastest known path for literal scanning. The art is to combine SIMD with smart candidate verification and to stay within memory bandwidth limits.

Homework / Exercises to Practice the Concept

  1. Implement SIMD memchr and compare against strchr.
  2. Extend it to memmem using a two-byte prefilter.
  3. Build a newline counter using SIMD and validate counts.

Solutions to the Homework / Exercises

  1. Use vector compares and bitmasks; verify correctness with randomized tests.
  2. First scan for the last byte, then verify the rest.
  3. Count ` ` masks per chunk and sum with a scalar fallback for tails.

Chapter 6: Parallel Traversal, Filtering, and I/O Strategy

Fundamentals

The fastest pattern engine in the world is useless if you cannot feed it data quickly. In real-world search tools, the bottleneck often shifts from regex execution to directory traversal and I/O. Modern SSDs can handle hundreds of thousands of file operations per second, but a single thread cannot saturate them. Parallel traversal is therefore essential. Tools like ripgrep and fd use work-stealing queues: each worker thread maintains a queue of directories; idle threads steal work from busy ones. This distributes filesystem work evenly and scales with CPU cores.

File filtering is equally important. Ignoring node_modules, .git, build artifacts, and binary files can reduce the search space by orders of magnitude. That is why ignore rules (.gitignore, .ignore, –glob) are not a UX feature but a performance requirement. A fast search tool spends as much effort on deciding what not to read as it spends on actual matching.

Finally, once files are chosen, you must decide how to read them. For large files, mmap can be faster because it avoids copying and leverages page cache directly. For many small files, read with a reusable buffer and batched syscalls is faster because mmap has per-file setup overhead. Production tools adaptively choose the strategy based on file size and count.

Deep Dive into the Concept

Parallel traversal is a scheduling problem. The cost of a directory is not just the number of entries but the depth of subdirectories and the filesystem latency. A static partition (“thread 1 gets /src, thread 2 gets /tests”) fails when one subtree is much heavier. Work-stealing solves this by letting threads dynamically rebalance. Each thread has a deque of work; it pushes new subdirectories to its own deque (LIFO) and steals from the other end (FIFO) of another thread’s deque. This minimizes contention and keeps caches hot.

Ignore rules introduce complexity: you must interpret .gitignore files in the context of directory hierarchy, apply global ignore patterns, and allow overrides. Rules cascade: a child directory inherits the parent’s ignore rules, then adds its own. This requires a stack of ignore matchers as you descend the tree. The correct behavior is surprisingly subtle, but a correct implementation can be reused across tools (as ripgrep does with its ignore crate).

Binary file detection is another filter. Most tools stop searching a file when they encounter a NUL byte or when a file fails UTF-8 validation. This is a policy decision with performance implications: strict UTF-8 validation costs CPU, but prevents garbage output. Many tools implement a heuristic: if a file contains a NUL byte in the first N bytes, treat it as binary and skip.

I/O strategy selection is about amortization. mmap costs a syscall and page table setup but allows zero-copy access. read costs a syscall per read and copies into user buffers, but can be faster for many small files because the overhead of mapping each file is high. Tools often pick a threshold (e.g., 1MB) above which they use mmap. The threshold is hardware-dependent, so production tools sometimes benchmark and cache the result or choose adaptively.

How This Fits in Projects

  • Project 10 implements parallel traversal with work stealing.
  • Project 11 builds a file-size-based mmap vs read selector.
  • Project 12 depends on efficient line mapping after I/O.
  • Project 15 integrates traversal, filtering, and I/O strategy.

Definitions & Key Terms

  • Work stealing: Idle threads steal tasks from busy threads.
  • Deque: Double-ended queue used for work distribution.
  • Ignore rules: Patterns that exclude files from search.
  • Binary detection: Heuristic to skip non-text files.
  • Page cache: Kernel cache of file data.

Mental Model Diagram

Worker 0 deque: [/src, /tests, /docs]
Worker 1 deque: []
Worker 2 deque: []

Worker 1 steals /docs from Worker 0
Worker 2 steals /tests from Worker 0

All workers proceed in parallel.

How It Works (Step-by-Step)

  1. Start with a root directory in a shared queue.
  2. Each worker pops a directory, reads entries, applies ignore rules.
  3. For each subdirectory, push onto local deque.
  4. If a worker runs out of work, steal from another deque.
  5. For each file, choose mmap or read based on size.

Invariant: Each file is visited once and ignore rules are applied consistently.

Failure mode: Incorrect rule scoping can skip or include wrong files.

Minimal Concrete Example

Root /project
- .gitignore: node_modules/
- src/main.c
- node_modules/lib.js (ignored)

Traversal should enqueue src/, ignore node_modules/.

Common Misconceptions

  • “Traversal is trivial” → It is a major performance bottleneck in large trees.
  • “Ignore rules are just UX” → They are essential for speed and correctness.
  • “mmap is always faster”mmap is slower for many small files.

Check-Your-Understanding Questions

  1. Why does work-stealing outperform static partitioning?
  2. How do ignore rules compose across directories?
  3. When is mmap worse than read?

Check-Your-Understanding Answers

  1. Because directory sizes are unpredictable and work stealing balances load dynamically.
  2. Rules inherit from parents and can be overridden by child .gitignore files.
  3. When files are small and the per-file mapping overhead dominates.

Real-World Applications

  • High-performance file search (rg, fd, ag)
  • Backup tools and file indexers

Where You’ll Apply It

  • Projects 10-12 and the integration in Project 15

References

  • “The Linux Programming Interface” — directory traversal and file I/O
  • “Advanced Programming in the UNIX Environment” — filesystem APIs
  • ripgrep and fd source code for work-stealing traversal

Key Insight

In real tools, traversal and I/O are often the real bottleneck, not regex.

Summary

Parallel traversal and adaptive I/O are essential for exploiting modern hardware. The best regex engine in the world is wasted if you cannot feed it bytes quickly or if you spend time scanning files you should have ignored.

Homework / Exercises to Practice the Concept

  1. Implement a single-threaded walker and measure throughput on a large tree.
  2. Add work-stealing and compare speedup vs thread count.
  3. Implement .gitignore parsing for a small subset of patterns.

Solutions to the Homework / Exercises

  1. Use readdir and record total files/sec.
  2. Expect near-linear speedup until I/O saturation.
  3. Support *, ?, and directory suffix / as a minimum.

Chapter 7: Fuzzy Matching and Interactive UX

Fundamentals

Fuzzy matching is not about exact matches; it is about ranking plausible matches. Given a query like “fzf” and a candidate like “FuzzyZoneFinder”, you want to score it highly because the letters appear in order and align with word boundaries. This is a fundamentally different problem from exact matching or regex. The algorithm must balance three goals: correctness (matches are valid subsequences), quality (better matches rank higher), and speed (interactive latency under 10ms).

Most fuzzy matchers are variants of subsequence matching. The simplest check is: can you walk through the candidate and consume each query character in order? For scoring, you add bonuses for matches at word boundaries, consecutive matches, or matches near the start; and penalties for large gaps. This is similar to sequence alignment in bioinformatics (Smith-Waterman), but tuned for interactive performance.

Deep Dive into the Concept

The scoring function is the heart of fuzzy matching. A good scoring function encodes human intuition: “matches at word boundaries are more important”, “consecutive matches feel better”, and “matches near the start are more relevant”. fzf’s algorithm uses a dynamic programming (DP) matrix where each cell stores the best score for matching the first i query characters within the first j candidate characters. You can optimize this by only keeping a sliding window or by precomputing character positions.

Performance is critical. An interactive fuzzy finder must score tens or hundreds of thousands of candidates as the user types. That means you need fast rejection (discard candidates that cannot match), incremental updates (reuse previous computations when the query extends), and parallelism (worker pools). The architecture of fzf uses a reader goroutine that streams candidates, a matcher pool that scores chunks, and a merger that keeps the top-K results. When the query changes, in-flight scoring is canceled and restarted.

Terminal UI adds its own constraints: you must render without flicker, handle raw input mode, and support resizing. The UI loop must be responsive even when matchers are busy. This leads to a classic producer-consumer design: input updates the query state, matchers work asynchronously, and the renderer paints the latest results. This architecture is useful beyond fuzzy search; it is the template for many interactive CLI tools.

How This Fits in Projects

  • Project 13 builds the scoring algorithm and subsequence matcher.
  • Project 14 integrates the matcher into a full interactive TUI.
  • Project 15 may reuse fuzzy matching for file selection or filters.

Definitions & Key Terms

  • Subsequence: Characters appear in order but not necessarily contiguously.
  • Gap penalty: Negative score for characters between matches.
  • Boundary bonus: Positive score for matching after separators or case changes.
  • Top-K: Keeping only the best K matches for display.

Mental Model Diagram

Query: f z f
Candidate: F u z z y Z o n e F i n d e r
Match path: ^   ^       ^
Score: bonus for word boundaries, penalty for gaps

How It Works (Step-by-Step)

  1. Verify the query is a subsequence of the candidate.
  2. Use DP to compute the best alignment score.
  3. Apply bonuses: start-of-word, camelCase, consecutive matches.
  4. Apply penalties: long gaps, late matches.
  5. Sort candidates by score, keep top-K.

Invariant: All reported matches are valid subsequences.

Failure mode: Poor scoring makes results feel “wrong” even if correct.

Minimal Concrete Example

Query: "rb"
Candidate: "ReadBuffer"
Matches: R (start bonus) + b (camelCase bonus)
Score higher than "crab" due to boundary bonuses.

Common Misconceptions

  • “Fuzzy matching is just substring search” → It is subsequence + scoring.
  • “Higher score always means shorter distance” → Boundary bonuses can outweigh distance.
  • “Interactive performance is optional” → Latency is the product; slow feels broken.

Check-Your-Understanding Questions

  1. What makes a match “valid” in fuzzy matching?
  2. Why do boundary bonuses matter?
  3. How do you keep latency low when scoring many candidates?

Check-Your-Understanding Answers

  1. The query must appear as a subsequence of the candidate.
  2. They align with how humans parse words and identifiers.
  3. Use fast rejection, parallelism, and incremental updates.

Real-World Applications

  • fzf, IDE quick-open, fuzzy file search
  • Command palette search in editors

Where You’ll Apply It

  • Projects 13 and 14

References

  • fzf source code (algo.go)
  • Smith-Waterman style sequence alignment (conceptual influence)

Key Insight

Fuzzy matching is a UX algorithm: speed and ranking quality matter as much as correctness.

Summary

Fuzzy matching trades exactness for usability. The right scoring heuristics and architecture make the tool feel instant and intelligent. This is where algorithms meet user experience.

Homework / Exercises to Practice the Concept

  1. Implement a subsequence matcher and compare results to fzf on small inputs.
  2. Add boundary bonuses and observe ranking changes.
  3. Add cancellation to your matcher pool when queries change.

Solutions to the Homework / Exercises

  1. Ensure all query characters appear in order; test with randomized cases.
  2. Check for underscores, spaces, or camelCase boundaries and add bonus.
  3. Use a shared atomic query version and ignore stale results.

Glossary (High-Signal)

  • Alignment: A candidate position where a pattern is tested against text.
  • Automaton: A finite-state machine that recognizes a set of strings.
  • Backtracking: Regex strategy that explores one path at a time and may be exponential.
  • DFA/NFA: Deterministic/nondeterministic finite automata.
  • Literal extraction: Finding substrings required by a regex to speed up matching.
  • Prefilter: Fast scan to narrow candidate positions before expensive checks.
  • SIMD: Vectorized CPU instructions that operate on many bytes at once.
  • Work stealing: Parallel scheduling strategy where idle threads steal tasks.
  • Page cache: Kernel cache of recently accessed file data.
  • Line-oriented output: Reporting matches with line numbers and context.

Concept Summary Table

This section provides a map of the mental models you will build during these projects.

Concept Cluster What You Need to Internalize
String Matching Foundations Why naive search is O(n×m), where worst cases come from, and how overlaps work.
Skip-Based Sublinear Search How Boyer-Moore turns mismatches into safe skips and why long patterns can be faster.
Multi-Pattern Automata How Aho-Corasick finds many patterns in one pass using failure links.
Regex Engines & Literal Extraction Automata vs backtracking, NFA/DFA tradeoffs, and how prefilters cut work.
SIMD & Vectorized Scanning How to scan 16-32 bytes at once, build candidate masks, and verify matches.
Parallel Traversal & I/O Strategy Work-stealing traversal, ignore rules, binary detection, and mmap vs read tradeoffs.
Fuzzy Matching & Interactive UX Subsequence scoring, boundary bonuses, and low-latency TUI architecture.

Project-to-Concept Map

Project What It Builds Primer Chapters It Uses
Project 1: Naive Search Baseline matcher + benchmarking String Matching Foundations
Project 2: Boyer-Moore Skip-based exact search Skip-Based Sublinear Search
Project 3: Aho-Corasick Multi-pattern automaton Multi-Pattern Automata
Project 4: NFA Regex Thompson NFA engine Regex Engines & Literal Extraction
Project 5: Lazy DFA DFA-on-demand regex Regex Engines & Literal Extraction
Project 6: Literal Extraction Regex prefilters Regex Engines & Literal Extraction, SIMD & Vectorized Scanning
Project 7: SIMD memchr Byte scanning with SIMD SIMD & Vectorized Scanning
Project 8: SIMD memmem Substring SIMD SIMD & Vectorized Scanning
Project 9: Teddy Algorithm Multi-literal SIMD SIMD & Vectorized Scanning
Project 10: Parallel Traversal Work-stealing walker Parallel Traversal & I/O Strategy
Project 11: mmap vs read Adaptive I/O Parallel Traversal & I/O Strategy
Project 12: SIMD Line Counting Line indexing + SIMD SIMD & Vectorized Scanning, Parallel Traversal & I/O Strategy
Project 13: Fuzzy Matcher Scoring algorithm Fuzzy Matching & Interactive UX
Project 14: fzf Clone Interactive TUI Fuzzy Matching & Interactive UX
Project 15: ripgrep Clone Integrated tool All chapters (1-7)

Deep Dive Reading by Concept

This section maps each concept to specific book chapters or primary sources. Prefer the books you already own, then use the primary sources for implementation details.

String Matching Foundations

Concept Book & Chapter Why This Matters
Naive string matching “Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 5: String Algorithms Establishes the baseline and complexity model.
Substring search basics “Algorithms in C, Parts 1-4” by Sedgewick — String processing sections Practical C-level perspective and implementation patterns.
Memory hierarchy “Computer Systems: A Programmer’s Perspective” — Ch. 6 Explains why cache behavior dominates naive loops.

Skip-Based Sublinear Search (Boyer-Moore)

Concept Book & Chapter Why This Matters
Boyer-Moore family “Algorithms, Fourth Edition” — Ch. 5 (substring search variants) Formal foundations and variant tradeoffs.
Good-suffix intuition “Algorithms in C” — String searching Helps you reason about borders and safe shifts.
Original algorithm Boyer & Moore (1977) paper The canonical definition of the heuristics.

Multi-Pattern Automata (Aho-Corasick)

Concept Book & Chapter Why This Matters
Tries and automata “Algorithms, Fourth Edition” — Tries and string search Core data structure for multi-pattern matching.
Failure links Aho & Corasick (1975) paper Defines the algorithm and correctness.
Practical implementation Aho-Corasick implementations (BurntSushi aho-corasick crate) Real-world performance choices.

Regex Engines & Literal Extraction

Concept Book & Chapter Why This Matters
Regex to automata “Compilers: Principles and Practice” — Regex to automata chapters Bridges regex syntax to NFAs/DFAs.
NFA simulation “Engineering a Compiler” — NFA/DFA sections Practical DFA vs NFA tradeoffs.
Linear-time regex RE2 documentation + Russ Cox articles Guarantees and design rationale.
Literal extraction ripgrep/regex internals docs Performance-critical optimization used in modern tools.

SIMD & Vectorized Scanning

Concept Book & Chapter Why This Matters
SIMD basics “Low-Level Programming” by Zhirkov — SIMD and CPU internals Explains vector instructions and data layout.
Memory hierarchy “Computer Systems: A Programmer’s Perspective” — Ch. 6 Shows why bandwidth is the limit.
memchr/memmem memchr crate documentation and source Real-world SIMD string scanning.
Teddy algorithm Hyperscan papers + burntsushi regex-internals Multi-literal SIMD prefiltering ideas.

Parallel Traversal & I/O Strategy

Concept Book & Chapter Why This Matters
Directory traversal APIs “The Linux Programming Interface” — directories & links Correctly enumerating files.
Threads and scheduling “Advanced Programming in the UNIX Environment” — threads Parallel traversal mechanics.
mmap vs read “Linux System Programming” — file I/O and memory Explains the I/O tradeoffs you will benchmark.

Fuzzy Matching & Interactive UX

Concept Book & Chapter Why This Matters
Dynamic programming “Algorithms, Fourth Edition” — DP chapters Basis of scoring alignment.
UX-driven algorithms fzf source code (algo.go) Real-world scoring heuristics.
Terminal behavior “The Linux Command Line” — terminal basics Required for raw mode, rendering, and key handling.

Quick Start Guide (For the Overwhelmed)

Feeling overwhelmed by 15 projects? Start here for your first 48 hours:

Day 1: Build Intuition (3-4 hours)

  1. Read this file’s Goal and “Why This Matters” sections (30 minutes)
  2. Implement Project 1 (Naive Search) (2 hours)
    • Don’t optimize, just make it work
    • Compare performance to grep on a 10MB file
    • Notice how slow it is on worst-case input
  3. Skim Boyer-Moore explanation (Project 2) (30 minutes)
    • Focus on understanding “why backwards scanning helps”
    • Don’t implement yet, just build mental model

Day 2: See the Magic (3-4 hours)

  1. Experiment with ripgrep (1 hour)
    • Install ripgrep: cargo install ripgrep
    • Clone a large repo: git clone https://github.com/torvalds/linux
    • Compare: time grep -r "schedule" linux/ vs time rg "schedule" linux/
    • Observe the massive speed difference
    • Try: rg --debug "pattern" to see literal extraction in action
  2. Read Russ Cox’s article (1.5 hours)
  3. Plan your path (30 minutes)
    • Based on what excites you, pick 3-5 projects from the full list
    • Recommended starter combos:
      • Algorithm focus: Projects 1, 2, 4
      • Performance focus: Projects 1, 7, 10
      • Practical tools: Projects 1, 13, 14

Week 1: First Working Project

Complete Project 1 or 2 fully, including:

  • Implementation
  • Testing on various inputs
  • Benchmarking
  • Comparing to real tools

Don’t move on until you have a working, correct implementation. Speed comes later.


Path A: Algorithm Theorist

For those who love understanding “why” through theory

  1. Project 1 (Naive) → See the O(n×m) baseline
  2. Project 2 (Boyer-Moore) → See O(n/m) sublinear search
  3. Project 3 (Aho-Corasick) → See multi-pattern O(n+m+z)
  4. Project 4 (Thompson NFA) → See guaranteed linear regex
  5. Project 5 (Lazy DFA) → See the DFA optimization

Expected outcome: Deep understanding of algorithm tradeoffs, ability to analyze complexity, knowledge of automata theory.

Path B: Performance Engineer

For those who want to make things FAST

  1. Project 1 (Naive) → Establish baseline
  2. Project 7 (SIMD memchr) → See 32x speedup from SIMD
  3. Project 8 (SIMD memmem) → Apply SIMD to substring search
  4. Project 10 (Parallel Traversal) → Saturate I/O with parallelism
  5. Project 11 (mmap vs read) → Optimize I/O strategy
  6. Project 15 (ripgrep clone) → Integrate everything

Expected outcome: Ability to profile, optimize, and understand performance bottlenecks. SIMD programming skills. Systems-level optimization intuition.

Path C: Tool Builder

For those who want to ship useful tools

  1. Project 1 (Naive) → Understand the basics
  2. Project 2 (Boyer-Moore) → Get basic algorithm optimization
  3. Project 10 (Parallel Traversal) → Handle large directory trees
  4. Project 13 (Fuzzy Matcher) → Build interactive selection
  5. Project 14 (fzf clone) → Ship a complete TUI tool
  6. Optional: Project 15 → Build production grep

Expected outcome: Multiple working tools you can use daily. Understanding of what makes tools “feel fast” to users. TUI programming skills.

Path D: Interview Preparation

For those targeting algorithm-heavy interviews

  1. Project 1 (Naive) → Baseline
  2. Project 2 (Boyer-Moore) → Classic interview algorithm
  3. Project 3 (Aho-Corasick) → Advanced pattern matching (asked at Google, Facebook)
  4. Project 4 (Thompson NFA) → Regex compilation (asked at Amazon, Microsoft)
  5. Project 13 (Fuzzy Matcher) → Dynamic programming + heuristics

Expected outcome: Ability to implement and explain classic string algorithms. Answers to “how does grep work?” and “how do you implement regex?” confidence in algorithm interviews.


Success Metrics

By the end of this guide, you should be able to demonstrate the following:

  • Correctness: Your outputs match grep/rg on the same inputs across normal, edge, and pathological cases.
  • Performance reasoning: You can explain why a given project is faster or slower (not just that it is).
  • Benchmark discipline: You use repeatable benchmarks (fixed inputs, warmed cache, controlled flags).
  • Algorithmic fluency: You can explain and implement naive search, Boyer-Moore, Aho-Corasick, Thompson NFA, and lazy DFA from memory.
  • Systems fluency: You can profile and attribute bottlenecks to I/O, CPU, cache, or output formatting.
  • Integration skill: Your final ripgrep-style tool reaches within 2x of rg on at least one real-world corpus.

Phase 1: String Matching Algorithms


## Project 1: Naive String Search (Baseline)

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Zig, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Algorithms / String Matching
  • Software or Tool: Custom Implementation
  • Main Book: “Introduction to Algorithms” by Cormen, Leiserson, Rivest, Stein (CLRS)

What you’ll build: A brute-force string search that compares every position in the text against the pattern. This is your baseline to understand what we’re optimizing.

Why it teaches search tools: You need to see the naive O(n×m) approach to appreciate why Boyer-Moore and SIMD matter. This is also what many standard library strstr() functions use!

Core challenges you’ll face:

  • Understanding worst case → pattern aaaaab in aaaaaaaaaaaaaaaaaab → maps to why preprocessing matters
  • Character-by-character comparison → maps to CPU cache behavior
  • Early termination → maps to optimization opportunities

Key Concepts:

  • String Matching Basics: “Introduction to Algorithms” Chapter 32.1 - CLRS
  • Big-O Analysis: Understanding O(n×m) worst case vs O(n) average case
  • Pointer Arithmetic: “C Programming Language” Chapter 5 - K&R

Difficulty: Beginner Time estimate: 2-3 hours Prerequisites: Basic C, pointers

Real world outcome:

$ ./naive_search "needle" haystack.txt
Found at position 1234
Found at position 5678
Search completed in 45ms

$ ./naive_search "aaaaaab" pathological.txt
# Watch it struggle on worst-case input
Search completed in 2300ms  # Much slower!

Benchmark:
Pattern "hello" in 1MB file:
  Naive: 12ms
  (We'll compare to Boyer-Moore later)

Implementation Hints:

For each position i in text:
    For each position j in pattern:
        If text[i+j] != pattern[j]:
            Break (mismatch)
        If j == pattern_length - 1:
            Found match at position i

The key insight is that we re-examine the same characters multiple times. If we’ve seen “abcd” and the pattern is “abce”, we know something about what the text contains—we shouldn’t forget it!

Learning milestones:

  1. Implementation works correctly → You understand the basic problem
  2. Can identify worst-case inputs → You understand algorithm analysis
  3. See why this is slow → You’re ready for optimization

The Core Question You’re Answering

“What is the absolute baseline for string matching? How slow can it be, and why?”

Before you can appreciate optimizations like Boyer-Moore or SIMD, you need to see the naive approach—both its simplicity and its inefficiency. This project answers: “What happens when we try every possible position with no cleverness whatsoever?”


Concepts You Must Understand First

Stop and research these before coding:

  1. String representation in memory
    • How is a string stored as an array of bytes?
    • What is a null terminator (in C)?
    • How do you index into a string (pointer arithmetic)?
    • Book Reference: “The C Programming Language” Ch. 5 (Pointers and Arrays) - K&R
  2. Nested loops and iteration
    • What does an outer loop over text positions + inner loop over pattern positions look like?
    • When should you break out of the inner loop?
    • How do you track the current position in both text and pattern?
  3. Big-O notation
    • What does O(n×m) mean in concrete terms?
    • When does worst case occur vs average case?
    • Book Reference: “Introduction to Algorithms” Ch. 3 (Growth of Functions) - CLRS

Questions to Guide Your Design

Before implementing, think through these:

  1. Loop Structure
    • Which loop should be outer: text or pattern? (Answer: text)
    • When do you know a match has succeeded?
    • When do you know a match has failed?
  2. Edge Cases
    • What if pattern is empty? (Everything matches? Nothing matches?)
    • What if pattern is longer than text? (No matches possible)
    • What if text is empty? (No matches unless pattern is also empty)
  3. Performance Characteristics
    • What’s the best case? (First position matches fully)
    • What’s the worst case? (Pattern almost matches everywhere)
    • What’s a pathological input that makes this maximally slow?

Thinking Exercise

Trace Execution by Hand

Before coding, trace this algorithm on paper:

Text:    "abcabcabd"
Pattern: "abcabd"

Position 0: "abcabc" vs "abcabd"
  'a' = 'a' ✓
  'b' = 'b' ✓
  'c' = 'c' ✓
  'a' = 'a' ✓
  'b' = 'b' ✓
  'c' ≠ 'd' ✗  → Mismatch after 5 comparisons

Position 1: "bcabca" vs "abcabd"
  'b' ≠ 'a' ✗  → Mismatch after 1 comparison

Position 2: "cabcab" vs "abcabd"
  'c' ≠ 'a' ✗  → Mismatch after 1 comparison

Position 3: "abcabd" vs "abcabd"
  All match! ✓ → Found at position 3

Questions while tracing:

  • How many total character comparisons did we make?
  • Did we “forget” anything we learned from position 0?
  • Could we have skipped positions 1 and 2? How?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Implement strstr() from scratch” (Common at Google, Microsoft, Amazon)
  2. “What’s the time complexity of naive string matching in the worst case?”
  3. “Given pattern ‘AAAA’ and text ‘AAAAAAAAAA’, how many comparisons does naive search make?”
  4. “Can you create a pattern and text where naive search takes quadratic time?”
  5. “Why is naive search still used in some standard libraries?” (Hint: simplicity, cache-friendly for short patterns)

Hints in Layers

Hint 1: Structure Start with two loops. The outer loop tries each starting position in the text. The inner loop compares the pattern character-by-character from that position.

Hint 2: Comparison Logic For each text position i, check if text[i+j] == pattern[j] for all j from 0 to pattern_length-1. If any comparison fails, break and try the next i.

Hint 3: Match Detection If the inner loop completes without breaking (meaning all characters matched), you’ve found a match at position i. Record it.

Hint 4: Edge Cases Handle these explicitly:

  • If pattern_length == 0: return immediately (or return 0, depending on semantics)
  • If text_length < pattern_length: no matches possible
  • Prevent out-of-bounds access: outer loop should run i <= text_length - pattern_length

Books That Will Help

Topic Book Chapter
Naive string matching algorithm “Introduction to Algorithms” by CLRS Ch. 32.1
String representation in C “The C Programming Language” by K&R Ch. 5
Algorithm analysis and Big-O “Introduction to Algorithms” by CLRS Ch. 3
Worst-case analysis “Algorithm Design” by Kleinberg & Tardos Ch. 2

Common Pitfalls & Debugging

Problem 1: “Segmentation fault when pattern is longer than text”

  • Why: You’re accessing text[i+j] when i+j >= text_length
  • Fix: Limit outer loop to i <= text_length - pattern_length
  • Quick test: ./naive_search "toolong" "hi" should not crash

Problem 2: “Finding matches that don’t exist”

  • Why: Logic error in match detection (probably not checking all characters)
  • Fix: Ensure inner loop runs for full pattern length, use a flag to track “all matched”
  • Quick test: ./naive_search "abc" "xyz" should find zero matches

Problem 3: “Performance is worse than expected even on simple inputs”

  • Why: Not breaking early on mismatch, or redundant work
  • Fix: Ensure inner loop breaks immediately when text[i+j] != pattern[j]
  • Quick test: Profile with perf stat ./naive_search ...

Problem 4: “Matches reported at wrong positions”

  • Why: Off-by-one error in position tracking
  • Fix: Double-check that you’re reporting i, not i+pattern_length or similar
  • Quick test: ./naive_search "needle" "haystackneedlehaystack" should report position 8

Definition of Done

  • Handles empty pattern, empty text, and pattern longer than text without crashing
  • Reports all matches correctly (including overlaps if configured)
  • Comparison counter produces the expected counts on known test cases
  • Benchmarked against grep on a 10MB file and results recorded
  • Unit tests cover at least 5 edge cases (short text, long pattern, repeats, binary data, Unicode)

## Project 2: Boyer-Moore String Search

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Algorithms / String Matching
  • Software or Tool: Custom Implementation
  • Main Book: “Algorithms on Strings, Trees, and Sequences” by Dan Gusfield

What you’ll build: The Boyer-Moore algorithm with both the “bad character” and “good suffix” heuristics. This is what GNU grep uses internally.

Why it teaches search tools: Boyer-Moore is the foundation of fast string search. Its key insight—scanning backwards and skipping forward—is counterintuitive but brilliant. Longer patterns search FASTER because you can skip more.

Core challenges you’ll face:

  • Bad character table construction → maps to preprocessing patterns
  • Good suffix table construction → maps to suffix analysis
  • Rightmost occurrence lookup → maps to how much to skip
  • Understanding why backwards scanning works → maps to algorithm intuition

Key Concepts:

  • Bad Character Rule: “Algorithms on Strings, Trees, and Sequences” Chapter 2.2 - Gusfield
  • Good Suffix Rule: “Algorithms on Strings, Trees, and Sequences” Chapter 2.2 - Gusfield
  • Skip Tables: “The Boyer-Moore Fast String Searching Algorithm” - J. Strother Moore (original paper)
  • Horspool Simplification: Boyer-Moore-Horspool variant that’s simpler to implement

Difficulty: Intermediate-Advanced Time estimate: 1 week Prerequisites: Project 1 (Naive Search), understanding of suffix/prefix concepts

Real world outcome:

$ ./boyer_moore "substantially" novel.txt
Found at position 45231
Found at position 89012
Search completed in 3ms  # vs 45ms naive!

Skip table visualization:
  Pattern: "example"
  Bad character table:
    'e' -> 1 (rightmost position from end)
    'l' -> 2
    'p' -> 3
    'm' -> 4
    'a' -> 5
    'x' -> 6
    (all others) -> 7 (pattern length)

Benchmark comparison:
  Pattern length 5:  Naive 12ms, Boyer-Moore 8ms   (1.5x faster)
  Pattern length 15: Naive 15ms, Boyer-Moore 2ms   (7.5x faster)
  Pattern length 50: Naive 18ms, Boyer-Moore 0.4ms (45x faster!)

Key insight: Longer patterns = faster search!

Implementation Hints: The bad character rule: When a mismatch occurs at text[i] and pattern[j], shift the pattern so that the rightmost occurrence of text[i] in the pattern aligns with text[i]. If text[i] doesn’t occur in the pattern, shift the entire pattern past position i.

The magic is that you compare from RIGHT TO LEFT within the pattern. When you mismatch, you’ve already seen characters to the right—use that information to skip forward!

Learning milestones:

  1. Bad character heuristic works → You understand basic skip logic
  2. Good suffix heuristic works → You understand suffix analysis
  3. Longer patterns search faster → You’ve internalized the counterintuitive insight

Real World Outcome

By the end of this project, you will have a Boyer-Moore CLI that can explain why it skipped ahead. You will be able to print the bad-character and good-suffix tables, and show measurable speedups as pattern length increases.

What you will see:

  1. Skip tables printed for any pattern you provide.
  2. Match positions with timing information.
  3. Benchmark comparison vs naive search on the same text.

Command Line Outcome Example:

$ ./bm_search --pattern "NEGLIGENCE" --text novel.txt --print-tables
Bad-char shift (bytes):
  E -> 3
  G -> 2
  N -> 1
  L -> 4
  I -> 5
  C -> 6
  ...
Good-suffix shift:
  suffix length 1 -> 6
  suffix length 2 -> 6
  suffix length 3 -> 9

Matches:
  45123
  90211
Time: 2.9ms

$ ./bm_search --pattern "NEGLIGENCE" --text novel.txt --baseline
Naive: 39.4ms  (13.6x slower)

The Core Question You’re Answering

“How can a mismatch prove that many future alignments are impossible?”

Boyer-Moore is the first algorithm that turns a single mismatch into a proof that you can skip ahead. This project makes that intuition concrete by forcing you to compute and visualize the skip distances.


Concepts You Must Understand First

  1. Bad Character Rule
    • Why does the rightmost occurrence matter?
    • How does the alphabet size affect the table?
    • Book Reference: “Algorithms, Fourth Edition” — String search section
  2. Good Suffix Rule
    • What is a suffix that is also a prefix (a border)?
    • How do you compute border arrays efficiently?
    • Book Reference: “Algorithms in C” — string searching sections
  3. Right-to-left scanning
    • What information do you gain by scanning backwards?
    • Why does this not miss matches?

Questions to Guide Your Design

  1. Table Construction
    • Do you precompute shifts for the full alphabet (256)?
    • How do you handle patterns longer than 256?
  2. Mismatch Handling
    • How do you combine bad-character and good-suffix shifts?
    • What happens if the good-suffix shift is smaller than 1?
  3. Correctness vs Speed
    • How will you test that your shifts never skip a valid match?
    • Can you construct adversarial tests for each heuristic?

Thinking Exercise

The “Skip Proof” Trace

Trace Boyer-Moore on paper for:

Text:    "ABC ABCDAB ABCDABCDABDE"
Pattern: "ABCDABD"

Questions to answer:

  • Where does the first mismatch happen?
  • What is the bad-character shift?
  • What is the good-suffix shift?
  • Which shift is chosen and why?

The Interview Questions They’ll Ask

  1. “Explain the bad-character and good-suffix rules in Boyer-Moore.”
  2. “Why can Boyer-Moore be sublinear on average?”
  3. “What is a border of a string, and why does it matter?”
  4. “How does Horspool simplify Boyer-Moore?”
  5. “When is Boyer-Moore worse than naive?”

Hints in Layers

Hint 1: Start with bad-character only Build only the bad-character table and ensure it works before adding good-suffix.

Hint 2: Precompute border lengths The good-suffix rule depends on knowing which suffixes also appear as prefixes.

Hint 3: Combine shifts safely Use max(bad_char_shift, good_suffix_shift) to keep correctness.

Hint 4: Verify with a brute-force oracle Generate random texts and compare results to your naive search.


Books That Will Help

Topic Book Chapter
Boyer-Moore fundamentals “Algorithms, Fourth Edition” by Sedgewick & Wayne Ch. 5 (Substring search)
String borders “Algorithms in C” by Sedgewick String searching sections
Pattern preprocessing “Mastering Algorithms with C” by Loudon String matching chapters

Common Pitfalls & Debugging

Problem 1: “Missing matches”

  • Why: Incorrect good-suffix table or shift combination
  • Fix: Verify table generation against a known implementation
  • Quick test: Compare results with naive search on random input

Problem 2: “Infinite loop or zero shift”

  • Why: Shift computed as 0 on mismatch
  • Fix: Enforce minimum shift of 1
  • Quick test: Pattern of length 1 should still progress

Problem 3: “Slow performance on short patterns”

  • Why: Preprocessing dominates or shifts are too small
  • Fix: Use naive for patterns below a threshold (e.g., < 8 bytes)

Definition of Done

  • Bad-character and good-suffix tables are correct and tested
  • Output matches naive search on random and adversarial inputs
  • Benchmarks show speedup for patterns >= 8 bytes
  • Horspool variant implemented and compared
  • Tables are printable with a CLI flag for inspection

## Project 3: Aho-Corasick Multi-Pattern Matching

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Algorithms / Automata Theory
  • Software or Tool: Custom Implementation
  • Main Book: “Introduction to Algorithms” by CLRS

What you’ll build: An Aho-Corasick automaton that can search for multiple patterns simultaneously in a single pass through the text. This is what fgrep and antivirus scanners use.

Why it teaches search tools: When you have multiple patterns (like 1000 keywords to search for), naive approaches become O(n×m×k) where k is pattern count. Aho-Corasick is O(n + m + z) where z is the number of matches—linear regardless of pattern count!

Core challenges you’ll face:

  • Trie construction → maps to prefix tree data structures
  • Failure link computation → maps to KMP failure function generalization
  • Output function (suffix links) → maps to handling overlapping patterns
  • State machine execution → maps to automata theory

Key Concepts:

  • Trie Data Structure: “Introduction to Algorithms” Chapter 12 - CLRS
  • Failure Links: The key innovation—where to go on mismatch (like KMP’s failure function)
  • Finite State Machines: “Introduction to the Theory of Computation” Chapter 1 - Sipser
  • Original Paper: “Efficient string matching: An aid to bibliographic search” - Aho & Corasick (1975)

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 2, understanding of tries and FSMs

Real world outcome:

$ ./aho_corasick keywords.txt document.txt
Loaded 1000 patterns from keywords.txt
Searching document.txt (5MB)...
Found "error" at position 1234
Found "warning" at position 1235
Found "error" at position 5678
...
Total: 4523 matches
Search completed in 45ms

Automaton visualization:
  States: 4523
  Transitions: 12456
  Failure links: 4522

Benchmark:
  1000 patterns, 5MB file:
    Naive (run each pattern): 12,000ms
    Aho-Corasick: 45ms
    Speedup: 266x!

Implementation Hints: Build a trie from all patterns. Then compute “failure links”—for each state, where do you go when the next character doesn’t match any outgoing edge? The failure link points to the longest proper suffix of the current path that is also a prefix of some pattern.

Key insight: You never backtrack in the input. Each character is examined exactly once. The failure links let you “teleport” to the right state.

Learning milestones:

  1. Trie construction correct → You understand prefix trees
  2. Failure links computed correctly → You understand the KMP connection
  3. Finds all overlapping matches → You understand the output function

Phase 2: Regular Expression Engines

Real World Outcome

You will have a CLI that loads hundreds or thousands of patterns and scans a file in a single pass, reporting every match with its pattern ID. You will be able to print the automaton size and see how it scales with the number of patterns.

What you will see:

  1. Automaton stats (states, transitions, failure links).
  2. All matches reported in one scan.
  3. Massive speedup over running each pattern separately.

Command Line Outcome Example:

$ ./aho_scan --patterns keywords.txt --text logs.txt --stats
Loaded 1500 patterns
States: 18432
Transitions: 72480
Failure links: 18431

Matches:
  [WARN] at 2331 (pattern #12)
  [ERROR] at 8801 (pattern #9)
  [timeout] at 9202 (pattern #103)

Total matches: 482
Time: 18ms

$ ./aho_scan --patterns keywords.txt --text logs.txt --baseline
Naive multi-scan: 3.7s
Speedup: 205x

The Core Question You’re Answering

“How can I find many patterns in one pass without re-scanning the text?”

Aho-Corasick shows that you can convert a set of patterns into a single automaton that finds all matches in linear time, regardless of how many patterns you have.


Concepts You Must Understand First

  1. Trie construction
    • How do you store multiple patterns as shared prefixes?
    • Book Reference: “Algorithms, Fourth Edition” — trie section
  2. Failure links
    • What suffix should you fall back to on mismatch?
    • How does this generalize KMP?
  3. Output function
    • Why do you need to emit patterns from suffix states too?

Questions to Guide Your Design

  1. Data structure layout
    • Dense 256-array transitions or sparse maps?
    • How will you store output sets efficiently?
  2. Construction order
    • Will you build failure links with BFS or DFS?
    • How do you ensure the root’s failure link is correct?
  3. Memory vs speed
    • What is the memory impact of dense transitions?
    • Can you compress transitions for rare edges?

Thinking Exercise

The Overlap Trap

Patterns: he, she, hers, his Text: ushers

Trace the automaton and list all matches.


The Interview Questions They’ll Ask

  1. “Explain how failure links are computed in Aho-Corasick.”
  2. “Why is the algorithm linear in the input size?”
  3. “How do you handle overlapping patterns?”
  4. “What is the memory complexity for N patterns of total length M?”
  5. “How does Aho-Corasick relate to KMP?”

Hints in Layers

Hint 1: Build the trie first Make sure pattern insertion and terminal node marking are correct.

Hint 2: Use BFS for failure links Start from root’s children and propagate failures level by level.

Hint 3: Merge output sets Each node’s output should include its failure link’s output.

Hint 4: Test with tiny patterns Two or three patterns are enough to validate correctness.


Books That Will Help

Topic Book Chapter
Tries and automata “Algorithms, Fourth Edition” by Sedgewick & Wayne Tries / string search
Multi-pattern matching Aho & Corasick paper (1975) Original algorithm
C implementation details “Algorithms in C” by Sedgewick String processing

Common Pitfalls & Debugging

Problem 1: “Missed matches”

  • Why: Output sets not propagated along failure links
  • Fix: Merge outputs during failure construction
  • Quick test: Patterns he and she in text she should emit both

Problem 2: “Infinite loop on mismatch”

  • Why: Failure link chain not terminating at root
  • Fix: Ensure root’s failure link is root itself

Problem 3: “Memory explosion”

  • Why: Dense transition tables for large alphabets
  • Fix: Use sparse transitions or compress infrequently used nodes

Definition of Done

  • Trie insertion and failure links are correct and tested
  • Overlapping patterns are detected correctly
  • Output is identical to running each pattern separately
  • Automaton stats (states/transitions) are reported
  • Benchmarks show near-linear scaling with pattern count

## Project 4: NFA Regex Engine (Thompson’s Construction)

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Zig
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compilers / Automata Theory
  • Software or Tool: Custom Implementation
  • Main Book: “Compilers: Principles, Techniques, and Tools” by Aho, Sethi, Ullman (Dragon Book)

What you’ll build: A regex engine using Thompson’s NFA construction. Parse a regex, convert to NFA, and simulate the NFA on input text.

Why it teaches search tools: This is the foundation of RE2, Rust’s regex, and one mode of GNU grep. Understanding NFA simulation is crucial to understanding why some regex engines are fast and others have “catastrophic backtracking.”

Core challenges you’ll face:

  • Regex parsing → maps to recursive descent parsing
  • NFA construction (concatenation, alternation, Kleene star) → maps to automata construction
  • NFA simulation (tracking multiple states) → maps to set operations
  • ε-closure computation → maps to graph traversal

Key Concepts:

  • Thompson’s Construction: “Compilers: Principles, Techniques, and Tools” Chapter 3.7 - Dragon Book
  • NFA Simulation: “Regular Expression Matching Can Be Simple And Fast” - Russ Cox (swtch.com/~rsc/regexp/regexp1.html)
  • ε-transitions: How to handle “empty” transitions efficiently
  • State Set Representation: Bit vectors for O(1) state operations

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 1-3, understanding of automata theory basics

Real world outcome:

$ ./nfa_regex "a(b|c)*d" input.txt
Parsing regex: a(b|c)*d
NFA states: 8
NFA transitions: 10

Found match "abcd" at position 100
Found match "abbbcd" at position 200
Found match "acbcbd" at position 300
Search completed in 15ms

NFA visualization (DOT format):
  0 -a-> 1
  1 -ε-> 2
  1 -ε-> 4
  2 -b-> 3
  4 -c-> 5
  3 -ε-> 6
  5 -ε-> 6
  6 -ε-> 1
  6 -ε-> 7
  7 -d-> 8 (accepting)

Pathological test:
  Pattern: (a+)+$
  Input: "aaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
  Time: 0.1ms  # Still fast! No backtracking!

Implementation Hints: Each regex operator becomes an NFA fragment:

  • Literal ‘a’: State with single outgoing edge labeled ‘a’
  • Concatenation AB: Connect A’s accepting state to B’s start state with ε
  • **Alternation A B**: New start state with ε to both A and B starts
  • Kleene star A*: ε to A, ε from A’s accept back to start, ε to skip A entirely

Simulation: Maintain a SET of current states. For each input character, compute the set of states reachable by following that character, then compute the ε-closure. Accept if final set contains an accepting state.

Learning milestones:

  1. Parses regex correctly → You understand recursive descent
  2. Constructs NFA correctly → You understand Thompson’s algorithm
  3. No catastrophic backtracking → You understand why NFAs are O(n×m)

Real World Outcome

You will have a working regex engine that accepts a subset of regex syntax (., *, |, grouping) and matches input in guaranteed linear time. You will be able to show that your engine handles pathological regexes instantly while backtracking engines blow up.

Command Line Outcome Example:

$ ./nfa_regex "(a+)+$" "aaaaaaaaaaaaaaaaaaaa!"
Match: false
Time: 0.4ms

$ python - <<'PY'
import re, time
pat = re.compile(r"(a+)+$")
text = "a"*24 + "!"
start = time.time(); pat.search(text); print(time.time()-start)
PY
# Often seconds on some engines

The Core Question You’re Answering

“How can I implement regex matching that is fast and safe against catastrophic backtracking?”

This project proves that regex can be linear-time by compiling to an NFA and simulating all paths at once.


Concepts You Must Understand First

  1. Thompson construction
    • How do you build NFAs from literals, concatenation, alternation, and star?
    • Book Reference: “Compilers: Principles and Practice” — regex to automata
  2. Epsilon closure
    • How do you compute all states reachable without consuming input?
  3. NFA simulation
    • How do you update the active state set per input character?

Questions to Guide Your Design

  1. State representation
    • Will you represent active states as a bitset or vector?
    • How do you size your state IDs?
  2. Regex features
    • Which operators are in scope (., *, +, ?, |)?
    • How will you handle escaping and grouping?
  3. Match semantics
    • Do you return first match or all matches?
    • How do you report match boundaries?

Thinking Exercise

The NFA Walkthrough

Build the Thompson NFA for regex a(b|c)*d on paper. Trace its active states for input abcbcd.


The Interview Questions They’ll Ask

  1. “How does Thompson NFA avoid catastrophic backtracking?”
  2. “What is epsilon closure and why do we need it?”
  3. “What regex features break regularity (and why)?”
  4. “Explain the difference between NFA and DFA matching.”
  5. “Why is NFA simulation linear in the input length?”

Hints in Layers

Hint 1: Start with concatenation only Implement literals and concatenation before you add alternation and star.

Hint 2: Use explicit epsilon edges You will need them to connect fragments and compute closures.

Hint 3: Precompute epsilon closure per state This speeds up the main loop dramatically.

Hint 4: Compare against a known engine Use RE2 or Rust regex for correctness tests.


Books That Will Help

Topic Book Chapter
Regex to automata “Compilers: Principles and Practice” Regex to automata chapters
Finite automata “Compilers: Principles and Practice” DFA/NFA sections
Safe regex design Russ Cox regex articles Online

Common Pitfalls & Debugging

Problem 1: “Matches are missing”

  • Why: Epsilon closure not computed correctly
  • Fix: Ensure you include all reachable epsilon transitions

Problem 2: “Regex matches too much”

  • Why: Accept state reachable without consuming input
  • Fix: Check fragment wiring and epsilon edges

Problem 3: “Slow matching”

  • Why: Active set duplicates or grows unbounded
  • Fix: Use bitsets or deduplicate state sets each step

Definition of Done

  • Supports literals, concatenation, alternation, and Kleene star
  • Correct on a suite of 50+ regex tests
  • Linear-time behavior on pathological inputs
  • Reports match boundaries correctly
  • CLI exposes a debug mode to print NFA states

## Project 5: DFA Regex Engine with Lazy Construction

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compilers / Automata Theory
  • Software or Tool: Custom Implementation
  • Main Book: “Introduction to the Theory of Computation” by Michael Sipser

What you’ll build: Convert NFA to DFA using the subset construction, but build states lazily (on-demand) to avoid exponential blowup. Cache states for reuse.

Why it teaches search tools: This is how production regex engines like RE2 and Rust’s regex work. DFA simulation is a simple table lookup per character—O(n) guaranteed. But DFAs can have exponentially many states, so lazy construction is essential.

Core challenges you’ll face:

  • Subset construction algorithm → maps to powerset of NFA states
  • State caching/memoization → maps to hash tables for state sets
  • Cache eviction under memory pressure → maps to LRU caches
  • Handling exponential blowup → maps to falling back to NFA

Key Concepts:

  • Subset Construction: “Introduction to the Theory of Computation” Chapter 1.2 - Sipser
  • Lazy DFA: “Regular Expression Matching: the Virtual Machine Approach” - Russ Cox
  • State Set Hashing: Efficiently comparing and storing state sets
  • Cache Management: When to evict cached DFA states

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 4 (NFA Engine)

Real world outcome:

$ ./dfa_regex "a(b|c)*d" input.txt
NFA states: 8
DFA cache: lazy construction enabled

Searching...
DFA state 0 (NFA: {0}) -> 'a' -> DFA state 1 (NFA: {1,2,4,6,7})
DFA state 1 -> 'b' -> DFA state 2 (NFA: {1,2,3,4,6,7})
[States constructed on demand]

Found match "abcd" at position 100
Search completed in 5ms (vs 15ms NFA simulation)

Cache statistics:
  DFA states constructed: 4
  DFA states possible (worst case): 256 (2^8)
  Cache hits: 45,231
  Cache misses: 4

Benchmark:
  NFA simulation: 15ms
  Lazy DFA: 5ms
  Full DFA (if we built it all): 3ms but 2MB memory

Implementation Hints: Each DFA state represents a SET of NFA states. When you see character ‘c’ in DFA state S, compute which NFA states are reachable from any state in S via ‘c’, take the ε-closure, and that’s your new DFA state.

The trick: Use a hash table to check if you’ve seen this state set before. If so, reuse the existing DFA state. If not, create a new one and cache it.

For the cache, use LRU eviction—if memory gets tight, throw away the least-recently-used DFA states. They’ll be reconstructed if needed.

Learning milestones:

  1. Subset construction produces correct DFA → You understand the powerset
  2. Lazy construction avoids memory blowup → You understand caching
  3. Performance matches or beats NFA → You understand the tradeoff

Real World Outcome

You will have a regex engine that can generate DFA states on demand and still remain safe from state explosion. You’ll be able to demonstrate a performance jump for repeated searches, while keeping memory usage bounded.

Command Line Outcome Example:

$ ./lazy_dfa "foo.*bar" logs.txt --stats
NFA states: 42
DFA states generated: 68
Matches: 120
Time: 5.4ms

$ ./lazy_dfa "(ab|ac|ad|ae)*z" logs.txt --stats
NFA states: 96
DFA states generated: 512 (capped)
Fallback: NFA simulation for remaining states
Time: 12.7ms

The Core Question You’re Answering

“How can I get DFA speed without paying DFA memory costs up front?”

Lazy DFA construction is the compromise between raw performance and exponential state growth.


Concepts You Must Understand First

  1. Subset construction
    • How does a DFA state represent a set of NFA states?
    • Book Reference: “Compilers: Principles and Practice” — NFA to DFA
  2. State explosion
    • Why can the number of DFA states be exponential?
  3. Hybrid engines
    • When do you fall back to NFA simulation?

Questions to Guide Your Design

  1. State caching
    • Will you hash DFA states by their NFA set bitmask?
  2. Size limits
    • What is your maximum DFA state budget?
    • How will you handle overflow (fallback or eviction)?
  3. Unicode handling
    • Will you operate on bytes or Unicode scalar values?

Thinking Exercise

The Explosion Case

Find a regex that produces many DFA states (e.g., nested alternations). Estimate how many subsets are possible.


The Interview Questions They’ll Ask

  1. “Why does DFA construction cause state explosion?”
  2. “How does a lazy DFA avoid that explosion?”
  3. “When would you prefer NFA over DFA?”
  4. “How do you cache DFA states efficiently?”

Hints in Layers

Hint 1: Represent DFA states as bitsets A DFA state is a set of NFA states; bitsets are compact and fast.

Hint 2: Use a hashmap for state interning Map bitset -> DFA state ID to avoid duplicates.

Hint 3: Add a hard cap If state count exceeds your cap, revert to NFA for the rest.


Books That Will Help

Topic Book Chapter
DFA construction “Compilers: Principles and Practice” NFA to DFA sections
Automata theory “Engineering a Compiler” Automata chapters
Regex performance RE2 documentation Online

Common Pitfalls & Debugging

Problem 1: “Incorrect matches”

  • Why: DFA state sets computed incorrectly
  • Fix: Compare against NFA simulation for the same regex

Problem 2: “Memory blowup”

  • Why: Unlimited DFA state creation
  • Fix: Add a state cap and fallback

Problem 3: “Unicode bugs”

  • Why: Mixing bytes and Unicode code points
  • Fix: Decide on byte-level or Unicode matching and test thoroughly

Definition of Done

  • Lazy DFA generates states on demand with caching
  • Correctness matches NFA simulation on test suite
  • DFA state cap prevents memory blowups
  • Benchmarks show speedup on repeated searches
  • Debug output lists DFA state count and fallback usage

## Project 6: Literal Extraction Optimizer

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: Rust (or C)
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Regex Optimization
  • Software or Tool: Custom Implementation
  • Main Book: “Regular Expression Matching” by Russ Cox (online articles)

What you’ll build: Analyze a regex to extract literal strings that MUST appear in any match. Use these literals to quickly skip non-matching regions before running the full regex.

Why it teaches search tools: This is ripgrep’s secret weapon. Most regexes have literal prefixes, suffixes, or required substrings. Find those with memchr/SIMD, THEN run the regex engine. This is why ripgrep searching for function.*async is almost as fast as searching for function.

Core challenges you’ll face:

  • Prefix extraction → maps to “what literal MUST the match start with?”
  • Suffix extraction → maps to “what literal MUST the match end with?”
  • Required substring extraction → maps to “what literal MUST appear somewhere?”
  • Handling alternation(foo|bar).*baz has no common prefix, but “baz” is required

Key Concepts:

  • Literal Extraction: regex-syntax crate documentation (Rust)
  • Prefix/Suffix Analysis: Walk the regex AST looking for required literals
  • Prefiltering: Run memchr on literals, verify with full regex

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 4-5, understanding of regex ASTs

Real world outcome:

$ ./literal_optimizer "function\s+\w+\s*\(.*async" input.js
Regex analysis:
  Prefix literals: ["function"]
  Suffix literals: ["async"]
  Required substrings: ["function", "async"]

Strategy:
  1. memchr for "function"
  2. If found, check if "async" exists after it
  3. If both found, run full regex on candidate region

Benchmark (10MB JavaScript file):
  Full regex scan: 150ms
  Literal-optimized: 8ms
  Speedup: 18.75x

Analysis of common patterns:
  "TODO": 100% literal, use memchr directly
  "error.*line \d+": prefix "error", can skip most text
  "[a-z]+@[a-z]+\.[a-z]+": no useful literals (email regex)
  "\berror\b": literal "error", boundary check at matches

Implementation Hints: Walk the regex syntax tree:

  • Literal nodes: Accumulate into current prefix/suffix
  • Concatenation: Prefix of A, suffix of B
  • Alternation: Common prefix/suffix of all branches (may be empty)
  • Kleene star: Can be skipped, so it “resets” the required substring

The extracted literals are passed to memchr (for single byte), memmem (for substring), or Aho-Corasick (for multiple literals).

Learning milestones:

  1. Correctly extracts prefix from simple regexes → You understand the concept
  2. Handles alternation and quantifiers → You understand the complexity
  3. 10x+ speedup on real regexes → You’ve internalized the optimization

Phase 3: SIMD and Low-Level Optimization

Real World Outcome

You will have a tool that analyzes a regex and prints the literals it can use as prefilters, then shows the speed difference when those literals are used to prune candidates.

Command Line Outcome Example:

$ ./literal_extract "error.*timeout|panic:.*"
Required literals: [error, timeout, panic:]
Prefilter strategy: choose rarest literal by frequency

$ ./prefilter_search --regex "error.*timeout" logs.txt
Prefilter hits: 432
Regex verifications: 432
Matches: 31
Time: 6.1ms

$ ./prefilter_search --regex "error.*timeout" logs.txt --no-prefilter
Regex verifications: 2,100,314
Matches: 31
Time: 220ms

The Core Question You’re Answering

“How can I avoid running an expensive regex on most of the text?”

Literal extraction transforms regex search into a two-phase pipeline: fast filtering followed by precise verification.


Concepts You Must Understand First

  1. Regex AST
    • How do you parse and traverse regex structure?
  2. Required vs optional literals
    • Which substrings must appear in any match?
  3. Prefilter strategies
    • How do you choose a literal (rarest, longest, earliest)?

Questions to Guide Your Design

  1. Literal extraction rules
    • How do you handle alternation?
    • What about .* gaps or character classes?
  2. Prefilter choice
    • Will you choose the longest literal or the rarest one?
    • How will you estimate rarity (frequency table)?
  3. Verification
    • How do you pass candidate offsets to the regex engine?

Thinking Exercise

Regex Dissection

For regex foo(bar|baz).*qux:

  • What literals are required in all matches?
  • What literals are optional?

The Interview Questions They’ll Ask

  1. “What is literal extraction and why does it help?”
  2. “How do you choose a prefilter literal?”
  3. “How do alternations affect literal extraction?”
  4. “Can literal extraction ever change correctness?”

Hints in Layers

Hint 1: Build a regex AST first Literal extraction is a tree traversal problem.

Hint 2: Use set operations For alternation, required literals are the intersection of branch literals.

Hint 3: Start with simple literals Handle concatenation and .* before complex classes.


Books That Will Help

Topic Book Chapter
Regex parsing “Compilers: Principles and Practice” Regex to automata chapters
Performance design “Clean Architecture” by Robert C. Martin Design tradeoffs (optional)
Regex internals burntsushi regex internals Online

Common Pitfalls & Debugging

Problem 1: “False negatives”

  • Why: Treated optional literals as required
  • Fix: Only require literals that appear in every match

Problem 2: “No speedup”

  • Why: Choosing a common literal (e.g., ‘e’)
  • Fix: Select the rarest literal using frequency estimation

Problem 3: “Regex verification incorrect”

  • Why: Candidate offset boundaries miscomputed
  • Fix: Always verify full regex on candidate region

Definition of Done

  • Extracts required literals from concatenation and alternation
  • Chooses a prefilter literal based on frequency or length
  • Prefilter reduces regex verifications by at least 10x on a test corpus
  • No false negatives when prefiltering
  • Debug output prints extracted literals and chosen prefilter

## Project 7: SIMD memchr Implementation

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C (with intrinsics)
  • Alternative Programming Languages: Rust, Assembly
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: SIMD / Low-Level Optimization
  • Software or Tool: SSE2/AVX2 Intrinsics
  • Main Book: “Intel Intrinsics Guide” (online reference)

What you’ll build: A SIMD-accelerated memchr that finds a byte in a buffer at 15+ bytes per CPU cycle using SSE2/AVX2.

Why it teaches search tools: memchr is the foundation of all fast search. If you can find a byte at 15 bytes/cycle, you can prefilter gigabytes of text per second, only running the expensive regex engine on the tiny fraction that might match.

Core challenges you’ll face:

  • SIMD vector loading (_mm_loadu_si128) → maps to loading 16 bytes at once
  • Parallel comparison (_mm_cmpeq_epi8) → maps to comparing 16 bytes simultaneously
  • Mask extraction (_mm_movemask_epi8) → maps to converting results to bitmask
  • Bit scanning (__builtin_ctz) → maps to finding first match in mask
  • Alignment handling → maps to memory alignment requirements

Key Concepts:

  • SSE2 Intrinsics: Intel Intrinsics Guide (intel.com)
  • Vector Comparison: Compare 16/32 bytes in a single instruction
  • Movemask: Convert comparison result to 16/32-bit integer
  • Bit Manipulation: Find first set bit in mask

Difficulty: Expert Time estimate: 1 week Prerequisites: Understanding of pointers, basic assembly concepts

Real world outcome:

$ ./simd_memchr '\n' large_file.txt
Searching for newlines in 1GB file...

Implementation comparison:
  Naive (byte-by-byte): 1200ms (0.83 GB/s)
  Standard memchr:       150ms (6.67 GB/s)
  SSE2 memchr:            80ms (12.5 GB/s)
  AVX2 memchr:            45ms (22.2 GB/s)

Throughput: 22.2 GB/s = ~9 bytes per cycle on 2.5GHz CPU

Disassembly of hot loop:
  vmovdqu  ymm0, [rdi]        ; Load 32 bytes
  vpcmpeqb ymm1, ymm0, ymm2   ; Compare with search byte
  vpmovmskb eax, ymm1         ; Extract comparison mask
  test     eax, eax           ; Any matches?
  jnz      found              ; If yes, handle match
  add      rdi, 32            ; Advance pointer
  jmp      loop               ; Continue

Implementation Hints:

// SSE2 memchr core loop
__m128i needle = _mm_set1_epi8(c);  // Broadcast search byte to all 16 lanes
while (ptr + 16 <= end) {
    __m128i chunk = _mm_loadu_si128((__m128i*)ptr);  // Load 16 bytes
    __m128i cmp = _mm_cmpeq_epi8(chunk, needle);     // Compare all 16
    int mask = _mm_movemask_epi8(cmp);               // Get result as bits
    if (mask != 0) {
        return ptr + __builtin_ctz(mask);  // Return first match
    }
    ptr += 16;
}
// Handle remaining bytes with scalar loop

Learning milestones:

  1. SSE2 version works correctly → You understand SIMD basics
  2. AVX2 version achieves 20+ GB/s → You understand wider vectors
  3. Handles alignment and tail bytes → You understand edge cases

Real World Outcome

You will have a SIMD-accelerated memchr clone and a benchmark harness that demonstrates vectorized speedups over byte-by-byte scanning.

Command Line Outcome Example:

$ ./simd_memchr x large.txt --stats
Matches: 18234
Bytes scanned: 1.0GB
Time: 12.4ms
Throughput: 80.6 GB/s (memory-bound)

$ ./simd_memchr x large.txt --baseline
Scalar memchr: 92.1ms
Speedup: 7.4x

The Core Question You’re Answering

“How do I compare many bytes at once safely and portably?”

This project is your entry point to SIMD: the foundational technique behind modern text search speed.


Concepts You Must Understand First

  1. Vector registers (SSE2/AVX2)
    • How many bytes do you compare per instruction?
  2. Bitmask extraction
    • How do you turn vector comparisons into positions?
  3. Alignment and tails
    • How do you handle the last few bytes safely?

Questions to Guide Your Design

  1. Portability
    • Will you use intrinsics, compiler builtins, or a library?
  2. Safety
    • How will you avoid reading past the end of the buffer?
  3. Benchmarking
    • How will you avoid measuring I/O instead of compute?

Thinking Exercise

Vector Mask Walkthrough

Given a 16-byte chunk and a target byte, compute the comparison mask by hand and locate the set bits.


The Interview Questions They’ll Ask

  1. “What is SIMD and why does it help string searching?”
  2. “How do you handle tail bytes when scanning with SIMD?”
  3. “Why can SIMD become memory-bandwidth bound?”

Hints in Layers

Hint 1: Start with SSE2 128-bit It is available on essentially all x86_64 CPUs.

Hint 2: Use unaligned loads Unaligned loads are fast on modern CPUs and simplify code.

Hint 3: Use ctz on the mask Extract candidate positions quickly.

Hint 4: Build a scalar fallback Use scalar code for the final tail bytes.


Books That Will Help

Topic Book Chapter
SIMD fundamentals “Low-Level Programming” by Zhirkov SIMD chapters
Memory hierarchy “Computer Systems: A Programmer’s Perspective” Ch. 6
Assembly patterns “Write Great Code” by Randall Hyde Instruction-level sections

Common Pitfalls & Debugging

Problem 1: “Segfault at end of buffer”

  • Why: Over-reading past the buffer end
  • Fix: Use a tail scalar loop for the last bytes

Problem 2: “Wrong match positions”

  • Why: Incorrect bitmask interpretation (endianness/bit order)
  • Fix: Verify mask bits with unit tests

Problem 3: “No speedup”

  • Why: Benchmark includes file I/O or small data sizes
  • Fix: Use in-memory buffers and large sizes

Definition of Done

  • SIMD path used for full vector blocks
  • Scalar tail handles remaining bytes safely
  • Benchmark demonstrates speedup over scalar loop
  • Unit tests confirm match positions
  • Fallback implementation exists for non-SIMD builds

## Project 8: SIMD Substring Search (memmem)

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C (with intrinsics) or Rust
  • Alternative Programming Languages: C++, Assembly
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: SIMD / String Algorithms
  • Software or Tool: SSE2/AVX2 Intrinsics
  • Main Book: “Intel Intrinsics Guide” + Hyperscan papers

What you’ll build: A SIMD-accelerated substring search using the “two-byte trick” or similar techniques. Find candidates with SIMD, verify with memcmp.

Why it teaches search tools: memmem is what you use for literal substrings. The two-byte trick: search for the first TWO bytes of the needle simultaneously, then verify the rest. This is often faster than full Boyer-Moore for short needles.

Core challenges you’ll face:

  • Two-byte candidate finding → maps to reducing false positives
  • Vectorized candidate verification → maps to batched memcmp
  • Handling variable needle lengths → maps to algorithm selection
  • Combining with Boyer-Moore for long needles → maps to hybrid approaches

Key Concepts:

  • Two-Way Algorithm: Used by glibc memmem
  • SIMD Candidate Finding: Search for first+last byte simultaneously
  • False Positive Rate: Why two bytes is much better than one byte
  • Hybrid Strategies: Different algorithms for different needle lengths

Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 7 (SIMD memchr)

Real world outcome:

$ ./simd_memmem "function" source_code.txt
Searching for "function" in 100MB file...

Algorithm selection:
  Needle length: 8 bytes
  Strategy: SIMD two-byte + verify

Benchmark:
  Naive strstr:      250ms
  glibc memmem:       45ms
  SIMD two-byte:      12ms
  Boyer-Moore:        18ms (better for longer needles)

False positive analysis:
  First byte 'f' matches: 2,345,678 times
  First two bytes "fu" match: 12,345 times (190x reduction!)
  Full "function" matches: 8,234 times

Disassembly insight:
  Searching for 'f' AND 'n' (first and last byte) simultaneously
  reduces candidates by 99.5%

Implementation Hints: The two-byte trick: Instead of searching for just the first byte, search for the first AND last byte of the needle at the correct offset. If needle is “hello” (length 5), search for positions where byte[i] == ‘h’ AND byte[i+4] == ‘o’. This dramatically reduces false positives.

With SIMD, you can check 16-32 positions simultaneously for both conditions, then AND the masks together.

Learning milestones:

  1. Two-byte search works correctly → You understand the optimization
  2. Faster than Boyer-Moore for short needles → You understand the tradeoff
  3. Hybrid approach selects best algorithm → You understand adaptive strategies

Real World Outcome

You will have a SIMD-accelerated memmem that can find multi-byte substrings efficiently and demonstrate how a two-byte prefilter reduces candidate checks.

Command Line Outcome Example:

$ ./simd_memmem --pattern "needle" big.txt
Matches: 421
Bytes scanned: 2.0GB
Time: 35ms

$ ./simd_memmem --pattern "needle" big.txt --baseline
Scalar memmem: 280ms
Speedup: 8x

The Core Question You’re Answering

“How can I extend SIMD from single-byte search to full substrings?”

This project shows how to turn SIMD hits into verified substring matches without re-scanning everything.


Concepts You Must Understand First

  1. Two-byte prefiltering
    • Why check the last byte (or two) first?
  2. Candidate verification
    • How do you confirm the rest of the pattern quickly?
  3. Overlap handling
    • How do you avoid missing overlapping matches?

Questions to Guide Your Design

  1. Prefilter choice
    • Which bytes of the pattern should you scan for?
  2. Alignment
    • How do you align candidate positions for verification?
  3. Tail handling
    • How do you handle the last chunk safely?

Thinking Exercise

Candidate Explosion

If your prefilter byte is too common (like ‘e’), how many candidates do you expect in English text? How does that affect performance?


The Interview Questions They’ll Ask

  1. “How does SIMD memmem differ from SIMD memchr?”
  2. “Why do we still need verification after a SIMD hit?”
  3. “What is the benefit of checking the last byte first?”

Hints in Layers

Hint 1: Start with last-byte prefilter This mirrors Boyer-Moore intuition.

Hint 2: Use SIMD for candidate positions Scan for the last byte, then check the preceding bytes.

Hint 3: Optimize verification Use memcmp or small SIMD compares to check the rest.


Books That Will Help

Topic Book Chapter
Substring search “Algorithms, Fourth Edition” String search
SIMD techniques “Low-Level Programming” SIMD chapters
Memory performance “Computer Systems: A Programmer’s Perspective” Ch. 6

Common Pitfalls & Debugging

Problem 1: “Missing matches near buffer end”

  • Why: Tail handling incorrect
  • Fix: Add scalar checks for the final bytes

Problem 2: “False positives”

  • Why: Prefilter hit without full verification
  • Fix: Always check full pattern before reporting

Problem 3: “Slow performance”

  • Why: Prefilter byte too common
  • Fix: Choose rarest byte or use two-byte prefilter

Definition of Done

  • SIMD prefilter identifies candidate positions
  • Full pattern verification prevents false positives
  • Benchmarks show speedup vs scalar memmem
  • Handles overlapping matches correctly
  • Scalar tail handles end-of-buffer safely

## Project 9: Teddy Algorithm (Multi-Literal SIMD)

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: SIMD / Multiple Pattern Matching
  • Software or Tool: SSSE3/AVX2 (PSHUFB instruction)
  • Main Book: Hyperscan papers + Rust memchr crate source code

What you’ll build: The Teddy algorithm from Intel’s Hyperscan, which finds multiple short literal patterns simultaneously using the PSHUFB instruction for parallel table lookups.

Why it teaches search tools: Teddy is what ripgrep uses when you have a small number of literal patterns (up to ~64). It “completely blows away the competition for short substrings” according to ripgrep’s author.

Core challenges you’ll face:

  • PSHUFB as a lookup table → maps to parallel 16-way lookup
  • Fingerprint-based candidate detection → maps to bloom filter-like behavior
  • Bucket management → maps to handling false positives
  • Verification phase → maps to confirming actual matches

Key Concepts:

  • PSHUFB Instruction: Use SIMD register as 16-entry lookup table
  • Fingerprinting: Map patterns to fingerprints, search for fingerprints
  • Bucket Disambiguation: Multiple patterns can share a fingerprint
  • Teddy Paper: “Teddy: An Efficient SIMD-based Literal Matching Engine” - Intel

Difficulty: Master Time estimate: 3-4 weeks Prerequisites: Projects 7-8, deep SIMD understanding

Real world outcome:

$ ./teddy_search patterns.txt haystack.txt
Loading 32 patterns from patterns.txt...
Patterns: ["error", "warning", "fatal", "debug", ...]

Fingerprint table constructed:
  Bucket 0: ["error", "event"]
  Bucket 3: ["warning"]
  Bucket 7: ["fatal", "fail"]
  ...

Searching 1GB haystack...
Candidates found: 145,234
Verified matches: 12,456
Search time: 89ms (11.2 GB/s throughput)

Benchmark comparison (32 patterns, 1GB file):
  Aho-Corasick:  450ms
  Teddy SIMD:     89ms
  Speedup: 5x

Note: Teddy wins for small pattern sets (<~64 patterns)
      Aho-Corasick wins for large pattern sets (1000+)

Implementation Hints: The key insight is that PSHUFB lets you do 16 parallel lookups in a single instruction. Each byte of the input indexes into a 16-byte table.

For each input position, you check if the first 3 bytes match any pattern’s first 3 bytes by having separate lookup tables for each byte position and ANDing the results.

The “fingerprint” is which patterns might match at this position. You then verify the full patterns in a scalar loop.

Learning milestones:

  1. PSHUFB-based lookup works → You understand the core trick
  2. Multiple patterns searched simultaneously → You understand fingerprinting
  3. Faster than Aho-Corasick for small sets → You’ve achieved the goal

Phase 4: File System and I/O Optimization

Real World Outcome

You will implement a multi-literal SIMD prefilter (Teddy-style) that scans for multiple candidate literals at once and dramatically reduces regex verification work.

Command Line Outcome Example:

$ ./teddy_prefilter --literals error warn panic timeout logs.txt
Prefilter hits: 1,204
Regex verifications: 1,204
Matches: 412
Time: 9.2ms

$ ./teddy_prefilter --literals error warn panic timeout logs.txt --single
Single-literal prefilter time: 21.7ms
Speedup: 2.3x

The Core Question You’re Answering

“How do I scan for many literals at SIMD speed without running them one by one?”

Teddy is the answer: combine multiple literal checks into a few vector operations.


Concepts You Must Understand First

  1. Vector bitmask composition
    • How do you merge masks for multiple bytes?
  2. Literal selection
    • How do you choose which literals to include in a SIMD batch?
  3. Candidate verification
    • How do you verify matches after a multi-literal hit?

Questions to Guide Your Design

  1. Batch size
    • How many literals can you process per SIMD pass?
  2. False positives
    • What is the acceptable candidate rate before verification becomes too expensive?
  3. Integration
    • How will you feed prefilter hits into the regex engine?

Thinking Exercise

Mask Combination

Given masks for bytes A, B, C, D, how do you combine them to find positions where any literal could start?


The Interview Questions They’ll Ask

  1. “What is the Teddy algorithm and why is it fast?”
  2. “How does it differ from scanning for one literal at a time?”
  3. “What tradeoffs does multi-literal scanning introduce?”

Hints in Layers

Hint 1: Start with two literals Prove correctness before scaling to many literals.

Hint 2: Use bitwise OR for candidate sets Combine masks to represent multiple literals.

Hint 3: Verify candidates with exact matchers Prefilter only narrows, never confirms.


Books That Will Help

Topic Book Chapter
SIMD and bitmasks “Low-Level Programming” SIMD chapters
Multi-pattern scanning Hyperscan papers Online
Practical optimization “Computer Systems: A Programmer’s Perspective” Ch. 6

Common Pitfalls & Debugging

Problem 1: “Too many false positives”

  • Why: Literals too short or too common
  • Fix: Choose rare literals or longer bytes for prefilter

Problem 2: “Missing matches”

  • Why: Incorrect mask alignment or offsets
  • Fix: Validate candidate positions with a slow checker

Problem 3: “No speedup”

  • Why: Verification cost dominates
  • Fix: Reduce candidates or optimize verifier

Definition of Done

  • Multi-literal SIMD prefilter implemented and tested
  • Candidate verification ensures no false positives
  • Benchmarks show improvement over single-literal scanning
  • Handles at least 8 literals per scan pass
  • Integrated into a regex prefilter pipeline

## Project 10: Parallel Directory Traversal

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Concurrency / File Systems
  • Software or Tool: Custom Implementation (or study ignore crate)
  • Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A lock-free parallel directory walker that uses a work-stealing queue to distribute directory traversal across multiple threads.

Why it teaches search tools: Directory traversal is often the bottleneck, not the search itself. fd and ripgrep use parallel traversal to saturate modern NVMe SSDs. A single thread can’t keep up with drives that deliver 500K+ IOPS.

Core challenges you’ll face:

  • Work-stealing queues → maps to load balancing
  • Lock-free synchronization → maps to avoiding contention
  • Directory entry batching → maps to reducing syscall overhead
  • Respecting .gitignore → maps to ignore pattern matching

Key Concepts:

  • Work-Stealing: “The Art of Multiprocessor Programming” Chapter 16 - Herlihy & Shavit
  • Crossbeam Deque: Rust’s work-stealing deque implementation
  • getdents64: Batched directory reading syscall
  • Gitignore Semantics: Complex rules about inheritance and negation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Basic threading, understanding of file systems

Real world outcome:

$ ./parallel_walker /home/user/code
Walking 2.1 million files across 150,000 directories...

Thread 0: 523,456 files processed
Thread 1: 518,234 files processed
Thread 2: 530,123 files processed
Thread 3: 528,187 files processed

Total time: 1.2 seconds
Throughput: 1.75 million files/second

Benchmark comparison:
  Single-threaded: 4.8 seconds
  4 threads:       1.2 seconds
  Speedup: 4x (linear with thread count!)

Work-stealing statistics:
  Steals performed: 1,234
  Average queue depth: 45 directories

Implementation Hints: Use a deque per thread. When a thread discovers subdirectories, push them to its own deque. When a thread’s deque is empty, steal from another thread’s deque (from the opposite end to reduce contention).

For .gitignore, you need a stack of ignore matchers—each directory can have its own .gitignore that adds to the parent’s rules. Use a rope-like structure or thread-local copies to avoid contention.

Learning milestones:

  1. Basic parallel traversal works → You understand work distribution
  2. Work-stealing balances load → You understand dynamic scheduling
  3. Respects .gitignore correctly → You understand the complexity

Real World Outcome

You will have a parallel directory walker that can traverse large trees, respect ignore rules, and saturate modern storage with multiple worker threads.

Command Line Outcome Example:

$ ./parallel_walk ~/code --threads 8 --stats
Directories: 12,304
Files: 201,442
Ignored: 1,920,117
Time: 0.72s

$ ./parallel_walk ~/code --threads 1 --stats
Time: 3.84s
Speedup: 5.3x

The Core Question You’re Answering

“How do I traverse massive directory trees fast and correctly?”

Traversal is the hidden bottleneck in many search tools. This project makes it explicit.


Concepts You Must Understand First

  1. Work-stealing scheduling
    • How do idle threads steal from busy ones?
  2. Ignore rules
    • How do .gitignore files compose across directories?
  3. Binary detection
    • How do you skip non-text files safely?

Questions to Guide Your Design

  1. Task granularity
    • Do you schedule by file or by directory?
  2. Rule propagation
    • How will you carry ignore rules into subdirectories?
  3. Thread safety
    • How do you avoid lock contention on shared queues?

Thinking Exercise

The Skewed Tree

Imagine one directory contains 1M files and another contains 10. How does your scheduler avoid one thread doing all the work?


The Interview Questions They’ll Ask

  1. “Why is work-stealing effective for directory traversal?”
  2. “How do ignore rules affect correctness and performance?”
  3. “When is a single-threaded walker acceptable?”
  4. “How would you detect binary files quickly?”

Hints in Layers

Hint 1: Start with a single-threaded walker Get correctness first, then add parallelism.

Hint 2: Use per-thread deques Push new subdirs locally; let others steal.

Hint 3: Cache ignore rules per directory Avoid recomputing patterns for every file.


Books That Will Help

Topic Book Chapter
Directory traversal APIs “The Linux Programming Interface” Directories and links
Threading basics “Advanced Programming in the UNIX Environment” Threads
Systems performance “Computer Systems: A Programmer’s Perspective” Storage and I/O

Common Pitfalls & Debugging

Problem 1: “Threads idle too often”

  • Why: Work is not shared effectively
  • Fix: Implement stealing from the opposite end of the deque

Problem 2: “Ignore rules not applied”

  • Why: Missing rule propagation in subdirectories
  • Fix: Maintain a stack of ignore matchers

Problem 3: “Traversal order inconsistent”

  • Why: Parallel traversal changes ordering
  • Fix: If ordering matters, sort results at the end

Definition of Done

  • Parallel traversal uses work-stealing queues
  • .gitignore and hidden-file rules are respected
  • Binary files are skipped using a documented heuristic
  • Benchmark shows scaling with thread count
  • Traversal results match a single-threaded reference run

## Project 11: mmap vs read Strategy Selector

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Operating Systems / I/O
  • Software or Tool: Custom Implementation
  • Main Book: “Linux System Programming” by Robert Love

What you’ll build: A file reading abstraction that automatically chooses between mmap and buffered read based on file size, access pattern, and system characteristics.

Why it teaches search tools: ripgrep “chooses the best searching strategy for you automatically.” For single large files, mmap wins (no copy into userspace). For many small files, buffered read wins (lower overhead per file).

Core challenges you’ll face:

  • mmap overhead measurement → maps to page table setup cost
  • read buffer sizing → maps to syscall amortization
  • Page fault handling → maps to understanding mmap tradeoffs
  • File size heuristics → maps to when to switch strategies

Key Concepts:

  • mmap Mechanics: “Linux System Programming” Chapter 9 - Robert Love
  • Page Faults: Demand paging and its costs
  • Buffered I/O: “Advanced Programming in the UNIX Environment” Chapter 5 - Stevens & Rago
  • Linus on mmap: Understanding the “virtual memory mapping is very expensive in itself” quote

Difficulty: Advanced Time estimate: 1 week Prerequisites: Basic systems programming, understanding of virtual memory

Real world outcome:

$ ./io_selector --benchmark
Benchmarking I/O strategies...

Single 1GB file:
  mmap:          120ms (8.33 GB/s)
  read (64KB):   145ms (6.90 GB/s)
  Winner: mmap (17% faster)

1000 files, 1MB each:
  mmap:          850ms (1.18 GB/s)
  read (64KB):   320ms (3.12 GB/s)
  Winner: read (62% faster)

10000 files, 10KB each:
  mmap:          2100ms (0.05 GB/s)
  read (64KB):   180ms (0.56 GB/s)
  Winner: read (10x faster!)

Heuristic derived:
  Use mmap when: file_size > 1MB AND file_count < 10
  Use read otherwise
  Crossover point: ~500KB per file

Auto-selection in action:
  ./io_selector search "pattern" ./small_files/  → using read
  ./io_selector search "pattern" huge_file.log   → using mmap

Implementation Hints: mmap’s overhead is per-file (page table setup, TLB shootdowns on unmap). This fixed cost is amortized over large files but dominates for small files.

read’s overhead is per-syscall, but you can amortize it with large buffers. However, you pay a memory copy cost that mmap avoids.

The crossover depends on hardware—measure on target systems and encode heuristics.

Learning milestones:

  1. Both strategies implemented correctly → You understand the mechanics
  2. Benchmark identifies crossover point → You understand the tradeoffs
  3. Auto-selection matches or beats manual choice → You’ve built a real optimization

Real World Outcome

You will implement a strategy selector that chooses mmap or read based on file size and workload characteristics, and produce benchmarks that show the crossover point.

Command Line Outcome Example:

$ ./io_selector --file logs/huge.log --pattern ERROR
Strategy: mmap (size = 2.3GB)
Matches: 5,214
Time: 0.91s

$ ./io_selector --file logs/small.txt --pattern ERROR
Strategy: read (size = 12KB)
Matches: 3
Time: 0.3ms

$ ./io_selector --benchmark --dir logs/
Crossover point: ~850KB on this machine

The Core Question You’re Answering

“When is mmap actually faster than read, and how can a tool decide?”

This project forces you to measure rather than assume, and to encode that measurement in a real strategy.


Concepts You Must Understand First

  1. Page cache and faults
    • How does the OS cache file data?
  2. System call overhead
    • What does each read call cost?
  3. Mapping overhead
    • What does mmap cost per file?

Questions to Guide Your Design

  1. Crossover detection
    • How will you measure the size at which mmap wins?
  2. Heuristics
    • Will you use a fixed threshold or adapt at runtime?
  3. Fallback behavior
    • What do you do if mmap fails?

Thinking Exercise

The Many-Small-Files Case

If you have 100,000 files of 4KB each, why is mmap likely slower than read? Estimate the cost.


The Interview Questions They’ll Ask

  1. “Why is mmap faster for large files but slower for small ones?”
  2. “What are page faults and why do they matter?”
  3. “How would you choose the crossover threshold?”

Hints in Layers

Hint 1: Benchmark on your machine The crossover threshold is hardware-dependent.

Hint 2: Use a fixed threshold first Start simple, then refine with adaptive measurements.

Hint 3: Add a fallback path If mmap fails, revert to buffered read.


Books That Will Help

Topic Book Chapter
File I/O “Linux System Programming” by Robert Love File I/O chapters
Virtual memory “Operating System Concepts” Virtual memory chapters
System call cost “Computer Systems: A Programmer’s Perspective” Exceptional control flow

Common Pitfalls & Debugging

Problem 1: “Segfault while using mmap”

  • Why: Access past mapped region
  • Fix: Ensure bounds checks and correct file size

Problem 2: “No performance difference”

  • Why: File size too small or cache effects masking differences
  • Fix: Use large files and drop caches if safe

Problem 3: “mmap fails on some files”

  • Why: File permissions or special files
  • Fix: Detect and fall back to read

Definition of Done

  • Strategy selector chooses mmap or read with clear rules
  • Benchmark shows measured crossover point
  • Handles errors and falls back to safe I/O
  • Works on large files and many small files
  • Documented rationale for chosen threshold

## Project 12: Line-Oriented Search with SIMD Line Counting

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: C or Rust
  • Alternative Programming Languages: C++, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: SIMD / Text Processing
  • Software or Tool: Custom Implementation
  • Main Book: “Intel Intrinsics Guide”

What you’ll build: A SIMD-accelerated newline counter and line boundary finder that can identify line starts/ends at gigabytes per second.

Why it teaches search tools: grep is LINE-oriented—you need to output the line containing the match. Finding line boundaries fast is critical. ripgrep uses SIMD to count newlines, which is why it can report line numbers without significant overhead.

Core challenges you’ll face:

  • SIMD newline counting → maps to POPCNT on comparison masks
  • Line boundary tracking → maps to maintaining byte offsets
  • Interleaving with search → maps to fusion of operations
  • Handling \r\n vs \n → maps to Windows compatibility

Key Concepts:

  • SIMD POPCNT: Count bits set in comparison mask
  • Prefix Sum: Computing line numbers from newline counts
  • Line Caching: Store line boundaries for random access
  • Streaming Processing: Process file in chunks without full materialization

Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 7 (SIMD memchr)

Real world outcome:

$ ./line_counter huge_log.txt
File size: 5.2 GB
Lines: 124,567,890

Benchmark:
  Naive (byte loop):  12.5 seconds
  wc -l:               3.2 seconds
  SIMD line count:     0.45 seconds (11.6 GB/s)

With line boundary tracking:
  Line 50000000 starts at byte offset 2,345,678,901
  Lookup time: 0.001ms (from precomputed index)

Integration with search:
  Pattern found at byte 123456789
  → Line number 567890 (computed in 0.01ms)
  → Line content: "Error: connection timeout..."

Implementation Hints: Use SIMD memchr to find newlines. For each 16/32-byte chunk, get the comparison mask and use POPCNT to count the bits set (= number of newlines in that chunk).

For line boundary tracking, either store every Nth line’s offset (sparse index) or compute prefix sums on the fly.

To output context lines, you need to find the line boundaries AROUND a match, not just count lines up to it.

Learning milestones:

  1. SIMD newline counting correct and fast → You understand the basics
  2. Can find line containing a byte offset → You understand the application
  3. Performance matches ripgrep’s line counting → You’ve achieved production quality

Phase 5: Fuzzy Matching

Real World Outcome

You will build a line-oriented search engine that can map byte offsets to line numbers and print context lines quickly, even on huge files.

Command Line Outcome Example:

$ ./line_search "panic" server.log --context 2
1021: [INFO] starting server
1022: [DEBUG] loading config
1023: [ERROR] panic: missing key
1024: [DEBUG] shutting down
1025: [INFO] exit

Matches: 1
Line count: 2,389,120
Time: 18ms

The Core Question You’re Answering

“How do I convert raw byte offsets into line numbers and context lines fast?”

This project makes line mapping a first-class performance feature rather than an afterthought.


Concepts You Must Understand First

  1. Newline counting
    • How can SIMD speed up counting ` `?
  2. Offset-to-line mapping
    • How do you map byte positions to line numbers quickly?
  3. Context windows
    • How do you find the previous and next line boundaries efficiently?

Questions to Guide Your Design

  1. Indexing strategy
    • Will you store every Nth newline offset?
  2. Context extraction
    • How will you find start/end of a line around a match?
  3. Memory vs speed
    • How much memory are you willing to trade for fast line lookup?

Thinking Exercise

The Sparse Index

If you store every 1024th newline offset, how many bytes of index do you need for a 1GB file with ~50M lines?


The Interview Questions They’ll Ask

  1. “How would you map byte offsets to line numbers quickly?”
  2. “Why is line counting a bottleneck in search tools?”
  3. “How do you print context lines efficiently?”

Hints in Layers

Hint 1: SIMD count newlines This is the same as memchr with popcount.

Hint 2: Build a sparse index Store every Nth newline offset to reduce search time.

Hint 3: Use binary search on the index Find the nearest checkpoint and scan locally.


Books That Will Help

Topic Book Chapter
Data structures “Algorithms, Fourth Edition” Arrays and binary search
Performance “Computer Systems: A Programmer’s Perspective” Memory hierarchy
SIMD basics “Low-Level Programming” SIMD chapters

Common Pitfalls & Debugging

Problem 1: “Wrong line numbers”

  • Why: Off-by-one errors in newline counting
  • Fix: Validate counts on small files by hand

Problem 2: “Slow context extraction”

  • Why: Scanning from file start for each match
  • Fix: Use sparse index and local scan

Problem 3: “High memory use”

  • Why: Storing every line offset
  • Fix: Store every Nth offset instead

Definition of Done

  • SIMD newline counting implemented and correct
  • Sparse index maps offsets to line numbers efficiently
  • Context lines printed accurately around matches
  • Benchmarks show improvement over naive line counting
  • Works on multi-gigabyte files

## Project 13: Fuzzy String Matcher (fzf-style)

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool” (Solo-Preneur Potential)
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Algorithms / String Matching
  • Software or Tool: Custom Implementation
  • Main Book: fzf source code (algo.go)

What you’ll build: A fuzzy matcher that finds items matching a pattern with characters in order but not necessarily contiguous. Score matches by quality (gaps, position of matches, word boundaries).

Why it teaches search tools: fzf revolutionized interactive file selection. Understanding its scoring algorithm helps you build better developer tools. The trick is balancing quality (finding the best match) with speed (interactive responsiveness).

Core challenges you’ll face:

  • Subsequence matching → maps to dynamic programming
  • Scoring function design → maps to heuristics for “good” matches
  • Word boundary bonuses → maps to camelCase and snake_case awareness
  • Performance optimization → maps to early termination, SIMD for candidate filtering

Key Concepts:

  • Longest Common Subsequence: Base algorithm for fuzzy matching
  • Scoring Heuristics: Position bonuses, gap penalties, boundary bonuses
  • Smith-Waterman: Scoring inspiration from bioinformatics
  • fzf algo.go: The actual implementation to study

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Dynamic programming, string algorithms

Real world outcome:

$ ./fuzzy_matcher
Input: fzf
Candidates: [fuzzy_finder, foozbaz, fizzbuzz, FuzzyZoneFinder]

Matches (sorted by score):
  1. FuzzyZoneFinder   score=95  (f-z-f at word boundaries!)
  2. fuzzy_finder      score=78  (f-z-f with small gaps)
  3. fizzbuzz          score=45  (f-z at start, z far)
  4. foozbaz           score=32  (f-z far apart, z before a)

Score breakdown for "FuzzyZoneFinder":
  'F' at position 0: +20 (start bonus)
  'z' at position 3: +15 (after capital, word boundary)
  'f' at position 10: +18 (camelCase boundary)
  Gap penalty: -3 (3 chars between z and f)
  Total: 95

Performance:
  10,000 candidates: 2ms
  100,000 candidates: 18ms
  1,000,000 candidates: 150ms

Implementation Hints: For each candidate, find the best way to match the pattern characters as a subsequence. Use dynamic programming to track score at each (pattern_pos, candidate_pos).

Scoring heuristics (from fzf):

  • Bonus for matching at start of word (after _, space, or capital)
  • Bonus for matching after a camelCase boundary
  • Penalty for gaps between matches
  • Bonus for consecutive matches

The first character match gets extra weight because it’s likely intentional.

Learning milestones:

  1. Basic subsequence matching works → You understand the problem
  2. Scoring produces intuitive rankings → You understand heuristics
  3. Fast enough for interactive use (<100ms for 100K items) → You understand optimization

Real World Outcome

You will implement a fuzzy matcher with scoring that produces intuitive rankings, along with benchmarks showing interactive performance at 100K+ candidates.

Command Line Outcome Example:

$ ./fuzzy_matcher --query fzf --candidates candidates.txt
1. FuzzyZoneFinder   score=95
2. fuzzy_finder      score=78
3. fizzbuzz          score=45
4. foozbaz           score=32

Scored 100,000 candidates in 18ms

The Core Question You’re Answering

“How do I score fuzzy matches so the best results feel obvious to a human?”

The algorithm is not just about matching; it is about ranking.


Concepts You Must Understand First

  1. Subsequence matching
    • How do you check that the query appears in order?
  2. Scoring heuristics
    • Why do boundary and consecutive bonuses matter?
  3. Dynamic programming
    • How do you compute optimal alignment efficiently?

Questions to Guide Your Design

  1. Scoring function
    • What bonuses and penalties produce intuitive results?
  2. Performance constraints
    • What is your target latency per 100K candidates?
  3. Ranking stability
    • How do you handle ties consistently?

Thinking Exercise

The Ranking Test

Given query cfg, decide which should rank higher: ConfigParser or cfg_manager and explain why.


The Interview Questions They’ll Ask

  1. “How does fuzzy matching differ from substring search?”
  2. “Why do word boundaries matter for scoring?”
  3. “How would you optimize for 1M candidates?”
  4. “What is the role of dynamic programming here?”

Hints in Layers

Hint 1: Implement subsequence check first Don’t score until you can correctly match.

Hint 2: Add bonuses gradually Start with consecutive bonus, then add boundary bonus.

Hint 3: Cache candidate lengths Avoid recomputing lengths inside the scoring loop.


Books That Will Help

Topic Book Chapter
Dynamic programming “Algorithms, Fourth Edition” DP chapters
Practical scoring fzf source code (algo.go) Online
Performance engineering “Computer Systems: A Programmer’s Perspective” Ch. 6

Common Pitfalls & Debugging

Problem 1: “Scores feel wrong”

  • Why: Bonuses/penalties poorly balanced
  • Fix: Tune with real examples and compare to fzf

Problem 2: “Too slow at 100K items”

  • Why: O(n*m) DP with large constants
  • Fix: Use pruning and early exits

Problem 3: “Incorrect ordering”

  • Why: Tie-breaking inconsistent
  • Fix: Add deterministic secondary keys

Definition of Done

  • Subsequence matching correct on test cases
  • Scoring function produces intuitive rankings
  • 100K candidates scored in < 50ms
  • Benchmarks show scaling to 1M candidates
  • Output includes per-candidate score for debugging

## Project 14: Interactive Fuzzy Finder (Full fzf Clone)

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool” (Solo-Preneur Potential)
  • Difficulty: Level 3: Advanced
  • Knowledge Area: TUI / Concurrency
  • Software or Tool: Custom Implementation
  • Main Book: fzf source code

What you’ll build: A complete interactive fuzzy finder with: terminal UI, incremental input, parallel matching with worker goroutines, result streaming, preview pane.

Why it teaches search tools: fzf shows how to build responsive interactive tools. The architecture—input reader, matcher pool, result merger, renderer—is a pattern you’ll use in many tools.

Core challenges you’ll face:

  • Terminal UI (raw mode, escape sequences) → maps to TUI rendering
  • Incremental search (update as you type) → maps to debouncing and cancellation
  • Parallel matching → maps to worker pools and result merging
  • Streaming large inputs → maps to handling unbounded data

Key Concepts:

  • Terminal Raw Mode: Reading characters without waiting for Enter
  • ANSI Escape Codes: Cursor movement, colors, clearing
  • Worker Goroutines: Parallel matching with cancellation
  • Result Merging: Combining ranked results from multiple workers

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 13, terminal programming basics

Real world outcome:

$ find . -type f | ./mini_fzf
[Interactive TUI appears]

> src/ma
  4/1234
  > src/main.rs
    src/matcher.go
    src/main_test.go
    src/map.rs

Features:
- Type to filter
- Arrow keys to navigate
- Enter to select
- Ctrl-C to cancel
- Preview pane (--preview) shows file content

Performance:
- 100K items loaded in 50ms
- Typing latency <10ms (feels instant)
- Matches update incrementally as you type

Implementation Hints: Architecture (from fzf):

  1. Reader: Reads input lines, chunks them, sends to matcher
  2. Matcher: Pool of workers that score chunks against current pattern
  3. Merger: Combines worker results, maintains top-K
  4. Terminal: Renders results, handles input

When the pattern changes, cancel in-flight matching and start new match. Use channels for communication between components.

Learning milestones:

  1. Basic TUI works (type and see results) → You understand terminal rendering
  2. Results update smoothly while typing → You understand incremental search
  3. Handles millions of items smoothly → You understand the architecture

Phase 6: Putting It All Together

Real World Outcome

You will produce a usable interactive fuzzy finder with raw terminal input, incremental matching, and smooth rendering under 10ms per keystroke on large inputs.

Command Line Outcome Example:

$ find . -type f | ./mini_fzf --preview 'sed -n "1,80p" {}'
[Interactive UI]
> src/ma
  4/1234
  > src/main.rs
    src/main_test.go
    src/matcher.go

Typing latency: 7ms
Results updated: 100K items

The Core Question You’re Answering

“How do I keep a terminal UI responsive while heavy matching work runs in the background?”

This project is about architecture: separating input, matching, and rendering.


Concepts You Must Understand First

  1. Terminal raw mode
    • How do you capture keystrokes without waiting for Enter?
  2. Cancellation
    • How do you stop in-flight matching when input changes?
  3. Concurrency patterns
    • How do you stream results while work continues?

Questions to Guide Your Design

  1. Input loop
    • How will you parse arrow keys and control sequences?
  2. Result merging
    • How do you maintain a top-K list from multiple workers?
  3. Rendering
    • How do you update the screen without flicker?

Thinking Exercise

The Janky UI

Imagine matching takes 200ms. How will you keep the UI responsive and show partial results?


The Interview Questions They’ll Ask

  1. “How would you design a responsive TUI for large input sets?”
  2. “How do you cancel in-flight work in a worker pool?”
  3. “What is the role of debouncing in interactive search?”
  4. “How do you avoid flicker in terminal rendering?”

Hints in Layers

Hint 1: Separate input, matching, and rendering Treat them as independent components connected by channels.

Hint 2: Use a query version number Discard stale results by comparing versions.

Hint 3: Render at a fixed FPS Update the screen at 30-60Hz for smoothness.


Books That Will Help

Topic Book Chapter
Terminal basics “The Linux Command Line” Terminal sections
Concurrency “Advanced Programming in the UNIX Environment” Threads
UX responsiveness “Code Complete” UI responsiveness sections

Common Pitfalls & Debugging

Problem 1: “Input feels laggy”

  • Why: Matching blocks the UI thread
  • Fix: Move matching to worker threads

Problem 2: “Flickering output”

  • Why: Full screen redraw each frame without double buffering
  • Fix: Use a back buffer or minimal redraws

Problem 3: “Out-of-order results”

  • Why: Old workers still emitting results
  • Fix: Drop results that do not match current query version

Definition of Done

  • Terminal raw mode input works with arrow keys and control sequences
  • Incremental matching updates within 10ms for 100K items
  • Worker pool with cancellation or versioning
  • Rendering loop is flicker-free
  • Preview pane works with a user-defined command

## Project 15: Build Your Own ripgrep

  • File: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Systems Programming / Everything Combined
  • Software or Tool: Custom Implementation
  • Main Book: ripgrep source code + all previous resources

What you’ll build: A production-quality grep with: parallel directory traversal, .gitignore support, SIMD-accelerated literal search, regex support with literal extraction, colored output, context lines.

Why it teaches search tools: This is the capstone project. You’ll integrate everything: parallel I/O, SIMD algorithms, regex engines, output formatting. The result is a tool YOU built that can search gigabytes per second.

Core challenges you’ll face:

  • Integrating all previous components → maps to system architecture
  • Memory efficiency → maps to avoiding unnecessary allocations
  • Correctness edge cases → maps to handling binary files, encoding issues
  • User experience polish → maps to colors, progress, error messages

Key Concepts:

  • Component Integration: How the pieces fit together
  • Error Handling: Graceful degradation on permission errors, encoding issues
  • Output Buffering: Line buffering vs block buffering for piped output
  • Configuration: Command-line argument parsing, config files

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects

Real world outcome:

$ ./mygrep "function.*async" ~/code --stats
Searching 150,234 files...

src/server/api.ts:145: async function handleRequest(req: Request) {
src/server/api.ts:234: export async function processData(data: Data) {
src/utils/fetch.ts:12: async function fetchWithRetry(url: string) {
...

Statistics:
  Files searched: 150,234
  Files matched: 1,234
  Lines matched: 5,678
  Time: 0.89s
  Throughput: 2.1 GB/s

$ hyperfine './mygrep TODO ~/code' 'rg TODO ~/code'
  mygrep: 0.95s ± 0.02s
  rg:     0.87s ± 0.02s

  Within 10% of ripgrep! 🎉

Implementation Hints: Start with a working but slow version, then optimize:

  1. Single-threaded, read-based file access, naive regex
  2. Add parallel directory traversal
  3. Add literal extraction and SIMD prefiltering
  4. Add mmap for large files
  5. Optimize output (buffer, avoid unnecessary work)

Profile continuously. The bottleneck shifts as you optimize—first it’s regex, then I/O, then output.

Learning milestones:

  1. Correct results on all test cases → You understand the requirements
  2. Within 2x of ripgrep performance → You understand optimization
  3. Users would actually use this → You’ve built something real

Real World Outcome

You will ship a production-grade grep-like tool that can search large repositories at multi-GB/s, respects ignore rules, supports regex, and provides colored, contextual output.

Command Line Outcome Example:

$ ./myrg "function.*async" ~/code --stats --context 1
Searching 150,234 files...

src/server/api.ts:145: async function handleRequest(req: Request) {
  ...

src/utils/fetch.ts:12: async function fetchWithRetry(url: string) {
  ...

Statistics:
  Files searched: 150,234
  Files matched: 1,234
  Lines matched: 5,678
  Time: 0.89s
  Throughput: 2.1 GB/s

$ hyperfine './myrg TODO ~/code' 'rg TODO ~/code'
  myrg: 0.95s ± 0.02s
  rg:   0.87s ± 0.02s

The Core Question You’re Answering

“How do all the pieces of a modern search tool fit together into a coherent, fast, and correct system?”

This is the capstone: integration, performance engineering, and UX polish.


Concepts You Must Understand First

  1. Traversal + ignore rules
    • How do you find the right files quickly?
  2. Regex pipeline
    • How do literal extraction and SIMD prefilters feed the regex engine?
  3. Output formatting
    • How do you print lines, context, and colors efficiently?

Questions to Guide Your Design

  1. Architecture
    • How will you structure the pipeline so each component is testable?
  2. Performance bottlenecks
    • How will you profile and identify the next bottleneck?
  3. Correctness
    • How will you validate your output against rg on real repos?

Thinking Exercise

The Bottleneck Shift

If your regex engine is 5x faster, what becomes the next bottleneck? Predict it before profiling.


The Interview Questions They’ll Ask

  1. “Walk me through the architecture of ripgrep.”
  2. “Where are the performance hotspots in a search tool?”
  3. “How do you ensure correctness across encodings and binary files?”
  4. “Why might --context slow down search?”

Hints in Layers

Hint 1: Start with correctness Build a slow but correct tool, then optimize each stage.

Hint 2: Profile continuously Use perf, dtrace, or Instruments to find hotspots.

Hint 3: Optimize output Buffer output and avoid per-match allocations.


Books That Will Help

Topic Book Chapter
Systems design “Clean Architecture” by Robert C. Martin Architecture chapters
I/O performance “The Linux Programming Interface” File I/O chapters
Regex internals ripgrep/regex internals docs Online

Common Pitfalls & Debugging

Problem 1: “Slower than expected”

  • Why: Output formatting or line counting dominates
  • Fix: Buffer output, precompute line indices, minimize allocations

Problem 2: “Incorrect matches”

  • Why: Unicode decoding or binary detection errors
  • Fix: Validate against rg with mixed encodings

Problem 3: “High memory usage”

  • Why: Reading entire files at once
  • Fix: Stream in chunks and reuse buffers

Definition of Done

  • Matches rg output on a real repo for at least 20 queries
  • Throughput within 2x of rg on one large corpus
  • Supports ignore rules, binary detection, and UTF-8 handling
  • --context and --stats flags implemented
  • CLI help and error messages are user-friendly

Project Comparison Table

# Project Difficulty Time Depth of Understanding Fun Factor
1 Naive String Search Beginner 2-3 hours ⭐⭐ ⭐⭐
2 Boyer-Moore Int-Adv 1 week ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
3 Aho-Corasick Advanced 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
4 NFA Regex Engine Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
5 DFA Regex (Lazy) Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
6 Literal Extraction Advanced 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
7 SIMD memchr Expert 1 week ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
8 SIMD memmem Expert 1-2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
9 Teddy Algorithm Master 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
10 Parallel Dir Traversal Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
11 mmap vs read Selector Advanced 1 week ⭐⭐⭐⭐ ⭐⭐⭐
12 SIMD Line Counting Advanced 1 week ⭐⭐⭐⭐ ⭐⭐⭐⭐
13 Fuzzy Matcher Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
14 Interactive fzf Clone Advanced 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
15 Build Your Own ripgrep Master 2-3 months ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

Path A: Algorithm Focus (Theory → Implementation)

  1. Projects 1-3 (String Matching Algorithms)
  2. Projects 4-5 (Regex Engines)
  3. Project 6 (Literal Extraction)
  4. Project 15 (Integration)

Path B: Performance Focus (Make It Fast)

  1. Project 1 (Baseline)
  2. Projects 7-8 (SIMD Fundamentals)
  3. Projects 10-11 (I/O Optimization)
  4. Project 15 (Integration)

Path C: Tools Focus (Build Useful Things)

  1. Projects 1-2 (Basic Algorithms)
  2. Projects 10-11 (Parallel I/O)
  3. Projects 13-14 (Fuzzy Finder)
  4. Project 15 (ripgrep Clone)

Why Each Tool Is Fast: Summary

GNU grep

  • Boyer-Moore + DFA/NFA hybrid
  • Hand-tuned inner loop
  • Falls back to NFA for complex patterns
  • Decades of optimization

The Silver Searcher (ag)

  • mmap + pthreads + PCRE JIT
  • First to respect .gitignore by default
  • Binary search for ignore patterns
  • Up to 8 worker threads

ripgrep

  • Finite automata regex (guaranteed linear time)
  • Teddy SIMD algorithm for literals
  • Lock-free parallel directory traversal
  • Automatic mmap vs read selection
  • UTF-8 decoding built into DFA
  • Literal extraction for regex optimization

fd

  • Parallel directory traversal
  • Uses ripgrep’s regex and ignore crates
  • Smart defaults (ignore hidden, respect .gitignore)
  • 10-23x faster than find for pattern matching

fzf

  • Custom fuzzy matching with scoring heuristics
  • Worker goroutine pool for parallel matching
  • Incremental search with result streaming
  • Two algorithm modes: v1 (fast) and v2 (optimal)

git ls-files (bonus)

  • Reads from Git’s index file directly
  • No directory traversal at all!
  • 5x+ faster than any directory walker
  • Only works inside Git repos

Key Insights

  1. Most “regex” searches are literal: The fastest path is finding literals with SIMD, then verifying with regex.

  2. DFA > NFA for guaranteed performance: Backtracking regex engines can take exponential time. Finite automata are O(n) always.

  3. Parallelization matters more than algorithm optimization: On modern SSDs, a single thread can’t saturate I/O. Use work-stealing queues.

  4. Skip what you don’t need: Respecting .gitignore often reduces search space by 10-100x.

  5. Measure, don’t assume: mmap isn’t always faster. Boyer-Moore isn’t always better. Profile on real workloads.

  6. SIMD is a superpower: 16-32 bytes per instruction changes everything. memchr at 15 bytes/cycle is 15x faster than byte-by-byte.


Summary

Project # Project Name Main Language
1 Naive String Search C
2 Boyer-Moore String Search C
3 Aho-Corasick Multi-Pattern C
4 NFA Regex Engine C
5 DFA Regex Engine (Lazy) C
6 Literal Extraction Optimizer Rust
7 SIMD memchr C
8 SIMD memmem C
9 Teddy Algorithm Rust
10 Parallel Directory Traversal Rust
11 mmap vs read Selector C
12 SIMD Line Counting C
13 Fuzzy String Matcher Go
14 Interactive Fuzzy Finder Go
15 Build Your Own ripgrep Rust

Sources

Primary Benchmarks and Tool Documentation

Algorithms and Theory

SIMD and Literal Prefiltering

Tool Source Code (Implementation References)


Note: This guide now includes a full theory primer, a concept-to-project map, deep dive reading, quick-start path, success metrics, and per-project deep dives (outcomes, pitfalls, and definitions of done) so you can move from theory to implementation with minimal guesswork.