Project 15: Build Your Own ripgrep

Integrate all components into a production-grade search tool with parallel traversal, regex + literal optimization, SIMD prefilters, and polished output.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 2-3 months
Main Programming Language Rust (Alternatives: C++, Go)
Alternative Programming Languages C++, Go
Coolness Level Level 5: Pure Magic
Business Potential Level 4: Open Core Infrastructure
Prerequisites Projects 1-14 (at least 1,4,7,10)
Key Topics pipeline integration, profiling, UX polish

1. Learning Objectives

By completing this project, you will:

  1. Integrate traversal, I/O, regex, SIMD, and output into one tool.
  2. Implement literal extraction and prefilter selection.
  3. Build a robust CLI with stats, context, and color output.
  4. Achieve performance within 2x of ripgrep on real corpora.
  5. Explain system bottlenecks and optimize them iteratively.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Search Pipeline Integration

Fundamentals

A modern search tool is a pipeline: traversal finds files, I/O reads bytes, prefilters reduce candidates, regex engines verify matches, and output formatting prints results. Each stage has a contract and a cost. The tool is only as fast as its slowest stage. Integration is about making these stages communicate efficiently and avoiding redundant work.

The pipeline must also handle errors gracefully. Permission errors and invalid encodings should not crash the tool. The output must be deterministic and user-friendly, with line numbers, context, and optional color.

Deep Dive into the concept

Integration complexity comes from coupling: traversal may run in parallel, while matching and output may be sequential to preserve order. You need to choose whether to preserve order or maximize throughput. ripgrep uses a parallel traversal plus a work queue of files; each worker searches a file and emits results. Output ordering can be stable per file but not necessarily global. A deterministic mode can sort outputs if needed.

Prefilters are selected based on the pattern: literal-only patterns use a fast literal engine; regex patterns use literal extraction to pick a prefilter (SIMD memmem or Teddy) and then verify matches with a regex engine. I/O strategy selection chooses mmap or read depending on file size. Line counting uses SIMD to avoid output bottlenecks.

The architecture must be measurable. You need built-in stats: bytes scanned, files searched, matches found, and time per stage. Without these, you cannot explain performance.

How this fits on projects

This project is the capstone that ties together all previous components into a coherent tool.

Definitions & key terms

  • pipeline -> series of stages from traversal to output
  • prefilter -> fast filter before regex verification
  • stage bottleneck -> slowest stage in pipeline
  • backpressure -> when downstream is slower than upstream

Mental model diagram (ASCII)

Traversal -> I/O -> Prefilter -> Regex -> Line Map -> Output
   |          |        |         |         |        |
 parallel  adaptive  SIMD     NFA/DFA   SIMD      buffered

How it works (step-by-step)

  1. Parse CLI and build pattern plan (literal vs regex).
  2. Start traversal workers and file queue.
  3. For each file: choose I/O strategy and scan.
  4. Use prefilter to find candidates; verify with regex.
  5. Use line index to format output; write buffered output.

Minimal concrete example

for file in files { scan_file(file, pattern_plan, io_strategy); }

Common misconceptions

  • Misconception: Faster regex always means faster tool. Correction: Output and traversal often dominate.
  • Misconception: Parallel output is always good. Correction: It can scramble ordering and increase overhead.

Check-your-understanding questions

  1. Why is literal extraction so important in a regex pipeline?
  2. What happens if output formatting is slow?
  3. Why do we need stats per stage?

Check-your-understanding answers

  1. It enables fast prefilters that reduce regex work.
  2. It becomes the bottleneck and limits throughput.
  3. To identify which stage to optimize next.

Real-world applications

  • ripgrep, ag, ack, fd
  • log search and incident response tooling

Where you’ll apply it

  • This project: Section 3.2 Functional Requirements and Section 4.4 Algorithm.
  • Also used in: all previous projects integrated here.

References

  • ripgrep documentation and source code
  • regex-automata internals

Key insights

Integration is performance engineering: the pipeline is only as fast as its slowest stage.

Summary

A production search tool is a pipeline of specialized components that must be coordinated and profiled.

Homework/Exercises to practice the concept

  1. Draw the pipeline for your tool and label bottlenecks.
  2. Run a benchmark and identify the slowest stage.

Solutions to the homework/exercises

  1. Use the pipeline diagram above and add timings.
  2. Use --stats output to locate the slow stage.

3. Project Specification

3.1 What You Will Build

A CLI tool myrg that searches directories recursively with ignore rules, supports literal and regex patterns, uses SIMD prefilters, and outputs matches with line numbers and optional context. It includes a --stats flag and deterministic mode.

3.2 Functional Requirements

  1. Traversal: parallel directory traversal with ignore rules.
  2. I/O strategy: auto select mmap vs read.
  3. Pattern plan: literal-only vs regex + prefilter.
  4. SIMD prefilters: memchr/memmem or Teddy.
  5. Regex verification: NFA/DFA engine.
  6. Output: line numbers, file paths, colors, context.
  7. Stats: files searched, bytes scanned, time per stage.

3.3 Non-Functional Requirements

  • Performance: within 2x of ripgrep on large corpora.
  • Reliability: robust error handling for permissions and encoding.
  • Usability: CLI help, good errors, clear output.

3.4 Example Usage / Output

$ ./myrg "function.*async" ~/code --stats --context 1
Searching 150,234 files...

src/server/api.ts:145: async function handleRequest(req: Request) {
  ...

Statistics:
  Files searched: 150,234
  Files matched: 1,234
  Lines matched: 5,678
  Time: 0.89s
  Throughput: 2.1 GB/s

3.5 Data Formats / Schemas / Protocols

Text output with optional JSON for machine-readable stats.

3.6 Edge Cases

  • Binary files with embedded text
  • Mixed encodings (UTF-8 errors)
  • Permission denied directories

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo build --release
./target/release/myrg "TODO" ~/code --stats

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures and --fixed-ts to freeze timing values.

3.7.3 CLI Transcript (Success + Failure)

$ ./myrg --fixed-ts "TODO" fixtures/repo --stats
fixtures/repo/README.md:3: TODO
Statistics: files=10 lines=1 time=0.001s
exit code: 0

$ ./myrg --fixed-ts "(" fixtures/repo
error: invalid regex at position 1
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+----------+   +-----------+   +-----------+   +-----------+   +-----------+
| Traversal|-->| I/O Layer |-->| Prefilter |-->| Regex     |-->| Output    |
+----------+   +-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Traversal | find files | parallel, ignore rules | | I/O | read bytes | mmap vs read selector | | Prefilter | candidate scan | literal extraction + SIMD | | Regex | verify matches | NFA/DFA engine | | Output | formatting | buffered, colorized |

4.3 Data Structures (No Full Code)

struct SearchTask { path: PathBuf, size: u64 }

4.4 Algorithm Overview

  1. Parse CLI and compile pattern plan.
  2. Start traversal and push files to worker queue.
  3. For each file, choose I/O strategy and scan bytes.
  4. Run prefilter; verify with regex or literal match.
  5. Format output and update stats.

Complexity Analysis

  • Traversal: O(files)
  • Search: O(bytes) + verification cost

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

myrg/
├── src/
│   ├── main.rs
│   ├── cli.rs
│   ├── traverse.rs
│   ├── io.rs
│   ├── prefilter.rs
│   ├── regex.rs
│   ├── output.rs
│   └── stats.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How do all the pieces of a modern search tool fit together into a fast, correct system?”

5.4 Concepts You Must Understand First

  1. Pipeline stages and data flow.
  2. Prefilter selection logic.
  3. Output formatting bottlenecks.

5.5 Questions to Guide Your Design

  1. How will you keep traversal and matching decoupled?
  2. How will you ensure output remains readable with parallel workers?
  3. How will you choose between different prefilters?

5.6 Thinking Exercise

If your regex engine becomes 5x faster, where does the bottleneck move next?

5.7 The Interview Questions They’ll Ask

  1. Walk through ripgrep’s architecture.
  2. How do you validate correctness against existing tools?

5.8 Hints in Layers

Hint 1: Start with a single-threaded correct tool. Hint 2: Add traversal and parallelism next. Hint 3: Add SIMD prefilters and optimize output.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | System design | “Clean Architecture” | architecture chapters | | I/O | “The Linux Programming Interface” | I/O chapters | | Regex internals | ripgrep docs | internals |

5.10 Implementation Phases

Phase 1: Correctness (2 weeks)

  • Single-threaded search with regex.
  • Checkpoint: matches equal rg on fixtures.

Phase 2: Performance (3 weeks)

  • Add traversal parallelism and I/O strategy selection.
  • Checkpoint: speedup on large corpora.

Phase 3: Polishing (2 weeks)

  • Add stats, context lines, colors.
  • Checkpoint: user-friendly CLI.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Output order | stable vs fastest | stable-per-file | balances determinism | | Prefilter | memmem vs Teddy | adaptive | choose based on literals |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | regex + prefilter | known inputs | | Integration | full pipeline | fixture repos | | Performance | benchmarks | large corpora |

6.2 Critical Test Cases

  1. Matches equal rg on 20 queries.
  2. Handles binary files and permission errors.
  3. Deterministic mode output stable.

6.3 Test Data

fixtures/repo with known matches and ignored files

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Output bottleneck | slow overall | buffer output and reduce formatting | | Incorrect ignore rules | too many files | reuse ignore crate logic | | Unicode errors | missing matches | fall back to byte-based search |

7.2 Debugging Strategies

  • Compare output against rg with diff.
  • Use stage timing to find bottlenecks.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --files mode like rg --files.
  • Add --type file filters.

8.2 Intermediate Extensions

  • Support .ignore and .rgignore files.
  • Add JSON output mode.

8.3 Advanced Extensions

  • Implement a daemon mode for repeated queries.
  • Add SIMD-accelerated context extraction.

9. Real-World Connections

9.1 Industry Applications

  • Developer tooling and incident response
  • Codebase auditing and compliance scans
  • ripgrep, ag, ack
  • regex-automata, ignore crates

9.3 Interview Relevance

  • Systems integration and performance engineering

10. Resources

10.1 Essential Reading

  • ripgrep design notes
  • regex-automata docs

10.2 Tools & Documentation

  • hyperfine for benchmarks
  • perf for profiling

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain each pipeline stage.
  • I can explain how prefilters reduce work.

11.2 Implementation

  • Output matches ripgrep on fixtures.
  • Performance within target range.

11.3 Growth

  • I can identify the current bottleneck and fix it.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Working grep-like tool with traversal and regex

Full Completion:

  • SIMD prefilters + stats + context output

Excellence (Going Above & Beyond):

  • Within 10% of ripgrep on benchmarks
  • Advanced output modes and JSON stats