Project 2: Boyer-Moore String Search
The Boyer-Moore algorithm with both the “bad character” and “good suffix” heuristics. This is what GNU grep uses internally.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C |
| Alternative Languages | Rust, C++, Zig |
| Difficulty | Level 3: Advanced |
| Time Estimate | 1 week |
| Knowledge Area | Algorithms / String Matching |
| Tooling | Custom Implementation |
| Prerequisites | Project 1 (Naive Search), understanding of suffix/prefix concepts |
What You Will Build
The Boyer-Moore algorithm with both the “bad character” and “good suffix” heuristics. This is what GNU grep uses internally.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Bad character table construction → maps to preprocessing patterns
- Good suffix table construction → maps to suffix analysis
- Rightmost occurrence lookup → maps to how much to skip
- Understanding why backwards scanning works → maps to algorithm intuition
Key Concepts
- Bad Character Rule: “Algorithms on Strings, Trees, and Sequences” Chapter 2.2 - Gusfield
- Good Suffix Rule: “Algorithms on Strings, Trees, and Sequences” Chapter 2.2 - Gusfield
- Skip Tables: “The Boyer-Moore Fast String Searching Algorithm” - J. Strother Moore (original paper)
- Horspool Simplification: Boyer-Moore-Horspool variant that’s simpler to implement
Real-World Outcome
$ ./boyer_moore "substantially" novel.txt
Found at position 45231
Found at position 89012
Search completed in 3ms # vs 45ms naive!
Skip table visualization:
Pattern: "example"
Bad character table:
'e' -> 1 (rightmost position from end)
'l' -> 2
'p' -> 3
'm' -> 4
'a' -> 5
'x' -> 6
(all others) -> 7 (pattern length)
Benchmark comparison:
Pattern length 5: Naive 12ms, Boyer-Moore 8ms (1.5x faster)
Pattern length 15: Naive 15ms, Boyer-Moore 2ms (7.5x faster)
Pattern length 50: Naive 18ms, Boyer-Moore 0.4ms (45x faster!)
Key insight: Longer patterns = faster search!
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Algorithms on Strings, Trees, and Sequences” by Dan Gusfield