Project 2: Boyer-Moore String Search

Build a Boyer-Moore matcher with bad-character and good-suffix heuristics, then benchmark it against your naive baseline.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1 week
Main Programming Language	C (Alternatives: Rust, Zig)
Alternative Programming Languages	Rust, Zig
Coolness Level	Level 3: Clever
Business Potential	Level 2: Niche Tooling
Prerequisites	P01, arrays, pointers, basic complexity
Key Topics	Boyer-Moore, bad-character table, good-suffix table, sublinear search

1. Learning Objectives

By completing this project, you will:

Implement Boyer-Moore with correct preprocessing tables.
Explain why and when Boyer-Moore is sublinear in practice.
Handle overlapping matches and correct shifts.
Compare performance against naive search on real files.
Design deterministic benchmarks with fixed inputs.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Boyer-Moore Heuristics

Fundamentals

Boyer-Moore matches the pattern from right to left and uses information from mismatches to skip ahead by multiple positions. Two heuristics drive those skips: the bad-character rule and the good-suffix rule. The bad-character rule says: if a mismatch occurs at pattern index j with text character c, you can shift the pattern so that the last occurrence of c in the pattern aligns with the mismatch position. If c is not in the pattern, you can shift past the mismatch entirely. The good-suffix rule says: if a suffix of the pattern matched before the mismatch, shift so that another occurrence of that suffix aligns, or align the longest proper prefix with the matched suffix.

These rules let Boyer-Moore skip more than one position at a time, often making it sublinear for random text. It shines when the pattern is longer and the alphabet is reasonably large.

Deep Dive into the concept

The bad-character table is typically a 256-entry table for bytes. For each byte value, it stores the last index where that byte appears in the pattern. On mismatch at index j, you compute shift = max(1, j - last_occurrence[c]). This is simple and fast.

The good-suffix table is more subtle. Suppose the pattern suffix P[j+1..m-1] matched the text, but P[j] mismatched. You want to shift so that a different occurrence of that suffix in the pattern aligns with the text, or align the longest prefix of P that matches the suffix. This requires preprocessing the pattern to compute border positions (prefixes that are also suffixes). Many implementations compute two arrays: suffix (length of the longest suffix ending at position i) and shift (the amount to shift for each mismatch position). The preprocessing cost is O(m), and the search is O(n) in the worst case for Boyer-Moore with good-suffix and bad-character heuristics, though there are pathological cases; in practice it is very fast.

You must also be careful with overlapping matches. Boyer-Moore can skip past overlaps if you always shift by the full pattern length after a match. The correct behavior is to compute the shift after a match using the good-suffix table for j = -1 (full match), which typically aligns the longest proper prefix with the matched suffix to allow overlaps.

How this fits on projects

You will use this heuristic engine as a core literal matcher in later projects (P06 literal extraction, P15 ripgrep-style integration) and for benchmarking against SIMD-based methods.

Definitions & key terms

bad-character rule -> shift based on last occurrence of mismatched character
good-suffix rule -> shift based on matched suffix alignment
border -> a prefix that is also a suffix
sublinear -> average comparisons less than n

Mental model diagram (ASCII)

pattern:    a b c d a b
text:           a b x d a b
match from right -> mismatch at 'c' vs 'x'
shift using bad-character / good-suffix

How it works (step-by-step)

Precompute bad-character table for all byte values.
Precompute good-suffix shift table for each pattern index.
Align pattern at position i and compare from right to left.
On mismatch at j, compute shift = max(bad_char, good_suffix).
On full match, record it and shift using good-suffix table to allow overlaps.

Minimal concrete example

int last[256];
for (int i = 0; i < 256; i++) last[i] = -1;
for (int i = 0; i < m; i++) last[(unsigned char)pat[i]] = i;

Common misconceptions

Misconception: Boyer-Moore is always faster. Correction: For very short patterns or tiny alphabets, it can be slower.
Misconception: After match, shift by m. Correction: Use good-suffix shift to allow overlaps.

Check-your-understanding questions

Why do we match from right to left in Boyer-Moore?
What does the bad-character table store for each byte?
How does the good-suffix rule prevent skipping valid matches?

Check-your-understanding answers

So that mismatches give information about the rightmost aligned char.
The last index where the byte occurs in the pattern (or -1).
It aligns a matching suffix with another occurrence or a prefix.

Real-world applications

GNU grep and other tuned literal searchers
Antivirus signature scanning for fixed patterns

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Critical Tests.
Also used in: P06-literal-extraction-optimizer, P15-build-your-own-ripgrep.

References

Boyer & Moore (1977), “A Fast String Searching Algorithm”
CLRS, String Matching chapter

Key insights

Boyer-Moore gains speed by skipping alignments that cannot possibly match.

Summary

Boyer-Moore trades preprocessing for large skips, often beating naive search by an order of magnitude on real data.

Homework/Exercises to practice the concept

Implement only bad-character and measure speedup.
Add good-suffix and compare again.

Solutions to the homework/exercises

The bad-character rule alone can already yield major gains for large alphabets.
Good-suffix improves worst cases and enables overlap correctness.

3. Project Specification

3.1 What You Will Build

A CLI tool bm-search that implements Boyer-Moore for literal search with both bad-character and good-suffix heuristics, reports matches, and provides a benchmark mode against the naive baseline.

3.2 Functional Requirements

Pattern preprocessing: build bad-character and good-suffix tables.
Search: report all matches with overlaps.
Benchmark: --bench compares with naive search on the same input.
Output: text and JSON modes like P01.
Deterministic mode: fixed timers for tests (--fixed-ts).

3.3 Non-Functional Requirements

Performance: beat naive search by 2x+ on 10MB ASCII text.
Reliability: correct on all overlap cases.
Usability: clear stats and shift counts in bench mode.

3.4 Example Usage / Output

$ ./bm-search "needle" big.txt
big.txt:10234: needle

$ ./bm-search --bench "needle" big.txt
naive: 42ms, bm: 8ms, speedup: 5.25x

3.5 Data Formats / Schemas / Protocols

Same as P01 for match output; add benchmark line:

bench: naive_ms=<n> bm_ms=<n> speedup=<n>

3.6 Edge Cases

Repeated patterns like “aaaaaa”
Very small patterns (length 1 or 2)
Pattern not found

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./bm-search "TODO" repo.txt
./bm-search --bench "TODO" repo.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixtures/english_1mb.txt and fixed timers to produce stable benchmark output.

3.7.3 CLI Transcript (Success + Failure)

$ ./bm-search --fixed-ts "ana" fixtures/banana.txt
fixtures/banana.txt:1:2: banana
fixtures/banana.txt:1:4: banana
exit code: 0

$ ./bm-search --bench
error: missing PATTERN and FILE
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +------------------+   +----------------+
| CLI Parse |-->| Preprocess Tables|-->| Search Loop    |
+-----------+   +------------------+   +-------+--------+
                                                |
                                                v
                                       +--------+-------+
                                       | Formatter      |
                                       +----------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

int bad_char[256];
int good_suffix[m + 1];

4.4 Algorithm Overview

Build bad-character table.
Build good-suffix table.
Align pattern at i, compare from right to left.
On mismatch, compute shift and advance i.
On match, record and shift for overlaps.

Complexity Analysis

Preprocessing: O(m)
Search: typically sublinear, worst-case O(n*m)

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o bm-search src/main.c

5.2 Project Structure

bm-search/
├── src/
│   ├── main.c
│   ├── preprocess.c
│   ├── bm_search.c
│   └── formatter.c
├── fixtures/
│   └── english_1mb.txt
├── tests/
└── Makefile

5.3 The Core Question You’re Answering

“How can a string matcher use mismatches to skip ahead without missing valid matches?”

5.4 Concepts You Must Understand First

Bad-character table and how to compute it.
Good-suffix table and border computation.
Overlap handling after a full match.

5.5 Questions to Guide Your Design

What shift is safe after a mismatch at index j?
How do you handle patterns of length 1?
How will you validate correctness vs the naive baseline?

5.6 Thinking Exercise

Manually compute bad-character shifts for pattern “ABCDABD” and text “ABC ABCDAB ABCDABCDABDE”.

5.7 The Interview Questions They’ll Ask

Why is Boyer-Moore sublinear in practice?
What are the bad-character and good-suffix heuristics?

5.8 Hints in Layers

Hint 1: Implement bad-character only first. Hint 2: Add good-suffix; verify with overlap tests. Hint 3: Compare with naive on the same fixture files.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Preprocessing (2 days)

Build bad-character table and tests.
Checkpoint: table matches expected values.

Phase 2: Search Loop (2 days)

Implement right-to-left comparisons.
Checkpoint: matches equal naive baseline.

Phase 3: Benchmarking (1 day)

Add --bench mode and deterministic fixtures.
Checkpoint: speedup reported consistently.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Pattern “ana” in “banana” -> offsets 1, 3.
Pattern “aaaa” in “aaaaaa” -> overlaps.
Pattern not found -> no matches.

6.3 Test Data

text: "abababab"
pattern: "abab"
expected offsets: 0,2,4

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Log mismatch positions and shifts on small inputs.
Compare match offsets to naive output.

8. Extensions & Challenges

8.1 Beginner Extensions

Add case-insensitive mode.
Print number of character comparisons.

8.2 Intermediate Extensions

Support Unicode with codepoint tables.
Add heuristics to fall back to naive for short patterns.

8.3 Advanced Extensions

Implement Galil optimization.
Add SIMD prefiltering before Boyer-Moore.

9. Real-World Connections

9.1 Industry Applications

grep-style tools for literal search
antivirus and signature scanning

GNU grep (Boyer-Moore core for literals)
ripgrep (uses prefilters for literals)

9.3 Interview Relevance

Explaining sublinear behavior and preprocessing trade-offs

10. Resources

10.1 Essential Reading

Boyer & Moore original paper
CLRS string matching chapter

10.2 Tools & Documentation

hyperfine for benchmarking
perf for profiling

Previous: P01-naive-string-search-baseline
Next: P03-aho-corasick-multi-pattern-matching

11. Self-Assessment Checklist

11.1 Understanding

I can explain both BM heuristics.
I can reason about safe shifts.

11.2 Implementation

Matches equal naive baseline for all fixtures.
Benchmark output is deterministic.

11.3 Growth

I can describe when BM is not ideal.

12. Submission / Completion Criteria

Minimum Viable Completion:

Correct BM matcher with bad-character
Passes baseline tests

Full Completion:

Good-suffix heuristic implemented
Benchmark mode with speedup

Excellence (Going Above & Beyond):

Galil optimization
Adaptive fallback to naive for short patterns