Project 2: Boyer-Moore String Search

Build a Boyer-Moore matcher with bad-character and good-suffix heuristics, then benchmark it against your naive baseline.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1 week
Main Programming Language C (Alternatives: Rust, Zig)
Alternative Programming Languages Rust, Zig
Coolness Level Level 3: Clever
Business Potential Level 2: Niche Tooling
Prerequisites P01, arrays, pointers, basic complexity
Key Topics Boyer-Moore, bad-character table, good-suffix table, sublinear search

1. Learning Objectives

By completing this project, you will:

  1. Implement Boyer-Moore with correct preprocessing tables.
  2. Explain why and when Boyer-Moore is sublinear in practice.
  3. Handle overlapping matches and correct shifts.
  4. Compare performance against naive search on real files.
  5. Design deterministic benchmarks with fixed inputs.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Boyer-Moore Heuristics

Fundamentals

Boyer-Moore matches the pattern from right to left and uses information from mismatches to skip ahead by multiple positions. Two heuristics drive those skips: the bad-character rule and the good-suffix rule. The bad-character rule says: if a mismatch occurs at pattern index j with text character c, you can shift the pattern so that the last occurrence of c in the pattern aligns with the mismatch position. If c is not in the pattern, you can shift past the mismatch entirely. The good-suffix rule says: if a suffix of the pattern matched before the mismatch, shift so that another occurrence of that suffix aligns, or align the longest proper prefix with the matched suffix.

These rules let Boyer-Moore skip more than one position at a time, often making it sublinear for random text. It shines when the pattern is longer and the alphabet is reasonably large.

Deep Dive into the concept

The bad-character table is typically a 256-entry table for bytes. For each byte value, it stores the last index where that byte appears in the pattern. On mismatch at index j, you compute shift = max(1, j - last_occurrence[c]). This is simple and fast.

The good-suffix table is more subtle. Suppose the pattern suffix P[j+1..m-1] matched the text, but P[j] mismatched. You want to shift so that a different occurrence of that suffix in the pattern aligns with the text, or align the longest prefix of P that matches the suffix. This requires preprocessing the pattern to compute border positions (prefixes that are also suffixes). Many implementations compute two arrays: suffix (length of the longest suffix ending at position i) and shift (the amount to shift for each mismatch position). The preprocessing cost is O(m), and the search is O(n) in the worst case for Boyer-Moore with good-suffix and bad-character heuristics, though there are pathological cases; in practice it is very fast.

You must also be careful with overlapping matches. Boyer-Moore can skip past overlaps if you always shift by the full pattern length after a match. The correct behavior is to compute the shift after a match using the good-suffix table for j = -1 (full match), which typically aligns the longest proper prefix with the matched suffix to allow overlaps.

How this fits on projects

You will use this heuristic engine as a core literal matcher in later projects (P06 literal extraction, P15 ripgrep-style integration) and for benchmarking against SIMD-based methods.

Definitions & key terms

  • bad-character rule -> shift based on last occurrence of mismatched character
  • good-suffix rule -> shift based on matched suffix alignment
  • border -> a prefix that is also a suffix
  • sublinear -> average comparisons less than n

Mental model diagram (ASCII)

pattern:    a b c d a b
text:           a b x d a b
match from right -> mismatch at 'c' vs 'x'
shift using bad-character / good-suffix

How it works (step-by-step)

  1. Precompute bad-character table for all byte values.
  2. Precompute good-suffix shift table for each pattern index.
  3. Align pattern at position i and compare from right to left.
  4. On mismatch at j, compute shift = max(bad_char, good_suffix).
  5. On full match, record it and shift using good-suffix table to allow overlaps.

Minimal concrete example

int last[256];
for (int i = 0; i < 256; i++) last[i] = -1;
for (int i = 0; i < m; i++) last[(unsigned char)pat[i]] = i;

Common misconceptions

  • Misconception: Boyer-Moore is always faster. Correction: For very short patterns or tiny alphabets, it can be slower.
  • Misconception: After match, shift by m. Correction: Use good-suffix shift to allow overlaps.

Check-your-understanding questions

  1. Why do we match from right to left in Boyer-Moore?
  2. What does the bad-character table store for each byte?
  3. How does the good-suffix rule prevent skipping valid matches?

Check-your-understanding answers

  1. So that mismatches give information about the rightmost aligned char.
  2. The last index where the byte occurs in the pattern (or -1).
  3. It aligns a matching suffix with another occurrence or a prefix.

Real-world applications

  • GNU grep and other tuned literal searchers
  • Antivirus signature scanning for fixed patterns

Where you’ll apply it

References

  • Boyer & Moore (1977), “A Fast String Searching Algorithm”
  • CLRS, String Matching chapter

Key insights

Boyer-Moore gains speed by skipping alignments that cannot possibly match.

Summary

Boyer-Moore trades preprocessing for large skips, often beating naive search by an order of magnitude on real data.

Homework/Exercises to practice the concept

  1. Implement only bad-character and measure speedup.
  2. Add good-suffix and compare again.

Solutions to the homework/exercises

  1. The bad-character rule alone can already yield major gains for large alphabets.
  2. Good-suffix improves worst cases and enables overlap correctness.

3. Project Specification

3.1 What You Will Build

A CLI tool bm-search that implements Boyer-Moore for literal search with both bad-character and good-suffix heuristics, reports matches, and provides a benchmark mode against the naive baseline.

3.2 Functional Requirements

  1. Pattern preprocessing: build bad-character and good-suffix tables.
  2. Search: report all matches with overlaps.
  3. Benchmark: --bench compares with naive search on the same input.
  4. Output: text and JSON modes like P01.
  5. Deterministic mode: fixed timers for tests (--fixed-ts).

3.3 Non-Functional Requirements

  • Performance: beat naive search by 2x+ on 10MB ASCII text.
  • Reliability: correct on all overlap cases.
  • Usability: clear stats and shift counts in bench mode.

3.4 Example Usage / Output

$ ./bm-search "needle" big.txt
big.txt:10234: needle

$ ./bm-search --bench "needle" big.txt
naive: 42ms, bm: 8ms, speedup: 5.25x

3.5 Data Formats / Schemas / Protocols

Same as P01 for match output; add benchmark line:

bench: naive_ms=<n> bm_ms=<n> speedup=<n>

3.6 Edge Cases

  • Repeated patterns like “aaaaaa”
  • Very small patterns (length 1 or 2)
  • Pattern not found

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./bm-search "TODO" repo.txt
./bm-search --bench "TODO" repo.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixtures/english_1mb.txt and fixed timers to produce stable benchmark output.

3.7.3 CLI Transcript (Success + Failure)

$ ./bm-search --fixed-ts "ana" fixtures/banana.txt
fixtures/banana.txt:1:2: banana
fixtures/banana.txt:1:4: banana
exit code: 0

$ ./bm-search --bench
error: missing PATTERN and FILE
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +------------------+   +----------------+
| CLI Parse |-->| Preprocess Tables|-->| Search Loop    |
+-----------+   +------------------+   +-------+--------+
                                                |
                                                v
                                       +--------+-------+
                                       | Formatter      |
                                       +----------------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Preprocessor | bad-char and good-suffix tables | O(m) preprocessing | | Search loop | right-to-left compare | compute max(bad, good) shift | | Bench | timing and stats | fixed inputs for determinism |

4.3 Data Structures (No Full Code)

int bad_char[256];
int good_suffix[m + 1];

4.4 Algorithm Overview

  1. Build bad-character table.
  2. Build good-suffix table.
  3. Align pattern at i, compare from right to left.
  4. On mismatch, compute shift and advance i.
  5. On match, record and shift for overlaps.

Complexity Analysis

  • Preprocessing: O(m)
  • Search: typically sublinear, worst-case O(n*m)

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o bm-search src/main.c

5.2 Project Structure

bm-search/
├── src/
│   ├── main.c
│   ├── preprocess.c
│   ├── bm_search.c
│   └── formatter.c
├── fixtures/
│   └── english_1mb.txt
├── tests/
└── Makefile

5.3 The Core Question You’re Answering

“How can a string matcher use mismatches to skip ahead without missing valid matches?”

5.4 Concepts You Must Understand First

  1. Bad-character table and how to compute it.
  2. Good-suffix table and border computation.
  3. Overlap handling after a full match.

5.5 Questions to Guide Your Design

  1. What shift is safe after a mismatch at index j?
  2. How do you handle patterns of length 1?
  3. How will you validate correctness vs the naive baseline?

5.6 Thinking Exercise

Manually compute bad-character shifts for pattern “ABCDABD” and text “ABC ABCDAB ABCDABCDABDE”.

5.7 The Interview Questions They’ll Ask

  1. Why is Boyer-Moore sublinear in practice?
  2. What are the bad-character and good-suffix heuristics?

5.8 Hints in Layers

Hint 1: Implement bad-character only first. Hint 2: Add good-suffix; verify with overlap tests. Hint 3: Compare with naive on the same fixture files.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Boyer-Moore | CLRS | Ch. 32 | | String algorithms | “Algorithms” (Sedgewick) | String Processing |

5.10 Implementation Phases

Phase 1: Preprocessing (2 days)

  • Build bad-character table and tests.
  • Checkpoint: table matches expected values.

Phase 2: Search Loop (2 days)

  • Implement right-to-left comparisons.
  • Checkpoint: matches equal naive baseline.

Phase 3: Benchmarking (1 day)

  • Add --bench mode and deterministic fixtures.
  • Checkpoint: speedup reported consistently.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Table size | 256 bytes vs Unicode | 256 bytes | start with byte search | | Shift on match | m vs good-suffix | good-suffix | preserves overlaps |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | preprocessing tables | known patterns | | Integration | matcher vs naive | fixtures | | Edge | repeated chars | “aaaaa” patterns |

6.2 Critical Test Cases

  1. Pattern “ana” in “banana” -> offsets 1, 3.
  2. Pattern “aaaa” in “aaaaaa” -> overlaps.
  3. Pattern not found -> no matches.

6.3 Test Data

text: "abababab"
pattern: "abab"
expected offsets: 0,2,4

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Wrong good-suffix table | missing matches | test against known cases | | Shift < 1 | infinite loop | enforce minimum shift of 1 | | Overlap skipped | fewer results | use match shift table |

7.2 Debugging Strategies

  • Log mismatch positions and shifts on small inputs.
  • Compare match offsets to naive output.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add case-insensitive mode.
  • Print number of character comparisons.

8.2 Intermediate Extensions

  • Support Unicode with codepoint tables.
  • Add heuristics to fall back to naive for short patterns.

8.3 Advanced Extensions

  • Implement Galil optimization.
  • Add SIMD prefiltering before Boyer-Moore.

9. Real-World Connections

9.1 Industry Applications

  • grep-style tools for literal search
  • antivirus and signature scanning
  • GNU grep (Boyer-Moore core for literals)
  • ripgrep (uses prefilters for literals)

9.3 Interview Relevance

  • Explaining sublinear behavior and preprocessing trade-offs

10. Resources

10.1 Essential Reading

  • Boyer & Moore original paper
  • CLRS string matching chapter

10.2 Tools & Documentation

  • hyperfine for benchmarking
  • perf for profiling

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain both BM heuristics.
  • I can reason about safe shifts.

11.2 Implementation

  • Matches equal naive baseline for all fixtures.
  • Benchmark output is deterministic.

11.3 Growth

  • I can describe when BM is not ideal.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Correct BM matcher with bad-character
  • Passes baseline tests

Full Completion:

  • Good-suffix heuristic implemented
  • Benchmark mode with speedup

Excellence (Going Above & Beyond):

  • Galil optimization
  • Adaptive fallback to naive for short patterns