Project 8: SIMD Substring Search (memmem)

Build a SIMD-accelerated substring search that finds a multi-byte needle in a large haystack efficiently.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 4: Hardcore
Business Potential	Level 2: Systems libs
Prerequisites	P07 SIMD memchr, basic string algorithms
Key Topics	SIMD prefiltering, candidate selection, memmem strategies

1. Learning Objectives

By completing this project, you will:

Implement a fast substring search using SIMD prefilters.
Combine SIMD candidate detection with scalar verification.
Handle small needles with specialized fast paths.
Compare performance against naive and Boyer-Moore.
Explain correctness and bounds for SIMD scanning.

2. All Theory Needed (Per-Concept Breakdown)

2.1 SIMD-Accelerated Substring Search

Fundamentals

Substring search (memmem) finds a multi-byte needle in a byte array. A common optimization is to use a prefilter: search for a single anchor byte (often the first or last needle byte) with SIMD, and only verify candidates when that anchor matches. This drastically reduces the number of full comparisons. For example, for needle “ABCD”, you can search for ‘A’ or ‘D’ and then check the surrounding bytes.

The SIMD part handles candidate detection; the scalar part verifies full equality. This hybrid approach is fast and simple, and forms the basis of many optimized memmem implementations.

Deep Dive into the concept

Choosing the anchor byte matters. If the needle has a rare byte, use that as the anchor to reduce candidate checks. If all bytes are common, you can use two anchors (first and last) and require both to match within the SIMD block, which reduces false positives further. Another approach is to use a small rolling hash or a two-byte compare. But the simplest approach is often fastest: memchr for the first byte, then verify.

For needles of length 1 or 2, specialized code paths are much faster than the general case. For very small needles, vectorized compare of multiple positions at once can be used. For longer needles, Boyer-Moore or Two-Way algorithms can be competitive, but SIMD prefiltering still wins on modern CPUs.

Ensuring correctness requires careful boundary handling. When you find a candidate anchor at position i, you must check that i + needle_len <= haystack_len before comparing the full needle. SIMD loops should stop at n - needle_len to avoid out-of-bounds reads.

How this fits on projects

This is the literal matcher that many regex engines will use as a prefilter (P06) and is a key building block for Teddy (P09) and the final search tool (P15).

Definitions & key terms

memmem -> substring search for a byte sequence
anchor byte -> selected needle byte used for fast candidate detection
prefilter -> fast check before full verification
verification -> full needle comparison at candidate position

Mental model diagram (ASCII)

haystack:  ... A . . . A B C D ...
needle:           A B C D
anchor search finds 'A' positions, then verify full needle

How it works (step-by-step)

Choose anchor byte (e.g., needle[0]).
Use SIMD memchr to find anchor occurrences.
For each candidate, verify full needle matches.
Return the first match or all matches as needed.

Minimal concrete example

size_t pos = simd_memchr(hay, n, needle[0]);
if (pos != NOT_FOUND && pos + m <= n && memcmp(hay+pos, needle, m) == 0) return pos;

Common misconceptions

Misconception: SIMD can compare the whole needle at once. Correction: SIMD helps find candidate positions; full comparison is still needed.
Misconception: Always use the first byte as anchor. Correction: A rare byte can be a better anchor.

Check-your-understanding questions

Why is a rare anchor byte better?
What happens if you forget to bound-check candidates?
Why do we keep scalar verification?

Check-your-understanding answers

It yields fewer candidate checks.
You might read past the buffer and crash.
SIMD prefilter only checks one byte, not the full needle.

Real-world applications

libc memmem implementations
search tools for literal patterns

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P09-teddy-algorithm-multi-literal-simd, P15-build-your-own-ripgrep.

References

0x80 SIMD substring search article
memchr crate docs

Key insights

A good anchor plus SIMD turns substring search into a fast candidate filter.

Summary

SIMD memmem blends a fast anchor search with precise verification for speed and correctness.

Homework/Exercises to practice the concept

Compare using first vs last byte as anchor on random text.
Implement a two-byte anchor prefilter and benchmark.

Solutions to the homework/exercises

Measure candidate counts; last byte may be rarer.
Use SIMD to compare two anchors and filter candidates.

3. Project Specification

3.1 What You Will Build

A CLI tool simd-memmem that searches for a literal substring in a file using SIMD anchor prefiltering and scalar verification. It should expose --anchor to choose first/last byte or auto-select.

3.2 Functional Requirements

Anchor selection: first, last, or auto (rarest byte).
SIMD search: use SIMD memchr for anchors.
Verification: full memcmp for each candidate.
Output: first match offset or all matches (--all).
Bench: compare against naive and BM.

3.3 Non-Functional Requirements

Performance: significant speedup on large files.
Reliability: never read out of bounds.
Usability: clear output and stats.

3.4 Example Usage / Output

$ ./simd-memmem --anchor auto "needle" big.txt
first match at offset 1048576 (anchor='d')

3.5 Data Formats / Schemas / Protocols

Simple output plus optional JSON stats.

3.6 Edge Cases

Needle length 1 or 2
Needle longer than haystack
No match

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./simd-memmem --anchor auto "TODO" fixtures/large.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures with known offsets and --all to verify ordering.

3.7.3 CLI Transcript (Success + Failure)

$ ./simd-memmem --anchor first "aba" fixtures/abab.txt
match at offset 0
match at offset 2
exit code: 0

$ ./simd-memmem "toolong" fixtures/short.txt
not found
exit code: 0

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| CLI Parse |-->| Anchor Sel|-->| SIMD Scan |-->| Verify     |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Match { size_t offset; };

4.4 Algorithm Overview

Choose anchor byte.
SIMD scan for anchors.
For each anchor, verify full needle.
Emit matches in order.

Complexity Analysis

Prefilter: O(n/vec_width)
Verification: O(k*m) where k is candidates

5. Implementation Guide

5.1 Development Environment Setup

cc -O3 -march=native -o simd-memmem src/main.c

5.2 Project Structure

simd-memmem/
├── src/
│   ├── main.c
│   ├── anchor.c
│   ├── simd_scan.c
│   └── verify.c
├── fixtures/
└── Makefile

5.3 The Core Question You’re Answering

“How can I reduce full needle comparisons by quickly filtering candidate positions?”

5.4 Concepts You Must Understand First

SIMD memchr from P07.
Anchor selection and selectivity.
Overlap handling in substring search.

5.5 Questions to Guide Your Design

How will you choose a rare anchor byte?
How will you ensure the candidate is within bounds?
How will you report all matches and overlaps?

5.6 Thinking Exercise

Given needle “abcab”, which anchor would you choose and why?

5.7 The Interview Questions They’ll Ask

Why is anchor selection important?
How do you balance SIMD and scalar verification?

5.8 Hints in Layers

Hint 1: Start with anchor = first byte. Hint 2: Add last-byte anchor option. Hint 3: Add auto-select by frequency.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Anchor + Verification (2 days)

Use scalar memchr as anchor.
Checkpoint: correct matches.

Phase 2: SIMD Prefilter (2 days)

Integrate SIMD memchr.
Checkpoint: matches equal baseline.

Phase 3: Heuristics + Bench (2 days)

Auto anchor selection and benchmark.
Checkpoint: speedup on fixtures.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Needle length 1 uses memchr.
Needle length 2 uses specialized path.
Overlaps are reported correctly.

6.3 Test Data

text: "ababab"
needle: "aba"
expected: offsets 0,2

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Compare with naive matcher for random inputs.
Print candidate positions to verify anchor selection.

8. Extensions & Challenges

8.1 Beginner Extensions

Add case-insensitive mode.
Add reverse search.

8.2 Intermediate Extensions

Two-byte anchor prefilter.
Support UTF-8 aware matching.

8.3 Advanced Extensions

Integrate Boyer-Moore fallback for long needles.
Add adaptive strategy based on needle length.

9. Real-World Connections

9.1 Industry Applications

Search engines for literal patterns
Fast parsers and tokenizers

memchr crate
glibc memmem

9.3 Interview Relevance

Discussing SIMD prefilters and cache effects

10. Resources

10.1 Essential Reading

SIMD substring search articles (0x80)
memchr crate docs

10.2 Tools & Documentation

hyperfine for benchmarks
perf for profiling

Previous: P07-simd-memchr-implementation
Next: P09-teddy-algorithm-multi-literal-simd

11. Self-Assessment Checklist

11.1 Understanding

I can explain anchor-based prefiltering.
I can reason about overlap handling.

11.2 Implementation

SIMD results match scalar baseline.
Auto anchor selection works.

11.3 Growth

I can describe when BM might beat SIMD.

12. Submission / Completion Criteria

Minimum Viable Completion:

SIMD anchor prefilter + scalar verification

Full Completion:

Auto anchor selection + benchmarks

Excellence (Going Above & Beyond):

Two-byte anchor prefilter
Adaptive strategy based on needle length