Project 8: SIMD Substring Search (memmem)

Build a SIMD-accelerated substring search that finds a multi-byte needle in a large haystack efficiently.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 1-2 weeks
Main Programming Language C (Alternatives: Rust)
Alternative Programming Languages Rust
Coolness Level Level 4: Hardcore
Business Potential Level 2: Systems libs
Prerequisites P07 SIMD memchr, basic string algorithms
Key Topics SIMD prefiltering, candidate selection, memmem strategies

1. Learning Objectives

By completing this project, you will:

  1. Implement a fast substring search using SIMD prefilters.
  2. Combine SIMD candidate detection with scalar verification.
  3. Handle small needles with specialized fast paths.
  4. Compare performance against naive and Boyer-Moore.
  5. Explain correctness and bounds for SIMD scanning.

2. All Theory Needed (Per-Concept Breakdown)

Fundamentals

Substring search (memmem) finds a multi-byte needle in a byte array. A common optimization is to use a prefilter: search for a single anchor byte (often the first or last needle byte) with SIMD, and only verify candidates when that anchor matches. This drastically reduces the number of full comparisons. For example, for needle “ABCD”, you can search for ‘A’ or ‘D’ and then check the surrounding bytes.

The SIMD part handles candidate detection; the scalar part verifies full equality. This hybrid approach is fast and simple, and forms the basis of many optimized memmem implementations.

Deep Dive into the concept

Choosing the anchor byte matters. If the needle has a rare byte, use that as the anchor to reduce candidate checks. If all bytes are common, you can use two anchors (first and last) and require both to match within the SIMD block, which reduces false positives further. Another approach is to use a small rolling hash or a two-byte compare. But the simplest approach is often fastest: memchr for the first byte, then verify.

For needles of length 1 or 2, specialized code paths are much faster than the general case. For very small needles, vectorized compare of multiple positions at once can be used. For longer needles, Boyer-Moore or Two-Way algorithms can be competitive, but SIMD prefiltering still wins on modern CPUs.

Ensuring correctness requires careful boundary handling. When you find a candidate anchor at position i, you must check that i + needle_len <= haystack_len before comparing the full needle. SIMD loops should stop at n - needle_len to avoid out-of-bounds reads.

How this fits on projects

This is the literal matcher that many regex engines will use as a prefilter (P06) and is a key building block for Teddy (P09) and the final search tool (P15).

Definitions & key terms

  • memmem -> substring search for a byte sequence
  • anchor byte -> selected needle byte used for fast candidate detection
  • prefilter -> fast check before full verification
  • verification -> full needle comparison at candidate position

Mental model diagram (ASCII)

haystack:  ... A . . . A B C D ...
needle:           A B C D
anchor search finds 'A' positions, then verify full needle

How it works (step-by-step)

  1. Choose anchor byte (e.g., needle[0]).
  2. Use SIMD memchr to find anchor occurrences.
  3. For each candidate, verify full needle matches.
  4. Return the first match or all matches as needed.

Minimal concrete example

size_t pos = simd_memchr(hay, n, needle[0]);
if (pos != NOT_FOUND && pos + m <= n && memcmp(hay+pos, needle, m) == 0) return pos;

Common misconceptions

  • Misconception: SIMD can compare the whole needle at once. Correction: SIMD helps find candidate positions; full comparison is still needed.
  • Misconception: Always use the first byte as anchor. Correction: A rare byte can be a better anchor.

Check-your-understanding questions

  1. Why is a rare anchor byte better?
  2. What happens if you forget to bound-check candidates?
  3. Why do we keep scalar verification?

Check-your-understanding answers

  1. It yields fewer candidate checks.
  2. You might read past the buffer and crash.
  3. SIMD prefilter only checks one byte, not the full needle.

Real-world applications

  • libc memmem implementations
  • search tools for literal patterns

Where you’ll apply it

References

  • 0x80 SIMD substring search article
  • memchr crate docs

Key insights

A good anchor plus SIMD turns substring search into a fast candidate filter.

Summary

SIMD memmem blends a fast anchor search with precise verification for speed and correctness.

Homework/Exercises to practice the concept

  1. Compare using first vs last byte as anchor on random text.
  2. Implement a two-byte anchor prefilter and benchmark.

Solutions to the homework/exercises

  1. Measure candidate counts; last byte may be rarer.
  2. Use SIMD to compare two anchors and filter candidates.

3. Project Specification

3.1 What You Will Build

A CLI tool simd-memmem that searches for a literal substring in a file using SIMD anchor prefiltering and scalar verification. It should expose --anchor to choose first/last byte or auto-select.

3.2 Functional Requirements

  1. Anchor selection: first, last, or auto (rarest byte).
  2. SIMD search: use SIMD memchr for anchors.
  3. Verification: full memcmp for each candidate.
  4. Output: first match offset or all matches (--all).
  5. Bench: compare against naive and BM.

3.3 Non-Functional Requirements

  • Performance: significant speedup on large files.
  • Reliability: never read out of bounds.
  • Usability: clear output and stats.

3.4 Example Usage / Output

$ ./simd-memmem --anchor auto "needle" big.txt
first match at offset 1048576 (anchor='d')

3.5 Data Formats / Schemas / Protocols

Simple output plus optional JSON stats.

3.6 Edge Cases

  • Needle length 1 or 2
  • Needle longer than haystack
  • No match

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./simd-memmem --anchor auto "TODO" fixtures/large.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures with known offsets and --all to verify ordering.

3.7.3 CLI Transcript (Success + Failure)

$ ./simd-memmem --anchor first "aba" fixtures/abab.txt
match at offset 0
match at offset 2
exit code: 0

$ ./simd-memmem "toolong" fixtures/short.txt
not found
exit code: 0

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| CLI Parse |-->| Anchor Sel|-->| SIMD Scan |-->| Verify     |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Anchor selector | choose byte | auto-select by frequency | | SIMD scan | find anchor positions | reuse P07 memchr | | Verifier | full compare | memcmp or manual loop |

4.3 Data Structures (No Full Code)

struct Match { size_t offset; };

4.4 Algorithm Overview

  1. Choose anchor byte.
  2. SIMD scan for anchors.
  3. For each anchor, verify full needle.
  4. Emit matches in order.

Complexity Analysis

  • Prefilter: O(n/vec_width)
  • Verification: O(k*m) where k is candidates

5. Implementation Guide

5.1 Development Environment Setup

cc -O3 -march=native -o simd-memmem src/main.c

5.2 Project Structure

simd-memmem/
├── src/
│   ├── main.c
│   ├── anchor.c
│   ├── simd_scan.c
│   └── verify.c
├── fixtures/
└── Makefile

5.3 The Core Question You’re Answering

“How can I reduce full needle comparisons by quickly filtering candidate positions?”

5.4 Concepts You Must Understand First

  1. SIMD memchr from P07.
  2. Anchor selection and selectivity.
  3. Overlap handling in substring search.

5.5 Questions to Guide Your Design

  1. How will you choose a rare anchor byte?
  2. How will you ensure the candidate is within bounds?
  3. How will you report all matches and overlaps?

5.6 Thinking Exercise

Given needle “abcab”, which anchor would you choose and why?

5.7 The Interview Questions They’ll Ask

  1. Why is anchor selection important?
  2. How do you balance SIMD and scalar verification?

5.8 Hints in Layers

Hint 1: Start with anchor = first byte. Hint 2: Add last-byte anchor option. Hint 3: Add auto-select by frequency.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | String search | CLRS | String matching | | SIMD | Intel Intrinsics Guide | SSE2/AVX2 |

5.10 Implementation Phases

Phase 1: Anchor + Verification (2 days)

  • Use scalar memchr as anchor.
  • Checkpoint: correct matches.

Phase 2: SIMD Prefilter (2 days)

  • Integrate SIMD memchr.
  • Checkpoint: matches equal baseline.

Phase 3: Heuristics + Bench (2 days)

  • Auto anchor selection and benchmark.
  • Checkpoint: speedup on fixtures.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Anchor | first vs last vs auto | auto | best selectivity | | Verification | memcmp vs manual | memcmp | simplicity |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | anchor selection | rare byte choice | | Integration | matches vs baseline | fixtures | | Edge | short needles | length 1,2 |

6.2 Critical Test Cases

  1. Needle length 1 uses memchr.
  2. Needle length 2 uses specialized path.
  3. Overlaps are reported correctly.

6.3 Test Data

text: "ababab"
needle: "aba"
expected: offsets 0,2

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Anchor out of bounds | crash | bound-check candidate | | Skipped overlaps | missing matches | advance by 1 after match | | Bad anchor heuristic | low speedup | pick rare byte |

7.2 Debugging Strategies

  • Compare with naive matcher for random inputs.
  • Print candidate positions to verify anchor selection.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add case-insensitive mode.
  • Add reverse search.

8.2 Intermediate Extensions

  • Two-byte anchor prefilter.
  • Support UTF-8 aware matching.

8.3 Advanced Extensions

  • Integrate Boyer-Moore fallback for long needles.
  • Add adaptive strategy based on needle length.

9. Real-World Connections

9.1 Industry Applications

  • Search engines for literal patterns
  • Fast parsers and tokenizers
  • memchr crate
  • glibc memmem

9.3 Interview Relevance

  • Discussing SIMD prefilters and cache effects

10. Resources

10.1 Essential Reading

  • SIMD substring search articles (0x80)
  • memchr crate docs

10.2 Tools & Documentation

  • hyperfine for benchmarks
  • perf for profiling

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain anchor-based prefiltering.
  • I can reason about overlap handling.

11.2 Implementation

  • SIMD results match scalar baseline.
  • Auto anchor selection works.

11.3 Growth

  • I can describe when BM might beat SIMD.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • SIMD anchor prefilter + scalar verification

Full Completion:

  • Auto anchor selection + benchmarks

Excellence (Going Above & Beyond):

  • Two-byte anchor prefilter
  • Adaptive strategy based on needle length