Project 8: SIMD Substring Search (memmem)
Build a SIMD-accelerated substring search that finds a multi-byte needle in a large haystack efficiently.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 4: Hardcore |
| Business Potential | Level 2: Systems libs |
| Prerequisites | P07 SIMD memchr, basic string algorithms |
| Key Topics | SIMD prefiltering, candidate selection, memmem strategies |
1. Learning Objectives
By completing this project, you will:
- Implement a fast substring search using SIMD prefilters.
- Combine SIMD candidate detection with scalar verification.
- Handle small needles with specialized fast paths.
- Compare performance against naive and Boyer-Moore.
- Explain correctness and bounds for SIMD scanning.
2. All Theory Needed (Per-Concept Breakdown)
2.1 SIMD-Accelerated Substring Search
Fundamentals
Substring search (memmem) finds a multi-byte needle in a byte array. A common optimization is to use a prefilter: search for a single anchor byte (often the first or last needle byte) with SIMD, and only verify candidates when that anchor matches. This drastically reduces the number of full comparisons. For example, for needle “ABCD”, you can search for ‘A’ or ‘D’ and then check the surrounding bytes.
The SIMD part handles candidate detection; the scalar part verifies full equality. This hybrid approach is fast and simple, and forms the basis of many optimized memmem implementations.
Deep Dive into the concept
Choosing the anchor byte matters. If the needle has a rare byte, use that as the anchor to reduce candidate checks. If all bytes are common, you can use two anchors (first and last) and require both to match within the SIMD block, which reduces false positives further. Another approach is to use a small rolling hash or a two-byte compare. But the simplest approach is often fastest: memchr for the first byte, then verify.
For needles of length 1 or 2, specialized code paths are much faster than the general case. For very small needles, vectorized compare of multiple positions at once can be used. For longer needles, Boyer-Moore or Two-Way algorithms can be competitive, but SIMD prefiltering still wins on modern CPUs.
Ensuring correctness requires careful boundary handling. When you find a candidate anchor at position i, you must check that i + needle_len <= haystack_len before comparing the full needle. SIMD loops should stop at n - needle_len to avoid out-of-bounds reads.
How this fits on projects
This is the literal matcher that many regex engines will use as a prefilter (P06) and is a key building block for Teddy (P09) and the final search tool (P15).
Definitions & key terms
- memmem -> substring search for a byte sequence
- anchor byte -> selected needle byte used for fast candidate detection
- prefilter -> fast check before full verification
- verification -> full needle comparison at candidate position
Mental model diagram (ASCII)
haystack: ... A . . . A B C D ...
needle: A B C D
anchor search finds 'A' positions, then verify full needle
How it works (step-by-step)
- Choose anchor byte (e.g., needle[0]).
- Use SIMD memchr to find anchor occurrences.
- For each candidate, verify full needle matches.
- Return the first match or all matches as needed.
Minimal concrete example
size_t pos = simd_memchr(hay, n, needle[0]);
if (pos != NOT_FOUND && pos + m <= n && memcmp(hay+pos, needle, m) == 0) return pos;
Common misconceptions
- Misconception: SIMD can compare the whole needle at once. Correction: SIMD helps find candidate positions; full comparison is still needed.
- Misconception: Always use the first byte as anchor. Correction: A rare byte can be a better anchor.
Check-your-understanding questions
- Why is a rare anchor byte better?
- What happens if you forget to bound-check candidates?
- Why do we keep scalar verification?
Check-your-understanding answers
- It yields fewer candidate checks.
- You might read past the buffer and crash.
- SIMD prefilter only checks one byte, not the full needle.
Real-world applications
- libc memmem implementations
- search tools for literal patterns
Where you’ll apply it
- This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
- Also used in: P09-teddy-algorithm-multi-literal-simd, P15-build-your-own-ripgrep.
References
- 0x80 SIMD substring search article
- memchr crate docs
Key insights
A good anchor plus SIMD turns substring search into a fast candidate filter.
Summary
SIMD memmem blends a fast anchor search with precise verification for speed and correctness.
Homework/Exercises to practice the concept
- Compare using first vs last byte as anchor on random text.
- Implement a two-byte anchor prefilter and benchmark.
Solutions to the homework/exercises
- Measure candidate counts; last byte may be rarer.
- Use SIMD to compare two anchors and filter candidates.
3. Project Specification
3.1 What You Will Build
A CLI tool simd-memmem that searches for a literal substring in a file using SIMD anchor prefiltering and scalar verification. It should expose --anchor to choose first/last byte or auto-select.
3.2 Functional Requirements
- Anchor selection: first, last, or auto (rarest byte).
- SIMD search: use SIMD memchr for anchors.
- Verification: full memcmp for each candidate.
- Output: first match offset or all matches (
--all). - Bench: compare against naive and BM.
3.3 Non-Functional Requirements
- Performance: significant speedup on large files.
- Reliability: never read out of bounds.
- Usability: clear output and stats.
3.4 Example Usage / Output
$ ./simd-memmem --anchor auto "needle" big.txt
first match at offset 1048576 (anchor='d')
3.5 Data Formats / Schemas / Protocols
Simple output plus optional JSON stats.
3.6 Edge Cases
- Needle length 1 or 2
- Needle longer than haystack
- No match
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./simd-memmem --anchor auto "TODO" fixtures/large.txt
3.7.2 Golden Path Demo (Deterministic)
Use fixed fixtures with known offsets and --all to verify ordering.
3.7.3 CLI Transcript (Success + Failure)
$ ./simd-memmem --anchor first "aba" fixtures/abab.txt
match at offset 0
match at offset 2
exit code: 0
$ ./simd-memmem "toolong" fixtures/short.txt
not found
exit code: 0
4. Solution Architecture
4.1 High-Level Design
+-----------+ +-----------+ +-----------+ +-----------+
| CLI Parse |-->| Anchor Sel|-->| SIMD Scan |-->| Verify |
+-----------+ +-----------+ +-----------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Anchor selector | choose byte | auto-select by frequency | | SIMD scan | find anchor positions | reuse P07 memchr | | Verifier | full compare | memcmp or manual loop |
4.3 Data Structures (No Full Code)
struct Match { size_t offset; };
4.4 Algorithm Overview
- Choose anchor byte.
- SIMD scan for anchors.
- For each anchor, verify full needle.
- Emit matches in order.
Complexity Analysis
- Prefilter: O(n/vec_width)
- Verification: O(k*m) where k is candidates
5. Implementation Guide
5.1 Development Environment Setup
cc -O3 -march=native -o simd-memmem src/main.c
5.2 Project Structure
simd-memmem/
├── src/
│ ├── main.c
│ ├── anchor.c
│ ├── simd_scan.c
│ └── verify.c
├── fixtures/
└── Makefile
5.3 The Core Question You’re Answering
“How can I reduce full needle comparisons by quickly filtering candidate positions?”
5.4 Concepts You Must Understand First
- SIMD memchr from P07.
- Anchor selection and selectivity.
- Overlap handling in substring search.
5.5 Questions to Guide Your Design
- How will you choose a rare anchor byte?
- How will you ensure the candidate is within bounds?
- How will you report all matches and overlaps?
5.6 Thinking Exercise
Given needle “abcab”, which anchor would you choose and why?
5.7 The Interview Questions They’ll Ask
- Why is anchor selection important?
- How do you balance SIMD and scalar verification?
5.8 Hints in Layers
Hint 1: Start with anchor = first byte. Hint 2: Add last-byte anchor option. Hint 3: Add auto-select by frequency.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | String search | CLRS | String matching | | SIMD | Intel Intrinsics Guide | SSE2/AVX2 |
5.10 Implementation Phases
Phase 1: Anchor + Verification (2 days)
- Use scalar memchr as anchor.
- Checkpoint: correct matches.
Phase 2: SIMD Prefilter (2 days)
- Integrate SIMD memchr.
- Checkpoint: matches equal baseline.
Phase 3: Heuristics + Bench (2 days)
- Auto anchor selection and benchmark.
- Checkpoint: speedup on fixtures.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Anchor | first vs last vs auto | auto | best selectivity | | Verification | memcmp vs manual | memcmp | simplicity |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | anchor selection | rare byte choice | | Integration | matches vs baseline | fixtures | | Edge | short needles | length 1,2 |
6.2 Critical Test Cases
- Needle length 1 uses memchr.
- Needle length 2 uses specialized path.
- Overlaps are reported correctly.
6.3 Test Data
text: "ababab"
needle: "aba"
expected: offsets 0,2
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Anchor out of bounds | crash | bound-check candidate | | Skipped overlaps | missing matches | advance by 1 after match | | Bad anchor heuristic | low speedup | pick rare byte |
7.2 Debugging Strategies
- Compare with naive matcher for random inputs.
- Print candidate positions to verify anchor selection.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add case-insensitive mode.
- Add reverse search.
8.2 Intermediate Extensions
- Two-byte anchor prefilter.
- Support UTF-8 aware matching.
8.3 Advanced Extensions
- Integrate Boyer-Moore fallback for long needles.
- Add adaptive strategy based on needle length.
9. Real-World Connections
9.1 Industry Applications
- Search engines for literal patterns
- Fast parsers and tokenizers
9.2 Related Open Source Projects
- memchr crate
- glibc memmem
9.3 Interview Relevance
- Discussing SIMD prefilters and cache effects
10. Resources
10.1 Essential Reading
- SIMD substring search articles (0x80)
- memchr crate docs
10.2 Tools & Documentation
hyperfinefor benchmarksperffor profiling
10.3 Related Projects in This Series
- Previous: P07-simd-memchr-implementation
- Next: P09-teddy-algorithm-multi-literal-simd
11. Self-Assessment Checklist
11.1 Understanding
- I can explain anchor-based prefiltering.
- I can reason about overlap handling.
11.2 Implementation
- SIMD results match scalar baseline.
- Auto anchor selection works.
11.3 Growth
- I can describe when BM might beat SIMD.
12. Submission / Completion Criteria
Minimum Viable Completion:
- SIMD anchor prefilter + scalar verification
Full Completion:
- Auto anchor selection + benchmarks
Excellence (Going Above & Beyond):
- Two-byte anchor prefilter
- Adaptive strategy based on needle length