Project 8: SIMD Substring Search (memmem)
A SIMD-accelerated substring search using the “two-byte trick” or similar techniques. Find candidates with SIMD, verify with memcmp.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C (with intrinsics) or Rust |
| Alternative Languages | C++, Assembly |
| Difficulty | Level 4: Expert |
| Time Estimate | 1-2 weeks |
| Knowledge Area | SIMD / String Algorithms |
| Tooling | SSE2/AVX2 Intrinsics |
| Prerequisites | Project 7 (SIMD memchr) |
What You Will Build
A SIMD-accelerated substring search using the “two-byte trick” or similar techniques. Find candidates with SIMD, verify with memcmp.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Two-byte candidate finding → maps to reducing false positives
- Vectorized candidate verification → maps to batched memcmp
- Handling variable needle lengths → maps to algorithm selection
- Combining with Boyer-Moore for long needles → maps to hybrid approaches
Key Concepts
- Two-Way Algorithm: Used by glibc memmem
- SIMD Candidate Finding: Search for first+last byte simultaneously
- False Positive Rate: Why two bytes is much better than one byte
- Hybrid Strategies: Different algorithms for different needle lengths
Real-World Outcome
$ ./simd_memmem "function" source_code.txt
Searching for "function" in 100MB file...
Algorithm selection:
Needle length: 8 bytes
Strategy: SIMD two-byte + verify
Benchmark:
Naive strstr: 250ms
glibc memmem: 45ms
SIMD two-byte: 12ms
Boyer-Moore: 18ms (better for longer needles)
False positive analysis:
First byte 'f' matches: 2,345,678 times
First two bytes "fu" match: 12,345 times (190x reduction!)
Full "function" matches: 8,234 times
Disassembly insight:
Searching for 'f' AND 'n' (first and last byte) simultaneously
reduces candidates by 99.5%
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Intel Intrinsics Guide” + Hyperscan papers