Project 8: SIMD Substring Search (memmem)

A SIMD-accelerated substring search using the “two-byte trick” or similar techniques. Find candidates with SIMD, verify with memcmp.

Quick Reference

Attribute Value
Primary Language C (with intrinsics) or Rust
Alternative Languages C++, Assembly
Difficulty Level 4: Expert
Time Estimate 1-2 weeks
Knowledge Area SIMD / String Algorithms
Tooling SSE2/AVX2 Intrinsics
Prerequisites Project 7 (SIMD memchr)

What You Will Build

A SIMD-accelerated substring search using the “two-byte trick” or similar techniques. Find candidates with SIMD, verify with memcmp.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Two-byte candidate finding → maps to reducing false positives
  • Vectorized candidate verification → maps to batched memcmp
  • Handling variable needle lengths → maps to algorithm selection
  • Combining with Boyer-Moore for long needles → maps to hybrid approaches

Key Concepts

  • Two-Way Algorithm: Used by glibc memmem
  • SIMD Candidate Finding: Search for first+last byte simultaneously
  • False Positive Rate: Why two bytes is much better than one byte
  • Hybrid Strategies: Different algorithms for different needle lengths

Real-World Outcome

$ ./simd_memmem "function" source_code.txt
Searching for "function" in 100MB file...

Algorithm selection:
  Needle length: 8 bytes
  Strategy: SIMD two-byte + verify

Benchmark:
  Naive strstr:      250ms
  glibc memmem:       45ms
  SIMD two-byte:      12ms
  Boyer-Moore:        18ms (better for longer needles)

False positive analysis:
  First byte 'f' matches: 2,345,678 times
  First two bytes "fu" match: 12,345 times (190x reduction!)
  Full "function" matches: 8,234 times

Disassembly insight:
  Searching for 'f' AND 'n' (first and last byte) simultaneously
  reduces candidates by 99.5%

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Intel Intrinsics Guide” + Hyperscan papers