Project 10: SIMD-Powered String Search

An implementation of the strstr function (find a substring in a larger string) that is accelerated using SIMD (Single Instruction, Multiple Data) intrinsics. You will compare its performance against the standard library’s strstr and a naive C implementation.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages C++, Rust
Difficulty Level 4: Expert
Time Estimate 2-3 weeks
Knowledge Area SIMD / Vectorization / Low-Level Optimization
Tooling GCC/Clang with Intrinsics (<immintrin.h>)
Prerequisites Strong C and pointer skills, basic assembly knowledge is helpful.

What You Will Build

An implementation of the strstr function (find a substring in a larger string) that is accelerated using SIMD (Single Instruction, Multiple Data) intrinsics. You will compare its performance against the standard library’s strstr and a naive C implementation.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Understanding SIMD concepts → maps to vector registers (XMM/YMM/ZMM), packed data, and lanes
  • Using compiler intrinsics → maps to writing C code that maps directly to specific assembly instructions (e.g., _mm256_load_si256)
  • Designing a SIMD-friendly algorithm → maps to rethinking a simple search into a parallel operation
  • Handling edge cases → maps to unaligned data and leftover bytes at the end of a string

Key Concepts

  • SIMD (Single Instruction, Multiple Data): Intel Intrinsics Guide (web search).
  • AVX/SSE Intrinsics: GCC documentation on x86 intrinsics.
  • Data Alignment for Vectorization: Many blog posts cover why aligned loads are faster.

Real-World Outcome

$ ./simd_strstr
Searching for a 16-byte needle in a 1GB haystack...

  Naive C strstr: 1.2 seconds
  Standard library strstr: 0.3 seconds
  SIMD-accelerated strstr: 0.08 seconds

  -> SIMD is 15x faster than naive C and 3.75x faster than the optimized standard library.

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • “Modern X86 Assembly Language Programming” by Daniel Kusswurm (for concepts)