Project 10: SIMD-Powered String Search

An implementation of the strstr function (find a substring in a larger string) that is accelerated using SIMD (Single Instruction, Multiple Data) intrinsics. You will compare its performance against the standard library’s strstr and a naive C implementation.

Quick Reference

Attribute	Value
Primary Language	C
Alternative Languages	C++, Rust
Difficulty	Level 4: Expert
Time Estimate	2-3 weeks
Knowledge Area	SIMD / Vectorization / Low-Level Optimization
Tooling	GCC/Clang with Intrinsics (`<immintrin.h>`)
Prerequisites	Strong C and pointer skills, basic assembly knowledge is helpful.

What You Will Build

An implementation of the strstr function (find a substring in a larger string) that is accelerated using SIMD (Single Instruction, Multiple Data) intrinsics. You will compare its performance against the standard library’s strstr and a naive C implementation.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

Understanding SIMD concepts → maps to vector registers (XMM/YMM/ZMM), packed data, and lanes
Using compiler intrinsics → maps to writing C code that maps directly to specific assembly instructions (e.g., _mm256_load_si256)
Designing a SIMD-friendly algorithm → maps to rethinking a simple search into a parallel operation
Handling edge cases → maps to unaligned data and leftover bytes at the end of a string

Key Concepts

SIMD (Single Instruction, Multiple Data): Intel Intrinsics Guide (web search).
AVX/SSE Intrinsics: GCC documentation on x86 intrinsics.
Data Alignment for Vectorization: Many blog posts cover why aligned loads are faster.

Real-World Outcome

$ ./simd_strstr
Searching for a 16-byte needle in a 1GB haystack...

  Naive C strstr: 1.2 seconds
  Standard library strstr: 0.3 seconds
  SIMD-accelerated strstr: 0.08 seconds

  -> SIMD is 15x faster than naive C and 3.75x faster than the optimized standard library.

Implementation Guide

Reproduce the simplest happy-path scenario.
Build the smallest working version of the core feature.
Add input validation and error handling.
Add instrumentation/logging to confirm behavior.
Refactor into clean modules with tests.

Milestones

Milestone 1: Minimal working program that runs end-to-end.
Milestone 2: Correct outputs for typical inputs.
Milestone 3: Robust handling of edge cases.
Milestone 4: Clean structure and documented usage.

Validation Checklist

Output matches the real-world outcome example
Handles invalid inputs safely
Provides clear errors and exit codes
Repeatable results across runs

References

Main guide: LEARN_C_PERFORMANCE_DEEP_DIVE.md
“Modern X86 Assembly Language Programming” by Daniel Kusswurm (for concepts)