Project 7: SIMD memchr Implementation
A SIMD-accelerated memchr that finds a byte in a buffer at 15+ bytes per CPU cycle using SSE2/AVX2.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C (with intrinsics) |
| Alternative Languages | Rust, Assembly |
| Difficulty | Level 4: Expert |
| Time Estimate | 1 week |
| Knowledge Area | SIMD / Low-Level Optimization |
| Tooling | SSE2/AVX2 Intrinsics |
| Prerequisites | Understanding of pointers, basic assembly concepts |
What You Will Build
A SIMD-accelerated memchr that finds a byte in a buffer at 15+ bytes per CPU cycle using SSE2/AVX2.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- SIMD vector loading (_mm_loadu_si128) → maps to loading 16 bytes at once
- Parallel comparison (_mm_cmpeq_epi8) → maps to comparing 16 bytes simultaneously
- Mask extraction (_mm_movemask_epi8) → maps to converting results to bitmask
- Bit scanning (__builtin_ctz) → maps to finding first match in mask
- Alignment handling → maps to memory alignment requirements
Key Concepts
- SSE2 Intrinsics: Intel Intrinsics Guide (intel.com)
- Vector Comparison: Compare 16/32 bytes in a single instruction
- Movemask: Convert comparison result to 16/32-bit integer
- Bit Manipulation: Find first set bit in mask
Real-World Outcome
$ ./simd_memchr '\n' large_file.txt
Searching for newlines in 1GB file...
Implementation comparison:
Naive (byte-by-byte): 1200ms (0.83 GB/s)
Standard memchr: 150ms (6.67 GB/s)
SSE2 memchr: 80ms (12.5 GB/s)
AVX2 memchr: 45ms (22.2 GB/s)
Throughput: 22.2 GB/s = ~9 bytes per cycle on 2.5GHz CPU
Disassembly of hot loop:
vmovdqu ymm0, [rdi] ; Load 32 bytes
vpcmpeqb ymm1, ymm0, ymm2 ; Compare with search byte
vpmovmskb eax, ymm1 ; Extract comparison mask
test eax, eax ; Any matches?
jnz found ; If yes, handle match
add rdi, 32 ; Advance pointer
jmp loop ; Continue
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Intel Intrinsics Guide” (online reference)