Project 7: SIMD memchr Implementation

A SIMD-accelerated memchr that finds a byte in a buffer at 15+ bytes per CPU cycle using SSE2/AVX2.

Quick Reference

Attribute Value
Primary Language C (with intrinsics)
Alternative Languages Rust, Assembly
Difficulty Level 4: Expert
Time Estimate 1 week
Knowledge Area SIMD / Low-Level Optimization
Tooling SSE2/AVX2 Intrinsics
Prerequisites Understanding of pointers, basic assembly concepts

What You Will Build

A SIMD-accelerated memchr that finds a byte in a buffer at 15+ bytes per CPU cycle using SSE2/AVX2.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • SIMD vector loading (_mm_loadu_si128) → maps to loading 16 bytes at once
  • Parallel comparison (_mm_cmpeq_epi8) → maps to comparing 16 bytes simultaneously
  • Mask extraction (_mm_movemask_epi8) → maps to converting results to bitmask
  • Bit scanning (__builtin_ctz) → maps to finding first match in mask
  • Alignment handling → maps to memory alignment requirements

Key Concepts

  • SSE2 Intrinsics: Intel Intrinsics Guide (intel.com)
  • Vector Comparison: Compare 16/32 bytes in a single instruction
  • Movemask: Convert comparison result to 16/32-bit integer
  • Bit Manipulation: Find first set bit in mask

Real-World Outcome

$ ./simd_memchr '\n' large_file.txt
Searching for newlines in 1GB file...

Implementation comparison:
  Naive (byte-by-byte): 1200ms (0.83 GB/s)
  Standard memchr:       150ms (6.67 GB/s)
  SSE2 memchr:            80ms (12.5 GB/s)
  AVX2 memchr:            45ms (22.2 GB/s)

Throughput: 22.2 GB/s = ~9 bytes per cycle on 2.5GHz CPU

Disassembly of hot loop:
  vmovdqu  ymm0, [rdi]        ; Load 32 bytes
  vpcmpeqb ymm1, ymm0, ymm2   ; Compare with search byte
  vpmovmskb eax, ymm1         ; Extract comparison mask
  test     eax, eax           ; Any matches?
  jnz      found              ; If yes, handle match
  add      rdi, 32            ; Advance pointer
  jmp      loop               ; Continue

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Intel Intrinsics Guide” (online reference)