Project 7: SIMD memchr Implementation

Build a SIMD-accelerated byte search (memchr) that scans 16-32 bytes per instruction and outperforms naive byte loops.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 1 week
Main Programming Language C (Alternatives: Rust)
Alternative Programming Languages Rust
Coolness Level Level 4: Hardcore
Business Potential Level 2: Systems libs
Prerequisites P01, basic SIMD concepts, CPU intrinsics
Key Topics SIMD, vector compares, alignment, memchr algorithm

1. Learning Objectives

By completing this project, you will:

  1. Implement SIMD byte comparison using SSE2/AVX2 intrinsics.
  2. Handle alignment, unaligned loads, and tail processing safely.
  3. Produce a memchr that beats naive loops on large buffers.
  4. Understand how to extract match masks from SIMD vectors.
  5. Benchmark with fixed inputs for deterministic results.

2. All Theory Needed (Per-Concept Breakdown)

Fundamentals

SIMD (Single Instruction, Multiple Data) allows a CPU to perform the same operation on multiple data elements in parallel. For memchr, this means comparing a block of 16 bytes (SSE2) or 32 bytes (AVX2) against a target byte in a single instruction. The result is a mask indicating which byte positions matched. You then find the first set bit to get the earliest match in the block.

The basic algorithm is: broadcast the target byte into a vector, load a block of bytes from the buffer, compare for equality, produce a bitmask, and if any bit is set, find the index of the first match. Otherwise, move to the next block.

Deep Dive into the concept

The main challenges are alignment and tail handling. Unaligned loads are usually safe on modern x86, but you must ensure you never read beyond the allocated buffer. A common approach is to process the head up to an aligned boundary with a scalar loop, then use SIMD for the middle, and handle the tail with a scalar loop again. Another approach is to allow unaligned loads but stop at n - vector_width and process the tail safely.

Extracting the mask uses intrinsics such as _mm_movemask_epi8 for SSE2. For AVX2, _mm256_movemask_epi8. The resulting integer has one bit per byte. You can use __builtin_ctz to find the least significant set bit. This is fast and branch-light.

Performance is often limited by memory bandwidth rather than compute. That means SIMD can make a huge difference on large buffers but will not help much on very small buffers. A hybrid strategy that falls back to scalar for short lengths is best.

How this fits on projects

This memchr is the building block for fast literal searches (P08), Teddy multi-literal prefilters (P09), and line counting (P12).

Definitions & key terms

  • SIMD -> parallel operations on vector registers
  • movemask -> extract comparison results into a bitmask
  • alignment -> address boundary for efficient loads
  • tail -> leftover bytes after SIMD blocks

Mental model diagram (ASCII)

bytes:  [b0 b1 b2 ... b15]
vector: [ t  t  t ...  t]
compare -> mask: 0010000000000100
ctz(mask) -> first match index

How it works (step-by-step)

  1. Broadcast target byte into a SIMD register.
  2. Load a vector of bytes from the buffer.
  3. Compare equality, produce a mask.
  4. If mask != 0, find first set bit and return offset.
  5. Otherwise, advance by vector width and repeat.

Minimal concrete example

__m128i needle = _mm_set1_epi8((char)target);
__m128i block = _mm_loadu_si128((const __m128i*)ptr);
__m128i eq = _mm_cmpeq_epi8(block, needle);
int mask = _mm_movemask_epi8(eq);

Common misconceptions

  • Misconception: SIMD always faster. Correction: For tiny buffers, scalar can be faster.
  • Misconception: Unaligned loads are unsafe. Correction: They are safe if you stay within bounds.

Check-your-understanding questions

  1. What does movemask return for a comparison?
  2. Why do you need tail handling?
  3. When does SIMD stop helping?

Check-your-understanding answers

  1. A bitmask with one bit per byte indicating equality.
  2. SIMD works in fixed-width blocks; leftovers remain.
  3. On very small buffers or when memory bandwidth dominates.

Real-world applications

  • libc memchr implementations
  • high-speed search tools and parsers

Where you’ll apply it

References

  • Intel Intrinsics Guide (SSE2/AVX2)
  • memchr crate documentation

Key insights

SIMD memchr is essentially a vectorized equality test plus a fast bit scan.

Summary

SIMD turns byte-by-byte search into block-by-block search and unlocks large speedups on big inputs.

Homework/Exercises to practice the concept

  1. Implement scalar memchr and count comparisons.
  2. Add SSE2 and measure speedups on 1MB buffers.

Solutions to the homework/exercises

  1. A simple loop comparing each byte.
  2. Use hyperfine and compare SSE2 vs scalar timings.

3. Project Specification

3.1 What You Will Build

A CLI tool simd-memchr that searches for a single byte in a buffer or file using SIMD. It should support SSE2 and optionally AVX2, with automatic fallback to scalar if the CPU lacks SIMD support.

3.2 Functional Requirements

  1. SIMD search: SSE2 implementation required; AVX2 optional.
  2. Scalar fallback: for short buffers or no SIMD.
  3. Benchmark mode: compare scalar vs SIMD.
  4. Output: first match offset or “not found”.

3.3 Non-Functional Requirements

  • Performance: 5x+ speedup on large buffers.
  • Reliability: never read out of bounds.
  • Usability: clear CPU feature detection output.

3.4 Example Usage / Output

$ ./simd-memchr --byte 0x0a big.txt
first match at offset 14523

3.5 Data Formats / Schemas / Protocols

Simple text output; optional JSON:

{"byte":10,"offset":14523,"algorithm":"sse2"}

3.6 Edge Cases

  • Empty input buffer
  • Byte not present
  • Very short buffers (< vector width)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./simd-memchr --byte 0x0a fixtures/large.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures and report deterministic offsets.

3.7.3 CLI Transcript (Success + Failure)

$ ./simd-memchr --byte 0x0a fixtures/large.txt
first match at offset 1024
exit code: 0

$ ./simd-memchr --byte 0xff fixtures/empty.txt
not found
exit code: 0

4. Solution Architecture

4.1 High-Level Design

+-----------+   +---------+   +-----------+   +-----------+
| CLI Parse |-->| Loader  |-->| SIMD Loop |-->| Formatter |
+-----------+   +---------+   +-----------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Feature detect | SSE2/AVX2 support | CPUID check | | SIMD search | vector compare | SSE2 baseline | | Scalar fallback | short buffers | simple loop |

4.3 Data Structures (No Full Code)

struct Result { size_t offset; int found; };

4.4 Algorithm Overview

  1. If len < vector width, use scalar.
  2. Otherwise loop in vector chunks:
    • load, compare, movemask
    • if mask != 0, find first set bit
  3. Handle tail bytes.

Complexity Analysis

  • Time: O(n/vec_width) comparisons
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

cc -O3 -march=native -o simd-memchr src/main.c

5.2 Project Structure

simd-memchr/
├── src/
│   ├── main.c
│   ├── simd.c
│   └── scalar.c
├── fixtures/
└── Makefile

5.3 The Core Question You’re Answering

“How can I compare 16 or 32 bytes at once safely and find the first match?”

5.4 Concepts You Must Understand First

  1. SIMD registers and intrinsics.
  2. movemask extraction.
  3. Tail and alignment handling.

5.5 Questions to Guide Your Design

  1. How will you avoid reading past the buffer end?
  2. How will you detect CPU SIMD support?
  3. When do you fall back to scalar?

5.6 Thinking Exercise

Given a mask 0b00101000, which byte index is the first match?

5.7 The Interview Questions They’ll Ask

  1. How does movemask work and why is it useful?
  2. Why do SIMD searches still need scalar tails?

5.8 Hints in Layers

Hint 1: Start with SSE2 and unaligned loads. Hint 2: Use __builtin_ctz on the mask to find first match. Hint 3: Add AVX2 and compare performance.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | SIMD basics | “Programming Massively Parallel Processors” | vector intro | | x86 intrinsics | Intel Intrinsics Guide | SSE2/AVX2 |

5.10 Implementation Phases

Phase 1: Scalar Baseline (1 day)

  • Implement scalar memchr.
  • Checkpoint: correctness on fixtures.

Phase 2: SIMD Core (2 days)

  • Implement SSE2 memchr.
  • Checkpoint: matches scalar results.

Phase 3: Benchmarking (2 days)

  • Add --bench mode and AVX2 optional path.
  • Checkpoint: speedup on large buffers.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Loads | aligned vs unaligned | unaligned | simpler, still fast | | Tail handling | scalar loop | scalar | safe and simple |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | movemask and bit scan | synthetic masks | | Integration | match correctness | fixtures | | Performance | speedup checks | 1MB+ buffers |

6.2 Critical Test Cases

  1. Byte present in first block.
  2. Byte only in tail.
  3. Byte not present.

6.3 Test Data

text: "abcd..." (1MB)
byte: 'z'
expected: not found

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Out-of-bounds load | crash | stop at n - vec_width | | Wrong mask handling | incorrect offset | use ctz on mask | | AVX2 mismatch | wrong result | align mask width to vector width |

7.2 Debugging Strategies

  • Compare SIMD results to scalar for random buffers.
  • Use sanitizers to catch OOB reads.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add reverse search (last occurrence).
  • Support searching multiple bytes (any-of set).

8.2 Intermediate Extensions

  • Add AVX2 path and runtime dispatch.
  • Implement memrchr with SIMD.

8.3 Advanced Extensions

  • Implement a two-byte SIMD prefilter for memmem.
  • Add ARM NEON version.

9. Real-World Connections

9.1 Industry Applications

  • libc implementations
  • search tools and parsers
  • memchr Rust crate
  • glibc memchr implementations

9.3 Interview Relevance

  • Explaining SIMD speedups and memory bandwidth limits

10. Resources

10.1 Essential Reading

  • Intel Intrinsics Guide
  • memchr crate docs

10.2 Tools & Documentation

  • perf for profiling
  • objdump -d to inspect generated instructions

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain movemask and bit scanning.
  • I can explain tail handling.

11.2 Implementation

  • SIMD results match scalar baseline.
  • Benchmark shows clear speedup.

11.3 Growth

  • I can describe the limits of SIMD speedups.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • SSE2 memchr with correct results

Full Completion:

  • Benchmark mode + scalar fallback

Excellence (Going Above & Beyond):

  • AVX2 and NEON implementations
  • Reverse search support