Project 7: SIMD memchr Implementation

Build a SIMD-accelerated byte search (memchr) that scans 16-32 bytes per instruction and outperforms naive byte loops.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	1 week
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 4: Hardcore
Business Potential	Level 2: Systems libs
Prerequisites	P01, basic SIMD concepts, CPU intrinsics
Key Topics	SIMD, vector compares, alignment, memchr algorithm

1. Learning Objectives

By completing this project, you will:

Implement SIMD byte comparison using SSE2/AVX2 intrinsics.
Handle alignment, unaligned loads, and tail processing safely.
Produce a memchr that beats naive loops on large buffers.
Understand how to extract match masks from SIMD vectors.
Benchmark with fixed inputs for deterministic results.

2. All Theory Needed (Per-Concept Breakdown)

2.1 SIMD Vectorized Byte Search

Fundamentals

SIMD (Single Instruction, Multiple Data) allows a CPU to perform the same operation on multiple data elements in parallel. For memchr, this means comparing a block of 16 bytes (SSE2) or 32 bytes (AVX2) against a target byte in a single instruction. The result is a mask indicating which byte positions matched. You then find the first set bit to get the earliest match in the block.

The basic algorithm is: broadcast the target byte into a vector, load a block of bytes from the buffer, compare for equality, produce a bitmask, and if any bit is set, find the index of the first match. Otherwise, move to the next block.

Deep Dive into the concept

The main challenges are alignment and tail handling. Unaligned loads are usually safe on modern x86, but you must ensure you never read beyond the allocated buffer. A common approach is to process the head up to an aligned boundary with a scalar loop, then use SIMD for the middle, and handle the tail with a scalar loop again. Another approach is to allow unaligned loads but stop at n - vector_width and process the tail safely.

Extracting the mask uses intrinsics such as _mm_movemask_epi8 for SSE2. For AVX2, _mm256_movemask_epi8. The resulting integer has one bit per byte. You can use __builtin_ctz to find the least significant set bit. This is fast and branch-light.

Performance is often limited by memory bandwidth rather than compute. That means SIMD can make a huge difference on large buffers but will not help much on very small buffers. A hybrid strategy that falls back to scalar for short lengths is best.

How this fits on projects

This memchr is the building block for fast literal searches (P08), Teddy multi-literal prefilters (P09), and line counting (P12).

Definitions & key terms

SIMD -> parallel operations on vector registers
movemask -> extract comparison results into a bitmask
alignment -> address boundary for efficient loads
tail -> leftover bytes after SIMD blocks

Mental model diagram (ASCII)

bytes:  [b0 b1 b2 ... b15]
vector: [ t  t  t ...  t]
compare -> mask: 0010000000000100
ctz(mask) -> first match index

How it works (step-by-step)

Broadcast target byte into a SIMD register.
Load a vector of bytes from the buffer.
Compare equality, produce a mask.
If mask != 0, find first set bit and return offset.
Otherwise, advance by vector width and repeat.

Minimal concrete example

__m128i needle = _mm_set1_epi8((char)target);
__m128i block = _mm_loadu_si128((const __m128i*)ptr);
__m128i eq = _mm_cmpeq_epi8(block, needle);
int mask = _mm_movemask_epi8(eq);

Common misconceptions

Misconception: SIMD always faster. Correction: For tiny buffers, scalar can be faster.
Misconception: Unaligned loads are unsafe. Correction: They are safe if you stay within bounds.

Check-your-understanding questions

What does movemask return for a comparison?
Why do you need tail handling?
When does SIMD stop helping?

Check-your-understanding answers

A bitmask with one bit per byte indicating equality.
SIMD works in fixed-width blocks; leftovers remain.
On very small buffers or when memory bandwidth dominates.

Real-world applications

libc memchr implementations
high-speed search tools and parsers

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P08-simd-substring-search-memmem, P12-line-oriented-search-with-simd-line-counting.

References

Intel Intrinsics Guide (SSE2/AVX2)
memchr crate documentation

Key insights

SIMD memchr is essentially a vectorized equality test plus a fast bit scan.

Summary

SIMD turns byte-by-byte search into block-by-block search and unlocks large speedups on big inputs.

Homework/Exercises to practice the concept

Implement scalar memchr and count comparisons.
Add SSE2 and measure speedups on 1MB buffers.

Solutions to the homework/exercises

A simple loop comparing each byte.
Use hyperfine and compare SSE2 vs scalar timings.

3. Project Specification

3.1 What You Will Build

A CLI tool simd-memchr that searches for a single byte in a buffer or file using SIMD. It should support SSE2 and optionally AVX2, with automatic fallback to scalar if the CPU lacks SIMD support.

3.2 Functional Requirements

SIMD search: SSE2 implementation required; AVX2 optional.
Scalar fallback: for short buffers or no SIMD.
Benchmark mode: compare scalar vs SIMD.
Output: first match offset or “not found”.

3.3 Non-Functional Requirements

Performance: 5x+ speedup on large buffers.
Reliability: never read out of bounds.
Usability: clear CPU feature detection output.

3.4 Example Usage / Output

$ ./simd-memchr --byte 0x0a big.txt
first match at offset 14523

3.5 Data Formats / Schemas / Protocols

Simple text output; optional JSON:

{"byte":10,"offset":14523,"algorithm":"sse2"}

3.6 Edge Cases

Empty input buffer
Byte not present
Very short buffers (< vector width)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./simd-memchr --byte 0x0a fixtures/large.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures and report deterministic offsets.

3.7.3 CLI Transcript (Success + Failure)

$ ./simd-memchr --byte 0x0a fixtures/large.txt
first match at offset 1024
exit code: 0

$ ./simd-memchr --byte 0xff fixtures/empty.txt
not found
exit code: 0

4. Solution Architecture

4.1 High-Level Design

+-----------+   +---------+   +-----------+   +-----------+
| CLI Parse |-->| Loader  |-->| SIMD Loop |-->| Formatter |
+-----------+   +---------+   +-----------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Result { size_t offset; int found; };

4.4 Algorithm Overview

If len < vector width, use scalar.
Otherwise loop in vector chunks:
- load, compare, movemask
- if mask != 0, find first set bit
Handle tail bytes.

Complexity Analysis

Time: O(n/vec_width) comparisons
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

cc -O3 -march=native -o simd-memchr src/main.c

5.2 Project Structure

simd-memchr/
├── src/
│   ├── main.c
│   ├── simd.c
│   └── scalar.c
├── fixtures/
└── Makefile

5.3 The Core Question You’re Answering

“How can I compare 16 or 32 bytes at once safely and find the first match?”

5.4 Concepts You Must Understand First

SIMD registers and intrinsics.
movemask extraction.
Tail and alignment handling.

5.5 Questions to Guide Your Design

How will you avoid reading past the buffer end?
How will you detect CPU SIMD support?
When do you fall back to scalar?

5.6 Thinking Exercise

Given a mask 0b00101000, which byte index is the first match?

5.7 The Interview Questions They’ll Ask

How does movemask work and why is it useful?
Why do SIMD searches still need scalar tails?

5.8 Hints in Layers

Hint 1: Start with SSE2 and unaligned loads. Hint 2: Use __builtin_ctz on the mask to find first match. Hint 3: Add AVX2 and compare performance.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Scalar Baseline (1 day)

Implement scalar memchr.
Checkpoint: correctness on fixtures.

Phase 2: SIMD Core (2 days)

Implement SSE2 memchr.
Checkpoint: matches scalar results.

Phase 3: Benchmarking (2 days)

Add --bench mode and AVX2 optional path.
Checkpoint: speedup on large buffers.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Byte present in first block.
Byte only in tail.
Byte not present.

6.3 Test Data

text: "abcd..." (1MB)
byte: 'z'
expected: not found

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Compare SIMD results to scalar for random buffers.
Use sanitizers to catch OOB reads.

8. Extensions & Challenges

8.1 Beginner Extensions

Add reverse search (last occurrence).
Support searching multiple bytes (any-of set).

8.2 Intermediate Extensions

Add AVX2 path and runtime dispatch.
Implement memrchr with SIMD.

8.3 Advanced Extensions

Implement a two-byte SIMD prefilter for memmem.
Add ARM NEON version.

9. Real-World Connections

9.1 Industry Applications

libc implementations
search tools and parsers

memchr Rust crate
glibc memchr implementations

9.3 Interview Relevance

Explaining SIMD speedups and memory bandwidth limits

10. Resources

10.1 Essential Reading

Intel Intrinsics Guide
memchr crate docs

10.2 Tools & Documentation

perf for profiling
objdump -d to inspect generated instructions

Previous: P06-literal-extraction-optimizer
Next: P08-simd-substring-search-memmem

11. Self-Assessment Checklist

11.1 Understanding

I can explain movemask and bit scanning.
I can explain tail handling.

11.2 Implementation

SIMD results match scalar baseline.
Benchmark shows clear speedup.

11.3 Growth

I can describe the limits of SIMD speedups.

12. Submission / Completion Criteria

Minimum Viable Completion:

SSE2 memchr with correct results

Full Completion:

Benchmark mode + scalar fallback

Excellence (Going Above & Beyond):

AVX2 and NEON implementations
Reverse search support