P05: High-Performance String Search Library

Build a SIMD-accelerated string search library that rivals ripgrep’s core

Attribute	Value
Main Language	C with SIMD Intrinsics
Difficulty	Level 4: Expert
Coolness Level	Level 4: Hardcore Tech Flex
Knowledge Area	Algorithms, Low-Level Optimization
Key Tools	SSE/AVX intrinsics, perf, Compiler Explorer
Main Book	“Modern X86 Assembly Language Programming” - Daniel Kusswurm

Learning Objectives

By completing this project, you will:

Master SIMD programming — Write code that processes 32 bytes in a single instruction
Understand CPU architecture — Learn about vector registers, cache lines, and instruction pipelining
Implement real-world algorithms — Go beyond textbook string search to what production tools use
Profile systematically — Use perf to find bottlenecks and flame graphs to visualize them
Handle encoding correctly — Deal with UTF-8 without breaking multi-byte characters
Build runtime dispatch — Detect CPU features and select optimal code paths

The Core Question

“Why are tools like ripgrep so much faster than grep, and what makes string search fast at the hardware level?”

Every CS student learns Boyer-Moore and KMP, but these aren’t what make modern search tools fast. The real questions:

What is SIMD? How can you search 32 bytes in the time it takes to search 1?
Why does memory matter more than Big-O? A cache-friendly O(n) beats a cache-hostile O(log n)
How do you write code for different CPUs? SSE, AVX2, AVX-512, NEON—how do you support them all?

After this project, you’ll understand that performance isn’t about clever algorithms—it’s about feeding data to the CPU the way it wants.

Deep Theoretical Foundation

1. SIMD Fundamentals (Single Instruction, Multiple Data)

The Basic Idea

Instead of processing one byte at a time:

// Scalar: 32 iterations
for (int i = 0; i < 32; i++) {
    if (haystack[i] == 'x') found = i;
}

Process 32 bytes simultaneously:

// SIMD: 1 "iteration"
__m256i needle = _mm256_set1_epi8('x');      // 32 copies of 'x'
__m256i block = _mm256_loadu_si256(haystack); // Load 32 bytes
__m256i cmp = _mm256_cmpeq_epi8(block, needle); // Compare all 32
int mask = _mm256_movemask_epi8(cmp);         // Get bitmask of matches
// mask has 1 bit for each matching position

Vector Register Sizes

Technology	Register Size	Registers	Introduced
SSE	128 bits (16 bytes)	XMM0-XMM15	Pentium III (1999)
AVX	256 bits (32 bytes)	YMM0-YMM15	Sandy Bridge (2011)
AVX-512	512 bits (64 bytes)	ZMM0-ZMM31	Skylake-X (2017)

Visual Model

256-bit YMM register (AVX2):
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ 10 │ 11 │ 12 │ 13 │ 14 │ 15 │
├────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┤
│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│byte│
│ 16 │ 17 │ 18 │ 19 │ 20 │ 21 │ 22 │ 23 │ 24 │ 25 │ 26 │ 27 │ 28 │ 29 │ 30 │ 31 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

One instruction compares ALL 32 bytes against a target character!

2. Key SIMD Intrinsics for String Search

Loading Data

#include <immintrin.h>

// Load 32 bytes from memory (unaligned is fine on modern CPUs)
__m256i data = _mm256_loadu_si256((__m256i*)ptr);

// Load 32 bytes (aligned to 32-byte boundary, slightly faster)
__m256i data = _mm256_load_si256((__m256i*)aligned_ptr);

Broadcasting a Single Byte

// Create vector with 32 copies of a character
__m256i needle = _mm256_set1_epi8('x');
// Result: [x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
//          x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x]

Comparing Bytes

// Compare each byte position, result is 0xFF for match, 0x00 for mismatch
__m256i result = _mm256_cmpeq_epi8(data, needle);
// If data[5] == 'x', then result[5] = 0xFF
// If data[6] != 'x', then result[6] = 0x00

Extracting Match Positions

// Convert comparison result to bitmask (one bit per byte)
int mask = _mm256_movemask_epi8(result);
// If bytes 5 and 17 matched: mask = 0b00000000000000100000000000100000

// Find position of first match
int pos = __builtin_ctz(mask);  // Count trailing zeros = 5

3. The SIMD String Search Algorithm

Basic Approach: First-Character Filter

For pattern "cat" in haystack:

Broadcast 'c' (first char) to all 32 lanes
Load 32 bytes of haystack
Compare all 32 bytes against 'c'
For each match position:
   - Verify full pattern "cat" with memcmp
Advance 32 bytes, repeat

Code

const char* simd_search(const char* haystack, size_t h_len,
                        const char* needle, size_t n_len) {
    if (n_len == 0) return haystack;
    if (n_len > h_len) return NULL;

    __m256i first = _mm256_set1_epi8(needle[0]);

    size_t i = 0;
    while (i + 32 <= h_len) {
        __m256i block = _mm256_loadu_si256((__m256i*)(haystack + i));
        __m256i cmp = _mm256_cmpeq_epi8(block, first);
        uint32_t mask = _mm256_movemask_epi8(cmp);

        while (mask != 0) {
            int pos = __builtin_ctz(mask);  // Find next set bit

            // Check if full pattern matches
            if (i + pos + n_len <= h_len &&
                memcmp(haystack + i + pos, needle, n_len) == 0) {
                return haystack + i + pos;
            }

            mask &= mask - 1;  // Clear lowest set bit
        }

        i += 32;
    }

    // Handle remaining bytes with scalar code
    return scalar_search(haystack + i, h_len - i, needle, n_len);
}

4. Why This Works Better Than Textbook Algorithms

Boyer-Moore and KMP

These algorithms have great theoretical complexity (O(n+m)) but:

Poor cache behavior: Jump around in the pattern, causing cache misses
Branch mispredictions: Decision logic on each character
No parallelism: Process one character at a time

SIMD Approach

Sequential memory access: Perfect for prefetching
Minimal branches: Just one branch per 32 bytes (on average)
Massive parallelism: 32 comparisons per instruction

When Boyer-Moore Wins

For very long patterns (50+ characters), the skip distance becomes significant enough that Boyer-Moore might win. But for typical patterns (1-20 chars), SIMD dominates.

5. Cache-Aware Programming

Cache Line Basics

A cache line is typically 64 bytes
Loading one byte loads the entire 64-byte line
Sequential access is fast (prefetcher loads next lines)
Random access is slow (each access might be a cache miss)

String Search is Perfectly Sequential

Memory: [haystack byte 0][byte 1][byte 2]...[byte 63][byte 64]...
                        ▲
                        L1 cache fetches 64 bytes at a time

With SIMD processing 32 bytes:
- First iteration: Bytes 0-31 (already in cache from prefetch)
- Second iteration: Bytes 32-63 (already in cache)
- Third iteration: Bytes 64-95 (prefetcher loaded it)

Measuring Cache Behavior

$ perf stat -e cache-misses,cache-references ./search_bench

Performance counter stats for './search_bench':
        12,345 cache-misses   #  0.12% of all cache refs
    10,000,000 cache-references

# Low cache-miss rate = good cache usage

6. CPU Feature Detection

The Problem

Not all CPUs support AVX2. Your code must:

Detect available features at runtime
Select the best implementation
Fall back gracefully on older CPUs

Detection with CPUID

#include <cpuid.h>

typedef struct {
    bool sse42;
    bool avx2;
    bool avx512f;
} cpu_features_t;

void detect_features(cpu_features_t* f) {
    unsigned int eax, ebx, ecx, edx;

    __cpuid(1, eax, ebx, ecx, edx);
    f->sse42 = (ecx >> 20) & 1;

    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    f->avx2 = (ebx >> 5) & 1;
    f->avx512f = (ebx >> 16) & 1;
}

GCC/Clang Helper

void select_implementation(void) {
    #if defined(__x86_64__) || defined(_M_X64)
    if (__builtin_cpu_supports("avx512f")) {
        search_fn = search_avx512;
    } else if (__builtin_cpu_supports("avx2")) {
        search_fn = search_avx2;
    } else if (__builtin_cpu_supports("sse4.2")) {
        search_fn = search_sse42;
    } else {
        search_fn = search_scalar;
    }
    #elif defined(__aarch64__)
    search_fn = search_neon;  // ARM NEON is always available on AArch64
    #else
    search_fn = search_scalar;
    #endif
}

7. UTF-8 Correctness

The Problem

UTF-8 uses variable-length encoding:

ASCII (0-127): 1 byte
Most European: 2 bytes
Asian languages: 3 bytes
Emoji: 4 bytes

If you find a byte match, you might be in the middle of a multi-byte character!

UTF-8 for "café": c  a  f  0xC3 0xA9
                  │  │  │   └──┴── 'é' (2 bytes)
                  └──┴──┴── ASCII bytes

If searching for byte 0xA9, you'd match inside 'é' - invalid!

Solutions

Byte-level search + validation: Find byte matches, then verify you’re at a character boundary
Search for complete codepoints: Expand pattern to UTF-8 bytes, ensure match boundaries are valid
Use ripgrep’s approach: Search bytes, report byte positions, let caller handle boundaries

// Check if byte is a UTF-8 continuation byte
bool is_continuation_byte(unsigned char b) {
    return (b & 0xC0) == 0x80;  // 10xxxxxx
}

// Validate match is at character boundary
bool valid_utf8_boundary(const char* haystack, size_t pos) {
    return !is_continuation_byte((unsigned char)haystack[pos]);
}

8. Profiling with perf

Basic Profiling

# Record CPU cycles
$ perf record -g ./fastsearch "pattern" largefile.txt

# View report
$ perf report

# Samples: 5K of event 'cycles'
# Overhead  Symbol
    45.23%  search_avx2
    12.45%  __memcmp_avx2
     8.92%  [kernel.kallsyms]
     ...

Detailed Statistics

$ perf stat -e cycles,instructions,cache-misses,branch-misses ./search_bench

Performance counter stats:
    10,234,567,890 cycles
    25,678,901,234 instructions  # 2.51 IPC (good!)
           12,345 cache-misses   # Very low (good!)
          345,678 branch-misses  # Low (good for SIMD)

Flame Graphs

$ perf record -g ./fastsearch ...
$ perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Project Specification

What You’ll Build

A high-performance string search library with:

// search.h

// Initialize (detect CPU features, select implementation)
void search_init(void);

// Basic search
const char* search_find(const char* haystack, size_t h_len,
                        const char* needle, size_t n_len);

// Search with options
typedef struct {
    bool case_insensitive;
    bool utf8_mode;
    bool use_simd;  // Auto-detect if not specified
} search_opts_t;

const char* search_find_opts(const char* haystack, size_t h_len,
                              const char* needle, size_t n_len,
                              const search_opts_t* opts);

// Get detected features
const char* search_get_features(void);

// Benchmark helper
double search_benchmark(const char* haystack, size_t h_len,
                        const char* needle, size_t n_len,
                        int iterations);

Deliverables

Core library: search.c, search_avx2.c, search_sse42.c, search_scalar.c
Header file: search.h with public API
CLI tool: fastsearch pattern file command-line utility
Benchmark suite: Comparison against glibc, memmem, and naive search
Test suite: Correctness tests including UTF-8 edge cases
Documentation: Explaining algorithm choices and when to use each

Success Criteria

# Build
$ make
gcc -O3 -mavx2 -o libsearch.so search.c search_avx2.c -shared -fPIC
gcc -O3 -mavx2 -o fastsearch main.c -L. -lsearch

# Verify correctness
$ ./test_suite
All 500 tests passed

# Performance benchmark
$ ./bench_search
Throughput on 100MB file:

Algorithm               Throughput     vs naive
-------------------------------------------
Naive                   142 MB/s       1.0x
glibc strstr            385 MB/s       2.7x
memmem                  521 MB/s       3.7x
SSE4.2 (this lib)       1,234 MB/s     8.7x
AVX2 (this lib)         2,847 MB/s     20.0x

# Real-world comparison
$ time ./fastsearch "pattern" large_log.txt
0.143s

$ time grep "pattern" large_log.txt
1.521s

Solution Architecture

Component Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         Public API (search.h)                        │
│  search_init() / search_find() / search_find_opts()                 │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────┐
│                      Feature Dispatcher                              │
│  Detects CPU features at init, stores function pointer              │
│                                                                       │
│  search_fn = detect_best_implementation();                           │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
        ┌────────────────────────┼────────────────────────────────┐
        │                        │                                │
        ▼                        ▼                                ▼
┌───────────────┐      ┌───────────────┐              ┌───────────────┐
│ search_avx512 │      │ search_avx2   │              │ search_scalar │
│ (512-bit)     │      │ (256-bit)     │              │ (fallback)    │
│               │      │               │              │               │
│ 64 bytes/iter │      │ 32 bytes/iter │              │ 1 byte/iter   │
└───────────────┘      └───────────────┘              └───────────────┘
        │                        │                                │
        └────────────────────────┼────────────────────────────────┘
                                 │
                                 ▼
                    ┌───────────────────────┐
                    │   Tail Handler        │
                    │   (remaining bytes)   │
                    └───────────────────────┘

Data Structures

Search Context

typedef struct {
    // Configuration
    bool case_insensitive;
    bool utf8_mode;

    // Precomputed data
    __m256i first_char_vec;  // Broadcast of needle[0]
    __m256i last_char_vec;   // Broadcast of needle[n-1] (for two-byte filter)

    // Needle info
    const char* needle;
    size_t needle_len;

    // Statistics
    uint64_t bytes_searched;
    uint64_t matches_found;
} search_ctx_t;

Function Pointer Dispatch

typedef const char* (*search_fn_t)(
    const char* haystack, size_t h_len,
    const char* needle, size_t n_len
);

static search_fn_t search_impl = NULL;

void search_init(void) {
    #if defined(__x86_64__)
    if (__builtin_cpu_supports("avx512f")) {
        search_impl = search_avx512;
    } else if (__builtin_cpu_supports("avx2")) {
        search_impl = search_avx2;
    } else if (__builtin_cpu_supports("sse4.2")) {
        search_impl = search_sse42;
    } else {
        search_impl = search_scalar;
    }
    #else
    search_impl = search_scalar;
    #endif
}

Key Algorithms

Two-Character Filter (Improved)

Instead of just checking the first character, check first AND last:

const char* search_avx2_two_char(const char* hay, size_t h_len,
                                  const char* needle, size_t n_len) {
    __m256i first = _mm256_set1_epi8(needle[0]);
    __m256i last = _mm256_set1_epi8(needle[n_len - 1]);

    size_t i = 0;
    while (i + n_len + 31 <= h_len) {
        __m256i block_first = _mm256_loadu_si256((__m256i*)(hay + i));
        __m256i block_last = _mm256_loadu_si256((__m256i*)(hay + i + n_len - 1));

        __m256i eq_first = _mm256_cmpeq_epi8(block_first, first);
        __m256i eq_last = _mm256_cmpeq_epi8(block_last, last);
        __m256i match = _mm256_and_si256(eq_first, eq_last);  // Both must match

        uint32_t mask = _mm256_movemask_epi8(match);

        while (mask != 0) {
            int pos = __builtin_ctz(mask);
            if (memcmp(hay + i + pos, needle, n_len) == 0) {
                return hay + i + pos;
            }
            mask &= mask - 1;
        }

        i += 32;
    }

    return search_scalar(hay + i, h_len - i, needle, n_len);
}

This dramatically reduces false positives, especially for longer patterns.

Phased Implementation Guide

Phase 1: Scalar Baseline (Days 1-2)

Goal: Working search with benchmarking infrastructure.

Steps:

Implement naive byte-by-byte search
Create benchmark harness with timing
Compare against strstr() and memmem()
Set up test suite with edge cases

Test: All tests pass, benchmark shows baseline performance.

Phase 2: SSE4.2 Implementation (Days 2-3)

Goal: First SIMD version using 128-bit vectors.

Steps:

Learn SSE4.2 intrinsics (_mm_* functions)
Implement first-character filter with SSE
Handle alignment and tail bytes
Benchmark against scalar

Test: 2-4x speedup over scalar.

Phase 3: AVX2 Implementation (Days 3-4)

Goal: Full-speed implementation with 256-bit vectors.

Steps:

Port SSE code to AVX2 (_mm256_* functions)
Implement two-character filter for better pruning
Optimize hot path (minimize branches)
Profile with perf to find remaining bottlenecks

Test: 8-20x speedup over scalar.

Phase 4: CPU Feature Detection (Days 4-5)

Goal: Automatic selection of best implementation.

Steps:

Implement CPUID-based feature detection
Create function pointer dispatch
Add runtime fallback chain (AVX2 → SSE → scalar)
Test on different CPU types (if available)

Test: Works on both old and new CPUs.

Phase 5: UTF-8 and Case-Insensitive (Days 5-7)

Goal: Handle real-world text correctly.

Steps:

Implement UTF-8 boundary validation
Add case-insensitive search option
Handle edge cases (empty patterns, pattern longer than haystack)
Comprehensive test suite

Test: All UTF-8 edge cases pass.

Phase 6: CLI and Polish (Days 8-10)

Goal: Usable command-line tool.

Steps:

Create fastsearch CLI
Add file handling and memory mapping
Documentation and examples
Final benchmarks and comparisons

Testing Strategy

Unit Tests

// Basic functionality
void test_basic_search(void) {
    const char* hay = "hello world";
    assert(search_find(hay, 11, "world", 5) == hay + 6);
    assert(search_find(hay, 11, "xyz", 3) == NULL);
    assert(search_find(hay, 11, "", 0) == hay);
}

// Edge cases
void test_edge_cases(void) {
    // Pattern at start
    assert(search_find("abcdef", 6, "abc", 3) != NULL);

    // Pattern at end
    assert(search_find("abcdef", 6, "def", 3) != NULL);

    // Pattern longer than haystack
    assert(search_find("abc", 3, "abcdef", 6) == NULL);

    // Repeated pattern
    const char* hay = "abcabcabc";
    const char* first = search_find(hay, 9, "abc", 3);
    assert(first == hay);
}

// UTF-8 correctness
void test_utf8(void) {
    const char* hay = "hello café world";

    // Should find "café"
    assert(search_find(hay, strlen(hay), "café", 5) != NULL);

    // Should NOT find partial codepoint
    // (implementation-specific behavior)
}

Stress Tests

void test_large_file(void) {
    // Generate 100MB of random text
    char* haystack = generate_random_text(100 * 1024 * 1024);

    // Insert known pattern at random position
    memcpy(haystack + 50000000, "UNIQUE_PATTERN", 14);

    const char* found = search_find(haystack, 100*1024*1024,
                                     "UNIQUE_PATTERN", 14);
    assert(found == haystack + 50000000);

    free(haystack);
}

void test_worst_case(void) {
    // Pattern that causes many false positives
    char* haystack = malloc(1000000);
    memset(haystack, 'a', 1000000);  // All 'a's

    // Search for "aaaaaab" - many partial matches
    const char* result = search_find(haystack, 1000000, "aaaaaab", 7);
    assert(result == NULL);  // Not found

    free(haystack);
}

Benchmark Suite

void run_benchmarks(void) {
    const size_t sizes[] = {1024, 1024*1024, 100*1024*1024};
    const char* patterns[] = {"the", "pattern", "superlongpattern"};

    for (size_t s = 0; s < ARRAY_SIZE(sizes); s++) {
        char* haystack = generate_test_data(sizes[s]);

        for (size_t p = 0; p < ARRAY_SIZE(patterns); p++) {
            double scalar_time = benchmark(search_scalar, haystack, sizes[s],
                                           patterns[p], strlen(patterns[p]));
            double simd_time = benchmark(search_find, haystack, sizes[s],
                                         patterns[p], strlen(patterns[p]));

            printf("Size: %zuMB, Pattern: '%s'\n",
                   sizes[s] / (1024*1024), patterns[p]);
            printf("  Scalar: %.2f MB/s\n", sizes[s] / scalar_time / 1e6);
            printf("  SIMD:   %.2f MB/s (%.1fx faster)\n",
                   sizes[s] / simd_time / 1e6, scalar_time / simd_time);
        }

        free(haystack);
    }
}

Common Pitfalls and Debugging

Pitfall 1: Unaligned Access Crashes

Symptom: SIGBUS or SIGSEGV on some systems.

Cause: Using aligned load (_mm256_load_si256) on unaligned data.

Fix: Use unaligned load (negligible performance difference on modern CPUs):

// Wrong (requires 32-byte alignment)
__m256i data = _mm256_load_si256((__m256i*)ptr);

// Right (works with any address)
__m256i data = _mm256_loadu_si256((__m256i*)ptr);

Pitfall 2: Reading Past Buffer End

Symptom: Occasional crashes or wrong results.

Cause: Loading 32 bytes when less than 32 remain.

Fix: Check bounds carefully:

// Wrong
while (i < h_len) {
    __m256i block = _mm256_loadu_si256((__m256i*)(hay + i));
    // Reads past end if h_len - i < 32!
}

// Right
while (i + 32 <= h_len) {
    __m256i block = _mm256_loadu_si256((__m256i*)(hay + i));
    // ...
    i += 32;
}
// Handle remaining i to h_len with scalar

Pitfall 3: Feature Detection at Compile Time

Symptom: Crashes on older CPUs, or suboptimal performance on newer ones.

Cause: Using -mavx2 flag makes compiler use AVX2 everywhere.

Fix: Compile different implementations separately:

# Build each implementation with appropriate flags
search_avx2.o: search_avx2.c
    $(CC) -mavx2 -c $< -o $@

search_sse42.o: search_sse42.c
    $(CC) -msse4.2 -c $< -o $@

search_scalar.o: search_scalar.c
    $(CC) -c $< -o $@

# Link all together
libsearch.so: search.o search_avx2.o search_sse42.o search_scalar.o
    $(CC) -shared $^ -o $@

Pitfall 4: Benchmark Variability

Symptom: Results vary wildly between runs.

Cause: CPU frequency scaling, other processes, cache state.

Fix:

# Disable frequency scaling
$ sudo cpupower frequency-set --governor performance

# Run multiple iterations
$ for i in {1..10}; do ./benchmark; done | awk '{sum+=$1} END {print sum/NR}'

# Pin to a specific CPU
$ taskset -c 0 ./benchmark

Extensions and Challenges

Extension 1: Multiple Pattern Search (Aho-Corasick + SIMD)

Search for many patterns simultaneously:

search_multi_t* sm = search_multi_create();
search_multi_add(sm, "error");
search_multi_add(sm, "warning");
search_multi_add(sm, "critical");
search_multi_find_all(sm, haystack, h_len, callback, user_data);

Extension 2: Regular Expression Support

Simple regex patterns with SIMD acceleration:

// Character classes
search_find_regex(hay, h_len, "[0-9]+");  // SIMD for digit detection

Extension 3: Parallel Search

Use multiple threads for very large files:

search_parallel_find(haystack, 1GB, needle, n_len, 8);  // 8 threads

Extension 4: Memory-Mapped I/O Integration

Search files without loading entirely into memory:

int fd = open(filename, O_RDONLY);
void* mapped = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(mapped, file_size, MADV_SEQUENTIAL);
search_find(mapped, file_size, needle, n_len);

Real-World Connections

Tools Using SIMD String Search

Tool	Key Technique
ripgrep	memchr crate with SIMD, Teddy algorithm for multiple literals
GNU grep	Boyer-Moore with SIMD for long patterns
Hyperscan	Intel library, SIMD-based regex engine
simdjson	SIMD for JSON parsing (similar principles)

Code to Study

memchr crate — Rust SIMD string search (ripgrep uses this)
Hyperscan — Intel’s SIMD regex library
simdjson — SIMD JSON parsing (excellent learning resource)

Resources

Essential Reading

“Modern X86 Assembly Language Programming” by Daniel Kusswurm — SIMD chapters
Intel Intrinsics Guide — Reference for all SIMD instructions
“What Every Programmer Should Know About Memory” by Ulrich Drepper

Online Resources

Intel Intrinsics Guide
Compiler Explorer — See generated assembly
BurntSushi’s Blog — ripgrep author’s posts on SIMD

Papers

“Hyperscan: A Fast Multi-pattern Regex Matcher” (Intel)
“Parsing Gigabytes of JSON per Second” (simdjson paper)

Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

I can explain what SIMD is and why it speeds up string search
I can describe the difference between aligned and unaligned loads
I can explain why cache-friendly algorithms beat clever algorithms
I can describe how to detect CPU features at runtime
I can explain the UTF-8 challenges in byte-level search

Implementation

My library correctly finds all test patterns
I have AVX2 and fallback implementations
CPU feature detection works correctly
UTF-8 edge cases are handled
My benchmarks show 10x+ improvement over naive search

Verification

I’ve tested on files of various sizes (1KB to 1GB)
I’ve tested with patterns of various lengths
I’ve profiled with perf and understand the hot paths
I’ve compared against glibc and ripgrep’s memchr

SIMD string search is where algorithms meet hardware. After this project, you’ll understand why the fastest code often looks nothing like the textbook version.