P05: High-Performance String Search Library

P05: High-Performance String Search Library

Build a SIMD-accelerated string search library that rivals ripgrepโ€™s core

Attribute Value
Main Language C with SIMD Intrinsics
Difficulty Level 4: Expert
Coolness Level Level 4: Hardcore Tech Flex
Knowledge Area Algorithms, Low-Level Optimization
Key Tools SSE/AVX intrinsics, perf, Compiler Explorer
Main Book โ€œModern X86 Assembly Language Programmingโ€ - Daniel Kusswurm

Learning Objectives

By completing this project, you will:

  1. Master SIMD programming โ€” Write code that processes 32 bytes in a single instruction
  2. Understand CPU architecture โ€” Learn about vector registers, cache lines, and instruction pipelining
  3. Implement real-world algorithms โ€” Go beyond textbook string search to what production tools use
  4. Profile systematically โ€” Use perf to find bottlenecks and flame graphs to visualize them
  5. Handle encoding correctly โ€” Deal with UTF-8 without breaking multi-byte characters
  6. Build runtime dispatch โ€” Detect CPU features and select optimal code paths

The Core Question

โ€œWhy are tools like ripgrep so much faster than grep, and what makes string search fast at the hardware level?โ€

Every CS student learns Boyer-Moore and KMP, but these arenโ€™t what make modern search tools fast. The real questions:

  • What is SIMD? How can you search 32 bytes in the time it takes to search 1?
  • Why does memory matter more than Big-O? A cache-friendly O(n) beats a cache-hostile O(log n)
  • How do you write code for different CPUs? SSE, AVX2, AVX-512, NEONโ€”how do you support them all?

After this project, youโ€™ll understand that performance isnโ€™t about clever algorithmsโ€”itโ€™s about feeding data to the CPU the way it wants.


Deep Theoretical Foundation

1. SIMD Fundamentals (Single Instruction, Multiple Data)

The Basic Idea

Instead of processing one byte at a time:

// Scalar: 32 iterations
for (int i = 0; i < 32; i++) {
    if (haystack[i] == 'x') found = i;
}

Process 32 bytes simultaneously:

// SIMD: 1 "iteration"
__m256i needle = _mm256_set1_epi8('x');      // 32 copies of 'x'
__m256i block = _mm256_loadu_si256(haystack); // Load 32 bytes
__m256i cmp = _mm256_cmpeq_epi8(block, needle); // Compare all 32
int mask = _mm256_movemask_epi8(cmp);         // Get bitmask of matches
// mask has 1 bit for each matching position

Vector Register Sizes

Technology Register Size Registers Introduced
SSE 128 bits (16 bytes) XMM0-XMM15 Pentium III (1999)
AVX 256 bits (32 bytes) YMM0-YMM15 Sandy Bridge (2011)
AVX-512 512 bits (64 bytes) ZMM0-ZMM31 Skylake-X (2017)

Visual Model

256-bit YMM register (AVX2):
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚
โ”‚ 0  โ”‚ 1  โ”‚ 2  โ”‚ 3  โ”‚ 4  โ”‚ 5  โ”‚ 6  โ”‚ 7  โ”‚ 8  โ”‚ 9  โ”‚ 10 โ”‚ 11 โ”‚ 12 โ”‚ 13 โ”‚ 14 โ”‚ 15 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ค
โ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚byteโ”‚
โ”‚ 16 โ”‚ 17 โ”‚ 18 โ”‚ 19 โ”‚ 20 โ”‚ 21 โ”‚ 22 โ”‚ 23 โ”‚ 24 โ”‚ 25 โ”‚ 26 โ”‚ 27 โ”‚ 28 โ”‚ 29 โ”‚ 30 โ”‚ 31 โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜

One instruction compares ALL 32 bytes against a target character!

Loading Data

#include <immintrin.h>

// Load 32 bytes from memory (unaligned is fine on modern CPUs)
__m256i data = _mm256_loadu_si256((__m256i*)ptr);

// Load 32 bytes (aligned to 32-byte boundary, slightly faster)
__m256i data = _mm256_load_si256((__m256i*)aligned_ptr);

Broadcasting a Single Byte

// Create vector with 32 copies of a character
__m256i needle = _mm256_set1_epi8('x');
// Result: [x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
//          x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x]

Comparing Bytes

// Compare each byte position, result is 0xFF for match, 0x00 for mismatch
__m256i result = _mm256_cmpeq_epi8(data, needle);
// If data[5] == 'x', then result[5] = 0xFF
// If data[6] != 'x', then result[6] = 0x00

Extracting Match Positions

// Convert comparison result to bitmask (one bit per byte)
int mask = _mm256_movemask_epi8(result);
// If bytes 5 and 17 matched: mask = 0b00000000000000100000000000100000

// Find position of first match
int pos = __builtin_ctz(mask);  // Count trailing zeros = 5

3. The SIMD String Search Algorithm

Basic Approach: First-Character Filter

For pattern "cat" in haystack:

1. Broadcast 'c' (first char) to all 32 lanes
2. Load 32 bytes of haystack
3. Compare all 32 bytes against 'c'
4. For each match position:
   - Verify full pattern "cat" with memcmp
5. Advance 32 bytes, repeat

Code

const char* simd_search(const char* haystack, size_t h_len,
                        const char* needle, size_t n_len) {
    if (n_len == 0) return haystack;
    if (n_len > h_len) return NULL;

    __m256i first = _mm256_set1_epi8(needle[0]);

    size_t i = 0;
    while (i + 32 <= h_len) {
        __m256i block = _mm256_loadu_si256((__m256i*)(haystack + i));
        __m256i cmp = _mm256_cmpeq_epi8(block, first);
        uint32_t mask = _mm256_movemask_epi8(cmp);

        while (mask != 0) {
            int pos = __builtin_ctz(mask);  // Find next set bit

            // Check if full pattern matches
            if (i + pos + n_len <= h_len &&
                memcmp(haystack + i + pos, needle, n_len) == 0) {
                return haystack + i + pos;
            }

            mask &= mask - 1;  // Clear lowest set bit
        }

        i += 32;
    }

    // Handle remaining bytes with scalar code
    return scalar_search(haystack + i, h_len - i, needle, n_len);
}

4. Why This Works Better Than Textbook Algorithms

Boyer-Moore and KMP

These algorithms have great theoretical complexity (O(n+m)) but:

  • Poor cache behavior: Jump around in the pattern, causing cache misses
  • Branch mispredictions: Decision logic on each character
  • No parallelism: Process one character at a time

SIMD Approach

  • Sequential memory access: Perfect for prefetching
  • Minimal branches: Just one branch per 32 bytes (on average)
  • Massive parallelism: 32 comparisons per instruction

When Boyer-Moore Wins

For very long patterns (50+ characters), the skip distance becomes significant enough that Boyer-Moore might win. But for typical patterns (1-20 chars), SIMD dominates.

5. Cache-Aware Programming

Cache Line Basics

  • A cache line is typically 64 bytes
  • Loading one byte loads the entire 64-byte line
  • Sequential access is fast (prefetcher loads next lines)
  • Random access is slow (each access might be a cache miss)

String Search is Perfectly Sequential

Memory: [haystack byte 0][byte 1][byte 2]...[byte 63][byte 64]...
                        โ–ฒ
                        L1 cache fetches 64 bytes at a time

With SIMD processing 32 bytes:
- First iteration: Bytes 0-31 (already in cache from prefetch)
- Second iteration: Bytes 32-63 (already in cache)
- Third iteration: Bytes 64-95 (prefetcher loaded it)

Measuring Cache Behavior

$ perf stat -e cache-misses,cache-references ./search_bench

Performance counter stats for './search_bench':
        12,345 cache-misses   #  0.12% of all cache refs
    10,000,000 cache-references

# Low cache-miss rate = good cache usage

6. CPU Feature Detection

The Problem

Not all CPUs support AVX2. Your code must:

  1. Detect available features at runtime
  2. Select the best implementation
  3. Fall back gracefully on older CPUs

Detection with CPUID

#include <cpuid.h>

typedef struct {
    bool sse42;
    bool avx2;
    bool avx512f;
} cpu_features_t;

void detect_features(cpu_features_t* f) {
    unsigned int eax, ebx, ecx, edx;

    __cpuid(1, eax, ebx, ecx, edx);
    f->sse42 = (ecx >> 20) & 1;

    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    f->avx2 = (ebx >> 5) & 1;
    f->avx512f = (ebx >> 16) & 1;
}

GCC/Clang Helper

void select_implementation(void) {
    #if defined(__x86_64__) || defined(_M_X64)
    if (__builtin_cpu_supports("avx512f")) {
        search_fn = search_avx512;
    } else if (__builtin_cpu_supports("avx2")) {
        search_fn = search_avx2;
    } else if (__builtin_cpu_supports("sse4.2")) {
        search_fn = search_sse42;
    } else {
        search_fn = search_scalar;
    }
    #elif defined(__aarch64__)
    search_fn = search_neon;  // ARM NEON is always available on AArch64
    #else
    search_fn = search_scalar;
    #endif
}

7. UTF-8 Correctness

The Problem

UTF-8 uses variable-length encoding:

  • ASCII (0-127): 1 byte
  • Most European: 2 bytes
  • Asian languages: 3 bytes
  • Emoji: 4 bytes

If you find a byte match, you might be in the middle of a multi-byte character!

UTF-8 for "cafรฉ": c  a  f  0xC3 0xA9
                  โ”‚  โ”‚  โ”‚   โ””โ”€โ”€โ”ดโ”€โ”€ 'รฉ' (2 bytes)
                  โ””โ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€ ASCII bytes

If searching for byte 0xA9, you'd match inside 'รฉ' - invalid!

Solutions

  1. Byte-level search + validation: Find byte matches, then verify youโ€™re at a character boundary
  2. Search for complete codepoints: Expand pattern to UTF-8 bytes, ensure match boundaries are valid
  3. Use ripgrepโ€™s approach: Search bytes, report byte positions, let caller handle boundaries
// Check if byte is a UTF-8 continuation byte
bool is_continuation_byte(unsigned char b) {
    return (b & 0xC0) == 0x80;  // 10xxxxxx
}

// Validate match is at character boundary
bool valid_utf8_boundary(const char* haystack, size_t pos) {
    return !is_continuation_byte((unsigned char)haystack[pos]);
}

8. Profiling with perf

Basic Profiling

# Record CPU cycles
$ perf record -g ./fastsearch "pattern" largefile.txt

# View report
$ perf report

# Samples: 5K of event 'cycles'
# Overhead  Symbol
    45.23%  search_avx2
    12.45%  __memcmp_avx2
     8.92%  [kernel.kallsyms]
     ...

Detailed Statistics

$ perf stat -e cycles,instructions,cache-misses,branch-misses ./search_bench

Performance counter stats:
    10,234,567,890 cycles
    25,678,901,234 instructions  # 2.51 IPC (good!)
           12,345 cache-misses   # Very low (good!)
          345,678 branch-misses  # Low (good for SIMD)

Flame Graphs

$ perf record -g ./fastsearch ...
$ perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Project Specification

What Youโ€™ll Build

A high-performance string search library with:

// search.h

// Initialize (detect CPU features, select implementation)
void search_init(void);

// Basic search
const char* search_find(const char* haystack, size_t h_len,
                        const char* needle, size_t n_len);

// Search with options
typedef struct {
    bool case_insensitive;
    bool utf8_mode;
    bool use_simd;  // Auto-detect if not specified
} search_opts_t;

const char* search_find_opts(const char* haystack, size_t h_len,
                              const char* needle, size_t n_len,
                              const search_opts_t* opts);

// Get detected features
const char* search_get_features(void);

// Benchmark helper
double search_benchmark(const char* haystack, size_t h_len,
                        const char* needle, size_t n_len,
                        int iterations);

Deliverables

  1. Core library: search.c, search_avx2.c, search_sse42.c, search_scalar.c
  2. Header file: search.h with public API
  3. CLI tool: fastsearch pattern file command-line utility
  4. Benchmark suite: Comparison against glibc, memmem, and naive search
  5. Test suite: Correctness tests including UTF-8 edge cases
  6. Documentation: Explaining algorithm choices and when to use each

Success Criteria

# Build
$ make
gcc -O3 -mavx2 -o libsearch.so search.c search_avx2.c -shared -fPIC
gcc -O3 -mavx2 -o fastsearch main.c -L. -lsearch

# Verify correctness
$ ./test_suite
All 500 tests passed

# Performance benchmark
$ ./bench_search
Throughput on 100MB file:

Algorithm               Throughput     vs naive
-------------------------------------------
Naive                   142 MB/s       1.0x
glibc strstr            385 MB/s       2.7x
memmem                  521 MB/s       3.7x
SSE4.2 (this lib)       1,234 MB/s     8.7x
AVX2 (this lib)         2,847 MB/s     20.0x

# Real-world comparison
$ time ./fastsearch "pattern" large_log.txt
0.143s

$ time grep "pattern" large_log.txt
1.521s

Solution Architecture

Component Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         Public API (search.h)                        โ”‚
โ”‚  search_init() / search_find() / search_find_opts()                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Feature Dispatcher                              โ”‚
โ”‚  Detects CPU features at init, stores function pointer              โ”‚
โ”‚                                                                       โ”‚
โ”‚  search_fn = detect_best_implementation();                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚                        โ”‚                                โ”‚
        โ–ผ                        โ–ผ                                โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ search_avx512 โ”‚      โ”‚ search_avx2   โ”‚              โ”‚ search_scalar โ”‚
โ”‚ (512-bit)     โ”‚      โ”‚ (256-bit)     โ”‚              โ”‚ (fallback)    โ”‚
โ”‚               โ”‚      โ”‚               โ”‚              โ”‚               โ”‚
โ”‚ 64 bytes/iter โ”‚      โ”‚ 32 bytes/iter โ”‚              โ”‚ 1 byte/iter   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                        โ”‚                                โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                                 โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Tail Handler        โ”‚
                    โ”‚   (remaining bytes)   โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Structures

Search Context

typedef struct {
    // Configuration
    bool case_insensitive;
    bool utf8_mode;

    // Precomputed data
    __m256i first_char_vec;  // Broadcast of needle[0]
    __m256i last_char_vec;   // Broadcast of needle[n-1] (for two-byte filter)

    // Needle info
    const char* needle;
    size_t needle_len;

    // Statistics
    uint64_t bytes_searched;
    uint64_t matches_found;
} search_ctx_t;

Function Pointer Dispatch

typedef const char* (*search_fn_t)(
    const char* haystack, size_t h_len,
    const char* needle, size_t n_len
);

static search_fn_t search_impl = NULL;

void search_init(void) {
    #if defined(__x86_64__)
    if (__builtin_cpu_supports("avx512f")) {
        search_impl = search_avx512;
    } else if (__builtin_cpu_supports("avx2")) {
        search_impl = search_avx2;
    } else if (__builtin_cpu_supports("sse4.2")) {
        search_impl = search_sse42;
    } else {
        search_impl = search_scalar;
    }
    #else
    search_impl = search_scalar;
    #endif
}

Key Algorithms

Two-Character Filter (Improved)

Instead of just checking the first character, check first AND last:

const char* search_avx2_two_char(const char* hay, size_t h_len,
                                  const char* needle, size_t n_len) {
    __m256i first = _mm256_set1_epi8(needle[0]);
    __m256i last = _mm256_set1_epi8(needle[n_len - 1]);

    size_t i = 0;
    while (i + n_len + 31 <= h_len) {
        __m256i block_first = _mm256_loadu_si256((__m256i*)(hay + i));
        __m256i block_last = _mm256_loadu_si256((__m256i*)(hay + i + n_len - 1));

        __m256i eq_first = _mm256_cmpeq_epi8(block_first, first);
        __m256i eq_last = _mm256_cmpeq_epi8(block_last, last);
        __m256i match = _mm256_and_si256(eq_first, eq_last);  // Both must match

        uint32_t mask = _mm256_movemask_epi8(match);

        while (mask != 0) {
            int pos = __builtin_ctz(mask);
            if (memcmp(hay + i + pos, needle, n_len) == 0) {
                return hay + i + pos;
            }
            mask &= mask - 1;
        }

        i += 32;
    }

    return search_scalar(hay + i, h_len - i, needle, n_len);
}

This dramatically reduces false positives, especially for longer patterns.


Phased Implementation Guide

Phase 1: Scalar Baseline (Days 1-2)

Goal: Working search with benchmarking infrastructure.

Steps:

  1. Implement naive byte-by-byte search
  2. Create benchmark harness with timing
  3. Compare against strstr() and memmem()
  4. Set up test suite with edge cases

Test: All tests pass, benchmark shows baseline performance.

Phase 2: SSE4.2 Implementation (Days 2-3)

Goal: First SIMD version using 128-bit vectors.

Steps:

  1. Learn SSE4.2 intrinsics (_mm_* functions)
  2. Implement first-character filter with SSE
  3. Handle alignment and tail bytes
  4. Benchmark against scalar

Test: 2-4x speedup over scalar.

Phase 3: AVX2 Implementation (Days 3-4)

Goal: Full-speed implementation with 256-bit vectors.

Steps:

  1. Port SSE code to AVX2 (_mm256_* functions)
  2. Implement two-character filter for better pruning
  3. Optimize hot path (minimize branches)
  4. Profile with perf to find remaining bottlenecks

Test: 8-20x speedup over scalar.

Phase 4: CPU Feature Detection (Days 4-5)

Goal: Automatic selection of best implementation.

Steps:

  1. Implement CPUID-based feature detection
  2. Create function pointer dispatch
  3. Add runtime fallback chain (AVX2 โ†’ SSE โ†’ scalar)
  4. Test on different CPU types (if available)

Test: Works on both old and new CPUs.

Phase 5: UTF-8 and Case-Insensitive (Days 5-7)

Goal: Handle real-world text correctly.

Steps:

  1. Implement UTF-8 boundary validation
  2. Add case-insensitive search option
  3. Handle edge cases (empty patterns, pattern longer than haystack)
  4. Comprehensive test suite

Test: All UTF-8 edge cases pass.

Phase 6: CLI and Polish (Days 8-10)

Goal: Usable command-line tool.

Steps:

  1. Create fastsearch CLI
  2. Add file handling and memory mapping
  3. Documentation and examples
  4. Final benchmarks and comparisons

Testing Strategy

Unit Tests

// Basic functionality
void test_basic_search(void) {
    const char* hay = "hello world";
    assert(search_find(hay, 11, "world", 5) == hay + 6);
    assert(search_find(hay, 11, "xyz", 3) == NULL);
    assert(search_find(hay, 11, "", 0) == hay);
}

// Edge cases
void test_edge_cases(void) {
    // Pattern at start
    assert(search_find("abcdef", 6, "abc", 3) != NULL);

    // Pattern at end
    assert(search_find("abcdef", 6, "def", 3) != NULL);

    // Pattern longer than haystack
    assert(search_find("abc", 3, "abcdef", 6) == NULL);

    // Repeated pattern
    const char* hay = "abcabcabc";
    const char* first = search_find(hay, 9, "abc", 3);
    assert(first == hay);
}

// UTF-8 correctness
void test_utf8(void) {
    const char* hay = "hello cafรฉ world";

    // Should find "cafรฉ"
    assert(search_find(hay, strlen(hay), "cafรฉ", 5) != NULL);

    // Should NOT find partial codepoint
    // (implementation-specific behavior)
}

Stress Tests

void test_large_file(void) {
    // Generate 100MB of random text
    char* haystack = generate_random_text(100 * 1024 * 1024);

    // Insert known pattern at random position
    memcpy(haystack + 50000000, "UNIQUE_PATTERN", 14);

    const char* found = search_find(haystack, 100*1024*1024,
                                     "UNIQUE_PATTERN", 14);
    assert(found == haystack + 50000000);

    free(haystack);
}

void test_worst_case(void) {
    // Pattern that causes many false positives
    char* haystack = malloc(1000000);
    memset(haystack, 'a', 1000000);  // All 'a's

    // Search for "aaaaaab" - many partial matches
    const char* result = search_find(haystack, 1000000, "aaaaaab", 7);
    assert(result == NULL);  // Not found

    free(haystack);
}

Benchmark Suite

void run_benchmarks(void) {
    const size_t sizes[] = {1024, 1024*1024, 100*1024*1024};
    const char* patterns[] = {"the", "pattern", "superlongpattern"};

    for (size_t s = 0; s < ARRAY_SIZE(sizes); s++) {
        char* haystack = generate_test_data(sizes[s]);

        for (size_t p = 0; p < ARRAY_SIZE(patterns); p++) {
            double scalar_time = benchmark(search_scalar, haystack, sizes[s],
                                           patterns[p], strlen(patterns[p]));
            double simd_time = benchmark(search_find, haystack, sizes[s],
                                         patterns[p], strlen(patterns[p]));

            printf("Size: %zuMB, Pattern: '%s'\n",
                   sizes[s] / (1024*1024), patterns[p]);
            printf("  Scalar: %.2f MB/s\n", sizes[s] / scalar_time / 1e6);
            printf("  SIMD:   %.2f MB/s (%.1fx faster)\n",
                   sizes[s] / simd_time / 1e6, scalar_time / simd_time);
        }

        free(haystack);
    }
}

Common Pitfalls and Debugging

Pitfall 1: Unaligned Access Crashes

Symptom: SIGBUS or SIGSEGV on some systems.

Cause: Using aligned load (_mm256_load_si256) on unaligned data.

Fix: Use unaligned load (negligible performance difference on modern CPUs):

// Wrong (requires 32-byte alignment)
__m256i data = _mm256_load_si256((__m256i*)ptr);

// Right (works with any address)
__m256i data = _mm256_loadu_si256((__m256i*)ptr);

Pitfall 2: Reading Past Buffer End

Symptom: Occasional crashes or wrong results.

Cause: Loading 32 bytes when less than 32 remain.

Fix: Check bounds carefully:

// Wrong
while (i < h_len) {
    __m256i block = _mm256_loadu_si256((__m256i*)(hay + i));
    // Reads past end if h_len - i < 32!
}

// Right
while (i + 32 <= h_len) {
    __m256i block = _mm256_loadu_si256((__m256i*)(hay + i));
    // ...
    i += 32;
}
// Handle remaining i to h_len with scalar

Pitfall 3: Feature Detection at Compile Time

Symptom: Crashes on older CPUs, or suboptimal performance on newer ones.

Cause: Using -mavx2 flag makes compiler use AVX2 everywhere.

Fix: Compile different implementations separately:

# Build each implementation with appropriate flags
search_avx2.o: search_avx2.c
    $(CC) -mavx2 -c $< -o $@

search_sse42.o: search_sse42.c
    $(CC) -msse4.2 -c $< -o $@

search_scalar.o: search_scalar.c
    $(CC) -c $< -o $@

# Link all together
libsearch.so: search.o search_avx2.o search_sse42.o search_scalar.o
    $(CC) -shared $^ -o $@

Pitfall 4: Benchmark Variability

Symptom: Results vary wildly between runs.

Cause: CPU frequency scaling, other processes, cache state.

Fix:

# Disable frequency scaling
$ sudo cpupower frequency-set --governor performance

# Run multiple iterations
$ for i in {1..10}; do ./benchmark; done | awk '{sum+=$1} END {print sum/NR}'

# Pin to a specific CPU
$ taskset -c 0 ./benchmark

Extensions and Challenges

Extension 1: Multiple Pattern Search (Aho-Corasick + SIMD)

Search for many patterns simultaneously:

search_multi_t* sm = search_multi_create();
search_multi_add(sm, "error");
search_multi_add(sm, "warning");
search_multi_add(sm, "critical");
search_multi_find_all(sm, haystack, h_len, callback, user_data);

Extension 2: Regular Expression Support

Simple regex patterns with SIMD acceleration:

// Character classes
search_find_regex(hay, h_len, "[0-9]+");  // SIMD for digit detection

Use multiple threads for very large files:

search_parallel_find(haystack, 1GB, needle, n_len, 8);  // 8 threads

Extension 4: Memory-Mapped I/O Integration

Search files without loading entirely into memory:

int fd = open(filename, O_RDONLY);
void* mapped = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(mapped, file_size, MADV_SEQUENTIAL);
search_find(mapped, file_size, needle, n_len);

Real-World Connections

Tool Key Technique
ripgrep memchr crate with SIMD, Teddy algorithm for multiple literals
GNU grep Boyer-Moore with SIMD for long patterns
Hyperscan Intel library, SIMD-based regex engine
simdjson SIMD for JSON parsing (similar principles)

Code to Study

  • memchr crate โ€” Rust SIMD string search (ripgrep uses this)
  • Hyperscan โ€” Intelโ€™s SIMD regex library
  • simdjson โ€” SIMD JSON parsing (excellent learning resource)

Resources

Essential Reading

  1. โ€œModern X86 Assembly Language Programmingโ€ by Daniel Kusswurm โ€” SIMD chapters
  2. Intel Intrinsics Guide โ€” Reference for all SIMD instructions
  3. โ€œWhat Every Programmer Should Know About Memoryโ€ by Ulrich Drepper

Online Resources

Papers

  • โ€œHyperscan: A Fast Multi-pattern Regex Matcherโ€ (Intel)
  • โ€œParsing Gigabytes of JSON per Secondโ€ (simdjson paper)

Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

  • I can explain what SIMD is and why it speeds up string search
  • I can describe the difference between aligned and unaligned loads
  • I can explain why cache-friendly algorithms beat clever algorithms
  • I can describe how to detect CPU features at runtime
  • I can explain the UTF-8 challenges in byte-level search

Implementation

  • My library correctly finds all test patterns
  • I have AVX2 and fallback implementations
  • CPU feature detection works correctly
  • UTF-8 edge cases are handled
  • My benchmarks show 10x+ improvement over naive search

Verification

  • Iโ€™ve tested on files of various sizes (1KB to 1GB)
  • Iโ€™ve tested with patterns of various lengths
  • Iโ€™ve profiled with perf and understand the hot paths
  • Iโ€™ve compared against glibc and ripgrepโ€™s memchr

SIMD string search is where algorithms meet hardware. After this project, youโ€™ll understand why the fastest code often looks nothing like the textbook version.