P05: High-Performance String Search Library
P05: High-Performance String Search Library
Build a SIMD-accelerated string search library that rivals ripgrepโs core
| Attribute | Value |
|---|---|
| Main Language | C with SIMD Intrinsics |
| Difficulty | Level 4: Expert |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Knowledge Area | Algorithms, Low-Level Optimization |
| Key Tools | SSE/AVX intrinsics, perf, Compiler Explorer |
| Main Book | โModern X86 Assembly Language Programmingโ - Daniel Kusswurm |
Learning Objectives
By completing this project, you will:
- Master SIMD programming โ Write code that processes 32 bytes in a single instruction
- Understand CPU architecture โ Learn about vector registers, cache lines, and instruction pipelining
- Implement real-world algorithms โ Go beyond textbook string search to what production tools use
- Profile systematically โ Use perf to find bottlenecks and flame graphs to visualize them
- Handle encoding correctly โ Deal with UTF-8 without breaking multi-byte characters
- Build runtime dispatch โ Detect CPU features and select optimal code paths
The Core Question
โWhy are tools like ripgrep so much faster than grep, and what makes string search fast at the hardware level?โ
Every CS student learns Boyer-Moore and KMP, but these arenโt what make modern search tools fast. The real questions:
- What is SIMD? How can you search 32 bytes in the time it takes to search 1?
- Why does memory matter more than Big-O? A cache-friendly O(n) beats a cache-hostile O(log n)
- How do you write code for different CPUs? SSE, AVX2, AVX-512, NEONโhow do you support them all?
After this project, youโll understand that performance isnโt about clever algorithmsโitโs about feeding data to the CPU the way it wants.
Deep Theoretical Foundation
1. SIMD Fundamentals (Single Instruction, Multiple Data)
The Basic Idea
Instead of processing one byte at a time:
// Scalar: 32 iterations
for (int i = 0; i < 32; i++) {
if (haystack[i] == 'x') found = i;
}
Process 32 bytes simultaneously:
// SIMD: 1 "iteration"
__m256i needle = _mm256_set1_epi8('x'); // 32 copies of 'x'
__m256i block = _mm256_loadu_si256(haystack); // Load 32 bytes
__m256i cmp = _mm256_cmpeq_epi8(block, needle); // Compare all 32
int mask = _mm256_movemask_epi8(cmp); // Get bitmask of matches
// mask has 1 bit for each matching position
Vector Register Sizes
| Technology | Register Size | Registers | Introduced |
|---|---|---|---|
| SSE | 128 bits (16 bytes) | XMM0-XMM15 | Pentium III (1999) |
| AVX | 256 bits (32 bytes) | YMM0-YMM15 | Sandy Bridge (2011) |
| AVX-512 | 512 bits (64 bytes) | ZMM0-ZMM31 | Skylake-X (2017) |
Visual Model
256-bit YMM register (AVX2):
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโ
โ 0 โ 1 โ 2 โ 3 โ 4 โ 5 โ 6 โ 7 โ 8 โ 9 โ 10 โ 11 โ 12 โ 13 โ 14 โ 15 โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโค
โbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโbyteโ
โ 16 โ 17 โ 18 โ 19 โ 20 โ 21 โ 22 โ 23 โ 24 โ 25 โ 26 โ 27 โ 28 โ 29 โ 30 โ 31 โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
One instruction compares ALL 32 bytes against a target character!
2. Key SIMD Intrinsics for String Search
Loading Data
#include <immintrin.h>
// Load 32 bytes from memory (unaligned is fine on modern CPUs)
__m256i data = _mm256_loadu_si256((__m256i*)ptr);
// Load 32 bytes (aligned to 32-byte boundary, slightly faster)
__m256i data = _mm256_load_si256((__m256i*)aligned_ptr);
Broadcasting a Single Byte
// Create vector with 32 copies of a character
__m256i needle = _mm256_set1_epi8('x');
// Result: [x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
// x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x]
Comparing Bytes
// Compare each byte position, result is 0xFF for match, 0x00 for mismatch
__m256i result = _mm256_cmpeq_epi8(data, needle);
// If data[5] == 'x', then result[5] = 0xFF
// If data[6] != 'x', then result[6] = 0x00
Extracting Match Positions
// Convert comparison result to bitmask (one bit per byte)
int mask = _mm256_movemask_epi8(result);
// If bytes 5 and 17 matched: mask = 0b00000000000000100000000000100000
// Find position of first match
int pos = __builtin_ctz(mask); // Count trailing zeros = 5
3. The SIMD String Search Algorithm
Basic Approach: First-Character Filter
For pattern "cat" in haystack:
1. Broadcast 'c' (first char) to all 32 lanes
2. Load 32 bytes of haystack
3. Compare all 32 bytes against 'c'
4. For each match position:
- Verify full pattern "cat" with memcmp
5. Advance 32 bytes, repeat
Code
const char* simd_search(const char* haystack, size_t h_len,
const char* needle, size_t n_len) {
if (n_len == 0) return haystack;
if (n_len > h_len) return NULL;
__m256i first = _mm256_set1_epi8(needle[0]);
size_t i = 0;
while (i + 32 <= h_len) {
__m256i block = _mm256_loadu_si256((__m256i*)(haystack + i));
__m256i cmp = _mm256_cmpeq_epi8(block, first);
uint32_t mask = _mm256_movemask_epi8(cmp);
while (mask != 0) {
int pos = __builtin_ctz(mask); // Find next set bit
// Check if full pattern matches
if (i + pos + n_len <= h_len &&
memcmp(haystack + i + pos, needle, n_len) == 0) {
return haystack + i + pos;
}
mask &= mask - 1; // Clear lowest set bit
}
i += 32;
}
// Handle remaining bytes with scalar code
return scalar_search(haystack + i, h_len - i, needle, n_len);
}
4. Why This Works Better Than Textbook Algorithms
Boyer-Moore and KMP
These algorithms have great theoretical complexity (O(n+m)) but:
- Poor cache behavior: Jump around in the pattern, causing cache misses
- Branch mispredictions: Decision logic on each character
- No parallelism: Process one character at a time
SIMD Approach
- Sequential memory access: Perfect for prefetching
- Minimal branches: Just one branch per 32 bytes (on average)
- Massive parallelism: 32 comparisons per instruction
When Boyer-Moore Wins
For very long patterns (50+ characters), the skip distance becomes significant enough that Boyer-Moore might win. But for typical patterns (1-20 chars), SIMD dominates.
5. Cache-Aware Programming
Cache Line Basics
- A cache line is typically 64 bytes
- Loading one byte loads the entire 64-byte line
- Sequential access is fast (prefetcher loads next lines)
- Random access is slow (each access might be a cache miss)
String Search is Perfectly Sequential
Memory: [haystack byte 0][byte 1][byte 2]...[byte 63][byte 64]...
โฒ
L1 cache fetches 64 bytes at a time
With SIMD processing 32 bytes:
- First iteration: Bytes 0-31 (already in cache from prefetch)
- Second iteration: Bytes 32-63 (already in cache)
- Third iteration: Bytes 64-95 (prefetcher loaded it)
Measuring Cache Behavior
$ perf stat -e cache-misses,cache-references ./search_bench
Performance counter stats for './search_bench':
12,345 cache-misses # 0.12% of all cache refs
10,000,000 cache-references
# Low cache-miss rate = good cache usage
6. CPU Feature Detection
The Problem
Not all CPUs support AVX2. Your code must:
- Detect available features at runtime
- Select the best implementation
- Fall back gracefully on older CPUs
Detection with CPUID
#include <cpuid.h>
typedef struct {
bool sse42;
bool avx2;
bool avx512f;
} cpu_features_t;
void detect_features(cpu_features_t* f) {
unsigned int eax, ebx, ecx, edx;
__cpuid(1, eax, ebx, ecx, edx);
f->sse42 = (ecx >> 20) & 1;
__cpuid_count(7, 0, eax, ebx, ecx, edx);
f->avx2 = (ebx >> 5) & 1;
f->avx512f = (ebx >> 16) & 1;
}
GCC/Clang Helper
void select_implementation(void) {
#if defined(__x86_64__) || defined(_M_X64)
if (__builtin_cpu_supports("avx512f")) {
search_fn = search_avx512;
} else if (__builtin_cpu_supports("avx2")) {
search_fn = search_avx2;
} else if (__builtin_cpu_supports("sse4.2")) {
search_fn = search_sse42;
} else {
search_fn = search_scalar;
}
#elif defined(__aarch64__)
search_fn = search_neon; // ARM NEON is always available on AArch64
#else
search_fn = search_scalar;
#endif
}
7. UTF-8 Correctness
The Problem
UTF-8 uses variable-length encoding:
- ASCII (0-127): 1 byte
- Most European: 2 bytes
- Asian languages: 3 bytes
- Emoji: 4 bytes
If you find a byte match, you might be in the middle of a multi-byte character!
UTF-8 for "cafรฉ": c a f 0xC3 0xA9
โ โ โ โโโโดโโ 'รฉ' (2 bytes)
โโโโดโโโดโโ ASCII bytes
If searching for byte 0xA9, you'd match inside 'รฉ' - invalid!
Solutions
- Byte-level search + validation: Find byte matches, then verify youโre at a character boundary
- Search for complete codepoints: Expand pattern to UTF-8 bytes, ensure match boundaries are valid
- Use ripgrepโs approach: Search bytes, report byte positions, let caller handle boundaries
// Check if byte is a UTF-8 continuation byte
bool is_continuation_byte(unsigned char b) {
return (b & 0xC0) == 0x80; // 10xxxxxx
}
// Validate match is at character boundary
bool valid_utf8_boundary(const char* haystack, size_t pos) {
return !is_continuation_byte((unsigned char)haystack[pos]);
}
8. Profiling with perf
Basic Profiling
# Record CPU cycles
$ perf record -g ./fastsearch "pattern" largefile.txt
# View report
$ perf report
# Samples: 5K of event 'cycles'
# Overhead Symbol
45.23% search_avx2
12.45% __memcmp_avx2
8.92% [kernel.kallsyms]
...
Detailed Statistics
$ perf stat -e cycles,instructions,cache-misses,branch-misses ./search_bench
Performance counter stats:
10,234,567,890 cycles
25,678,901,234 instructions # 2.51 IPC (good!)
12,345 cache-misses # Very low (good!)
345,678 branch-misses # Low (good for SIMD)
Flame Graphs
$ perf record -g ./fastsearch ...
$ perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Project Specification
What Youโll Build
A high-performance string search library with:
// search.h
// Initialize (detect CPU features, select implementation)
void search_init(void);
// Basic search
const char* search_find(const char* haystack, size_t h_len,
const char* needle, size_t n_len);
// Search with options
typedef struct {
bool case_insensitive;
bool utf8_mode;
bool use_simd; // Auto-detect if not specified
} search_opts_t;
const char* search_find_opts(const char* haystack, size_t h_len,
const char* needle, size_t n_len,
const search_opts_t* opts);
// Get detected features
const char* search_get_features(void);
// Benchmark helper
double search_benchmark(const char* haystack, size_t h_len,
const char* needle, size_t n_len,
int iterations);
Deliverables
- Core library:
search.c,search_avx2.c,search_sse42.c,search_scalar.c - Header file:
search.hwith public API - CLI tool:
fastsearch pattern filecommand-line utility - Benchmark suite: Comparison against glibc, memmem, and naive search
- Test suite: Correctness tests including UTF-8 edge cases
- Documentation: Explaining algorithm choices and when to use each
Success Criteria
# Build
$ make
gcc -O3 -mavx2 -o libsearch.so search.c search_avx2.c -shared -fPIC
gcc -O3 -mavx2 -o fastsearch main.c -L. -lsearch
# Verify correctness
$ ./test_suite
All 500 tests passed
# Performance benchmark
$ ./bench_search
Throughput on 100MB file:
Algorithm Throughput vs naive
-------------------------------------------
Naive 142 MB/s 1.0x
glibc strstr 385 MB/s 2.7x
memmem 521 MB/s 3.7x
SSE4.2 (this lib) 1,234 MB/s 8.7x
AVX2 (this lib) 2,847 MB/s 20.0x
# Real-world comparison
$ time ./fastsearch "pattern" large_log.txt
0.143s
$ time grep "pattern" large_log.txt
1.521s
Solution Architecture
Component Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Public API (search.h) โ
โ search_init() / search_find() / search_find_opts() โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Feature Dispatcher โ
โ Detects CPU features at init, stores function pointer โ
โ โ
โ search_fn = detect_best_implementation(); โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ search_avx512 โ โ search_avx2 โ โ search_scalar โ
โ (512-bit) โ โ (256-bit) โ โ (fallback) โ
โ โ โ โ โ โ
โ 64 bytes/iter โ โ 32 bytes/iter โ โ 1 byte/iter โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tail Handler โ
โ (remaining bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
Data Structures
Search Context
typedef struct {
// Configuration
bool case_insensitive;
bool utf8_mode;
// Precomputed data
__m256i first_char_vec; // Broadcast of needle[0]
__m256i last_char_vec; // Broadcast of needle[n-1] (for two-byte filter)
// Needle info
const char* needle;
size_t needle_len;
// Statistics
uint64_t bytes_searched;
uint64_t matches_found;
} search_ctx_t;
Function Pointer Dispatch
typedef const char* (*search_fn_t)(
const char* haystack, size_t h_len,
const char* needle, size_t n_len
);
static search_fn_t search_impl = NULL;
void search_init(void) {
#if defined(__x86_64__)
if (__builtin_cpu_supports("avx512f")) {
search_impl = search_avx512;
} else if (__builtin_cpu_supports("avx2")) {
search_impl = search_avx2;
} else if (__builtin_cpu_supports("sse4.2")) {
search_impl = search_sse42;
} else {
search_impl = search_scalar;
}
#else
search_impl = search_scalar;
#endif
}
Key Algorithms
Two-Character Filter (Improved)
Instead of just checking the first character, check first AND last:
const char* search_avx2_two_char(const char* hay, size_t h_len,
const char* needle, size_t n_len) {
__m256i first = _mm256_set1_epi8(needle[0]);
__m256i last = _mm256_set1_epi8(needle[n_len - 1]);
size_t i = 0;
while (i + n_len + 31 <= h_len) {
__m256i block_first = _mm256_loadu_si256((__m256i*)(hay + i));
__m256i block_last = _mm256_loadu_si256((__m256i*)(hay + i + n_len - 1));
__m256i eq_first = _mm256_cmpeq_epi8(block_first, first);
__m256i eq_last = _mm256_cmpeq_epi8(block_last, last);
__m256i match = _mm256_and_si256(eq_first, eq_last); // Both must match
uint32_t mask = _mm256_movemask_epi8(match);
while (mask != 0) {
int pos = __builtin_ctz(mask);
if (memcmp(hay + i + pos, needle, n_len) == 0) {
return hay + i + pos;
}
mask &= mask - 1;
}
i += 32;
}
return search_scalar(hay + i, h_len - i, needle, n_len);
}
This dramatically reduces false positives, especially for longer patterns.
Phased Implementation Guide
Phase 1: Scalar Baseline (Days 1-2)
Goal: Working search with benchmarking infrastructure.
Steps:
- Implement naive byte-by-byte search
- Create benchmark harness with timing
- Compare against
strstr()andmemmem() - Set up test suite with edge cases
Test: All tests pass, benchmark shows baseline performance.
Phase 2: SSE4.2 Implementation (Days 2-3)
Goal: First SIMD version using 128-bit vectors.
Steps:
- Learn SSE4.2 intrinsics (
_mm_*functions) - Implement first-character filter with SSE
- Handle alignment and tail bytes
- Benchmark against scalar
Test: 2-4x speedup over scalar.
Phase 3: AVX2 Implementation (Days 3-4)
Goal: Full-speed implementation with 256-bit vectors.
Steps:
- Port SSE code to AVX2 (
_mm256_*functions) - Implement two-character filter for better pruning
- Optimize hot path (minimize branches)
- Profile with
perfto find remaining bottlenecks
Test: 8-20x speedup over scalar.
Phase 4: CPU Feature Detection (Days 4-5)
Goal: Automatic selection of best implementation.
Steps:
- Implement CPUID-based feature detection
- Create function pointer dispatch
- Add runtime fallback chain (AVX2 โ SSE โ scalar)
- Test on different CPU types (if available)
Test: Works on both old and new CPUs.
Phase 5: UTF-8 and Case-Insensitive (Days 5-7)
Goal: Handle real-world text correctly.
Steps:
- Implement UTF-8 boundary validation
- Add case-insensitive search option
- Handle edge cases (empty patterns, pattern longer than haystack)
- Comprehensive test suite
Test: All UTF-8 edge cases pass.
Phase 6: CLI and Polish (Days 8-10)
Goal: Usable command-line tool.
Steps:
- Create
fastsearchCLI - Add file handling and memory mapping
- Documentation and examples
- Final benchmarks and comparisons
Testing Strategy
Unit Tests
// Basic functionality
void test_basic_search(void) {
const char* hay = "hello world";
assert(search_find(hay, 11, "world", 5) == hay + 6);
assert(search_find(hay, 11, "xyz", 3) == NULL);
assert(search_find(hay, 11, "", 0) == hay);
}
// Edge cases
void test_edge_cases(void) {
// Pattern at start
assert(search_find("abcdef", 6, "abc", 3) != NULL);
// Pattern at end
assert(search_find("abcdef", 6, "def", 3) != NULL);
// Pattern longer than haystack
assert(search_find("abc", 3, "abcdef", 6) == NULL);
// Repeated pattern
const char* hay = "abcabcabc";
const char* first = search_find(hay, 9, "abc", 3);
assert(first == hay);
}
// UTF-8 correctness
void test_utf8(void) {
const char* hay = "hello cafรฉ world";
// Should find "cafรฉ"
assert(search_find(hay, strlen(hay), "cafรฉ", 5) != NULL);
// Should NOT find partial codepoint
// (implementation-specific behavior)
}
Stress Tests
void test_large_file(void) {
// Generate 100MB of random text
char* haystack = generate_random_text(100 * 1024 * 1024);
// Insert known pattern at random position
memcpy(haystack + 50000000, "UNIQUE_PATTERN", 14);
const char* found = search_find(haystack, 100*1024*1024,
"UNIQUE_PATTERN", 14);
assert(found == haystack + 50000000);
free(haystack);
}
void test_worst_case(void) {
// Pattern that causes many false positives
char* haystack = malloc(1000000);
memset(haystack, 'a', 1000000); // All 'a's
// Search for "aaaaaab" - many partial matches
const char* result = search_find(haystack, 1000000, "aaaaaab", 7);
assert(result == NULL); // Not found
free(haystack);
}
Benchmark Suite
void run_benchmarks(void) {
const size_t sizes[] = {1024, 1024*1024, 100*1024*1024};
const char* patterns[] = {"the", "pattern", "superlongpattern"};
for (size_t s = 0; s < ARRAY_SIZE(sizes); s++) {
char* haystack = generate_test_data(sizes[s]);
for (size_t p = 0; p < ARRAY_SIZE(patterns); p++) {
double scalar_time = benchmark(search_scalar, haystack, sizes[s],
patterns[p], strlen(patterns[p]));
double simd_time = benchmark(search_find, haystack, sizes[s],
patterns[p], strlen(patterns[p]));
printf("Size: %zuMB, Pattern: '%s'\n",
sizes[s] / (1024*1024), patterns[p]);
printf(" Scalar: %.2f MB/s\n", sizes[s] / scalar_time / 1e6);
printf(" SIMD: %.2f MB/s (%.1fx faster)\n",
sizes[s] / simd_time / 1e6, scalar_time / simd_time);
}
free(haystack);
}
}
Common Pitfalls and Debugging
Pitfall 1: Unaligned Access Crashes
Symptom: SIGBUS or SIGSEGV on some systems.
Cause: Using aligned load (_mm256_load_si256) on unaligned data.
Fix: Use unaligned load (negligible performance difference on modern CPUs):
// Wrong (requires 32-byte alignment)
__m256i data = _mm256_load_si256((__m256i*)ptr);
// Right (works with any address)
__m256i data = _mm256_loadu_si256((__m256i*)ptr);
Pitfall 2: Reading Past Buffer End
Symptom: Occasional crashes or wrong results.
Cause: Loading 32 bytes when less than 32 remain.
Fix: Check bounds carefully:
// Wrong
while (i < h_len) {
__m256i block = _mm256_loadu_si256((__m256i*)(hay + i));
// Reads past end if h_len - i < 32!
}
// Right
while (i + 32 <= h_len) {
__m256i block = _mm256_loadu_si256((__m256i*)(hay + i));
// ...
i += 32;
}
// Handle remaining i to h_len with scalar
Pitfall 3: Feature Detection at Compile Time
Symptom: Crashes on older CPUs, or suboptimal performance on newer ones.
Cause: Using -mavx2 flag makes compiler use AVX2 everywhere.
Fix: Compile different implementations separately:
# Build each implementation with appropriate flags
search_avx2.o: search_avx2.c
$(CC) -mavx2 -c $< -o $@
search_sse42.o: search_sse42.c
$(CC) -msse4.2 -c $< -o $@
search_scalar.o: search_scalar.c
$(CC) -c $< -o $@
# Link all together
libsearch.so: search.o search_avx2.o search_sse42.o search_scalar.o
$(CC) -shared $^ -o $@
Pitfall 4: Benchmark Variability
Symptom: Results vary wildly between runs.
Cause: CPU frequency scaling, other processes, cache state.
Fix:
# Disable frequency scaling
$ sudo cpupower frequency-set --governor performance
# Run multiple iterations
$ for i in {1..10}; do ./benchmark; done | awk '{sum+=$1} END {print sum/NR}'
# Pin to a specific CPU
$ taskset -c 0 ./benchmark
Extensions and Challenges
Extension 1: Multiple Pattern Search (Aho-Corasick + SIMD)
Search for many patterns simultaneously:
search_multi_t* sm = search_multi_create();
search_multi_add(sm, "error");
search_multi_add(sm, "warning");
search_multi_add(sm, "critical");
search_multi_find_all(sm, haystack, h_len, callback, user_data);
Extension 2: Regular Expression Support
Simple regex patterns with SIMD acceleration:
// Character classes
search_find_regex(hay, h_len, "[0-9]+"); // SIMD for digit detection
Extension 3: Parallel Search
Use multiple threads for very large files:
search_parallel_find(haystack, 1GB, needle, n_len, 8); // 8 threads
Extension 4: Memory-Mapped I/O Integration
Search files without loading entirely into memory:
int fd = open(filename, O_RDONLY);
void* mapped = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(mapped, file_size, MADV_SEQUENTIAL);
search_find(mapped, file_size, needle, n_len);
Real-World Connections
Tools Using SIMD String Search
| Tool | Key Technique |
|---|---|
| ripgrep | memchr crate with SIMD, Teddy algorithm for multiple literals |
| GNU grep | Boyer-Moore with SIMD for long patterns |
| Hyperscan | Intel library, SIMD-based regex engine |
| simdjson | SIMD for JSON parsing (similar principles) |
Code to Study
- memchr crate โ Rust SIMD string search (ripgrep uses this)
- Hyperscan โ Intelโs SIMD regex library
- simdjson โ SIMD JSON parsing (excellent learning resource)
Resources
Essential Reading
- โModern X86 Assembly Language Programmingโ by Daniel Kusswurm โ SIMD chapters
- Intel Intrinsics Guide โ Reference for all SIMD instructions
- โWhat Every Programmer Should Know About Memoryโ by Ulrich Drepper
Online Resources
- Intel Intrinsics Guide
- Compiler Explorer โ See generated assembly
- BurntSushiโs Blog โ ripgrep authorโs posts on SIMD
Papers
- โHyperscan: A Fast Multi-pattern Regex Matcherโ (Intel)
- โParsing Gigabytes of JSON per Secondโ (simdjson paper)
Self-Assessment Checklist
Before considering this project complete, verify:
Understanding
- I can explain what SIMD is and why it speeds up string search
- I can describe the difference between aligned and unaligned loads
- I can explain why cache-friendly algorithms beat clever algorithms
- I can describe how to detect CPU features at runtime
- I can explain the UTF-8 challenges in byte-level search
Implementation
- My library correctly finds all test patterns
- I have AVX2 and fallback implementations
- CPU feature detection works correctly
- UTF-8 edge cases are handled
- My benchmarks show 10x+ improvement over naive search
Verification
- Iโve tested on files of various sizes (1KB to 1GB)
- Iโve tested with patterns of various lengths
- Iโve profiled with perf and understand the hot paths
- Iโve compared against glibc and ripgrepโs memchr
SIMD string search is where algorithms meet hardware. After this project, youโll understand why the fastest code often looks nothing like the textbook version.