Project 5: High-Performance String Search Library
Build a SIMD-accelerated substring search library with a CLI benchmark and UTF-8 safety.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2 to 4 weeks |
| Main Programming Language | C (C17 with SIMD intrinsics) |
| Alternative Programming Languages | C++, Rust |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 4 - Open Core Infrastructure |
| Prerequisites | C pointers, CPU caches, basic algorithms |
| Key Topics | string search algorithms, SIMD, alignment, cache behavior, benchmarking |
1. Learning Objectives
By completing this project, you will:
- Implement multiple string search algorithms and select between them.
- Use SIMD intrinsics safely without undefined behavior.
- Understand how cache and branch prediction dominate real-world performance.
- Build CPU feature detection and dispatch for SSE4.2 and AVX2.
- Produce deterministic benchmarks that compare against
strstrandmemmem.
2. All Theory Needed (Per-Concept Breakdown)
Concept 1: String Search Algorithms and UTF-8 Boundaries
Fundamentals
Substring search has classic algorithms like naive search, Boyer-Moore, and Two-Way. Big-O complexity is useful, but real performance depends on constants and hardware. A practical library uses different algorithms depending on needle length. For short needles, a simple scan with memchr or a vectorized first-byte search can be fastest. For longer needles, Boyer-Moore or Two-Way reduces comparisons by skipping ahead. If your input is UTF-8, you must also decide whether to treat it as raw bytes or to avoid matches that cut across multibyte sequences. Byte-level search is faster but can return matches inside UTF-8 continuation bytes. A good library exposes a mode: byte search or UTF-8 safe search.
Deep Dive into the concept
The naive algorithm compares the needle at every position. It is O(n*m) but has excellent cache behavior and low overhead for short needles. Boyer-Moore improves average-case performance by using two heuristics: the bad character rule and the good suffix rule. These allow the algorithm to skip ahead when a mismatch occurs. Two-Way is a linear-time algorithm used by glibc for memmem because it has good worst-case guarantees and is faster in practice for many patterns. A high-performance library often implements a hybrid strategy: use a fast first-byte scan to find candidate positions, then validate with a full compare, and fall back to a different algorithm for long needles.
Choosing the right algorithm depends on needle length and alphabet size. For example, for a 2 or 3 byte needle, precomputing a full Boyer-Moore table may be slower than scanning with SIMD and using memcmp on candidates. For long needles, Boyer-Moore or Two-Way can skip large portions of the haystack, especially when the alphabet is large and mismatches are common. This is why libraries like ripgrep use specialized strategies for short patterns.
UTF-8 adds another layer. A UTF-8 string is a sequence of bytes where multibyte characters have specific bit patterns. The continuation bytes always have the two high bits 10. If you perform a byte-level search, you can find a match that starts at a continuation byte, which is not a valid character boundary. Some applications are fine with byte-level search, but others require that matches align to character boundaries. To handle this, you can add a validation step: when a candidate match is found, check that the starting byte is not a continuation byte. If the needle itself can contain multibyte sequences, you may also need to ensure that the match does not split a character. This can be done by validating the bytes around the match or by running a UTF-8 decoder to determine boundaries.
The complexity of UTF-8 validation can be reduced by precomputing a mask of continuation bytes. For example, you can scan the haystack and mark all bytes that are continuation bytes, then ensure matches start only at non-continuation positions. SIMD can help here too: you can compare the high bits of 16 or 32 bytes at once to identify continuation bytes. The trade-off is additional work per scan. Your library should therefore expose a mode flag so the user can choose between raw byte speed and UTF-8 correctness.
Finally, algorithm selection should be data-driven. Your benchmark should measure crossover points: at what needle lengths does SIMD scanning beat Boyer-Moore? At what sizes does Two-Way become faster? This is hardware-dependent. The library should encapsulate selection logic in one place and allow overrides for testing. That selection logic becomes part of the library’s performance contract and should be documented.
How This Fits on Projects
This concept defines the algorithm selection in Section 3.2, the data structures in Section 4.4, and the benchmarking in Section 3.4 and Section 5.10 Phase 2. It also connects to string parsing in P03-mini-async-runtime.md and to data format validation in P06-embedded-key-value-database.md.
Definitions & Key Terms
- Needle: The pattern being searched for.
- Haystack: The text being searched.
- Boyer-Moore: Algorithm using skip heuristics for fast search.
- Two-Way: Linear-time algorithm used in glibc
memmem. - UTF-8 continuation byte: A byte with high bits
10.
Mental Model Diagram (ASCII)
Haystack: [....A....A....A....]
Needle: [A B C]
Scan -> candidate -> verify -> match
How It Works (Step-by-Step)
- Choose algorithm based on needle length.
- Scan for candidate positions (often by first byte).
- Verify full needle match at candidate.
- If UTF-8 mode, validate boundary.
- Return match offsets.
Invariants:
- All returned matches are valid for the chosen mode.
- Algorithm selection is deterministic for a given input.
Failure modes:
- Incorrect boundary handling in UTF-8 mode.
- Slowdowns due to poor algorithm selection.
Minimal Concrete Example
// naive scan
for (size_t i = 0; i + n <= hlen; i++) {
if (memcmp(h + i, needle, n) == 0) return i;
}
Common Misconceptions
- “Big-O always predicts performance” -> Constants and cache matter more.
- “UTF-8 is just bytes” -> Byte-level matches can break character boundaries.
- “One algorithm is enough” -> Real-world inputs need hybrid strategies.
Check-Your-Understanding Questions
- Why can a naive algorithm beat Boyer-Moore for short needles?
- What is a UTF-8 continuation byte and why does it matter?
- Why is algorithm selection hardware-dependent?
Check-Your-Understanding Answers
- The naive algorithm has low overhead and good cache locality.
- Continuation bytes indicate the middle of a multi-byte character; matches there are invalid in UTF-8 mode.
- SIMD width, cache size, and branch prediction vary across CPUs.
Real-World Applications
- ripgrep uses specialized fast paths for short patterns.
- glibc uses Two-Way for
memmem.
Where You’ll Apply It
- This project: Section 3.2 Functional Requirements, Section 4.4 Algorithm Overview, Section 5.10 Phase 2.
- Also used in: P06-embedded-key-value-database.md for key scanning and P03-mini-async-runtime.md for protocol parsing.
References
- “Algorithms” by Sedgewick and Wayne - string search chapters.
- “The C Programming Language” - string functions and pitfalls.
Key Insights
Fast string search is a hybrid strategy, not a single algorithm.
Summary
Algorithm choice dominates performance. A practical library blends naive scans, SIMD scanning, and skip-based algorithms while respecting UTF-8 boundaries when required.
Homework/Exercises to Practice the Concept
- Implement naive search and measure it on short needles.
- Implement a Boyer-Moore bad character table.
- Add a UTF-8 boundary check and compare performance.
Solutions to the Homework/Exercises
- Naive search often wins for needles under 4 bytes.
- The bad character rule allows large skips on mismatches.
- UTF-8 checks reduce false matches but add overhead.
Concept 2: SIMD Intrinsics, Alignment, and Undefined Behavior
Fundamentals
SIMD allows you to compare multiple bytes at once, often 16 or 32 bytes per instruction. This can accelerate the first-byte scan and candidate validation. However, SIMD code is easy to get wrong: unaligned loads or reads past the end of a buffer can cause undefined behavior. Even if the CPU allows unaligned loads, the C standard does not allow reading beyond the end of an object. You must ensure that loads are guarded or use masked loads for the tail. You also need to handle different instruction sets (SSE4.2 vs AVX2) with feature detection and fallback paths. Correct SIMD is therefore about safety first, performance second.
Deep Dive into the concept
SIMD intrinsics expose vector instructions to C code. For example, _mm_loadu_si128 loads 16 bytes from memory into an SSE register. Unaligned loads are allowed on modern x86 but still require that the memory is within a valid object. If you load beyond the end of a buffer, you have undefined behavior even if the memory happens to be mapped. The safe solution is to handle the tail with a scalar loop or to use masked loads on AVX2 or AVX-512. Another technique is to allocate a padded buffer so you can safely load a full vector at the end. For a library, you should not assume the caller has padded the input, so your code must be safe.
Alignment is another constraint. Some instructions require aligned loads, and even those that allow unaligned loads can be slower when crossing cache lines. A practical strategy is to align the scanning pointer to a vector boundary and then process aligned blocks in the main loop. For the first few bytes before alignment, use scalar code. This hybrid approach keeps correctness while maximizing throughput. You also need to consider the alignment of the needle, particularly if you precompute vector constants or compare multiple bytes at a fixed offset.
SIMD string search often uses a two-stage approach: first, find candidate positions where the first and last bytes of the needle match using vector compares. Then, verify the candidate with a scalar compare. This avoids loading the entire needle for every position. For example, if the needle starts with ‘E’, you can use pcmpeqb to compare 16 bytes against ‘E’ and then use a bitmask to locate candidate positions. AVX2 doubles the width to 32 bytes and can improve throughput, but only if the data is large enough to amortize the overhead.
CPU feature detection is required because not all machines support SSE4.2 or AVX2. You should implement a dispatch mechanism that selects the best available implementation at runtime. On x86, you can use the CPUID instruction to detect features. The dispatch should be cached so you do not run CPUID on every call. If AVX2 is available, you can use it, but you must also check OS support for saving AVX registers. This is usually done by checking the OSXSAVE bit and XCR0. The fallback path should be a correct scalar implementation.
The biggest risk in SIMD code is undefined behavior. Reading past the end of the buffer, misaligned pointer casting, or violating strict aliasing rules can cause crashes or incorrect results. A safe design uses memcpy into a vector type or uses intrinsics that take a pointer to void. It also ensures that any pointer arithmetic stays within the bounds of the haystack. Your tests must include boundary cases: needles at the very end of the buffer, empty needles, and very small haystacks. These tests catch out-of-bounds loads that may not crash under normal conditions.
How This Fits on Projects
This concept determines the SIMD implementation in Section 4.4 and the CPU dispatch in Section 5.10 Phase 2. It also connects to alignment rules in P01-custom-memory-allocator.md and to cache-aware scheduling in P02-work-stealing-thread-pool.md.
Definitions & Key Terms
- SIMD: Single Instruction Multiple Data operations on vectors.
- Intrinsic: Compiler-provided function that maps to a CPU instruction.
- Alignment: Address boundary required for safe or fast loads.
- CPUID: Instruction to query CPU features on x86.
- OSXSAVE: CPU/OS feature for saving extended registers.
Mental Model Diagram (ASCII)
[bytes] -> SIMD compare -> bitmask -> candidate positions -> verify
How It Works (Step-by-Step)
- Detect CPU features and choose SIMD width.
- Align scanning pointer to vector boundary.
- Load vector and compare against needle first byte.
- Use bitmask to find candidate positions.
- Verify candidate with scalar compare.
Invariants:
- Vector loads never read past the end of the buffer.
- Fallback path produces identical results.
Failure modes:
- Crashes due to out-of-bounds loads.
- Incorrect matches due to misaligned comparisons.
Minimal Concrete Example
__m128i chunk = _mm_loadu_si128((const __m128i*)ptr);
__m128i needle = _mm_set1_epi8((char)first_byte);
int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(chunk, needle));
Common Misconceptions
- “Unaligned loads are always safe” -> They are safe only within object bounds.
- “SIMD always wins” -> Overhead can outweigh gains for small inputs.
- “AVX2 is always faster” -> It can be slower if data is small or not aligned.
Check-Your-Understanding Questions
- Why is reading past the end of a buffer undefined behavior?
- Why do you need CPU feature detection for SIMD?
- Why do SIMD implementations often need a scalar tail loop?
Check-Your-Understanding Answers
- The C standard forbids accessing memory outside the object even if mapped.
- Not all CPUs support the same instructions.
- The tail is smaller than the vector width and must be handled safely.
Real-World Applications
- Hyperscan uses SIMD for fast pattern matching.
- ripgrep uses SIMD for literal searches.
Where You’ll Apply It
- This project: Section 4.4 Data Structures, Section 5.10 Phase 2, Section 6.2 Critical Test Cases.
- Also used in: P01-custom-memory-allocator.md for alignment guarantees.
References
- “Modern X86 Assembly Language Programming” - SIMD chapters.
- “Computer Systems: A Programmer’s Perspective” - alignment and memory.
Key Insights
SIMD is powerful but unforgiving; correctness and bounds safety come first.
Summary
SIMD can accelerate string search dramatically, but it demands careful handling of alignment, bounds, and CPU feature detection. A correct fallback path is mandatory.
Homework/Exercises to Practice the Concept
- Implement a SIMD first-byte scan and compare to scalar.
- Add a tail loop that handles the last bytes safely.
- Implement CPUID detection for SSE4.2 and AVX2.
Solutions to the Homework/Exercises
- SIMD scan finds candidate positions faster than scalar.
- Tail loop prevents out-of-bounds access.
- CPUID detection selects the fastest safe path.
Concept 3: Cache Behavior, Branch Prediction, and Benchmarking
Fundamentals
String search performance is dominated by memory bandwidth and cache behavior. Sequential scans are fast because they stream through cache lines. Random access or frequent branch mispredictions can slow down the CPU drastically. A good search library minimizes unpredictable branches and uses sequential access patterns. Benchmarking must be deterministic and avoid I/O bottlenecks. The benchmark should use in-memory buffers, fixed seeds, and warm-up iterations. It should measure not only throughput but also latency and CPU usage. Without careful benchmarking, you may optimize the wrong thing.
Deep Dive into the concept
Modern CPUs can process multiple instructions per cycle but are often stalled by memory. A cache line fetch may take dozens of cycles if it misses L1 and L2. String search that scans sequentially benefits from hardware prefetching, which loads the next cache lines automatically. Algorithms that jump around in the haystack can defeat prefetching and become slower despite fewer comparisons. This is why naive or Two-Way scans often outperform more complex algorithms on real data.
Branch prediction also matters. A branch that is hard to predict can cost 10 to 20 cycles. In string search, comparing each byte and branching on mismatch can create unpredictable patterns. SIMD helps by replacing many branches with vector comparisons and bitmask operations. Another technique is to use memchr or memcmp, which are highly optimized and often use SIMD internally. Leveraging these library functions can be faster than writing custom code.
Benchmarking must isolate the search algorithm from I/O and OS noise. Use a fixed seed to generate the haystack and needle. Keep the haystack in memory and run multiple iterations to warm caches. Measure with a high-resolution monotonic clock and report p50 and p99 latencies. Also measure the effect of needle length and position. For example, if the needle appears near the beginning, all algorithms look fast; if it appears near the end or not at all, the differences emerge. A good benchmark should include both cases.
Determinism is critical. If the benchmark uses random data without a fixed seed, each run will differ and regressions will be hard to detect. You should report the seed and include it in output. You should also pin CPU frequency or at least disable turbo for consistent results if possible. In this project, you can keep it simple: fix the seed, fix the input sizes, and run multiple iterations, then report the median.
Finally, dispatch overhead should be considered. If you choose an algorithm per call, that logic should be fast. You can precompute a strategy when the needle is compiled, then reuse it across searches. This is a common design: a fastsearch_prepare function creates a plan, and fastsearch_find uses it. This reduces overhead and makes performance more stable. It also enables better benchmarking because you can isolate preprocessing cost from search cost.
How This Fits on Projects
This concept guides the benchmark design in Section 3.4 and the algorithm selection in Section 5.10 Phase 3. It also connects to performance measurement in P02-work-stealing-thread-pool.md and allocator benchmarking in P01-custom-memory-allocator.md.
Definitions & Key Terms
- Cache line: 64-byte block moved between memory and cache.
- Prefetch: Hardware loading of future cache lines.
- Branch prediction: CPU guessing of branch outcomes.
- Throughput: Operations per second.
- Tail latency: Slowest percentile performance.
Mental Model Diagram (ASCII)
Sequential scan -> prefetch -> high throughput
Random jumps -> cache misses -> stalls
How It Works (Step-by-Step)
- Generate deterministic haystack and needle.
- Warm the cache with a pre-run.
- Run multiple iterations and measure time.
- Report median, p50, p99 metrics.
Invariants:
- Benchmark inputs are fixed by seed.
- Measurements exclude file I/O.
Failure modes:
- Non-deterministic inputs hide regressions.
- I/O dominates timing and skews results.
Minimal Concrete Example
uint64_t start = now_ns();
for (int i = 0; i < 1000; i++) find(haystack, needle);
uint64_t end = now_ns();
Common Misconceptions
- “More comparisons means slower” -> Cache effects often dominate.
- “Average is enough” -> Tail latency shows worst cases.
- “Disk data is fine” -> I/O hides algorithm cost.
Check-Your-Understanding Questions
- Why does sequential scanning often beat random access algorithms?
- Why should benchmarks use a fixed seed?
- What is the difference between throughput and tail latency?
Check-Your-Understanding Answers
- Sequential scans leverage prefetching and cache locality.
- Fixed seeds make results reproducible and comparable.
- Throughput is average rate; tail latency shows worst-case performance.
Real-World Applications
- Search tools like ripgrep rely on cache-friendly scans.
- Database engines benchmark string comparisons on realistic data.
Where You’ll Apply It
- This project: Section 3.4 Example Output, Section 5.10 Phase 3, Section 6.2 Critical Test Cases.
- Also used in: P01-custom-memory-allocator.md for benchmark design.
References
- “Computer Systems: A Programmer’s Perspective” - caching chapters.
- “Systems Performance” by Brendan Gregg - performance analysis.
Key Insights
Performance is dominated by cache behavior and branches, not just algorithmic complexity.
Summary
Benchmarking and cache awareness are essential for meaningful performance improvements. A deterministic benchmark and cache-friendly algorithms reveal true gains.
Homework/Exercises to Practice the Concept
- Benchmark naive search with random vs sequential haystacks.
- Measure p50 and p99 latency for a fixed input.
- Compare results with and without warm-up.
Solutions to the Homework/Exercises
- Sequential haystacks show higher throughput due to prefetching.
- Tail latency highlights worst-case patterns.
- Warm-up reduces variance and improves repeatability.
3. Project Specification
3.1 What You Will Build
A C library called fastsearch that provides fast substring search over byte buffers, with optional UTF-8 safety. It includes a CLI tool for benchmarking and a CPU feature detection layer that selects between scalar, SSE4.2, and AVX2 implementations. The library exposes a prepare/find API to separate preprocessing from search.
3.2 Functional Requirements
- Search API:
prepare,find, andfind_allfunctions. - Algorithm selection: Choose algorithm based on needle length.
- SIMD acceleration: Use SSE4.2 and AVX2 when available.
- UTF-8 mode: Optional boundary-safe matching.
- Benchmark CLI: Deterministic benchmarks and JSON output.
- CPU feature detection: Runtime dispatch with caching.
3.3 Non-Functional Requirements
- Performance: Outperform
strstron large inputs. - Reliability: SIMD path matches scalar results exactly.
- Usability: Clear API and stable behavior.
3.4 Example Usage / Output
$ ./fastsearch --bench --seed 777 --size 1048576 --needle ERROR
fastsearch 1.0
algo=simd needle=5 size=1048576
throughput=1.2 GB/s p99=0.8 ms
3.5 Data Formats / Schemas / Protocols
Benchmark JSON output:
{
"needle": "ERROR",
"size": 1048576,
"seed": 777,
"algo": "simd",
"throughput_bytes_per_sec": 1200000000,
"p99_ms": 0.8
}
3.6 Edge Cases
- Empty needle returns match at position 0.
- Needle longer than haystack returns no match.
- UTF-8 mode rejects matches inside continuation bytes.
3.7 Real World Outcome
A library and CLI that outperform standard strstr on large inputs with deterministic benchmarks.
3.7.1 How to Run (Copy/Paste)
make
./fastsearch --bench --seed 777 --size 1048576 --needle ERROR
3.7.2 Golden Path Demo (Deterministic)
- Use seed 777 and fixed buffer size.
- Expect stable throughput across runs.
3.7.3 If CLI: Exact Terminal Transcript
$ ./fastsearch --bench --seed 777 --size 1048576 --needle ERROR
fastsearch 1.0
algo=simd needle=5 size=1048576
throughput=1.2 GB/s p99=0.8 ms
exit_code=0
$ ./fastsearch --bench --size -1
fastsearch: invalid size
usage: fastsearch --bench --size BYTES --needle STR
exit_code=2
3.7.4 If Web App
Not applicable.
3.7.5 If API
Not applicable.
3.7.6 If Library
Install/import
#include "fastsearch.h"
Minimal usage
fs_ctx_t* ctx = fs_prepare("ERROR", 5, FS_MODE_UTF8);
int pos = fs_find(ctx, haystack, hay_len);
fs_destroy(ctx);
Error handling
fs_ctx_t* ctx = fs_prepare(needle, n, FS_MODE_UTF8);
if (!ctx) { fprintf(stderr, "prepare failed\n"); }
3.7.7 If GUI / Desktop / Mobile
Not applicable.
3.7.8 If TUI
Not applicable.
4. Solution Architecture
4.1 High-Level Design
+--------------+ +-------------------+ +----------------+
| API Layer |-> | Algorithm Planner |-> | SIMD/Scalar Impl|
+--------------+ +-------------------+ +----------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Planner | Choose algorithm | Needle length thresholds |
| SIMD impl | Fast scanning | Safe alignment and tail handling |
| UTF-8 filter | Boundary checks | Optional mode |
| Benchmark CLI | Deterministic tests | Fixed seed |
4.4 Data Structures (No Full Code)
typedef struct fs_ctx {
int algo;
int utf8_mode;
uint8_t needle[256];
size_t needle_len;
} fs_ctx_t;
4.4 Algorithm Overview
Key Algorithm: SIMD First-Byte Scan
- Load vector block.
- Compare to first needle byte.
- Extract mask and test candidates.
- Verify candidate with scalar compare.
Complexity Analysis:
- Time: O(n) with vectorized constants.
- Space: O(1) extra per search.
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install build-essential
make
5.2 Project Structure
fastsearch/
|-- src/
| |-- fs_scalar.c
| |-- fs_simd.c
| |-- fs_dispatch.c
| `-- fs_bench.c
|-- include/
| `-- fastsearch.h
`-- Makefile
5.3 The Core Question You’re Answering
“How can a real-world search tool beat naive algorithms by an order of magnitude?”
5.4 Concepts You Must Understand First
- String search algorithms and UTF-8 boundaries
- SIMD and alignment safety
- Cache behavior and benchmarking
5.5 Questions to Guide Your Design
- What needle length thresholds choose each algorithm?
- How do you handle the tail safely in SIMD?
- How do you avoid false matches in UTF-8 mode?
5.6 Thinking Exercise
A match occurs at the last 3 bytes of a buffer and SIMD loads 16 bytes.
- How do you avoid reading out of bounds?
- What tail strategy will you use?
5.7 The Interview Questions They’ll Ask
- Why can SIMD beat Boyer-Moore in practice?
- How do you avoid UB in vectorized code?
- How do you detect CPU features at runtime?
5.8 Hints in Layers
Hint 1: Build scalar baseline
Start with naive search and validate correctness.
Hint 2: Add SIMD first-byte scan
Use SIMD to find candidate positions.
Hint 3: Add UTF-8 mode
Validate boundary bytes on matches.
Hint 4: Add CPU dispatch
Select SSE4.2 or AVX2 at runtime.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| SIMD intrinsics | “Modern X86 Assembly Language Programming” | SIMD chapters |
| Algorithms | “Algorithms” by Sedgewick and Wayne | String search |
| C memory | “Computer Systems: A Programmer’s Perspective” | Memory and cache |
5.10 Implementation Phases
Phase 1: Scalar Baseline (1 week)
Goals:
- Correct naive search and Two-Way or Boyer-Moore baseline
Tasks:
- Implement scalar search and tests.
- Add algorithm selection based on needle length.
Checkpoint: Scalar path passes all tests.
Phase 2: SIMD Acceleration (1 week)
Goals:
- Add SIMD scanning and CPU dispatch
Tasks:
- Implement SSE4.2 or AVX2 scan.
- Add CPUID-based dispatch.
Checkpoint: SIMD and scalar results match.
Phase 3: Benchmark and UTF-8 (1 week)
Goals:
- Add UTF-8 mode and deterministic benchmark
Tasks:
- Implement UTF-8 boundary checks.
- Build benchmark CLI with fixed seed.
Checkpoint: Benchmark shows speedup over strstr.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Algorithm | Two-Way vs Boyer-Moore | Two-Way for long needles | Good worst-case |
| SIMD width | SSE4.2 vs AVX2 | AVX2 if available | Wider vectors |
| UTF-8 handling | validate vs raw bytes | optional mode | User choice |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Algorithm correctness | matching, no match |
| Integration Tests | SIMD vs scalar parity | random buffers |
| Edge Case Tests | boundaries | needle at end |
6.2 Critical Test Cases
- SIMD and scalar return identical results for random inputs.
- UTF-8 mode rejects matches starting at continuation bytes.
- Needle longer than haystack returns no match.
6.3 Test Data
seed: 777
sizes: 64, 1024, 1048576
needles: 2, 5, 16, 64
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| OOB loads | Crashes or UB | Tail handling or masked loads |
| Wrong matches | UTF-8 boundary errors | Validate continuation bytes |
| No speedup | Benchmark too small | Use large in-memory buffers |
7.2 Debugging Strategies
- ASAN: catch out-of-bounds loads.
- Binary comparison: compare SIMD and scalar outputs on random data.
7.3 Performance Traps
- Recomputing algorithm selection on every call.
- Using I/O-bound benchmarks.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
find_allwith vectorized scanning. - Add a simple CLI that prints match offsets.
8.2 Intermediate Extensions
- Add precompiled needle tables for faster long-needle search.
- Add UTF-16 mode.
8.3 Advanced Extensions
- Implement AVX-512 masked loads.
- Add multi-pattern search (Aho-Corasick style).
9. Real-World Connections
9.1 Industry Applications
- ripgrep: Literal search optimized with SIMD.
- Hyperscan: High-performance regex engine.
9.2 Related Open Source Projects
- stringzilla: Fast SIMD string operations.
- Hyperscan: SIMD-optimized pattern matching.
9.3 Interview Relevance
- Explain why SIMD improves performance for scanning.
- Describe how to avoid UB in vectorized code.
10. Resources
10.1 Essential Reading
- “Modern X86 Assembly Language Programming” - SIMD.
- “Algorithms” by Sedgewick and Wayne - string search.
10.2 Video Resources
- Talks on SIMD optimization and string search.
10.3 Tools & Documentation
perffor profiling.objdumpandclang -Sfor inspecting SIMD code.
10.4 Related Projects in This Series
- P01-custom-memory-allocator.md: alignment and UB.
- P02-work-stealing-thread-pool.md: performance benchmarking.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain algorithm selection trade-offs.
- I can explain SIMD alignment requirements.
- I can explain why cache behavior matters.
11.2 Implementation
- SIMD and scalar paths match exactly.
- UTF-8 mode passes test corpus.
- Benchmark shows speedup over
strstr.
11.3 Growth
- I can describe SIMD design decisions in an interview.
- I can explain how to extend to multiple patterns.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Scalar search with deterministic benchmark.
- Correct handling of empty and missing needles.
Full Completion:
- SIMD path implemented with CPU dispatch.
- UTF-8 mode validated.
Excellence (Going Above & Beyond):
- Multi-pattern search or AVX-512 support.
- Detailed performance report with cache analysis.