Project 12: Line-Oriented Search with SIMD Line Counting
A SIMD-accelerated newline counter and line boundary finder that can identify line starts/ends at gigabytes per second.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C or Rust |
| Alternative Languages | C++, Zig |
| Difficulty | Level 3: Advanced |
| Time Estimate | 1 week |
| Knowledge Area | SIMD / Text Processing |
| Tooling | Custom Implementation |
| Prerequisites | Project 7 (SIMD memchr) |
What You Will Build
A SIMD-accelerated newline counter and line boundary finder that can identify line starts/ends at gigabytes per second.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- SIMD newline counting → maps to POPCNT on comparison masks
- Line boundary tracking → maps to maintaining byte offsets
- Interleaving with search → maps to fusion of operations
- Handling \r\n vs \n → maps to Windows compatibility
Key Concepts
- SIMD POPCNT: Count bits set in comparison mask
- Prefix Sum: Computing line numbers from newline counts
- Line Caching: Store line boundaries for random access
- Streaming Processing: Process file in chunks without full materialization
Real-World Outcome
$ ./line_counter huge_log.txt
File size: 5.2 GB
Lines: 124,567,890
Benchmark:
Naive (byte loop): 12.5 seconds
wc -l: 3.2 seconds
SIMD line count: 0.45 seconds (11.6 GB/s)
With line boundary tracking:
Line 50000000 starts at byte offset 2,345,678,901
Lookup time: 0.001ms (from precomputed index)
Integration with search:
Pattern found at byte 123456789
→ Line number 567890 (computed in 0.01ms)
→ Line content: "Error: connection timeout..."
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
TEXT_SEARCH_TOOLS_DEEP_DIVE.md - “Intel Intrinsics Guide”