Project 12: Line-Oriented Search with SIMD Line Counting

A SIMD-accelerated newline counter and line boundary finder that can identify line starts/ends at gigabytes per second.

Quick Reference

Attribute Value
Primary Language C or Rust
Alternative Languages C++, Zig
Difficulty Level 3: Advanced
Time Estimate 1 week
Knowledge Area SIMD / Text Processing
Tooling Custom Implementation
Prerequisites Project 7 (SIMD memchr)

What You Will Build

A SIMD-accelerated newline counter and line boundary finder that can identify line starts/ends at gigabytes per second.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • SIMD newline counting → maps to POPCNT on comparison masks
  • Line boundary tracking → maps to maintaining byte offsets
  • Interleaving with search → maps to fusion of operations
  • Handling \r\n vs \n → maps to Windows compatibility

Key Concepts

  • SIMD POPCNT: Count bits set in comparison mask
  • Prefix Sum: Computing line numbers from newline counts
  • Line Caching: Store line boundaries for random access
  • Streaming Processing: Process file in chunks without full materialization

Real-World Outcome

$ ./line_counter huge_log.txt
File size: 5.2 GB
Lines: 124,567,890

Benchmark:
  Naive (byte loop):  12.5 seconds
  wc -l:               3.2 seconds
  SIMD line count:     0.45 seconds (11.6 GB/s)

With line boundary tracking:
  Line 50000000 starts at byte offset 2,345,678,901
  Lookup time: 0.001ms (from precomputed index)

Integration with search:
  Pattern found at byte 123456789
  → Line number 567890 (computed in 0.01ms)
  → Line content: "Error: connection timeout..."

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Intel Intrinsics Guide”