Project 12: Line-Oriented Search with SIMD Line Counting

Build a line-aware search pipeline that counts newlines with SIMD and maps byte offsets to line/column quickly.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1 week
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 3: Clever
Business Potential	Level 2: Tooling
Prerequisites	P07 SIMD memchr, basic indexing
Key Topics	newline counting, line index, SIMD reduction

1. Learning Objectives

By completing this project, you will:

Implement SIMD newline counting for large buffers.
Build a line index to map byte offsets to line/column quickly.
Integrate line counting with a search pipeline.
Measure speedup over scalar line counting.
Handle edge cases like missing final newline.

2. All Theory Needed (Per-Concept Breakdown)

2.1 SIMD Line Counting and Line Indexing

Fundamentals

Line-oriented search tools report matches by line. To do this efficiently, you must map a byte offset to line and column. A simple approach is to scan from the start and count newlines, but this is too slow for large files. Instead, you can precompute a line index by scanning the file once and recording newline offsets. SIMD can accelerate this scan by counting newlines in blocks.

The idea is similar to SIMD memchr: broadcast the newline byte ‘\n’, compare with a vector of bytes, extract a mask, and count set bits. This yields the number of newlines in each block. You can accumulate counts and optionally store newline positions to build a line index.

Deep Dive into the concept

There are two common strategies: (1) build a list of newline offsets and then use binary search to map byte offsets to line numbers; (2) build a block index (e.g., every 4KB store newline count) and compute line numbers by scanning only a small region. The first is simpler, the second is more memory efficient for huge files.

SIMD counting uses movemask and popcount. For each block, you compare with ‘\n’, obtain a mask, then popcount to get number of newlines. For exact line positions, you can iterate through mask bits and record offsets. This is still fast because most blocks have few newlines relative to bytes.

Line indexing is performance critical in grep-style tools because formatting line numbers and context lines can dominate total time when matches are frequent. Efficient line counting is a major part of why ripgrep is fast.

How this fits on projects

This line indexing module is used in P15 and improves output formatting performance for P01-P08 when reporting lines.

Definitions & key terms

line index -> mapping from line number to byte offset
newline count -> number of ‘\n’ bytes encountered
popcount -> count of set bits in a mask
block index -> per-block summary of newlines

Mental model diagram (ASCII)

buffer: "a\nfoo\nbar\n"
mask:    010010010
popcount -> 3 newlines
line index -> [1, 5, 9]

How it works (step-by-step)

Broadcast ‘\n’ into SIMD register.
Compare with block, produce mask.
popcount mask to count newlines.
Optionally record positions for each set bit.
Build a line index for fast mapping.

Minimal concrete example

__m128i nl = _mm_set1_epi8('\n');
__m128i blk = _mm_loadu_si128(ptr);
int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(blk, nl));
int count = __builtin_popcount(mask);

Common misconceptions

Misconception: line numbers can be computed on the fly cheaply. Correction: repeated scans are expensive on large files.
Misconception: SIMD is only for search, not counting. Correction: SIMD speeds up any byte classification task.

Check-your-understanding questions

Why does line indexing improve performance?
What does popcount do in this context?
How do you map an offset to line number with a newline offset list?

Check-your-understanding answers

It avoids rescanning the file for each match.
It counts newline bytes in a SIMD block.
Binary search the newline offset list.

Real-world applications

grep/rg output formatting
log processing tools

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P15-build-your-own-ripgrep.

References

ripgrep output formatting internals
SIMD line counting articles

Key insights

Fast searching is wasted if line mapping is slow; SIMD line counting fixes that bottleneck.

Summary

SIMD line counting makes line-oriented output fast and scalable on large files.

Homework/Exercises to practice the concept

Build a newline offset list for a small buffer by hand.
Implement SIMD line count and compare to scalar count.

Solutions to the homework/exercises

Record each ‘\n’ offset in order.
Use movemask + popcount for SIMD blocks.

3. Project Specification

3.1 What You Will Build

A library + CLI line-index that builds a SIMD-accelerated line index for a file and exposes functions to map byte offsets to line/column. It integrates with a simple search tool to display line numbers.

3.2 Functional Requirements

SIMD newline count: fast scanning of file buffer.
Line index: store newline offsets or block counts.
Mapping API: given byte offset, return line and column.
Integration: use in a search tool to print matches.

3.3 Non-Functional Requirements

Performance: 5x speedup over scalar on large files.
Reliability: correct line/col for all offsets.
Usability: clear output and API.

3.4 Example Usage / Output

$ ./line-index fixtures/large.txt --offset 1024
line=12 col=5

3.5 Data Formats / Schemas / Protocols

JSON output example:

{"offset":1024,"line":12,"col":5}

3.6 Edge Cases

File without trailing newline
Very long lines
Offset at end of file

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./line-index fixtures/large.txt --offset 1024

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixture with known newline positions.

3.7.3 CLI Transcript (Success + Failure)

$ ./line-index fixtures/sample.txt --offset 4
line=2 col=1
exit code: 0

$ ./line-index fixtures/sample.txt --offset 99999
error: offset out of range
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +------------+   +-----------+
| Loader    |-->| SIMD Scan |-->| Line Index |-->| Mapper    |
+-----------+   +-----------+   +------------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct LineIndex { size_t* nl_offsets; size_t count; };

4.4 Algorithm Overview

SIMD scan buffer for ‘\n’.
Record offsets in vector.
Map offsets via binary search.

Complexity Analysis

Build: O(n/vec_width)
Query: O(log L) where L is number of lines

5. Implementation Guide

5.1 Development Environment Setup

cc -O3 -march=native -o line-index src/main.c

5.2 Project Structure

line-index/
├── src/
│   ├── main.c
│   ├── scan.c
│   └── index.c
└── fixtures/

5.3 The Core Question You’re Answering

“How do I map byte offsets to line/column quickly without rescanning the file?”

5.4 Concepts You Must Understand First

SIMD newline counting.
Binary search on newline offsets.
Line and column calculations.

5.5 Questions to Guide Your Design

How will you store newline offsets for huge files?
How will you handle files without trailing newline?
How will you compute column from previous newline?

5.6 Thinking Exercise

If offset is exactly on a ‘\n’, what line/col should you report?

5.7 The Interview Questions They’ll Ask

Why is line indexing a performance bottleneck?
How does SIMD improve newline counting?

5.8 Hints in Layers

Hint 1: Start with scalar scan + offset list. Hint 2: Add SIMD counting to accelerate scanning. Hint 3: Add block-level index for huge files.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Scalar Index (2 days)

Build newline offset list.
Checkpoint: correct line/col mapping.

Phase 2: SIMD Scan (2 days)

Replace scalar scan with SIMD.
Checkpoint: identical results.

Phase 3: Integration (2 days)

Integrate with search tool output.
Checkpoint: match output equals baseline.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Offset 0 -> line 1, col 1.
Offset at newline -> previous line end.
Offset at EOF -> last line.

6.3 Test Data

text: "a\nfoo\nbar"
offset 4 -> line 2 col 2

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Compare line index built by SIMD and scalar.
Print newline offsets for small fixtures.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --range to output line range for offsets.
Add --context to output surrounding lines.

8.2 Intermediate Extensions

Block-level index to reduce memory.
Memory-mapped line index for huge files.

8.3 Advanced Extensions

SIMD prefix-sum for newline counts.
Parallel line index construction.

9. Real-World Connections

9.1 Industry Applications

grep/rg line number formatting
log processing and highlighting

ripgrep line counting code
SIMD newline counting libraries

9.3 Interview Relevance

Explaining performance bottlenecks in output formatting

10. Resources

10.1 Essential Reading

Intel Intrinsics Guide
ripgrep output formatting notes

10.2 Tools & Documentation

perf for profiling
hyperfine for benchmarks

Previous: P11-mmap-vs-read-strategy-selector
Next: P13-fuzzy-string-matcher-fzf-style

11. Self-Assessment Checklist

11.1 Understanding

I can explain SIMD newline counting.
I can map offsets to line/col correctly.

11.2 Implementation

SIMD index matches scalar baseline.
Output formatting uses line index.

11.3 Growth

I can describe memory trade-offs for line indexing.

12. Submission / Completion Criteria

Minimum Viable Completion:

SIMD line count and offset mapping

Full Completion:

Integrated with search output

Excellence (Going Above & Beyond):

Block-level index and parallel build