Project 12: Line-Oriented Search with SIMD Line Counting
Build a line-aware search pipeline that counts newlines with SIMD and maps byte offsets to line/column quickly.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1 week |
| Main Programming Language | C (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 3: Clever |
| Business Potential | Level 2: Tooling |
| Prerequisites | P07 SIMD memchr, basic indexing |
| Key Topics | newline counting, line index, SIMD reduction |
1. Learning Objectives
By completing this project, you will:
- Implement SIMD newline counting for large buffers.
- Build a line index to map byte offsets to line/column quickly.
- Integrate line counting with a search pipeline.
- Measure speedup over scalar line counting.
- Handle edge cases like missing final newline.
2. All Theory Needed (Per-Concept Breakdown)
2.1 SIMD Line Counting and Line Indexing
Fundamentals
Line-oriented search tools report matches by line. To do this efficiently, you must map a byte offset to line and column. A simple approach is to scan from the start and count newlines, but this is too slow for large files. Instead, you can precompute a line index by scanning the file once and recording newline offsets. SIMD can accelerate this scan by counting newlines in blocks.
The idea is similar to SIMD memchr: broadcast the newline byte ‘\n’, compare with a vector of bytes, extract a mask, and count set bits. This yields the number of newlines in each block. You can accumulate counts and optionally store newline positions to build a line index.
Deep Dive into the concept
There are two common strategies: (1) build a list of newline offsets and then use binary search to map byte offsets to line numbers; (2) build a block index (e.g., every 4KB store newline count) and compute line numbers by scanning only a small region. The first is simpler, the second is more memory efficient for huge files.
SIMD counting uses movemask and popcount. For each block, you compare with ‘\n’, obtain a mask, then popcount to get number of newlines. For exact line positions, you can iterate through mask bits and record offsets. This is still fast because most blocks have few newlines relative to bytes.
Line indexing is performance critical in grep-style tools because formatting line numbers and context lines can dominate total time when matches are frequent. Efficient line counting is a major part of why ripgrep is fast.
How this fits on projects
This line indexing module is used in P15 and improves output formatting performance for P01-P08 when reporting lines.
Definitions & key terms
- line index -> mapping from line number to byte offset
- newline count -> number of ‘\n’ bytes encountered
- popcount -> count of set bits in a mask
- block index -> per-block summary of newlines
Mental model diagram (ASCII)
buffer: "a\nfoo\nbar\n"
mask: 010010010
popcount -> 3 newlines
line index -> [1, 5, 9]
How it works (step-by-step)
- Broadcast ‘\n’ into SIMD register.
- Compare with block, produce mask.
- popcount mask to count newlines.
- Optionally record positions for each set bit.
- Build a line index for fast mapping.
Minimal concrete example
__m128i nl = _mm_set1_epi8('\n');
__m128i blk = _mm_loadu_si128(ptr);
int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(blk, nl));
int count = __builtin_popcount(mask);
Common misconceptions
- Misconception: line numbers can be computed on the fly cheaply. Correction: repeated scans are expensive on large files.
- Misconception: SIMD is only for search, not counting. Correction: SIMD speeds up any byte classification task.
Check-your-understanding questions
- Why does line indexing improve performance?
- What does popcount do in this context?
- How do you map an offset to line number with a newline offset list?
Check-your-understanding answers
- It avoids rescanning the file for each match.
- It counts newline bytes in a SIMD block.
- Binary search the newline offset list.
Real-world applications
- grep/rg output formatting
- log processing tools
Where you’ll apply it
- This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
- Also used in: P15-build-your-own-ripgrep.
References
- ripgrep output formatting internals
- SIMD line counting articles
Key insights
Fast searching is wasted if line mapping is slow; SIMD line counting fixes that bottleneck.
Summary
SIMD line counting makes line-oriented output fast and scalable on large files.
Homework/Exercises to practice the concept
- Build a newline offset list for a small buffer by hand.
- Implement SIMD line count and compare to scalar count.
Solutions to the homework/exercises
- Record each ‘\n’ offset in order.
- Use movemask + popcount for SIMD blocks.
3. Project Specification
3.1 What You Will Build
A library + CLI line-index that builds a SIMD-accelerated line index for a file and exposes functions to map byte offsets to line/column. It integrates with a simple search tool to display line numbers.
3.2 Functional Requirements
- SIMD newline count: fast scanning of file buffer.
- Line index: store newline offsets or block counts.
- Mapping API: given byte offset, return line and column.
- Integration: use in a search tool to print matches.
3.3 Non-Functional Requirements
- Performance: 5x speedup over scalar on large files.
- Reliability: correct line/col for all offsets.
- Usability: clear output and API.
3.4 Example Usage / Output
$ ./line-index fixtures/large.txt --offset 1024
line=12 col=5
3.5 Data Formats / Schemas / Protocols
JSON output example:
{"offset":1024,"line":12,"col":5}
3.6 Edge Cases
- File without trailing newline
- Very long lines
- Offset at end of file
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./line-index fixtures/large.txt --offset 1024
3.7.2 Golden Path Demo (Deterministic)
Use fixed fixture with known newline positions.
3.7.3 CLI Transcript (Success + Failure)
$ ./line-index fixtures/sample.txt --offset 4
line=2 col=1
exit code: 0
$ ./line-index fixtures/sample.txt --offset 99999
error: offset out of range
exit code: 2
4. Solution Architecture
4.1 High-Level Design
+-----------+ +-----------+ +------------+ +-----------+
| Loader |-->| SIMD Scan |-->| Line Index |-->| Mapper |
+-----------+ +-----------+ +------------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Scanner | count newlines | SIMD movemask | | Index | store offsets | vector of offsets | | Mapper | offset -> line/col | binary search |
4.3 Data Structures (No Full Code)
struct LineIndex { size_t* nl_offsets; size_t count; };
4.4 Algorithm Overview
- SIMD scan buffer for ‘\n’.
- Record offsets in vector.
- Map offsets via binary search.
Complexity Analysis
- Build: O(n/vec_width)
- Query: O(log L) where L is number of lines
5. Implementation Guide
5.1 Development Environment Setup
cc -O3 -march=native -o line-index src/main.c
5.2 Project Structure
line-index/
├── src/
│ ├── main.c
│ ├── scan.c
│ └── index.c
└── fixtures/
5.3 The Core Question You’re Answering
“How do I map byte offsets to line/column quickly without rescanning the file?”
5.4 Concepts You Must Understand First
- SIMD newline counting.
- Binary search on newline offsets.
- Line and column calculations.
5.5 Questions to Guide Your Design
- How will you store newline offsets for huge files?
- How will you handle files without trailing newline?
- How will you compute column from previous newline?
5.6 Thinking Exercise
If offset is exactly on a ‘\n’, what line/col should you report?
5.7 The Interview Questions They’ll Ask
- Why is line indexing a performance bottleneck?
- How does SIMD improve newline counting?
5.8 Hints in Layers
Hint 1: Start with scalar scan + offset list. Hint 2: Add SIMD counting to accelerate scanning. Hint 3: Add block-level index for huge files.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | SIMD basics | Intel Intrinsics Guide | SSE2 | | Data structures | “Algorithms” (Sedgewick) | binary search |
5.10 Implementation Phases
Phase 1: Scalar Index (2 days)
- Build newline offset list.
- Checkpoint: correct line/col mapping.
Phase 2: SIMD Scan (2 days)
- Replace scalar scan with SIMD.
- Checkpoint: identical results.
Phase 3: Integration (2 days)
- Integrate with search tool output.
- Checkpoint: match output equals baseline.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Index type | full offsets vs block counts | full offsets (v1) | simpler | | Offset mapping | binary search | yes | O(log L) |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | newline scan | small buffers | | Integration | line mapping | fixtures | | Edge | offsets at boundaries | start/end of file |
6.2 Critical Test Cases
- Offset 0 -> line 1, col 1.
- Offset at newline -> previous line end.
- Offset at EOF -> last line.
6.3 Test Data
text: "a\nfoo\nbar"
offset 4 -> line 2 col 2
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Off-by-one line numbers | wrong output | define exact semantics | | Missing last line | missing output | handle no trailing newline | | SIMD mask errors | wrong counts | test against scalar |
7.2 Debugging Strategies
- Compare line index built by SIMD and scalar.
- Print newline offsets for small fixtures.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--rangeto output line range for offsets. - Add
--contextto output surrounding lines.
8.2 Intermediate Extensions
- Block-level index to reduce memory.
- Memory-mapped line index for huge files.
8.3 Advanced Extensions
- SIMD prefix-sum for newline counts.
- Parallel line index construction.
9. Real-World Connections
9.1 Industry Applications
- grep/rg line number formatting
- log processing and highlighting
9.2 Related Open Source Projects
- ripgrep line counting code
- SIMD newline counting libraries
9.3 Interview Relevance
- Explaining performance bottlenecks in output formatting
10. Resources
10.1 Essential Reading
- Intel Intrinsics Guide
- ripgrep output formatting notes
10.2 Tools & Documentation
perffor profilinghyperfinefor benchmarks
10.3 Related Projects in This Series
- Previous: P11-mmap-vs-read-strategy-selector
- Next: P13-fuzzy-string-matcher-fzf-style
11. Self-Assessment Checklist
11.1 Understanding
- I can explain SIMD newline counting.
- I can map offsets to line/col correctly.
11.2 Implementation
- SIMD index matches scalar baseline.
- Output formatting uses line index.
11.3 Growth
- I can describe memory trade-offs for line indexing.
12. Submission / Completion Criteria
Minimum Viable Completion:
- SIMD line count and offset mapping
Full Completion:
- Integrated with search output
Excellence (Going Above & Beyond):
- Block-level index and parallel build