Project 11: mmap vs read Strategy Selector

Build an adaptive I/O layer that chooses between mmap and buffered reads based on file size and access patterns.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1 week
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 3: Clever
Business Potential	Level 2: Systems tooling
Prerequisites	file I/O, mmap basics
Key Topics	mmap, read buffers, page faults, adaptive I/O

1. Learning Objectives

By completing this project, you will:

Implement both mmap and buffered read I/O paths.
Measure and compare performance on small vs large files.
Build heuristics to choose I/O strategy automatically.
Handle errors and edge cases for memory mapping.
Explain page faults and OS caching behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 mmap vs read I/O Strategies

Fundamentals

read() with a buffer is the traditional way to load file data. You allocate a buffer, call read repeatedly, and process bytes as they arrive. mmap() maps a file into your address space and lets you access it like a byte array. The OS loads pages on demand, which can be faster for random access but may incur page faults.

The right strategy depends on file size, access pattern, and OS behavior. For many small files, read buffers are often faster due to lower setup overhead. For very large files or random access patterns, mmap can be more efficient because it avoids extra copies and leverages page caching.

Deep Dive into the concept

With mmap, the file is mapped into virtual memory. Accessing a page triggers a page fault if it is not resident, causing the OS to load it from disk. This can be very fast on SSDs but can be slower if your access is sequential and the OS prefetcher with read() is already optimal.

Buffered read uses system calls repeatedly and copies data into user-space buffers. This is often faster for streaming workloads because you can control buffer size and prefetch. It also avoids issues with sparse files and memory mapping failures on huge files. mmap can also be tricky on 32-bit systems due to address space limits.

Modern tools (ripgrep) use heuristics: if the file is big and regular, mmap may be used; if file is small or you are reading many small files, buffered reads might win. The best choice depends on real-world benchmarks and hardware.

How this fits on projects

This adaptive I/O layer feeds your search engine in P15 and affects line counting performance in P12.

Definitions & key terms

mmap -> map file into memory
page fault -> OS loads a page on access
buffered read -> read data into user-space buffer
prefetch -> OS reads ahead for sequential access

Mental model diagram (ASCII)

read():  file -> kernel buffer -> user buffer -> process
mmap():  file -> page cache -> process address space

How it works (step-by-step)

Inspect file size and type.
If large and regular, consider mmap.
If small or special, use read with buffer.
Process bytes via unified interface.

Minimal concrete example

void* p = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);

Common misconceptions

Misconception: mmap is always faster. Correction: For many small files, mmap overhead can dominate.
Misconception: read always copies twice. Correction: Page cache reduces actual disk reads.

Check-your-understanding questions

Why might mmap be worse for small files?
What triggers a page fault?
Why is sequential read often fast?

Check-your-understanding answers

Mapping overhead and page faults outweigh benefits.
Accessing a page not present in memory.
OS read-ahead prefetches sequential blocks.

Real-world applications

ripgrep and other search tools
database engines and file scanners

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P12-line-oriented-search-with-simd-line-counting, P15-build-your-own-ripgrep.

References

“The Linux Programming Interface” (mmap chapters)
ripgrep I/O strategy docs

Key insights

There is no universally best I/O strategy; choose based on size and access pattern.

Summary

Adaptive I/O balances mmap and read to maximize throughput on real workloads.

Homework/Exercises to practice the concept

Benchmark mmap vs read on a 1MB and a 1GB file.
Plot the crossover point where mmap becomes faster.

Solutions to the homework/exercises

Use hyperfine to compare implementations.
Record timings and identify the size where mmap wins.

3. Project Specification

3.1 What You Will Build

A library + CLI io-select that loads files using either mmap or buffered reads and exposes a --strategy flag (auto, mmap, read). It reports timing stats and selects strategy automatically when auto is chosen.

3.2 Functional Requirements

mmap path: map file and expose byte slice.
read path: buffered read into reusable buffer.
auto selection: choose based on file size threshold.
stats: output timing and bytes processed.

3.3 Non-Functional Requirements

Performance: auto selection should be within 10% of best strategy.
Reliability: handles errors gracefully (mmap failures, short reads).
Usability: clear output and flags.

3.4 Example Usage / Output

$ ./io-select --strategy auto fixtures/large.txt
strategy: mmap
bytes: 104857600
ms: 120

3.5 Data Formats / Schemas / Protocols

Text output with optional JSON stats.

3.6 Edge Cases

Non-regular files (pipes)
Zero-length files
Very large files on 32-bit systems

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./io-select --strategy auto fixtures/large.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures and emit consistent strategy decision.

3.7.3 CLI Transcript (Success + Failure)

$ ./io-select --strategy mmap fixtures/large.txt
strategy: mmap
bytes: 104857600
exit code: 0

$ ./io-select --strategy mmap fixtures/pipe
error: mmap not supported for this file type
exit code: 1

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| CLI Parse |-->| Strategy  |-->| I/O Layer |-->| Processor |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct IoResult { size_t bytes; double ms; const char* strategy; };

4.4 Algorithm Overview

Stat file and detect type.
If strategy=auto, choose based on size threshold.
Run read or mmap path and measure time.

Complexity Analysis

Time: O(n) for both paths
Space: O(1) (mmap uses virtual memory)

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o io-select src/main.c

5.2 Project Structure

io-select/
├── src/
│   ├── main.c
│   ├── strategy.c
│   ├── mmap_reader.c
│   └── read_reader.c
└── fixtures/

5.3 The Core Question You’re Answering

“When is mmap better than read, and how can a tool decide automatically?”

5.4 Concepts You Must Understand First

Page cache and page faults.
File types and stat metadata.
Buffer sizing for read loops.

5.5 Questions to Guide Your Design

What threshold should you choose and how will you tune it?
How will you handle pipes or special files?
How will you measure timing consistently?

5.6 Thinking Exercise

Predict which strategy wins for a 4KB file and explain why.

5.7 The Interview Questions They’ll Ask

What are the trade-offs of mmap vs read?
How does the OS page cache affect performance?

5.8 Hints in Layers

Hint 1: Start with a fixed threshold (e.g., 1MB). Hint 2: Add file type checks to avoid mmap on pipes. Hint 3: Add benchmarking to tune threshold.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Baseline Readers (2 days)

Implement read and mmap paths.
Checkpoint: same byte counts.

Phase 2: Strategy Selector (2 days)

Add auto strategy based on size.
Checkpoint: correct decisions on fixtures.

Phase 3: Benchmarking (2 days)

Measure crossover points.
Checkpoint: threshold tuned for system.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

0-byte file uses read and returns 0 bytes.
Pipe input rejects mmap with clear error.
Large file uses mmap in auto mode.

6.3 Test Data

file sizes: 0B, 4KB, 1MB, 100MB
expected: read, read, threshold-dependent, mmap

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use strace to inspect syscalls.
Compare byte counts between strategies.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --buffer-size flag.
Add CSV output for benchmarks.

8.2 Intermediate Extensions

Adaptive strategy based on past timings.
Use posix_fadvise hints.

8.3 Advanced Extensions

Hybrid: mmap for large files, read for small.
Integrate with async I/O APIs.

9. Real-World Connections

9.1 Industry Applications

Search tools and log processors
Databases and file scanners

ripgrep I/O strategy
fd and ag implementations

9.3 Interview Relevance

Discussing memory mapping and OS caches

10. Resources

10.1 Essential Reading

TLPI file I/O chapters
OS VM chapters (OSTEP)

10.2 Tools & Documentation

strace for syscall tracing
hyperfine for benchmarks

Previous: P10-parallel-directory-traversal
Next: P12-line-oriented-search-with-simd-line-counting

11. Self-Assessment Checklist

11.1 Understanding

I can explain mmap vs read trade-offs.
I can explain page faults and caching.

11.2 Implementation

Auto strategy chooses the expected path.
Both strategies return identical results.

11.3 Growth

I can tune thresholds based on benchmarks.

12. Submission / Completion Criteria

Minimum Viable Completion:

Read + mmap implementations with strategy flag

Full Completion:

Auto strategy with benchmark stats

Excellence (Going Above & Beyond):

Adaptive strategy based on real-time measurements
posix_fadvise integration