Project 11: mmap vs read Strategy Selector
Build an adaptive I/O layer that chooses between mmap and buffered reads based on file size and access patterns.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1 week |
| Main Programming Language | C (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 3: Clever |
| Business Potential | Level 2: Systems tooling |
| Prerequisites | file I/O, mmap basics |
| Key Topics | mmap, read buffers, page faults, adaptive I/O |
1. Learning Objectives
By completing this project, you will:
- Implement both mmap and buffered read I/O paths.
- Measure and compare performance on small vs large files.
- Build heuristics to choose I/O strategy automatically.
- Handle errors and edge cases for memory mapping.
- Explain page faults and OS caching behavior.
2. All Theory Needed (Per-Concept Breakdown)
2.1 mmap vs read I/O Strategies
Fundamentals
read() with a buffer is the traditional way to load file data. You allocate a buffer, call read repeatedly, and process bytes as they arrive. mmap() maps a file into your address space and lets you access it like a byte array. The OS loads pages on demand, which can be faster for random access but may incur page faults.
The right strategy depends on file size, access pattern, and OS behavior. For many small files, read buffers are often faster due to lower setup overhead. For very large files or random access patterns, mmap can be more efficient because it avoids extra copies and leverages page caching.
Deep Dive into the concept
With mmap, the file is mapped into virtual memory. Accessing a page triggers a page fault if it is not resident, causing the OS to load it from disk. This can be very fast on SSDs but can be slower if your access is sequential and the OS prefetcher with read() is already optimal.
Buffered read uses system calls repeatedly and copies data into user-space buffers. This is often faster for streaming workloads because you can control buffer size and prefetch. It also avoids issues with sparse files and memory mapping failures on huge files. mmap can also be tricky on 32-bit systems due to address space limits.
Modern tools (ripgrep) use heuristics: if the file is big and regular, mmap may be used; if file is small or you are reading many small files, buffered reads might win. The best choice depends on real-world benchmarks and hardware.
How this fits on projects
This adaptive I/O layer feeds your search engine in P15 and affects line counting performance in P12.
Definitions & key terms
- mmap -> map file into memory
- page fault -> OS loads a page on access
- buffered read -> read data into user-space buffer
- prefetch -> OS reads ahead for sequential access
Mental model diagram (ASCII)
read(): file -> kernel buffer -> user buffer -> process
mmap(): file -> page cache -> process address space
How it works (step-by-step)
- Inspect file size and type.
- If large and regular, consider mmap.
- If small or special, use read with buffer.
- Process bytes via unified interface.
Minimal concrete example
void* p = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
Common misconceptions
- Misconception: mmap is always faster. Correction: For many small files, mmap overhead can dominate.
- Misconception: read always copies twice. Correction: Page cache reduces actual disk reads.
Check-your-understanding questions
- Why might mmap be worse for small files?
- What triggers a page fault?
- Why is sequential read often fast?
Check-your-understanding answers
- Mapping overhead and page faults outweigh benefits.
- Accessing a page not present in memory.
- OS read-ahead prefetches sequential blocks.
Real-world applications
- ripgrep and other search tools
- database engines and file scanners
Where you’ll apply it
- This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
- Also used in: P12-line-oriented-search-with-simd-line-counting, P15-build-your-own-ripgrep.
References
- “The Linux Programming Interface” (mmap chapters)
- ripgrep I/O strategy docs
Key insights
There is no universally best I/O strategy; choose based on size and access pattern.
Summary
Adaptive I/O balances mmap and read to maximize throughput on real workloads.
Homework/Exercises to practice the concept
- Benchmark mmap vs read on a 1MB and a 1GB file.
- Plot the crossover point where mmap becomes faster.
Solutions to the homework/exercises
- Use
hyperfineto compare implementations. - Record timings and identify the size where mmap wins.
3. Project Specification
3.1 What You Will Build
A library + CLI io-select that loads files using either mmap or buffered reads and exposes a --strategy flag (auto, mmap, read). It reports timing stats and selects strategy automatically when auto is chosen.
3.2 Functional Requirements
- mmap path: map file and expose byte slice.
- read path: buffered read into reusable buffer.
- auto selection: choose based on file size threshold.
- stats: output timing and bytes processed.
3.3 Non-Functional Requirements
- Performance: auto selection should be within 10% of best strategy.
- Reliability: handles errors gracefully (mmap failures, short reads).
- Usability: clear output and flags.
3.4 Example Usage / Output
$ ./io-select --strategy auto fixtures/large.txt
strategy: mmap
bytes: 104857600
ms: 120
3.5 Data Formats / Schemas / Protocols
Text output with optional JSON stats.
3.6 Edge Cases
- Non-regular files (pipes)
- Zero-length files
- Very large files on 32-bit systems
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./io-select --strategy auto fixtures/large.txt
3.7.2 Golden Path Demo (Deterministic)
Use fixed fixtures and emit consistent strategy decision.
3.7.3 CLI Transcript (Success + Failure)
$ ./io-select --strategy mmap fixtures/large.txt
strategy: mmap
bytes: 104857600
exit code: 0
$ ./io-select --strategy mmap fixtures/pipe
error: mmap not supported for this file type
exit code: 1
4. Solution Architecture
4.1 High-Level Design
+-----------+ +-----------+ +-----------+ +-----------+
| CLI Parse |-->| Strategy |-->| I/O Layer |-->| Processor |
+-----------+ +-----------+ +-----------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Strategy selector | choose read vs mmap | size threshold + file type | | mmap reader | map and unmap file | MAP_PRIVATE | | read reader | buffered read loop | reuse buffer |
4.3 Data Structures (No Full Code)
struct IoResult { size_t bytes; double ms; const char* strategy; };
4.4 Algorithm Overview
- Stat file and detect type.
- If strategy=auto, choose based on size threshold.
- Run read or mmap path and measure time.
Complexity Analysis
- Time: O(n) for both paths
- Space: O(1) (mmap uses virtual memory)
5. Implementation Guide
5.1 Development Environment Setup
cc -O2 -Wall -Wextra -o io-select src/main.c
5.2 Project Structure
io-select/
├── src/
│ ├── main.c
│ ├── strategy.c
│ ├── mmap_reader.c
│ └── read_reader.c
└── fixtures/
5.3 The Core Question You’re Answering
“When is mmap better than read, and how can a tool decide automatically?”
5.4 Concepts You Must Understand First
- Page cache and page faults.
- File types and stat metadata.
- Buffer sizing for read loops.
5.5 Questions to Guide Your Design
- What threshold should you choose and how will you tune it?
- How will you handle pipes or special files?
- How will you measure timing consistently?
5.6 Thinking Exercise
Predict which strategy wins for a 4KB file and explain why.
5.7 The Interview Questions They’ll Ask
- What are the trade-offs of mmap vs read?
- How does the OS page cache affect performance?
5.8 Hints in Layers
Hint 1: Start with a fixed threshold (e.g., 1MB). Hint 2: Add file type checks to avoid mmap on pipes. Hint 3: Add benchmarking to tune threshold.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | I/O internals | “The Linux Programming Interface” | file I/O chapters | | OS caching | “Operating Systems: Three Easy Pieces” | VM chapters |
5.10 Implementation Phases
Phase 1: Baseline Readers (2 days)
- Implement read and mmap paths.
- Checkpoint: same byte counts.
Phase 2: Strategy Selector (2 days)
- Add auto strategy based on size.
- Checkpoint: correct decisions on fixtures.
Phase 3: Benchmarking (2 days)
- Measure crossover points.
- Checkpoint: threshold tuned for system.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Threshold | fixed vs adaptive | fixed (v1) | simple and testable | | Buffer size | 64KB vs 1MB | 256KB | good general-purpose size |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | strategy selection | file size cases | | Integration | reader correctness | fixtures | | Edge | special files | pipes, directories |
6.2 Critical Test Cases
- 0-byte file uses read and returns 0 bytes.
- Pipe input rejects mmap with clear error.
- Large file uses mmap in auto mode.
6.3 Test Data
file sizes: 0B, 4KB, 1MB, 100MB
expected: read, read, threshold-dependent, mmap
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | mmap on non-regular files | error | check file type first | | Timing noise | inconsistent results | run multiple trials | | Buffer reuse bugs | corrupted data | reset buffer length each read |
7.2 Debugging Strategies
- Use
straceto inspect syscalls. - Compare byte counts between strategies.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--buffer-sizeflag. - Add CSV output for benchmarks.
8.2 Intermediate Extensions
- Adaptive strategy based on past timings.
- Use
posix_fadvisehints.
8.3 Advanced Extensions
- Hybrid: mmap for large files, read for small.
- Integrate with async I/O APIs.
9. Real-World Connections
9.1 Industry Applications
- Search tools and log processors
- Databases and file scanners
9.2 Related Open Source Projects
- ripgrep I/O strategy
- fd and ag implementations
9.3 Interview Relevance
- Discussing memory mapping and OS caches
10. Resources
10.1 Essential Reading
- TLPI file I/O chapters
- OS VM chapters (OSTEP)
10.2 Tools & Documentation
stracefor syscall tracinghyperfinefor benchmarks
10.3 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain mmap vs read trade-offs.
- I can explain page faults and caching.
11.2 Implementation
- Auto strategy chooses the expected path.
- Both strategies return identical results.
11.3 Growth
- I can tune thresholds based on benchmarks.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Read + mmap implementations with strategy flag
Full Completion:
- Auto strategy with benchmark stats
Excellence (Going Above & Beyond):
- Adaptive strategy based on real-time measurements
- posix_fadvise integration