Project 11: mmap vs read Strategy Selector

A file reading abstraction that automatically chooses between mmap and buffered read based on file size, access pattern, and system characteristics.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages Rust, C++
Difficulty Level 3: Advanced
Time Estimate 1 week
Knowledge Area Operating Systems / I/O
Tooling Custom Implementation
Prerequisites Basic systems programming, understanding of virtual memory

What You Will Build

A file reading abstraction that automatically chooses between mmap and buffered read based on file size, access pattern, and system characteristics.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • mmap overhead measurement → maps to page table setup cost
  • read buffer sizing → maps to syscall amortization
  • Page fault handling → maps to understanding mmap tradeoffs
  • File size heuristics → maps to when to switch strategies

Key Concepts

  • mmap Mechanics: “Linux System Programming” Chapter 9 - Robert Love
  • Page Faults: Demand paging and its costs
  • Buffered I/O: “Advanced Programming in the UNIX Environment” Chapter 5 - Stevens & Rago
  • Linus on mmap: Understanding the “virtual memory mapping is very expensive in itself” quote

Real-World Outcome

$ ./io_selector --benchmark
Benchmarking I/O strategies...

Single 1GB file:
  mmap:          120ms (8.33 GB/s)
  read (64KB):   145ms (6.90 GB/s)
  Winner: mmap (17% faster)

1000 files, 1MB each:
  mmap:          850ms (1.18 GB/s)
  read (64KB):   320ms (3.12 GB/s)
  Winner: read (62% faster)

10000 files, 10KB each:
  mmap:          2100ms (0.05 GB/s)
  read (64KB):   180ms (0.56 GB/s)
  Winner: read (10x faster!)

Heuristic derived:
  Use mmap when: file_size > 1MB AND file_count < 10
  Use read otherwise
  Crossover point: ~500KB per file

Auto-selection in action:
  ./io_selector search "pattern" ./small_files/  → using read
  ./io_selector search "pattern" huge_file.log   → using mmap

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: TEXT_SEARCH_TOOLS_DEEP_DIVE.md
  • “Linux System Programming” by Robert Love