Project 7: Page Cache and mmap Lab

Build a repeatable experiment that measures cold vs warm file reads and explains page fault behavior.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate Weekend
Main Programming Language C or Python (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level See REFERENCE.md (Level 2)
Business Potential See REFERENCE.md (Level 2)
Prerequisites Basic file I/O, timing, /proc basics
Key Topics Page cache, page faults, mmap behavior

1. Learning Objectives

By completing this project, you will:

  1. Explain why warm reads are faster than cold reads.
  2. Measure page faults deterministically.
  3. Compare read() and mmap() access patterns.
  4. Produce a reproducible performance report.

2. All Theory Needed (Per-Concept Breakdown)

Virtual Memory, Page Cache, and mmap

Fundamentals Virtual memory gives each process a private address space, and the kernel maps those addresses to physical memory. The page cache stores recently accessed file data in memory so future reads can be served without disk I/O. Memory-mapped files use the same cache, but data is accessed via page faults rather than explicit read calls. Understanding these mechanisms explains why the first read of a file is slow, why subsequent reads are fast, and why performance can vary depending on caching behavior.

Deep Dive When a process reads from a file, the kernel does not necessarily fetch exactly the requested bytes from disk. Instead, it typically reads data in page-sized chunks into the page cache. The read() syscall then copies data from the page cache into user space buffers. If the data is already cached, the read can be satisfied without disk access. This is why repeated reads are significantly faster. The cache is shared across processes, which means one program can warm the cache for another.

Memory-mapped files provide an alternative interface. When you map a file into memory, the kernel sets up virtual memory mappings that correspond to file-backed pages. The program then accesses file data by reading memory addresses. The first access to a page triggers a page fault, and the kernel loads the page into memory, populating the page cache. Subsequent accesses to that page are served from memory. This means mmap and read share the same underlying caching mechanism, but the access path is different: mmap relies on page faults and virtual memory, while read relies on explicit syscalls and data copies.

Page faults come in two types: minor and major. A minor fault happens when the page is already in memory but not mapped into the process address space. A major fault happens when the data must be read from disk. Major faults are expensive and correlate strongly with I/O latency. When you measure file performance, you should record both fault types to understand the underlying cause of slow reads.

Caching is not purely beneficial; it can obscure the true cost of disk I/O and lead to misleading benchmarks. A naive benchmark may report extremely fast performance simply because the data is cached. This is why it is important to distinguish cold cache measurements (after dropping caches or using unique data) from warm cache measurements. It also explains why production systems can show very different latency profiles after restarts or under memory pressure.

The page cache is also involved in writes. When you write to a file, the kernel typically writes into cached pages and marks them dirty, deferring the actual disk write. This improves performance but means that a successful write does not guarantee data is on disk. Applications that need durability use explicit sync operations. Understanding this behavior is essential for reasoning about data integrity and write latency.

Memory pressure can evict cached pages. The kernel uses heuristics to determine which pages to evict, balancing file cache with anonymous memory used by processes. Under heavy pressure, the system may swap, which introduces additional latency. This is one reason why performance can degrade sharply when memory is tight.

How this fit on projects You will apply this concept in §3.1 for experiment design, in §4.2 for measurement components, and in §6.2 for test cases. It also supports P08-mirror-filesystem-fuse.md where cache behavior matters.

Definitions & key terms

  • Page cache: Kernel cache of file-backed pages.
  • Page fault: Trap when a page is not mapped or present.
  • Minor fault: Page present but not mapped.
  • Major fault: Page must be read from disk.
  • mmap: Map file into virtual address space.

Mental model diagram

File on disk -> Page cache -> User read buffer
                       |
                       v
                  mmap view

How it works

  1. Read or map a file.
  2. First access triggers page faults.
  3. Pages are loaded into cache.
  4. Subsequent reads are served from memory.

Minimal concrete example

Cold read: 1.8s, major faults: 45
Warm read: 0.06s, major faults: 0

Common misconceptions

  • “A read always hits disk.” Cached data may be served from RAM.
  • “mmap avoids the page cache.” It uses the page cache.
  • “Write completes means data is safe.” It may still be in cache.

Check-your-understanding questions

  1. Why are warm reads faster than cold reads?
  2. What is the difference between major and minor page faults?
  3. How does mmap access differ from read()?
  4. Why can benchmarks be misleading without cache control?

Check-your-understanding answers

  1. Warm reads hit the page cache instead of disk.
  2. Major faults require disk I/O; minor faults do not.
  3. mmap uses page faults and memory access; read uses syscalls and copy.
  4. Cached data hides true disk cost.

Real-world applications

  • Database performance tuning.
  • Diagnosing slow I/O in production.
  • Building caching-aware benchmarks.

Where you’ll apply it

References

  • “Operating Systems: Three Easy Pieces” - memory chapters
  • “The Linux Programming Interface” - memory mapping

Key insights Page cache behavior is the hidden variable in file I/O performance.

Summary Once you understand page faults and cache, I/O latency becomes predictable.

Homework/Exercises to practice the concept

  1. Measure the time difference between first and second reads of a file.
  2. Observe page fault counters during mmap access.

Solutions to the homework/exercises

  1. The second read should be much faster.
  2. Major faults appear on first access, minor faults afterward.

3. Project Specification

3.1 What You Will Build

A command-line experiment runner that reads a large file using both read() and mmap() styles, records timing, and prints page fault counts before and after each run.

3.2 Functional Requirements

  1. Cold vs warm read: measure timings for both cases.
  2. Fault counters: record major and minor faults.
  3. mmap comparison: compare read and mmap paths.

3.3 Non-Functional Requirements

  • Performance: runs in under 1 minute.
  • Reliability: results are reproducible on the same machine.
  • Usability: clear report output.

3.4 Example Usage / Output

$ ./cachelab sample.dat
cold read: 1.82s
warm read: 0.06s
major faults: +45
minor faults: +1200

3.5 Data Formats / Schemas / Protocols

  • Report format: key: value per line.
  • Use fixed units (seconds, KB).

3.6 Edge Cases

  • File too small to show difference.
  • Background I/O polluting cache.
  • Insufficient permissions to drop caches.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • Build in project-root.
  • Run ./cachelab sample.dat.

3.7.2 Golden Path Demo (Deterministic)

Use a fixed file size and fixed run order: cold read then warm read.

3.7.3 If CLI: Exact terminal transcript

$ ./cachelab sample.dat
cold read: 1.82s
warm read: 0.06s
major faults: +45
minor faults: +1200
# exit code: 0

Failure demo (deterministic):

$ ./cachelab missing.dat
error: file not found
# exit code: 2

Exit codes:

  • 0 success
  • 2 missing file

4. Solution Architecture

4.1 High-Level Design

measure -> read -> record -> read again -> report

4.2 Key Components

Component Responsibility Key Decisions
Timer Measure elapsed time Use monotonic clock
Reader Read file via read() Fixed block size
Mapper Read file via mmap Sequential access

4.4 Data Structures (No Full Code)

  • Run record: method, duration, faults before/after.
  • Report: list of run records.

4.4 Algorithm Overview

Key Algorithm: measurement loop

  1. Record fault counters.
  2. Read file with chosen method.
  3. Record counters again.
  4. Compute deltas and print report.

Complexity Analysis:

  • Time: O(n) for file size n.
  • Space: O(1) extra memory.

5. Implementation Guide

5.1 Development Environment Setup

# Install a C or Python runtime

5.2 Project Structure

project-root/
├── src/
│   ├── cachelab.c
│   └── measure.c
├── data/
│   └── sample.dat
└── README.md

5.3 The Core Question You’re Answering

“Why does the second read of a file feel instant?”

5.4 Concepts You Must Understand First

  1. Page cache
    • What is cached and when?
    • Book Reference: “Operating Systems: Three Easy Pieces” - memory chapters
  2. Page faults
    • Major vs minor faults.
    • Book Reference: “The Linux Programming Interface” - memory mapping

5.5 Questions to Guide Your Design

  1. How will you ensure a cold cache measurement?
  2. How will you compute fault deltas?

5.6 Thinking Exercise

Predict the Curve

Sketch expected read times across repeated runs.

5.7 The Interview Questions They’ll Ask

  1. “What is the page cache?”
  2. “What is the difference between major and minor faults?”
  3. “How does mmap change I/O?”
  4. “Why can benchmarks lie?”

5.8 Hints in Layers

Hint 1: Fixed file Use a fixed file size so runs are comparable.

Hint 2: Warm vs cold Run the read twice without changing the file.

Hint 3: Fault counters Read fault counters before and after each run.

Hint 4: Debugging Repeat runs and average results.

5.9 Books That Will Help

Topic Book Chapter
Memory “Operating Systems: Three Easy Pieces” Memory chapters
mmap “The Linux Programming Interface” Memory mapping

5.10 Implementation Phases

Phase 1: Foundation (1 day)

Goals:

  • Measure read time for a file.

Tasks:

  1. Implement a basic file reader.
  2. Add timing output.

Checkpoint: cold read time printed.

Phase 2: Core Functionality (1 day)

Goals:

  • Add fault counter collection.

Tasks:

  1. Read fault counters before/after.
  2. Compute deltas.

Checkpoint: major/minor fault counts printed.

Phase 3: Polish & Edge Cases (1 day)

Goals:

  • Add mmap comparison and error handling.

Tasks:

  1. Implement mmap path.
  2. Validate missing file behavior.

Checkpoint: both methods produce reports.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Timing wall clock vs monotonic monotonic avoid clock jumps
Method order read then mmap fixed order deterministic results

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Timer correctness verify monotonic timing
Integration Tests Full run sample.dat report
Edge Case Tests Missing file error output

6.2 Critical Test Cases

  1. Cold read: higher time, major faults > 0.
  2. Warm read: lower time, major faults ~0.
  3. Missing file: error message and exit code 2.

6.3 Test Data

sample.dat size: 100 MB

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Uncontrolled cache inconsistent results fixed order and file
Clock jumps negative times use monotonic clock
Small files no difference use large file

7.2 Debugging Strategies

  • Repeat runs: average results.
  • Check /proc/meminfo: confirm cache changes.

7.3 Performance Traps

Background I/O can hide the cold/warm difference; run in a quiet environment.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add CSV output.
  • Add a progress indicator.

8.2 Intermediate Extensions

  • Compare sequential vs random access patterns.
  • Add fsync timing for writes.

8.3 Advanced Extensions

  • Build a cache simulator using the measurements.
  • Visualize results in a plot.

9. Real-World Connections

9.1 Industry Applications

  • Database tuning: cache warm-up and buffering.
  • Performance engineering: isolating disk vs memory latency.
  • fio: https://fio.readthedocs.io/ - flexible I/O benchmark tool.
  • perf: https://perf.wiki.kernel.org/ - performance profiling.

9.3 Interview Relevance

Page cache behavior is a common systems performance topic.


10. Resources

10.1 Essential Reading

  • “Operating Systems: Three Easy Pieces” - memory chapters
  • “The Linux Programming Interface” - memory mapping

10.2 Video Resources

  • “Page cache explained” - talks (search title)

10.3 Tools & Documentation

  • procfs: https://docs.kernel.org/filesystems/proc.html
  • perf: https://perf.wiki.kernel.org/

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain page cache behavior
  • I can distinguish major vs minor faults
  • I understand mmap vs read()

11.2 Implementation

  • All functional requirements are met
  • Results are reproducible
  • Error handling is correct

11.3 Growth

  • I can explain this lab in an interview
  • I documented lessons learned
  • I can propose an extension

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Cold vs warm read comparison
  • Fault counts reported
  • Deterministic output format

Full Completion:

  • All minimum criteria plus:
  • mmap comparison
  • Failure demo with exit code

Excellence (Going Above & Beyond):

  • Visualizations and automated reports
  • Multi-run statistical analysis