Project 7: Page Cache and mmap Lab

Build a repeatable experiment that measures cold vs warm file reads and explains page fault behavior.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	Weekend
Main Programming Language	C or Python (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	See REFERENCE.md (Level 2)
Business Potential	See REFERENCE.md (Level 2)
Prerequisites	Basic file I/O, timing, /proc basics
Key Topics	Page cache, page faults, mmap behavior

1. Learning Objectives

By completing this project, you will:

Explain why warm reads are faster than cold reads.
Measure page faults deterministically.
Compare read() and mmap() access patterns.
Produce a reproducible performance report.

2. All Theory Needed (Per-Concept Breakdown)

Virtual Memory, Page Cache, and mmap

Fundamentals Virtual memory gives each process a private address space, and the kernel maps those addresses to physical memory. The page cache stores recently accessed file data in memory so future reads can be served without disk I/O. Memory-mapped files use the same cache, but data is accessed via page faults rather than explicit read calls. Understanding these mechanisms explains why the first read of a file is slow, why subsequent reads are fast, and why performance can vary depending on caching behavior.

Deep Dive When a process reads from a file, the kernel does not necessarily fetch exactly the requested bytes from disk. Instead, it typically reads data in page-sized chunks into the page cache. The read() syscall then copies data from the page cache into user space buffers. If the data is already cached, the read can be satisfied without disk access. This is why repeated reads are significantly faster. The cache is shared across processes, which means one program can warm the cache for another.

Memory-mapped files provide an alternative interface. When you map a file into memory, the kernel sets up virtual memory mappings that correspond to file-backed pages. The program then accesses file data by reading memory addresses. The first access to a page triggers a page fault, and the kernel loads the page into memory, populating the page cache. Subsequent accesses to that page are served from memory. This means mmap and read share the same underlying caching mechanism, but the access path is different: mmap relies on page faults and virtual memory, while read relies on explicit syscalls and data copies.

Page faults come in two types: minor and major. A minor fault happens when the page is already in memory but not mapped into the process address space. A major fault happens when the data must be read from disk. Major faults are expensive and correlate strongly with I/O latency. When you measure file performance, you should record both fault types to understand the underlying cause of slow reads.

Caching is not purely beneficial; it can obscure the true cost of disk I/O and lead to misleading benchmarks. A naive benchmark may report extremely fast performance simply because the data is cached. This is why it is important to distinguish cold cache measurements (after dropping caches or using unique data) from warm cache measurements. It also explains why production systems can show very different latency profiles after restarts or under memory pressure.

The page cache is also involved in writes. When you write to a file, the kernel typically writes into cached pages and marks them dirty, deferring the actual disk write. This improves performance but means that a successful write does not guarantee data is on disk. Applications that need durability use explicit sync operations. Understanding this behavior is essential for reasoning about data integrity and write latency.

Memory pressure can evict cached pages. The kernel uses heuristics to determine which pages to evict, balancing file cache with anonymous memory used by processes. Under heavy pressure, the system may swap, which introduces additional latency. This is one reason why performance can degrade sharply when memory is tight.

How this fit on projects You will apply this concept in §3.1 for experiment design, in §4.2 for measurement components, and in §6.2 for test cases. It also supports P08-mirror-filesystem-fuse.md where cache behavior matters.

Definitions & key terms

Page cache: Kernel cache of file-backed pages.
Page fault: Trap when a page is not mapped or present.
Minor fault: Page present but not mapped.
Major fault: Page must be read from disk.
mmap: Map file into virtual address space.

Mental model diagram

File on disk -> Page cache -> User read buffer
                       |
                       v
                  mmap view

How it works

Read or map a file.
First access triggers page faults.
Pages are loaded into cache.
Subsequent reads are served from memory.

Minimal concrete example

Cold read: 1.8s, major faults: 45
Warm read: 0.06s, major faults: 0

Common misconceptions

“A read always hits disk.” Cached data may be served from RAM.
“mmap avoids the page cache.” It uses the page cache.
“Write completes means data is safe.” It may still be in cache.

Check-your-understanding questions

Why are warm reads faster than cold reads?
What is the difference between major and minor page faults?
How does mmap access differ from read()?
Why can benchmarks be misleading without cache control?

Check-your-understanding answers

Warm reads hit the page cache instead of disk.
Major faults require disk I/O; minor faults do not.
mmap uses page faults and memory access; read uses syscalls and copy.
Cached data hides true disk cost.

Real-world applications

Database performance tuning.
Diagnosing slow I/O in production.
Building caching-aware benchmarks.

Where you’ll apply it

See §3.1 What You Will Build and §4.1 High-Level Design.
Also used in: P08-mirror-filesystem-fuse.md

References

“Operating Systems: Three Easy Pieces” - memory chapters
“The Linux Programming Interface” - memory mapping

Key insights Page cache behavior is the hidden variable in file I/O performance.

Summary Once you understand page faults and cache, I/O latency becomes predictable.

Homework/Exercises to practice the concept

Measure the time difference between first and second reads of a file.
Observe page fault counters during mmap access.

Solutions to the homework/exercises

The second read should be much faster.
Major faults appear on first access, minor faults afterward.

3. Project Specification

3.1 What You Will Build

A command-line experiment runner that reads a large file using both read() and mmap() styles, records timing, and prints page fault counts before and after each run.

3.2 Functional Requirements

Cold vs warm read: measure timings for both cases.
Fault counters: record major and minor faults.
mmap comparison: compare read and mmap paths.

3.3 Non-Functional Requirements

Performance: runs in under 1 minute.
Reliability: results are reproducible on the same machine.
Usability: clear report output.

3.4 Example Usage / Output

$ ./cachelab sample.dat
cold read: 1.82s
warm read: 0.06s
major faults: +45
minor faults: +1200

3.5 Data Formats / Schemas / Protocols

Report format: key: value per line.
Use fixed units (seconds, KB).

3.6 Edge Cases

File too small to show difference.
Background I/O polluting cache.
Insufficient permissions to drop caches.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

Build in project-root.
Run ./cachelab sample.dat.

3.7.2 Golden Path Demo (Deterministic)

Use a fixed file size and fixed run order: cold read then warm read.

3.7.3 If CLI: Exact terminal transcript

$ ./cachelab sample.dat
cold read: 1.82s
warm read: 0.06s
major faults: +45
minor faults: +1200
# exit code: 0

Failure demo (deterministic):

$ ./cachelab missing.dat
error: file not found
# exit code: 2

Exit codes:

0 success
2 missing file

4. Solution Architecture

4.1 High-Level Design

measure -> read -> record -> read again -> report

4.2 Key Components

Component	Responsibility	Key Decisions
Timer	Measure elapsed time	Use monotonic clock
Reader	Read file via read()	Fixed block size
Mapper	Read file via mmap	Sequential access

4.4 Data Structures (No Full Code)

Run record: method, duration, faults before/after.
Report: list of run records.

4.4 Algorithm Overview

Key Algorithm: measurement loop

Record fault counters.
Read file with chosen method.
Record counters again.
Compute deltas and print report.

Complexity Analysis:

Time: O(n) for file size n.
Space: O(1) extra memory.

5. Implementation Guide

5.1 Development Environment Setup

# Install a C or Python runtime

5.2 Project Structure

project-root/
├── src/
│   ├── cachelab.c
│   └── measure.c
├── data/
│   └── sample.dat
└── README.md

5.3 The Core Question You’re Answering

“Why does the second read of a file feel instant?”

5.4 Concepts You Must Understand First

Page cache
- What is cached and when?
- Book Reference: “Operating Systems: Three Easy Pieces” - memory chapters
Page faults
- Major vs minor faults.
- Book Reference: “The Linux Programming Interface” - memory mapping

5.5 Questions to Guide Your Design

How will you ensure a cold cache measurement?
How will you compute fault deltas?

5.6 Thinking Exercise

Predict the Curve

Sketch expected read times across repeated runs.

5.7 The Interview Questions They’ll Ask

“What is the page cache?”
“What is the difference between major and minor faults?”
“How does mmap change I/O?”
“Why can benchmarks lie?”

5.8 Hints in Layers

Hint 1: Fixed file Use a fixed file size so runs are comparable.

Hint 2: Warm vs cold Run the read twice without changing the file.

Hint 3: Fault counters Read fault counters before and after each run.

Hint 4: Debugging Repeat runs and average results.

5.9 Books That Will Help

Topic	Book	Chapter
Memory	“Operating Systems: Three Easy Pieces”	Memory chapters
mmap	“The Linux Programming Interface”	Memory mapping

5.10 Implementation Phases

Phase 1: Foundation (1 day)

Goals:

Measure read time for a file.

Tasks:

Implement a basic file reader.
Add timing output.

Checkpoint: cold read time printed.

Phase 2: Core Functionality (1 day)

Goals:

Add fault counter collection.

Tasks:

Read fault counters before/after.
Compute deltas.

Checkpoint: major/minor fault counts printed.

Phase 3: Polish & Edge Cases (1 day)

Goals:

Add mmap comparison and error handling.

Tasks:

Implement mmap path.
Validate missing file behavior.

Checkpoint: both methods produce reports.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Timing	wall clock vs monotonic	monotonic	avoid clock jumps
Method order	read then mmap	fixed order	deterministic results

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Timer correctness	verify monotonic timing
Integration Tests	Full run	sample.dat report
Edge Case Tests	Missing file	error output

6.2 Critical Test Cases

Cold read: higher time, major faults > 0.
Warm read: lower time, major faults ~0.
Missing file: error message and exit code 2.

6.3 Test Data

sample.dat size: 100 MB

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Uncontrolled cache	inconsistent results	fixed order and file
Clock jumps	negative times	use monotonic clock
Small files	no difference	use large file

7.2 Debugging Strategies

Repeat runs: average results.
Check /proc/meminfo: confirm cache changes.

7.3 Performance Traps

Background I/O can hide the cold/warm difference; run in a quiet environment.

8. Extensions & Challenges

8.1 Beginner Extensions

Add CSV output.
Add a progress indicator.

8.2 Intermediate Extensions

Compare sequential vs random access patterns.
Add fsync timing for writes.

8.3 Advanced Extensions

Build a cache simulator using the measurements.
Visualize results in a plot.

9. Real-World Connections

9.1 Industry Applications

Database tuning: cache warm-up and buffering.
Performance engineering: isolating disk vs memory latency.

fio: https://fio.readthedocs.io/ - flexible I/O benchmark tool.
perf: https://perf.wiki.kernel.org/ - performance profiling.

9.3 Interview Relevance

Page cache behavior is a common systems performance topic.

10. Resources

10.1 Essential Reading

“Operating Systems: Three Easy Pieces” - memory chapters
“The Linux Programming Interface” - memory mapping

10.2 Video Resources

“Page cache explained” - talks (search title)

10.3 Tools & Documentation

procfs: https://docs.kernel.org/filesystems/proc.html
perf: https://perf.wiki.kernel.org/

11. Self-Assessment Checklist

11.1 Understanding

I can explain page cache behavior
I can distinguish major vs minor faults
I understand mmap vs read()

11.2 Implementation

All functional requirements are met
Results are reproducible
Error handling is correct

11.3 Growth

I can explain this lab in an interview
I documented lessons learned
I can propose an extension

12. Submission / Completion Criteria

Minimum Viable Completion:

Cold vs warm read comparison
Fault counts reported
Deterministic output format

Full Completion:

All minimum criteria plus:
mmap comparison
Failure demo with exit code

Excellence (Going Above & Beyond):

Visualizations and automated reports
Multi-run statistical analysis

Project 7: Page Cache and mmap Lab

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Virtual Memory, Page Cache, and mmap

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: Exact terminal transcript

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (1 day)

Phase 2: Core Functionality (1 day)

Phase 3: Polish & Edge Cases (1 day)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria