Project 4: Page-Based Storage Engine with Buffer Pool

Build a buffer-managed storage layer that enforces pin/dirty invariants and exposes measurable cache behavior.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	Level 4: Open Core Infrastructure
Prerequisites	Project 1 and Project 3 concepts
Key Topics	Pin/unpin discipline, eviction, dirty flush lifecycle

1. Learning Objectives

Implement a page manager that supports fixed-size slotted pages.
Build a buffer pool with pin counts, dirty tracking, and CLOCK eviction.
Enforce correctness invariants around frame reuse and flush behavior.
Instrument hit rate, eviction, and flush latency metrics.

2. All Theory Needed (Per-Concept Breakdown)

Buffer Pool Correctness Model

Fundamentals

A buffer pool is not only a cache. It is a correctness boundary where logical page access must respect frame lifecycle rules: fetch, pin, mutate/read, unpin, and optional flush. Any violation can corrupt data or crash under load.

Deep Dive into the concept

Every frame is a stateful object with ownership semantics. Pin count indicates active use; dirty flag indicates persistence debt. Eviction chooses only unpinned frames; dirty victims require safe flush before reuse. This sequence seems straightforward but fails easily in edge paths: error returns that forget unpin, retries that double-unpin, or flush failures ignored during victim selection.

CLOCK policy offers practical balance between complexity and behavior. It approximates recency with low overhead by rotating a hand and clearing reference bits. Compared to exact LRU, CLOCK reduces metadata churn and lock contention. Regardless of policy, correctness first: no pinned eviction, no dirty reuse without successful writeback.

Observability should be built-in: per-frame pin lifetime, eviction retries, dirty percentage, and flush queue depth. These metrics explain throughput cliffs better than hit rate alone. For example, a high hit rate can hide severe tail latency caused by synchronous dirty flushes on critical paths.

How this fit on projects

Primary in this project.
Required by WAL and SQL projects for stable runtime behavior.

Definitions & key terms

Frame: In-memory container for one page.
Pin count: Active holders count for frame.
Dirty page: Modified but not durable page.

Mental model diagram

Request page 42 -> hit? yes -> pin++ -> use -> unpin--
                  miss -> choose victim (unpinned only)
                       -> if dirty flush
                       -> read page 42 into frame

How it works (step-by-step, with invariants and failure modes)

Lookup page ID in frame table.
Hit: increment pin and return frame.
Miss: select unpinned victim.
Flush victim if dirty.
Load requested page and pin.

Failure modes:

Pin leaks causing eviction starvation.
Dirty frame reused after failed flush.

Minimal concrete example

trace: fetch(10), fetch(11), unpin(10 dirty), fetch(12)
victim selection must not evict page 11 if pin>0

Common misconceptions

“Hit rate alone is enough.” -> Tail flush stalls still matter.
“Pinned pages can be force-evicted.” -> Unsafe.

Check-your-understanding questions

Why is unpin discipline critical?
Why can sequential scans hurt naive policies?
Which metric detects pin leaks?

Check-your-understanding answers

Prevents frame reuse while still in use.
Scan pages displace hot working set.
Long-lived pin counts and victim retry loops.

Real-world applications

PostgreSQL shared buffers.
InnoDB buffer pool.

Where you’ll apply it

This project and Projects 5-7.

References

Key insights

Buffer pools are correctness-critical execution infrastructure.

Summary

Strong frame lifecycle contracts prevent subtle runtime corruption.

Homework/Exercises to practice the concept

Define frame state transition table.
Build workload that causes LRU scan pollution.
Choose metrics for dashboard and justify each.

Solutions to the homework/exercises

Include pinned/dirty/flushing transitions with illegal edges blocked.
Alternate large scans with small hot set.
Add hit rate, dirty %, pin lifetime, flush latency, eviction retries.

3. Project Specification

3.1 What You Will Build

Storage subsystem with page API and memory-resident buffer pool supporting configurable eviction policy.

3.2 Functional Requirements

Fetch and pin page by ID.
Unpin page with optional dirty mark.
Evict unpinned victim using policy.
Flush dirty pages safely.
Report runtime metrics.

3.3 Non-Functional Requirements

Performance: Stable latency under mixed workloads.
Reliability: No data corruption under flush failures.
Usability: Metrics and frame inspection available.

3.4 Example Usage / Output

run mixed workload -> inspect hit rate and flush stats

3.5 Data Formats / Schemas / Protocols

Frame table schema with page ID map, pin count, dirty flag, reference bit.

3.6 Edge Cases

All frames pinned, repeated page faults, flush error, cold-start scans.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make storage_engine_demo
./storage_engine_demo workload --profile mixed --ops 500000 --seed 11

3.7.2 Golden Path Demo (Deterministic)

Fixed workload profile and seed.

3.7.3 If CLI: exact terminal transcript

$ ./storage_engine_demo workload --profile mixed --ops 500000 --seed 11
hit_rate=0.934 evictions=27122 dirty_flushes=9180
$ ./storage_engine_demo inspect-frame 87
frame=87 page=14829 pin=0 dirty=1 refbit=0

4. Solution Architecture

4.1 High-Level Design

Query/Storage Calls -> Buffer Pool -> Page File I/O

4.2 Key Components

Component	Responsibility	Key Decisions
Frame table	page->frame map	hash lookup
Eviction policy	victim choice	CLOCK default
Flush manager	durable writes	batch-friendly
Metrics collector	observability	low-overhead counters

4.4 Data Structures (No Full Code)

Frame { page_id, pin_count, dirty, refbit, last_access }
BufferPool { frames[], clock_hand, page_map }

4.4 Algorithm Overview

Fetch path and victim-selection path with pin/dirty guards.

5. Implementation Guide

5.1 Development Environment Setup

make storage_engine_demo

5.2 Project Structure

project/
├── src/buffer_pool/
├── src/page/
├── tests/
└── tools/

5.3 The Core Question You’re Answering

How do I shape page access so performance stays high without breaking durability and correctness?

5.4 Concepts You Must Understand First

Frame lifecycle
Eviction policy behavior
Dirty writeback contracts

5.5 Questions to Guide Your Design

What happens when all frames are pinned?
How do you surface flush failures to callers?

5.6 Thinking Exercise

Draw state machine for one frame from free to dirty flush and reuse.

5.7 The Interview Questions They’ll Ask

Why can LRU fail under scans?
Why are pin counts non-negotiable?
What metrics reveal cache health?
How does dirty flush interact with WAL?

5.8 Hints in Layers

Build deterministic trace tests first.
Add assertion for illegal state transitions.
Defer optimization until metrics are stable.

5.9 Books That Will Help

Topic	Book	Chapter
Cache behavior	Computer Systems: A Programmer’s Perspective	Ch. 6
OS caching concepts	Operating Systems: Three Easy Pieces	cache/file chapters

5.10 Implementation Phases

Phase 1: fetch/unpin + frame table
Phase 2: CLOCK eviction
Phase 3: dirty flush + metrics

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Policy	LRU / CLOCK	CLOCK	simpler + robust
Flush mode	sync per-evict / background assist	hybrid	reduces stalls

6. Testing Strategy

6.1 Test Categories

Unit: state transitions
Integration: mixed workload behavior
Fault injection: flush/write errors

6.2 Critical Test Cases

No pinned eviction under stress.
Dirty flush failure does not reuse frame.
Metrics remain consistent with operation trace.

6.3 Test Data

Seeded request traces: random, scan-heavy, mixed.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Forgotten unpin	system stalls	pin-leak detector
Ignored flush errors	stale or lost updates	fail frame reuse on write failure

7.2 Debugging Strategies

Frame dump command for live diagnostics.
Log clock-hand movement and victim decisions.

7.3 Performance Traps

Global buffer lock can become contention bottleneck.

8. Extensions & Challenges

8.1 Beginner Extensions

Add per-page access heatmap output.

8.2 Intermediate Extensions

Add scan-resistance policy (2Q-like behavior).

8.3 Advanced Extensions

Add partitioned buffer pool for reduced lock contention.

9. Real-World Connections

9.1 Industry Applications

Buffer pool quality often determines practical DB throughput.

PostgreSQL and InnoDB buffer manager behavior documentation.

9.3 Interview Relevance

Buffer manager design is common in systems and DB interviews.

10. Resources

10.1 Essential Reading

OS/file-cache chapters and database architecture docs.

10.2 Video Resources

Talks on DB buffer manager internals.

10.3 Tools & Documentation

perf, strace, sanitizer builds, custom trace visualizers.

Previous: Project 3
Next: Project 5

11. Self-Assessment Checklist

11.1 Understanding

I can explain frame lifecycle and illegal transitions.
I can explain policy tradeoffs with workload examples.

11.2 Implementation

No pinned evictions in stress tests.
Dirty flush behavior is safe under injected failures.

11.3 Growth

I can propose next policy improvement with benchmark evidence.

12. Submission / Completion Criteria

Minimum Viable Completion:

Fetch/unpin, basic eviction, dirty tracking

Full Completion:

Plus robust metrics and failure injection tests

Excellence:

Plus adaptive policy and contention reduction improvements