Project 4: Page-Based Storage Engine with Buffer Pool

Build a buffer-managed storage layer that enforces pin/dirty invariants and exposes measurable cache behavior.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Main Programming Language C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level Level 4: Hardcore Tech Flex
Business Potential Level 4: Open Core Infrastructure
Prerequisites Project 1 and Project 3 concepts
Key Topics Pin/unpin discipline, eviction, dirty flush lifecycle

1. Learning Objectives

  1. Implement a page manager that supports fixed-size slotted pages.
  2. Build a buffer pool with pin counts, dirty tracking, and CLOCK eviction.
  3. Enforce correctness invariants around frame reuse and flush behavior.
  4. Instrument hit rate, eviction, and flush latency metrics.

2. All Theory Needed (Per-Concept Breakdown)

Buffer Pool Correctness Model

Fundamentals

A buffer pool is not only a cache. It is a correctness boundary where logical page access must respect frame lifecycle rules: fetch, pin, mutate/read, unpin, and optional flush. Any violation can corrupt data or crash under load.

Deep Dive into the concept

Every frame is a stateful object with ownership semantics. Pin count indicates active use; dirty flag indicates persistence debt. Eviction chooses only unpinned frames; dirty victims require safe flush before reuse. This sequence seems straightforward but fails easily in edge paths: error returns that forget unpin, retries that double-unpin, or flush failures ignored during victim selection.

CLOCK policy offers practical balance between complexity and behavior. It approximates recency with low overhead by rotating a hand and clearing reference bits. Compared to exact LRU, CLOCK reduces metadata churn and lock contention. Regardless of policy, correctness first: no pinned eviction, no dirty reuse without successful writeback.

Observability should be built-in: per-frame pin lifetime, eviction retries, dirty percentage, and flush queue depth. These metrics explain throughput cliffs better than hit rate alone. For example, a high hit rate can hide severe tail latency caused by synchronous dirty flushes on critical paths.

How this fit on projects

  • Primary in this project.
  • Required by WAL and SQL projects for stable runtime behavior.

Definitions & key terms

  • Frame: In-memory container for one page.
  • Pin count: Active holders count for frame.
  • Dirty page: Modified but not durable page.

Mental model diagram

Request page 42 -> hit? yes -> pin++ -> use -> unpin--
                  miss -> choose victim (unpinned only)
                       -> if dirty flush
                       -> read page 42 into frame

How it works (step-by-step, with invariants and failure modes)

  1. Lookup page ID in frame table.
  2. Hit: increment pin and return frame.
  3. Miss: select unpinned victim.
  4. Flush victim if dirty.
  5. Load requested page and pin.

Failure modes:

  • Pin leaks causing eviction starvation.
  • Dirty frame reused after failed flush.

Minimal concrete example

trace: fetch(10), fetch(11), unpin(10 dirty), fetch(12)
victim selection must not evict page 11 if pin>0

Common misconceptions

  • “Hit rate alone is enough.” -> Tail flush stalls still matter.
  • “Pinned pages can be force-evicted.” -> Unsafe.

Check-your-understanding questions

  1. Why is unpin discipline critical?
  2. Why can sequential scans hurt naive policies?
  3. Which metric detects pin leaks?

Check-your-understanding answers

  1. Prevents frame reuse while still in use.
  2. Scan pages displace hot working set.
  3. Long-lived pin counts and victim retry loops.

Real-world applications

  • PostgreSQL shared buffers.
  • InnoDB buffer pool.

Where you’ll apply it

  • This project and Projects 5-7.

References

Key insights

Buffer pools are correctness-critical execution infrastructure.

Summary

Strong frame lifecycle contracts prevent subtle runtime corruption.

Homework/Exercises to practice the concept

  1. Define frame state transition table.
  2. Build workload that causes LRU scan pollution.
  3. Choose metrics for dashboard and justify each.

Solutions to the homework/exercises

  1. Include pinned/dirty/flushing transitions with illegal edges blocked.
  2. Alternate large scans with small hot set.
  3. Add hit rate, dirty %, pin lifetime, flush latency, eviction retries.

3. Project Specification

3.1 What You Will Build

Storage subsystem with page API and memory-resident buffer pool supporting configurable eviction policy.

3.2 Functional Requirements

  1. Fetch and pin page by ID.
  2. Unpin page with optional dirty mark.
  3. Evict unpinned victim using policy.
  4. Flush dirty pages safely.
  5. Report runtime metrics.

3.3 Non-Functional Requirements

  • Performance: Stable latency under mixed workloads.
  • Reliability: No data corruption under flush failures.
  • Usability: Metrics and frame inspection available.

3.4 Example Usage / Output

run mixed workload -> inspect hit rate and flush stats

3.5 Data Formats / Schemas / Protocols

Frame table schema with page ID map, pin count, dirty flag, reference bit.

3.6 Edge Cases

All frames pinned, repeated page faults, flush error, cold-start scans.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make storage_engine_demo
./storage_engine_demo workload --profile mixed --ops 500000 --seed 11

3.7.2 Golden Path Demo (Deterministic)

Fixed workload profile and seed.

3.7.3 If CLI: exact terminal transcript

$ ./storage_engine_demo workload --profile mixed --ops 500000 --seed 11
hit_rate=0.934 evictions=27122 dirty_flushes=9180
$ ./storage_engine_demo inspect-frame 87
frame=87 page=14829 pin=0 dirty=1 refbit=0

4. Solution Architecture

4.1 High-Level Design

Query/Storage Calls -> Buffer Pool -> Page File I/O

4.2 Key Components

Component Responsibility Key Decisions
Frame table page->frame map hash lookup
Eviction policy victim choice CLOCK default
Flush manager durable writes batch-friendly
Metrics collector observability low-overhead counters

4.4 Data Structures (No Full Code)

Frame { page_id, pin_count, dirty, refbit, last_access }
BufferPool { frames[], clock_hand, page_map }

4.4 Algorithm Overview

Fetch path and victim-selection path with pin/dirty guards.


5. Implementation Guide

5.1 Development Environment Setup

make storage_engine_demo

5.2 Project Structure

project/
├── src/buffer_pool/
├── src/page/
├── tests/
└── tools/

5.3 The Core Question You’re Answering

How do I shape page access so performance stays high without breaking durability and correctness?

5.4 Concepts You Must Understand First

  • Frame lifecycle
  • Eviction policy behavior
  • Dirty writeback contracts

5.5 Questions to Guide Your Design

  • What happens when all frames are pinned?
  • How do you surface flush failures to callers?

5.6 Thinking Exercise

Draw state machine for one frame from free to dirty flush and reuse.

5.7 The Interview Questions They’ll Ask

  1. Why can LRU fail under scans?
  2. Why are pin counts non-negotiable?
  3. What metrics reveal cache health?
  4. How does dirty flush interact with WAL?

5.8 Hints in Layers

  • Build deterministic trace tests first.
  • Add assertion for illegal state transitions.
  • Defer optimization until metrics are stable.

5.9 Books That Will Help

Topic Book Chapter
Cache behavior Computer Systems: A Programmer’s Perspective Ch. 6
OS caching concepts Operating Systems: Three Easy Pieces cache/file chapters

5.10 Implementation Phases

  • Phase 1: fetch/unpin + frame table
  • Phase 2: CLOCK eviction
  • Phase 3: dirty flush + metrics

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Policy LRU / CLOCK CLOCK simpler + robust
Flush mode sync per-evict / background assist hybrid reduces stalls

6. Testing Strategy

6.1 Test Categories

  • Unit: state transitions
  • Integration: mixed workload behavior
  • Fault injection: flush/write errors

6.2 Critical Test Cases

  1. No pinned eviction under stress.
  2. Dirty flush failure does not reuse frame.
  3. Metrics remain consistent with operation trace.

6.3 Test Data

Seeded request traces: random, scan-heavy, mixed.


7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Forgotten unpin system stalls pin-leak detector
Ignored flush errors stale or lost updates fail frame reuse on write failure

7.2 Debugging Strategies

  • Frame dump command for live diagnostics.
  • Log clock-hand movement and victim decisions.

7.3 Performance Traps

Global buffer lock can become contention bottleneck.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add per-page access heatmap output.

8.2 Intermediate Extensions

  • Add scan-resistance policy (2Q-like behavior).

8.3 Advanced Extensions

  • Add partitioned buffer pool for reduced lock contention.

9. Real-World Connections

9.1 Industry Applications

Buffer pool quality often determines practical DB throughput.

  • PostgreSQL and InnoDB buffer manager behavior documentation.

9.3 Interview Relevance

Buffer manager design is common in systems and DB interviews.


10. Resources

10.1 Essential Reading

  • OS/file-cache chapters and database architecture docs.

10.2 Video Resources

  • Talks on DB buffer manager internals.

10.3 Tools & Documentation

  • perf, strace, sanitizer builds, custom trace visualizers.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain frame lifecycle and illegal transitions.
  • I can explain policy tradeoffs with workload examples.

11.2 Implementation

  • No pinned evictions in stress tests.
  • Dirty flush behavior is safe under injected failures.

11.3 Growth

  • I can propose next policy improvement with benchmark evidence.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Fetch/unpin, basic eviction, dirty tracking

Full Completion:

  • Plus robust metrics and failure injection tests

Excellence:

  • Plus adaptive policy and contention reduction improvements