Project 10: NUMA-Aware Database Buffer Pool

Build a NUMA-aware buffer pool manager that places pages on the node where they are most accessed and schedules queries near those pages.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	4-6 weeks
Main Programming Language	C++ (Alternatives: C, Rust)
Alternative Programming Languages	C, Rust
Coolness Level	Level 5: Pure Magic
Business Potential	4. The “Open Core” Infrastructure
Prerequisites	Concurrency, caching, NUMA placement, basic DB concepts
Key Topics	buffer pools, page replacement, NUMA placement, query scheduling

1. Learning Objectives

By completing this project, you will:

Explain why databases rely on buffer pools for performance.
Implement a partitioned buffer pool with per-node caches.
Track page access statistics and choose placement policies.
Migrate pages between nodes when access patterns shift.
Schedule queries to run near the data they access.
Measure local hit rates vs remote hit rates.
Provide deterministic benchmarks that mimic real workloads.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Buffer Pool Management and Page Replacement

Fundamentals

A buffer pool is the in-memory cache of database pages. It reduces disk I/O by keeping frequently accessed pages in memory. When the buffer pool is full, a replacement policy decides which page to evict. Common policies include LRU, CLOCK, and ARC. A buffer pool must track page metadata: pin counts, dirty flags, and access timestamps. Without careful management, queries stall waiting for I/O or contend on global locks. A NUMA-aware buffer pool partitions the cache by node to improve locality and scalability.

Deep Dive into the Concept

Databases store data on disk in fixed-size pages. Accessing disk is orders of magnitude slower than memory, so databases use a buffer pool to cache hot pages. Each page in the buffer pool has metadata: a page ID, a pointer to the memory region, a dirty flag (modified but not flushed), and a pin count indicating active users. When a query needs a page, the buffer pool checks if it is already in memory (hit) or must be fetched from disk (miss). On a miss, the buffer pool chooses a victim page to evict, possibly flushing it to disk if it is dirty.

Replacement policies are critical. LRU evicts the least recently used page but can be expensive to maintain. CLOCK approximates LRU with a circular buffer and a reference bit. ARC uses two lists to balance recency and frequency. For this project, you can start with CLOCK or LRU because they are easier to implement and understand. The goal is not to build a production DBMS but to model the core behaviors.

Concurrency is another challenge. Multiple threads may access the buffer pool simultaneously, so you need latches (locks) for page frames and for global structures like the page table. Many systems use per-page latches and partitioned hash tables to reduce contention. In a NUMA-aware design, partitioning also aligns with node locality: each node has its own buffer pool partition and page table. This reduces cross-node lock contention and keeps hot pages local.

Dirty page handling is also key. Evicting a dirty page requires a flush to disk, which is expensive. Many systems use background flushers to write dirty pages asynchronously. For this project, you can simulate I/O by adding a fixed delay for page fetch/flush, which makes the performance impact visible in your benchmarks.

The buffer pool design is a microcosm of systems engineering: you must balance locality, concurrency, and correctness. The NUMA-aware extension adds another dimension by aligning partitions with nodes and tracking which node accesses which pages most frequently.

How this fits on projects

This concept provides the basic buffer pool design that your NUMA-aware system builds on. Without correct page management, placement optimizations are meaningless.

Definitions & Key Terms

Buffer pool -> In-memory cache of database pages.
Page -> Fixed-size unit of storage in a database.
Pin count -> Number of active users of a page.
Dirty page -> Page modified in memory but not written to disk.
Replacement policy -> Algorithm for choosing eviction victims.

Mental Model Diagram (ASCII)

Disk Pages -> Buffer Pool Frames
[PageID] [Frame] [Dirty?] [PinCount]

Replacement policy chooses victim when full

How It Works (Step-by-Step)

Query requests page X.
Buffer pool lookup: hit or miss.
If miss, select victim and evict (flush if dirty).
Load page X into a frame and update metadata.
Increment pin count while in use.

Invariants: Pinned pages cannot be evicted; dirty pages must be flushed before eviction.

Failure modes: deadlocks on latches, eviction of pinned pages, unbounded dirty pages.

Minimal Concrete Example

struct PageFrame {
    PageId id;
    bool dirty;
    int pin_count;
    uint64_t last_access;
};

Common Misconceptions

“Buffer pool just caches pages” -> it also manages concurrency and eviction.
“LRU is always best” -> workload patterns can make LRU suboptimal.
“Dirty pages are cheap” -> flushing can dominate performance.

Check-Your-Understanding Questions

Why can’t a pinned page be evicted?
What is the difference between a hit and a miss?
Why do databases use replacement policies like CLOCK?

Check-Your-Understanding Answers

It is in active use and evicting would break correctness.
A hit finds the page in memory; a miss requires disk I/O.
CLOCK approximates LRU with lower overhead.

Real-World Applications

Every database system (PostgreSQL, MySQL, SQL Server).
Storage engines in embedded databases.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.2, Sec. 5.10.
Also used in: P06-numa-aware-data-structure-library.

References

“Database Internals” (Petrov) – Ch. 5
“Operating Systems: Three Easy Pieces” – Ch. 13

Key Insights

A buffer pool is not just a cache; it is the concurrency and eviction engine of a database.

Summary

Buffer pool management combines caching, eviction, and concurrency control. The design must handle hits, misses, dirty pages, and eviction while supporting concurrent queries. This foundation enables NUMA-aware placement strategies.

Homework/Exercises to Practice the Concept

Implement an LRU list for page frames.
Simulate a workload with a 90% hit rate and measure latency.
Compare LRU and CLOCK on a sequential scan workload.

Solutions to the Homework/Exercises

Use a doubly linked list with move-to-front on access.
Latency should be dominated by hits; misses are rare.
CLOCK performs better by avoiding cache thrashing.

2.2 NUMA-Aware Placement, Migration, and Query Scheduling

Fundamentals

In a NUMA-aware buffer pool, pages should be stored on the node where they are most frequently accessed. Queries should run on the same node to maximize local hits. This requires tracking per-page access statistics, placing pages in per-node partitions, and migrating pages when access patterns shift. Scheduling queries near data reduces remote access and coherence traffic. The system must balance locality with load, and avoid thrashing pages between nodes.

Deep Dive into the Concept

NUMA-aware placement extends buffer pool design with per-node partitions. Each node has its own pool of frames, allocated with a NUMA-aware allocator. When a page is first loaded, the system chooses a node based on the accessing thread or a placement policy (e.g., first-touch or frequency-based). This is similar to sharding but at the page level.

The key challenge is dynamic workloads. A page that is hot on node 0 today might be hot on node 1 tomorrow. If the page remains on node 0, queries on node 1 incur remote access. Therefore, the buffer pool should collect access statistics per page per node. A simple policy is to track counters and migrate a page when another node’s access count exceeds a threshold. Migration can be done with move_pages or by allocating a new frame on the target node and copying the page. The latter is often easier in user-space.

Query scheduling also matters. If a query touches many pages stored on node 0, it should be scheduled on node 0. You can implement a simple scheduler that looks at the dominant node of the pages a query needs. For the project, you can model queries as a list of page IDs and choose the node with the highest count. This approximation is sufficient to demonstrate the effect of locality-aware scheduling.

However, migration has costs. Moving pages can be expensive and can cause thrashing if access patterns fluctuate rapidly. You need hysteresis: require a significant imbalance before moving a page, and enforce a cooldown period between migrations. You should also track migration statistics and report them to the user.

Finally, you must balance load. If all hot pages cluster on one node, that node’s buffer pool may become saturated while others are underutilized. A smart policy may choose to keep some pages remote to balance memory usage. This is why you should report both local hit rate and overall hit rate, and allow the user to tune the policy.

How this fits on projects

This concept defines the NUMA-aware extensions: per-node partitions, page placement decisions, migration thresholds, and query scheduling. It is the key differentiator of this project.

Definitions & Key Terms

Local hit -> Page hit on the same node as the executing thread.
Remote hit -> Page hit on a different node.
Migration threshold -> Condition for moving a page to another node.
Hysteresis -> Preventing frequent oscillation by requiring a large change.
Query scheduling -> Choosing where to run a query based on data locality.

Mental Model Diagram (ASCII)

Node 0 Buffer Pool: Pages {1,2,3}
Node 1 Buffer Pool: Pages {4,5,6}

Query touches {1,2,3} -> run on Node 0
If page 4 becomes hot on Node 0 -> migrate to Node 0

How It Works (Step-by-Step)

On page access, increment per-node access counters.
If access imbalance exceeds threshold, migrate page.
Assign queries to node with most of their pages.
Track local vs remote hit rates.

Invariants: Migrations are throttled; pages have a clear owning node.

Failure modes: thrashing migrations, uneven pool sizes, cross-node contention.

Minimal Concrete Example

if (access_count[node] > access_count[owner] * 2) {
    migrate_page_to(node);
}

Common Misconceptions

“Always migrate to hottest node” -> can cause thrashing.
“Scheduling is enough” -> data placement still matters.
“Remote hits are always bad” -> sometimes necessary for balance.

Check-Your-Understanding Questions

Why do you need hysteresis in migration?
How can you decide where to run a query?
What metrics indicate NUMA-aware success?

Check-Your-Understanding Answers

To prevent pages from bouncing between nodes.
Choose the node with the most page accesses for that query.
High local hit rate and low migration churn.

Real-World Applications

NUMA-aware buffer pools in modern database engines.
In-memory caches and analytics systems.

Where You’ll Apply It

In this project: Sec. 3.2, Sec. 4.2, Sec. 5.10.
Also used in: P09-numa-aware-thread-pool.

References

“Database Internals” – Ch. 5
“Computer Architecture” – Ch. 2

Key Insights

Locality is a moving target; NUMA-aware systems must adapt without thrashing.

Summary

NUMA-aware placement and scheduling improve locality but require adaptive policies. By tracking per-node access patterns, migrating pages with hysteresis, and scheduling queries near data, you can significantly improve performance on NUMA systems.

Homework/Exercises to Practice the Concept

Simulate access patterns and decide when to migrate pages.
Implement a simple query scheduler based on page locality.
Measure local hit rate before and after migration.

Solutions to the Homework/Exercises

Migrate when one node’s accesses exceed another by a chosen threshold.
Count pages per node and choose the max.
Local hit rate should increase after migration, with some migration cost.

3. Project Specification

3.1 What You Will Build

A NUMA-aware buffer pool manager that partitions pages per node, tracks access statistics, migrates hot pages, and schedules queries near their data.

Included: per-node pools, page table, replacement policy, migration, scheduler. Excluded: full SQL engine or durability logging.

3.2 Functional Requirements

Implement per-node buffer pool partitions.
Provide page lookup, pin, and unpin APIs.
Implement a replacement policy (LRU or CLOCK).
Track per-page access counts per node.
Migrate pages when access patterns change.
Schedule queries to nodes based on page locality.
Report local vs remote hit rates and migration stats.

3.3 Non-Functional Requirements

Performance: local hit rate >95% on node-local workloads.
Reliability: correct pin/unpin and eviction behavior.
Usability: clear stats reporting.

3.4 Example Usage / Output

Buffer Pool Stats:
Node 0: 1.2M pages, 98.9% local hits
Node 1: 1.1M pages, 98.3% local hits
Remote hits: 1.2%
Page migrations: 842

3.5 Data Formats / Schemas / Protocols

JSON stats output:

{
  "node0": {"pages": 1200000, "local_hits_pct": 98.9},
  "node1": {"pages": 1100000, "local_hits_pct": 98.3},
  "remote_hits_pct": 1.2,
  "migrations": 842
}

3.6 Edge Cases

Workload shifts rapidly, causing migration thrashing.
Pool imbalance across nodes.
Queries that touch pages evenly across nodes.

3.7 Real World Outcome

You can run benchmark workloads and observe higher local hit rates, fewer remote accesses, and adaptive page placement over time.

3.7.1 How to Run (Copy/Paste)

cmake -S . -B build && cmake --build build
./build/buffer_pool_demo --pages=2000000 --workload=zipf --seed=42

3.7.2 Golden Path Demo (Deterministic)

$ ./buffer_pool_demo --pages=2000000 --workload=zipf --seed=42
Node 0: 98.9% local hits
Node 1: 98.3% local hits
Remote hits: 1.2%
Page migrations: 842

3.7.3 If CLI: Exact Terminal Transcript

$ ./buffer_pool_demo --pages=2000000 --workload=uniform --seed=42
Node 0: 92.1% local hits
Node 1: 91.7% local hits
Remote hits: 8.2%
Page migrations: 120

3.7.4 Failure Demo (Deterministic)

$ ./buffer_pool_demo --pages=-1
ERROR: invalid page count
EXIT: 1

Exit Codes:

0 success
1 invalid arguments
2 initialization failure

4. Solution Architecture

4.1 High-Level Design

+--------------+   +--------------+   +------------------+
| Page Table   |-->| Per-Node Pool|-->| Scheduler/Stats  |
+--------------+   +--------------+   +------------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct PageFrame {
    PageId id;
    bool dirty;
    int pin_count;
    int owner_node;
    uint64_t access_count[MAX_NODES];
};

4.4 Algorithm Overview

Key Algorithm: Page Access

Lookup page in node-local pool.
If hit, increment local counter.
If miss, fetch page and place in node-local pool.
If access imbalance exceeds threshold, migrate page.

Complexity Analysis:

Time: O(1) average for page lookup
Space: O(total pages)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential cmake libnuma-dev

5.2 Project Structure

buffer_pool/
|-- src/
|   |-- buffer_pool.cpp
|   |-- page_table.cpp
|   |-- replacement.cpp
|   |-- migration.cpp
|   +-- scheduler.cpp
|-- include/
|   +-- buffer_pool.h
|-- tests/
|   +-- workloads.cpp
|-- CMakeLists.txt
+-- README.md

5.3 The Core Question You’re Answering

“How do real systems like databases exploit NUMA locality at scale?”

5.4 Concepts You Must Understand First

Buffer pool design and replacement policies.
NUMA placement and migration.
Query scheduling based on data locality.

5.5 Questions to Guide Your Design

What replacement policy will you use?
How will you decide when to migrate pages?
How will you schedule queries to nodes?

5.6 Thinking Exercise

Suppose a table is hot on node 0 during the day and node 1 at night. How should your buffer pool adapt?

5.7 The Interview Questions They’ll Ask

“Why is NUMA important for databases?”
“How do you choose which node holds a page?”
“What is the cost of page migration?”

5.8 Hints in Layers

Hint 1: Start with a single-node buffer pool.

Hint 2: Add per-node partitions.

Hint 3: Add access counters and migration threshold.

Hint 4: Add a scheduler that chooses the dominant node.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Base Buffer Pool (2 weeks)

Goals: implement basic pool, page table, and replacement.

Tasks:

Implement page lookup and pin/unpin.
Add LRU or CLOCK replacement.

Checkpoint: workload runs with expected hit rate.

Phase 2: NUMA Partitioning (1 week)

Goals: split pool by node and allocate frames locally.

Tasks:

Add per-node pools.
Use NUMA-aware allocator.

Checkpoint: local hit rate improves under node-local workload.

Phase 3: Migration + Scheduling (1-2 weeks)

Goals: adapt to shifting workloads.

Tasks:

Track access counts per node.
Migrate pages with hysteresis.
Schedule queries near data.

Checkpoint: migration improves locality without thrashing.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Local hit rate >95% on node-local workload.
Migration rate remains bounded under shifting workloads.
Invalid page counts exit with code 1.

6.3 Test Data

workload=zipf(0.9), pages=2M, seed=42

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Track per-page migration history.
Log per-node hit rates over time.

7.3 Performance Traps

Overly aggressive migration can reduce throughput.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a CLI flag to choose replacement policy.
Add CSV export of hit/miss stats.

8.2 Intermediate Extensions

Add adaptive migration thresholds.
Support read-only replica pages.

8.3 Advanced Extensions

Integrate with real storage engine traces.
Add predictive scheduling based on query history.

9. Real-World Connections

9.1 Industry Applications

NUMA-aware databases (Oracle, SQL Server, modern engines).
In-memory analytics systems.

PostgreSQL – buffer pool implementation for study.
RocksDB – block cache design.

9.3 Interview Relevance

Discussing buffer pools and NUMA locality in systems interviews.

10. Resources

10.1 Essential Reading

“Database Internals” – Ch. 5
“Computer Architecture” – Ch. 2

10.2 Video Resources

Database buffer pool talks.

10.3 Tools & Documentation

numactl – placement policies.

11. Self-Assessment Checklist

11.1 Understanding

I can explain buffer pool hit/miss behavior.
I can explain NUMA-aware page placement.

11.2 Implementation

Buffer pool functions correctly under load.
Local hit rate improves with NUMA-aware placement.

11.3 Growth

I can explain this design in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Single-node buffer pool with replacement policy.
Basic benchmark output.

Full Completion:

NUMA partitioning and locality stats.
Migration and scheduling policies implemented.

Excellence (Going Above & Beyond):

Adaptive migration policies and workload-aware scheduling.