Project 10: NUMA-Aware Database Buffer Pool

Build a NUMA-aware buffer pool manager that places pages on the node where they are most accessed and schedules queries near those pages.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 4-6 weeks
Main Programming Language C++ (Alternatives: C, Rust)
Alternative Programming Languages C, Rust
Coolness Level Level 5: Pure Magic
Business Potential 4. The “Open Core” Infrastructure
Prerequisites Concurrency, caching, NUMA placement, basic DB concepts
Key Topics buffer pools, page replacement, NUMA placement, query scheduling

1. Learning Objectives

By completing this project, you will:

  1. Explain why databases rely on buffer pools for performance.
  2. Implement a partitioned buffer pool with per-node caches.
  3. Track page access statistics and choose placement policies.
  4. Migrate pages between nodes when access patterns shift.
  5. Schedule queries to run near the data they access.
  6. Measure local hit rates vs remote hit rates.
  7. Provide deterministic benchmarks that mimic real workloads.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Buffer Pool Management and Page Replacement

Fundamentals

A buffer pool is the in-memory cache of database pages. It reduces disk I/O by keeping frequently accessed pages in memory. When the buffer pool is full, a replacement policy decides which page to evict. Common policies include LRU, CLOCK, and ARC. A buffer pool must track page metadata: pin counts, dirty flags, and access timestamps. Without careful management, queries stall waiting for I/O or contend on global locks. A NUMA-aware buffer pool partitions the cache by node to improve locality and scalability.

Deep Dive into the Concept

Databases store data on disk in fixed-size pages. Accessing disk is orders of magnitude slower than memory, so databases use a buffer pool to cache hot pages. Each page in the buffer pool has metadata: a page ID, a pointer to the memory region, a dirty flag (modified but not flushed), and a pin count indicating active users. When a query needs a page, the buffer pool checks if it is already in memory (hit) or must be fetched from disk (miss). On a miss, the buffer pool chooses a victim page to evict, possibly flushing it to disk if it is dirty.

Replacement policies are critical. LRU evicts the least recently used page but can be expensive to maintain. CLOCK approximates LRU with a circular buffer and a reference bit. ARC uses two lists to balance recency and frequency. For this project, you can start with CLOCK or LRU because they are easier to implement and understand. The goal is not to build a production DBMS but to model the core behaviors.

Concurrency is another challenge. Multiple threads may access the buffer pool simultaneously, so you need latches (locks) for page frames and for global structures like the page table. Many systems use per-page latches and partitioned hash tables to reduce contention. In a NUMA-aware design, partitioning also aligns with node locality: each node has its own buffer pool partition and page table. This reduces cross-node lock contention and keeps hot pages local.

Dirty page handling is also key. Evicting a dirty page requires a flush to disk, which is expensive. Many systems use background flushers to write dirty pages asynchronously. For this project, you can simulate I/O by adding a fixed delay for page fetch/flush, which makes the performance impact visible in your benchmarks.

The buffer pool design is a microcosm of systems engineering: you must balance locality, concurrency, and correctness. The NUMA-aware extension adds another dimension by aligning partitions with nodes and tracking which node accesses which pages most frequently.

How this fits on projects

This concept provides the basic buffer pool design that your NUMA-aware system builds on. Without correct page management, placement optimizations are meaningless.

Definitions & Key Terms

  • Buffer pool -> In-memory cache of database pages.
  • Page -> Fixed-size unit of storage in a database.
  • Pin count -> Number of active users of a page.
  • Dirty page -> Page modified in memory but not written to disk.
  • Replacement policy -> Algorithm for choosing eviction victims.

Mental Model Diagram (ASCII)

Disk Pages -> Buffer Pool Frames
[PageID] [Frame] [Dirty?] [PinCount]

Replacement policy chooses victim when full

How It Works (Step-by-Step)

  1. Query requests page X.
  2. Buffer pool lookup: hit or miss.
  3. If miss, select victim and evict (flush if dirty).
  4. Load page X into a frame and update metadata.
  5. Increment pin count while in use.

Invariants: Pinned pages cannot be evicted; dirty pages must be flushed before eviction.

Failure modes: deadlocks on latches, eviction of pinned pages, unbounded dirty pages.

Minimal Concrete Example

struct PageFrame {
    PageId id;
    bool dirty;
    int pin_count;
    uint64_t last_access;
};

Common Misconceptions

  • “Buffer pool just caches pages” -> it also manages concurrency and eviction.
  • “LRU is always best” -> workload patterns can make LRU suboptimal.
  • “Dirty pages are cheap” -> flushing can dominate performance.

Check-Your-Understanding Questions

  1. Why can’t a pinned page be evicted?
  2. What is the difference between a hit and a miss?
  3. Why do databases use replacement policies like CLOCK?

Check-Your-Understanding Answers

  1. It is in active use and evicting would break correctness.
  2. A hit finds the page in memory; a miss requires disk I/O.
  3. CLOCK approximates LRU with lower overhead.

Real-World Applications

  • Every database system (PostgreSQL, MySQL, SQL Server).
  • Storage engines in embedded databases.

Where You’ll Apply It

References

  • “Database Internals” (Petrov) – Ch. 5
  • “Operating Systems: Three Easy Pieces” – Ch. 13

Key Insights

A buffer pool is not just a cache; it is the concurrency and eviction engine of a database.

Summary

Buffer pool management combines caching, eviction, and concurrency control. The design must handle hits, misses, dirty pages, and eviction while supporting concurrent queries. This foundation enables NUMA-aware placement strategies.

Homework/Exercises to Practice the Concept

  1. Implement an LRU list for page frames.
  2. Simulate a workload with a 90% hit rate and measure latency.
  3. Compare LRU and CLOCK on a sequential scan workload.

Solutions to the Homework/Exercises

  1. Use a doubly linked list with move-to-front on access.
  2. Latency should be dominated by hits; misses are rare.
  3. CLOCK performs better by avoiding cache thrashing.

2.2 NUMA-Aware Placement, Migration, and Query Scheduling

Fundamentals

In a NUMA-aware buffer pool, pages should be stored on the node where they are most frequently accessed. Queries should run on the same node to maximize local hits. This requires tracking per-page access statistics, placing pages in per-node partitions, and migrating pages when access patterns shift. Scheduling queries near data reduces remote access and coherence traffic. The system must balance locality with load, and avoid thrashing pages between nodes.

Deep Dive into the Concept

NUMA-aware placement extends buffer pool design with per-node partitions. Each node has its own pool of frames, allocated with a NUMA-aware allocator. When a page is first loaded, the system chooses a node based on the accessing thread or a placement policy (e.g., first-touch or frequency-based). This is similar to sharding but at the page level.

The key challenge is dynamic workloads. A page that is hot on node 0 today might be hot on node 1 tomorrow. If the page remains on node 0, queries on node 1 incur remote access. Therefore, the buffer pool should collect access statistics per page per node. A simple policy is to track counters and migrate a page when another node’s access count exceeds a threshold. Migration can be done with move_pages or by allocating a new frame on the target node and copying the page. The latter is often easier in user-space.

Query scheduling also matters. If a query touches many pages stored on node 0, it should be scheduled on node 0. You can implement a simple scheduler that looks at the dominant node of the pages a query needs. For the project, you can model queries as a list of page IDs and choose the node with the highest count. This approximation is sufficient to demonstrate the effect of locality-aware scheduling.

However, migration has costs. Moving pages can be expensive and can cause thrashing if access patterns fluctuate rapidly. You need hysteresis: require a significant imbalance before moving a page, and enforce a cooldown period between migrations. You should also track migration statistics and report them to the user.

Finally, you must balance load. If all hot pages cluster on one node, that node’s buffer pool may become saturated while others are underutilized. A smart policy may choose to keep some pages remote to balance memory usage. This is why you should report both local hit rate and overall hit rate, and allow the user to tune the policy.

How this fits on projects

This concept defines the NUMA-aware extensions: per-node partitions, page placement decisions, migration thresholds, and query scheduling. It is the key differentiator of this project.

Definitions & Key Terms

  • Local hit -> Page hit on the same node as the executing thread.
  • Remote hit -> Page hit on a different node.
  • Migration threshold -> Condition for moving a page to another node.
  • Hysteresis -> Preventing frequent oscillation by requiring a large change.
  • Query scheduling -> Choosing where to run a query based on data locality.

Mental Model Diagram (ASCII)

Node 0 Buffer Pool: Pages {1,2,3}
Node 1 Buffer Pool: Pages {4,5,6}

Query touches {1,2,3} -> run on Node 0
If page 4 becomes hot on Node 0 -> migrate to Node 0

How It Works (Step-by-Step)

  1. On page access, increment per-node access counters.
  2. If access imbalance exceeds threshold, migrate page.
  3. Assign queries to node with most of their pages.
  4. Track local vs remote hit rates.

Invariants: Migrations are throttled; pages have a clear owning node.

Failure modes: thrashing migrations, uneven pool sizes, cross-node contention.

Minimal Concrete Example

if (access_count[node] > access_count[owner] * 2) {
    migrate_page_to(node);
}

Common Misconceptions

  • “Always migrate to hottest node” -> can cause thrashing.
  • “Scheduling is enough” -> data placement still matters.
  • “Remote hits are always bad” -> sometimes necessary for balance.

Check-Your-Understanding Questions

  1. Why do you need hysteresis in migration?
  2. How can you decide where to run a query?
  3. What metrics indicate NUMA-aware success?

Check-Your-Understanding Answers

  1. To prevent pages from bouncing between nodes.
  2. Choose the node with the most page accesses for that query.
  3. High local hit rate and low migration churn.

Real-World Applications

  • NUMA-aware buffer pools in modern database engines.
  • In-memory caches and analytics systems.

Where You’ll Apply It

References

  • “Database Internals” – Ch. 5
  • “Computer Architecture” – Ch. 2

Key Insights

Locality is a moving target; NUMA-aware systems must adapt without thrashing.

Summary

NUMA-aware placement and scheduling improve locality but require adaptive policies. By tracking per-node access patterns, migrating pages with hysteresis, and scheduling queries near data, you can significantly improve performance on NUMA systems.

Homework/Exercises to Practice the Concept

  1. Simulate access patterns and decide when to migrate pages.
  2. Implement a simple query scheduler based on page locality.
  3. Measure local hit rate before and after migration.

Solutions to the Homework/Exercises

  1. Migrate when one node’s accesses exceed another by a chosen threshold.
  2. Count pages per node and choose the max.
  3. Local hit rate should increase after migration, with some migration cost.

3. Project Specification

3.1 What You Will Build

A NUMA-aware buffer pool manager that partitions pages per node, tracks access statistics, migrates hot pages, and schedules queries near their data.

Included: per-node pools, page table, replacement policy, migration, scheduler. Excluded: full SQL engine or durability logging.

3.2 Functional Requirements

  1. Implement per-node buffer pool partitions.
  2. Provide page lookup, pin, and unpin APIs.
  3. Implement a replacement policy (LRU or CLOCK).
  4. Track per-page access counts per node.
  5. Migrate pages when access patterns change.
  6. Schedule queries to nodes based on page locality.
  7. Report local vs remote hit rates and migration stats.

3.3 Non-Functional Requirements

  • Performance: local hit rate >95% on node-local workloads.
  • Reliability: correct pin/unpin and eviction behavior.
  • Usability: clear stats reporting.

3.4 Example Usage / Output

Buffer Pool Stats:
Node 0: 1.2M pages, 98.9% local hits
Node 1: 1.1M pages, 98.3% local hits
Remote hits: 1.2%
Page migrations: 842

3.5 Data Formats / Schemas / Protocols

JSON stats output:

{
  "node0": {"pages": 1200000, "local_hits_pct": 98.9},
  "node1": {"pages": 1100000, "local_hits_pct": 98.3},
  "remote_hits_pct": 1.2,
  "migrations": 842
}

3.6 Edge Cases

  • Workload shifts rapidly, causing migration thrashing.
  • Pool imbalance across nodes.
  • Queries that touch pages evenly across nodes.

3.7 Real World Outcome

You can run benchmark workloads and observe higher local hit rates, fewer remote accesses, and adaptive page placement over time.

3.7.1 How to Run (Copy/Paste)

cmake -S . -B build && cmake --build build
./build/buffer_pool_demo --pages=2000000 --workload=zipf --seed=42

3.7.2 Golden Path Demo (Deterministic)

$ ./buffer_pool_demo --pages=2000000 --workload=zipf --seed=42
Node 0: 98.9% local hits
Node 1: 98.3% local hits
Remote hits: 1.2%
Page migrations: 842

3.7.3 If CLI: Exact Terminal Transcript

$ ./buffer_pool_demo --pages=2000000 --workload=uniform --seed=42
Node 0: 92.1% local hits
Node 1: 91.7% local hits
Remote hits: 8.2%
Page migrations: 120

3.7.4 Failure Demo (Deterministic)

$ ./buffer_pool_demo --pages=-1
ERROR: invalid page count
EXIT: 1

Exit Codes:

  • 0 success
  • 1 invalid arguments
  • 2 initialization failure

4. Solution Architecture

4.1 High-Level Design

+--------------+   +--------------+   +------------------+
| Page Table   |-->| Per-Node Pool|-->| Scheduler/Stats  |
+--------------+   +--------------+   +------------------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |—|—|—| | Page Table | Map page IDs to frames | sharded by node | | Pool Manager | Manage frames and eviction | LRU vs CLOCK | | Migration Engine | Move hot pages | threshold + hysteresis | | Scheduler | Assign queries to nodes | locality heuristic | | Stats | Track hit rates | per-node counters |

4.3 Data Structures (No Full Code)

struct PageFrame {
    PageId id;
    bool dirty;
    int pin_count;
    int owner_node;
    uint64_t access_count[MAX_NODES];
};

4.4 Algorithm Overview

Key Algorithm: Page Access

  1. Lookup page in node-local pool.
  2. If hit, increment local counter.
  3. If miss, fetch page and place in node-local pool.
  4. If access imbalance exceeds threshold, migrate page.

Complexity Analysis:

  • Time: O(1) average for page lookup
  • Space: O(total pages)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential cmake libnuma-dev

5.2 Project Structure

buffer_pool/
|-- src/
|   |-- buffer_pool.cpp
|   |-- page_table.cpp
|   |-- replacement.cpp
|   |-- migration.cpp
|   +-- scheduler.cpp
|-- include/
|   +-- buffer_pool.h
|-- tests/
|   +-- workloads.cpp
|-- CMakeLists.txt
+-- README.md

5.3 The Core Question You’re Answering

“How do real systems like databases exploit NUMA locality at scale?”

5.4 Concepts You Must Understand First

  1. Buffer pool design and replacement policies.
  2. NUMA placement and migration.
  3. Query scheduling based on data locality.

5.5 Questions to Guide Your Design

  1. What replacement policy will you use?
  2. How will you decide when to migrate pages?
  3. How will you schedule queries to nodes?

5.6 Thinking Exercise

Suppose a table is hot on node 0 during the day and node 1 at night. How should your buffer pool adapt?

5.7 The Interview Questions They’ll Ask

  1. “Why is NUMA important for databases?”
  2. “How do you choose which node holds a page?”
  3. “What is the cost of page migration?”

5.8 Hints in Layers

Hint 1: Start with a single-node buffer pool.

Hint 2: Add per-node partitions.

Hint 3: Add access counters and migration threshold.

Hint 4: Add a scheduler that chooses the dominant node.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Buffer pools | “Database Internals” | Ch. 5 | | NUMA placement | “Computer Architecture” | Ch. 2 | | Concurrency | “C++ Concurrency in Action” | Ch. 2-4 |

5.10 Implementation Phases

Phase 1: Base Buffer Pool (2 weeks)

Goals: implement basic pool, page table, and replacement.

Tasks:

  1. Implement page lookup and pin/unpin.
  2. Add LRU or CLOCK replacement.

Checkpoint: workload runs with expected hit rate.

Phase 2: NUMA Partitioning (1 week)

Goals: split pool by node and allocate frames locally.

Tasks:

  1. Add per-node pools.
  2. Use NUMA-aware allocator.

Checkpoint: local hit rate improves under node-local workload.

Phase 3: Migration + Scheduling (1-2 weeks)

Goals: adapt to shifting workloads.

Tasks:

  1. Track access counts per node.
  2. Migrate pages with hysteresis.
  3. Schedule queries near data.

Checkpoint: migration improves locality without thrashing.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Replacement | LRU vs CLOCK | CLOCK | lower overhead | | Migration | aggressive vs conservative | conservative | avoid thrashing | | Scheduling | static vs dynamic | dynamic | adapt to workload |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |—|—|—| | Unit Tests | Page table correctness | lookup and pin/unpin | | Integration Tests | Locality metrics | local vs remote hit rate | | Edge Case Tests | Migration thrash | detect oscillations |

6.2 Critical Test Cases

  1. Local hit rate >95% on node-local workload.
  2. Migration rate remains bounded under shifting workloads.
  3. Invalid page counts exit with code 1.

6.3 Test Data

workload=zipf(0.9), pages=2M, seed=42

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |—|—|—| | Thrashing migrations | High migration count | add hysteresis | | Global locks | poor scaling | shard locks per node | | Imbalanced pools | node out of memory | allow spillover |

7.2 Debugging Strategies

  • Track per-page migration history.
  • Log per-node hit rates over time.

7.3 Performance Traps

  • Overly aggressive migration can reduce throughput.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a CLI flag to choose replacement policy.
  • Add CSV export of hit/miss stats.

8.2 Intermediate Extensions

  • Add adaptive migration thresholds.
  • Support read-only replica pages.

8.3 Advanced Extensions

  • Integrate with real storage engine traces.
  • Add predictive scheduling based on query history.

9. Real-World Connections

9.1 Industry Applications

  • NUMA-aware databases (Oracle, SQL Server, modern engines).
  • In-memory analytics systems.
  • PostgreSQL – buffer pool implementation for study.
  • RocksDB – block cache design.

9.3 Interview Relevance

  • Discussing buffer pools and NUMA locality in systems interviews.

10. Resources

10.1 Essential Reading

  • “Database Internals” – Ch. 5
  • “Computer Architecture” – Ch. 2

10.2 Video Resources

  • Database buffer pool talks.

10.3 Tools & Documentation

  • numactl – placement policies.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain buffer pool hit/miss behavior.
  • I can explain NUMA-aware page placement.

11.2 Implementation

  • Buffer pool functions correctly under load.
  • Local hit rate improves with NUMA-aware placement.

11.3 Growth

  • I can explain this design in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Single-node buffer pool with replacement policy.
  • Basic benchmark output.

Full Completion:

  • NUMA partitioning and locality stats.
  • Migration and scheduling policies implemented.

Excellence (Going Above & Beyond):

  • Adaptive migration policies and workload-aware scheduling.