Project 8: Simple Database Engine

Build a relational database engine with B-tree indexing, page-based storage, SQL parsing, and ACID transactions to master systems programming.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 1-2 months
Language C++
Prerequisites B-tree data structures, file I/O, SQL basics, transaction concepts
Key Topics Page storage, buffer pool, B-tree, SQL parsing, WAL, ACID

1. Learning Objectives

By completing this project, you will:

  • Implement page-based storage: Manage fixed-size pages on disk
  • Build a buffer pool: Cache pages in memory with eviction policies
  • Implement B-tree indexing: Insert, search, and split operations
  • Parse SQL statements: Tokenize and parse SELECT, INSERT, CREATE TABLE
  • Understand ACID properties: Atomicity, Consistency, Isolation, Durability
  • Implement Write-Ahead Logging: Ensure durability through logging
  • Design storage formats: Row layout, page format, metadata storage

2. Theoretical Foundation

2.1 Core Concepts

Why Databases Use Pages:

Traditional File I/O:                Database Page I/O:
┌─────────────────────┐             ┌─────────────────────┐
│ Read 50 bytes here  │             │ Read 4KB page       │
│ Read 30 bytes there │             │ (contains many rows)│
│ Read 100 bytes...   │             │                     │
│ Many small I/Os     │             │ One I/O, use cache  │
└─────────────────────┘             └─────────────────────┘

Disk seeks are ~10ms. Reading 4KB takes ~0.05ms.
Once you seek, read more! Pages amortize seek cost.

Page Layout (Slotted Page):

┌─────────────────────────────────────────────────────────┐
│  Page Header (16 bytes)                                  │
│  ┌─────────────────────────────────────────────────────┐ │
│  │ page_id │ num_slots │ free_space_ptr │ page_type   │ │
│  └─────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│  Slot Directory (grows down)                             │
│  ┌───────┬───────┬───────┬───────┬─────────────────────┐ │
│  │Slot 0 │Slot 1 │Slot 2 │Slot 3 │  ...                │ │
│  │off,len│off,len│off,len│off,len│                     │ │
│  └───────┴───────┴───────┴───────┴─────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│                                                          │
│              Free Space (grows both ways)                │
│                                                          │
├─────────────────────────────────────────────────────────┤
│  Record Data (grows up from bottom)                      │
│  ┌─────────────────────────────────────────────────────┐ │
│  │ Record 3 │ Record 2 │ Record 1 │ Record 0 │         │ │
│  └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

B-Tree Structure:

                          ┌─────────────────────┐
                          │   Root (Internal)   │
                          │   [30] [60]        │
                          └──────────┬──────────┘
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
           ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
           │  Internal     │ │  Internal     │ │  Internal     │
           │  [10][20]     │ │  [40][50]     │ │  [70][80]     │
           └───────┬───────┘ └───────┬───────┘ └───────┬───────┘
          ┌────┬────┐       ┌────┬────┐       ┌────┬────┐
          ▼    ▼    ▼       ▼    ▼    ▼       ▼    ▼    ▼
        ┌───┐┌───┐┌───┐   ┌───┐┌───┐┌───┐   ┌───┐┌───┐┌───┐
        │5,8││12,││25,│   │35,││45,││55,│   │65,││75,││85,│
        │   ││18 ││28 │   │38 ││48 ││58 │   │68 ││78 ││90 │
        └───┘└───┘└───┘   └───┘└───┘└───┘   └───┘└───┘└───┘
           Leaf Nodes (contain actual row pointers or data)

B-Tree Properties:
- All leaves at same depth
- Nodes are at least half full
- Search: O(log n)
- Insert: O(log n) + possible splits
- Perfect for disk: wide and shallow

Buffer Pool Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        Buffer Pool                               │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    Page Frames (Memory)                    │  │
│  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐   │  │
│  │  │Page 12 │ │Page 45 │ │Page 3  │ │Page 89 │ │ Empty  │   │  │
│  │  │dirty=1 │ │dirty=0 │ │dirty=1 │ │dirty=0 │ │        │   │  │
│  │  │pin=2   │ │pin=0   │ │pin=1   │ │pin=0   │ │        │   │  │
│  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Page Table: page_id -> frame_index                      │    │
│  │  { 12 -> 0, 45 -> 1, 3 -> 2, 89 -> 3 }                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  LRU List (for eviction): [45, 89, 3, 12]               │    │
│  │  (45 is least recently used, evict first)               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
          │                                    ▲
          │ Evict/Flush                        │ Fetch
          ▼                                    │
┌─────────────────────────────────────────────────────────────────┐
│                      Disk (data.db file)                         │
│  Page 0  │ Page 1  │ Page 2  │ Page 3  │ ... │ Page N  │        │
└─────────────────────────────────────────────────────────────────┘

Write-Ahead Logging (WAL):

                    Transaction Flow
                          │
       ┌──────────────────┼──────────────────┐
       │                  │                  │
       ▼                  ▼                  ▼
  ┌─────────┐       ┌─────────┐       ┌─────────┐
  │ BEGIN   │       │ UPDATE  │       │ COMMIT  │
  └────┬────┘       └────┬────┘       └────┬────┘
       │                 │                  │
       │                 ▼                  │
       │    ┌────────────────────────┐      │
       │    │ 1. Write to WAL first │      │
       │    │    (log record)        │      │
       │    │ 2. Then modify page    │      │
       │    │    (in buffer pool)    │      │
       │    └────────────────────────┘      │
       │                                    │
       │                 ┌──────────────────┘
       │                 ▼
       │    ┌────────────────────────┐
       │    │ 3. Flush WAL to disk   │
       │    │ 4. Return success      │
       │    │ (pages written later)  │
       │    └────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────┐
│  WAL File (sequential writes - fast!)                │
│                                                      │
│  [LSN 1][LSN 2][LSN 3][LSN 4][LSN 5][LSN 6]...      │
│  BEGIN  UPDATE UPDATE UPDATE COMMIT  BEGIN          │
│  txn1   txn1   txn2   txn1   txn1    txn3           │
└─────────────────────────────────────────────────────┘

Recovery: Replay WAL to reconstruct committed transactions

2.2 Why This Matters

Databases are the foundation of almost every application:

  • Web applications: PostgreSQL, MySQL
  • Mobile apps: SQLite
  • Enterprise: Oracle, SQL Server
  • Analytics: ClickHouse, DuckDB

Building a database teaches:

  • How data is organized on disk
  • Why indexing matters (and when it doesn’t)
  • How transactions maintain consistency
  • Why databases seem magical but aren’t

2.3 Historical Context

Database systems evolved from file systems in the 1960s:

  • 1970: E.F. Codd proposes relational model
  • 1974: System R (IBM) implements SQL
  • 1979: Oracle first commercial RDBMS
  • 1989: PostgreSQL begins development
  • 1995: MySQL released
  • 2000: SQLite released (embedded)
  • 2017: CockroachDB, TiDB (distributed SQL)

Modern databases still use core concepts from the 1970s:

  • B-trees (invented 1972)
  • Write-ahead logging (1980s)
  • MVCC (1981, PostgreSQL)

2.4 Common Misconceptions

Misconception 1: “Databases keep everything in memory” Reality: Databases use a buffer pool to cache pages. On memory pressure, pages are evicted to disk. This allows working with datasets larger than RAM.

Misconception 2: “B-trees are just binary trees” Reality: B-trees have high fanout (many children per node). A typical node might have 100-500 keys. This reduces tree height and disk I/O.

Misconception 3: “Indexes always make queries faster” Reality: Indexes speed up reads but slow down writes. Each INSERT must update every index. Too many indexes hurt performance.

Misconception 4: “COMMIT is when data is written to disk” Reality: COMMIT ensures the WAL is flushed to disk. Data pages may be written later (checkpoint). WAL enables fast commits with durability.


3. Project Specification

3.1 What You Will Build

A simple relational database that:

  • Stores data in a page-based file format
  • Uses a buffer pool for caching
  • Supports B-tree indexes for fast lookup
  • Parses and executes basic SQL
  • Provides ACID transactions with WAL

3.2 Functional Requirements

Requirement Description
FR-1 CREATE TABLE with INT, TEXT column types
FR-2 INSERT INTO with values
FR-3 SELECT * and SELECT columns with WHERE on primary key
FR-4 Primary key constraint with B-tree index
FR-5 BEGIN, COMMIT, ROLLBACK transactions
FR-6 Persist data across restarts
FR-7 Recover from crash using WAL

3.3 Non-Functional Requirements

Requirement Description
NFR-1 Handle tables with 1M+ rows
NFR-2 Primary key lookup in O(log n)
NFR-3 Buffer pool size configurable
NFR-4 Corruption detection (checksums)
NFR-5 Clean separation of components

3.4 Example Usage / Output

$ ./mydb test.db

mydb> CREATE TABLE users (
    id INT PRIMARY KEY,
    name TEXT,
    email TEXT
);
Table 'users' created.

mydb> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
1 row inserted.

mydb> INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');
1 row inserted.

mydb> INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com');
1 row inserted.

mydb> SELECT * FROM users;
+----+---------+---------------------+
| id | name    | email               |
+----+---------+---------------------+
|  1 | Alice   | alice@example.com   |
|  2 | Bob     | bob@example.com     |
|  3 | Charlie | charlie@example.com |
+----+---------+---------------------+
3 rows returned.

mydb> SELECT name, email FROM users WHERE id = 2;
+------+-----------------+
| name | email           |
+------+-----------------+
| Bob  | bob@example.com |
+------+-----------------+
1 row returned (index scan).

mydb> BEGIN;
Transaction started.

mydb> INSERT INTO users VALUES (4, 'David', 'david@example.com');
1 row inserted.

mydb> SELECT * FROM users WHERE id = 4;
+----+-------+-------------------+
| id | name  | email             |
+----+-------+-------------------+
|  4 | David | david@example.com |
+----+-------+-------------------+
1 row returned.

mydb> ROLLBACK;
Transaction rolled back.

mydb> SELECT * FROM users WHERE id = 4;
0 rows returned.  -- David is gone!

mydb> .stats
Buffer pool: 128 pages (512 KB)
  Hits: 1,234
  Misses: 45
  Hit ratio: 96.5%
Tables: 1
  users: 3 rows, 1 pages

mydb> .exit
Goodbye!

3.5 Real World Outcome

$ ./mydb test.db

# Load 1 million rows
mydb> .load users_1m.csv
Loaded 1,000,000 rows in 45 seconds.

# Point query (uses B-tree index)
mydb> SELECT * FROM users WHERE id = 500000;
1 row returned in 0.2ms (index scan, 4 page reads).

# Range query (requires table scan for now)
mydb> SELECT COUNT(*) FROM users WHERE name LIKE 'A%';
38,421 rows matched in 2.3 seconds (sequential scan).

# Transaction test
mydb> BEGIN;
mydb> UPDATE users SET email = 'test@test.com' WHERE id = 1;
mydb> -- Simulate crash: kill -9 $PID

$ ./mydb test.db
Recovering from WAL...
Replayed 0 committed transactions.
Discarded 1 uncommitted transaction.
Recovery complete.

mydb> SELECT * FROM users WHERE id = 1;
-- Original email, not 'test@test.com'!

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────────┐
│                        Database Engine                               │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                      SQL Layer                               │    │
│  │                                                              │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │    │
│  │  │  Parser  │─►│ Analyzer │─►│ Planner  │─►│ Executor │    │    │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │    │
│  │                                                              │    │
│  └──────────────────────────────┬───────────────────────────────┘    │
│                                 │                                    │
│  ┌──────────────────────────────▼───────────────────────────────┐    │
│  │                    Access Methods                             │    │
│  │                                                               │    │
│  │  ┌─────────────────┐     ┌─────────────────────────────────┐ │    │
│  │  │ Table (Heap)    │     │ Index (B-Tree)                  │ │    │
│  │  │ - Insert row    │     │ - Insert key                    │ │    │
│  │  │ - Scan all rows │     │ - Search key                    │ │    │
│  │  └────────┬────────┘     └────────┬────────────────────────┘ │    │
│  │           │                       │                          │    │
│  └───────────┼───────────────────────┼──────────────────────────┘    │
│              │                       │                               │
│  ┌───────────▼───────────────────────▼──────────────────────────┐    │
│  │                     Buffer Pool                               │    │
│  │                                                               │    │
│  │  getPage(page_id) -> Page*                                   │    │
│  │  markDirty(page_id)                                          │    │
│  │  unpin(page_id)                                              │    │
│  │  flushPage(page_id)                                          │    │
│  └───────────────────────────────┬───────────────────────────────┘    │
│                                  │                                    │
│  ┌───────────────────────────────▼───────────────────────────────┐    │
│  │                     Disk Manager                              │    │
│  │                                                               │    │
│  │  readPage(page_id, data)                                     │    │
│  │  writePage(page_id, data)                                    │    │
│  │  allocatePage() -> page_id                                   │    │
│  └───────────────────────────────┬───────────────────────────────┘    │
│                                  │                                    │
└──────────────────────────────────┼────────────────────────────────────┘
                                   ▼
                             ┌──────────┐
                             │ data.db  │
                             │ wal.log  │
                             └──────────┘

4.2 Key Components

Component Responsibility
Parser Tokenize SQL, build AST
Analyzer Validate table/column names, resolve types
Planner Decide access method (index vs scan)
Executor Execute the plan, return results
Table Heap file operations (insert, scan)
Index B-tree operations (insert, search)
Buffer Pool Page caching with LRU eviction
Disk Manager Raw page I/O
WAL Write-ahead log for durability

4.3 Data Structures

Page Header:

struct PageHeader {
    uint32_t page_id;
    uint16_t num_slots;
    uint16_t free_space_offset;
    uint16_t free_space_end;
    PageType page_type;  // HEAP_PAGE, BTREE_INTERNAL, BTREE_LEAF
    uint32_t checksum;
};

B-Tree Node:

struct BTreeNode {
    bool is_leaf;
    uint16_t num_keys;
    uint32_t keys[MAX_KEYS];        // For internal: separator keys
    uint32_t children[MAX_KEYS+1];  // For internal: child page ids
    uint32_t values[MAX_KEYS];      // For leaf: row ids / data pointers
    uint32_t next_leaf;             // For leaf: sibling pointer
};

Buffer Pool Frame:

struct Frame {
    Page data;
    page_id_t page_id;
    bool is_dirty;
    int pin_count;
};

4.4 Algorithm Overview

B-Tree Search:

search(node, key):
    if node.is_leaf:
        for i in 0 to node.num_keys:
            if node.keys[i] == key:
                return node.values[i]
        return NOT_FOUND

    // Find child to descend
    i = 0
    while i < node.num_keys and key >= node.keys[i]:
        i++

    child = buffer_pool.getPage(node.children[i])
    return search(child, key)

B-Tree Insert:

insert(key, value):
    if root is null:
        root = create_leaf_node()
        root.insert(key, value)
        return

    result = insert_recursive(root, key, value)

    if result.split_occurred:
        // Root split, create new root
        new_root = create_internal_node()
        new_root.keys[0] = result.split_key
        new_root.children[0] = root
        new_root.children[1] = result.new_node
        root = new_root

insert_recursive(node, key, value):
    if node.is_leaf:
        if node.has_space():
            node.insert_sorted(key, value)
            return NO_SPLIT
        else:
            // Split leaf
            new_leaf = create_leaf_node()
            (split_key, entries_for_new) = node.split_at_middle()
            new_leaf.insert_all(entries_for_new)
            node.next_leaf = new_leaf
            return SPLIT(split_key, new_leaf)

    // Internal node
    child_index = find_child(node, key)
    child = getPage(node.children[child_index])
    result = insert_recursive(child, key, value)

    if result.split_occurred:
        if node.has_space():
            node.insert_child(child_index, result.split_key, result.new_node)
            return NO_SPLIT
        else:
            // Split internal node
            new_internal = create_internal_node()
            (push_up_key, entries_for_new) = node.split_at_middle()
            new_internal.insert_all(entries_for_new)
            return SPLIT(push_up_key, new_internal)

    return NO_SPLIT

5. Implementation Guide

5.1 Development Environment Setup

Required:

  • C++17 or later
  • CMake 3.14+
  • File system with fsync support

Recommended:

# Install development tools
sudo apt install cmake g++ valgrind

# For visualization (optional)
pip install graphviz  # To render B-tree diagrams

5.2 Project Structure

database_engine/
├── include/
│   ├── common/
│   │   ├── types.hpp        # page_id_t, slot_id_t, etc.
│   │   └── config.hpp       # PAGE_SIZE, BUFFER_POOL_SIZE
│   ├── storage/
│   │   ├── disk_manager.hpp
│   │   ├── page.hpp
│   │   └── buffer_pool.hpp
│   ├── access/
│   │   ├── table_heap.hpp
│   │   └── btree_index.hpp
│   ├── catalog/
│   │   ├── schema.hpp
│   │   └── catalog.hpp
│   ├── sql/
│   │   ├── lexer.hpp
│   │   ├── parser.hpp
│   │   └── executor.hpp
│   └── recovery/
│       └── wal.hpp
├── src/
│   ├── storage/
│   ├── access/
│   ├── catalog/
│   ├── sql/
│   ├── recovery/
│   └── main.cpp
├── tests/
│   ├── test_buffer_pool.cpp
│   ├── test_btree.cpp
│   ├── test_sql.cpp
│   └── test_recovery.cpp
├── CMakeLists.txt
└── README.md

5.3 The Core Question You’re Answering

“How do I reliably store, index, and query structured data at scale?”

This is the fundamental question of data management. Understanding database internals helps you:

  • Choose the right database for your use case
  • Design efficient schemas
  • Debug performance issues
  • Build data systems that don’t lose data

5.4 Concepts You Must Understand First

  1. What is a B-tree and why is it preferred over binary trees for databases?
    • High fanout reduces tree depth
    • Matches disk page size
    • Reference: “Introduction to Algorithms” Ch. 18
  2. What are the ACID properties?
    • Atomicity: All or nothing
    • Consistency: Rules always satisfied
    • Isolation: Transactions don’t interfere
    • Durability: Committed data survives crashes
  3. What is Write-Ahead Logging and why is it necessary?
    • Write log before data pages
    • Enables recovery after crash
    • Reference: “Database Internals” Ch. 10
  4. What is a buffer pool and why is it needed?
    • Pages cached in memory
    • Eviction policy (LRU)
    • Dirty page tracking
  5. What is the difference between a heap file and an indexed file?
    • Heap: Unordered, append-only, O(n) search
    • Indexed: Ordered by key, O(log n) search

5.5 Questions to Guide Your Design

Storage Format:

  • Fixed-size or variable-size records?
  • How to handle NULL values?
  • How to store strings (length-prefixed, null-terminated)?

Buffer Pool:

  • What eviction policy (LRU, Clock, LRU-K)?
  • How to handle concurrent access?
  • When to flush dirty pages?

B-Tree:

  • What is the fanout (keys per node)?
  • How to handle duplicate keys?
  • Clustered or non-clustered index?

Transactions:

  • What isolation level to support?
  • How to detect conflicts?
  • When to acquire/release locks?

5.6 Thinking Exercise

Design the page layout for a table with this schema:

CREATE TABLE products (
    id INT PRIMARY KEY,
    name TEXT,       -- variable length, max 100 chars
    price REAL,      -- 8 bytes
    in_stock BOOL    -- 1 byte
);

Questions:

  1. How many bytes per row (fixed fields + variable fields)?
  2. How many rows fit in a 4KB page?
  3. Where do you store the variable-length name?
  4. How does the slot directory help with variable-length data?

5.7 Hints in Layers

Hint 1 - Starting Point (Conceptual): Build bottom-up: DiskManager -> BufferPool -> HeapFile -> BTree -> SQL Parser -> Executor. Test each layer before building the next.

Hint 2 - Next Level (More Specific):

DiskManager (simplest layer):

class DiskManager {
    std::fstream file_;
    std::atomic<page_id_t> next_page_id_;

public:
    void readPage(page_id_t id, char* data) {
        file_.seekg(id * PAGE_SIZE);
        file_.read(data, PAGE_SIZE);
    }

    void writePage(page_id_t id, const char* data) {
        file_.seekp(id * PAGE_SIZE);
        file_.write(data, PAGE_SIZE);
        file_.flush();  // Or use fsync for durability
    }

    page_id_t allocatePage() {
        return next_page_id_++;
    }
};

Hint 3 - B-Tree Split:

// When a leaf node is full and we need to insert
void BTreeIndex::splitLeaf(BTreeNode* leaf, int key, RowId value) {
    // Create new leaf
    page_id_t new_page_id = buffer_pool_.allocatePage();
    BTreeNode* new_leaf = reinterpret_cast<BTreeNode*>(
        buffer_pool_.getPage(new_page_id)->data()
    );
    new_leaf->is_leaf = true;
    new_leaf->num_keys = 0;

    // Find split point (middle)
    int total = leaf->num_keys + 1;
    int split_idx = total / 2;

    // Collect all entries + new entry, sort
    std::vector<std::pair<int, RowId>> entries;
    for (int i = 0; i < leaf->num_keys; i++) {
        entries.push_back({leaf->keys[i], leaf->values[i]});
    }
    entries.push_back({key, value});
    std::sort(entries.begin(), entries.end());

    // First half stays in old leaf
    leaf->num_keys = 0;
    for (int i = 0; i < split_idx; i++) {
        leaf->keys[leaf->num_keys] = entries[i].first;
        leaf->values[leaf->num_keys] = entries[i].second;
        leaf->num_keys++;
    }

    // Second half goes to new leaf
    for (int i = split_idx; i < entries.size(); i++) {
        new_leaf->keys[new_leaf->num_keys] = entries[i].first;
        new_leaf->values[new_leaf->num_keys] = entries[i].second;
        new_leaf->num_keys++;
    }

    // Link leaves
    new_leaf->next_leaf = leaf->next_leaf;
    leaf->next_leaf = new_page_id;

    // Return split key for parent
    return new_leaf->keys[0];
}

Hint 4 - WAL Record Format:

struct WALRecord {
    uint64_t lsn;           // Log Sequence Number
    uint32_t txn_id;
    WALRecordType type;     // BEGIN, INSERT, UPDATE, DELETE, COMMIT, ABORT
    page_id_t page_id;
    uint16_t offset;
    uint16_t length;
    char old_data[MAX_DATA];  // For UNDO
    char new_data[MAX_DATA];  // For REDO
};

// Recovery algorithm (simplified ARIES):
void recover() {
    // 1. Read WAL from beginning
    // 2. Build table of active transactions
    // 3. REDO: Replay all changes
    // 4. UNDO: Rollback uncommitted transactions
}

5.8 The Interview Questions They’ll Ask

  1. “Why use a B-tree instead of a hash index?”
    • B-tree supports range queries, hash doesn’t
    • B-tree sorted traversal, hash is random
    • B-tree handles growing data gracefully
  2. “Explain the purpose of Write-Ahead Logging”
    • Ensure durability without flushing data pages on every commit
    • Enable crash recovery by replaying log
    • Sequential writes are faster than random writes
  3. “What happens during a checkpoint?”
    • Flush all dirty pages to disk
    • Write checkpoint record to WAL
    • Can truncate old WAL entries
    • Reduces recovery time
  4. “How does MVCC work?”
    • Multiple versions of each row
    • Each transaction sees consistent snapshot
    • No read locks needed
    • Reference: PostgreSQL, MySQL InnoDB
  5. “What is the difference between clustered and non-clustered indexes?”
    • Clustered: Table data stored in index order (only one per table)
    • Non-clustered: Index contains pointers to heap

5.9 Books That Will Help

Topic Book & Chapter
B-Trees “Introduction to Algorithms” Ch. 18 - Cormen et al.
Database Storage “Database Internals” Ch. 1-4 - Alex Petrov
Buffer Management “Database Internals” Ch. 5 - Alex Petrov
Write-Ahead Logging “Database Internals” Ch. 10 - Alex Petrov
SQL Processing “Database System Concepts” Ch. 12-15 - Silberschatz
Transaction Processing “Transaction Processing” - Jim Gray

5.10 Implementation Phases

Phase 1: Storage Layer (Week 1)

  • Implement DiskManager
  • Implement Page class with header
  • Implement BufferPool with LRU eviction
  • Test: Read/write pages, eviction works

Phase 2: Table Access (Week 2)

  • Implement HeapFile (table storage)
  • Implement row serialization/deserialization
  • Implement table scan
  • Test: Insert rows, scan all rows

Phase 3: B-Tree Index (Week 3-4)

  • Implement B-tree search
  • Implement B-tree insert (without splits)
  • Implement node splitting
  • Test: Insert 10,000 keys, search all

Phase 4: SQL Parsing (Week 5)

  • Implement SQL tokenizer
  • Implement parser for CREATE, INSERT, SELECT
  • Build AST representation
  • Test: Parse various SQL statements

Phase 5: Execution (Week 6)

  • Implement CREATE TABLE execution
  • Implement INSERT execution
  • Implement SELECT with WHERE
  • Connect parser to executor

Phase 6: Transactions & WAL (Week 7-8)

  • Implement WAL writer
  • Implement BEGIN/COMMIT/ROLLBACK
  • Implement crash recovery
  • Test: Kill process, verify recovery

5.11 Key Implementation Decisions

Decision Recommended Choice Reasoning
Page size 4KB Matches OS page size, good for SSDs
B-tree order ~100 keys/node Fits in one page, low tree height
Buffer pool size 128 pages (512KB) Good for testing, configurable
Serialization Fixed-width + offset for variable Simple, efficient
WAL format Binary records Compact, fast to write

6. Testing Strategy

Unit Tests:

TEST(BufferPool, EvictsLRU) {
    BufferPool pool(2);  // Only 2 frames

    auto p1 = pool.getPage(1);
    auto p2 = pool.getPage(2);

    pool.unpin(1);
    pool.unpin(2);

    // Access page 1 again (makes it recently used)
    p1 = pool.getPage(1);
    pool.unpin(1);

    // Get page 3 - should evict page 2 (LRU)
    auto p3 = pool.getPage(3);

    EXPECT_FALSE(pool.contains(2));  // Page 2 evicted
    EXPECT_TRUE(pool.contains(1));   // Page 1 still there
}

TEST(BTree, SearchAfterSplits) {
    BTreeIndex index;

    // Insert enough to cause multiple splits
    for (int i = 0; i < 10000; i++) {
        index.insert(i, RowId{0, i});
    }

    // Verify all keys findable
    for (int i = 0; i < 10000; i++) {
        auto result = index.search(i);
        ASSERT_TRUE(result.has_value());
        EXPECT_EQ(result->slot_id, i);
    }
}

Integration Tests:

TEST(Database, CrashRecovery) {
    {
        Database db("test.db");
        db.execute("CREATE TABLE t (id INT PRIMARY KEY)");
        db.execute("BEGIN");
        db.execute("INSERT INTO t VALUES (1)");
        db.execute("COMMIT");

        db.execute("BEGIN");
        db.execute("INSERT INTO t VALUES (2)");
        // No commit - simulate crash
    }

    // Reopen and recover
    Database db("test.db");

    auto result = db.execute("SELECT * FROM t");
    EXPECT_EQ(result.rows.size(), 1);  // Only committed row
    EXPECT_EQ(result.rows[0][0], 1);   // id = 1
}

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Data corruption Wrong values read Page not flushed Call fsync() after write
B-tree invariant violation Search fails Split not propagated correctly Verify parent updated
Buffer pool deadlock Program hangs Page pinned but not unpinned Always unpin after use
WAL not durable Lost data on crash WAL not fsynced before commit fsync WAL on commit
Memory leak Growing memory Pages not evicted Check pin counts

Debugging Tools:

# Hex dump pages
xxd -g 4 data.db | head -100

# Visualize B-tree (custom script)
./btree_dump data.db | dot -Tpng > btree.png

# Check file sync
strace -e fsync ./mydb test.db

8. Extensions & Challenges

  1. Secondary Indexes: Non-primary key indexes
  2. JOIN Operations: Nested loop, hash join
  3. Query Optimizer: Cost-based optimization
  4. MVCC: Multi-version concurrency control
  5. Compression: Page-level compression
  6. Replication: Primary-replica setup

9. Real-World Connections

  • SQLite: Excellent codebase to study, similar architecture
  • LevelDB/RocksDB: LSM-tree based, different tradeoffs
  • PostgreSQL: Production RDBMS, open source
  • BoltDB: Simple B-tree database in Go

10. Resources


11. Self-Assessment Checklist

  • DiskManager reads/writes pages correctly
  • BufferPool evicts LRU pages
  • HeapFile inserts and scans rows
  • B-tree insert handles splits
  • B-tree search finds all inserted keys
  • SQL parser handles CREATE, INSERT, SELECT
  • Executor produces correct results
  • Transactions can be committed and rolled back
  • Data survives restart (persisted)
  • Crash recovery works (WAL replay)

12. Submission / Completion Criteria

Your implementation is complete when:

  1. Core SQL works: CREATE TABLE, INSERT, SELECT with WHERE
  2. Index used: Primary key lookups use B-tree
  3. ACID compliant: Transactions work correctly
  4. Durable: Data survives process restart
  5. Recoverable: Uncommitted transactions rolled back after crash
  6. Scalable: Handles 100K+ rows efficiently
  7. Can explain: You can whiteboard the architecture in an interview