Project 8: Simple Database Engine

Build a relational database engine with B-tree indexing, page-based storage, SQL parsing, and ACID transactions to master systems programming.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	1-2 months
Language	C++
Prerequisites	B-tree data structures, file I/O, SQL basics, transaction concepts
Key Topics	Page storage, buffer pool, B-tree, SQL parsing, WAL, ACID

1. Learning Objectives

By completing this project, you will:

Implement page-based storage: Manage fixed-size pages on disk
Build a buffer pool: Cache pages in memory with eviction policies
Implement B-tree indexing: Insert, search, and split operations
Parse SQL statements: Tokenize and parse SELECT, INSERT, CREATE TABLE
Understand ACID properties: Atomicity, Consistency, Isolation, Durability
Implement Write-Ahead Logging: Ensure durability through logging
Design storage formats: Row layout, page format, metadata storage

2. Theoretical Foundation

2.1 Core Concepts

Why Databases Use Pages:

Traditional File I/O:                Database Page I/O:
┌─────────────────────┐             ┌─────────────────────┐
│ Read 50 bytes here  │             │ Read 4KB page       │
│ Read 30 bytes there │             │ (contains many rows)│
│ Read 100 bytes...   │             │                     │
│ Many small I/Os     │             │ One I/O, use cache  │
└─────────────────────┘             └─────────────────────┘

Disk seeks are ~10ms. Reading 4KB takes ~0.05ms.
Once you seek, read more! Pages amortize seek cost.

Page Layout (Slotted Page):

┌─────────────────────────────────────────────────────────┐
│  Page Header (16 bytes)                                  │
│  ┌─────────────────────────────────────────────────────┐ │
│  │ page_id │ num_slots │ free_space_ptr │ page_type   │ │
│  └─────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│  Slot Directory (grows down)                             │
│  ┌───────┬───────┬───────┬───────┬─────────────────────┐ │
│  │Slot 0 │Slot 1 │Slot 2 │Slot 3 │  ...                │ │
│  │off,len│off,len│off,len│off,len│                     │ │
│  └───────┴───────┴───────┴───────┴─────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│                                                          │
│              Free Space (grows both ways)                │
│                                                          │
├─────────────────────────────────────────────────────────┤
│  Record Data (grows up from bottom)                      │
│  ┌─────────────────────────────────────────────────────┐ │
│  │ Record 3 │ Record 2 │ Record 1 │ Record 0 │         │ │
│  └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

B-Tree Structure:

                          ┌─────────────────────┐
                          │   Root (Internal)   │
                          │   [30] [60]        │
                          └──────────┬──────────┘
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
           ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
           │  Internal     │ │  Internal     │ │  Internal     │
           │  [10][20]     │ │  [40][50]     │ │  [70][80]     │
           └───────┬───────┘ └───────┬───────┘ └───────┬───────┘
          ┌────┬────┐       ┌────┬────┐       ┌────┬────┐
          ▼    ▼    ▼       ▼    ▼    ▼       ▼    ▼    ▼
        ┌───┐┌───┐┌───┐   ┌───┐┌───┐┌───┐   ┌───┐┌───┐┌───┐
        │5,8││12,││25,│   │35,││45,││55,│   │65,││75,││85,│
        │   ││18 ││28 │   │38 ││48 ││58 │   │68 ││78 ││90 │
        └───┘└───┘└───┘   └───┘└───┘└───┘   └───┘└───┘└───┘
           Leaf Nodes (contain actual row pointers or data)

B-Tree Properties:
- All leaves at same depth
- Nodes are at least half full
- Search: O(log n)
- Insert: O(log n) + possible splits
- Perfect for disk: wide and shallow

Buffer Pool Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        Buffer Pool                               │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    Page Frames (Memory)                    │  │
│  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐   │  │
│  │  │Page 12 │ │Page 45 │ │Page 3  │ │Page 89 │ │ Empty  │   │  │
│  │  │dirty=1 │ │dirty=0 │ │dirty=1 │ │dirty=0 │ │        │   │  │
│  │  │pin=2   │ │pin=0   │ │pin=1   │ │pin=0   │ │        │   │  │
│  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Page Table: page_id -> frame_index                      │    │
│  │  { 12 -> 0, 45 -> 1, 3 -> 2, 89 -> 3 }                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  LRU List (for eviction): [45, 89, 3, 12]               │    │
│  │  (45 is least recently used, evict first)               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
          │                                    ▲
          │ Evict/Flush                        │ Fetch
          ▼                                    │
┌─────────────────────────────────────────────────────────────────┐
│                      Disk (data.db file)                         │
│  Page 0  │ Page 1  │ Page 2  │ Page 3  │ ... │ Page N  │        │
└─────────────────────────────────────────────────────────────────┘

Write-Ahead Logging (WAL):

                    Transaction Flow
                          │
       ┌──────────────────┼──────────────────┐
       │                  │                  │
       ▼                  ▼                  ▼
  ┌─────────┐       ┌─────────┐       ┌─────────┐
  │ BEGIN   │       │ UPDATE  │       │ COMMIT  │
  └────┬────┘       └────┬────┘       └────┬────┘
       │                 │                  │
       │                 ▼                  │
       │    ┌────────────────────────┐      │
       │    │ 1. Write to WAL first │      │
       │    │    (log record)        │      │
       │    │ 2. Then modify page    │      │
       │    │    (in buffer pool)    │      │
       │    └────────────────────────┘      │
       │                                    │
       │                 ┌──────────────────┘
       │                 ▼
       │    ┌────────────────────────┐
       │    │ 3. Flush WAL to disk   │
       │    │ 4. Return success      │
       │    │ (pages written later)  │
       │    └────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────┐
│  WAL File (sequential writes - fast!)                │
│                                                      │
│  [LSN 1][LSN 2][LSN 3][LSN 4][LSN 5][LSN 6]...      │
│  BEGIN  UPDATE UPDATE UPDATE COMMIT  BEGIN          │
│  txn1   txn1   txn2   txn1   txn1    txn3           │
└─────────────────────────────────────────────────────┘

Recovery: Replay WAL to reconstruct committed transactions

2.2 Why This Matters

Databases are the foundation of almost every application:

Web applications: PostgreSQL, MySQL
Mobile apps: SQLite
Enterprise: Oracle, SQL Server
Analytics: ClickHouse, DuckDB

Building a database teaches:

How data is organized on disk
Why indexing matters (and when it doesn’t)
How transactions maintain consistency
Why databases seem magical but aren’t

2.3 Historical Context

Database systems evolved from file systems in the 1960s:

1970: E.F. Codd proposes relational model
1974: System R (IBM) implements SQL
1979: Oracle first commercial RDBMS
1989: PostgreSQL begins development
1995: MySQL released
2000: SQLite released (embedded)
2017: CockroachDB, TiDB (distributed SQL)

Modern databases still use core concepts from the 1970s:

B-trees (invented 1972)
Write-ahead logging (1980s)
MVCC (1981, PostgreSQL)

2.4 Common Misconceptions

Misconception 1: “Databases keep everything in memory” Reality: Databases use a buffer pool to cache pages. On memory pressure, pages are evicted to disk. This allows working with datasets larger than RAM.

Misconception 2: “B-trees are just binary trees” Reality: B-trees have high fanout (many children per node). A typical node might have 100-500 keys. This reduces tree height and disk I/O.

Misconception 3: “Indexes always make queries faster” Reality: Indexes speed up reads but slow down writes. Each INSERT must update every index. Too many indexes hurt performance.

Misconception 4: “COMMIT is when data is written to disk” Reality: COMMIT ensures the WAL is flushed to disk. Data pages may be written later (checkpoint). WAL enables fast commits with durability.

3. Project Specification

3.1 What You Will Build

A simple relational database that:

Stores data in a page-based file format
Uses a buffer pool for caching
Supports B-tree indexes for fast lookup
Parses and executes basic SQL
Provides ACID transactions with WAL

3.2 Functional Requirements

Requirement	Description
FR-1	CREATE TABLE with INT, TEXT column types
FR-2	INSERT INTO with values
FR-3	SELECT * and SELECT columns with WHERE on primary key
FR-4	Primary key constraint with B-tree index
FR-5	BEGIN, COMMIT, ROLLBACK transactions
FR-6	Persist data across restarts
FR-7	Recover from crash using WAL

3.3 Non-Functional Requirements

Requirement	Description
NFR-1	Handle tables with 1M+ rows
NFR-2	Primary key lookup in O(log n)
NFR-3	Buffer pool size configurable
NFR-4	Corruption detection (checksums)
NFR-5	Clean separation of components

3.4 Example Usage / Output

$ ./mydb test.db

mydb> CREATE TABLE users (
    id INT PRIMARY KEY,
    name TEXT,
    email TEXT
);
Table 'users' created.

mydb> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
1 row inserted.

mydb> INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');
1 row inserted.

mydb> INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com');
1 row inserted.

mydb> SELECT * FROM users;
+----+---------+---------------------+
| id | name    | email               |
+----+---------+---------------------+
|  1 | Alice   | alice@example.com   |
|  2 | Bob     | bob@example.com     |
|  3 | Charlie | charlie@example.com |
+----+---------+---------------------+
3 rows returned.

mydb> SELECT name, email FROM users WHERE id = 2;
+------+-----------------+
| name | email           |
+------+-----------------+
| Bob  | bob@example.com |
+------+-----------------+
1 row returned (index scan).

mydb> BEGIN;
Transaction started.

mydb> INSERT INTO users VALUES (4, 'David', 'david@example.com');
1 row inserted.

mydb> SELECT * FROM users WHERE id = 4;
+----+-------+-------------------+
| id | name  | email             |
+----+-------+-------------------+
|  4 | David | david@example.com |
+----+-------+-------------------+
1 row returned.

mydb> ROLLBACK;
Transaction rolled back.

mydb> SELECT * FROM users WHERE id = 4;
0 rows returned.  -- David is gone!

mydb> .stats
Buffer pool: 128 pages (512 KB)
  Hits: 1,234
  Misses: 45
  Hit ratio: 96.5%
Tables: 1
  users: 3 rows, 1 pages

mydb> .exit
Goodbye!

3.5 Real World Outcome

$ ./mydb test.db

# Load 1 million rows
mydb> .load users_1m.csv
Loaded 1,000,000 rows in 45 seconds.

# Point query (uses B-tree index)
mydb> SELECT * FROM users WHERE id = 500000;
1 row returned in 0.2ms (index scan, 4 page reads).

# Range query (requires table scan for now)
mydb> SELECT COUNT(*) FROM users WHERE name LIKE 'A%';
38,421 rows matched in 2.3 seconds (sequential scan).

# Transaction test
mydb> BEGIN;
mydb> UPDATE users SET email = 'test@test.com' WHERE id = 1;
mydb> -- Simulate crash: kill -9 $PID

$ ./mydb test.db
Recovering from WAL...
Replayed 0 committed transactions.
Discarded 1 uncommitted transaction.
Recovery complete.

mydb> SELECT * FROM users WHERE id = 1;
-- Original email, not 'test@test.com'!

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────────┐
│                        Database Engine                               │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                      SQL Layer                               │    │
│  │                                                              │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │    │
│  │  │  Parser  │─►│ Analyzer │─►│ Planner  │─►│ Executor │    │    │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │    │
│  │                                                              │    │
│  └──────────────────────────────┬───────────────────────────────┘    │
│                                 │                                    │
│  ┌──────────────────────────────▼───────────────────────────────┐    │
│  │                    Access Methods                             │    │
│  │                                                               │    │
│  │  ┌─────────────────┐     ┌─────────────────────────────────┐ │    │
│  │  │ Table (Heap)    │     │ Index (B-Tree)                  │ │    │
│  │  │ - Insert row    │     │ - Insert key                    │ │    │
│  │  │ - Scan all rows │     │ - Search key                    │ │    │
│  │  └────────┬────────┘     └────────┬────────────────────────┘ │    │
│  │           │                       │                          │    │
│  └───────────┼───────────────────────┼──────────────────────────┘    │
│              │                       │                               │
│  ┌───────────▼───────────────────────▼──────────────────────────┐    │
│  │                     Buffer Pool                               │    │
│  │                                                               │    │
│  │  getPage(page_id) -> Page*                                   │    │
│  │  markDirty(page_id)                                          │    │
│  │  unpin(page_id)                                              │    │
│  │  flushPage(page_id)                                          │    │
│  └───────────────────────────────┬───────────────────────────────┘    │
│                                  │                                    │
│  ┌───────────────────────────────▼───────────────────────────────┐    │
│  │                     Disk Manager                              │    │
│  │                                                               │    │
│  │  readPage(page_id, data)                                     │    │
│  │  writePage(page_id, data)                                    │    │
│  │  allocatePage() -> page_id                                   │    │
│  └───────────────────────────────┬───────────────────────────────┘    │
│                                  │                                    │
└──────────────────────────────────┼────────────────────────────────────┘
                                   ▼
                             ┌──────────┐
                             │ data.db  │
                             │ wal.log  │
                             └──────────┘

4.2 Key Components

Component	Responsibility
Parser	Tokenize SQL, build AST
Analyzer	Validate table/column names, resolve types
Planner	Decide access method (index vs scan)
Executor	Execute the plan, return results
Table	Heap file operations (insert, scan)
Index	B-tree operations (insert, search)
Buffer Pool	Page caching with LRU eviction
Disk Manager	Raw page I/O
WAL	Write-ahead log for durability

4.3 Data Structures

Page Header:

struct PageHeader {
    uint32_t page_id;
    uint16_t num_slots;
    uint16_t free_space_offset;
    uint16_t free_space_end;
    PageType page_type;  // HEAP_PAGE, BTREE_INTERNAL, BTREE_LEAF
    uint32_t checksum;
};

B-Tree Node:

struct BTreeNode {
    bool is_leaf;
    uint16_t num_keys;
    uint32_t keys[MAX_KEYS];        // For internal: separator keys
    uint32_t children[MAX_KEYS+1];  // For internal: child page ids
    uint32_t values[MAX_KEYS];      // For leaf: row ids / data pointers
    uint32_t next_leaf;             // For leaf: sibling pointer
};

Buffer Pool Frame:

struct Frame {
    Page data;
    page_id_t page_id;
    bool is_dirty;
    int pin_count;
};

4.4 Algorithm Overview

B-Tree Search:

search(node, key):
    if node.is_leaf:
        for i in 0 to node.num_keys:
            if node.keys[i] == key:
                return node.values[i]
        return NOT_FOUND

    // Find child to descend
    i = 0
    while i < node.num_keys and key >= node.keys[i]:
        i++

    child = buffer_pool.getPage(node.children[i])
    return search(child, key)

B-Tree Insert:

insert(key, value):
    if root is null:
        root = create_leaf_node()
        root.insert(key, value)
        return

    result = insert_recursive(root, key, value)

    if result.split_occurred:
        // Root split, create new root
        new_root = create_internal_node()
        new_root.keys[0] = result.split_key
        new_root.children[0] = root
        new_root.children[1] = result.new_node
        root = new_root

insert_recursive(node, key, value):
    if node.is_leaf:
        if node.has_space():
            node.insert_sorted(key, value)
            return NO_SPLIT
        else:
            // Split leaf
            new_leaf = create_leaf_node()
            (split_key, entries_for_new) = node.split_at_middle()
            new_leaf.insert_all(entries_for_new)
            node.next_leaf = new_leaf
            return SPLIT(split_key, new_leaf)

    // Internal node
    child_index = find_child(node, key)
    child = getPage(node.children[child_index])
    result = insert_recursive(child, key, value)

    if result.split_occurred:
        if node.has_space():
            node.insert_child(child_index, result.split_key, result.new_node)
            return NO_SPLIT
        else:
            // Split internal node
            new_internal = create_internal_node()
            (push_up_key, entries_for_new) = node.split_at_middle()
            new_internal.insert_all(entries_for_new)
            return SPLIT(push_up_key, new_internal)

    return NO_SPLIT

5. Implementation Guide

5.1 Development Environment Setup

Required:

C++17 or later
CMake 3.14+
File system with fsync support

Recommended:

# Install development tools
sudo apt install cmake g++ valgrind

# For visualization (optional)
pip install graphviz  # To render B-tree diagrams

5.2 Project Structure

database_engine/
├── include/
│   ├── common/
│   │   ├── types.hpp        # page_id_t, slot_id_t, etc.
│   │   └── config.hpp       # PAGE_SIZE, BUFFER_POOL_SIZE
│   ├── storage/
│   │   ├── disk_manager.hpp
│   │   ├── page.hpp
│   │   └── buffer_pool.hpp
│   ├── access/
│   │   ├── table_heap.hpp
│   │   └── btree_index.hpp
│   ├── catalog/
│   │   ├── schema.hpp
│   │   └── catalog.hpp
│   ├── sql/
│   │   ├── lexer.hpp
│   │   ├── parser.hpp
│   │   └── executor.hpp
│   └── recovery/
│       └── wal.hpp
├── src/
│   ├── storage/
│   ├── access/
│   ├── catalog/
│   ├── sql/
│   ├── recovery/
│   └── main.cpp
├── tests/
│   ├── test_buffer_pool.cpp
│   ├── test_btree.cpp
│   ├── test_sql.cpp
│   └── test_recovery.cpp
├── CMakeLists.txt
└── README.md

5.3 The Core Question You’re Answering

“How do I reliably store, index, and query structured data at scale?”

This is the fundamental question of data management. Understanding database internals helps you:

Choose the right database for your use case
Design efficient schemas
Debug performance issues
Build data systems that don’t lose data

5.4 Concepts You Must Understand First

What is a B-tree and why is it preferred over binary trees for databases?
- High fanout reduces tree depth
- Matches disk page size
- Reference: “Introduction to Algorithms” Ch. 18
What are the ACID properties?
- Atomicity: All or nothing
- Consistency: Rules always satisfied
- Isolation: Transactions don’t interfere
- Durability: Committed data survives crashes
What is Write-Ahead Logging and why is it necessary?
- Write log before data pages
- Enables recovery after crash
- Reference: “Database Internals” Ch. 10
What is a buffer pool and why is it needed?
- Pages cached in memory
- Eviction policy (LRU)
- Dirty page tracking
What is the difference between a heap file and an indexed file?
- Heap: Unordered, append-only, O(n) search
- Indexed: Ordered by key, O(log n) search

5.5 Questions to Guide Your Design

Storage Format:

Fixed-size or variable-size records?
How to handle NULL values?
How to store strings (length-prefixed, null-terminated)?

Buffer Pool:

What eviction policy (LRU, Clock, LRU-K)?
How to handle concurrent access?
When to flush dirty pages?

B-Tree:

What is the fanout (keys per node)?
How to handle duplicate keys?
Clustered or non-clustered index?

Transactions:

What isolation level to support?
How to detect conflicts?
When to acquire/release locks?

5.6 Thinking Exercise

Design the page layout for a table with this schema:

CREATE TABLE products (
    id INT PRIMARY KEY,
    name TEXT,       -- variable length, max 100 chars
    price REAL,      -- 8 bytes
    in_stock BOOL    -- 1 byte
);

Questions:

How many bytes per row (fixed fields + variable fields)?
How many rows fit in a 4KB page?
Where do you store the variable-length name?
How does the slot directory help with variable-length data?

5.7 Hints in Layers

Hint 1 - Starting Point (Conceptual): Build bottom-up: DiskManager -> BufferPool -> HeapFile -> BTree -> SQL Parser -> Executor. Test each layer before building the next.

Hint 2 - Next Level (More Specific):

DiskManager (simplest layer):

class DiskManager {
    std::fstream file_;
    std::atomic<page_id_t> next_page_id_;

public:
    void readPage(page_id_t id, char* data) {
        file_.seekg(id * PAGE_SIZE);
        file_.read(data, PAGE_SIZE);
    }

    void writePage(page_id_t id, const char* data) {
        file_.seekp(id * PAGE_SIZE);
        file_.write(data, PAGE_SIZE);
        file_.flush();  // Or use fsync for durability
    }

    page_id_t allocatePage() {
        return next_page_id_++;
    }
};

Hint 3 - B-Tree Split:

// When a leaf node is full and we need to insert
void BTreeIndex::splitLeaf(BTreeNode* leaf, int key, RowId value) {
    // Create new leaf
    page_id_t new_page_id = buffer_pool_.allocatePage();
    BTreeNode* new_leaf = reinterpret_cast<BTreeNode*>(
        buffer_pool_.getPage(new_page_id)->data()
    );
    new_leaf->is_leaf = true;
    new_leaf->num_keys = 0;

    // Find split point (middle)
    int total = leaf->num_keys + 1;
    int split_idx = total / 2;

    // Collect all entries + new entry, sort
    std::vector<std::pair<int, RowId>> entries;
    for (int i = 0; i < leaf->num_keys; i++) {
        entries.push_back({leaf->keys[i], leaf->values[i]});
    }
    entries.push_back({key, value});
    std::sort(entries.begin(), entries.end());

    // First half stays in old leaf
    leaf->num_keys = 0;
    for (int i = 0; i < split_idx; i++) {
        leaf->keys[leaf->num_keys] = entries[i].first;
        leaf->values[leaf->num_keys] = entries[i].second;
        leaf->num_keys++;
    }

    // Second half goes to new leaf
    for (int i = split_idx; i < entries.size(); i++) {
        new_leaf->keys[new_leaf->num_keys] = entries[i].first;
        new_leaf->values[new_leaf->num_keys] = entries[i].second;
        new_leaf->num_keys++;
    }

    // Link leaves
    new_leaf->next_leaf = leaf->next_leaf;
    leaf->next_leaf = new_page_id;

    // Return split key for parent
    return new_leaf->keys[0];
}

Hint 4 - WAL Record Format:

struct WALRecord {
    uint64_t lsn;           // Log Sequence Number
    uint32_t txn_id;
    WALRecordType type;     // BEGIN, INSERT, UPDATE, DELETE, COMMIT, ABORT
    page_id_t page_id;
    uint16_t offset;
    uint16_t length;
    char old_data[MAX_DATA];  // For UNDO
    char new_data[MAX_DATA];  // For REDO
};

// Recovery algorithm (simplified ARIES):
void recover() {
    // 1. Read WAL from beginning
    // 2. Build table of active transactions
    // 3. REDO: Replay all changes
    // 4. UNDO: Rollback uncommitted transactions
}

5.8 The Interview Questions They’ll Ask

“Why use a B-tree instead of a hash index?”
- B-tree supports range queries, hash doesn’t
- B-tree sorted traversal, hash is random
- B-tree handles growing data gracefully
“Explain the purpose of Write-Ahead Logging”
- Ensure durability without flushing data pages on every commit
- Enable crash recovery by replaying log
- Sequential writes are faster than random writes
“What happens during a checkpoint?”
- Flush all dirty pages to disk
- Write checkpoint record to WAL
- Can truncate old WAL entries
- Reduces recovery time
“How does MVCC work?”
- Multiple versions of each row
- Each transaction sees consistent snapshot
- No read locks needed
- Reference: PostgreSQL, MySQL InnoDB
“What is the difference between clustered and non-clustered indexes?”
- Clustered: Table data stored in index order (only one per table)
- Non-clustered: Index contains pointers to heap

5.9 Books That Will Help

Topic	Book & Chapter
B-Trees	“Introduction to Algorithms” Ch. 18 - Cormen et al.
Database Storage	“Database Internals” Ch. 1-4 - Alex Petrov
Buffer Management	“Database Internals” Ch. 5 - Alex Petrov
Write-Ahead Logging	“Database Internals” Ch. 10 - Alex Petrov
SQL Processing	“Database System Concepts” Ch. 12-15 - Silberschatz
Transaction Processing	“Transaction Processing” - Jim Gray

5.10 Implementation Phases

Phase 1: Storage Layer (Week 1)

Implement DiskManager
Implement Page class with header
Implement BufferPool with LRU eviction
Test: Read/write pages, eviction works

Phase 2: Table Access (Week 2)

Implement HeapFile (table storage)
Implement row serialization/deserialization
Implement table scan
Test: Insert rows, scan all rows

Phase 3: B-Tree Index (Week 3-4)

Implement B-tree search
Implement B-tree insert (without splits)
Implement node splitting
Test: Insert 10,000 keys, search all

Phase 4: SQL Parsing (Week 5)

Implement SQL tokenizer
Implement parser for CREATE, INSERT, SELECT
Build AST representation
Test: Parse various SQL statements

Phase 5: Execution (Week 6)

Implement CREATE TABLE execution
Implement INSERT execution
Implement SELECT with WHERE
Connect parser to executor

Phase 6: Transactions & WAL (Week 7-8)

Implement WAL writer
Implement BEGIN/COMMIT/ROLLBACK
Implement crash recovery
Test: Kill process, verify recovery

5.11 Key Implementation Decisions

Decision	Recommended Choice	Reasoning
Page size	4KB	Matches OS page size, good for SSDs
B-tree order	~100 keys/node	Fits in one page, low tree height
Buffer pool size	128 pages (512KB)	Good for testing, configurable
Serialization	Fixed-width + offset for variable	Simple, efficient
WAL format	Binary records	Compact, fast to write

6. Testing Strategy

Unit Tests:

TEST(BufferPool, EvictsLRU) {
    BufferPool pool(2);  // Only 2 frames

    auto p1 = pool.getPage(1);
    auto p2 = pool.getPage(2);

    pool.unpin(1);
    pool.unpin(2);

    // Access page 1 again (makes it recently used)
    p1 = pool.getPage(1);
    pool.unpin(1);

    // Get page 3 - should evict page 2 (LRU)
    auto p3 = pool.getPage(3);

    EXPECT_FALSE(pool.contains(2));  // Page 2 evicted
    EXPECT_TRUE(pool.contains(1));   // Page 1 still there
}

TEST(BTree, SearchAfterSplits) {
    BTreeIndex index;

    // Insert enough to cause multiple splits
    for (int i = 0; i < 10000; i++) {
        index.insert(i, RowId{0, i});
    }

    // Verify all keys findable
    for (int i = 0; i < 10000; i++) {
        auto result = index.search(i);
        ASSERT_TRUE(result.has_value());
        EXPECT_EQ(result->slot_id, i);
    }
}

Integration Tests:

TEST(Database, CrashRecovery) {
    {
        Database db("test.db");
        db.execute("CREATE TABLE t (id INT PRIMARY KEY)");
        db.execute("BEGIN");
        db.execute("INSERT INTO t VALUES (1)");
        db.execute("COMMIT");

        db.execute("BEGIN");
        db.execute("INSERT INTO t VALUES (2)");
        // No commit - simulate crash
    }

    // Reopen and recover
    Database db("test.db");

    auto result = db.execute("SELECT * FROM t");
    EXPECT_EQ(result.rows.size(), 1);  // Only committed row
    EXPECT_EQ(result.rows[0][0], 1);   // id = 1
}

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Data corruption	Wrong values read	Page not flushed	Call fsync() after write
B-tree invariant violation	Search fails	Split not propagated correctly	Verify parent updated
Buffer pool deadlock	Program hangs	Page pinned but not unpinned	Always unpin after use
WAL not durable	Lost data on crash	WAL not fsynced before commit	fsync WAL on commit
Memory leak	Growing memory	Pages not evicted	Check pin counts

Debugging Tools:

# Hex dump pages
xxd -g 4 data.db | head -100

# Visualize B-tree (custom script)
./btree_dump data.db | dot -Tpng > btree.png

# Check file sync
strace -e fsync ./mydb test.db

8. Extensions & Challenges

Secondary Indexes: Non-primary key indexes
JOIN Operations: Nested loop, hash join
Query Optimizer: Cost-based optimization
MVCC: Multi-version concurrency control
Compression: Page-level compression
Replication: Primary-replica setup

9. Real-World Connections

SQLite: Excellent codebase to study, similar architecture
LevelDB/RocksDB: LSM-tree based, different tradeoffs
PostgreSQL: Production RDBMS, open source
BoltDB: Simple B-tree database in Go

10. Resources

11. Self-Assessment Checklist

12. Submission / Completion Criteria

Your implementation is complete when:

Core SQL works: CREATE TABLE, INSERT, SELECT with WHERE
Index used: Primary key lookups use B-tree
ACID compliant: Transactions work correctly
Durable: Data survives process restart
Recoverable: Uncommitted transactions rolled back after crash
Scalable: Handles 100K+ rows efficiently
Can explain: You can whiteboard the architecture in an interview