Project 8: Simple Database Engine
Build a relational database engine with B-tree indexing, page-based storage, SQL parsing, and ACID transactions to master systems programming.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 1-2 months |
| Language | C++ |
| Prerequisites | B-tree data structures, file I/O, SQL basics, transaction concepts |
| Key Topics | Page storage, buffer pool, B-tree, SQL parsing, WAL, ACID |
1. Learning Objectives
By completing this project, you will:
- Implement page-based storage: Manage fixed-size pages on disk
- Build a buffer pool: Cache pages in memory with eviction policies
- Implement B-tree indexing: Insert, search, and split operations
- Parse SQL statements: Tokenize and parse SELECT, INSERT, CREATE TABLE
- Understand ACID properties: Atomicity, Consistency, Isolation, Durability
- Implement Write-Ahead Logging: Ensure durability through logging
- Design storage formats: Row layout, page format, metadata storage
2. Theoretical Foundation
2.1 Core Concepts
Why Databases Use Pages:
Traditional File I/O: Database Page I/O:
┌─────────────────────┐ ┌─────────────────────┐
│ Read 50 bytes here │ │ Read 4KB page │
│ Read 30 bytes there │ │ (contains many rows)│
│ Read 100 bytes... │ │ │
│ Many small I/Os │ │ One I/O, use cache │
└─────────────────────┘ └─────────────────────┘
Disk seeks are ~10ms. Reading 4KB takes ~0.05ms.
Once you seek, read more! Pages amortize seek cost.
Page Layout (Slotted Page):
┌─────────────────────────────────────────────────────────┐
│ Page Header (16 bytes) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ page_id │ num_slots │ free_space_ptr │ page_type │ │
│ └─────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Slot Directory (grows down) │
│ ┌───────┬───────┬───────┬───────┬─────────────────────┐ │
│ │Slot 0 │Slot 1 │Slot 2 │Slot 3 │ ... │ │
│ │off,len│off,len│off,len│off,len│ │ │
│ └───────┴───────┴───────┴───────┴─────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│ │
│ Free Space (grows both ways) │
│ │
├─────────────────────────────────────────────────────────┤
│ Record Data (grows up from bottom) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Record 3 │ Record 2 │ Record 1 │ Record 0 │ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
B-Tree Structure:
┌─────────────────────┐
│ Root (Internal) │
│ [30] [60] │
└──────────┬──────────┘
┌────────────────┼────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Internal │ │ Internal │ │ Internal │
│ [10][20] │ │ [40][50] │ │ [70][80] │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
┌────┬────┐ ┌────┬────┐ ┌────┬────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌───┐┌───┐┌───┐ ┌───┐┌───┐┌───┐ ┌───┐┌───┐┌───┐
│5,8││12,││25,│ │35,││45,││55,│ │65,││75,││85,│
│ ││18 ││28 │ │38 ││48 ││58 │ │68 ││78 ││90 │
└───┘└───┘└───┘ └───┘└───┘└───┘ └───┘└───┘└───┘
Leaf Nodes (contain actual row pointers or data)
B-Tree Properties:
- All leaves at same depth
- Nodes are at least half full
- Search: O(log n)
- Insert: O(log n) + possible splits
- Perfect for disk: wide and shallow
Buffer Pool Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Buffer Pool │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Page Frames (Memory) │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │Page 12 │ │Page 45 │ │Page 3 │ │Page 89 │ │ Empty │ │ │
│ │ │dirty=1 │ │dirty=0 │ │dirty=1 │ │dirty=0 │ │ │ │ │
│ │ │pin=2 │ │pin=0 │ │pin=1 │ │pin=0 │ │ │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Page Table: page_id -> frame_index │ │
│ │ { 12 -> 0, 45 -> 1, 3 -> 2, 89 -> 3 } │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ LRU List (for eviction): [45, 89, 3, 12] │ │
│ │ (45 is least recently used, evict first) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
│ ▲
│ Evict/Flush │ Fetch
▼ │
┌─────────────────────────────────────────────────────────────────┐
│ Disk (data.db file) │
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │ Page N │ │
└─────────────────────────────────────────────────────────────────┘
Write-Ahead Logging (WAL):
Transaction Flow
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ BEGIN │ │ UPDATE │ │ COMMIT │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ 1. Write to WAL first │ │
│ │ (log record) │ │
│ │ 2. Then modify page │ │
│ │ (in buffer pool) │ │
│ └────────────────────────┘ │
│ │
│ ┌──────────────────┘
│ ▼
│ ┌────────────────────────┐
│ │ 3. Flush WAL to disk │
│ │ 4. Return success │
│ │ (pages written later) │
│ └────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ WAL File (sequential writes - fast!) │
│ │
│ [LSN 1][LSN 2][LSN 3][LSN 4][LSN 5][LSN 6]... │
│ BEGIN UPDATE UPDATE UPDATE COMMIT BEGIN │
│ txn1 txn1 txn2 txn1 txn1 txn3 │
└─────────────────────────────────────────────────────┘
Recovery: Replay WAL to reconstruct committed transactions
2.2 Why This Matters
Databases are the foundation of almost every application:
- Web applications: PostgreSQL, MySQL
- Mobile apps: SQLite
- Enterprise: Oracle, SQL Server
- Analytics: ClickHouse, DuckDB
Building a database teaches:
- How data is organized on disk
- Why indexing matters (and when it doesn’t)
- How transactions maintain consistency
- Why databases seem magical but aren’t
2.3 Historical Context
Database systems evolved from file systems in the 1960s:
- 1970: E.F. Codd proposes relational model
- 1974: System R (IBM) implements SQL
- 1979: Oracle first commercial RDBMS
- 1989: PostgreSQL begins development
- 1995: MySQL released
- 2000: SQLite released (embedded)
- 2017: CockroachDB, TiDB (distributed SQL)
Modern databases still use core concepts from the 1970s:
- B-trees (invented 1972)
- Write-ahead logging (1980s)
- MVCC (1981, PostgreSQL)
2.4 Common Misconceptions
Misconception 1: “Databases keep everything in memory” Reality: Databases use a buffer pool to cache pages. On memory pressure, pages are evicted to disk. This allows working with datasets larger than RAM.
Misconception 2: “B-trees are just binary trees” Reality: B-trees have high fanout (many children per node). A typical node might have 100-500 keys. This reduces tree height and disk I/O.
Misconception 3: “Indexes always make queries faster” Reality: Indexes speed up reads but slow down writes. Each INSERT must update every index. Too many indexes hurt performance.
Misconception 4: “COMMIT is when data is written to disk” Reality: COMMIT ensures the WAL is flushed to disk. Data pages may be written later (checkpoint). WAL enables fast commits with durability.
3. Project Specification
3.1 What You Will Build
A simple relational database that:
- Stores data in a page-based file format
- Uses a buffer pool for caching
- Supports B-tree indexes for fast lookup
- Parses and executes basic SQL
- Provides ACID transactions with WAL
3.2 Functional Requirements
| Requirement | Description |
|---|---|
| FR-1 | CREATE TABLE with INT, TEXT column types |
| FR-2 | INSERT INTO with values |
| FR-3 | SELECT * and SELECT columns with WHERE on primary key |
| FR-4 | Primary key constraint with B-tree index |
| FR-5 | BEGIN, COMMIT, ROLLBACK transactions |
| FR-6 | Persist data across restarts |
| FR-7 | Recover from crash using WAL |
3.3 Non-Functional Requirements
| Requirement | Description |
|---|---|
| NFR-1 | Handle tables with 1M+ rows |
| NFR-2 | Primary key lookup in O(log n) |
| NFR-3 | Buffer pool size configurable |
| NFR-4 | Corruption detection (checksums) |
| NFR-5 | Clean separation of components |
3.4 Example Usage / Output
$ ./mydb test.db
mydb> CREATE TABLE users (
id INT PRIMARY KEY,
name TEXT,
email TEXT
);
Table 'users' created.
mydb> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
1 row inserted.
mydb> INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');
1 row inserted.
mydb> INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com');
1 row inserted.
mydb> SELECT * FROM users;
+----+---------+---------------------+
| id | name | email |
+----+---------+---------------------+
| 1 | Alice | alice@example.com |
| 2 | Bob | bob@example.com |
| 3 | Charlie | charlie@example.com |
+----+---------+---------------------+
3 rows returned.
mydb> SELECT name, email FROM users WHERE id = 2;
+------+-----------------+
| name | email |
+------+-----------------+
| Bob | bob@example.com |
+------+-----------------+
1 row returned (index scan).
mydb> BEGIN;
Transaction started.
mydb> INSERT INTO users VALUES (4, 'David', 'david@example.com');
1 row inserted.
mydb> SELECT * FROM users WHERE id = 4;
+----+-------+-------------------+
| id | name | email |
+----+-------+-------------------+
| 4 | David | david@example.com |
+----+-------+-------------------+
1 row returned.
mydb> ROLLBACK;
Transaction rolled back.
mydb> SELECT * FROM users WHERE id = 4;
0 rows returned. -- David is gone!
mydb> .stats
Buffer pool: 128 pages (512 KB)
Hits: 1,234
Misses: 45
Hit ratio: 96.5%
Tables: 1
users: 3 rows, 1 pages
mydb> .exit
Goodbye!
3.5 Real World Outcome
$ ./mydb test.db
# Load 1 million rows
mydb> .load users_1m.csv
Loaded 1,000,000 rows in 45 seconds.
# Point query (uses B-tree index)
mydb> SELECT * FROM users WHERE id = 500000;
1 row returned in 0.2ms (index scan, 4 page reads).
# Range query (requires table scan for now)
mydb> SELECT COUNT(*) FROM users WHERE name LIKE 'A%';
38,421 rows matched in 2.3 seconds (sequential scan).
# Transaction test
mydb> BEGIN;
mydb> UPDATE users SET email = 'test@test.com' WHERE id = 1;
mydb> -- Simulate crash: kill -9 $PID
$ ./mydb test.db
Recovering from WAL...
Replayed 0 committed transactions.
Discarded 1 uncommitted transaction.
Recovery complete.
mydb> SELECT * FROM users WHERE id = 1;
-- Original email, not 'test@test.com'!
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────────┐
│ Database Engine │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ SQL Layer │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Parser │─►│ Analyzer │─►│ Planner │─►│ Executor │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────▼───────────────────────────────┐ │
│ │ Access Methods │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ Table (Heap) │ │ Index (B-Tree) │ │ │
│ │ │ - Insert row │ │ - Insert key │ │ │
│ │ │ - Scan all rows │ │ - Search key │ │ │
│ │ └────────┬────────┘ └────────┬────────────────────────┘ │ │
│ │ │ │ │ │
│ └───────────┼───────────────────────┼──────────────────────────┘ │
│ │ │ │
│ ┌───────────▼───────────────────────▼──────────────────────────┐ │
│ │ Buffer Pool │ │
│ │ │ │
│ │ getPage(page_id) -> Page* │ │
│ │ markDirty(page_id) │ │
│ │ unpin(page_id) │ │
│ │ flushPage(page_id) │ │
│ └───────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────▼───────────────────────────────┐ │
│ │ Disk Manager │ │
│ │ │ │
│ │ readPage(page_id, data) │ │
│ │ writePage(page_id, data) │ │
│ │ allocatePage() -> page_id │ │
│ └───────────────────────────────┬───────────────────────────────┘ │
│ │ │
└──────────────────────────────────┼────────────────────────────────────┘
▼
┌──────────┐
│ data.db │
│ wal.log │
└──────────┘
4.2 Key Components
| Component | Responsibility |
|---|---|
| Parser | Tokenize SQL, build AST |
| Analyzer | Validate table/column names, resolve types |
| Planner | Decide access method (index vs scan) |
| Executor | Execute the plan, return results |
| Table | Heap file operations (insert, scan) |
| Index | B-tree operations (insert, search) |
| Buffer Pool | Page caching with LRU eviction |
| Disk Manager | Raw page I/O |
| WAL | Write-ahead log for durability |
4.3 Data Structures
Page Header:
struct PageHeader {
uint32_t page_id;
uint16_t num_slots;
uint16_t free_space_offset;
uint16_t free_space_end;
PageType page_type; // HEAP_PAGE, BTREE_INTERNAL, BTREE_LEAF
uint32_t checksum;
};
B-Tree Node:
struct BTreeNode {
bool is_leaf;
uint16_t num_keys;
uint32_t keys[MAX_KEYS]; // For internal: separator keys
uint32_t children[MAX_KEYS+1]; // For internal: child page ids
uint32_t values[MAX_KEYS]; // For leaf: row ids / data pointers
uint32_t next_leaf; // For leaf: sibling pointer
};
Buffer Pool Frame:
struct Frame {
Page data;
page_id_t page_id;
bool is_dirty;
int pin_count;
};
4.4 Algorithm Overview
B-Tree Search:
search(node, key):
if node.is_leaf:
for i in 0 to node.num_keys:
if node.keys[i] == key:
return node.values[i]
return NOT_FOUND
// Find child to descend
i = 0
while i < node.num_keys and key >= node.keys[i]:
i++
child = buffer_pool.getPage(node.children[i])
return search(child, key)
B-Tree Insert:
insert(key, value):
if root is null:
root = create_leaf_node()
root.insert(key, value)
return
result = insert_recursive(root, key, value)
if result.split_occurred:
// Root split, create new root
new_root = create_internal_node()
new_root.keys[0] = result.split_key
new_root.children[0] = root
new_root.children[1] = result.new_node
root = new_root
insert_recursive(node, key, value):
if node.is_leaf:
if node.has_space():
node.insert_sorted(key, value)
return NO_SPLIT
else:
// Split leaf
new_leaf = create_leaf_node()
(split_key, entries_for_new) = node.split_at_middle()
new_leaf.insert_all(entries_for_new)
node.next_leaf = new_leaf
return SPLIT(split_key, new_leaf)
// Internal node
child_index = find_child(node, key)
child = getPage(node.children[child_index])
result = insert_recursive(child, key, value)
if result.split_occurred:
if node.has_space():
node.insert_child(child_index, result.split_key, result.new_node)
return NO_SPLIT
else:
// Split internal node
new_internal = create_internal_node()
(push_up_key, entries_for_new) = node.split_at_middle()
new_internal.insert_all(entries_for_new)
return SPLIT(push_up_key, new_internal)
return NO_SPLIT
5. Implementation Guide
5.1 Development Environment Setup
Required:
- C++17 or later
- CMake 3.14+
- File system with fsync support
Recommended:
# Install development tools
sudo apt install cmake g++ valgrind
# For visualization (optional)
pip install graphviz # To render B-tree diagrams
5.2 Project Structure
database_engine/
├── include/
│ ├── common/
│ │ ├── types.hpp # page_id_t, slot_id_t, etc.
│ │ └── config.hpp # PAGE_SIZE, BUFFER_POOL_SIZE
│ ├── storage/
│ │ ├── disk_manager.hpp
│ │ ├── page.hpp
│ │ └── buffer_pool.hpp
│ ├── access/
│ │ ├── table_heap.hpp
│ │ └── btree_index.hpp
│ ├── catalog/
│ │ ├── schema.hpp
│ │ └── catalog.hpp
│ ├── sql/
│ │ ├── lexer.hpp
│ │ ├── parser.hpp
│ │ └── executor.hpp
│ └── recovery/
│ └── wal.hpp
├── src/
│ ├── storage/
│ ├── access/
│ ├── catalog/
│ ├── sql/
│ ├── recovery/
│ └── main.cpp
├── tests/
│ ├── test_buffer_pool.cpp
│ ├── test_btree.cpp
│ ├── test_sql.cpp
│ └── test_recovery.cpp
├── CMakeLists.txt
└── README.md
5.3 The Core Question You’re Answering
“How do I reliably store, index, and query structured data at scale?”
This is the fundamental question of data management. Understanding database internals helps you:
- Choose the right database for your use case
- Design efficient schemas
- Debug performance issues
- Build data systems that don’t lose data
5.4 Concepts You Must Understand First
- What is a B-tree and why is it preferred over binary trees for databases?
- High fanout reduces tree depth
- Matches disk page size
- Reference: “Introduction to Algorithms” Ch. 18
- What are the ACID properties?
- Atomicity: All or nothing
- Consistency: Rules always satisfied
- Isolation: Transactions don’t interfere
- Durability: Committed data survives crashes
- What is Write-Ahead Logging and why is it necessary?
- Write log before data pages
- Enables recovery after crash
- Reference: “Database Internals” Ch. 10
- What is a buffer pool and why is it needed?
- Pages cached in memory
- Eviction policy (LRU)
- Dirty page tracking
- What is the difference between a heap file and an indexed file?
- Heap: Unordered, append-only, O(n) search
- Indexed: Ordered by key, O(log n) search
5.5 Questions to Guide Your Design
Storage Format:
- Fixed-size or variable-size records?
- How to handle NULL values?
- How to store strings (length-prefixed, null-terminated)?
Buffer Pool:
- What eviction policy (LRU, Clock, LRU-K)?
- How to handle concurrent access?
- When to flush dirty pages?
B-Tree:
- What is the fanout (keys per node)?
- How to handle duplicate keys?
- Clustered or non-clustered index?
Transactions:
- What isolation level to support?
- How to detect conflicts?
- When to acquire/release locks?
5.6 Thinking Exercise
Design the page layout for a table with this schema:
CREATE TABLE products (
id INT PRIMARY KEY,
name TEXT, -- variable length, max 100 chars
price REAL, -- 8 bytes
in_stock BOOL -- 1 byte
);
Questions:
- How many bytes per row (fixed fields + variable fields)?
- How many rows fit in a 4KB page?
- Where do you store the variable-length name?
- How does the slot directory help with variable-length data?
5.7 Hints in Layers
Hint 1 - Starting Point (Conceptual): Build bottom-up: DiskManager -> BufferPool -> HeapFile -> BTree -> SQL Parser -> Executor. Test each layer before building the next.
Hint 2 - Next Level (More Specific):
DiskManager (simplest layer):
class DiskManager {
std::fstream file_;
std::atomic<page_id_t> next_page_id_;
public:
void readPage(page_id_t id, char* data) {
file_.seekg(id * PAGE_SIZE);
file_.read(data, PAGE_SIZE);
}
void writePage(page_id_t id, const char* data) {
file_.seekp(id * PAGE_SIZE);
file_.write(data, PAGE_SIZE);
file_.flush(); // Or use fsync for durability
}
page_id_t allocatePage() {
return next_page_id_++;
}
};
Hint 3 - B-Tree Split:
// When a leaf node is full and we need to insert
void BTreeIndex::splitLeaf(BTreeNode* leaf, int key, RowId value) {
// Create new leaf
page_id_t new_page_id = buffer_pool_.allocatePage();
BTreeNode* new_leaf = reinterpret_cast<BTreeNode*>(
buffer_pool_.getPage(new_page_id)->data()
);
new_leaf->is_leaf = true;
new_leaf->num_keys = 0;
// Find split point (middle)
int total = leaf->num_keys + 1;
int split_idx = total / 2;
// Collect all entries + new entry, sort
std::vector<std::pair<int, RowId>> entries;
for (int i = 0; i < leaf->num_keys; i++) {
entries.push_back({leaf->keys[i], leaf->values[i]});
}
entries.push_back({key, value});
std::sort(entries.begin(), entries.end());
// First half stays in old leaf
leaf->num_keys = 0;
for (int i = 0; i < split_idx; i++) {
leaf->keys[leaf->num_keys] = entries[i].first;
leaf->values[leaf->num_keys] = entries[i].second;
leaf->num_keys++;
}
// Second half goes to new leaf
for (int i = split_idx; i < entries.size(); i++) {
new_leaf->keys[new_leaf->num_keys] = entries[i].first;
new_leaf->values[new_leaf->num_keys] = entries[i].second;
new_leaf->num_keys++;
}
// Link leaves
new_leaf->next_leaf = leaf->next_leaf;
leaf->next_leaf = new_page_id;
// Return split key for parent
return new_leaf->keys[0];
}
Hint 4 - WAL Record Format:
struct WALRecord {
uint64_t lsn; // Log Sequence Number
uint32_t txn_id;
WALRecordType type; // BEGIN, INSERT, UPDATE, DELETE, COMMIT, ABORT
page_id_t page_id;
uint16_t offset;
uint16_t length;
char old_data[MAX_DATA]; // For UNDO
char new_data[MAX_DATA]; // For REDO
};
// Recovery algorithm (simplified ARIES):
void recover() {
// 1. Read WAL from beginning
// 2. Build table of active transactions
// 3. REDO: Replay all changes
// 4. UNDO: Rollback uncommitted transactions
}
5.8 The Interview Questions They’ll Ask
- “Why use a B-tree instead of a hash index?”
- B-tree supports range queries, hash doesn’t
- B-tree sorted traversal, hash is random
- B-tree handles growing data gracefully
- “Explain the purpose of Write-Ahead Logging”
- Ensure durability without flushing data pages on every commit
- Enable crash recovery by replaying log
- Sequential writes are faster than random writes
- “What happens during a checkpoint?”
- Flush all dirty pages to disk
- Write checkpoint record to WAL
- Can truncate old WAL entries
- Reduces recovery time
- “How does MVCC work?”
- Multiple versions of each row
- Each transaction sees consistent snapshot
- No read locks needed
- Reference: PostgreSQL, MySQL InnoDB
- “What is the difference between clustered and non-clustered indexes?”
- Clustered: Table data stored in index order (only one per table)
- Non-clustered: Index contains pointers to heap
5.9 Books That Will Help
| Topic | Book & Chapter |
|---|---|
| B-Trees | “Introduction to Algorithms” Ch. 18 - Cormen et al. |
| Database Storage | “Database Internals” Ch. 1-4 - Alex Petrov |
| Buffer Management | “Database Internals” Ch. 5 - Alex Petrov |
| Write-Ahead Logging | “Database Internals” Ch. 10 - Alex Petrov |
| SQL Processing | “Database System Concepts” Ch. 12-15 - Silberschatz |
| Transaction Processing | “Transaction Processing” - Jim Gray |
5.10 Implementation Phases
Phase 1: Storage Layer (Week 1)
- Implement DiskManager
- Implement Page class with header
- Implement BufferPool with LRU eviction
- Test: Read/write pages, eviction works
Phase 2: Table Access (Week 2)
- Implement HeapFile (table storage)
- Implement row serialization/deserialization
- Implement table scan
- Test: Insert rows, scan all rows
Phase 3: B-Tree Index (Week 3-4)
- Implement B-tree search
- Implement B-tree insert (without splits)
- Implement node splitting
- Test: Insert 10,000 keys, search all
Phase 4: SQL Parsing (Week 5)
- Implement SQL tokenizer
- Implement parser for CREATE, INSERT, SELECT
- Build AST representation
- Test: Parse various SQL statements
Phase 5: Execution (Week 6)
- Implement CREATE TABLE execution
- Implement INSERT execution
- Implement SELECT with WHERE
- Connect parser to executor
Phase 6: Transactions & WAL (Week 7-8)
- Implement WAL writer
- Implement BEGIN/COMMIT/ROLLBACK
- Implement crash recovery
- Test: Kill process, verify recovery
5.11 Key Implementation Decisions
| Decision | Recommended Choice | Reasoning |
|---|---|---|
| Page size | 4KB | Matches OS page size, good for SSDs |
| B-tree order | ~100 keys/node | Fits in one page, low tree height |
| Buffer pool size | 128 pages (512KB) | Good for testing, configurable |
| Serialization | Fixed-width + offset for variable | Simple, efficient |
| WAL format | Binary records | Compact, fast to write |
6. Testing Strategy
Unit Tests:
TEST(BufferPool, EvictsLRU) {
BufferPool pool(2); // Only 2 frames
auto p1 = pool.getPage(1);
auto p2 = pool.getPage(2);
pool.unpin(1);
pool.unpin(2);
// Access page 1 again (makes it recently used)
p1 = pool.getPage(1);
pool.unpin(1);
// Get page 3 - should evict page 2 (LRU)
auto p3 = pool.getPage(3);
EXPECT_FALSE(pool.contains(2)); // Page 2 evicted
EXPECT_TRUE(pool.contains(1)); // Page 1 still there
}
TEST(BTree, SearchAfterSplits) {
BTreeIndex index;
// Insert enough to cause multiple splits
for (int i = 0; i < 10000; i++) {
index.insert(i, RowId{0, i});
}
// Verify all keys findable
for (int i = 0; i < 10000; i++) {
auto result = index.search(i);
ASSERT_TRUE(result.has_value());
EXPECT_EQ(result->slot_id, i);
}
}
Integration Tests:
TEST(Database, CrashRecovery) {
{
Database db("test.db");
db.execute("CREATE TABLE t (id INT PRIMARY KEY)");
db.execute("BEGIN");
db.execute("INSERT INTO t VALUES (1)");
db.execute("COMMIT");
db.execute("BEGIN");
db.execute("INSERT INTO t VALUES (2)");
// No commit - simulate crash
}
// Reopen and recover
Database db("test.db");
auto result = db.execute("SELECT * FROM t");
EXPECT_EQ(result.rows.size(), 1); // Only committed row
EXPECT_EQ(result.rows[0][0], 1); // id = 1
}
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Data corruption | Wrong values read | Page not flushed | Call fsync() after write |
| B-tree invariant violation | Search fails | Split not propagated correctly | Verify parent updated |
| Buffer pool deadlock | Program hangs | Page pinned but not unpinned | Always unpin after use |
| WAL not durable | Lost data on crash | WAL not fsynced before commit | fsync WAL on commit |
| Memory leak | Growing memory | Pages not evicted | Check pin counts |
Debugging Tools:
# Hex dump pages
xxd -g 4 data.db | head -100
# Visualize B-tree (custom script)
./btree_dump data.db | dot -Tpng > btree.png
# Check file sync
strace -e fsync ./mydb test.db
8. Extensions & Challenges
- Secondary Indexes: Non-primary key indexes
- JOIN Operations: Nested loop, hash join
- Query Optimizer: Cost-based optimization
- MVCC: Multi-version concurrency control
- Compression: Page-level compression
- Replication: Primary-replica setup
9. Real-World Connections
- SQLite: Excellent codebase to study, similar architecture
- LevelDB/RocksDB: LSM-tree based, different tradeoffs
- PostgreSQL: Production RDBMS, open source
- BoltDB: Simple B-tree database in Go
10. Resources
11. Self-Assessment Checklist
- DiskManager reads/writes pages correctly
- BufferPool evicts LRU pages
- HeapFile inserts and scans rows
- B-tree insert handles splits
- B-tree search finds all inserted keys
- SQL parser handles CREATE, INSERT, SELECT
- Executor produces correct results
- Transactions can be committed and rolled back
- Data survives restart (persisted)
- Crash recovery works (WAL replay)
12. Submission / Completion Criteria
Your implementation is complete when:
- Core SQL works: CREATE TABLE, INSERT, SELECT with WHERE
- Index used: Primary key lookups use B-tree
- ACID compliant: Transactions work correctly
- Durable: Data survives process restart
- Recoverable: Uncommitted transactions rolled back after crash
- Scalable: Handles 100K+ rows efficiently
- Can explain: You can whiteboard the architecture in an interview