Project 6: Capstone - Simple Database Engine
Build a minimal SQL database with WAL, B-tree primary index, LSM secondary index, and buffer pool.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 40-80 hours |
| Main Programming Language | C (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 5 |
| Business Potential | Infrastructure product |
| Prerequisites | All prior sprint projects |
| Key Topics | WAL, buffer pool, B-tree, LSM, SQL parsing |
1. Learning Objectives
By completing this project, you will:
- Integrate WAL-based durability with in-memory and on-disk indexes.
- Build a buffer pool with LRU eviction and dirty page management.
- Implement a B-tree primary index and an LSM-based secondary index.
- Parse and execute a minimal SQL subset with deterministic output.
- Implement crash recovery that restores consistent state.
2. All Theory Needed (Per-Concept Breakdown)
Write-Ahead Logging and Recovery in a Database
Fundamentals
Write-ahead logging (WAL) ensures that every change to the database is recorded on disk before it is applied to data pages. This allows the database to recover from crashes by replaying the log. The WAL stores logical or physical changes along with a sequence number. On startup, the database reads the log and replays committed updates to bring the database to a consistent state. WAL is the core durability mechanism used in production databases.
From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.
Deep Dive into the concept
In a database engine, WAL serves two goals: durability and atomicity. Durability means once a transaction is acknowledged, its changes survive crashes. Atomicity means partial changes are not visible. The WAL protocol is simple: before modifying a data page, write a log record describing the change, then flush the log to disk. Only after the log is durable can you apply the change to the page. This is the write-ahead rule.
A minimal WAL record can include: record type (INSERT/DELETE), table id, key, value, and a sequence number. The sequence number (LSN) is monotonically increasing and provides ordering. When a data page is modified, it stores the latest LSN applied. On recovery, you read the log and apply records with LSN greater than the page LSN. This prevents reapplying changes that are already on disk. This is the essence of redo logging.
Undo logging is more complex and is not required for this project. We will focus on single-statement atomicity: each statement is logged and considered committed once the log is flushed. This means there are no multi-statement transactions. However, you should still handle crashes during a statement. For example, if you write the log but crash before updating the data pages, recovery will replay the log and apply the update. If you update pages but crash before log flush, the update should not be considered committed. This is why the log must be flushed before the data page write is acknowledged.
Recovery requires a clear ordering. On startup, you load the catalog (table definitions), then build the buffer pool, then replay the WAL. The WAL replay must apply updates to the B-tree and LSM structures in the same order they were logged. This ensures that the indexes stay consistent. You should design the log so that each record can be applied idempotently. If recovery replays a record twice, it should not break correctness. One way is to use sequence numbers and ignore records with LSN <= page LSN.
The WAL also interacts with checkpoints. A checkpoint is a moment when you flush all dirty pages to disk and record a checkpoint in the log. After a checkpoint, you only need to replay log records after that point. This reduces recovery time. For this project, implement a simple checkpoint command that flushes the buffer pool and truncates the WAL. This gives you practical insight into how databases manage log growth.
Finally, WAL is closely tied to write performance. Writing to a sequential log is fast, but fsync is expensive. You can provide a --fsync option to control durability. For deterministic demos, always fsync. For performance experiments, allow batching. This makes the tradeoff explicit.
In practice, you should also think about log truncation. Without truncation, the WAL grows indefinitely and recovery becomes slow. A simple checkpoint that flushes dirty pages and records the last applied LSN lets you safely truncate older log records. Even if you implement this in a minimal form, it forces you to reason about what state is already reflected on disk and what still lives only in the log. This is a core durability concept that separates toy systems from real ones.
How this fit on projects
WAL is central to §3.2, §3.7, and §5.10 Phase 1. It is tested in §6.2 crash recovery. It is also used in P02-log-structured-kv-store.md.
Definitions & key terms
- WAL -> write-ahead log
- LSN -> log sequence number
- redo logging -> reapply changes after crash
- checkpoint -> point after which older log records can be discarded
Mental model diagram
client write -> WAL append -> fsync -> apply to pages
crash -> replay WAL -> restore pages
How it works (step-by-step, with invariants and failure modes)
- Serialize change into WAL record with LSN.
- Append to WAL and fsync.
- Apply change to indexes and mark pages dirty.
- On crash, replay WAL records in order.
Invariants: log is append-only; page LSN <= latest WAL LSN. Failure modes: missing fsync or non-idempotent replay.
Minimal concrete example
LSN=10 INSERT users id=1 name=alice
Common misconceptions
- “WAL is optional” -> without it, crashes corrupt the database.
- “Log flush can be skipped” -> then durability is not guaranteed.
- “Recovery is just replay” -> you must avoid reapplying changes incorrectly.
Check-your-understanding questions
- Why must the log be flushed before page updates are acknowledged?
- What is a checkpoint and why is it useful?
- How does LSN prevent double-apply during recovery?
Check-your-understanding answers
- Otherwise a crash could lose the log but keep page updates, violating durability.
- It limits how much log must be replayed on startup.
- Pages store the last applied LSN; records below that are skipped.
Real-world applications
- PostgreSQL and MySQL WAL (redo logs).
- SQLite journal mode.
Where you’ll apply it
- This project: §3.2, §3.7, §5.10 Phase 1, §6.2.
- Also used in: P02-log-structured-kv-store.md.
References
- Database Internals, recovery chapters
- Designing Data-Intensive Applications, Ch. 3
Key insights
WAL is the contract between memory and disk: log first, then data.
Summary
Write-ahead logging is the durability backbone of the database. It enables crash recovery, checkpoints, and consistent index updates.
Homework/Exercises to practice the concept
- Implement a WAL with LSN and a replay tool.
- Simulate a crash by killing the process and verify recovery output.
Solutions to the homework/exercises
- Append records with sequence numbers and replay them in order.
- Restart and verify that all committed inserts appear.
Buffer Pool and LRU Page Management
Fundamentals
A buffer pool caches database pages in memory to reduce disk IO. Each page can be clean or dirty. When a page is modified, it is marked dirty and written back later. The buffer pool uses an eviction policy such as LRU to decide which page to evict when memory is full. This is similar to a cache but with the added responsibility of writing back dirty pages.
From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.
Deep Dive into the concept
The buffer pool is the core performance layer of a database engine. Without it, every query would trigger disk reads, which is too slow. The buffer pool stores a fixed number of pages in memory. Each page has metadata: page id, pin count (how many users are holding it), dirty flag, and last access for LRU. When a request asks for a page, the buffer pool checks if it is in cache. If so, it returns it and updates LRU. If not, it loads the page from disk, possibly evicting another page. Eviction is only allowed for unpinned pages. If the evicted page is dirty, it must be flushed to disk first.
A key design choice is the LRU representation. You can use an intrusive list as in Project 1. Each cached page is a node in the list. On access, move it to the head. Eviction removes from the tail. This is O(1) and deterministic. The buffer pool must also integrate with WAL: before flushing a dirty page, ensure that the WAL is flushed up to the page LSN. This is the write-ahead rule at the page level. This rule prevents a dirty page from reaching disk before the corresponding log records, which would break recovery.
Pinning is important. When a page is fetched, it is pinned to prevent eviction. When the caller is done, it unpins. If a page is pinned, eviction must skip it. If all pages are pinned, the buffer pool cannot load new pages and must return an error. In a single-threaded engine, pinning can be simplified, but you should still implement it to model real behavior. This also helps you avoid bugs where a page is evicted while still in use.
The buffer pool interacts with concurrency. In a real engine, each page would have a latch to protect it from concurrent access. For this project, you can omit latches, but design the API to make it possible later. This means separating “fetch” (pin) from “unpin” and ensuring the buffer pool does not expose internal structures.
You should track buffer pool stats: hit rate, miss rate, dirty writes, and evictions. These are key performance metrics. For deterministic tests, use fixed page access sequences and verify expected hit/miss counts.
If you want to make the buffer pool more realistic, add a small "flush scheduler" that periodically writes dirty pages in the background (or on a manual FLUSH command). This helps you observe how write-back interacts with WAL ordering. You can also add prefetching for sequential scans: when a query scans a leaf page, prefetch the next leaf into the cache. This improves scan throughput and demonstrates that caches can be proactive. These small enhancements deepen your understanding of how a buffer pool supports both random access and sequential workloads.
One more operational detail is sizing. A buffer pool that is too small will thrash, evicting pages before they can be reused. A pool that is too large may waste memory and reduce cache locality in the rest of the process. You can expose a --buffer-mb flag and run the same workload at different sizes to see how hit rate changes. This reinforces the idea that caching is a quantitative problem: you must choose sizes based on workload, not intuition.
How this fit on projects
Buffer pool is required in §3.2 and §3.7. It is implemented in §5.10 Phase 2 and tested in §6.2. It reuses LRU logic from P01-in-memory-cache-lru.md.
Definitions & key terms
- buffer pool -> in-memory cache of pages
- dirty page -> modified page not yet flushed
- pin -> mark page as in use
- LRU -> least recently used eviction policy
Mental model diagram
page request -> cache hit? -> yes -> move to LRU head
no -> load -> evict tail (if needed)
How it works (step-by-step, with invariants and failure modes)
- Fetch page id from buffer pool.
- If cached, increment pin and move to LRU head.
- If not cached, evict tail page (if unpinned), flush if dirty.
- Load page from disk, insert into cache, pin.
- Unpin when done.
Invariants: pinned pages are not evicted; dirty pages are flushed with WAL ordering. Failure modes: eviction of pinned pages or lost writes.
Minimal concrete example
page_t *p = bp_fetch(id);
// use page
bp_unpin(p, dirty);
Common misconceptions
- “Buffer pool is just a cache” -> it must also manage dirty pages and WAL ordering.
- “LRU is perfect” -> it is a heuristic, but adequate for this project.
- “Pinning is optional” -> without it, pages can be evicted while in use.
Check-your-understanding questions
- Why must dirty pages be flushed after WAL is flushed?
- What happens if all pages are pinned?
- Why does LRU help performance?
Check-your-understanding answers
- To preserve the write-ahead rule and ensure recovery correctness.
- The buffer pool cannot load new pages and must return an error.
- It keeps recently used pages in memory, reducing disk IO.
Real-world applications
- Database buffer pools (PostgreSQL, MySQL).
- Filesystem page caches.
Where you’ll apply it
- This project: §3.2, §3.7, §5.10 Phase 2.
- Also used in: P01-in-memory-cache-lru.md.
References
- Database Internals, buffer pool chapter
- OSTEP, caching chapters
Key insights
A buffer pool is a cache with durability responsibilities; it must respect WAL ordering.
Summary
The buffer pool minimizes disk IO and enforces safe write-back. It uses LRU, pinning, and dirty tracking to manage pages.
Homework/Exercises to practice the concept
- Implement an LRU cache for fixed-size pages.
- Add dirty tracking and an explicit flush operation.
Solutions to the homework/exercises
- Use a hash map and a doubly linked list for O(1) LRU.
- Mark dirty pages and write them back on flush or eviction.
B-Tree Primary Index
Fundamentals
The primary index maps the primary key to the full row location. A B-tree is ideal because it provides O(log n) lookups and efficient range scans. The primary index is stored on disk in pages and is managed by the buffer pool. Insertions and lookups must maintain B-tree invariants, and updates must be crash-safe using WAL.
From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.
Deep Dive into the concept
The B-tree primary index is the main access path for SELECT queries with a primary key. Each leaf stores key and row pointer (or the row itself in a heap file). For this project, you can store the full row in the leaf value to keep the design simple. The B-tree pages are loaded into the buffer pool when accessed. Inserts require searching to the leaf, inserting the key, and splitting if necessary. Each page modification should be logged in the WAL before it is written back to disk.
A key design choice is the row format. You can store fixed-length rows for simplicity: an integer id and a fixed-length name. This makes serialization straightforward. The B-tree leaf stores key and value in fixed-size slots, similar to the project 5 node layout. The primary index should be unique; inserting a duplicate key should replace the existing row or return an error. Choose one behavior and make it consistent.
The primary index is also used for range queries. If you support SELECT * FROM users WHERE id BETWEEN 10 AND 20, you can perform a range scan starting at key 10 and follow linked leaves. This reuses the range scan concept from the B-tree project. The primary index also interacts with secondary indexes: when a secondary index lookup finds a primary key, it uses the primary index to fetch the full row.
Crash safety is crucial. If you insert a row, the WAL must record the insert before the page is dirtied. On recovery, you replay inserts to restore the index. A simple approach is to log the logical insert and reapply it during recovery rather than logging physical page changes. This is called logical logging. It is easier to implement but can be slower during recovery. For this project, logical logging is fine.
You should also think about how the primary index stores rows. If you store rows directly in the leaf, updates require rewriting the leaf page. This is fine for fixed-size rows, but for variable-length rows you would need a slotted page layout. A slotted page stores a header with free space and a slot array that points to variable-length records. That is more complex than this project needs, but understanding the idea helps you see why many databases separate the index from the heap storage. In a clustered index (like InnoDB), the primary index stores the row in the leaf. In a heap-based design, the primary index stores a pointer to a row stored elsewhere. For this project, storing fixed-size rows in leaves keeps the design simple and deterministic.
Uniqueness enforcement is another practical detail. A primary key must be unique, so your insert path should check for an existing key before inserting. This requires a lookup during insert. If the key exists, return ERR duplicate_key and do not modify the tree. This seems obvious, but it matters for correctness: if you log the insert before checking duplicates, recovery could replay a failed insert. Therefore, you should check for duplicates first, then log and apply the insert. This ordering keeps WAL semantics correct and demonstrates that logical constraints affect the logging protocol.
How this fit on projects
The B-tree primary index is the core of §3.2 and §4.2. It is implemented in §5.10 Phase 3 and tested in §6.2. It builds directly on P05-btree-file-index.md.
Definitions & key terms
- primary index -> unique index on primary key
- row pointer -> reference to row storage location
- logical logging -> WAL records describe logical operations
Mental model diagram
primary key -> B-tree -> leaf -> row
How it works (step-by-step, with invariants and failure modes)
- Parse INSERT statement and extract key/value.
- Append WAL record and fsync.
- Insert into B-tree leaf; split if needed.
- Mark pages dirty in buffer pool.
Invariants: keys sorted; primary key unique. Failure modes: duplicate keys or lost updates after crash.
Minimal concrete example
INSERT users (id=1, name=alice) -> WAL -> B-tree insert
Common misconceptions
- “Primary index is optional” -> it is required for fast key lookups.
- “B-tree updates are atomic” -> they require WAL for crash safety.
- “Rows must be separate” -> for this project, storing rows in leaves is fine.
Check-your-understanding questions
- Why is a B-tree good for primary indexes?
- What is the difference between logical and physical logging?
- What happens if you insert a duplicate key?
Check-your-understanding answers
- It minimizes disk IO with high fanout and supports range scans.
- Logical logging stores operations; physical logging stores page changes.
- It should either update existing row or return an error consistently.
Real-world applications
- MySQL InnoDB clustered primary index.
- SQLite B-tree table storage.
Where you’ll apply it
- This project: §3.2, §4.2, §5.10 Phase 3.
- Also used in: P05-btree-file-index.md.
References
- Database Internals, index chapters
- Algorithms, Fourth Edition, B-tree sections
Key insights
The primary index is the main access path; its correctness and durability define the database.
Summary
A B-tree primary index provides fast lookups and ordered scans. It must integrate with WAL and the buffer pool for durability.
Homework/Exercises to practice the concept
- Implement a small in-memory B-tree with inserts and range scans.
- Add logical logging for inserts and replay them.
Solutions to the homework/exercises
- Use fixed-size nodes and split on overflow.
- Write insert records to log and replay on startup.
LSM-Based Secondary Index
Fundamentals
A secondary index maps non-primary columns (like name) to primary keys. Building a B-tree for each secondary index can be expensive. An LSM-based secondary index uses a memtable and SSTables to optimize writes. Queries on the secondary index return a list of primary keys, which are then looked up in the primary index. This design trades write performance for read cost, which is often acceptable for secondary indexes.
From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.
Deep Dive into the concept
Secondary indexes are optional in a minimal database, but they are essential for real workloads. An LSM-based secondary index leverages the log-structured design from Project 2. Each secondary index has its own memtable and SSTables. When a row is inserted, you insert a mapping from secondary key to primary key into the secondary index. If multiple rows share the same secondary key, you store a list or a separate entry per primary key. For simplicity, you can store duplicate entries and return all matches.
Using LSM for secondary indexes has two advantages: fast writes and easy batching. Because you already implement a WAL for the primary data, you can reuse it or create a separate log for the secondary index. A single WAL that records both primary and secondary updates ensures consistency. On recovery, you replay the WAL to rebuild both indexes. This is important because the secondary index must reflect the primary data.
Reads from the secondary index require searching the memtable and SSTables, just like in an LSM store. This can be slower than a B-tree, but you can use Bloom filters to reduce disk reads. Once you obtain primary keys, you perform primary index lookups. This is a two-step process and highlights the tradeoff: secondary index queries are more expensive but writes are faster. This is common in real systems; for example, Cassandra uses LSM for base storage and maintains secondary indexes separately.
Compaction is also important. If you never compact the secondary index, reads will slow down. You can reuse the compaction logic from Project 2. Since secondary indexes are typically smaller than primary data, compaction may be cheaper. You should still track write amplification and surface it in stats.
A subtle consistency issue arises when updating or deleting rows. If a row’s secondary key changes, you must delete the old mapping and insert the new one. This is handled by tombstones in the LSM index. For this project, you can keep it simple by making data immutable (no updates), or by supporting deletes with tombstones. Make the choice explicit. The key is to keep the secondary index consistent with primary data.
Another design choice is how to store multiple primary keys for the same secondary key. You can either store one entry per (secondary, primary) pair, which makes insertion simple and makes compaction a matter of keeping the newest entries, or you can store a single entry whose value is a list of primary keys. The list approach reduces key duplication but complicates updates and deletions. For this project, the one-entry-per-pair approach is simpler and aligns with LSM semantics. It also keeps the memtable and SSTables uniform: each entry is a single key/value record with fixed encoding. If you later add uniqueness constraints on the secondary key, you can enforce it by checking for existing entries before insert, similar to the primary index.
You should also consider normalization of secondary keys. For example, if your database treats names as case-insensitive, you must normalize keys (e.g., lowercase) before inserting into the index. Otherwise, Alice and alice would be treated as different entries. This may sound like an application detail, but it affects index correctness. By choosing and documenting a normalization rule in this project, you make the index semantics explicit and avoid subtle bugs during queries.
How this fit on projects
The LSM secondary index is required in §3.2 and implemented in §5.10 Phase 4. It is tested in §6.2 and builds directly on P02-log-structured-kv-store.md.
Definitions & key terms
- secondary index -> index on non-primary column
- memtable -> in-memory sorted structure
- SSTable -> immutable sorted file
- tombstone -> delete marker for index entries
Mental model diagram
name -> LSM index -> primary keys -> B-tree -> rows
How it works (step-by-step, with invariants and failure modes)
- On insert, add secondary key -> primary key entry in memtable.
- Flush memtable to SSTable when full.
- On query, search memtable then SSTables for matches.
- For each primary key, lookup in primary B-tree.
Invariants: secondary index reflects all primary rows; tombstones override old entries. Failure modes: stale entries if WAL replay incomplete.
Minimal concrete example
INSERT id=1 name=alice -> sec_index("alice") = [1]
SELECT * WHERE name=alice -> [1] -> primary lookup
Common misconceptions
- “Secondary index must be B-tree” -> LSM is often better for writes.
- “Secondary queries are cheap” -> they can be two-stage and slower.
- “Compaction is optional” -> it is required for read performance.
Check-your-understanding questions
- Why use LSM for secondary indexes?
- How do tombstones work in a secondary index?
- Why do secondary queries require primary lookups?
Check-your-understanding answers
- It converts random writes into sequential runs, improving write throughput.
- Tombstones mark deleted mappings and override older entries.
- The secondary index maps to primary keys, not full rows.
Real-world applications
- Cassandra secondary indexes.
- RocksDB-based systems with secondary indexes.
Where you’ll apply it
- This project: §3.2, §5.10 Phase 4, §6.2.
- Also used in: P02-log-structured-kv-store.md.
References
- Designing Data-Intensive Applications, Ch. 3
- Database Internals, index chapters
Key insights
An LSM secondary index trades read cost for fast writes and simple batching.
Summary
The secondary index uses the LSM structure to optimize writes. Queries are two-step: index lookup then primary fetch.
Homework/Exercises to practice the concept
- Build a small LSM index that maps strings to integers.
- Add a query that returns all primary keys for a given secondary key.
Solutions to the homework/exercises
- Use a sorted memtable and flush to an SSTable file.
- Merge results from memtable and SSTables, then query primary index.
SQL Parsing and Execution Pipeline
Fundamentals
A database engine must parse SQL, plan execution, and run operations against its indexes. For this project, you will support a minimal SQL subset: CREATE TABLE, INSERT, and SELECT with simple WHERE predicates. Parsing converts text into an abstract syntax tree (AST). Execution uses the AST to perform index operations. This pipeline is the interface between user queries and internal data structures.
From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.
Deep Dive into the concept
SQL parsing begins with tokenization: breaking the input into keywords, identifiers, numbers, and strings. A simple recursive descent parser can handle a small grammar. For example, CREATE TABLE users (id INT, name TEXT); can be parsed into a table definition with column names and types. INSERT INTO users VALUES (1, 'alice'); becomes an insert statement with values. SELECT name FROM users WHERE id = 2; becomes a select statement with a projection list and a predicate.
Execution planning can be simple. For this project, you can implement a direct execution path without optimization. For a SELECT with WHERE id = X, use the primary index to lookup the row. For WHERE name = X, use the secondary index to find primary keys, then fetch each row. For SELECT * without WHERE, perform a full scan of the primary index using range scan across leaves. Each execution path should be deterministic and produce clear output.
Parsing must handle errors gracefully. If the input is invalid, return an error without crashing. A good approach is to provide explicit error codes and error messages. The parser should not accept ambiguous or partial input. You can keep the grammar small and strict to simplify implementation.
The execution engine must also integrate with transactions or at least a per-statement durability model. For this project, each statement is auto-committed: it is logged, applied, and then returns. This simplifies concurrency. You should still ensure that the response is only returned after WAL flush if durability is enabled.
Finally, deterministic output is important. The database shell should produce the same results for the same input every time. Avoid timestamps in output or make them deterministic. This makes testing and grading possible.
Parsing string literals and identifiers is another subtle area. SQL allows quoted strings with spaces and escape sequences, and identifiers can be case-insensitive depending on the dialect. For this project, define a small, explicit subset: identifiers are lowercase ASCII, strings are single-quoted with \\' as an escape, and whitespace is insignificant outside strings. This keeps the parser simple and makes it clear which inputs are valid. It also makes error handling deterministic: if an input violates the grammar, return a single error code and do not attempt recovery. This design mirrors the "fail fast" approach in many embedded databases.
It is also worth designing the AST with explicit node types and fields, even if the grammar is small. For example, represent a SELECT as a node with table, columns, and an optional predicate structure. This makes execution straightforward and avoids ad-hoc string parsing. A structured AST also enables a simple .plan output: you can print which index will be used based on the predicate fields. Finally, keep execution errors distinct from parse errors. If the SQL is syntactically correct but refers to an unknown table, return ERR no_table rather than ERR parse. This distinction makes the shell feel professional and helps users debug issues quickly.
For testing, keep a small corpus of valid and invalid queries and run them on every change to the parser. This ensures regressions are caught early and makes the behavior predictable.
How this fit on projects
SQL parsing is required in §3.2 and implemented in §5.10 Phase 5. It is tested in §6.2. It is the integration point for all previous data structures.
Definitions & key terms
- tokenizer -> splits input into tokens
- parser -> builds AST from tokens
- AST -> abstract syntax tree
- execution plan -> steps to execute a query
Mental model diagram
SQL text -> tokens -> AST -> execution -> indexes -> result
How it works (step-by-step, with invariants and failure modes)
- Read a line from the shell.
- Tokenize into keywords and literals.
- Parse tokens into AST.
- Execute AST using indexes.
- Return results or error.
Invariants: parsing is deterministic; execution uses correct index path. Failure modes: malformed input, wrong index selection.
Minimal concrete example
SELECT name FROM users WHERE id = 2;
-> primary index lookup -> output name
Common misconceptions
- “Parsing is easy” -> errors and edge cases matter.
- “Planner is required” -> for this project, direct execution is enough.
- “SELECT without WHERE is free” -> it requires a full scan.
Check-your-understanding questions
- How does the parser know when a statement ends?
- Why use the primary index for
WHERE id =? - How do you handle invalid SQL?
Check-your-understanding answers
- Use a semicolon as a terminator.
- It provides the fastest lookup path for primary key.
- Return an error and do not execute.
Real-world applications
- SQLite parser and virtual machine.
- PostgreSQL parser and planner.
Where you’ll apply it
- This project: §3.2, §5.10 Phase 5, §6.2.
References
- SQLite “Parsing” documentation
- Compilers: Principles, Techniques, and Tools (for parsing)
Key insights
SQL parsing turns text into deterministic operations on your indexes.
Summary
A minimal SQL pipeline connects user commands to your storage engine. Keep the grammar small and the execution paths explicit.
Homework/Exercises to practice the concept
- Implement a tokenizer for SQL keywords and literals.
- Write a parser for CREATE and INSERT.
Solutions to the homework/exercises
- Use a state machine to scan identifiers and numbers.
- Implement recursive descent parsing with explicit grammar rules.
3. Project Specification
3.1 What You Will Build
A minimal SQL database engine that supports CREATE TABLE, INSERT, and SELECT with simple WHERE predicates. Data is stored in a B-tree primary index, and a secondary index is maintained using an LSM structure. A WAL provides durability and crash recovery. A buffer pool caches pages and supports LRU eviction. The engine is accessed through an interactive shell.
3.2 Functional Requirements
- Support
CREATE TABLE name (id INT, name TEXT). - Support
INSERT INTO name VALUES (id, 'value'). - Support
SELECT * FROM name WHERE id = X. - Support
SELECT name FROM name WHERE name = 'x'using secondary index. - Provide
.planto show which index was used. - Implement WAL and recovery.
3.3 Non-Functional Requirements
- Performance: primary lookups O(log n), secondary lookups O(log n) + fetch.
- Reliability: crash recovery restores consistent state.
- Usability: deterministic shell output with clear errors.
3.4 Example Usage / Output
$ ./mini_db
> CREATE TABLE users (id INT, name TEXT);
OK
> INSERT INTO users VALUES (1, 'alice');
OK
> SELECT name FROM users WHERE id = 1;
alice
> .plan SELECT name FROM users WHERE id = 1;
plan: primary index lookup
3.5 Data Formats / Schemas / Protocols
Database files:
data.db- primary B-tree pagessec.idx- secondary LSM fileswal.log- write-ahead log
3.6 Edge Cases
- Duplicate primary key returns error.
- Unknown table returns
ERR no_table. - Malformed SQL returns
ERR parse.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
cc -O2 -Wall -Wextra -o mini_db src/main.c src/parser.c src/wal.c src/btree.c src/lsm.c src/buffer.c
./mini_db --fsync 1
3.7.2 Golden Path Demo (Deterministic)
$ ./mini_db
> CREATE TABLE users (id INT, name TEXT);
OK
> INSERT INTO users VALUES (1, 'alice');
OK
> INSERT INTO users VALUES (2, 'bob');
OK
> SELECT name FROM users WHERE id = 2;
bob
> .plan SELECT name FROM users WHERE name = 'alice';
plan: secondary index lookup
3.7.3 Failure Demo (Deterministic)
> SELECT name FROM missing WHERE id = 1;
ERR no_table
> INSERT INTO users VALUES (1, 'alice');
ERR duplicate_key
Exit codes:
- 0 on normal exit
- 4 on WAL corruption or recovery failure
4. Solution Architecture
4.1 High-Level Design
SQL shell -> parser -> execution engine
|-> WAL
|-> primary B-tree (buffer pool)
|-> secondary LSM
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | WAL | durability | logical logging | | Buffer pool | page cache | LRU eviction | | Primary index | B-tree | fixed-size rows | | Secondary index | LSM | memtable + SSTables | | Parser | SQL front-end | small grammar |
4.4 Data Structures (No Full Code)
struct row {
int64_t id;
char name[64];
};
4.4 Algorithm Overview
Key Algorithm: INSERT
- Parse SQL and build AST.
- Append WAL record and fsync.
- Insert into primary B-tree.
- Insert into secondary LSM.
Complexity Analysis:
- Time: O(log n) for primary insert + LSM write
- Space: O(n)
5. Implementation Guide
5.1 Development Environment Setup
cc --version
5.2 Project Structure
mini_db/
├── src/
│ ├── main.c
│ ├── parser.c
│ ├── wal.c
│ ├── btree.c
│ ├── lsm.c
│ └── buffer.c
└── README.md
5.3 The Core Question You’re Answering
“How do the major systems data structures integrate into a working database?”
5.4 Concepts You Must Understand First
- WAL and recovery
- Buffer pool and LRU
- B-tree primary index
- LSM secondary index
- SQL parsing and execution
5.5 Questions to Guide Your Design
- How will you ensure WAL and page writes follow the write-ahead rule?
- What is your page size and row format?
- How will you keep secondary index consistent with primary?
5.6 Thinking Exercise
If the system crashes after WAL append but before B-tree insert, what should recovery do?
5.7 The Interview Questions They’ll Ask
- Why does a database use a WAL?
- How does a buffer pool reduce IO?
- Why might a database use both B-trees and LSM trees?
5.8 Hints in Layers
Hint 1: Build the WAL and B-tree insert first. Hint 2: Add buffer pool and page caching. Hint 3: Add secondary index and SQL parsing last.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Recovery | Database Internals | Ch. 6 | | Indexes | Algorithms, Fourth Edition | Ch. 6.1 | | LSM | Designing Data-Intensive Applications | Ch. 3 |
5.10 Implementation Phases
Phase 1: Foundation (10-15 hours)
Goals: WAL, simple storage, replay Tasks: WAL format, replay, simple table Checkpoint: crash/recovery test passes
Phase 2: Core Functionality (12-20 hours)
Goals: B-tree and buffer pool Tasks: page layout, insert/search, cache Checkpoint: primary key queries work after restart
Phase 3: Integration (12-20 hours)
Goals: secondary index and SQL parsing Tasks: LSM index, parser, query routing Checkpoint: SELECT works for primary and secondary keys
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———|———|—————-|———–| | Logging | logical, physical | logical | simpler recovery | | Secondary index | B-tree, LSM | LSM | write optimized | | Parser | hand-written, generated | hand-written | small grammar |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———|———|———-| | Unit Tests | parser correctness | parse errors | | Integration Tests | WAL recovery | crash and replay | | End-to-End Tests | SQL shell | deterministic transcript |
6.2 Critical Test Cases
- Insert, crash, restart, and verify data persists.
- Secondary index query returns correct rows.
- Duplicate primary key returns error.
6.3 Test Data
CREATE TABLE users (id INT, name TEXT)
INSERT INTO users VALUES (1, 'alice')
SELECT name FROM users WHERE id = 1
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | WAL not flushed | missing rows after crash | fsync before ack | | Buffer pool bug | stale reads | flush dirty pages | | Secondary index stale | wrong results | log index updates |
7.2 Debugging Strategies
- Add verbose WAL replay logs.
- Implement a
CHECKcommand to validate index consistency.
7.3 Performance Traps
- Small buffer pool leads to thrashing.
- No compaction on secondary index slows queries.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
DELETEsupport. - Add
SELECT *without WHERE.
8.2 Intermediate Extensions
- Add simple transactions (BEGIN/COMMIT).
- Add index stats and query planner hints.
8.3 Advanced Extensions
- Implement MVCC for concurrent reads.
- Add background compaction thread.
9. Real-World Connections
9.1 Industry Applications
- SQLite-style embedded databases.
- Storage engines in larger systems.
9.2 Related Open Source Projects
- SQLite (B-tree + WAL)
- RocksDB (LSM-based storage)
9.3 Interview Relevance
- WAL and recovery design
- Index tradeoffs and query planning
10. Resources
10.1 Essential Reading
- Database Internals, recovery and indexes
- Designing Data-Intensive Applications, LSM trees
10.2 Video Resources
- “Database Internals” talks
10.3 Tools & Documentation
sqlite3for reference behavior
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain WAL and recovery.
- I can explain buffer pool design.
- I can explain B-tree vs LSM tradeoffs.
11.2 Implementation
- All functional requirements are met.
- Recovery works after crash.
- SQL shell outputs deterministic results.
11.3 Growth
- I can explain database architecture in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- WAL + primary index + basic SQL shell
Full Completion:
- Buffer pool + secondary index
- Crash recovery with deterministic demo
Excellence (Going Above & Beyond):
- Query planner hints
- Background compaction and checkpoints