Learning Database Internals with C

Goal: Build a working mental model of database internals by implementing the core subsystems in C—how bytes on disk become pages, how pages become indexes, how SQL becomes executable operators, and how crashes don’t corrupt committed data.

Databases are one of the most fascinating systems to understand deeply, and C is the perfect language for this—SQLite, PostgreSQL, and MySQL are all written in C. You’ll see exactly how bytes become queries.

Storage Engine Fundamentals: Pages, Records, and File Formats

Databases don’t store “rows” directly; they store pages (fixed-size blocks, commonly 4KB or 8KB) and pack records into them with headers, slot directories, and free-space tracking. This makes reads/writes predictable and enables efficient caching.

Page (4KB)
┌───────────────────────────────────────────┐
│ Page Header (size, LSN, flags)            │
├───────────────────────────────────────────┤
│ Slot Directory (offsets to records)       │
│  [0] → 0x03F0                             │
│  [1] → 0x03B0                             │
├───────────────────────────────────────────┤
│ Free Space (grows downward)               │
├───────────────────────────────────────────┤
│ Records (variable-length, grow upward)    │
└───────────────────────────────────────────┘

Key ideas you will apply:

Binary serialization: structs → bytes → structs, with explicit endianness and alignment.
Record layouts: fixed vs variable length, slot directories for in-page indirection.
Free-space management: fragmentation inside a page, compaction strategies.
Page IDs: logical page numbers mapped to byte offsets in a single file.

Indexing with B-Trees and B+ Trees

B-trees minimize disk I/O by keeping high fan-out nodes, which shrinks tree height. A B+ tree stores all values in leaves and keeps internal nodes as routing tables—great for range scans.

             [ 30 | 60 ]
            /     |     \
      [10|20]  [40|50]  [70|90]
        |         |         |
      leafs     leafs     leafs

Key ideas you will apply:

Node layout on disk: keys + child pointers packed into a page.
Split/merge: keeping nodes within capacity while preserving order.
Search path: disk reads are dominated by height, not data size.
Range scans: leaf-level linked list for sequential access.

Buffer Pool and Disk I/O

Disk is slow. The buffer pool is the database’s in-memory cache for pages. It decides what stays hot and what gets evicted, and it makes sure dirty pages are safely persisted.

Disk Pages            Buffer Pool (RAM)
┌────────┐            ┌────────┐
│ P42    │ ───────▶   │ frame3 │ dirty
│ P43    │            │ frame4 │ clean
│ P44    │ ───────▶   │ frame5 │ pinned
└────────┘            └────────┘

Key ideas you will apply:

Pin/unpin: prevent eviction while in use.
LRU/clock: eviction policy design and tradeoffs.
Dirty tracking: flush only modified pages.
Page IO: reads aligned to page size, writes batched.

Query Processing Pipeline (Parse → Plan → Execute)

SQL is just text until it becomes an executable plan. Databases transform a query into an AST, then an execution plan, then a chain of iterators/operators.

SQL text
   ↓ tokenize
Tokens
   ↓ parse
AST
   ↓ plan/optimize
Plan Tree
   ↓ execute
Operators (scan → filter → join → aggregate)

Key ideas you will apply:

Lexer/parser: turn grammar into a structured AST.
Logical plan: relational algebra operations.
Physical plan: pick algorithms (hash join vs nested loop).
Iterator model: next() pulls tuples through a pipeline.

Transactions, Concurrency, and Isolation

Multiple clients must see consistent results while changing shared data. Transactions provide ACID guarantees, and isolation prevents interleavings from corrupting logic.

Key ideas you will apply:

Locks vs MVCC: blocking vs versioned reads.
Isolation levels: read committed vs repeatable read vs serializable.
Two-phase locking: correctness at the cost of contention.
Write skew/lost updates: anomalies you must prevent.

Write-Ahead Logging (WAL) and Crash Recovery

Databases are judged by how they recover. WAL ensures changes are logged before data pages are written. Recovery replays logs to rebuild a consistent state.

Transaction writes
  1) append log record
  2) fsync log
  3) update data page

Crash → replay log → consistent state

Key ideas you will apply:

Log record format: redo/undo payloads.
Durability: fsync ordering and group commit.
Checkpoints: shorten recovery time.
Idempotent replay: safe repeated log application.

Client Boundaries and Protocol Design (C API Discipline)

Even a tiny database needs a clean boundary between client code and internal machinery. In C, the boundary is your .h file. The caller must know exactly what they own and what they must free.

Key ideas you will apply:

Opaque handles: hide internal structs.
Memory ownership: returned buffers must have explicit lifetime rules.
Protocol framing: length-prefix, delimiter, or RESP-style serialization.
Error contracts: avoid leaking internal errno or state.

Concept Summary Table

Concept Cluster	What You Need to Internalize
Page-based storage	Pages are the fundamental unit; records are packed with headers and slot directories.
Binary encoding	Serialization is explicit and byte-level; endianness and alignment matter.
Indexing	B-trees/B+ trees trade fan-out for fewer disk reads; splits/merges preserve order.
Buffer pool	Page caching, pinning, dirty tracking, and eviction policies dominate performance.
Query pipeline	SQL becomes tokens → AST → plan → operators; execution is a pull-based pipeline.
Transactions	ACID is enforced by locks or MVCC with clear isolation semantics.
Recovery	WAL and checkpoints guarantee durability after crashes.
API boundaries	Opaque handles and ownership rules prevent misuse and memory bugs.

Deep Dive Reading by Concept

Page-Based Storage and Binary Encoding

Concept	Book & Chapter
File I/O and byte-level persistence	The C Programming Language — Ch. 7: “Input and Output”
Disk-based data layouts	Database System Concepts — Ch. 13: “Data Storage Structures”
Encoding and evolution	Designing Data-Intensive Applications — Ch. 4: “Encoding and Evolution”

Indexing and B-Trees

Concept	Book & Chapter
Balanced search trees	Algorithms, Fourth Edition — §3.3: “Balanced Search Trees”
Storage engine indexes	Database Internals — Ch. 2-4: “B-Tree Basics and Storage Engines”

Buffer Pool and Caching

Concept	Book & Chapter
Cache behavior	Computer Systems: A Programmer’s Perspective — Ch. 6.4: “Cache Memories”
Buffer management	Database System Concepts — Ch. 13: “Data Storage Structures”

Query Processing

Concept	Book & Chapter
Query processing and optimization	Database System Concepts — Ch. 15-16: “Query Processing and Optimization”
Execution models	Database Internals — Ch. 11: “Execution Models”

Transactions and Recovery

Concept	Book & Chapter
Transactions and isolation	Designing Data-Intensive Applications — Ch. 7: “Transactions”
Recovery system	Database System Concepts — Ch. 19: “Recovery System”
Crash consistency	Operating Systems: Three Easy Pieces — Ch. 42: “Crash Consistency”

API Boundaries in C

Concept	Book & Chapter
Interfaces and encapsulation	C Interfaces and Implementations — Ch. 2: “Interfaces and Implementations”
Memory ownership discipline	Effective C, 2nd Edition — Ch. 6: “Memory Management”

Project 1: Persistent Key-Value Store

File: persistent_key_value_store.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: Level 1: The “Resume Gold”
Difficulty: Level 1: Beginner (The Tinkerer)
Knowledge Area: Database Internals, File I/O
Software or Tool: File System, Binary Serialization
Main Book: “The C Programming Language” by Kernighan & Ritchie

What you’ll build: A simple key-value database that stores data to disk, supports GET/PUT/DELETE operations, and survives program restarts.

Why it teaches databases: This is the absolute foundation. Before B-trees, SQL, or transactions, a database must solve one problem: how do you store structured data on disk and read it back efficiently? You’ll confront serialization, file formats, and the gap between memory and persistent storage.

Core challenges you’ll face

Serialization: How do you convert C structs to bytes and back? (maps to data encoding)
File I/O: When do you fopen, fread, fwrite, fsync? (maps to durability)
Simple indexing: How do you find a key without scanning everything? (maps to indexing concepts)
Memory management: Managing buffers, avoiding leaks (maps to buffer management)

Key Concepts

Concept	Resource
File I/O in C	“The C Programming Language” Ch. 7 (Input and Output) - Kernighan & Ritchie
Serialization formats	“Designing Data-Intensive Applications” Ch. 4 (Encoding and Evolution) - Martin Kleppmann
Basic data structures	“C Interfaces and Implementations” Ch. 2 (Interfaces and Implementations) - David R. Hanson

Project Details

Attribute	Value
Difficulty	Beginner
Time estimate	Weekend
Prerequisites	Basic C (pointers, structs, file I/O)

Real World Outcome

A working CLI tool where you can type ./kvstore put name "Douglas" and ./kvstore get name returns Douglas
Data persists after you close and reopen the program
You can hexdump your data file and understand every byte

Deliverables

kvstore CLI with get, put, del, list commands
On-disk file format doc (page header + record layout)
tests/ covering CRUD, persistence, and corruption handling
bench/ for basic throughput and latency measurements

Success Criteria

100% pass rate on CRUD tests across restarts
hexdump matches documented layout for at least 10 inserted keys
Handles missing keys and malformed records without crashing
Achieves predictable O(n) scan cost and logs timing

Learning Milestones

First milestone: Store a single key-value pair and retrieve it → you understand basic file writes
Second milestone: Handle multiple keys with a simple linear scan → you see why this is slow (O(n))
Final milestone: Add a simple hash-based index file → you understand why databases need indexes

Project 2: Key-Value Store Client Library

File: SPRINT_4_BOUNDARIES_INTERFACES_PROJECTS.md
Programming Language: C
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: API Design / Networking
Software or Tool: Redis Client Protocol
Main Book: “Effective C, 2nd Edition” by Robert C. Seacord

What you’ll build: A C client library (like libkvclient) that connects to Redis or your own simple TCP key-value server. The library exposes kv_connect(), kv_set(), kv_get(), kv_disconnect() and handles all protocol details internally.

Why it teaches Boundaries & Interfaces: Client libraries are the purest form of “boundary as contract.” Your users see only your .h file. They never see your parsing logic, socket handling, or internal buffers. You must communicate ownership clearly: who owns the returned string from kv_get()? The caller? The library? This ambiguity has caused thousands of real-world bugs.

Core challenges you’ll face

Opaque handle design: Designing kv_handle as an opaque type (encapsulation, information hiding)
Memory ownership: Deciding who owns memory returned by kv_get() and making it obvious (ownership across boundaries)
Const correctness: Using const char* correctly for keys vs mutable buffers for values
Protocol abstraction: Hiding protocol details while allowing configuration (internal vs external invariants)
State management: Handling connection state without exposing it (avoiding global state)

Key Concepts

Concept	Resource
Opaque pointers pattern	“C Interfaces and Implementations” Ch. 2 - David R. Hanson
Memory ownership in APIs	“Effective C, 2nd Edition” Ch. 6 (Memory Management) - Robert C. Seacord
Const correctness	“C Programming: A Modern Approach” Ch. 17 - K. N. King
Socket programming	“The Linux Programming Interface” Ch. 56-59 - Michael Kerrisk
Defensive coding	“Code Complete, 2nd Edition” Ch. 8 (Defensive Programming) - Steve McConnell

Project Details

Attribute	Value
Difficulty	Intermediate
Time estimate	1-2 weeks
Prerequisites	Project 1, basic socket programming

Real World Outcome

kv_handle *db = kv_connect("localhost", 6379);
kv_set(db, "user:1:name", "Alice");
char *name = kv_get(db, "user:1:name");  // Clear who owns this!
printf("Name: %s\n", name);
kv_free_string(name);  // Explicit ownership transfer
kv_disconnect(db);

Your library connects to a real Redis server (or your mock), and the API makes ownership unmistakable.

Deliverables

libkvclient.a and kvclient.h with documented ownership rules
Protocol parser and encoder (RESP or your own framing)
Example program that sets and gets keys
Error model with explicit error codes and messages

Success Criteria

API usage requires no knowledge of sockets or protocol framing
kv_get() ownership is unambiguous and enforced by kv_free_string()
Handles reconnects and timeouts without leaking memory
No public header exposes internal structs

Learning Milestones

First milestone: Implement kv_connect() returning an opaque handle → understand why hiding struct internals prevents misuse
Second milestone: Design “who frees the returned string” → you’ll never design an ambiguous ownership API again
Final milestone: Add error handling that doesn’t expose internal errno → understand the boundary between internal and external invariants

Project 3: B-Tree Library

File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
Programming Language: C
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Data Structures / Databases
Software or Tool: B-Trees
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A disk-backed B-tree implementation that can store millions of keys with O(log n) lookups, insertions, and deletions.

Why it teaches databases: The B-tree is the data structure of databases. SQLite uses B-trees for tables and indexes. PostgreSQL uses B+ trees. Understanding B-trees means understanding why databases are fast. When you implement node splits and merges yourself, you’ll never forget how indexing works.

Core challenges you’ll face

Node structure: Designing the on-disk format for internal and leaf nodes (maps to page layout)
Tree traversal: Following child pointers through pages (maps to random vs sequential I/O)
Node splitting: What happens when a node is full? (maps to tree balancing)
Page management: Each node = one disk page (maps to buffer pool concepts)

Resources for key challenges

“Let’s Build a Simple Database” Part 7-13 by cstack - Excellent walkthrough of B-tree implementation in a database context
Implementation of B-Tree in C - GeeksforGeeks reference implementation

Key Concepts

Concept	Resource
B-tree fundamentals	“Algorithms, Fourth Edition” §3.3 (Balanced Search Trees) - Sedgewick & Wayne
Disk-based tree structures	“Database Internals” Ch. 2-4 - Alex Petrov
Page layout	“Computer Systems: A Programmer’s Perspective” Ch. 6 (Memory Hierarchy) - Bryant & O’Hallaron

Project Details

Attribute	Value
Difficulty	Intermediate
Time estimate	1-2 weeks
Prerequisites	Project 1, understanding of tree data structures

Real World Outcome

A library where you can insert 1 million keys and retrieve any key in <10 disk reads
Visual output showing tree structure: ./btree_demo --visualize prints the tree levels
Benchmark comparison showing O(log n) vs O(n) performance

Deliverables

Disk-backed B-tree with insert, search, delete
btree_verify() validator for invariants (sorted keys, fanout, depth)
Visualization tool for tree levels and node occupancy
Benchmarks comparing lookups vs linear scan

Success Criteria

All invariants pass after randomized insert/delete sequences
Lookup depth scales logarithmically with dataset size
Node split/merge logic handles worst-case sequences
Tree remains consistent after restart and reload

Learning Milestones

First milestone: Implement search through a pre-built tree → understand tree traversal
Second milestone: Implement insertion with node splitting → understand how trees grow
Final milestone: Implement deletion with merging → understand the full complexity of balanced trees

Project 4: Page-Based Storage Engine

File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
Programming Language: C
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Operating Systems / Databases
Software or Tool: Buffer Pool
Main Book: “Database System Concepts” by Silberschatz et al.

What you’ll build: A storage engine with fixed-size pages, a page directory, a buffer pool with LRU eviction, and support for variable-length records.

Why it teaches databases: Real databases don’t just dump data to disk—they organize it into pages (typically 4KB or 8KB). This project teaches you how databases actually manage disk space, cache hot pages in memory, and handle records that don’t fit neatly into fixed slots.

Core challenges you’ll face

Page format design: Headers, slots, free space management (maps to physical storage)
Buffer pool management: Which pages to keep in memory? (maps to caching)
LRU eviction: Implementing an eviction policy (maps to cache replacement)
Slot directory: Finding records within a page (maps to tuple storage)

Resources for key challenges

CMU 15-445 Project #1 - Buffer Pool Manager - The gold standard assignment for this concept
SQLite Internals: Pages & B-trees - How SQLite actually does it

Key Concepts

Concept	Resource
Buffer management	“Database System Concepts” Ch. 13 (Data Storage Structures) - Silberschatz et al.
Page layout	“Operating Systems: Three Easy Pieces” Ch. 39-40 (Files and Directories) - Arpaci-Dusseau
Cache replacement policies	“Computer Systems: A Programmer’s Perspective” Ch. 6.4 (Cache Memories) - Bryant & O’Hallaron

Project Details

Attribute	Value
Difficulty	Intermediate
Time estimate	1-2 weeks
Prerequisites	Project 3

Real World Outcome

Run ./storage_demo and see buffer pool statistics: “Buffer pool: 64 pages, 58 hits, 6 misses, 91% hit rate”
Insert 100,000 records and watch the buffer pool manage memory efficiently
Monitor which pages are hot vs cold with a debug view

Deliverables

Page file format spec with header, slots, and free-space rules
Buffer pool manager with pin/unpin and LRU or clock eviction
Record API for insert/update/delete with variable-length fields
Stats output for hit/miss, dirty flushes, and eviction count

Success Criteria

Read/write API works with both fixed and variable-length records
Buffer pool never evicts pinned pages
Dirty pages are flushed on shutdown and survive restart
Page utilization stays above 70% after mixed workload

Learning Milestones

First milestone: Fixed-size record storage in pages → understand page structure
Second milestone: Implement buffer pool with pinning → understand memory management
Final milestone: Add LRU eviction and variable-length records → understand production storage engines

Project 5: Write-Ahead Log (WAL) and Crash Recovery

File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
Programming Language: C
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Databases / Crash Consistency
Software or Tool: WAL
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A transaction log that guarantees durability—your database survives crashes without losing committed data.

Why it teaches databases: WAL is what makes databases reliable. The “D” in ACID (Durability) comes from write-ahead logging. You’ll understand why databases write to a log first, how they recover from crashes, and why fsync() matters so much.

Core challenges you’ll face

Log record format: Designing undo/redo log entries (maps to recovery)
Log-write protocol: Ensuring logs hit disk before data (maps to durability)
Crash recovery: Replaying the log to restore consistency (maps to REDO/UNDO)
Checkpointing: Limiting recovery time (maps to performance)

Resources for key challenges

Write-Ahead Logging and ARIES by Kevin Sookocheff - Clear explanation of WAL and ARIES recovery
PostgreSQL WAL Documentation - How a production database does it
Caltech CS122 WAL Assignment - Hands-on implementation guide

Key Concepts

Concept	Resource
ACID guarantees	“Designing Data-Intensive Applications” Ch. 7 (Transactions) - Martin Kleppmann
Recovery algorithms	“Database System Concepts” Ch. 19 (Recovery System) - Silberschatz et al.
File system semantics	“Operating Systems: Three Easy Pieces” Ch. 42 (Crash Consistency) - Arpaci-Dusseau

Project Details

Attribute	Value
Difficulty	Advanced
Time estimate	1-2 weeks
Prerequisites	Projects 1-4

Real World Outcome

Run ./wal_demo --crash-test: the program crashes mid-transaction, restarts, and recovers to a consistent state
See log replay in action: “Replaying 47 log records… Recovery complete. 3 transactions rolled back.”
Kill the process with kill -9 at any point and verify no data corruption

Deliverables

Log file format with redo/undo records and LSNs
Recovery tool that replays logs and verifies consistency
Checkpoint mechanism to bound recovery time
Crash-test harness that simulates power loss

Success Criteria

Committed transactions persist after forced crash
Uncommitted transactions are fully rolled back
Recovery completes within a fixed log window after checkpoints
Log replay is idempotent across repeated runs

Learning Milestones

First milestone: Append-only log writing → understand sequential durability
Second milestone: REDO-only recovery → understand forward recovery
Final milestone: Full UNDO/REDO with checkpoints → understand complete crash recovery

Project 6: SQL Query Engine

File: sql_query_engine.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: Level 4: The “Open Core” Infrastructure
Difficulty: Level 4: Expert (The Systems Architect)
Knowledge Area: Compilers, Query Processing
Software or Tool: Lexer/Parser, SQLite Architecture
Main Book: “Database System Concepts” by Silberschatz et al.

What you’ll build: A SQL parser and query executor that can handle SELECT, INSERT, UPDATE, DELETE with WHERE clauses, joins, and basic aggregations.

Why it teaches databases: This is where bytes become tables. You’ll understand how SELECT * FROM users WHERE age > 21 becomes a series of operations: parse → plan → execute. Building a query engine demystifies SQL completely.

Core challenges you’ll face

Lexing/Parsing: Turning SQL text into an AST (maps to query compilation)
Query planning: Deciding which indexes to use (maps to optimization)
Execution engine: Iterator model vs volcano model (maps to runtime)
Join algorithms: Nested loop, hash join basics (maps to query processing)

Resources for key challenges

Architecture of SQLite - How SQLite’s query pipeline works
Deep Dive into SQLite’s Internal Architecture - Tokenizer → Parser → Code Generator → VDBE

Key Concepts

Concept	Resource
Parsing	“Compilers: Principles and Practice” Ch. 3-4 (Lexical/Syntax Analysis) - Dave & Dave
Query processing	“Database System Concepts” Ch. 15-16 (Query Processing & Optimization) - Silberschatz et al.
Execution models	“Database Internals” Ch. 11 - Alex Petrov

Project Details

Attribute	Value
Difficulty	Advanced
Time estimate	2-4 weeks
Prerequisites	Projects 1-5 (or at minimum 1-3)

Real World Outcome

A working REPL where you type SQL and get results:

minidb> SELECT name, age FROM users WHERE age > 21;
+--------+-----+
| name   | age |
+--------+-----+
| Alice  | 25  |
| Bob    | 30  |
+--------+-----+
2 rows returned (0.003 sec)

Run EXPLAIN SELECT... and see the query plan.

Deliverables

SQL lexer/parser producing an AST
Logical planner that builds operator trees
Physical executor with scans, filters, joins, aggregates
EXPLAIN output for operator pipeline

Success Criteria

Parses and executes a defined SQL subset without crashes
Joins return correct results for at least two join types
EXPLAIN output matches actual operator execution order
Basic indexes are used when available

Learning Milestones

First milestone: Parse and execute simple SELECT → understand the pipeline
Second milestone: Add WHERE clause filtering → understand predicate evaluation
Final milestone: Implement JOIN and basic aggregates → understand relational algebra

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Key-Value Store	Beginner	Weekend	⭐⭐ Foundation	⭐⭐⭐⭐ Quick wins
2. KV Client Library	Intermediate	1-2 weeks	⭐⭐⭐ API Design	⭐⭐⭐ Practical
3. B-Tree Library	Intermediate	1-2 weeks	⭐⭐⭐⭐⭐ Core concept	⭐⭐⭐⭐ Satisfying
4. Storage Engine	Intermediate	1-2 weeks	⭐⭐⭐⭐ Production-like	⭐⭐⭐ Technical
5. WAL & Recovery	Advanced	1-2 weeks	⭐⭐⭐⭐⭐ ACID mastery	⭐⭐⭐ Challenging
6. SQL Query Engine	Advanced	2-4 weeks	⭐⭐⭐⭐ Full picture	⭐⭐⭐⭐⭐ “I built SQL!”

Recommended Learning Path

Based on the goal of deeply understanding databases:

Start with Project 3 (B-Tree Library) if you’re comfortable with C file I/O. The B-tree is the heart of database indexing, and everything else builds on top of understanding how data is organized for fast retrieval.

Start with Project 1 (Key-Value Store) if you want to warm up first. It’s a weekend project that builds confidence and sets up patterns you’ll use throughout.

Project 2 (KV Client Library) can be done in parallel with Project 3 or anytime after Project 1. It’s a great way to practice API design and learn socket programming, and the concepts will help you design cleaner interfaces for your database components.

Suggested Timeline

Week	Project	Focus
Week 1	Key-Value Store	Warm-up, file I/O fundamentals
Week 2	KV Client Library	API design, ownership semantics (can be parallel with Week 3)
Weeks 3-4	B-Tree Library	The core insight of database indexing
Weeks 5-6	Storage Engine	Production-level understanding
Weeks 7-8	WAL & Recovery	Reliability mastery
Weeks 9-11	SQL Query Engine	Tie it all together

Final Capstone Project: SQLite Clone (“TinyDB”)

What you’ll build: A complete, embedded SQL database that stores data in a single file, supports transactions, and handles concurrent readers—essentially a simplified SQLite.

Why this is the ultimate project: This integrates everything: storage engine, B-tree indexes, buffer pool, WAL, SQL parsing, query execution, and transactions. When you finish, you’ll have built a real database from scratch that you can actually use in other projects.

Core challenges you’ll face

Integration: Making all components work together seamlessly
File format design: Single-file database with header, pages, freelist
Transaction isolation: Basic locking or MVCC
Concurrency: Multiple readers, single writer

Resources for this capstone

Let’s Build a Simple Database by cstack - The definitive SQLite clone tutorial
How SQLite Works - Official documentation
SQLite Source Code - ~150K lines of beautifully documented C

Key Concepts

Concept	Resource
Complete database architecture	“Database Internals” by Alex Petrov - The best modern book on storage engines
SQLite specifics	The Architecture of SQLite - Official architecture overview
Production considerations	“Designing Data-Intensive Applications” Ch. 3 (Storage and Retrieval) - Kleppmann

Project Details

Attribute	Value
Difficulty	Advanced
Time estimate	1-2 months
Prerequisites	All previous projects

Real World Outcome

A single tinydb binary and a libtinydb.a library you can link into other C programs
Run ./tinydb myapp.db and get a full SQL REPL
Use your database as the backend for a simple web app or CLI tool
Show someone: “I built this database from scratch in C”

Deliverables

Single-file database format with page header, freelist, and schema catalog
Integrated storage engine, B-tree indexes, WAL, and SQL executor
Concurrency model (single-writer/multi-reader or basic MVCC)
Test suite that covers recovery, queries, and corruption cases

Success Criteria

Database survives crash mid-transaction with no corruption
Concurrent readers never observe partial writes
Schema and data persist across versioned file formats
Performance baseline within 5-10x of SQLite for simple workloads

Learning Milestones

Weeks 1-2: Integrate storage engine + B-tree with file format
Weeks 3-4: Add SQL parser and basic query execution
Weeks 5-6: Implement WAL and crash recovery
Weeks 7-8: Add transactions and basic concurrency
Final: Polish, optimize, write tests, benchmark against SQLite

Additional Resources

Books (from your collection)

Book	Relevance
“Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron	Memory hierarchy, caching, file I/O
“Operating Systems: Three Easy Pieces” - Arpaci-Dusseau	File systems, crash consistency
“The C Programming Language” - Kernighan & Ritchie	C fundamentals
“C Interfaces and Implementations” - David R. Hanson	Clean C design patterns
“Algorithms, Fourth Edition” - Sedgewick & Wayne	B-trees, data structures

Online Resources

Let’s Build a Simple Database - cstack’s SQLite clone tutorial
Architecture of SQLite - Official SQLite documentation
CMU 15-445 Database Systems - Buffer Pool Manager project
Write-Ahead Logging and ARIES - WAL deep dive
SQLite Internals: Pages & B-trees - Fly.io blog
Implementation of B-Tree in C - GeeksforGeeks
Deep Dive into SQLite’s Internal Architecture - Dev.to article
PostgreSQL WAL Documentation - Official PostgreSQL docs

Recommended Books (not in your collection)

Book	Why It’s Essential
“Database Internals” by Alex Petrov	The best modern book specifically about storage engine internals
“Database System Concepts” by Silberschatz et al.	Comprehensive academic textbook covering all database topics
“Designing Data-Intensive Applications” by Martin Kleppmann	Modern systems thinking about data storage and processing

Quick Start Checklist

Set up a C development environment (gcc/clang, make, gdb)
Create a project directory structure
Start with Project 1 or 3 based on your comfort level
Read the cstack tutorial alongside your implementation
Use hexdump and gdb liberally to understand what’s happening
Benchmark your implementations to see O(n) vs O(log n) in practice
Keep notes on what you learn—database concepts are interconnected

Happy building! There’s nothing quite like the moment when you realize you understand how databases actually work.