DATABASE INTERNALS C LEARNING PROJECTS
Learning Database Internals with C
A hands-on learning path to deeply understand how databases work by building one from scratch in C.
Databases are one of the most fascinating systems to understand deeply, and C is the perfect language for this—SQLite, PostgreSQL, and MySQL are all written in C. You’ll see exactly how bytes become queries.
Core Concept Analysis
Understanding “how databases work” breaks down into these fundamental building blocks:
| Layer | What It Does | Key Concepts |
|---|---|---|
| Storage | Persists data to disk | Pages, blocks, file formats, serialization |
| Indexing | Fast data lookup | B-trees, B+ trees, hash indexes |
| Buffer Pool | Manages memory/disk I/O | Page caching, eviction policies (LRU) |
| Query Processing | Interprets SQL | Parsing, planning, execution |
| Transactions | Ensures ACID | Locking, MVCC, isolation levels |
| Recovery | Survives crashes | Write-Ahead Logging (WAL), checkpoints |
Project 1: Persistent Key-Value Store
- File: persistent_key_value_store.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: Level 1: The “Resume Gold”
- Difficulty: Level 1: Beginner (The Tinkerer)
- Knowledge Area: Database Internals, File I/O
- Software or Tool: File System, Binary Serialization
- Main Book: “The C Programming Language” by Kernighan & Ritchie
What you’ll build: A simple key-value database that stores data to disk, supports GET/PUT/DELETE operations, and survives program restarts.
Why it teaches databases: This is the absolute foundation. Before B-trees, SQL, or transactions, a database must solve one problem: how do you store structured data on disk and read it back efficiently? You’ll confront serialization, file formats, and the gap between memory and persistent storage.
Core challenges you’ll face
- Serialization: How do you convert C structs to bytes and back? (maps to data encoding)
- File I/O: When do you
fopen,fread,fwrite,fsync? (maps to durability) - Simple indexing: How do you find a key without scanning everything? (maps to indexing concepts)
- Memory management: Managing buffers, avoiding leaks (maps to buffer management)
Key Concepts
| Concept | Resource |
|---|---|
| File I/O in C | “The C Programming Language” Ch. 7 (Input and Output) - Kernighan & Ritchie |
| Serialization formats | “Designing Data-Intensive Applications” Ch. 4 (Encoding and Evolution) - Martin Kleppmann |
| Basic data structures | “C Interfaces and Implementations” Ch. 2 (Interfaces and Implementations) - David R. Hanson |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time estimate | Weekend |
| Prerequisites | Basic C (pointers, structs, file I/O) |
Real World Outcome
- A working CLI tool where you can type
./kvstore put name "Douglas"and./kvstore get namereturnsDouglas - Data persists after you close and reopen the program
- You can
hexdumpyour data file and understand every byte
Learning Milestones
- First milestone: Store a single key-value pair and retrieve it → you understand basic file writes
- Second milestone: Handle multiple keys with a simple linear scan → you see why this is slow (O(n))
- Final milestone: Add a simple hash-based index file → you understand why databases need indexes
Project 2: Key-Value Store Client Library
- File: SPRINT_4_BOUNDARIES_INTERFACES_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: API Design / Networking
- Software or Tool: Redis Client Protocol
- Main Book: “Effective C, 2nd Edition” by Robert C. Seacord
What you’ll build: A C client library (like libkvclient) that connects to Redis or your own simple TCP key-value server. The library exposes kv_connect(), kv_set(), kv_get(), kv_disconnect() and handles all protocol details internally.
Why it teaches Boundaries & Interfaces: Client libraries are the purest form of “boundary as contract.” Your users see only your .h file. They never see your parsing logic, socket handling, or internal buffers. You must communicate ownership clearly: who owns the returned string from kv_get()? The caller? The library? This ambiguity has caused thousands of real-world bugs.
Core challenges you’ll face
- Opaque handle design: Designing
kv_handleas an opaque type (encapsulation, information hiding) - Memory ownership: Deciding who owns memory returned by
kv_get()and making it obvious (ownership across boundaries) - Const correctness: Using
const char*correctly for keys vs mutable buffers for values - Protocol abstraction: Hiding protocol details while allowing configuration (internal vs external invariants)
- State management: Handling connection state without exposing it (avoiding global state)
Key Concepts
| Concept | Resource |
|---|---|
| Opaque pointers pattern | “C Interfaces and Implementations” Ch. 2 - David R. Hanson |
| Memory ownership in APIs | “Effective C, 2nd Edition” Ch. 6 (Memory Management) - Robert C. Seacord |
| Const correctness | “C Programming: A Modern Approach” Ch. 17 - K. N. King |
| Socket programming | “The Linux Programming Interface” Ch. 56-59 - Michael Kerrisk |
| Defensive coding | “Code Complete, 2nd Edition” Ch. 8 (Defensive Programming) - Steve McConnell |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time estimate | 1-2 weeks |
| Prerequisites | Project 1, basic socket programming |
Real World Outcome
kv_handle *db = kv_connect("localhost", 6379);
kv_set(db, "user:1:name", "Alice");
char *name = kv_get(db, "user:1:name"); // Clear who owns this!
printf("Name: %s\n", name);
kv_free_string(name); // Explicit ownership transfer
kv_disconnect(db);
Your library connects to a real Redis server (or your mock), and the API makes ownership unmistakable.
Learning Milestones
- First milestone: Implement
kv_connect()returning an opaque handle → understand why hiding struct internals prevents misuse - Second milestone: Design “who frees the returned string” → you’ll never design an ambiguous ownership API again
- Final milestone: Add error handling that doesn’t expose internal errno → understand the boundary between internal and external invariants
Project 3: B-Tree Library
- File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Structures / Databases
- Software or Tool: B-Trees
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A disk-backed B-tree implementation that can store millions of keys with O(log n) lookups, insertions, and deletions.
Why it teaches databases: The B-tree is the data structure of databases. SQLite uses B-trees for tables and indexes. PostgreSQL uses B+ trees. Understanding B-trees means understanding why databases are fast. When you implement node splits and merges yourself, you’ll never forget how indexing works.
Core challenges you’ll face
- Node structure: Designing the on-disk format for internal and leaf nodes (maps to page layout)
- Tree traversal: Following child pointers through pages (maps to random vs sequential I/O)
- Node splitting: What happens when a node is full? (maps to tree balancing)
- Page management: Each node = one disk page (maps to buffer pool concepts)
Resources for key challenges
- “Let’s Build a Simple Database” Part 7-13 by cstack - Excellent walkthrough of B-tree implementation in a database context
- Implementation of B-Tree in C - GeeksforGeeks reference implementation
Key Concepts
| Concept | Resource |
|---|---|
| B-tree fundamentals | “Algorithms, Fourth Edition” §3.3 (Balanced Search Trees) - Sedgewick & Wayne |
| Disk-based tree structures | “Database Internals” Ch. 2-4 - Alex Petrov |
| Page layout | “Computer Systems: A Programmer’s Perspective” Ch. 6 (Memory Hierarchy) - Bryant & O’Hallaron |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time estimate | 1-2 weeks |
| Prerequisites | Project 1, understanding of tree data structures |
Real World Outcome
- A library where you can insert 1 million keys and retrieve any key in <10 disk reads
- Visual output showing tree structure:
./btree_demo --visualizeprints the tree levels - Benchmark comparison showing O(log n) vs O(n) performance
Learning Milestones
- First milestone: Implement search through a pre-built tree → understand tree traversal
- Second milestone: Implement insertion with node splitting → understand how trees grow
- Final milestone: Implement deletion with merging → understand the full complexity of balanced trees
Project 4: Page-Based Storage Engine
- File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Operating Systems / Databases
- Software or Tool: Buffer Pool
- Main Book: “Database System Concepts” by Silberschatz et al.
What you’ll build: A storage engine with fixed-size pages, a page directory, a buffer pool with LRU eviction, and support for variable-length records.
Why it teaches databases: Real databases don’t just dump data to disk—they organize it into pages (typically 4KB or 8KB). This project teaches you how databases actually manage disk space, cache hot pages in memory, and handle records that don’t fit neatly into fixed slots.
Core challenges you’ll face
- Page format design: Headers, slots, free space management (maps to physical storage)
- Buffer pool management: Which pages to keep in memory? (maps to caching)
- LRU eviction: Implementing an eviction policy (maps to cache replacement)
- Slot directory: Finding records within a page (maps to tuple storage)
Resources for key challenges
- CMU 15-445 Project #1 - Buffer Pool Manager - The gold standard assignment for this concept
- SQLite Internals: Pages & B-trees - How SQLite actually does it
Key Concepts
| Concept | Resource |
|---|---|
| Buffer management | “Database System Concepts” Ch. 13 (Data Storage Structures) - Silberschatz et al. |
| Page layout | “Operating Systems: Three Easy Pieces” Ch. 39-40 (Files and Directories) - Arpaci-Dusseau |
| Cache replacement policies | “Computer Systems: A Programmer’s Perspective” Ch. 6.4 (Cache Memories) - Bryant & O’Hallaron |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time estimate | 1-2 weeks |
| Prerequisites | Project 3 |
Real World Outcome
- Run
./storage_demoand see buffer pool statistics: “Buffer pool: 64 pages, 58 hits, 6 misses, 91% hit rate” - Insert 100,000 records and watch the buffer pool manage memory efficiently
- Monitor which pages are hot vs cold with a debug view
Learning Milestones
- First milestone: Fixed-size record storage in pages → understand page structure
- Second milestone: Implement buffer pool with pinning → understand memory management
- Final milestone: Add LRU eviction and variable-length records → understand production storage engines
Project 5: Write-Ahead Log (WAL) and Crash Recovery
- File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Databases / Crash Consistency
- Software or Tool: WAL
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A transaction log that guarantees durability—your database survives crashes without losing committed data.
Why it teaches databases: WAL is what makes databases reliable. The “D” in ACID (Durability) comes from write-ahead logging. You’ll understand why databases write to a log first, how they recover from crashes, and why fsync() matters so much.
Core challenges you’ll face
- Log record format: Designing undo/redo log entries (maps to recovery)
- Log-write protocol: Ensuring logs hit disk before data (maps to durability)
- Crash recovery: Replaying the log to restore consistency (maps to REDO/UNDO)
- Checkpointing: Limiting recovery time (maps to performance)
Resources for key challenges
- Write-Ahead Logging and ARIES by Kevin Sookocheff - Clear explanation of WAL and ARIES recovery
- PostgreSQL WAL Documentation - How a production database does it
- Caltech CS122 WAL Assignment - Hands-on implementation guide
Key Concepts
| Concept | Resource |
|---|---|
| ACID guarantees | “Designing Data-Intensive Applications” Ch. 7 (Transactions) - Martin Kleppmann |
| Recovery algorithms | “Database System Concepts” Ch. 19 (Recovery System) - Silberschatz et al. |
| File system semantics | “Operating Systems: Three Easy Pieces” Ch. 42 (Crash Consistency) - Arpaci-Dusseau |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time estimate | 1-2 weeks |
| Prerequisites | Projects 1-4 |
Real World Outcome
- Run
./wal_demo --crash-test: the program crashes mid-transaction, restarts, and recovers to a consistent state - See log replay in action: “Replaying 47 log records… Recovery complete. 3 transactions rolled back.”
- Kill the process with
kill -9at any point and verify no data corruption
Learning Milestones
- First milestone: Append-only log writing → understand sequential durability
- Second milestone: REDO-only recovery → understand forward recovery
- Final milestone: Full UNDO/REDO with checkpoints → understand complete crash recovery
Project 6: SQL Query Engine
- File: sql_query_engine.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: Level 4: The “Open Core” Infrastructure
- Difficulty: Level 4: Expert (The Systems Architect)
- Knowledge Area: Compilers, Query Processing
- Software or Tool: Lexer/Parser, SQLite Architecture
- Main Book: “Database System Concepts” by Silberschatz et al.
What you’ll build: A SQL parser and query executor that can handle SELECT, INSERT, UPDATE, DELETE with WHERE clauses, joins, and basic aggregations.
Why it teaches databases: This is where bytes become tables. You’ll understand how SELECT * FROM users WHERE age > 21 becomes a series of operations: parse → plan → execute. Building a query engine demystifies SQL completely.
Core challenges you’ll face
- Lexing/Parsing: Turning SQL text into an AST (maps to query compilation)
- Query planning: Deciding which indexes to use (maps to optimization)
- Execution engine: Iterator model vs volcano model (maps to runtime)
- Join algorithms: Nested loop, hash join basics (maps to query processing)
Resources for key challenges
- Architecture of SQLite - How SQLite’s query pipeline works
- Deep Dive into SQLite’s Internal Architecture - Tokenizer → Parser → Code Generator → VDBE
Key Concepts
| Concept | Resource |
|---|---|
| Parsing | “Compilers: Principles and Practice” Ch. 3-4 (Lexical/Syntax Analysis) - Dave & Dave |
| Query processing | “Database System Concepts” Ch. 15-16 (Query Processing & Optimization) - Silberschatz et al. |
| Execution models | “Database Internals” Ch. 11 - Alex Petrov |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time estimate | 2-4 weeks |
| Prerequisites | Projects 1-5 (or at minimum 1-3) |
Real World Outcome
A working REPL where you type SQL and get results:
minidb> SELECT name, age FROM users WHERE age > 21;
+--------+-----+
| name | age |
+--------+-----+
| Alice | 25 |
| Bob | 30 |
+--------+-----+
2 rows returned (0.003 sec)
Run EXPLAIN SELECT... and see the query plan.
Learning Milestones
- First milestone: Parse and execute simple SELECT → understand the pipeline
- Second milestone: Add WHERE clause filtering → understand predicate evaluation
- Final milestone: Implement JOIN and basic aggregates → understand relational algebra
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Key-Value Store | Beginner | Weekend | ⭐⭐ Foundation | ⭐⭐⭐⭐ Quick wins |
| 2. KV Client Library | Intermediate | 1-2 weeks | ⭐⭐⭐ API Design | ⭐⭐⭐ Practical |
| 3. B-Tree Library | Intermediate | 1-2 weeks | ⭐⭐⭐⭐⭐ Core concept | ⭐⭐⭐⭐ Satisfying |
| 4. Storage Engine | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ Production-like | ⭐⭐⭐ Technical |
| 5. WAL & Recovery | Advanced | 1-2 weeks | ⭐⭐⭐⭐⭐ ACID mastery | ⭐⭐⭐ Challenging |
| 6. SQL Query Engine | Advanced | 2-4 weeks | ⭐⭐⭐⭐ Full picture | ⭐⭐⭐⭐⭐ “I built SQL!” |
Recommended Learning Path
Based on the goal of deeply understanding databases:
Start with Project 3 (B-Tree Library) if you’re comfortable with C file I/O. The B-tree is the heart of database indexing, and everything else builds on top of understanding how data is organized for fast retrieval.
Start with Project 1 (Key-Value Store) if you want to warm up first. It’s a weekend project that builds confidence and sets up patterns you’ll use throughout.
Project 2 (KV Client Library) can be done in parallel with Project 3 or anytime after Project 1. It’s a great way to practice API design and learn socket programming, and the concepts will help you design cleaner interfaces for your database components.
Suggested Timeline
| Week | Project | Focus |
|---|---|---|
| Week 1 | Key-Value Store | Warm-up, file I/O fundamentals |
| Week 2 | KV Client Library | API design, ownership semantics (can be parallel with Week 3) |
| Weeks 3-4 | B-Tree Library | The core insight of database indexing |
| Weeks 5-6 | Storage Engine | Production-level understanding |
| Weeks 7-8 | WAL & Recovery | Reliability mastery |
| Weeks 9-11 | SQL Query Engine | Tie it all together |
Final Capstone Project: SQLite Clone (“TinyDB”)
What you’ll build: A complete, embedded SQL database that stores data in a single file, supports transactions, and handles concurrent readers—essentially a simplified SQLite.
Why this is the ultimate project: This integrates everything: storage engine, B-tree indexes, buffer pool, WAL, SQL parsing, query execution, and transactions. When you finish, you’ll have built a real database from scratch that you can actually use in other projects.
Core challenges you’ll face
- Integration: Making all components work together seamlessly
- File format design: Single-file database with header, pages, freelist
- Transaction isolation: Basic locking or MVCC
- Concurrency: Multiple readers, single writer
Resources for this capstone
- Let’s Build a Simple Database by cstack - The definitive SQLite clone tutorial
- How SQLite Works - Official documentation
- SQLite Source Code - ~150K lines of beautifully documented C
Key Concepts
| Concept | Resource |
|---|---|
| Complete database architecture | “Database Internals” by Alex Petrov - The best modern book on storage engines |
| SQLite specifics | The Architecture of SQLite - Official architecture overview |
| Production considerations | “Designing Data-Intensive Applications” Ch. 3 (Storage and Retrieval) - Kleppmann |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time estimate | 1-2 months |
| Prerequisites | All previous projects |
Real World Outcome
- A single
tinydbbinary and alibtinydb.alibrary you can link into other C programs - Run
./tinydb myapp.dband get a full SQL REPL - Use your database as the backend for a simple web app or CLI tool
- Show someone: “I built this database from scratch in C”
Learning Milestones
- Weeks 1-2: Integrate storage engine + B-tree with file format
- Weeks 3-4: Add SQL parser and basic query execution
- Weeks 5-6: Implement WAL and crash recovery
- Weeks 7-8: Add transactions and basic concurrency
- Final: Polish, optimize, write tests, benchmark against SQLite
Additional Resources
Books (from your collection)
| Book | Relevance |
|---|---|
| “Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron | Memory hierarchy, caching, file I/O |
| “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau | File systems, crash consistency |
| “The C Programming Language” - Kernighan & Ritchie | C fundamentals |
| “C Interfaces and Implementations” - David R. Hanson | Clean C design patterns |
| “Algorithms, Fourth Edition” - Sedgewick & Wayne | B-trees, data structures |
Online Resources
- Let’s Build a Simple Database - cstack’s SQLite clone tutorial
- Architecture of SQLite - Official SQLite documentation
- CMU 15-445 Database Systems - Buffer Pool Manager project
- Write-Ahead Logging and ARIES - WAL deep dive
- SQLite Internals: Pages & B-trees - Fly.io blog
- Implementation of B-Tree in C - GeeksforGeeks
- Deep Dive into SQLite’s Internal Architecture - Dev.to article
- PostgreSQL WAL Documentation - Official PostgreSQL docs
Recommended Books (not in your collection)
| Book | Why It’s Essential |
|---|---|
| “Database Internals” by Alex Petrov | The best modern book specifically about storage engine internals |
| “Database System Concepts” by Silberschatz et al. | Comprehensive academic textbook covering all database topics |
| “Designing Data-Intensive Applications” by Martin Kleppmann | Modern systems thinking about data storage and processing |
Quick Start Checklist
- Set up a C development environment (gcc/clang, make, gdb)
- Create a project directory structure
- Start with Project 1 or 3 based on your comfort level
- Read the cstack tutorial alongside your implementation
- Use
hexdumpandgdbliberally to understand what’s happening - Benchmark your implementations to see O(n) vs O(log n) in practice
- Keep notes on what you learn—database concepts are interconnected
Happy building! There’s nothing quite like the moment when you realize you understand how databases actually work.