← Back to all projects

DATABASE INTERNALS C LEARNING PROJECTS

Learning Database Internals with C

A hands-on learning path to deeply understand how databases work by building one from scratch in C.

Databases are one of the most fascinating systems to understand deeply, and C is the perfect language for this—SQLite, PostgreSQL, and MySQL are all written in C. You’ll see exactly how bytes become queries.


Core Concept Analysis

Understanding “how databases work” breaks down into these fundamental building blocks:

Layer What It Does Key Concepts
Storage Persists data to disk Pages, blocks, file formats, serialization
Indexing Fast data lookup B-trees, B+ trees, hash indexes
Buffer Pool Manages memory/disk I/O Page caching, eviction policies (LRU)
Query Processing Interprets SQL Parsing, planning, execution
Transactions Ensures ACID Locking, MVCC, isolation levels
Recovery Survives crashes Write-Ahead Logging (WAL), checkpoints

Project 1: Persistent Key-Value Store

  • File: persistent_key_value_store.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: Level 1: The “Resume Gold”
  • Difficulty: Level 1: Beginner (The Tinkerer)
  • Knowledge Area: Database Internals, File I/O
  • Software or Tool: File System, Binary Serialization
  • Main Book: “The C Programming Language” by Kernighan & Ritchie

What you’ll build: A simple key-value database that stores data to disk, supports GET/PUT/DELETE operations, and survives program restarts.

Why it teaches databases: This is the absolute foundation. Before B-trees, SQL, or transactions, a database must solve one problem: how do you store structured data on disk and read it back efficiently? You’ll confront serialization, file formats, and the gap between memory and persistent storage.

Core challenges you’ll face

  • Serialization: How do you convert C structs to bytes and back? (maps to data encoding)
  • File I/O: When do you fopen, fread, fwrite, fsync? (maps to durability)
  • Simple indexing: How do you find a key without scanning everything? (maps to indexing concepts)
  • Memory management: Managing buffers, avoiding leaks (maps to buffer management)

Key Concepts

Concept Resource
File I/O in C “The C Programming Language” Ch. 7 (Input and Output) - Kernighan & Ritchie
Serialization formats “Designing Data-Intensive Applications” Ch. 4 (Encoding and Evolution) - Martin Kleppmann
Basic data structures “C Interfaces and Implementations” Ch. 2 (Interfaces and Implementations) - David R. Hanson

Project Details

Attribute Value
Difficulty Beginner
Time estimate Weekend
Prerequisites Basic C (pointers, structs, file I/O)

Real World Outcome

  • A working CLI tool where you can type ./kvstore put name "Douglas" and ./kvstore get name returns Douglas
  • Data persists after you close and reopen the program
  • You can hexdump your data file and understand every byte

Learning Milestones

  1. First milestone: Store a single key-value pair and retrieve it → you understand basic file writes
  2. Second milestone: Handle multiple keys with a simple linear scan → you see why this is slow (O(n))
  3. Final milestone: Add a simple hash-based index file → you understand why databases need indexes

Project 2: Key-Value Store Client Library

  • File: SPRINT_4_BOUNDARIES_INTERFACES_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: API Design / Networking
  • Software or Tool: Redis Client Protocol
  • Main Book: “Effective C, 2nd Edition” by Robert C. Seacord

What you’ll build: A C client library (like libkvclient) that connects to Redis or your own simple TCP key-value server. The library exposes kv_connect(), kv_set(), kv_get(), kv_disconnect() and handles all protocol details internally.

Why it teaches Boundaries & Interfaces: Client libraries are the purest form of “boundary as contract.” Your users see only your .h file. They never see your parsing logic, socket handling, or internal buffers. You must communicate ownership clearly: who owns the returned string from kv_get()? The caller? The library? This ambiguity has caused thousands of real-world bugs.

Core challenges you’ll face

  • Opaque handle design: Designing kv_handle as an opaque type (encapsulation, information hiding)
  • Memory ownership: Deciding who owns memory returned by kv_get() and making it obvious (ownership across boundaries)
  • Const correctness: Using const char* correctly for keys vs mutable buffers for values
  • Protocol abstraction: Hiding protocol details while allowing configuration (internal vs external invariants)
  • State management: Handling connection state without exposing it (avoiding global state)

Key Concepts

Concept Resource
Opaque pointers pattern “C Interfaces and Implementations” Ch. 2 - David R. Hanson
Memory ownership in APIs “Effective C, 2nd Edition” Ch. 6 (Memory Management) - Robert C. Seacord
Const correctness “C Programming: A Modern Approach” Ch. 17 - K. N. King
Socket programming “The Linux Programming Interface” Ch. 56-59 - Michael Kerrisk
Defensive coding “Code Complete, 2nd Edition” Ch. 8 (Defensive Programming) - Steve McConnell

Project Details

Attribute Value
Difficulty Intermediate
Time estimate 1-2 weeks
Prerequisites Project 1, basic socket programming

Real World Outcome

kv_handle *db = kv_connect("localhost", 6379);
kv_set(db, "user:1:name", "Alice");
char *name = kv_get(db, "user:1:name");  // Clear who owns this!
printf("Name: %s\n", name);
kv_free_string(name);  // Explicit ownership transfer
kv_disconnect(db);

Your library connects to a real Redis server (or your mock), and the API makes ownership unmistakable.

Learning Milestones

  1. First milestone: Implement kv_connect() returning an opaque handle → understand why hiding struct internals prevents misuse
  2. Second milestone: Design “who frees the returned string” → you’ll never design an ambiguous ownership API again
  3. Final milestone: Add error handling that doesn’t expose internal errno → understand the boundary between internal and external invariants

Project 3: B-Tree Library

  • File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Structures / Databases
  • Software or Tool: B-Trees
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A disk-backed B-tree implementation that can store millions of keys with O(log n) lookups, insertions, and deletions.

Why it teaches databases: The B-tree is the data structure of databases. SQLite uses B-trees for tables and indexes. PostgreSQL uses B+ trees. Understanding B-trees means understanding why databases are fast. When you implement node splits and merges yourself, you’ll never forget how indexing works.

Core challenges you’ll face

  • Node structure: Designing the on-disk format for internal and leaf nodes (maps to page layout)
  • Tree traversal: Following child pointers through pages (maps to random vs sequential I/O)
  • Node splitting: What happens when a node is full? (maps to tree balancing)
  • Page management: Each node = one disk page (maps to buffer pool concepts)

Resources for key challenges

Key Concepts

Concept Resource
B-tree fundamentals “Algorithms, Fourth Edition” §3.3 (Balanced Search Trees) - Sedgewick & Wayne
Disk-based tree structures “Database Internals” Ch. 2-4 - Alex Petrov
Page layout “Computer Systems: A Programmer’s Perspective” Ch. 6 (Memory Hierarchy) - Bryant & O’Hallaron

Project Details

Attribute Value
Difficulty Intermediate
Time estimate 1-2 weeks
Prerequisites Project 1, understanding of tree data structures

Real World Outcome

  • A library where you can insert 1 million keys and retrieve any key in <10 disk reads
  • Visual output showing tree structure: ./btree_demo --visualize prints the tree levels
  • Benchmark comparison showing O(log n) vs O(n) performance

Learning Milestones

  1. First milestone: Implement search through a pre-built tree → understand tree traversal
  2. Second milestone: Implement insertion with node splitting → understand how trees grow
  3. Final milestone: Implement deletion with merging → understand the full complexity of balanced trees

Project 4: Page-Based Storage Engine

  • File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Operating Systems / Databases
  • Software or Tool: Buffer Pool
  • Main Book: “Database System Concepts” by Silberschatz et al.

What you’ll build: A storage engine with fixed-size pages, a page directory, a buffer pool with LRU eviction, and support for variable-length records.

Why it teaches databases: Real databases don’t just dump data to disk—they organize it into pages (typically 4KB or 8KB). This project teaches you how databases actually manage disk space, cache hot pages in memory, and handle records that don’t fit neatly into fixed slots.

Core challenges you’ll face

  • Page format design: Headers, slots, free space management (maps to physical storage)
  • Buffer pool management: Which pages to keep in memory? (maps to caching)
  • LRU eviction: Implementing an eviction policy (maps to cache replacement)
  • Slot directory: Finding records within a page (maps to tuple storage)

Resources for key challenges

Key Concepts

Concept Resource
Buffer management “Database System Concepts” Ch. 13 (Data Storage Structures) - Silberschatz et al.
Page layout “Operating Systems: Three Easy Pieces” Ch. 39-40 (Files and Directories) - Arpaci-Dusseau
Cache replacement policies “Computer Systems: A Programmer’s Perspective” Ch. 6.4 (Cache Memories) - Bryant & O’Hallaron

Project Details

Attribute Value
Difficulty Intermediate
Time estimate 1-2 weeks
Prerequisites Project 3

Real World Outcome

  • Run ./storage_demo and see buffer pool statistics: “Buffer pool: 64 pages, 58 hits, 6 misses, 91% hit rate”
  • Insert 100,000 records and watch the buffer pool manage memory efficiently
  • Monitor which pages are hot vs cold with a debug view

Learning Milestones

  1. First milestone: Fixed-size record storage in pages → understand page structure
  2. Second milestone: Implement buffer pool with pinning → understand memory management
  3. Final milestone: Add LRU eviction and variable-length records → understand production storage engines

Project 5: Write-Ahead Log (WAL) and Crash Recovery

  • File: DATABASE_INTERNALS_C_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Databases / Crash Consistency
  • Software or Tool: WAL
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A transaction log that guarantees durability—your database survives crashes without losing committed data.

Why it teaches databases: WAL is what makes databases reliable. The “D” in ACID (Durability) comes from write-ahead logging. You’ll understand why databases write to a log first, how they recover from crashes, and why fsync() matters so much.

Core challenges you’ll face

  • Log record format: Designing undo/redo log entries (maps to recovery)
  • Log-write protocol: Ensuring logs hit disk before data (maps to durability)
  • Crash recovery: Replaying the log to restore consistency (maps to REDO/UNDO)
  • Checkpointing: Limiting recovery time (maps to performance)

Resources for key challenges

Key Concepts

Concept Resource
ACID guarantees “Designing Data-Intensive Applications” Ch. 7 (Transactions) - Martin Kleppmann
Recovery algorithms “Database System Concepts” Ch. 19 (Recovery System) - Silberschatz et al.
File system semantics “Operating Systems: Three Easy Pieces” Ch. 42 (Crash Consistency) - Arpaci-Dusseau

Project Details

Attribute Value
Difficulty Advanced
Time estimate 1-2 weeks
Prerequisites Projects 1-4

Real World Outcome

  • Run ./wal_demo --crash-test: the program crashes mid-transaction, restarts, and recovers to a consistent state
  • See log replay in action: “Replaying 47 log records… Recovery complete. 3 transactions rolled back.”
  • Kill the process with kill -9 at any point and verify no data corruption

Learning Milestones

  1. First milestone: Append-only log writing → understand sequential durability
  2. Second milestone: REDO-only recovery → understand forward recovery
  3. Final milestone: Full UNDO/REDO with checkpoints → understand complete crash recovery

Project 6: SQL Query Engine

  • File: sql_query_engine.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: Level 4: The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert (The Systems Architect)
  • Knowledge Area: Compilers, Query Processing
  • Software or Tool: Lexer/Parser, SQLite Architecture
  • Main Book: “Database System Concepts” by Silberschatz et al.

What you’ll build: A SQL parser and query executor that can handle SELECT, INSERT, UPDATE, DELETE with WHERE clauses, joins, and basic aggregations.

Why it teaches databases: This is where bytes become tables. You’ll understand how SELECT * FROM users WHERE age > 21 becomes a series of operations: parse → plan → execute. Building a query engine demystifies SQL completely.

Core challenges you’ll face

  • Lexing/Parsing: Turning SQL text into an AST (maps to query compilation)
  • Query planning: Deciding which indexes to use (maps to optimization)
  • Execution engine: Iterator model vs volcano model (maps to runtime)
  • Join algorithms: Nested loop, hash join basics (maps to query processing)

Resources for key challenges

Key Concepts

Concept Resource
Parsing “Compilers: Principles and Practice” Ch. 3-4 (Lexical/Syntax Analysis) - Dave & Dave
Query processing “Database System Concepts” Ch. 15-16 (Query Processing & Optimization) - Silberschatz et al.
Execution models “Database Internals” Ch. 11 - Alex Petrov

Project Details

Attribute Value
Difficulty Advanced
Time estimate 2-4 weeks
Prerequisites Projects 1-5 (or at minimum 1-3)

Real World Outcome

A working REPL where you type SQL and get results:

minidb> SELECT name, age FROM users WHERE age > 21;
+--------+-----+
| name   | age |
+--------+-----+
| Alice  | 25  |
| Bob    | 30  |
+--------+-----+
2 rows returned (0.003 sec)

Run EXPLAIN SELECT... and see the query plan.

Learning Milestones

  1. First milestone: Parse and execute simple SELECT → understand the pipeline
  2. Second milestone: Add WHERE clause filtering → understand predicate evaluation
  3. Final milestone: Implement JOIN and basic aggregates → understand relational algebra

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Key-Value Store Beginner Weekend ⭐⭐ Foundation ⭐⭐⭐⭐ Quick wins
2. KV Client Library Intermediate 1-2 weeks ⭐⭐⭐ API Design ⭐⭐⭐ Practical
3. B-Tree Library Intermediate 1-2 weeks ⭐⭐⭐⭐⭐ Core concept ⭐⭐⭐⭐ Satisfying
4. Storage Engine Intermediate 1-2 weeks ⭐⭐⭐⭐ Production-like ⭐⭐⭐ Technical
5. WAL & Recovery Advanced 1-2 weeks ⭐⭐⭐⭐⭐ ACID mastery ⭐⭐⭐ Challenging
6. SQL Query Engine Advanced 2-4 weeks ⭐⭐⭐⭐ Full picture ⭐⭐⭐⭐⭐ “I built SQL!”

Based on the goal of deeply understanding databases:

Start with Project 3 (B-Tree Library) if you’re comfortable with C file I/O. The B-tree is the heart of database indexing, and everything else builds on top of understanding how data is organized for fast retrieval.

Start with Project 1 (Key-Value Store) if you want to warm up first. It’s a weekend project that builds confidence and sets up patterns you’ll use throughout.

Project 2 (KV Client Library) can be done in parallel with Project 3 or anytime after Project 1. It’s a great way to practice API design and learn socket programming, and the concepts will help you design cleaner interfaces for your database components.

Suggested Timeline

Week Project Focus
Week 1 Key-Value Store Warm-up, file I/O fundamentals
Week 2 KV Client Library API design, ownership semantics (can be parallel with Week 3)
Weeks 3-4 B-Tree Library The core insight of database indexing
Weeks 5-6 Storage Engine Production-level understanding
Weeks 7-8 WAL & Recovery Reliability mastery
Weeks 9-11 SQL Query Engine Tie it all together

Final Capstone Project: SQLite Clone (“TinyDB”)

What you’ll build: A complete, embedded SQL database that stores data in a single file, supports transactions, and handles concurrent readers—essentially a simplified SQLite.

Why this is the ultimate project: This integrates everything: storage engine, B-tree indexes, buffer pool, WAL, SQL parsing, query execution, and transactions. When you finish, you’ll have built a real database from scratch that you can actually use in other projects.

Core challenges you’ll face

  • Integration: Making all components work together seamlessly
  • File format design: Single-file database with header, pages, freelist
  • Transaction isolation: Basic locking or MVCC
  • Concurrency: Multiple readers, single writer

Resources for this capstone

Key Concepts

Concept Resource
Complete database architecture “Database Internals” by Alex Petrov - The best modern book on storage engines
SQLite specifics The Architecture of SQLite - Official architecture overview
Production considerations “Designing Data-Intensive Applications” Ch. 3 (Storage and Retrieval) - Kleppmann

Project Details

Attribute Value
Difficulty Advanced
Time estimate 1-2 months
Prerequisites All previous projects

Real World Outcome

  • A single tinydb binary and a libtinydb.a library you can link into other C programs
  • Run ./tinydb myapp.db and get a full SQL REPL
  • Use your database as the backend for a simple web app or CLI tool
  • Show someone: “I built this database from scratch in C”

Learning Milestones

  1. Weeks 1-2: Integrate storage engine + B-tree with file format
  2. Weeks 3-4: Add SQL parser and basic query execution
  3. Weeks 5-6: Implement WAL and crash recovery
  4. Weeks 7-8: Add transactions and basic concurrency
  5. Final: Polish, optimize, write tests, benchmark against SQLite

Additional Resources

Books (from your collection)

Book Relevance
“Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron Memory hierarchy, caching, file I/O
“Operating Systems: Three Easy Pieces” - Arpaci-Dusseau File systems, crash consistency
“The C Programming Language” - Kernighan & Ritchie C fundamentals
“C Interfaces and Implementations” - David R. Hanson Clean C design patterns
“Algorithms, Fourth Edition” - Sedgewick & Wayne B-trees, data structures

Online Resources

Book Why It’s Essential
“Database Internals” by Alex Petrov The best modern book specifically about storage engine internals
“Database System Concepts” by Silberschatz et al. Comprehensive academic textbook covering all database topics
“Designing Data-Intensive Applications” by Martin Kleppmann Modern systems thinking about data storage and processing

Quick Start Checklist

  • Set up a C development environment (gcc/clang, make, gdb)
  • Create a project directory structure
  • Start with Project 1 or 3 based on your comfort level
  • Read the cstack tutorial alongside your implementation
  • Use hexdump and gdb liberally to understand what’s happening
  • Benchmark your implementations to see O(n) vs O(log n) in practice
  • Keep notes on what you learn—database concepts are interconnected

Happy building! There’s nothing quite like the moment when you realize you understand how databases actually work.