Project 11: Database Engine

Build a simple on-disk key-value store with paging, indexing, and a query interface.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	3-4 weeks
Language	C
Prerequisites	Files, data structures, hashing or trees
Key Topics	Storage, paging, indexing
Main Programming Language	C
Alternative Programming Languages	Rust, Go, Zig
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	4. The “Open Core” Infrastructure
Knowledge Area	Databases / File I/O / Data Structures
Software or Tool	hexdump, strace, filesystem
Main Book	Database Internals by Alex Petrov

What You Will Build

Build a simple on-disk key-value store with paging, indexing, and a query interface.

This project emphasizes Databases / File I/O / Data Structures using C with production-style constraints.

1. Learning Objectives

By completing this project, you will:

Design an on-disk file format with pages.
Implement a simple index (hash or B-tree).
Support insert, get, delete operations.
Manage durability with flushes and checkpoints.

2. Theoretical Foundation

2.1 Core Concepts

Pages: Fixed-size blocks (e.g., 4 KB) for disk I/O.
Indexing: Map keys to locations quickly.
Durability: Data must survive crashes.

2.2 Why This Matters

Databases are systems programming in action. Building one teaches storage formats, performance trade-offs, and correctness under failure.

2.3 Historical Context / Background

Early databases relied on B-trees and page-based storage to minimize disk seeks. The same concepts remain in modern systems.

2.4 Common Misconceptions

“Disk I/O is random”: It is expensive and should be minimized.
“A file is just bytes”: Structure is essential for performance.

3. Project Specification

3.1 What You Will Build

A minimal key-value database with:

db_open, db_close
db_put, db_get, db_delete
An on-disk file format
A simple CLI for queries

3.2 Functional Requirements

Persist records to disk.
Retrieve records by key efficiently.
Handle updates and deletes.
Load existing data on startup.

3.3 Non-Functional Requirements

Reliability: Data survives process restart.
Performance: Use buffered I/O and paging.
Usability: Provide clear CLI feedback.

3.4 Example Usage / Output

$ ./db
> put user:1 "Ada"
OK
> get user:1
Ada
> delete user:1
OK

3.5 Real World Outcome

You have a small but real database with a file format you designed. You can explain how keys map to disk pages and how updates work.

4. Solution Architecture

4.1 High-Level Design

CLI -> parser -> storage engine -> page cache -> disk file

4.2 Key Components

Component	Responsibility	Key Decisions
Page cache	Buffer pages in memory	LRU or simple array
Index	Map keys to offsets	Hash index first
Storage	Read/write pages	Fixed-size blocks

4.3 Data Structures

typedef struct {
    uint32_t key_len;
    uint32_t val_len;
    // followed by key bytes and value bytes
} Record;

4.4 Algorithm Overview

Key Algorithm: Put

Hash key to find page bucket.
Load page into cache.
Insert or update record.
Mark page dirty and flush on close.

Complexity Analysis:

Put/Get average: O(1) with hash index
Disk I/O: O(1) per page

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -g -o db db.c storage.c index.c

5.2 Project Structure

db/
├── src/
│   ├── db.c
│   ├── storage.c
│   └── index.c
├── tests/
│   └── test_db.sh
└── README.md

5.3 The Core Question You’re Answering

“How do I store structured data on disk so it can be retrieved quickly later?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Paging
- Why fixed-size pages are used for disk I/O.
Indexing
- Hash vs B-tree trade-offs.
Durability
- What happens if the process crashes mid-write?

5.5 Questions to Guide Your Design

Before implementing, think through these:

Will you allow variable-length keys/values?
How will you handle deleted records (tombstones)?
When will you flush dirty pages?

5.6 Thinking Exercise

Page Layout

Sketch a 4 KB page layout that stores multiple records and tracks free space.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

“Why do databases use paging?”
“What are the trade-offs between hashing and B-trees?”
“How would you handle crashes during writes?”

5.8 Hints in Layers

Hint 1: Start with an append-only file Store records sequentially before adding an index.

Hint 2: Add a simple hash index Use in-memory index first.

Hint 3: Add page cache Read/write in fixed-size blocks.

5.9 Books That Will Help

Topic	Book	Chapter
Storage engines	“Designing Data-Intensive Applications”	Ch. 3
B-trees	“Database System Concepts”	Indexing chapter

5.10 Implementation Phases

Phase 1: Foundation (5-7 days)

Goals:

Basic file format and append-only writes

Tasks:

Write records sequentially.
Implement db_get by scanning.

Checkpoint: Data persists across runs.

Phase 2: Core Functionality (7-10 days)

Goals:

Add index and page cache

Tasks:

Build hash index in memory.
Add fixed-size page reads.

Checkpoint: Lookups are fast.

Phase 3: Polish & Edge Cases (5-7 days)

Goals:

Deletes and compaction

Tasks:

Add tombstones for deletes.
Add compaction command.

Checkpoint: Deleted keys no longer returned.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Index type	Hash vs B-tree	Hash first	Simpler to implement
Write strategy	In-place vs append-only	Append-only	Easier durability

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Storage format	Record encode/decode
Integration Tests	CLI behavior	Put/get/delete
Durability Tests	Restart persistence	Reopen DB

6.2 Critical Test Cases

Overwrite key: Value updates.
Restart: Data persists.
Delete: Key no longer returned.

6.3 Test Data

user:1 -> "Ada"
user:2 -> "Linus"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Not flushing data	Data loss	`fsync` on close
Corrupt record lengths	Crash on load	Validate sizes
Index desync	Wrong values	Rebuild index on load

7.2 Debugging Strategies

Add a db_dump tool to inspect pages.
Validate record sizes when reading.

7.3 Performance Traps

Full-file scans for every get are slow. Use an index as early as possible.

8. Extensions & Challenges

8.1 Beginner Extensions

Add db_list to iterate keys.
Add statistics (page counts, free space).

8.2 Intermediate Extensions

Implement B-tree index.
Add basic transactions.

8.3 Advanced Extensions

Implement WAL (write-ahead log).
Add concurrency control.

9. Real-World Connections

9.1 Industry Applications

Embedded DBs: SQLite, LevelDB.
Caching layers: Simple key-value stores.

SQLite: A full SQL database in C.

9.3 Interview Relevance

Storage engine internals and indexing are strong signals for systems roles.

10. Resources

10.1 Essential Reading

“Designing Data-Intensive Applications” - Ch. 3
“Database System Concepts” - Indexing chapter

10.2 Video Resources

Storage engine lectures

10.3 Tools & Documentation

man 2 fsync: Durability guarantees

Hash Table: Indexing support.
HTTP Server: Expose DB via API.

11. Self-Assessment Checklist

11.1 Understanding

I can explain paging.
I can describe hash vs B-tree indexing.
I understand durability trade-offs.

11.2 Implementation

Data persists across runs.
Put/get/delete work correctly.
Index is consistent with storage.

11.3 Growth

I can add WAL or transactions.
I can explain this project in an interview.

Definition of Done

Minimum Viable Completion:

Append-only file with put/get.

Full Completion:

Index and delete support.

Excellence (Going Above & Beyond):

WAL and compaction.

This guide was generated from C_PROGRAMMING_COMPLETE_MASTERY.md. For the complete learning path, see the parent directory.

Project 11: Database Engine

Quick Reference

What You Will Build

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

Page Layout

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (5-7 days)

Phase 2: Core Functionality (7-10 days)

Phase 3: Polish & Edge Cases (5-7 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

Definition of Done