Project 11: Database Engine
Build a simple on-disk key-value store with paging, indexing, and a query interface.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 3-4 weeks |
| Language | C |
| Prerequisites | Files, data structures, hashing or trees |
| Key Topics | Storage, paging, indexing |
1. Learning Objectives
By completing this project, you will:
- Design an on-disk file format with pages.
- Implement a simple index (hash or B-tree).
- Support insert, get, delete operations.
- Manage durability with flushes and checkpoints.
2. Theoretical Foundation
2.1 Core Concepts
- Pages: Fixed-size blocks (e.g., 4 KB) for disk I/O.
- Indexing: Map keys to locations quickly.
- Durability: Data must survive crashes.
2.2 Why This Matters
Databases are systems programming in action. Building one teaches storage formats, performance trade-offs, and correctness under failure.
2.3 Historical Context / Background
Early databases relied on B-trees and page-based storage to minimize disk seeks. The same concepts remain in modern systems.
2.4 Common Misconceptions
- “Disk I/O is random”: It is expensive and should be minimized.
- “A file is just bytes”: Structure is essential for performance.
3. Project Specification
3.1 What You Will Build
A minimal key-value database with:
db_open,db_closedb_put,db_get,db_delete- An on-disk file format
- A simple CLI for queries
3.2 Functional Requirements
- Persist records to disk.
- Retrieve records by key efficiently.
- Handle updates and deletes.
- Load existing data on startup.
3.3 Non-Functional Requirements
- Reliability: Data survives process restart.
- Performance: Use buffered I/O and paging.
- Usability: Provide clear CLI feedback.
3.4 Example Usage / Output
$ ./db
> put user:1 "Ada"
OK
> get user:1
Ada
> delete user:1
OK
3.5 Real World Outcome
You have a small but real database with a file format you designed. You can explain how keys map to disk pages and how updates work.
4. Solution Architecture
4.1 High-Level Design
CLI -> parser -> storage engine -> page cache -> disk file
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Page cache | Buffer pages in memory | LRU or simple array |
| Index | Map keys to offsets | Hash index first |
| Storage | Read/write pages | Fixed-size blocks |
4.3 Data Structures
typedef struct {
uint32_t key_len;
uint32_t val_len;
// followed by key bytes and value bytes
} Record;
4.4 Algorithm Overview
Key Algorithm: Put
- Hash key to find page bucket.
- Load page into cache.
- Insert or update record.
- Mark page dirty and flush on close.
Complexity Analysis:
- Put/Get average: O(1) with hash index
- Disk I/O: O(1) per page
5. Implementation Guide
5.1 Development Environment Setup
cc -Wall -Wextra -O2 -g -o db db.c storage.c index.c
5.2 Project Structure
db/
├── src/
│ ├── db.c
│ ├── storage.c
│ └── index.c
├── tests/
│ └── test_db.sh
└── README.md
5.3 The Core Question You’re Answering
“How do I store structured data on disk so it can be retrieved quickly later?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Paging
- Why fixed-size pages are used for disk I/O.
- Indexing
- Hash vs B-tree trade-offs.
- Durability
- What happens if the process crashes mid-write?
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Will you allow variable-length keys/values?
- How will you handle deleted records (tombstones)?
- When will you flush dirty pages?
5.6 Thinking Exercise
Page Layout
Sketch a 4 KB page layout that stores multiple records and tracks free space.
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “Why do databases use paging?”
- “What are the trade-offs between hashing and B-trees?”
- “How would you handle crashes during writes?”
5.8 Hints in Layers
Hint 1: Start with an append-only file Store records sequentially before adding an index.
Hint 2: Add a simple hash index Use in-memory index first.
Hint 3: Add page cache Read/write in fixed-size blocks.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Storage engines | “Designing Data-Intensive Applications” | Ch. 3 |
| B-trees | “Database System Concepts” | Indexing chapter |
5.10 Implementation Phases
Phase 1: Foundation (5-7 days)
Goals:
- Basic file format and append-only writes
Tasks:
- Write records sequentially.
- Implement
db_getby scanning.
Checkpoint: Data persists across runs.
Phase 2: Core Functionality (7-10 days)
Goals:
- Add index and page cache
Tasks:
- Build hash index in memory.
- Add fixed-size page reads.
Checkpoint: Lookups are fast.
Phase 3: Polish & Edge Cases (5-7 days)
Goals:
- Deletes and compaction
Tasks:
- Add tombstones for deletes.
- Add compaction command.
Checkpoint: Deleted keys no longer returned.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Index type | Hash vs B-tree | Hash first | Simpler to implement |
| Write strategy | In-place vs append-only | Append-only | Easier durability |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Storage format | Record encode/decode |
| Integration Tests | CLI behavior | Put/get/delete |
| Durability Tests | Restart persistence | Reopen DB |
6.2 Critical Test Cases
- Overwrite key: Value updates.
- Restart: Data persists.
- Delete: Key no longer returned.
6.3 Test Data
user:1 -> "Ada"
user:2 -> "Linus"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Not flushing data | Data loss | fsync on close |
| Corrupt record lengths | Crash on load | Validate sizes |
| Index desync | Wrong values | Rebuild index on load |
7.2 Debugging Strategies
- Add a
db_dumptool to inspect pages. - Validate record sizes when reading.
7.3 Performance Traps
Full-file scans for every get are slow. Use an index as early as possible.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
db_listto iterate keys. - Add statistics (page counts, free space).
8.2 Intermediate Extensions
- Implement B-tree index.
- Add basic transactions.
8.3 Advanced Extensions
- Implement WAL (write-ahead log).
- Add concurrency control.
9. Real-World Connections
9.1 Industry Applications
- Embedded DBs: SQLite, LevelDB.
- Caching layers: Simple key-value stores.
9.2 Related Open Source Projects
- SQLite: A full SQL database in C.
9.3 Interview Relevance
Storage engine internals and indexing are strong signals for systems roles.
10. Resources
10.1 Essential Reading
- “Designing Data-Intensive Applications” - Ch. 3
- “Database System Concepts” - Indexing chapter
10.2 Video Resources
- Storage engine lectures
10.3 Tools & Documentation
man 2 fsync: Durability guarantees
10.4 Related Projects in This Series
- Hash Table: Indexing support.
- HTTP Server: Expose DB via API.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain paging.
- I can describe hash vs B-tree indexing.
- I understand durability trade-offs.
11.2 Implementation
- Data persists across runs.
- Put/get/delete work correctly.
- Index is consistent with storage.
11.3 Growth
- I can add WAL or transactions.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Append-only file with put/get.
Full Completion:
- Index and delete support.
Excellence (Going Above & Beyond):
- WAL and compaction.
This guide was generated from C_PROGRAMMING_COMPLETE_MASTERY.md. For the complete learning path, see the parent directory.