Project 11: Database Engine

Build a simple on-disk key-value store with paging, indexing, and a query interface.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 3-4 weeks
Language C
Prerequisites Files, data structures, hashing or trees
Key Topics Storage, paging, indexing

1. Learning Objectives

By completing this project, you will:

  1. Design an on-disk file format with pages.
  2. Implement a simple index (hash or B-tree).
  3. Support insert, get, delete operations.
  4. Manage durability with flushes and checkpoints.

2. Theoretical Foundation

2.1 Core Concepts

  • Pages: Fixed-size blocks (e.g., 4 KB) for disk I/O.
  • Indexing: Map keys to locations quickly.
  • Durability: Data must survive crashes.

2.2 Why This Matters

Databases are systems programming in action. Building one teaches storage formats, performance trade-offs, and correctness under failure.

2.3 Historical Context / Background

Early databases relied on B-trees and page-based storage to minimize disk seeks. The same concepts remain in modern systems.

2.4 Common Misconceptions

  • “Disk I/O is random”: It is expensive and should be minimized.
  • “A file is just bytes”: Structure is essential for performance.

3. Project Specification

3.1 What You Will Build

A minimal key-value database with:

  • db_open, db_close
  • db_put, db_get, db_delete
  • An on-disk file format
  • A simple CLI for queries

3.2 Functional Requirements

  1. Persist records to disk.
  2. Retrieve records by key efficiently.
  3. Handle updates and deletes.
  4. Load existing data on startup.

3.3 Non-Functional Requirements

  • Reliability: Data survives process restart.
  • Performance: Use buffered I/O and paging.
  • Usability: Provide clear CLI feedback.

3.4 Example Usage / Output

$ ./db
> put user:1 "Ada"
OK
> get user:1
Ada
> delete user:1
OK

3.5 Real World Outcome

You have a small but real database with a file format you designed. You can explain how keys map to disk pages and how updates work.


4. Solution Architecture

4.1 High-Level Design

CLI -> parser -> storage engine -> page cache -> disk file

4.2 Key Components

Component Responsibility Key Decisions
Page cache Buffer pages in memory LRU or simple array
Index Map keys to offsets Hash index first
Storage Read/write pages Fixed-size blocks

4.3 Data Structures

typedef struct {
    uint32_t key_len;
    uint32_t val_len;
    // followed by key bytes and value bytes
} Record;

4.4 Algorithm Overview

Key Algorithm: Put

  1. Hash key to find page bucket.
  2. Load page into cache.
  3. Insert or update record.
  4. Mark page dirty and flush on close.

Complexity Analysis:

  • Put/Get average: O(1) with hash index
  • Disk I/O: O(1) per page

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -g -o db db.c storage.c index.c

5.2 Project Structure

db/
├── src/
│   ├── db.c
│   ├── storage.c
│   └── index.c
├── tests/
│   └── test_db.sh
└── README.md

5.3 The Core Question You’re Answering

“How do I store structured data on disk so it can be retrieved quickly later?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Paging
    • Why fixed-size pages are used for disk I/O.
  2. Indexing
    • Hash vs B-tree trade-offs.
  3. Durability
    • What happens if the process crashes mid-write?

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Will you allow variable-length keys/values?
  2. How will you handle deleted records (tombstones)?
  3. When will you flush dirty pages?

5.6 Thinking Exercise

Page Layout

Sketch a 4 KB page layout that stores multiple records and tracks free space.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why do databases use paging?”
  2. “What are the trade-offs between hashing and B-trees?”
  3. “How would you handle crashes during writes?”

5.8 Hints in Layers

Hint 1: Start with an append-only file Store records sequentially before adding an index.

Hint 2: Add a simple hash index Use in-memory index first.

Hint 3: Add page cache Read/write in fixed-size blocks.

5.9 Books That Will Help

Topic Book Chapter
Storage engines “Designing Data-Intensive Applications” Ch. 3
B-trees “Database System Concepts” Indexing chapter

5.10 Implementation Phases

Phase 1: Foundation (5-7 days)

Goals:

  • Basic file format and append-only writes

Tasks:

  1. Write records sequentially.
  2. Implement db_get by scanning.

Checkpoint: Data persists across runs.

Phase 2: Core Functionality (7-10 days)

Goals:

  • Add index and page cache

Tasks:

  1. Build hash index in memory.
  2. Add fixed-size page reads.

Checkpoint: Lookups are fast.

Phase 3: Polish & Edge Cases (5-7 days)

Goals:

  • Deletes and compaction

Tasks:

  1. Add tombstones for deletes.
  2. Add compaction command.

Checkpoint: Deleted keys no longer returned.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Index type Hash vs B-tree Hash first Simpler to implement
Write strategy In-place vs append-only Append-only Easier durability

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Storage format Record encode/decode
Integration Tests CLI behavior Put/get/delete
Durability Tests Restart persistence Reopen DB

6.2 Critical Test Cases

  1. Overwrite key: Value updates.
  2. Restart: Data persists.
  3. Delete: Key no longer returned.

6.3 Test Data

user:1 -> "Ada"
user:2 -> "Linus"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Not flushing data Data loss fsync on close
Corrupt record lengths Crash on load Validate sizes
Index desync Wrong values Rebuild index on load

7.2 Debugging Strategies

  • Add a db_dump tool to inspect pages.
  • Validate record sizes when reading.

7.3 Performance Traps

Full-file scans for every get are slow. Use an index as early as possible.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add db_list to iterate keys.
  • Add statistics (page counts, free space).

8.2 Intermediate Extensions

  • Implement B-tree index.
  • Add basic transactions.

8.3 Advanced Extensions

  • Implement WAL (write-ahead log).
  • Add concurrency control.

9. Real-World Connections

9.1 Industry Applications

  • Embedded DBs: SQLite, LevelDB.
  • Caching layers: Simple key-value stores.
  • SQLite: A full SQL database in C.

9.3 Interview Relevance

Storage engine internals and indexing are strong signals for systems roles.


10. Resources

10.1 Essential Reading

  • “Designing Data-Intensive Applications” - Ch. 3
  • “Database System Concepts” - Indexing chapter

10.2 Video Resources

  • Storage engine lectures

10.3 Tools & Documentation

  • man 2 fsync: Durability guarantees
  • Hash Table: Indexing support.
  • HTTP Server: Expose DB via API.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain paging.
  • I can describe hash vs B-tree indexing.
  • I understand durability trade-offs.

11.2 Implementation

  • Data persists across runs.
  • Put/get/delete work correctly.
  • Index is consistent with storage.

11.3 Growth

  • I can add WAL or transactions.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Append-only file with put/get.

Full Completion:

  • Index and delete support.

Excellence (Going Above & Beyond):

  • WAL and compaction.

This guide was generated from C_PROGRAMMING_COMPLETE_MASTERY.md. For the complete learning path, see the parent directory.