Project 3: Persistent Key-Value Store

Build a durable key-value store with a write-ahead log, memtable, and on-disk SSTables.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-4 weeks
Main Programming Language	C or Rust (Alternatives: Go, C++)
Alternative Programming Languages	Go, C++
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	Storage and data services
Prerequisites	File I/O, data structures, hashing
Key Topics	WAL, SSTables, compaction, indexing, checksums

1. Learning Objectives

By completing this project, you will:

Implement a durable key-value store with crash recovery.
Explain log-structured storage and why it improves write throughput.
Build immutable on-disk tables with indexes and bloom filters.
Implement compaction and tombstones for deletes.
Analyze read/write amplification and trade-offs.
Provide a CLI or TCP API to exercise the store.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Write-Ahead Log (WAL)

Description / Expanded Explanation

A WAL records every mutation before it is applied to in-memory structures. If the process crashes, the log can be replayed to restore state. This ensures durability and atomicity.

Definitions & Key Terms

WAL -> append-only log of operations
durability -> data survives crashes
replay -> reapply log entries to rebuild state
fsync -> force data to disk

Mental Model Diagram (ASCII)

PUT k1=v1
append to WAL -> fsync -> apply to memtable

Mental Model Diagram (Image)

WAL Flow

How It Works (Step-by-Step)

Client issues PUT/DELETE.
Append record to WAL with checksum.
Flush to disk (fsync).
Apply mutation to memtable.
On crash, replay WAL to rebuild memtable.

Minimal Concrete Example

WAL record: [len][crc][op=PUT][key][value]

Common Misconceptions

“WAL is optional” -> without it you lose recent writes on crash.
“fsync every write is too slow” -> batching and group commit reduce cost.

Check-Your-Understanding Questions

Why must the WAL be written before updating the memtable?
What happens if fsync is skipped?
How can you batch fsync calls?

Where You’ll Apply It

See 3.5 for WAL record format.
See 5.10 Phase 1 for WAL implementation.
Also used in: P04 Raft Consensus

2.2 Memtable and SSTables

Description / Expanded Explanation

The memtable is an in-memory sorted structure. When it reaches a size threshold, it is flushed to disk as an immutable SSTable. This design keeps writes fast and sequential.

Definitions & Key Terms

memtable -> in-memory sorted map
SSTable -> sorted string table stored on disk
flush -> write memtable to SSTable
immutable -> SSTables are never modified in-place

Mental Model Diagram (ASCII)

writes -> memtable -> flush -> SSTable file

Mental Model Diagram (Image)

Memtable and SSTable

How It Works (Step-by-Step)

Insert into memtable (balanced tree or skiplist).
When memtable is full, freeze it.
Write sorted entries to SSTable file.
Create index and bloom filter for the SSTable.

Minimal Concrete Example

memtable: {a:1, b:2, z:9}
SSTable data: a=1, b=2, z=9

Common Misconceptions

“SSTable is like a database table” -> it is just a sorted file.
“You can update SSTables” -> updates become new files.

Check-Your-Understanding Questions

Why is the memtable sorted?
What is the advantage of immutable SSTables?
What happens when memtable fills up during heavy writes?

Where You’ll Apply It

See 4.4 for read path algorithms.
See 5.10 Phase 2 for flush and file layout.
Also used in: P10 Mini Git Object Store

2.3 Indexes and Bloom Filters

Description / Expanded Explanation

Indexes locate keys inside SSTables without scanning the entire file. Bloom filters quickly rule out tables that do not contain a key, reducing disk reads.

Definitions & Key Terms

index -> mapping from key to offset
bloom filter -> probabilistic membership structure
false positive -> bloom filter says “maybe” when key not present
false negative -> should not happen

Mental Model Diagram (ASCII)

lookup key -> bloom filter -> maybe? -> index -> offset -> value

Mental Model Diagram (Image)

Bloom Filter

How It Works (Step-by-Step)

Check bloom filter; if negative, skip SSTable.
If positive, binary search index.
Seek to offset in data file and read value.

Minimal Concrete Example

index: [a->0, b->32, z->64]

Common Misconceptions

“Bloom filters are exact” -> they can have false positives.
“Index must store every key” -> sparse index works with block offsets.

Check-Your-Understanding Questions

Why are false positives acceptable for bloom filters?
How do you choose bloom filter size?
What is the trade-off of a sparse index?

Where You’ll Apply It

See 3.5 for index/bloom format.
See 4.4 for read path complexity.
Also used in: P01 Memory Allocator for metadata layout

2.4 Compaction and Tombstones

Description / Expanded Explanation

Compaction merges multiple SSTables and removes obsolete entries. Deletions are represented as tombstones that invalidate older values during compaction.

Definitions & Key Terms

compaction -> merge sorted SSTables
tombstone -> special marker for delete
read amplification -> number of files read for a query
write amplification -> extra writes caused by compaction

Mental Model Diagram (ASCII)

SSTable1: a=1, b=2
SSTable2: b=DEL, c=3
Compaction -> a=1, b=DEL, c=3

Mental Model Diagram (Image)

Compaction

How It Works (Step-by-Step)

Select SSTables to compact.
Merge-sort entries by key.
Drop older versions and expired tombstones.
Write new SSTable and delete old ones.

Minimal Concrete Example

merge(a=1,b=2) + (b=DEL,c=3) -> a=1,b=DEL,c=3

Common Misconceptions

“Compaction is optional” -> without it reads slow down and disk fills.
“Tombstones can be removed immediately” -> need to ensure older tables are compacted.

Check-Your-Understanding Questions

Why does compaction increase write amplification?
When is it safe to drop tombstones?
How does compaction improve read performance?

Where You’ll Apply It

See 5.10 Phase 3 for compaction scheduling.
See 7.3 for performance traps.
Also used in: P04 Raft Consensus

2.5 Checksums and Crash Recovery

Description / Expanded Explanation

Checksums detect corruption in WAL and SSTables. Recovery replays logs and validates files. Without integrity checks, silent corruption can return wrong data.

Definitions & Key Terms

checksum -> hash for verifying data integrity
corruption -> data changed by disk or bugs
recovery -> process of restoring state after crash
snapshot -> compact representation of current state

Mental Model Diagram (ASCII)

record -> crc -> disk
read -> crc -> compare -> ok or error

Mental Model Diagram (Image)

Checksums

How It Works (Step-by-Step)

Compute checksum for each record.
On read, recompute and compare.
If mismatch, mark file corrupted and rebuild.
On startup, replay WAL to rebuild memtable.

Minimal Concrete Example

crc32(record_bytes) -> 0xA1B2C3D4

Common Misconceptions

“Disks never corrupt” -> bit flips and partial writes happen.
“Checksums are too slow” -> they are cheap compared to I/O.

Check-Your-Understanding Questions

Why is checksum validation needed after crash recovery?
What should your system do when corruption is detected?
Why do you need length prefixes in WAL records?

Where You’ll Apply It

See 3.5 for WAL and SSTable record format.
See 6.2 for corruption tests.
Also used in: P10 Mini Git Object Store

3. Project Specification

3.1 What You Will Build

A persistent key-value store that supports GET, PUT, and DELETE. It uses a WAL for durability, a memtable for fast writes, and immutable SSTables on disk. It provides a CLI and optional TCP server.

3.2 Functional Requirements

PUT key value inserts or overwrites.
GET key retrieves latest value or NOT_FOUND.
DELETE key writes tombstone.
Durability: data survives process crash.
Recovery: WAL replay rebuilds state.
Compaction: merges SSTables and removes obsolete data.
Metrics: stats on read/write/compaction.

3.3 Non-Functional Requirements

Performance: sequential writes, avoid random writes.
Reliability: detect corruption with checksums.
Usability: deterministic demo dataset.

3.4 Example Usage / Output

$ kv put user:1 alice
OK
$ kv get user:1
alice
$ kv del user:1
OK
$ kv get user:1
NOT_FOUND

3.5 Data Formats / Schemas / Protocols

WAL record:

[len][crc32][op][key_len][key][val_len][value]

SSTable file:

[data blocks][index block][bloom filter][footer]

3.6 Edge Cases

Overwrite same key multiple times across tables.
Delete key that does not exist.
Crash during flush.
Corrupted WAL record.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./kv init ./data
./kv put user:1 alice
./kv get user:1

3.7.2 Golden Path Demo (Deterministic)

Use a fixed seed for compaction scheduling and deterministic record ordering.

3.7.3 CLI Transcript (Success)

$ ./kv init ./data
$ ./kv put user:1 alice
OK
$ ./kv put user:2 bob
OK
$ ./kv get user:1
alice
$ ./kv stats
{"memtable":2,"sstables":1,"wal_bytes":128}
$ echo $?
0

3.7.3 CLI Transcript (Failure)

$ ./kv get missing:key
NOT_FOUND
$ echo $?
1

3.7.4 If API

TCP protocol (optional):

PUT key value\n
GET key\n
DEL key\n

Error JSON for TCP mode:

{"error":"not_found","message":"key missing"}

3.7.5 Exit Codes

0 success
1 key not found
2 corruption detected
3 disk I/O error

4. Solution Architecture

4.1 High-Level Design

client -> API -> WAL -> memtable -> flush -> SSTable -> compaction

4.2 Key Components

4.3 Data Structures (No Full Code)

struct wal_record {
    uint32_t len;
    uint32_t crc;
    uint8_t op;
};

4.4 Algorithm Overview

Write Path: append to WAL, fsync, insert into memtable, flush if full. Read Path: check memtable, then SSTables newest to oldest, using bloom filters.

Complexity Analysis

Writes: O(log n) in memtable
Reads: O(log n) with filters, worst-case O(k log n) across k tables

5. Implementation Guide

5.1 Development Environment Setup

make

5.2 Project Structure

project-root/
├── src/wal/
├── src/memtable/
├── src/sstable/
├── src/compaction/
└── tests/

5.3 The Core Question You’re Answering

“How can I store key-value data so it survives crashes and stays fast as it grows?”

5.4 Concepts You Must Understand First

Append-only logs and fsync
Sorted maps and range scans
Immutable file formats
Merge compaction

5.5 Questions to Guide Your Design

What is the WAL flush policy?
How large should memtables be before flush?
How do you schedule compaction without blocking writes?

5.6 Thinking Exercise

Trace writes: PUT a, PUT b, DEL a, PUT a, then crash. What values should be recovered?

5.7 The Interview Questions They’ll Ask

Why are LSM trees good for writes?
Explain read amplification and write amplification.
How does a bloom filter help?

5.8 Hints in Layers

Hint 1: Start with WAL + in-memory map. Hint 2: Add flush to a single SSTable. Hint 3: Add compaction later.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Storage engines | Designing Data-Intensive Applications | Ch. 3 | | File systems | Operating Systems: Three Easy Pieces | Ch. 40 |

5.10 Implementation Phases

Phase 1: Foundation (5-7 days)

WAL + memtable.

Phase 2: Core Functionality (7-10 days)

SSTables, indexing, bloom filters.

Phase 3: Polish and Edge Cases (5-7 days)

Compaction, crash recovery, metrics.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Crash after WAL write but before memtable insert -> recovery replays.
Tombstone hides older value after compaction.
Bloom filter avoids unnecessary file reads.

6.3 Test Data

keys: user:1..user:1000
values: random strings

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Dump SSTable contents to inspect ordering.
Use deterministic seeds for compaction tests.

7.3 Performance Traps

Large bloom filters cause memory bloat.
Overly frequent compaction increases write amplification.

8. Extensions & Challenges

8.1 Beginner Extensions

Add range scan support.
Add simple metrics.

8.2 Intermediate Extensions

Add background compaction thread.
Add snapshots.

8.3 Advanced Extensions

Add distributed replication.
Add MVCC for consistent reads.

9. Real-World Connections

9.1 Industry Applications

LSM-based databases (LevelDB, RocksDB).
Log-structured storage systems.

LevelDB
RocksDB
Badger

9.3 Interview Relevance

Demonstrates storage engine fundamentals and durability.

10. Resources

10.1 Essential Reading

Designing Data-Intensive Applications (Chapter 3)
The Log-Structured Merge-Tree paper

10.2 Video Resources

Storage engine lectures

10.3 Tools & Documentation

fio, iostat, perf

11. Self-Assessment Checklist

11.1 Understanding

I can explain why WAL comes before memtable.
I can describe compaction trade-offs.
I can explain bloom filter false positives.

11.2 Implementation

Recovery works after crash.
Compaction removes obsolete values.
Checksums detect corruption.

11.3 Growth

I can compare this to LevelDB in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

PUT/GET/DEL work and survive restarts.
WAL is written before memtable changes.

Full Completion:

SSTables and compaction implemented.
Metrics and stats available.

Excellence (Going Above & Beyond):

Range scans and MVCC.
Background compaction tuning.