Project 3: Persistent Key-Value Store
Build a durable key-value store with a write-ahead log, memtable, and on-disk SSTables.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-4 weeks |
| Main Programming Language | C or Rust (Alternatives: Go, C++) |
| Alternative Programming Languages | Go, C++ |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | Storage and data services |
| Prerequisites | File I/O, data structures, hashing |
| Key Topics | WAL, SSTables, compaction, indexing, checksums |
1. Learning Objectives
By completing this project, you will:
- Implement a durable key-value store with crash recovery.
- Explain log-structured storage and why it improves write throughput.
- Build immutable on-disk tables with indexes and bloom filters.
- Implement compaction and tombstones for deletes.
- Analyze read/write amplification and trade-offs.
- Provide a CLI or TCP API to exercise the store.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Write-Ahead Log (WAL)
Description / Expanded Explanation
A WAL records every mutation before it is applied to in-memory structures. If the process crashes, the log can be replayed to restore state. This ensures durability and atomicity.
Definitions & Key Terms
- WAL -> append-only log of operations
- durability -> data survives crashes
- replay -> reapply log entries to rebuild state
- fsync -> force data to disk
Mental Model Diagram (ASCII)
PUT k1=v1
append to WAL -> fsync -> apply to memtable
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Client issues PUT/DELETE.
- Append record to WAL with checksum.
- Flush to disk (fsync).
- Apply mutation to memtable.
- On crash, replay WAL to rebuild memtable.
Minimal Concrete Example
WAL record: [len][crc][op=PUT][key][value]
Common Misconceptions
- “WAL is optional” -> without it you lose recent writes on crash.
- “fsync every write is too slow” -> batching and group commit reduce cost.
Check-Your-Understanding Questions
- Why must the WAL be written before updating the memtable?
- What happens if fsync is skipped?
- How can you batch fsync calls?
Where You’ll Apply It
- See 3.5 for WAL record format.
- See 5.10 Phase 1 for WAL implementation.
- Also used in: P04 Raft Consensus
2.2 Memtable and SSTables
Description / Expanded Explanation
The memtable is an in-memory sorted structure. When it reaches a size threshold, it is flushed to disk as an immutable SSTable. This design keeps writes fast and sequential.
Definitions & Key Terms
- memtable -> in-memory sorted map
- SSTable -> sorted string table stored on disk
- flush -> write memtable to SSTable
- immutable -> SSTables are never modified in-place
Mental Model Diagram (ASCII)
writes -> memtable -> flush -> SSTable file
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Insert into memtable (balanced tree or skiplist).
- When memtable is full, freeze it.
- Write sorted entries to SSTable file.
- Create index and bloom filter for the SSTable.
Minimal Concrete Example
memtable: {a:1, b:2, z:9}
SSTable data: a=1, b=2, z=9
Common Misconceptions
- “SSTable is like a database table” -> it is just a sorted file.
- “You can update SSTables” -> updates become new files.
Check-Your-Understanding Questions
- Why is the memtable sorted?
- What is the advantage of immutable SSTables?
- What happens when memtable fills up during heavy writes?
Where You’ll Apply It
- See 4.4 for read path algorithms.
- See 5.10 Phase 2 for flush and file layout.
- Also used in: P10 Mini Git Object Store
2.3 Indexes and Bloom Filters
Description / Expanded Explanation
Indexes locate keys inside SSTables without scanning the entire file. Bloom filters quickly rule out tables that do not contain a key, reducing disk reads.
Definitions & Key Terms
- index -> mapping from key to offset
- bloom filter -> probabilistic membership structure
- false positive -> bloom filter says “maybe” when key not present
- false negative -> should not happen
Mental Model Diagram (ASCII)
lookup key -> bloom filter -> maybe? -> index -> offset -> value
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Check bloom filter; if negative, skip SSTable.
- If positive, binary search index.
- Seek to offset in data file and read value.
Minimal Concrete Example
index: [a->0, b->32, z->64]
Common Misconceptions
- “Bloom filters are exact” -> they can have false positives.
- “Index must store every key” -> sparse index works with block offsets.
Check-Your-Understanding Questions
- Why are false positives acceptable for bloom filters?
- How do you choose bloom filter size?
- What is the trade-off of a sparse index?
Where You’ll Apply It
- See 3.5 for index/bloom format.
- See 4.4 for read path complexity.
- Also used in: P01 Memory Allocator for metadata layout
2.4 Compaction and Tombstones
Description / Expanded Explanation
Compaction merges multiple SSTables and removes obsolete entries. Deletions are represented as tombstones that invalidate older values during compaction.
Definitions & Key Terms
- compaction -> merge sorted SSTables
- tombstone -> special marker for delete
- read amplification -> number of files read for a query
- write amplification -> extra writes caused by compaction
Mental Model Diagram (ASCII)
SSTable1: a=1, b=2
SSTable2: b=DEL, c=3
Compaction -> a=1, b=DEL, c=3
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Select SSTables to compact.
- Merge-sort entries by key.
- Drop older versions and expired tombstones.
- Write new SSTable and delete old ones.
Minimal Concrete Example
merge(a=1,b=2) + (b=DEL,c=3) -> a=1,b=DEL,c=3
Common Misconceptions
- “Compaction is optional” -> without it reads slow down and disk fills.
- “Tombstones can be removed immediately” -> need to ensure older tables are compacted.
Check-Your-Understanding Questions
- Why does compaction increase write amplification?
- When is it safe to drop tombstones?
- How does compaction improve read performance?
Where You’ll Apply It
- See 5.10 Phase 3 for compaction scheduling.
- See 7.3 for performance traps.
- Also used in: P04 Raft Consensus
2.5 Checksums and Crash Recovery
Description / Expanded Explanation
Checksums detect corruption in WAL and SSTables. Recovery replays logs and validates files. Without integrity checks, silent corruption can return wrong data.
Definitions & Key Terms
- checksum -> hash for verifying data integrity
- corruption -> data changed by disk or bugs
- recovery -> process of restoring state after crash
- snapshot -> compact representation of current state
Mental Model Diagram (ASCII)
record -> crc -> disk
read -> crc -> compare -> ok or error
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Compute checksum for each record.
- On read, recompute and compare.
- If mismatch, mark file corrupted and rebuild.
- On startup, replay WAL to rebuild memtable.
Minimal Concrete Example
crc32(record_bytes) -> 0xA1B2C3D4
Common Misconceptions
- “Disks never corrupt” -> bit flips and partial writes happen.
- “Checksums are too slow” -> they are cheap compared to I/O.
Check-Your-Understanding Questions
- Why is checksum validation needed after crash recovery?
- What should your system do when corruption is detected?
- Why do you need length prefixes in WAL records?
Where You’ll Apply It
- See 3.5 for WAL and SSTable record format.
- See 6.2 for corruption tests.
- Also used in: P10 Mini Git Object Store
3. Project Specification
3.1 What You Will Build
A persistent key-value store that supports GET, PUT, and DELETE. It uses a WAL for durability, a memtable for fast writes, and immutable SSTables on disk. It provides a CLI and optional TCP server.
3.2 Functional Requirements
- PUT key value inserts or overwrites.
- GET key retrieves latest value or NOT_FOUND.
- DELETE key writes tombstone.
- Durability: data survives process crash.
- Recovery: WAL replay rebuilds state.
- Compaction: merges SSTables and removes obsolete data.
- Metrics: stats on read/write/compaction.
3.3 Non-Functional Requirements
- Performance: sequential writes, avoid random writes.
- Reliability: detect corruption with checksums.
- Usability: deterministic demo dataset.
3.4 Example Usage / Output
$ kv put user:1 alice
OK
$ kv get user:1
alice
$ kv del user:1
OK
$ kv get user:1
NOT_FOUND
3.5 Data Formats / Schemas / Protocols
WAL record:
[len][crc32][op][key_len][key][val_len][value]
SSTable file:
[data blocks][index block][bloom filter][footer]
3.6 Edge Cases
- Overwrite same key multiple times across tables.
- Delete key that does not exist.
- Crash during flush.
- Corrupted WAL record.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./kv init ./data
./kv put user:1 alice
./kv get user:1
3.7.2 Golden Path Demo (Deterministic)
Use a fixed seed for compaction scheduling and deterministic record ordering.
3.7.3 CLI Transcript (Success)
$ ./kv init ./data
$ ./kv put user:1 alice
OK
$ ./kv put user:2 bob
OK
$ ./kv get user:1
alice
$ ./kv stats
{"memtable":2,"sstables":1,"wal_bytes":128}
$ echo $?
0
3.7.3 CLI Transcript (Failure)
$ ./kv get missing:key
NOT_FOUND
$ echo $?
1
3.7.4 If API
TCP protocol (optional):
PUT key value\n
GET key\n
DEL key\n
Error JSON for TCP mode:
{"error":"not_found","message":"key missing"}
3.7.5 Exit Codes
0success1key not found2corruption detected3disk I/O error
4. Solution Architecture
4.1 High-Level Design
client -> API -> WAL -> memtable -> flush -> SSTable -> compaction
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | WAL | durable log | append-only with checksum | | Memtable | in-memory sorted map | skiplist or tree | | SSTables | on-disk sorted files | immutable, indexed | | Compactor | merge files | size-tiered strategy | | Recovery | replay WAL | validate checksums |
4.3 Data Structures (No Full Code)
struct wal_record {
uint32_t len;
uint32_t crc;
uint8_t op;
};
4.4 Algorithm Overview
Write Path: append to WAL, fsync, insert into memtable, flush if full. Read Path: check memtable, then SSTables newest to oldest, using bloom filters.
Complexity Analysis
- Writes: O(log n) in memtable
- Reads: O(log n) with filters, worst-case O(k log n) across k tables
5. Implementation Guide
5.1 Development Environment Setup
make
5.2 Project Structure
project-root/
├── src/wal/
├── src/memtable/
├── src/sstable/
├── src/compaction/
└── tests/
5.3 The Core Question You’re Answering
“How can I store key-value data so it survives crashes and stays fast as it grows?”
5.4 Concepts You Must Understand First
- Append-only logs and fsync
- Sorted maps and range scans
- Immutable file formats
- Merge compaction
5.5 Questions to Guide Your Design
- What is the WAL flush policy?
- How large should memtables be before flush?
- How do you schedule compaction without blocking writes?
5.6 Thinking Exercise
Trace writes: PUT a, PUT b, DEL a, PUT a, then crash. What values should be recovered?
5.7 The Interview Questions They’ll Ask
- Why are LSM trees good for writes?
- Explain read amplification and write amplification.
- How does a bloom filter help?
5.8 Hints in Layers
Hint 1: Start with WAL + in-memory map. Hint 2: Add flush to a single SSTable. Hint 3: Add compaction later.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Storage engines | Designing Data-Intensive Applications | Ch. 3 | | File systems | Operating Systems: Three Easy Pieces | Ch. 40 |
5.10 Implementation Phases
Phase 1: Foundation (5-7 days)
WAL + memtable.
Phase 2: Core Functionality (7-10 days)
SSTables, indexing, bloom filters.
Phase 3: Polish and Edge Cases (5-7 days)
Compaction, crash recovery, metrics.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Memtable | tree vs skiplist | skiplist | fast inserts | | Compaction | size-tiered vs leveled | size-tiered | simpler | | WAL fsync | per write vs batching | batching | performance |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | WAL parsing | checksum validation | | Integration Tests | recovery after crash | kill process mid-write | | Load Tests | write throughput | 1M inserts |
6.2 Critical Test Cases
- Crash after WAL write but before memtable insert -> recovery replays.
- Tombstone hides older value after compaction.
- Bloom filter avoids unnecessary file reads.
6.3 Test Data
keys: user:1..user:1000
values: random strings
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Missing fsync | data loss on crash | fsync WAL | | No compaction | disk grows forever | background compaction | | Corrupt index | wrong reads | validate checksum |
7.2 Debugging Strategies
- Dump SSTable contents to inspect ordering.
- Use deterministic seeds for compaction tests.
7.3 Performance Traps
- Large bloom filters cause memory bloat.
- Overly frequent compaction increases write amplification.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add range scan support.
- Add simple metrics.
8.2 Intermediate Extensions
- Add background compaction thread.
- Add snapshots.
8.3 Advanced Extensions
- Add distributed replication.
- Add MVCC for consistent reads.
9. Real-World Connections
9.1 Industry Applications
- LSM-based databases (LevelDB, RocksDB).
- Log-structured storage systems.
9.2 Related Open Source Projects
- LevelDB
- RocksDB
- Badger
9.3 Interview Relevance
- Demonstrates storage engine fundamentals and durability.
10. Resources
10.1 Essential Reading
- Designing Data-Intensive Applications (Chapter 3)
- The Log-Structured Merge-Tree paper
10.2 Video Resources
- Storage engine lectures
10.3 Tools & Documentation
fio,iostat,perf
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why WAL comes before memtable.
- I can describe compaction trade-offs.
- I can explain bloom filter false positives.
11.2 Implementation
- Recovery works after crash.
- Compaction removes obsolete values.
- Checksums detect corruption.
11.3 Growth
- I can compare this to LevelDB in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- PUT/GET/DEL work and survive restarts.
- WAL is written before memtable changes.
Full Completion:
- SSTables and compaction implemented.
- Metrics and stats available.
Excellence (Going Above & Beyond):
- Range scans and MVCC.
- Background compaction tuning.