Project 3: Persistent Key-Value Store

Build a durable key-value store with a write-ahead log, memtable, and on-disk SSTables.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-4 weeks
Main Programming Language C or Rust (Alternatives: Go, C++)
Alternative Programming Languages Go, C++
Coolness Level Level 4: Hardcore Tech Flex
Business Potential Storage and data services
Prerequisites File I/O, data structures, hashing
Key Topics WAL, SSTables, compaction, indexing, checksums

1. Learning Objectives

By completing this project, you will:

  1. Implement a durable key-value store with crash recovery.
  2. Explain log-structured storage and why it improves write throughput.
  3. Build immutable on-disk tables with indexes and bloom filters.
  4. Implement compaction and tombstones for deletes.
  5. Analyze read/write amplification and trade-offs.
  6. Provide a CLI or TCP API to exercise the store.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Write-Ahead Log (WAL)

Description / Expanded Explanation

A WAL records every mutation before it is applied to in-memory structures. If the process crashes, the log can be replayed to restore state. This ensures durability and atomicity.

Definitions & Key Terms
  • WAL -> append-only log of operations
  • durability -> data survives crashes
  • replay -> reapply log entries to rebuild state
  • fsync -> force data to disk
Mental Model Diagram (ASCII)
PUT k1=v1
append to WAL -> fsync -> apply to memtable
Mental Model Diagram (Image)

WAL Flow

How It Works (Step-by-Step)
  1. Client issues PUT/DELETE.
  2. Append record to WAL with checksum.
  3. Flush to disk (fsync).
  4. Apply mutation to memtable.
  5. On crash, replay WAL to rebuild memtable.
Minimal Concrete Example
WAL record: [len][crc][op=PUT][key][value]
Common Misconceptions
  • “WAL is optional” -> without it you lose recent writes on crash.
  • “fsync every write is too slow” -> batching and group commit reduce cost.
Check-Your-Understanding Questions
  1. Why must the WAL be written before updating the memtable?
  2. What happens if fsync is skipped?
  3. How can you batch fsync calls?
Where You’ll Apply It
  • See 3.5 for WAL record format.
  • See 5.10 Phase 1 for WAL implementation.
  • Also used in: P04 Raft Consensus

2.2 Memtable and SSTables

Description / Expanded Explanation

The memtable is an in-memory sorted structure. When it reaches a size threshold, it is flushed to disk as an immutable SSTable. This design keeps writes fast and sequential.

Definitions & Key Terms
  • memtable -> in-memory sorted map
  • SSTable -> sorted string table stored on disk
  • flush -> write memtable to SSTable
  • immutable -> SSTables are never modified in-place
Mental Model Diagram (ASCII)
writes -> memtable -> flush -> SSTable file
Mental Model Diagram (Image)

Memtable and SSTable

How It Works (Step-by-Step)
  1. Insert into memtable (balanced tree or skiplist).
  2. When memtable is full, freeze it.
  3. Write sorted entries to SSTable file.
  4. Create index and bloom filter for the SSTable.
Minimal Concrete Example
memtable: {a:1, b:2, z:9}
SSTable data: a=1, b=2, z=9
Common Misconceptions
  • “SSTable is like a database table” -> it is just a sorted file.
  • “You can update SSTables” -> updates become new files.
Check-Your-Understanding Questions
  1. Why is the memtable sorted?
  2. What is the advantage of immutable SSTables?
  3. What happens when memtable fills up during heavy writes?
Where You’ll Apply It

2.3 Indexes and Bloom Filters

Description / Expanded Explanation

Indexes locate keys inside SSTables without scanning the entire file. Bloom filters quickly rule out tables that do not contain a key, reducing disk reads.

Definitions & Key Terms
  • index -> mapping from key to offset
  • bloom filter -> probabilistic membership structure
  • false positive -> bloom filter says “maybe” when key not present
  • false negative -> should not happen
Mental Model Diagram (ASCII)
lookup key -> bloom filter -> maybe? -> index -> offset -> value
Mental Model Diagram (Image)

Bloom Filter

How It Works (Step-by-Step)
  1. Check bloom filter; if negative, skip SSTable.
  2. If positive, binary search index.
  3. Seek to offset in data file and read value.
Minimal Concrete Example
index: [a->0, b->32, z->64]
Common Misconceptions
  • “Bloom filters are exact” -> they can have false positives.
  • “Index must store every key” -> sparse index works with block offsets.
Check-Your-Understanding Questions
  1. Why are false positives acceptable for bloom filters?
  2. How do you choose bloom filter size?
  3. What is the trade-off of a sparse index?
Where You’ll Apply It
  • See 3.5 for index/bloom format.
  • See 4.4 for read path complexity.
  • Also used in: P01 Memory Allocator for metadata layout

2.4 Compaction and Tombstones

Description / Expanded Explanation

Compaction merges multiple SSTables and removes obsolete entries. Deletions are represented as tombstones that invalidate older values during compaction.

Definitions & Key Terms
  • compaction -> merge sorted SSTables
  • tombstone -> special marker for delete
  • read amplification -> number of files read for a query
  • write amplification -> extra writes caused by compaction
Mental Model Diagram (ASCII)
SSTable1: a=1, b=2
SSTable2: b=DEL, c=3
Compaction -> a=1, b=DEL, c=3
Mental Model Diagram (Image)

Compaction

How It Works (Step-by-Step)
  1. Select SSTables to compact.
  2. Merge-sort entries by key.
  3. Drop older versions and expired tombstones.
  4. Write new SSTable and delete old ones.
Minimal Concrete Example
merge(a=1,b=2) + (b=DEL,c=3) -> a=1,b=DEL,c=3
Common Misconceptions
  • “Compaction is optional” -> without it reads slow down and disk fills.
  • “Tombstones can be removed immediately” -> need to ensure older tables are compacted.
Check-Your-Understanding Questions
  1. Why does compaction increase write amplification?
  2. When is it safe to drop tombstones?
  3. How does compaction improve read performance?
Where You’ll Apply It
  • See 5.10 Phase 3 for compaction scheduling.
  • See 7.3 for performance traps.
  • Also used in: P04 Raft Consensus

2.5 Checksums and Crash Recovery

Description / Expanded Explanation

Checksums detect corruption in WAL and SSTables. Recovery replays logs and validates files. Without integrity checks, silent corruption can return wrong data.

Definitions & Key Terms
  • checksum -> hash for verifying data integrity
  • corruption -> data changed by disk or bugs
  • recovery -> process of restoring state after crash
  • snapshot -> compact representation of current state
Mental Model Diagram (ASCII)
record -> crc -> disk
read -> crc -> compare -> ok or error
Mental Model Diagram (Image)

Checksums

How It Works (Step-by-Step)
  1. Compute checksum for each record.
  2. On read, recompute and compare.
  3. If mismatch, mark file corrupted and rebuild.
  4. On startup, replay WAL to rebuild memtable.
Minimal Concrete Example
crc32(record_bytes) -> 0xA1B2C3D4
Common Misconceptions
  • “Disks never corrupt” -> bit flips and partial writes happen.
  • “Checksums are too slow” -> they are cheap compared to I/O.
Check-Your-Understanding Questions
  1. Why is checksum validation needed after crash recovery?
  2. What should your system do when corruption is detected?
  3. Why do you need length prefixes in WAL records?
Where You’ll Apply It

3. Project Specification

3.1 What You Will Build

A persistent key-value store that supports GET, PUT, and DELETE. It uses a WAL for durability, a memtable for fast writes, and immutable SSTables on disk. It provides a CLI and optional TCP server.

3.2 Functional Requirements

  1. PUT key value inserts or overwrites.
  2. GET key retrieves latest value or NOT_FOUND.
  3. DELETE key writes tombstone.
  4. Durability: data survives process crash.
  5. Recovery: WAL replay rebuilds state.
  6. Compaction: merges SSTables and removes obsolete data.
  7. Metrics: stats on read/write/compaction.

3.3 Non-Functional Requirements

  • Performance: sequential writes, avoid random writes.
  • Reliability: detect corruption with checksums.
  • Usability: deterministic demo dataset.

3.4 Example Usage / Output

$ kv put user:1 alice
OK
$ kv get user:1
alice
$ kv del user:1
OK
$ kv get user:1
NOT_FOUND

3.5 Data Formats / Schemas / Protocols

WAL record:

[len][crc32][op][key_len][key][val_len][value]

SSTable file:

[data blocks][index block][bloom filter][footer]

3.6 Edge Cases

  • Overwrite same key multiple times across tables.
  • Delete key that does not exist.
  • Crash during flush.
  • Corrupted WAL record.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./kv init ./data
./kv put user:1 alice
./kv get user:1

3.7.2 Golden Path Demo (Deterministic)

Use a fixed seed for compaction scheduling and deterministic record ordering.

3.7.3 CLI Transcript (Success)

$ ./kv init ./data
$ ./kv put user:1 alice
OK
$ ./kv put user:2 bob
OK
$ ./kv get user:1
alice
$ ./kv stats
{"memtable":2,"sstables":1,"wal_bytes":128}
$ echo $?
0

3.7.3 CLI Transcript (Failure)

$ ./kv get missing:key
NOT_FOUND
$ echo $?
1

3.7.4 If API

TCP protocol (optional):

PUT key value\n
GET key\n
DEL key\n

Error JSON for TCP mode:

{"error":"not_found","message":"key missing"}

3.7.5 Exit Codes

  • 0 success
  • 1 key not found
  • 2 corruption detected
  • 3 disk I/O error

4. Solution Architecture

4.1 High-Level Design

client -> API -> WAL -> memtable -> flush -> SSTable -> compaction

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | WAL | durable log | append-only with checksum | | Memtable | in-memory sorted map | skiplist or tree | | SSTables | on-disk sorted files | immutable, indexed | | Compactor | merge files | size-tiered strategy | | Recovery | replay WAL | validate checksums |

4.3 Data Structures (No Full Code)

struct wal_record {
    uint32_t len;
    uint32_t crc;
    uint8_t op;
};

4.4 Algorithm Overview

Write Path: append to WAL, fsync, insert into memtable, flush if full. Read Path: check memtable, then SSTables newest to oldest, using bloom filters.

Complexity Analysis

  • Writes: O(log n) in memtable
  • Reads: O(log n) with filters, worst-case O(k log n) across k tables

5. Implementation Guide

5.1 Development Environment Setup

make

5.2 Project Structure

project-root/
├── src/wal/
├── src/memtable/
├── src/sstable/
├── src/compaction/
└── tests/

5.3 The Core Question You’re Answering

“How can I store key-value data so it survives crashes and stays fast as it grows?”

5.4 Concepts You Must Understand First

  1. Append-only logs and fsync
  2. Sorted maps and range scans
  3. Immutable file formats
  4. Merge compaction

5.5 Questions to Guide Your Design

  1. What is the WAL flush policy?
  2. How large should memtables be before flush?
  3. How do you schedule compaction without blocking writes?

5.6 Thinking Exercise

Trace writes: PUT a, PUT b, DEL a, PUT a, then crash. What values should be recovered?

5.7 The Interview Questions They’ll Ask

  1. Why are LSM trees good for writes?
  2. Explain read amplification and write amplification.
  3. How does a bloom filter help?

5.8 Hints in Layers

Hint 1: Start with WAL + in-memory map. Hint 2: Add flush to a single SSTable. Hint 3: Add compaction later.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Storage engines | Designing Data-Intensive Applications | Ch. 3 | | File systems | Operating Systems: Three Easy Pieces | Ch. 40 |

5.10 Implementation Phases

Phase 1: Foundation (5-7 days)

WAL + memtable.

Phase 2: Core Functionality (7-10 days)

SSTables, indexing, bloom filters.

Phase 3: Polish and Edge Cases (5-7 days)

Compaction, crash recovery, metrics.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Memtable | tree vs skiplist | skiplist | fast inserts | | Compaction | size-tiered vs leveled | size-tiered | simpler | | WAL fsync | per write vs batching | batching | performance |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | WAL parsing | checksum validation | | Integration Tests | recovery after crash | kill process mid-write | | Load Tests | write throughput | 1M inserts |

6.2 Critical Test Cases

  1. Crash after WAL write but before memtable insert -> recovery replays.
  2. Tombstone hides older value after compaction.
  3. Bloom filter avoids unnecessary file reads.

6.3 Test Data

keys: user:1..user:1000
values: random strings

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Missing fsync | data loss on crash | fsync WAL | | No compaction | disk grows forever | background compaction | | Corrupt index | wrong reads | validate checksum |

7.2 Debugging Strategies

  • Dump SSTable contents to inspect ordering.
  • Use deterministic seeds for compaction tests.

7.3 Performance Traps

  • Large bloom filters cause memory bloat.
  • Overly frequent compaction increases write amplification.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add range scan support.
  • Add simple metrics.

8.2 Intermediate Extensions

  • Add background compaction thread.
  • Add snapshots.

8.3 Advanced Extensions

  • Add distributed replication.
  • Add MVCC for consistent reads.

9. Real-World Connections

9.1 Industry Applications

  • LSM-based databases (LevelDB, RocksDB).
  • Log-structured storage systems.
  • LevelDB
  • RocksDB
  • Badger

9.3 Interview Relevance

  • Demonstrates storage engine fundamentals and durability.

10. Resources

10.1 Essential Reading

  • Designing Data-Intensive Applications (Chapter 3)
  • The Log-Structured Merge-Tree paper

10.2 Video Resources

  • Storage engine lectures

10.3 Tools & Documentation

  • fio, iostat, perf

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why WAL comes before memtable.
  • I can describe compaction trade-offs.
  • I can explain bloom filter false positives.

11.2 Implementation

  • Recovery works after crash.
  • Compaction removes obsolete values.
  • Checksums detect corruption.

11.3 Growth

  • I can compare this to LevelDB in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • PUT/GET/DEL work and survive restarts.
  • WAL is written before memtable changes.

Full Completion:

  • SSTables and compaction implemented.
  • Metrics and stats available.

Excellence (Going Above & Beyond):

  • Range scans and MVCC.
  • Background compaction tuning.