Project 6: Embedded Key-Value Database (WAL + LSM)

Build a crash-safe embedded key-value store with WAL, memtable, SSTables, and compaction.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 4 to 8 weeks
Main Programming Language C (C17)
Alternative Programming Languages C++, Rust
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 3 - Infrastructure Core
Prerequisites File I/O, concurrency basics, data structures
Key Topics WAL, LSM trees, compaction, checksums, bloom filters, crash recovery

1. Learning Objectives

By completing this project, you will:

  1. Design a durable write-ahead log with checksums and recovery.
  2. Implement an LSM-based storage engine with memtable and SSTables.
  3. Build compaction and metadata management for immutable files.
  4. Add indexing and bloom filters for fast reads.
  5. Validate crash safety with deterministic crash simulations.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Write-Ahead Logging and Crash Consistency

Fundamentals

Write-ahead logging (WAL) ensures durability by recording changes before they are applied to the main data structure. A WAL entry is appended to disk, and only after the WAL is flushed does the system acknowledge the write. If the process crashes, the WAL is replayed to rebuild the memtable. Crash consistency is about guaranteeing that partial writes or torn pages do not leave the database in an inconsistent state. This requires checksums, sequence numbers, and a clear recovery protocol. The WAL is the first line of defense against data loss.

Deep Dive into the concept

The WAL is an append-only log. Each write operation generates a record with a header and payload. The header typically contains a sequence number, payload length, and checksum. The checksum protects against torn writes or partial writes caused by power loss. When the database starts, it scans the WAL and replays all valid records in order. The last valid record is determined by checksum validation; any trailing incomplete record is ignored. This ensures that the database state is consistent even if the last write was interrupted.

Durability depends on fsync semantics. On POSIX, a file write can be buffered by the OS and not actually written to disk. To guarantee durability, you must call fsync or fdatasync after writing the WAL record. The ordering is critical: write record, flush, then acknowledge. If you acknowledge before fsync, a crash can lose acknowledged writes. The cost is latency; to reduce it, many systems batch fsync calls, acknowledging multiple writes per sync. This introduces a durability window, which you should make explicit as a configuration parameter. In this project, you can implement a sync_every parameter to control batching.

Crash consistency also depends on atomicity of file operations. Appending to a file is not always atomic at the block level. A WAL record may be partially written if the process crashes mid-write. Checksums detect this. Another technique is record framing with length prefixes and suffixes, allowing the reader to detect incomplete records. You should define a WAL format with a fixed header and a checksum that covers the payload. Use a simple checksum like CRC32 or a fast hash. The key is determinism and correctness, not cryptographic security.

The WAL is also part of the performance story. Frequent fsync calls can be slow, especially on HDDs. A storage engine often offers different durability modes: synchronous (fsync on every write), batched (fsync every N writes), or asynchronous (fsync in background). For this project, implement at least synchronous and batched modes. Make the trade-off explicit in your documentation and tests.

Recovery must be deterministic and idempotent. If the WAL is replayed twice, the resulting state should be the same. This is achieved by ordering entries by sequence number and replaying them once. You may also include a marker for transaction boundaries. Even in a simple key-value store, you can treat each write as a transaction of one record. The recovery algorithm should scan the WAL, validate checksums, and apply entries in order to a fresh memtable. If it encounters corruption, it should stop at the last valid entry and report a warning. This is how production systems maintain data integrity after crashes.

How This Fits on Projects

This concept drives the WAL format in Section 3.5, the recovery logic in Section 4.4, and the crash tests in Section 6.2. It also connects to file I/O abstractions in P04-cross-platform-syscall-abstraction.md and backpressure in P03-mini-async-runtime.md.

Definitions & Key Terms

  • WAL (Write-Ahead Log): Append-only log of changes written before applying them.
  • fsync: OS call to flush buffered writes to durable storage.
  • Checksum: Value used to detect corruption or torn writes.
  • Torn write: Partial write due to crash or power loss.
  • Replay: Applying WAL records to rebuild state after a crash.

Mental Model Diagram (ASCII)

Client write -> WAL append + fsync -> ack -> memtable
Crash -> replay WAL -> rebuild memtable

How It Works (Step-by-Step)

  1. Serialize record with header and checksum.
  2. Append to WAL file.
  3. Call fsync according to durability policy.
  4. Apply record to memtable.
  5. On restart, scan WAL and replay valid records.

Invariants:

  • Writes are acknowledged only after WAL durability.
  • WAL replay is deterministic and ordered.

Failure modes:

  • Missing fsync loses acknowledged writes.
  • Corrupted WAL record without checksum detection.

Minimal Concrete Example

struct wal_header {
    uint32_t len;
    uint32_t crc;
    uint64_t seq;
};

Common Misconceptions

  • “Write completed after write()” -> It is not durable until fsync.
  • “WAL is optional” -> It is the core of crash safety.
  • “Checksums are expensive” -> They are cheap compared to data loss.

Check-Your-Understanding Questions

  1. Why must the WAL be fsynced before acknowledging writes?
  2. How do checksums detect torn writes?
  3. What happens if WAL replay is not idempotent?

Check-Your-Understanding Answers

  1. Otherwise a crash can lose acknowledged data.
  2. A partial record will not match the checksum.
  3. The database state can diverge or duplicate writes.

Real-World Applications

  • SQLite uses a WAL for durability.
  • RocksDB uses a WAL plus memtable.

Where You’ll Apply It

References

  • “Operating Systems: Three Easy Pieces” - crash consistency.
  • “Database Internals” by Alex Petrov - WAL design.

Key Insights

Durability is a protocol: write, flush, then acknowledge.

Summary

WAL is the durability backbone of the database. A clear record format and fsync discipline are non-negotiable for crash safety.

Homework/Exercises to Practice the Concept

  1. Write a simple append-only log with CRC32 checksums.
  2. Simulate a crash by truncating the log and test recovery.
  3. Measure latency with fsync on every write vs batched fsync.

Solutions to the Homework/Exercises

  1. The checksum detects partial writes at the end of the log.
  2. Recovery stops at the last valid record.
  3. Batched fsync improves throughput but widens the durability window.

Concept 2: LSM Trees, Memtables, and SSTables

Fundamentals

An LSM tree stores recent writes in memory and periodically flushes them to immutable on-disk tables. The memtable is usually a sorted data structure like a skip list or balanced tree. When it fills, it is frozen and written to an SSTable on disk. Reads must search the memtable first, then SSTables, usually from newest to oldest. Over time, many SSTables accumulate, so compaction merges them into fewer, larger tables. This design trades write amplification for high write throughput and sequential disk I/O.

Deep Dive into the concept

The memtable is the write buffer of an LSM tree. It must support fast inserts and ordered iteration. A skip list is a common choice because it offers O(log n) insertion and ordered traversal with simple implementation. When the memtable reaches a size threshold, it is rotated into an immutable memtable and flushed to disk in the background. The flush writes key-value entries in sorted order into an SSTable file. Because SSTables are immutable, reads can rely on their ordering and do not need in-place updates.

SSTables typically store data in blocks. Each block contains a set of key-value pairs and a block checksum. At the end of the file, an index maps key ranges to block offsets. This allows the reader to seek directly to the block that might contain a key. A manifest file tracks the list of SSTables and their levels. This metadata must be updated atomically, usually by writing a new manifest record and fsyncing it.

Compaction is the process of merging multiple SSTables to reduce read amplification. In a leveled compaction strategy, each level has a size limit and overlapping key ranges are avoided. When a level exceeds its limit, its files are merged into the next level. In a tiered strategy, files are merged in batches but levels may overlap, trading read performance for write efficiency. For a project of this size, you can implement a simple leveled compaction: merge the oldest tables into a new table and delete the old ones once the new one is safely written.

Compaction must handle tombstones (deletes). When a key is deleted, a tombstone record is written to the memtable and flushed to SSTables. Compaction must drop entries that are older than a tombstone. This is the mechanism by which deletes are eventually purged. You must also ensure that compaction is crash-safe: the new SSTable should be fully written and fsynced before the old ones are removed. A common pattern is to write to a temporary file, fsync, then atomically rename it into place. Only then can the old files be deleted.

Read performance depends on the number of SSTables and the index. The memtable and most recent SSTables are searched first. Bloom filters (discussed next) can reduce unnecessary disk reads. A realistic implementation also uses a block cache to keep hot blocks in memory. These components together determine read latency.

How This Fits on Projects

This concept defines the on-disk format in Section 3.5, the compaction algorithm in Section 4.4, and the background worker in Section 5.10 Phase 3. It also connects to memory allocation in P01-custom-memory-allocator.md and to backpressure in P03-mini-async-runtime.md.

Definitions & Key Terms

  • LSM tree: Log-structured merge tree, a write-optimized storage structure.
  • Memtable: In-memory sorted structure for recent writes.
  • SSTable: Immutable sorted table stored on disk.
  • Compaction: Merge of SSTables to reduce read amplification.
  • Tombstone: Record representing a delete.

Mental Model Diagram (ASCII)

Writes -> memtable -> flush -> SSTables -> compaction -> fewer SSTables

How It Works (Step-by-Step)

  1. Insert key into memtable.
  2. When memtable is full, freeze and flush to SSTable.
  3. Add SSTable to manifest.
  4. Background compaction merges SSTables.
  5. Reads search memtable then SSTables newest to oldest.

Invariants:

  • SSTables are immutable and sorted.
  • Manifest reflects only fully written tables.

Failure modes:

  • Crash during compaction leaves orphaned files.
  • Missing manifest updates cause data loss.

Minimal Concrete Example

// SSTable entry format
// [key_len][val_len][key][value]

Common Misconceptions

  • “Compaction is optional” -> Reads slow down without it.
  • “SSTables can be updated in place” -> They are immutable by design.
  • “Memtable alone is enough” -> It loses data on crash without WAL.

Check-Your-Understanding Questions

  1. Why are SSTables immutable?
  2. What is write amplification and why does it occur in LSM trees?
  3. How does a tombstone delete work?

Check-Your-Understanding Answers

  1. Immutability allows sequential writes and simpler recovery.
  2. Data is rewritten during compaction, increasing total I/O.
  3. A tombstone records a delete and is later purged during compaction.

Real-World Applications

  • RocksDB and LevelDB are LSM-based storage engines.
  • Cassandra uses LSM trees for distributed storage.

Where You’ll Apply It

  • This project: Section 3.5 Data Formats, Section 4.4 Algorithm Overview, Section 5.10 Phase 2.
  • Also used in: P01-custom-memory-allocator.md for memory management strategies.

References

  • “Designing Data-Intensive Applications” - LSM trees.
  • “Database Internals” - storage engine design.

Key Insights

LSM trees trade read amplification for write throughput, and compaction is the price.

Summary

Memtables and SSTables provide fast writes and immutable disk structures. Compaction manages read performance and storage growth.

Homework/Exercises to Practice the Concept

  1. Implement a sorted memtable with a skip list.
  2. Write a simple SSTable file and build an index.
  3. Merge two SSTables and resolve duplicates.

Solutions to the Homework/Exercises

  1. The skip list gives ordered iteration and O(log n) inserts.
  2. The index maps key ranges to file offsets.
  3. The newer entry wins and older duplicates are dropped.

Concept 3: Bloom Filters, Indexing, and Block Caching

Fundamentals

Reading from an LSM tree can require checking multiple SSTables. Bloom filters are probabilistic data structures that quickly tell you if a key is definitely not in a table. They reduce disk reads for missing keys. Each SSTable typically has a bloom filter and a sparse index. The index maps key ranges to block offsets, allowing the reader to seek directly to a block. A block cache stores hot blocks in memory, reducing disk I/O for frequently accessed keys. These techniques are essential for acceptable read performance.

Deep Dive into the concept

A bloom filter is a bit array plus multiple hash functions. When a key is inserted, each hash sets a bit. When querying, the same hashes are checked; if any bit is zero, the key is definitely not present. If all bits are one, the key might be present. This false positive rate is tunable by the size of the bit array and the number of hash functions. In an LSM tree, each SSTable has its own bloom filter. A read first checks the bloom filter; if it says “not present,” you skip the SSTable. This can save many disk reads, especially for negative lookups.

Indexes in SSTables are usually sparse. Instead of indexing every key, you index every Nth key or every block. The index entry contains the first key in the block and the block offset. To look up a key, you binary search the index to find the block that could contain it, then read that block and scan within it. This reduces memory usage while still enabling fast lookup. You should choose a block size and index interval that balance memory usage and I/O. Typical block sizes are 4KB to 16KB, matching page size and cache line behavior.

Block caching further improves performance. A block cache keeps recently accessed blocks in memory, often using an LRU policy. When a block is requested, the cache returns it if available; otherwise it reads from disk and inserts it. The cache must handle concurrency and memory limits. If the cache is too small, read latency increases. If it is too large, it competes with the OS page cache. In an embedded database, you can expose a configuration parameter for cache size.

Bloom filters and indexes must be stored and loaded efficiently. The bloom filter can be stored at the end of the SSTable, along with a footer that points to the index. The footer should include offsets and checksums. When opening an SSTable, you read the footer, load the index and bloom filter into memory, and then use them for queries. This design keeps data blocks on disk and metadata in memory. The metadata size should be small enough to load for all tables.

Finally, be aware of false positives. A bloom filter can say “maybe” when the key is not present, causing an unnecessary disk read. You can tune the false positive rate by adjusting the number of bits per key. In a project, you can pick a default like 10 bits per key and 7 hash functions, then test its performance. The key is to keep the implementation deterministic and correct, not to chase the perfect rate.

How This Fits on Projects

This concept defines the SSTable metadata in Section 3.5, the read path in Section 4.4, and the performance tests in Section 6.2. It also connects to caching patterns in P01-custom-memory-allocator.md and performance measurement in P05-high-performance-string-search.md.

Definitions & Key Terms

  • Bloom filter: Probabilistic set membership structure.
  • False positive: Bloom filter says present when it is not.
  • Sparse index: Index of blocks rather than every key.
  • Block cache: Memory cache for SSTable blocks.
  • LRU: Least Recently Used eviction policy.

Mental Model Diagram (ASCII)

Lookup -> bloom filter -> index -> block read -> key scan

How It Works (Step-by-Step)

  1. Check bloom filter for key.
  2. If maybe present, binary search index for block.
  3. Read block from disk or cache.
  4. Scan block for key and return value.

Invariants:

  • Bloom filter and index correspond to the same SSTable version.
  • Block cache entries are immutable.

Failure modes:

  • Incorrect bloom filter bits cause missed keys.
  • Index offsets corrupted lead to wrong reads.

Minimal Concrete Example

// bloom filter check
bool maybe = (bits[h1] && bits[h2] && bits[h3]);

Common Misconceptions

  • “Bloom filters guarantee presence” -> They only guarantee absence.
  • “Index every key” -> Sparse indexes are enough and cheaper.
  • “Cache is optional” -> Reads can be too slow without it.

Check-Your-Understanding Questions

  1. Why can bloom filters improve read latency?
  2. What trade-off does a larger bloom filter make?
  3. Why use a sparse index instead of a full index?

Check-Your-Understanding Answers

  1. They avoid disk reads for keys not present.
  2. More memory for fewer false positives.
  3. It reduces memory usage while still allowing fast lookup.

Real-World Applications

  • RocksDB uses bloom filters and block caches heavily.
  • Cassandra relies on bloom filters for negative lookups.

Where You’ll Apply It

References

  • “Database Internals” - indexing and bloom filters.
  • “Designing Data-Intensive Applications” - storage engine details.

Key Insights

Read performance in LSM trees depends on metadata and caching as much as on disk speed.

Summary

Bloom filters, indexes, and caches make reads viable in an LSM tree. Without them, read latency explodes as tables accumulate.

Homework/Exercises to Practice the Concept

  1. Implement a bloom filter with two hash functions.
  2. Build a sparse index for a sorted file.
  3. Add an LRU cache for blocks and measure hit rate.

Solutions to the Homework/Exercises

  1. The filter returns false negatives only if the bits are wrong.
  2. The index reduces lookup to one block read.
  3. The cache improves repeated reads and reduces I/O.

Concept 4: Concurrency, Recovery, and Background Work

Fundamentals

A storage engine must handle concurrent reads and writes while maintaining durability. Even in an embedded database, multiple threads may access the database. The simplest approach is a global mutex, but that limits concurrency. A better approach is to allow concurrent reads while writes acquire a lock. Background work such as memtable flush and compaction must not block reads for long. Recovery must be safe even if a crash occurs during compaction. These constraints shape the concurrency model and the synchronization primitives you choose.

Deep Dive into the concept

Concurrency in an LSM engine often uses a read-write lock: multiple readers can proceed, but writes require exclusive access. This works because reads are mostly immutable: SSTables and immutable memtables are read-only. The only mutable structure is the active memtable. A common design is to use a mutex for the memtable and allow readers to snapshot it. When the memtable is rotated to immutable, writers create a new memtable and continue, while readers can still access the old one. This allows flush to disk without blocking writers. The flush can happen in a background thread, but the flush thread must coordinate with the manifest update to ensure crash safety.

Compaction is the most complex concurrent operation. It merges multiple SSTables into a new file and then swaps it into the active set. This must be atomic: readers should either see the old set or the new set, but not a mixture. A common technique is to write the new table, fsync it, then update the manifest in a single record. The manifest update acts as the atomic switch. After that, old tables can be deleted. If a crash happens before the manifest update, the new table is ignored. If the crash happens after, the new table is used and old ones are deleted on recovery.

Recovery must rebuild the in-memory state from persistent files. On startup, the engine reads the manifest to discover active SSTables, loads their metadata (indexes and bloom filters), then replays the WAL into a new memtable. This should be deterministic. If there are orphaned SSTables (written but not in the manifest), they should be ignored or cleaned up. If there is a partially written manifest record, it should be detected and skipped. This requires checksums in the manifest as well.

Background work introduces backpressure. If compaction falls behind, read amplification increases and write throughput degrades. You can monitor the number of SSTables and trigger compaction when thresholds are exceeded. You can also throttle writes when the system is overloaded. This is a trade-off: throttling reduces write throughput but stabilizes latency. For this project, implement a simple trigger: if the number of SSTables exceeds a limit, pause writes until compaction catches up. This demonstrates backpressure in a storage engine.

Finally, tests for concurrency are essential. Use multiple threads to perform reads and writes while compaction runs. Use a deterministic seed so results are reproducible. Use crash tests that kill the process during flush and compaction. After recovery, verify that all acknowledged writes are present and that no unacknowledged writes appear.

How This Fits on Projects

This concept defines the locking strategy in Section 4.2, the background worker in Section 5.10 Phase 3, and the crash tests in Section 6.2. It also connects to work stealing in P02-work-stealing-thread-pool.md and async scheduling in P03-mini-async-runtime.md.

Definitions & Key Terms

  • Read-write lock: Lock allowing multiple readers or one writer.
  • Manifest: Metadata file describing active SSTables.
  • Background compaction: Merging SSTables while serving reads.
  • Snapshot: A consistent view of the database at a point in time.
  • Throttling: Slowing writes to allow background work to catch up.

Mental Model Diagram (ASCII)

writes -> memtable -> flush -> SSTables
            |             |
         readers       compaction

How It Works (Step-by-Step)

  1. Writers acquire lock to update memtable and WAL.
  2. Readers read memtable and SSTables without blocking.
  3. Background thread flushes immutable memtable.
  4. Compaction merges SSTables and updates manifest.
  5. Recovery rebuilds state from manifest and WAL.

Invariants:

  • Manifest update is the atomic switch for SSTables.
  • Readers always see a consistent set of tables.

Failure modes:

  • Crash during compaction leaves orphaned tables.
  • Deadlocks if locks are held during I/O.

Minimal Concrete Example

pthread_rwlock_t db_lock;
// readers: rdlock, writers: wrlock

Common Misconceptions

  • “Global mutex is fine” -> It kills read concurrency.
  • “Compaction can delete old tables immediately” -> Only after manifest update.
  • “Crash tests are optional” -> They are the only proof of durability.

Check-Your-Understanding Questions

  1. Why must manifest updates be atomic?
  2. How do readers stay consistent during compaction?
  3. Why should locks not be held during disk I/O?

Check-Your-Understanding Answers

  1. They define which tables are active after a crash.
  2. Readers use immutable SSTables and a snapshot of active tables.
  3. Disk I/O is slow and can block other operations, causing deadlocks.

Real-World Applications

  • RocksDB uses background compaction threads.
  • LevelDB uses snapshots for consistent reads.

Where You’ll Apply It

  • This project: Section 4.2 Key Components, Section 5.10 Phase 3, Section 6.2 Crash Tests.
  • Also used in: P02-work-stealing-thread-pool.md for background scheduling.

References

  • “Designing Data-Intensive Applications” - concurrency and compaction.
  • “Database Internals” - recovery and locking strategies.

Key Insights

Concurrency and recovery are inseparable in a storage engine; the manifest is the source of truth.

Summary

An embedded database must coordinate reads, writes, and compaction without losing data. Locks, snapshots, and manifest updates are the core tools.

Homework/Exercises to Practice the Concept

  1. Implement a simple read-write lock around a shared map.
  2. Simulate compaction by merging sorted arrays while readers read.
  3. Write a crash test that kills the process during flush.

Solutions to the Homework/Exercises

  1. Readers proceed concurrently, writers block them.
  2. Readers still see consistent data if they use snapshots.
  3. Recovery should replay WAL and ignore partial files.

3. Project Specification

3.1 What You Will Build

A crash-safe embedded key-value store with WAL, memtable, SSTables, bloom filters, and background compaction. It includes a CLI tool (kvdb) for put/get/delete operations and a deterministic crash test harness. The database is embedded: it runs as a library linked into applications, not as a server.

3.2 Functional Requirements

  1. Put/Get/Delete API: Basic key-value operations.
  2. WAL: Append-only log with checksum and fsync policy.
  3. Memtable: In-memory sorted structure.
  4. SSTables: Immutable sorted files with index and bloom filter.
  5. Compaction: Merge SSTables and remove tombstones.
  6. Recovery: Replay WAL and rebuild state on startup.
  7. Crash tests: Deterministic crash simulation.

3.3 Non-Functional Requirements

  • Durability: Acknowledged writes survive crash.
  • Performance: Sequential writes with stable throughput.
  • Reliability: Recovery handles partial writes gracefully.

3.4 Example Usage / Output

$ ./kvdb put user:1 Alice
OK
$ ./kvdb get user:1
Alice

3.5 Data Formats / Schemas / Protocols

WAL record format:

[magic:4][len:4][seq:8][crc:4][key_len:4][val_len:4][key][value]

SSTable footer format:

[index_offset:8][filter_offset:8][footer_crc:4]

3.6 Edge Cases

  • Duplicate key overwrites previous value.
  • Delete writes a tombstone and removes key on compaction.
  • Crash during compaction leaves old tables intact.

3.7 Real World Outcome

A persistent embedded database that survives simulated crashes and provides consistent reads.

3.7.1 How to Run (Copy/Paste)

make
./kvdb put user:1 Alice
./kvdb get user:1
./kvdb crash-test --seed 999 --iterations 50

3.7.2 Golden Path Demo (Deterministic)

  • Run crash-test with seed 999 and fixed iterations.
  • Expect 50/50 successful recoveries with no lost acknowledged writes.

3.7.3 If CLI: Exact Terminal Transcript

$ ./kvdb put user:1 Alice
OK
exit_code=0

$ ./kvdb get user:1
Alice
exit_code=0

$ ./kvdb get missing
NOT_FOUND
exit_code=1

3.7.4 If Web App

Not applicable.

3.7.5 If API

Not applicable.

3.7.6 If Library

Install/import

#include "kvdb.h"

Minimal usage

kvdb_t* db = kvdb_open("./data");
kvdb_put(db, "user:1", "Alice");
kvdb_close(db);

Error handling

kvdb_error_t err;
kvdb_t* db = kvdb_open("./data", &err);
if (!db) { fprintf(stderr, "open failed: %d\n", err); }

3.7.7 If GUI / Desktop / Mobile

Not applicable.

3.7.8 If TUI

Not applicable.


4. Solution Architecture

4.1 High-Level Design

+---------+    +---------+    +-----------+    +-----------+
| WAL     | -> | Memtable| -> | SSTables  | -> | Compaction|
+---------+    +---------+    +-----------+    +-----------+
        |             |
        v             v
   Recovery        Bloom/Index

4.2 Key Components

Component Responsibility Key Decisions
WAL Durable logging Record format and fsync policy
Memtable In-memory writes Skip list vs tree
SSTable Immutable storage Block size and index format
Compaction Merge tables Leveled strategy
Manifest Track active tables Atomic updates

4.4 Data Structures (No Full Code)

typedef struct kv_entry {
    char* key;
    char* value;
    uint64_t seq;
    int tombstone;
} kv_entry_t;

4.4 Algorithm Overview

Key Algorithm: Recovery

  1. Read manifest to load SSTables.
  2. Replay WAL entries in order.
  3. Rebuild memtable and sequence counter.

Complexity Analysis:

  • Time: O(N) for WAL replay.
  • Space: O(M) for memtable size.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install build-essential
make

5.2 Project Structure

kvdb/
|-- src/
|   |-- wal.c
|   |-- memtable.c
|   |-- sstable.c
|   |-- compaction.c
|   `-- db.c
|-- include/
|   `-- kvdb.h
|-- tools/
|   `-- kvdb_cli.c
`-- Makefile

5.3 The Core Question You’re Answering

“How do you guarantee durability and correctness even when the process crashes mid-write?”

5.4 Concepts You Must Understand First

  1. Write-ahead logging and fsync
  2. LSM tree structure and compaction
  3. Bloom filters and indexes
  4. Concurrency and recovery rules

5.5 Questions to Guide Your Design

  1. What is your WAL record format and checksum?
  2. When do you flush the memtable?
  3. How do you trigger compaction?

5.6 Thinking Exercise

A crash happens after WAL append but before memtable update.

  • What state should recovery produce?
  • How do you detect the last valid WAL record?

5.7 The Interview Questions They’ll Ask

  1. Why does WAL guarantee durability?
  2. What is write amplification in LSM trees?
  3. How do bloom filters improve reads?

5.8 Hints in Layers

Hint 1: WAL first

Implement WAL append and replay before memtable flush.

Hint 2: Add memtable

Use a sorted structure and replay WAL into it.

Hint 3: Add SSTables

Write memtable to disk and add index.

Hint 4: Add compaction

Merge tables and remove tombstones.

5.9 Books That Will Help

Topic Book Chapter
Crash consistency “Operating Systems: Three Easy Pieces” Ch. 40-42
LSM trees “Designing Data-Intensive Applications” Storage engine
Data structures “Algorithms” by Sedgewick Balanced trees

5.10 Implementation Phases

Phase 1: WAL + Memtable (2 weeks)

Goals:

  • Durable WAL with replay
  • In-memory memtable

Tasks:

  1. Implement WAL append and checksum validation.
  2. Replay WAL on startup into memtable.

Checkpoint: Crash/restart preserves writes.

Phase 2: SSTables (1 to 2 weeks)

Goals:

  • Flush memtable to immutable SSTables
  • Build index and bloom filter

Tasks:

  1. Implement SSTable writer and reader.
  2. Load index and bloom filter on startup.

Checkpoint: Reads work after restart without WAL replay.

Phase 3: Compaction + Crash Tests (1 to 2 weeks)

Goals:

  • Merge SSTables safely
  • Deterministic crash tests

Tasks:

  1. Implement leveled compaction with atomic manifest updates.
  2. Add crash simulation harness with fixed seed.

Checkpoint: 50 crash tests pass with no lost writes.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Memtable Skip list vs tree Skip list Simple and ordered
Compaction Leveled vs tiered Leveled Better read performance
Sync policy per-write vs batched batched (configurable) Balance latency and throughput

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests WAL and SSTable formats checksum, index offsets
Integration Tests Put/get/delete persistence across restart
Crash Tests Simulated failures kill during flush

6.2 Critical Test Cases

  1. WAL replay restores all acknowledged writes.
  2. Crash during compaction leaves data consistent.
  3. Bloom filters never cause false negatives.

6.3 Test Data

seed: 999
writes: 5000
keys: user:1..user:5000

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing fsync Lost writes after crash Fsync WAL before ack
Bad checksum logic Recovery stops early Validate checksum implementation
Deleting old SSTables early Data loss Delete only after manifest update

7.2 Debugging Strategies

  • Crash simulation: kill process at random points with fixed seed.
  • Verbose logging: log WAL sequence numbers and compaction steps.

7.3 Performance Traps

  • Too frequent fsync calls reducing throughput.
  • Compaction running too often due to low thresholds.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add batch write API.
  • Add basic stats output (SSTable count, memtable size).

8.2 Intermediate Extensions

  • Add block cache with LRU eviction.
  • Add prefix compression in SSTables.

8.3 Advanced Extensions

  • Add MVCC snapshots.
  • Add background compaction scheduling with priorities.

9. Real-World Connections

9.1 Industry Applications

  • RocksDB: Embedded LSM-based key-value store.
  • LevelDB: Similar architecture with WAL and SSTables.
  • RocksDB: Production LSM engine.
  • SQLite: WAL-based embedded database.

9.3 Interview Relevance

  • Explain WAL durability and recovery.
  • Describe LSM compaction trade-offs.

10. Resources

10.1 Essential Reading

  • “Designing Data-Intensive Applications” - storage engines.
  • “Database Internals” by Alex Petrov - WAL, compaction.

10.2 Video Resources

  • Talks on LSM trees and storage engine design.

10.3 Tools & Documentation

  • strace and perf for I/O profiling.
  • fsync and file I/O man pages.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why WAL is required for durability.
  • I can explain compaction trade-offs.
  • I can explain how bloom filters reduce reads.

11.2 Implementation

  • Crash tests pass deterministically.
  • Compaction does not lose data.
  • Recovery is idempotent.

11.3 Growth

  • I can explain my storage engine design in an interview.
  • I can extend the engine with MVCC or caching.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • WAL + memtable with crash recovery.
  • Basic put/get/delete CLI.

Full Completion:

  • SSTable flush and compaction implemented.
  • Bloom filters and index used for reads.

Excellence (Going Above & Beyond):

  • MVCC snapshots and block cache.
  • Detailed durability benchmark report.