Project 2: Log-Structured Key-Value Store (Mini LSM-Tree)

Build a persistent key-value store with WAL, memtable, SSTables, and background compaction.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	25-40 hours
Main Programming Language	C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	Level 4
Business Potential	Open-core infrastructure
Prerequisites	File IO, sorting, hash tables, basic OS concepts
Key Topics	WAL, SSTables, compaction, bloom filters, tombstones

1. Learning Objectives

By completing this project, you will:

Implement a write-ahead log (WAL) for crash-safe durability.
Build a memtable that flushes into immutable SSTables.
Implement multi-way merge compaction and tombstone handling.
Add Bloom filters to reduce unnecessary disk reads.
Measure write amplification and read latency under different compaction policies.

2. All Theory Needed (Per-Concept Breakdown)

Write-Ahead Logging (WAL) and Crash Recovery

Fundamentals

A write-ahead log is a sequential file that records every change before it is applied to in-memory structures. The central idea is that the log is the authoritative history of updates, so after a crash you can replay it to restore the last known good state. WAL works because sequential writes are fast and because you can ensure durability by calling fsync after appending a record. In an LSM-style store, the WAL protects the memtable: you write the update to the log, then apply it to the memtable. If the process crashes, you rebuild the memtable by reading the log from the last checkpoint. WAL records must be self-describing and checksummed so that you can detect partial writes or corruption. Understanding WAL is essential because it is the difference between a fast cache and a durable database.

Deep Dive into the concept

WAL is the core durability mechanism for many storage engines. It follows a simple rule: the log must reach stable storage before the in-memory state is considered committed. This rule ensures that after a crash, the system can replay the log and recover all committed updates. Implementing WAL in a mini LSM tree involves defining a record format, writing records sequentially, and ensuring correct fsync behavior. A typical record format includes a length prefix, a record type (PUT/DEL), the key length, value length, the key bytes, and the value bytes. You must also include a checksum so you can detect a torn write. A torn write occurs when a crash happens mid-write and only part of the record is persisted. On recovery, you read records sequentially, validate the checksum and length, and stop when you detect an invalid record. This approach ensures that the log is always a prefix of valid records.

The performance of WAL is tied to fsync. Calling fsync on every write ensures durability but can be slow. Real systems batch fsync calls or use group commit. For this project, you can implement a simple policy: fsync after each write for correctness, and optionally add a --fsync-interval option to batch. The correctness model must be explicit: if you batch fsync, you accept that the last few updates may be lost in a crash. This is still useful for demonstrating tradeoffs. The log is append-only, which means write performance is high and largely sequential. The OS page cache will buffer the writes, but fsync is the signal to flush to disk.

WAL also interacts with compaction. When you flush a memtable to an SSTable, you can truncate the WAL up to the flush point, because those updates are now persisted in the SSTable. This requires a checkpoint mechanism. A simple approach is to rotate WAL segments: once a memtable flush completes, you close the current log file and start a new one. On recovery, you replay only the most recent log segments that are not yet included in SSTables. This is a form of log recycling. Your system should store a manifest file that records which log segments are active and which SSTables exist. The manifest can be updated atomically by writing a new file and renaming it, which is an atomic operation on most filesystems.

Recovery is straightforward but must be deterministic. On startup, you read the manifest, open the WAL segments, replay them into a new memtable, and then open SSTables. For each key, the most recent update wins. This means your WAL records should include a monotonically increasing sequence number or at least preserve order. Sequence numbers are important when you later merge SSTables, because you need to choose the newest version of a key among duplicates. In a simple system, you can rely on log order and per-SSTable sequence ranges, but a global sequence number is easier to reason about. Each write increments a counter and stores the sequence number in the log and the memtable entry.

Failure modes include partial writes, missing fsync, and incorrect recovery ordering. If you forget to fsync, you may lose updates that you believed were durable. If you ignore checksums, you may replay corrupted data and return wrong values. If you fail to stop at the first corrupted record, you may read garbage and crash. The WAL is also a potential source of write amplification if you do not recycle it. If you keep appending forever, log replay becomes slow. This is why checkpoints and log rotation are essential. By designing WAL carefully in this project, you will understand the core durability mechanism used by production databases and storage systems.

How this fit on projects

WAL is central to §3.2 Functional Requirements and §3.7 Real World Outcome. It is implemented in §5.10 Phase 1 and tested in §6.2 Crash Recovery cases. It is also used in P06-capstone-simple-database-engine.md.

Definitions & key terms

WAL -> write-ahead log, append-only record of updates
fsync -> system call to flush file data to disk
torn write -> partial record written before crash
checkpoint -> point where WAL can be truncated
sequence number -> monotonic id to order updates

Mental model diagram

Client PUT -> append WAL -> fsync -> apply to memtable
Crash -> replay WAL -> rebuild memtable -> serve reads

How it works (step-by-step, with invariants and failure modes)

Serialize PUT/DEL into a WAL record with checksum.
Append to WAL file and fsync.
Apply update to memtable with sequence number.
On restart, read WAL sequentially and validate checksums.
Stop at first invalid record and build memtable from valid records.

Invariants: WAL records are append-only; memtable contains all durable updates since last flush. Failure modes: missing fsync, corrupted record, incorrect replay ordering.

Minimal concrete example

[record_len][type=PUT][seq=42][klen=5][vlen=3][key=alpha][val=one][crc]

Common misconceptions

“WAL is optional” -> without it you lose data on crash.
“fsync is cheap” -> it is expensive but necessary for durability.
“You can ignore partial records” -> you must stop at first invalid record.

Check-your-understanding questions

Why must the WAL be written before updating the memtable?
What happens if you do not checksum WAL records?
How does WAL rotation reduce recovery time?

Check-your-understanding answers

If you crash after memtable update but before WAL, the update is lost.
You might replay corrupted bytes and produce incorrect key/value pairs.
You only replay recent logs rather than the full history.

Real-world applications

PostgreSQL WAL and crash recovery.
RocksDB log files for memtable durability.

Where you’ll apply it

This project: §3.2, §3.7, §5.10 Phase 1, §6.2.
Also used in: P06-capstone-simple-database-engine.md.

References

Designing Data-Intensive Applications, Ch. 3
Database Internals, Ch. 5
The Linux Programming Interface, file IO chapters

Key insights

WAL turns volatile memory updates into durable history, enabling fast writes without sacrificing crash safety.

Summary

A write-ahead log is the durability backbone of the store. It is append-only, checksummed, and replayable. It enables fast sequential writes while guaranteeing recovery.

Homework/Exercises to practice the concept

Implement a WAL writer with checksummed records and a replay tool.
Simulate a crash by truncating the WAL mid-record and verify recovery stops safely.

Solutions to the homework/exercises

Serialize records with a length prefix and CRC, then replay by scanning and validating.
When CRC or length fails, stop replay and ignore trailing bytes.

Memtable and SSTable (Sorted Run) Design

Fundamentals

The memtable is an in-memory structure that stores recent writes in sorted order. When it reaches a size threshold, it is flushed to disk as an SSTable (Sorted String Table). SSTables are immutable, sorted files that enable efficient range scans and binary search. The memtable is often implemented as a skip list or a balanced tree, but for this project you can use a sorted array or a binary search tree. The key property is that it can output entries in sorted order when flushing. SSTables are immutable to keep writes sequential and simplify crash safety. Reads consult the memtable first, then the newest SSTables, and use the newest key version. This separation of mutable memtable and immutable SSTables is the defining feature of an LSM tree.

Deep Dive into the concept

Memtable design is a tradeoff between write speed, memory usage, and flush cost. A skip list provides O(log n) inserts and maintains sorted order, which is useful for range queries and for producing a sorted run during flush. A balanced tree offers similar properties. An array-based memtable can be faster for small sizes but requires sorting on flush. The flush cost is acceptable if you batch enough writes, because sorting is O(n log n) and the amortized cost per write can still be low. For this project, a skip list or sorted array is sufficient, as the main goal is to understand the flow from memory to disk.

An SSTable is an immutable file containing sorted key/value pairs, usually in blocks. A simple format is: a sequence of entries (key length, value length, key bytes, value bytes) plus a block index at the end. The block index maps key ranges to file offsets, enabling efficient binary search and range scans. A minimal implementation can store entries sequentially and use a sparse index: every Nth key and its offset. During lookup, you binary search the sparse index to find the block that could contain the key, then scan within that block. This keeps memory overhead small and lookup times acceptable.

Immutability is critical. Because SSTables are never modified in place, writes are append-only and crash-safe. Compaction later merges SSTables to remove duplicates and tombstones. The memtable acts as a buffer that absorbs writes and provides fast reads for hot data. When the memtable flushes, you write a new SSTable in a single sequential pass, which is efficient on disk. This is why LSM trees are write-optimized: they convert random writes into sequential runs.

Reading requires a merge view across multiple sorted sources: memtable and multiple SSTables. The latest version of a key is determined by sequence number. This means each entry should include its sequence number. For a read, you search the memtable, then SSTables in order of recency, stopping at the first match or tombstone. This can be expensive if many SSTables exist, which is why compaction is necessary. The memtable can also have a Bloom filter to quickly reject missing keys, but the key concept is the sorted run itself.

Tombstones are delete markers. When you delete a key, you insert a tombstone record in the memtable. When flushed, the tombstone is written to an SSTable. During reads, the tombstone indicates the key is deleted, even if older SSTables contain it. During compaction, tombstones allow the system to drop older versions and remove the key entirely after a certain age or when it is safe.

The overall flow is: client writes -> WAL -> memtable; memtable flush -> SSTable; reads -> check memtable then SSTables. This design allows high write throughput because each write is O(log n) in memory plus an append to WAL. The cost is read amplification: you may need to check multiple SSTables. Bloom filters and compaction mitigate this. This is the heart of LSM tree design and is critical to understand before building the project.

How this fit on projects

Memtable and SSTables are the core storage path in §3.2, §3.5, and §4.4. They are implemented in §5.10 Phase 2 and evaluated in §6.2. Concepts reappear in P05-btree-file-index.md as a contrast to B-tree storage.

Definitions & key terms

memtable -> in-memory sorted structure for recent writes
SSTable -> immutable sorted run on disk
sorted run -> sequence of key/value pairs in sorted order
tombstone -> delete marker stored like a value
sparse index -> index of every Nth key for faster search

Mental model diagram

WAL -> memtable (sorted) -> flush -> SSTable 0001
                     \-> SSTable 0002
Reads: memtable -> SSTable newest -> older

How it works (step-by-step, with invariants and failure modes)

Insert into memtable (sorted by key) with sequence number.
When size threshold reached, freeze memtable.
Write frozen memtable to new SSTable sequentially.
Add SSTable metadata to manifest.
Replace memtable with a new empty one.

Invariants: memtable entries are sorted; SSTables are immutable and sorted. Failure modes: partial SSTable writes, missing manifest updates, or incorrect tombstone handling.

Minimal concrete example

SSTable entries:
[key_len][val_len][seq][key][val]
...
Index: [key=foo offset=0], [key=qux offset=4096]

Common misconceptions

“SSTables are random access” -> they are sequential with sparse indexing.
“Deletes remove keys immediately” -> deletes create tombstones.
“Memtable can be unsorted” -> then flush would be expensive and complex.

Check-your-understanding questions

Why does the memtable need to be sorted?
How do tombstones prevent old values from resurfacing?
Why are SSTables immutable?

Check-your-understanding answers

Sorting allows efficient range scans and sequential flush to SSTables.
The newest tombstone overrides older values during reads and compaction.
Immutability keeps writes sequential and simplifies crash safety.

Real-world applications

RocksDB and LevelDB store data in SSTables.
Cassandra and HBase use memtables and SSTables.

Where you’ll apply it

This project: §3.2, §3.5, §4.4, §5.10 Phase 2.
Also used in: P05-btree-file-index.md for comparison.

References

Designing Data-Intensive Applications, Ch. 3
Database Internals, Ch. 4
LevelDB implementation notes

Key insights

Memtables and SSTables transform random writes into sequential runs, trading read amplification for write throughput.

Summary

The memtable buffers writes and flushes to immutable sorted files. Reads merge across these sorted sources, and tombstones carry delete semantics. This is the LSM core.

Homework/Exercises to practice the concept

Implement a sorted array memtable and a flush routine that writes to a file.
Add a sparse index and implement binary search for lookup.

Solutions to the homework/exercises

Collect entries, sort by key, and write sequentially with lengths.
Store every Nth key and offset, binary search the index, then scan within the block.

Compaction and Multi-Way Merge

Fundamentals

Compaction is the process of merging multiple SSTables into fewer, larger SSTables to reduce read amplification and remove obsolete data. Because SSTables are immutable, updates produce new entries rather than overwriting old ones. Compaction scans multiple sorted files, merges them by key, and writes out a new sorted run that keeps only the newest version of each key and drops tombstones when safe. This reduces the number of files and improves lookup time. However, it rewrites data, which creates write amplification. Choosing when and how to compact is a fundamental tradeoff in LSM design.

From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.

Deep Dive into the concept

Compaction is where the LSM tree pays for its fast writes. Each new SSTable adds another file to check on reads, so you periodically merge files to keep read costs bounded. The simplest policy is tiered compaction: when you have N files of similar size, merge them into one larger file. This produces high write amplification because data can be rewritten multiple times as it moves to larger tiers, but it preserves write throughput. Leveled compaction maintains size ratios between levels and merges data into a fixed level, reducing read amplification but increasing compaction work. For this project, implement a simple tiered compaction policy: when the number of SSTables exceeds a threshold (e.g., 4), merge the oldest ones into a new file.

The merge algorithm is a multi-way merge of sorted lists. You can implement this with a min-heap keyed by the current key of each SSTable iterator. Each iterator reads entries sequentially from its file. The heap always yields the smallest key among the current entries. If multiple entries have the same key, you select the one with the highest sequence number (newest) and discard the others. Tombstones are treated as entries; if the newest entry is a tombstone, you output it or drop it depending on whether you are compacting at the last level. In this project, you can keep tombstones to preserve correctness, or you can drop them when you are sure no older SSTables contain the key. A simple rule: only drop tombstones during a full compaction that includes all SSTables.

Compaction must be crash safe. Because you are writing new SSTables, you should write to a temporary file and only rename it into place once the write is complete and fsync has succeeded. Update the manifest after the new file is durable, and only then delete the old files. This ensures that a crash in the middle of compaction does not lose data. The manifest should list the active SSTables in order of recency. During compaction, you build a new file, then replace the old set atomically by writing a new manifest file and renaming it over the old one.

Write amplification is the number of bytes written to disk per byte of user data. Compaction increases this because the same data is rewritten multiple times. You should measure this by tracking bytes read and written during compaction. For example, if you compact two 100MB files into one 100MB file, you read 200MB and write 100MB, which is 3x total IO relative to input data. The ratio of total writes to user writes is the write amplification. Your stats command should report this so the learner sees the cost of compaction decisions.

Compaction also affects space amplification, which is the ratio of disk used to live data. Without compaction, many old versions and tombstones accumulate, increasing space usage. Compaction reduces space amplification but costs IO. This is the systems tradeoff: write amplification vs read amplification vs space amplification. By implementing compaction and measuring its cost, you will gain intuition that is critical for designing storage engines.

How this fit on projects

Compaction is required in §3.2 and §3.7. It is implemented in §5.10 Phase 3 and tested in §6.2. It is also a key concept in P06-capstone-simple-database-engine.md.

Definitions & key terms

compaction -> merging SSTables into fewer files
multi-way merge -> merge sorted lists from multiple sources
write amplification -> bytes written per byte of user data
space amplification -> disk used vs live data
leveled/tiered -> compaction strategies

Mental model diagram

SSTable A + SSTable B + SSTable C
       \     |     /
        \    |    /
         \   |   /
        merged SSTable D

How it works (step-by-step, with invariants and failure modes)

Select a set of SSTables to compact.
Open iterators for each file and push current key into heap.
Repeatedly pop smallest key, choose newest version by sequence.
Write chosen entry to new SSTable.
When done, fsync new file and update manifest.
Delete old files.

Invariants: new SSTable is sorted; newest version wins; manifest reflects only active files. Failure modes: duplicate keys, lost data if manifest updated early, or torn output file.

Minimal concrete example

Input:
SST1: (a->1 seq=1), (c->3 seq=2)
SST2: (a->2 seq=3), (b->5 seq=4)
Output:
(a->2), (b->5), (c->3)

Common misconceptions

“Compaction is optional” -> without it, reads slow down and space grows.
“Compaction always helps” -> it increases write amplification.
“Tombstones can be dropped anytime” -> only safe when older data is fully compacted.

Check-your-understanding questions

Why does compaction reduce read amplification?
What does write amplification measure?
When is it safe to drop tombstones?

Check-your-understanding answers

Fewer SSTables means fewer files to check on reads.
The ratio of bytes written to disk to user-written bytes.
When all older SSTables that could contain the key are included in compaction.

Real-world applications

LevelDB and RocksDB compaction scheduling.
Cassandra background compaction services.

Where you’ll apply it

This project: §3.2, §3.7, §5.10 Phase 3, §7.3.
Also used in: P06-capstone-simple-database-engine.md.

References

Designing Data-Intensive Applications, Ch. 3
Database Internals, Ch. 4-5
LevelDB compaction design notes

Key insights

Compaction is the price of fast writes: it trades IO for reduced read and space amplification.

Summary

By merging sorted runs and keeping only the newest keys, compaction keeps read performance manageable and removes obsolete data, at the cost of extra writes.

Homework/Exercises to practice the concept

Implement a two-way merge of sorted key/value files.
Extend to a heap-based N-way merge and measure time.

Solutions to the homework/exercises

Use two pointers and compare keys, writing the smallest each step.
Use a min-heap keyed by current key from each iterator.

Bloom Filters for Read Avoidance

Fundamentals

A Bloom filter is a probabilistic data structure that answers set membership queries with possible false positives but no false negatives. In an LSM tree, each SSTable can have a Bloom filter that says whether a key might be present in that file. If the filter says “no”, you can skip reading the file. This reduces read amplification and disk IO. A Bloom filter is implemented as a bit array and several hash functions. To insert a key, you set bits at multiple hash positions. To query, you check those bits; if any bit is 0, the key is definitely not present. The false positive rate depends on the number of bits per key and the number of hash functions.

Deep Dive into the concept

Bloom filters are a performance hack that leverage probability to save IO. In LSM systems, reads are often the bottleneck because you may need to check multiple SSTables. A Bloom filter reduces this by quickly answering “not present” in memory, which is far faster than a disk read. The design tradeoff is memory vs accuracy: more bits per key means fewer false positives. A common target is 10 bits per key, giving a false positive rate of around 1%. You can compute the optimal number of hash functions as k = (m/n) * ln(2) where m is the number of bits and n is the number of keys.

To implement a Bloom filter, you store a bit array of size m and choose k hash functions. In practice, you can use one strong hash function and derive multiple hashes by mixing the result with different seeds. For example, compute a 64-bit hash and then derive k positions by adding a fixed delta or using the lower and upper bits in a double hashing scheme: h_i = h1 + i*h2. This avoids the cost of multiple independent hashes. When building an SSTable, you insert each key into the filter. Store the filter alongside the SSTable, either as a sidecar file or as a trailer section in the SSTable file. On lookup, load the filter into memory and check it before scanning the SSTable.

Bloom filters have no false negatives, which is critical for correctness. If the filter says “not present”, you can safely skip the file. If it says “maybe”, you must check the SSTable. False positives increase read amplification but do not affect correctness. This means Bloom filters can be tuned without changing system semantics. You should expose a configuration option for bits per key and measure the impact on memory usage and read performance. In a small system, you can precompute the number of bits based on the number of keys during SSTable creation and write the filter size into the file header so it can be loaded on startup.

There are pitfalls. If you miscompute the number of keys or the hash function distribution is poor, the false positive rate can be much higher than expected. If you store the filter on disk and load it on demand, you might reintroduce IO overhead. For this project, load all filters at startup into memory. The memory overhead is small compared to the performance benefit. Another pitfall is using variable-length keys without consistent hashing; you must hash the exact bytes and length to avoid collisions due to null bytes or encoding differences.

Bloom filters also highlight the systems mindset: you trade a bit of memory and occasional false positives for a massive reduction in expensive disk seeks. This is the same principle as caching, but applied within the storage engine. Implementing Bloom filters in this project gives you a concrete understanding of probabilistic data structures and their real-world impact.

One more practical detail is how to size and persist the filter. You can compute the bit array length from an estimated key count at SSTable creation, then store the filter size and hash parameters in the SSTable header. If the estimate is wrong, your false positive rate changes, but correctness is still intact. This is why many systems store the actual key count and rebuild filters during compaction when the count is known precisely. You should also decide how to load filters on startup. Loading all filters into memory gives the fastest reads but uses more RAM. Lazy-loading filters saves memory but adds an extra read on the first query for each file. For this project, load all filters and expose the memory cost in stats so the tradeoff is visible.

How this fit on projects

Bloom filters are required in §3.2 and §3.7. They are implemented in §5.10 Phase 3 and tested in §6.2. They also appear in P06-capstone-simple-database-engine.md for secondary indexes.

Definitions & key terms

Bloom filter -> probabilistic membership structure
false positive -> filter says yes when key not present
bits per key -> memory parameter that controls accuracy
double hashing -> deriving k hashes from two base hashes

Mental model diagram

key -> hash1, hash2 -> set bits -> [bit array]
query -> check bits -> maybe / no

How it works (step-by-step, with invariants and failure modes)

Choose bit array size and number of hash functions.
For each key in SSTable, set k bit positions.
On lookup, compute k positions and test bits.
If any bit is 0, skip the SSTable.
If all bits are 1, read SSTable and verify key.

Invariants: no false negatives; consistent hash over key bytes. Failure modes: wrong hash inputs, corrupted filter, or incorrect size metadata.

Minimal concrete example

bit array size = 64 bits
key "foo" -> positions 3, 17, 41 -> set bits
lookup "bar" -> bit 17 = 0 -> definitely not present

Common misconceptions

“Bloom filters guarantee presence” -> they only say “maybe”.
“False positives are errors” -> they only cause extra IO, not incorrect results.
“Hash count doesn’t matter” -> it strongly impacts false positive rate.

Check-your-understanding questions

Why are Bloom filters safe to use for skipping SSTables?
What happens if you reduce bits per key too much?
How does double hashing reduce CPU cost?

Check-your-understanding answers

They never return false negatives, so skipping is safe.
False positives rise, causing more disk reads.
You compute two hashes and derive many, instead of computing many hashes.

Real-world applications

RocksDB per-file Bloom filters.
Web caches using Bloom filters to avoid disk lookups.

Where you’ll apply it

This project: §3.2, §3.7, §5.10 Phase 3.
Also used in: P06-capstone-simple-database-engine.md.

References

Algorithms, Fourth Edition, hashing chapters
Bloom’s original 1970 paper
Database Internals, filter discussions

Key insights

Bloom filters convert expensive disk reads into cheap memory checks with controllable error rates.

Summary

Bloom filters reduce read amplification by quickly ruling out SSTables that cannot contain a key, trading memory for IO savings.

Homework/Exercises to practice the concept

Implement a Bloom filter with configurable bits per key.
Measure false positive rate for random keys.

Solutions to the homework/exercises

Use a bit array and k hash functions derived from a 64-bit hash.
Insert N keys, test M random keys, and measure the false positive ratio.

3. Project Specification

3.1 What You Will Build

A persistent key-value store with a WAL, memtable, immutable SSTables, and background compaction. The system supports PUT, GET, DELETE, and STATS in a CLI shell. It stores data in a directory containing log segments and SSTable files. It uses Bloom filters to minimize disk reads and tracks write amplification and compaction statistics.

3.2 Functional Requirements

PUT key value appends to WAL and updates memtable.
GET key returns the newest value or (nil).
DELETE key writes a tombstone and returns OK.
STATS reports memtable size, SSTable count, and write amplification.
COMPACT triggers manual compaction.
System must recover correctly after crash by replaying WAL.

3.3 Non-Functional Requirements

Performance: sequential write path; compaction in background or on demand.
Reliability: crash-safe WAL; no data loss after restart.
Usability: clear CLI shell with deterministic outputs.

3.4 Example Usage / Output

$ ./lsm_kv --dir ./data
lsm> put user:1 alice
OK (seq=1)
lsm> put user:2 bob
OK (seq=2)
lsm> get user:1
alice
lsm> stats
segments=1 memtable=64KB sstables=0 write_amp=1.0x
lsm> compact
compacted 1 files -> 1 file (write_amp=1.5x)

3.5 Data Formats / Schemas / Protocols

CLI line protocol (single command per line). Data directory layout:

wal-000001.log - WAL segment
sst-000001.sst - SSTable file
MANIFEST - active SSTables and sequence number

SSTable entry format:

[record_len][key_len][val_len][seq][key][val]

3.6 Edge Cases

GET for deleted key returns (nil) even if older SSTables contain value.
Crash during compaction: old files still referenced by manifest.
WAL record truncated: recovery stops at last valid record.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -Wextra -o lsm_kv src/main.c src/wal.c src/memtable.c src/sstable.c src/compaction.c
./lsm_kv --dir ./data --memtable-mb 8 --fsync-every 1

3.7.2 Golden Path Demo (Deterministic)

$ ./lsm_kv --dir ./data
lsm> put a 1
OK (seq=1)
lsm> put b 2
OK (seq=2)
lsm> get a
1
lsm> stats
segments=1 memtable=2KB sstables=0 write_amp=1.0x

3.7.3 Failure Demo (Deterministic)

lsm> get missing
(nil)
lsm> put big <too_large>
ERR too_large
lsm> load /missing/file
ERR no_such_file

Exit codes:

0 on normal exit
3 on WAL corruption

4. Solution Architecture

4.1 High-Level Design

Client -> WAL append -> Memtable -> flush -> SSTables -> compaction
                            ^                         |
                            +----- reads -------------+

4.2 Key Components

4.4 Data Structures (No Full Code)

struct wal_record {
    uint32_t len;
    uint8_t type; // PUT/DEL
    uint64_t seq;
    uint32_t klen;
    uint32_t vlen;
    // key bytes, value bytes
    uint32_t crc;
};

4.4 Algorithm Overview

Key Algorithm: Multi-way merge compaction

Open iterators for each SSTable.
Push current key into min-heap.
Pop smallest key, keep newest version.
Write to new SSTable.

Complexity Analysis:

Time: O(n log k) for merge (k = number of SSTables)
Space: O(k) for heap and iterators

5. Implementation Guide

5.1 Development Environment Setup

cc --version
make --version

5.2 Project Structure

lsm_kv/
├── src/
│   ├── main.c
│   ├── wal.c
│   ├── memtable.c
│   ├── sstable.c
│   └── compaction.c
├── tests/
│   └── test_recovery.c
└── README.md

5.3 The Core Question You’re Answering

“How do you trade read simplicity for massive write throughput?”

5.4 Concepts You Must Understand First

WAL and fsync
Sorted memtable + immutable SSTables
Compaction and write amplification
Bloom filters

5.5 Questions to Guide Your Design

What memtable size will you flush at?
What compaction threshold balances read vs write amplification?
How will you detect torn WAL records?

5.6 Thinking Exercise

If you compact three 50MB SSTables into one 120MB SSTable, what is the write amplification for that compaction step?

5.7 The Interview Questions They’ll Ask

Why are LSM trees write-optimized?
How does a Bloom filter improve read performance?
What is a tombstone and why do we need it?

5.8 Hints in Layers

Hint 1: Start with WAL + memtable only, no SSTables. Hint 2: Add SSTable flush and read logic. Hint 3: Add compaction and Bloom filters last.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (6-10 hours)

Goals: WAL, basic memtable, replay on startup Tasks: implement WAL writer/reader, simple in-memory map Checkpoint: crash and recovery test passes

Phase 2: Core Functionality (8-12 hours)

Goals: SSTable flush and lookup Tasks: write sorted run, build sparse index, read path Checkpoint: GET works after flush and restart

Phase 3: Polish & Edge Cases (8-12 hours)

Goals: compaction and bloom filters Tasks: multi-way merge, stats, write amplification tracking Checkpoint: compaction reduces file count and stats update

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Insert keys, flush, restart, verify values.
Insert, delete, compact, verify delete persists.
Truncate WAL mid-record and verify recovery stops safely.

6.3 Test Data

put a 1
put b 2
del a
get a

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Add a WAL dump tool for inspecting records.
Log compaction merge decisions with key and sequence number.

7.3 Performance Traps

Compaction triggered too often increases write amplification.
Too many SSTables increases read latency.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a scan command for range queries.
Add --fsync-every N batching.

8.2 Intermediate Extensions

Implement leveled compaction and compare stats.
Add block compression.

8.3 Advanced Extensions

Build a background compaction thread.
Implement manifest versioning and atomic swaps.

9. Real-World Connections

9.1 Industry Applications

RocksDB: LSM tree with compaction and Bloom filters.
Cassandra: memtable + SSTable storage.

LevelDB: reference LSM implementation.
RocksDB: production LSM engine.

9.3 Interview Relevance

LSM vs B-tree tradeoffs
WAL and crash recovery

10. Resources

10.1 Essential Reading

Designing Data-Intensive Applications, Ch. 3
Database Internals, Ch. 4-5

10.2 Video Resources

“LSM Trees Explained” talks

10.3 Tools & Documentation

fio or hyperfine for IO benchmarking

11. Self-Assessment Checklist

11.1 Understanding

I can explain WAL and crash recovery.
I can explain compaction and write amplification.
I can explain Bloom filter tradeoffs.

11.2 Implementation

All functional requirements are met.
Recovery works after crash.
Compaction produces correct results.

11.3 Growth

I can discuss LSM vs B-tree in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

WAL + memtable + basic GET/PUT/DEL
Recovery after crash

Full Completion:

SSTables, compaction, Bloom filters
Stats with write amplification

Excellence (Going Above & Beyond):

Background compaction
Configurable policies and benchmarking