Project 2: Log-Structured Key-Value Store (Mini LSM-Tree)
Build a persistent key-value store with WAL, memtable, SSTables, and background compaction.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 25-40 hours |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 4 |
| Business Potential | Open-core infrastructure |
| Prerequisites | File IO, sorting, hash tables, basic OS concepts |
| Key Topics | WAL, SSTables, compaction, bloom filters, tombstones |
1. Learning Objectives
By completing this project, you will:
- Implement a write-ahead log (WAL) for crash-safe durability.
- Build a memtable that flushes into immutable SSTables.
- Implement multi-way merge compaction and tombstone handling.
- Add Bloom filters to reduce unnecessary disk reads.
- Measure write amplification and read latency under different compaction policies.
2. All Theory Needed (Per-Concept Breakdown)
Write-Ahead Logging (WAL) and Crash Recovery
Fundamentals
A write-ahead log is a sequential file that records every change before it is applied to in-memory structures. The central idea is that the log is the authoritative history of updates, so after a crash you can replay it to restore the last known good state. WAL works because sequential writes are fast and because you can ensure durability by calling fsync after appending a record. In an LSM-style store, the WAL protects the memtable: you write the update to the log, then apply it to the memtable. If the process crashes, you rebuild the memtable by reading the log from the last checkpoint. WAL records must be self-describing and checksummed so that you can detect partial writes or corruption. Understanding WAL is essential because it is the difference between a fast cache and a durable database.
Deep Dive into the concept
WAL is the core durability mechanism for many storage engines. It follows a simple rule: the log must reach stable storage before the in-memory state is considered committed. This rule ensures that after a crash, the system can replay the log and recover all committed updates. Implementing WAL in a mini LSM tree involves defining a record format, writing records sequentially, and ensuring correct fsync behavior. A typical record format includes a length prefix, a record type (PUT/DEL), the key length, value length, the key bytes, and the value bytes. You must also include a checksum so you can detect a torn write. A torn write occurs when a crash happens mid-write and only part of the record is persisted. On recovery, you read records sequentially, validate the checksum and length, and stop when you detect an invalid record. This approach ensures that the log is always a prefix of valid records.
The performance of WAL is tied to fsync. Calling fsync on every write ensures durability but can be slow. Real systems batch fsync calls or use group commit. For this project, you can implement a simple policy: fsync after each write for correctness, and optionally add a --fsync-interval option to batch. The correctness model must be explicit: if you batch fsync, you accept that the last few updates may be lost in a crash. This is still useful for demonstrating tradeoffs. The log is append-only, which means write performance is high and largely sequential. The OS page cache will buffer the writes, but fsync is the signal to flush to disk.
WAL also interacts with compaction. When you flush a memtable to an SSTable, you can truncate the WAL up to the flush point, because those updates are now persisted in the SSTable. This requires a checkpoint mechanism. A simple approach is to rotate WAL segments: once a memtable flush completes, you close the current log file and start a new one. On recovery, you replay only the most recent log segments that are not yet included in SSTables. This is a form of log recycling. Your system should store a manifest file that records which log segments are active and which SSTables exist. The manifest can be updated atomically by writing a new file and renaming it, which is an atomic operation on most filesystems.
Recovery is straightforward but must be deterministic. On startup, you read the manifest, open the WAL segments, replay them into a new memtable, and then open SSTables. For each key, the most recent update wins. This means your WAL records should include a monotonically increasing sequence number or at least preserve order. Sequence numbers are important when you later merge SSTables, because you need to choose the newest version of a key among duplicates. In a simple system, you can rely on log order and per-SSTable sequence ranges, but a global sequence number is easier to reason about. Each write increments a counter and stores the sequence number in the log and the memtable entry.
Failure modes include partial writes, missing fsync, and incorrect recovery ordering. If you forget to fsync, you may lose updates that you believed were durable. If you ignore checksums, you may replay corrupted data and return wrong values. If you fail to stop at the first corrupted record, you may read garbage and crash. The WAL is also a potential source of write amplification if you do not recycle it. If you keep appending forever, log replay becomes slow. This is why checkpoints and log rotation are essential. By designing WAL carefully in this project, you will understand the core durability mechanism used by production databases and storage systems.
How this fit on projects
WAL is central to §3.2 Functional Requirements and §3.7 Real World Outcome. It is implemented in §5.10 Phase 1 and tested in §6.2 Crash Recovery cases. It is also used in P06-capstone-simple-database-engine.md.
Definitions & key terms
- WAL -> write-ahead log, append-only record of updates
- fsync -> system call to flush file data to disk
- torn write -> partial record written before crash
- checkpoint -> point where WAL can be truncated
- sequence number -> monotonic id to order updates
Mental model diagram
Client PUT -> append WAL -> fsync -> apply to memtable
Crash -> replay WAL -> rebuild memtable -> serve reads
How it works (step-by-step, with invariants and failure modes)
- Serialize PUT/DEL into a WAL record with checksum.
- Append to WAL file and fsync.
- Apply update to memtable with sequence number.
- On restart, read WAL sequentially and validate checksums.
- Stop at first invalid record and build memtable from valid records.
Invariants: WAL records are append-only; memtable contains all durable updates since last flush. Failure modes: missing fsync, corrupted record, incorrect replay ordering.
Minimal concrete example
[record_len][type=PUT][seq=42][klen=5][vlen=3][key=alpha][val=one][crc]
Common misconceptions
- “WAL is optional” -> without it you lose data on crash.
- “fsync is cheap” -> it is expensive but necessary for durability.
- “You can ignore partial records” -> you must stop at first invalid record.
Check-your-understanding questions
- Why must the WAL be written before updating the memtable?
- What happens if you do not checksum WAL records?
- How does WAL rotation reduce recovery time?
Check-your-understanding answers
- If you crash after memtable update but before WAL, the update is lost.
- You might replay corrupted bytes and produce incorrect key/value pairs.
- You only replay recent logs rather than the full history.
Real-world applications
- PostgreSQL WAL and crash recovery.
- RocksDB log files for memtable durability.
Where you’ll apply it
- This project: §3.2, §3.7, §5.10 Phase 1, §6.2.
- Also used in: P06-capstone-simple-database-engine.md.
References
- Designing Data-Intensive Applications, Ch. 3
- Database Internals, Ch. 5
- The Linux Programming Interface, file IO chapters
Key insights
WAL turns volatile memory updates into durable history, enabling fast writes without sacrificing crash safety.
Summary
A write-ahead log is the durability backbone of the store. It is append-only, checksummed, and replayable. It enables fast sequential writes while guaranteeing recovery.
Homework/Exercises to practice the concept
- Implement a WAL writer with checksummed records and a replay tool.
- Simulate a crash by truncating the WAL mid-record and verify recovery stops safely.
Solutions to the homework/exercises
- Serialize records with a length prefix and CRC, then replay by scanning and validating.
- When CRC or length fails, stop replay and ignore trailing bytes.
Memtable and SSTable (Sorted Run) Design
Fundamentals
The memtable is an in-memory structure that stores recent writes in sorted order. When it reaches a size threshold, it is flushed to disk as an SSTable (Sorted String Table). SSTables are immutable, sorted files that enable efficient range scans and binary search. The memtable is often implemented as a skip list or a balanced tree, but for this project you can use a sorted array or a binary search tree. The key property is that it can output entries in sorted order when flushing. SSTables are immutable to keep writes sequential and simplify crash safety. Reads consult the memtable first, then the newest SSTables, and use the newest key version. This separation of mutable memtable and immutable SSTables is the defining feature of an LSM tree.
Deep Dive into the concept
Memtable design is a tradeoff between write speed, memory usage, and flush cost. A skip list provides O(log n) inserts and maintains sorted order, which is useful for range queries and for producing a sorted run during flush. A balanced tree offers similar properties. An array-based memtable can be faster for small sizes but requires sorting on flush. The flush cost is acceptable if you batch enough writes, because sorting is O(n log n) and the amortized cost per write can still be low. For this project, a skip list or sorted array is sufficient, as the main goal is to understand the flow from memory to disk.
An SSTable is an immutable file containing sorted key/value pairs, usually in blocks. A simple format is: a sequence of entries (key length, value length, key bytes, value bytes) plus a block index at the end. The block index maps key ranges to file offsets, enabling efficient binary search and range scans. A minimal implementation can store entries sequentially and use a sparse index: every Nth key and its offset. During lookup, you binary search the sparse index to find the block that could contain the key, then scan within that block. This keeps memory overhead small and lookup times acceptable.
Immutability is critical. Because SSTables are never modified in place, writes are append-only and crash-safe. Compaction later merges SSTables to remove duplicates and tombstones. The memtable acts as a buffer that absorbs writes and provides fast reads for hot data. When the memtable flushes, you write a new SSTable in a single sequential pass, which is efficient on disk. This is why LSM trees are write-optimized: they convert random writes into sequential runs.
Reading requires a merge view across multiple sorted sources: memtable and multiple SSTables. The latest version of a key is determined by sequence number. This means each entry should include its sequence number. For a read, you search the memtable, then SSTables in order of recency, stopping at the first match or tombstone. This can be expensive if many SSTables exist, which is why compaction is necessary. The memtable can also have a Bloom filter to quickly reject missing keys, but the key concept is the sorted run itself.
Tombstones are delete markers. When you delete a key, you insert a tombstone record in the memtable. When flushed, the tombstone is written to an SSTable. During reads, the tombstone indicates the key is deleted, even if older SSTables contain it. During compaction, tombstones allow the system to drop older versions and remove the key entirely after a certain age or when it is safe.
The overall flow is: client writes -> WAL -> memtable; memtable flush -> SSTable; reads -> check memtable then SSTables. This design allows high write throughput because each write is O(log n) in memory plus an append to WAL. The cost is read amplification: you may need to check multiple SSTables. Bloom filters and compaction mitigate this. This is the heart of LSM tree design and is critical to understand before building the project.
How this fit on projects
Memtable and SSTables are the core storage path in §3.2, §3.5, and §4.4. They are implemented in §5.10 Phase 2 and evaluated in §6.2. Concepts reappear in P05-btree-file-index.md as a contrast to B-tree storage.
Definitions & key terms
- memtable -> in-memory sorted structure for recent writes
- SSTable -> immutable sorted run on disk
- sorted run -> sequence of key/value pairs in sorted order
- tombstone -> delete marker stored like a value
- sparse index -> index of every Nth key for faster search
Mental model diagram
WAL -> memtable (sorted) -> flush -> SSTable 0001
\-> SSTable 0002
Reads: memtable -> SSTable newest -> older
How it works (step-by-step, with invariants and failure modes)
- Insert into memtable (sorted by key) with sequence number.
- When size threshold reached, freeze memtable.
- Write frozen memtable to new SSTable sequentially.
- Add SSTable metadata to manifest.
- Replace memtable with a new empty one.
Invariants: memtable entries are sorted; SSTables are immutable and sorted. Failure modes: partial SSTable writes, missing manifest updates, or incorrect tombstone handling.
Minimal concrete example
SSTable entries:
[key_len][val_len][seq][key][val]
...
Index: [key=foo offset=0], [key=qux offset=4096]
Common misconceptions
- “SSTables are random access” -> they are sequential with sparse indexing.
- “Deletes remove keys immediately” -> deletes create tombstones.
- “Memtable can be unsorted” -> then flush would be expensive and complex.
Check-your-understanding questions
- Why does the memtable need to be sorted?
- How do tombstones prevent old values from resurfacing?
- Why are SSTables immutable?
Check-your-understanding answers
- Sorting allows efficient range scans and sequential flush to SSTables.
- The newest tombstone overrides older values during reads and compaction.
- Immutability keeps writes sequential and simplifies crash safety.
Real-world applications
- RocksDB and LevelDB store data in SSTables.
- Cassandra and HBase use memtables and SSTables.
Where you’ll apply it
- This project: §3.2, §3.5, §4.4, §5.10 Phase 2.
- Also used in: P05-btree-file-index.md for comparison.
References
- Designing Data-Intensive Applications, Ch. 3
- Database Internals, Ch. 4
- LevelDB implementation notes
Key insights
Memtables and SSTables transform random writes into sequential runs, trading read amplification for write throughput.
Summary
The memtable buffers writes and flushes to immutable sorted files. Reads merge across these sorted sources, and tombstones carry delete semantics. This is the LSM core.
Homework/Exercises to practice the concept
- Implement a sorted array memtable and a flush routine that writes to a file.
- Add a sparse index and implement binary search for lookup.
Solutions to the homework/exercises
- Collect entries, sort by key, and write sequentially with lengths.
- Store every Nth key and offset, binary search the index, then scan within the block.
Compaction and Multi-Way Merge
Fundamentals
Compaction is the process of merging multiple SSTables into fewer, larger SSTables to reduce read amplification and remove obsolete data. Because SSTables are immutable, updates produce new entries rather than overwriting old ones. Compaction scans multiple sorted files, merges them by key, and writes out a new sorted run that keeps only the newest version of each key and drops tombstones when safe. This reduces the number of files and improves lookup time. However, it rewrites data, which creates write amplification. Choosing when and how to compact is a fundamental tradeoff in LSM design.
From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.
Deep Dive into the concept
Compaction is where the LSM tree pays for its fast writes. Each new SSTable adds another file to check on reads, so you periodically merge files to keep read costs bounded. The simplest policy is tiered compaction: when you have N files of similar size, merge them into one larger file. This produces high write amplification because data can be rewritten multiple times as it moves to larger tiers, but it preserves write throughput. Leveled compaction maintains size ratios between levels and merges data into a fixed level, reducing read amplification but increasing compaction work. For this project, implement a simple tiered compaction policy: when the number of SSTables exceeds a threshold (e.g., 4), merge the oldest ones into a new file.
The merge algorithm is a multi-way merge of sorted lists. You can implement this with a min-heap keyed by the current key of each SSTable iterator. Each iterator reads entries sequentially from its file. The heap always yields the smallest key among the current entries. If multiple entries have the same key, you select the one with the highest sequence number (newest) and discard the others. Tombstones are treated as entries; if the newest entry is a tombstone, you output it or drop it depending on whether you are compacting at the last level. In this project, you can keep tombstones to preserve correctness, or you can drop them when you are sure no older SSTables contain the key. A simple rule: only drop tombstones during a full compaction that includes all SSTables.
Compaction must be crash safe. Because you are writing new SSTables, you should write to a temporary file and only rename it into place once the write is complete and fsync has succeeded. Update the manifest after the new file is durable, and only then delete the old files. This ensures that a crash in the middle of compaction does not lose data. The manifest should list the active SSTables in order of recency. During compaction, you build a new file, then replace the old set atomically by writing a new manifest file and renaming it over the old one.
Write amplification is the number of bytes written to disk per byte of user data. Compaction increases this because the same data is rewritten multiple times. You should measure this by tracking bytes read and written during compaction. For example, if you compact two 100MB files into one 100MB file, you read 200MB and write 100MB, which is 3x total IO relative to input data. The ratio of total writes to user writes is the write amplification. Your stats command should report this so the learner sees the cost of compaction decisions.
Compaction also affects space amplification, which is the ratio of disk used to live data. Without compaction, many old versions and tombstones accumulate, increasing space usage. Compaction reduces space amplification but costs IO. This is the systems tradeoff: write amplification vs read amplification vs space amplification. By implementing compaction and measuring its cost, you will gain intuition that is critical for designing storage engines.
How this fit on projects
Compaction is required in §3.2 and §3.7. It is implemented in §5.10 Phase 3 and tested in §6.2. It is also a key concept in P06-capstone-simple-database-engine.md.
Definitions & key terms
- compaction -> merging SSTables into fewer files
- multi-way merge -> merge sorted lists from multiple sources
- write amplification -> bytes written per byte of user data
- space amplification -> disk used vs live data
- leveled/tiered -> compaction strategies
Mental model diagram
SSTable A + SSTable B + SSTable C
\ | /
\ | /
\ | /
merged SSTable D
How it works (step-by-step, with invariants and failure modes)
- Select a set of SSTables to compact.
- Open iterators for each file and push current key into heap.
- Repeatedly pop smallest key, choose newest version by sequence.
- Write chosen entry to new SSTable.
- When done, fsync new file and update manifest.
- Delete old files.
Invariants: new SSTable is sorted; newest version wins; manifest reflects only active files. Failure modes: duplicate keys, lost data if manifest updated early, or torn output file.
Minimal concrete example
Input:
SST1: (a->1 seq=1), (c->3 seq=2)
SST2: (a->2 seq=3), (b->5 seq=4)
Output:
(a->2), (b->5), (c->3)
Common misconceptions
- “Compaction is optional” -> without it, reads slow down and space grows.
- “Compaction always helps” -> it increases write amplification.
- “Tombstones can be dropped anytime” -> only safe when older data is fully compacted.
Check-your-understanding questions
- Why does compaction reduce read amplification?
- What does write amplification measure?
- When is it safe to drop tombstones?
Check-your-understanding answers
- Fewer SSTables means fewer files to check on reads.
- The ratio of bytes written to disk to user-written bytes.
- When all older SSTables that could contain the key are included in compaction.
Real-world applications
- LevelDB and RocksDB compaction scheduling.
- Cassandra background compaction services.
Where you’ll apply it
- This project: §3.2, §3.7, §5.10 Phase 3, §7.3.
- Also used in: P06-capstone-simple-database-engine.md.
References
- Designing Data-Intensive Applications, Ch. 3
- Database Internals, Ch. 4-5
- LevelDB compaction design notes
Key insights
Compaction is the price of fast writes: it trades IO for reduced read and space amplification.
Summary
By merging sorted runs and keeping only the newest keys, compaction keeps read performance manageable and removes obsolete data, at the cost of extra writes.
Homework/Exercises to practice the concept
- Implement a two-way merge of sorted key/value files.
- Extend to a heap-based N-way merge and measure time.
Solutions to the homework/exercises
- Use two pointers and compare keys, writing the smallest each step.
- Use a min-heap keyed by current key from each iterator.
Bloom Filters for Read Avoidance
Fundamentals
A Bloom filter is a probabilistic data structure that answers set membership queries with possible false positives but no false negatives. In an LSM tree, each SSTable can have a Bloom filter that says whether a key might be present in that file. If the filter says “no”, you can skip reading the file. This reduces read amplification and disk IO. A Bloom filter is implemented as a bit array and several hash functions. To insert a key, you set bits at multiple hash positions. To query, you check those bits; if any bit is 0, the key is definitely not present. The false positive rate depends on the number of bits per key and the number of hash functions.
Deep Dive into the concept
Bloom filters are a performance hack that leverage probability to save IO. In LSM systems, reads are often the bottleneck because you may need to check multiple SSTables. A Bloom filter reduces this by quickly answering “not present” in memory, which is far faster than a disk read. The design tradeoff is memory vs accuracy: more bits per key means fewer false positives. A common target is 10 bits per key, giving a false positive rate of around 1%. You can compute the optimal number of hash functions as k = (m/n) * ln(2) where m is the number of bits and n is the number of keys.
To implement a Bloom filter, you store a bit array of size m and choose k hash functions. In practice, you can use one strong hash function and derive multiple hashes by mixing the result with different seeds. For example, compute a 64-bit hash and then derive k positions by adding a fixed delta or using the lower and upper bits in a double hashing scheme: h_i = h1 + i*h2. This avoids the cost of multiple independent hashes. When building an SSTable, you insert each key into the filter. Store the filter alongside the SSTable, either as a sidecar file or as a trailer section in the SSTable file. On lookup, load the filter into memory and check it before scanning the SSTable.
Bloom filters have no false negatives, which is critical for correctness. If the filter says “not present”, you can safely skip the file. If it says “maybe”, you must check the SSTable. False positives increase read amplification but do not affect correctness. This means Bloom filters can be tuned without changing system semantics. You should expose a configuration option for bits per key and measure the impact on memory usage and read performance. In a small system, you can precompute the number of bits based on the number of keys during SSTable creation and write the filter size into the file header so it can be loaded on startup.
There are pitfalls. If you miscompute the number of keys or the hash function distribution is poor, the false positive rate can be much higher than expected. If you store the filter on disk and load it on demand, you might reintroduce IO overhead. For this project, load all filters at startup into memory. The memory overhead is small compared to the performance benefit. Another pitfall is using variable-length keys without consistent hashing; you must hash the exact bytes and length to avoid collisions due to null bytes or encoding differences.
Bloom filters also highlight the systems mindset: you trade a bit of memory and occasional false positives for a massive reduction in expensive disk seeks. This is the same principle as caching, but applied within the storage engine. Implementing Bloom filters in this project gives you a concrete understanding of probabilistic data structures and their real-world impact.
One more practical detail is how to size and persist the filter. You can compute the bit array length from an estimated key count at SSTable creation, then store the filter size and hash parameters in the SSTable header. If the estimate is wrong, your false positive rate changes, but correctness is still intact. This is why many systems store the actual key count and rebuild filters during compaction when the count is known precisely. You should also decide how to load filters on startup. Loading all filters into memory gives the fastest reads but uses more RAM. Lazy-loading filters saves memory but adds an extra read on the first query for each file. For this project, load all filters and expose the memory cost in stats so the tradeoff is visible.
How this fit on projects
Bloom filters are required in §3.2 and §3.7. They are implemented in §5.10 Phase 3 and tested in §6.2. They also appear in P06-capstone-simple-database-engine.md for secondary indexes.
Definitions & key terms
- Bloom filter -> probabilistic membership structure
- false positive -> filter says yes when key not present
- bits per key -> memory parameter that controls accuracy
- double hashing -> deriving k hashes from two base hashes
Mental model diagram
key -> hash1, hash2 -> set bits -> [bit array]
query -> check bits -> maybe / no
How it works (step-by-step, with invariants and failure modes)
- Choose bit array size and number of hash functions.
- For each key in SSTable, set k bit positions.
- On lookup, compute k positions and test bits.
- If any bit is 0, skip the SSTable.
- If all bits are 1, read SSTable and verify key.
Invariants: no false negatives; consistent hash over key bytes. Failure modes: wrong hash inputs, corrupted filter, or incorrect size metadata.
Minimal concrete example
bit array size = 64 bits
key "foo" -> positions 3, 17, 41 -> set bits
lookup "bar" -> bit 17 = 0 -> definitely not present
Common misconceptions
- “Bloom filters guarantee presence” -> they only say “maybe”.
- “False positives are errors” -> they only cause extra IO, not incorrect results.
- “Hash count doesn’t matter” -> it strongly impacts false positive rate.
Check-your-understanding questions
- Why are Bloom filters safe to use for skipping SSTables?
- What happens if you reduce bits per key too much?
- How does double hashing reduce CPU cost?
Check-your-understanding answers
- They never return false negatives, so skipping is safe.
- False positives rise, causing more disk reads.
- You compute two hashes and derive many, instead of computing many hashes.
Real-world applications
- RocksDB per-file Bloom filters.
- Web caches using Bloom filters to avoid disk lookups.
Where you’ll apply it
- This project: §3.2, §3.7, §5.10 Phase 3.
- Also used in: P06-capstone-simple-database-engine.md.
References
- Algorithms, Fourth Edition, hashing chapters
- Bloom’s original 1970 paper
- Database Internals, filter discussions
Key insights
Bloom filters convert expensive disk reads into cheap memory checks with controllable error rates.
Summary
Bloom filters reduce read amplification by quickly ruling out SSTables that cannot contain a key, trading memory for IO savings.
Homework/Exercises to practice the concept
- Implement a Bloom filter with configurable bits per key.
- Measure false positive rate for random keys.
Solutions to the homework/exercises
- Use a bit array and k hash functions derived from a 64-bit hash.
- Insert N keys, test M random keys, and measure the false positive ratio.
3. Project Specification
3.1 What You Will Build
A persistent key-value store with a WAL, memtable, immutable SSTables, and background compaction. The system supports PUT, GET, DELETE, and STATS in a CLI shell. It stores data in a directory containing log segments and SSTable files. It uses Bloom filters to minimize disk reads and tracks write amplification and compaction statistics.
3.2 Functional Requirements
- PUT key value appends to WAL and updates memtable.
- GET key returns the newest value or
(nil). - DELETE key writes a tombstone and returns OK.
- STATS reports memtable size, SSTable count, and write amplification.
- COMPACT triggers manual compaction.
- System must recover correctly after crash by replaying WAL.
3.3 Non-Functional Requirements
- Performance: sequential write path; compaction in background or on demand.
- Reliability: crash-safe WAL; no data loss after restart.
- Usability: clear CLI shell with deterministic outputs.
3.4 Example Usage / Output
$ ./lsm_kv --dir ./data
lsm> put user:1 alice
OK (seq=1)
lsm> put user:2 bob
OK (seq=2)
lsm> get user:1
alice
lsm> stats
segments=1 memtable=64KB sstables=0 write_amp=1.0x
lsm> compact
compacted 1 files -> 1 file (write_amp=1.5x)
3.5 Data Formats / Schemas / Protocols
CLI line protocol (single command per line). Data directory layout:
wal-000001.log- WAL segmentsst-000001.sst- SSTable fileMANIFEST- active SSTables and sequence number
SSTable entry format:
[record_len][key_len][val_len][seq][key][val]
3.6 Edge Cases
- GET for deleted key returns
(nil)even if older SSTables contain value. - Crash during compaction: old files still referenced by manifest.
- WAL record truncated: recovery stops at last valid record.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
cc -O2 -Wall -Wextra -o lsm_kv src/main.c src/wal.c src/memtable.c src/sstable.c src/compaction.c
./lsm_kv --dir ./data --memtable-mb 8 --fsync-every 1
3.7.2 Golden Path Demo (Deterministic)
$ ./lsm_kv --dir ./data
lsm> put a 1
OK (seq=1)
lsm> put b 2
OK (seq=2)
lsm> get a
1
lsm> stats
segments=1 memtable=2KB sstables=0 write_amp=1.0x
3.7.3 Failure Demo (Deterministic)
lsm> get missing
(nil)
lsm> put big <too_large>
ERR too_large
lsm> load /missing/file
ERR no_such_file
Exit codes:
- 0 on normal exit
- 3 on WAL corruption
4. Solution Architecture
4.1 High-Level Design
Client -> WAL append -> Memtable -> flush -> SSTables -> compaction
^ |
+----- reads -------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | WAL | durability | checksums, fsync policy | | Memtable | fast writes | sorted array or skip list | | SSTable | persistent sorted runs | sparse index, immutable | | Compaction | merge files | tiered policy, heap merge | | Bloom filter | read avoidance | bits per key tuning |
4.4 Data Structures (No Full Code)
struct wal_record {
uint32_t len;
uint8_t type; // PUT/DEL
uint64_t seq;
uint32_t klen;
uint32_t vlen;
// key bytes, value bytes
uint32_t crc;
};
4.4 Algorithm Overview
Key Algorithm: Multi-way merge compaction
- Open iterators for each SSTable.
- Push current key into min-heap.
- Pop smallest key, keep newest version.
- Write to new SSTable.
Complexity Analysis:
- Time: O(n log k) for merge (k = number of SSTables)
- Space: O(k) for heap and iterators
5. Implementation Guide
5.1 Development Environment Setup
cc --version
make --version
5.2 Project Structure
lsm_kv/
├── src/
│ ├── main.c
│ ├── wal.c
│ ├── memtable.c
│ ├── sstable.c
│ └── compaction.c
├── tests/
│ └── test_recovery.c
└── README.md
5.3 The Core Question You’re Answering
“How do you trade read simplicity for massive write throughput?”
5.4 Concepts You Must Understand First
- WAL and fsync
- Sorted memtable + immutable SSTables
- Compaction and write amplification
- Bloom filters
5.5 Questions to Guide Your Design
- What memtable size will you flush at?
- What compaction threshold balances read vs write amplification?
- How will you detect torn WAL records?
5.6 Thinking Exercise
If you compact three 50MB SSTables into one 120MB SSTable, what is the write amplification for that compaction step?
5.7 The Interview Questions They’ll Ask
- Why are LSM trees write-optimized?
- How does a Bloom filter improve read performance?
- What is a tombstone and why do we need it?
5.8 Hints in Layers
Hint 1: Start with WAL + memtable only, no SSTables. Hint 2: Add SSTable flush and read logic. Hint 3: Add compaction and Bloom filters last.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | LSM trees | Designing Data-Intensive Applications | Ch. 3 | | Storage design | Database Internals | Ch. 4 | | File IO | The Linux Programming Interface | Ch. 13-14 |
5.10 Implementation Phases
Phase 1: Foundation (6-10 hours)
Goals: WAL, basic memtable, replay on startup Tasks: implement WAL writer/reader, simple in-memory map Checkpoint: crash and recovery test passes
Phase 2: Core Functionality (8-12 hours)
Goals: SSTable flush and lookup Tasks: write sorted run, build sparse index, read path Checkpoint: GET works after flush and restart
Phase 3: Polish & Edge Cases (8-12 hours)
Goals: compaction and bloom filters Tasks: multi-way merge, stats, write amplification tracking Checkpoint: compaction reduces file count and stats update
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———|———|—————-|———–| | Memtable structure | skip list, sorted array | sorted array | simpler and adequate for project | | Compaction policy | tiered, leveled | tiered | easier to implement | | Fsync policy | every write, batched | every write (default) | correctness-first |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———|———|———-| | Unit Tests | record encoding | checksum validation | | Integration Tests | WAL recovery | insert, crash, restart | | Edge Case Tests | tombstones | delete then read |
6.2 Critical Test Cases
- Insert keys, flush, restart, verify values.
- Insert, delete, compact, verify delete persists.
- Truncate WAL mid-record and verify recovery stops safely.
6.3 Test Data
put a 1
put b 2
del a
get a
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | WAL corruption | crash on startup | validate length + checksum | | Tombstones ignored | deleted keys reappear | apply newest version wins | | Compaction bugs | missing keys | multi-way merge with seq numbers |
7.2 Debugging Strategies
- Add a WAL dump tool for inspecting records.
- Log compaction merge decisions with key and sequence number.
7.3 Performance Traps
- Compaction triggered too often increases write amplification.
- Too many SSTables increases read latency.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a
scancommand for range queries. - Add
--fsync-every Nbatching.
8.2 Intermediate Extensions
- Implement leveled compaction and compare stats.
- Add block compression.
8.3 Advanced Extensions
- Build a background compaction thread.
- Implement manifest versioning and atomic swaps.
9. Real-World Connections
9.1 Industry Applications
- RocksDB: LSM tree with compaction and Bloom filters.
- Cassandra: memtable + SSTable storage.
9.2 Related Open Source Projects
- LevelDB: reference LSM implementation.
- RocksDB: production LSM engine.
9.3 Interview Relevance
- LSM vs B-tree tradeoffs
- WAL and crash recovery
10. Resources
10.1 Essential Reading
- Designing Data-Intensive Applications, Ch. 3
- Database Internals, Ch. 4-5
10.2 Video Resources
- “LSM Trees Explained” talks
10.3 Tools & Documentation
fioorhyperfinefor IO benchmarking
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain WAL and crash recovery.
- I can explain compaction and write amplification.
- I can explain Bloom filter tradeoffs.
11.2 Implementation
- All functional requirements are met.
- Recovery works after crash.
- Compaction produces correct results.
11.3 Growth
- I can discuss LSM vs B-tree in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- WAL + memtable + basic GET/PUT/DEL
- Recovery after crash
Full Completion:
- SSTables, compaction, Bloom filters
- Stats with write amplification
Excellence (Going Above & Beyond):
- Background compaction
- Configurable policies and benchmarking