Project 12: High-Performance KV Store (Custom Everything)
Project 12: High-Performance KV Store (Custom Everything)
“Databases are the cathedrals of the software world - built to last centuries, serving billions, and embodying our deepest understanding of data persistence.” - Andy Pavlo
Project Metadata
- Main Programming Language: Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Difficulty: Level 5: Master
- Knowledge Area: Databases / Systems Engineering
- Time Estimate: 1-2 months
- Prerequisites:
- Completed Projects 3 (Arena Allocator), 6 (Lock-Free Queue), and 7 (Zero-Copy Parser)
- Strong understanding of unsafe Rust and raw pointers
- Familiarity with file I/O and memory mapping
- Basic understanding of concurrent programming and atomics
What You Will Build
A production-grade Key-Value store that uses a custom Arena Allocator for the index, Zero-Copy parsing for values, and Atomics for thread-safe access. This database engine will support millions of operations per second on a single core, rivaling the performance of Redis and RocksDB for specific workloads.
This is the “Final Boss” of the Advanced Rust learning path. It requires you to integrate almost every advanced concept: pinning, custom allocation, atomics, memory mapping, and lifetime management.
Learning Objectives
By the end of this project, you will be able to:
-
Design a complete storage engine architecture - Understand the trade-offs between different data structures (LSM Trees, B+ Trees, Skip Lists) and choose appropriately for your workload
-
Implement an append-only log with Write-Ahead Logging (WAL) - Build a crash-safe persistence layer that can recover from unexpected shutdowns without data loss
-
Use memory-mapped files (mmap) for zero-copy I/O - Leverage the operating system’s virtual memory system to read data without copying it to user space
-
Build a high-performance concurrent index - Implement a lock-free or fine-grained locking data structure that scales with CPU cores
-
Apply arena allocation in a real system - Use your custom allocator knowledge to minimize allocation overhead in the hot path
-
Implement compaction and garbage collection - Reclaim space from deleted or updated keys without blocking reads
-
Achieve sub-microsecond latency on the hot path - Understand and eliminate latency sources, keeping the common case blazingly fast
Deep Theoretical Foundation
Before writing any code, you must understand the fundamental concepts that underpin all modern key-value stores. This section provides the theoretical foundation you’ll need.
Part 1: The Anatomy of a Storage Engine
Every key-value store, from Redis to RocksDB to DynamoDB, shares a common architectural pattern:
KEY-VALUE STORE ARCHITECTURE
====================================================================
CLIENT REQUEST
|
v
+-------------------------------------------------------------------+
| API LAYER |
| GET(key) -> Option<Value> |
| PUT(key, value) -> Result<()> |
| DELETE(key) -> Result<()> |
+-------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------+
| CONCURRENCY LAYER |
| - Lock management (mutex, rwlock, or lock-free) |
| - Transaction isolation (if supported) |
| - MVCC (Multi-Version Concurrency Control) |
+-------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------+
| INDEX LAYER |
| Maps: Key -> (File Offset, Value Length) |
| Data Structures: B-Tree, Skip List, Hash Table |
| Lives in: RAM (fast) or Disk (persistent) |
+-------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------+
| STORAGE LAYER |
| - Write-Ahead Log (WAL) for durability |
| - Data files (append-only log or sorted segments) |
| - Compaction and garbage collection |
+-------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------+
| FILE SYSTEM / OS |
| - fsync for durability guarantees |
| - mmap for zero-copy reads |
| - Page cache for read performance |
+-------------------------------------------------------------------+
|
v
DISK / SSD
Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann, Chapter 3: Storage and Retrieval - The definitive resource on storage engine design.
Part 2: LSM Trees vs B+ Trees - The Fundamental Trade-off
The choice between Log-Structured Merge Trees (LSM) and B+ Trees represents the most important architectural decision in storage engine design.
LSM TREE ARCHITECTURE (Write-Optimized)
====================================================================
WRITES flow through multiple levels, from memory to disk:
+-------------------+
| MEMTABLE | <- All writes go here first (in RAM)
| (Sorted Map) | Fast random writes, sorted by key
+-------------------+
|
| (Flush when full)
v
+-------------------+
| IMMUTABLE | <- Previous memtable being flushed
| MEMTABLE |
+-------------------+
|
| (Write to disk)
v
+-------------------+ +-------------------+
| LEVEL 0 |----->| LEVEL 0 | <- Unsorted SSTable files
| (SSTable) | | (SSTable) | May have overlapping key ranges
+-------------------+ +-------------------+
|
| (Compaction merges and sorts)
v
+-------------------+ +-------------------+ +-------------------+
| LEVEL 1 |----->| LEVEL 1 |----->| LEVEL 1 |
| (SSTable) | | (SSTable) | | (SSTable) |
+-------------------+ +-------------------+ +-------------------+
| Non-overlapping key ranges per level
| (Further compaction, each level 10x larger)
v
+-------------------+ +-------------------+ +-------------------+
| LEVEL 2 |----->| LEVEL 2 |----->| LEVEL 2 | ...
| (SSTable) | | (SSTable) | | (SSTable) |
+-------------------+ +-------------------+ +-------------------+
SSTable (Sorted String Table) Format:
+------------------------------------------------------------------------+
| DATA BLOCK | DATA BLOCK | DATA BLOCK | INDEX BLOCK | BLOOM FILTER|
+------------------------------------------------------------------------+
| | | | |
v v v v v
Sorted Sorted Sorted Points to Quick "might
key-value key-value key-value data blocks contain" check
pairs pairs pairs by key range
READ PATH in LSM Tree:
1. Check Memtable (RAM) - O(log n)
2. Check Immutable Memtable (RAM) - O(log n)
3. Check Level 0 SSTables (may check ALL files) - O(k * log n)
4. Check Level 1 SSTables (only one file per key) - O(log n)
5. Check Level 2 SSTables (only one file per key) - O(log n)
...
WORST CASE: Read amplification = O(levels * log n)
B+ TREE ARCHITECTURE (Read-Optimized)
====================================================================
All data lives in a single tree structure on disk:
+------------------+
| ROOT NODE |
| [50 | 100 | 150] |
+--------+---------+
|
+----------------------+----------------------+
| | |
v v v
+-------------+ +-------------+ +-------------+
| INTERNAL | | INTERNAL | | INTERNAL |
| [10|20|30] | | [60|70|80] | | [110|120] |
+------+------+ +------+------+ +------+------+
| | |
+-----+-----+ +-----+-----+ +-----+-----+
v v v v v v v v v
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
|1-9| |11 | |21 | |51 | |61 | |71 | |101| |111| |121|
| | |to | |to | |to | |to | |to | |to | |to | |to |
| | |19 | |29 | |59 | |69 | |79 | |109| |119| |129|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
\ \ \ | | | / / /
\ \ \________|_____|_____|________/ / /
\ \_________ | | | _________/ /
\___________ \ | | | / __________/
\ \ | | | / /
v v v v v v v
LEAF NODES LINKED LIST (for range scans)
+---+<->+---+<->+---+<->+---+<->+---+
| 1 | | 11| | 21| | 51| | 61| ...
|...| |...| |...| |...| |...|
+---+ +---+ +---+ +---+ +---+
B+ Tree Node Format:
+------------------------------------------------------------------------+
| HEADER | KEY | PTR | KEY | PTR | KEY | PTR | KEY | PTR | ... | KEY | PTR |
+------------------------------------------------------------------------+
| | | | |
| | | | +-- Pointer to child (or value in leaf)
| | | +-- Separator key (all keys in right subtree >= this)
| | +-- Pointer to child (or value in leaf)
| +-- Separator key
+-- Node metadata (is_leaf, key_count, etc.)
READ PATH in B+ Tree:
1. Load root node from disk (usually cached) - 1 I/O
2. Binary search to find child pointer - O(log b)
3. Load next node - 1 I/O
4. Repeat until leaf - O(log_b n) I/Os total
...
BEST CASE: Read amplification = O(log_b n) where b = branching factor (~100-1000)
LSM vs B+ TREE COMPARISON
====================================================================
+-------------------+------------------+------------------+
| Metric | LSM Tree | B+ Tree |
+-------------------+------------------+------------------+
| Write throughput | EXCELLENT | GOOD |
| | (Sequential I/O) | (Random I/O) |
+-------------------+------------------+------------------+
| Read latency | VARIABLE | CONSISTENT |
| | (May check many | (Tree depth |
| | levels) | is predictable) |
+-------------------+------------------+------------------+
| Space efficiency | WORSE | BETTER |
| | (Write amplif.) | (In-place update)|
+-------------------+------------------+------------------+
| Write amplification| HIGH | MODERATE |
| | (Compaction) | (Page splits) |
+-------------------+------------------+------------------+
| Range scans | MODERATE | EXCELLENT |
| | (Merge iterators)| (Linked leaves) |
+-------------------+------------------+------------------+
| Point lookups | VARIABLE | EXCELLENT |
| | (Bloom filters | (Direct path) |
| | help a lot) | |
+-------------------+------------------+------------------+
USE LSM TREES WHEN:
- Write-heavy workload (logging, time-series, analytics)
- SSD storage (sequential writes are only slightly better)
- Can tolerate variable read latency
- Examples: RocksDB, LevelDB, Cassandra, ScyllaDB
USE B+ TREES WHEN:
- Read-heavy workload (OLTP databases)
- Need consistent read latency
- Range queries are common
- Examples: PostgreSQL, MySQL, SQLite, LMDB
Book Reference: “Database Internals” by Alex Petrov, Chapters 2-7 cover B-Trees and LSM Trees in extraordinary detail.
Part 3: Write-Ahead Logging (WAL) for Crash Recovery
Every durable storage engine uses Write-Ahead Logging: before any modification is applied to the main data structure, it is first written to a sequential log file and flushed to disk.
WRITE-AHEAD LOGGING (WAL) PRINCIPLE
====================================================================
The golden rule: "Never modify data until the log is on disk."
WITHOUT WAL (Dangerous!)
========================
1. User: PUT(key="name", value="Alice")
2. Engine: Update index in RAM
3. Engine: Write to data file
4. <<< POWER FAILURE HERE >>>
5. On restart: Index says key exists, but data file is corrupted/incomplete
6. RESULT: DATA CORRUPTION OR LOSS
WITH WAL (Safe!)
=================
1. User: PUT(key="name", value="Alice")
2. Engine: Write to WAL file
3. Engine: fsync() WAL file <-- This MUST complete before step 4
4. Engine: Acknowledge to user "Write successful"
5. Engine: Update index in RAM (can be lazy)
6. Engine: Write to data file (can be batched)
7. <<< POWER FAILURE HERE >>>
8. On restart:
a. Read WAL from beginning
b. Replay all operations that weren't applied to data file
c. Data integrity GUARANTEED
WAL FILE FORMAT
====================================================================
+--------+--------+--------+--------+--------+--------+
| Record | Record | Record | Record | Record | Record | ...
+--------+--------+--------+--------+--------+--------+
Each Record:
+----------+----------+----------+----------+------------------+----------+
| Length | CRC32 | Sequence | Type | Key/Value Data | Padding |
| (4 bytes)| (4 bytes)| (8 bytes)| (1 byte) | (variable) | (0-7 B) |
+----------+----------+----------+----------+------------------+----------+
| | | | | |
| | | | | +-- Align to 8 bytes
| | | | +-- Actual key+value
| | | +-- PUT=1, DELETE=2, COMMIT=3
| | +-- Monotonically increasing sequence number
| +-- Checksum for corruption detection
+-- Total record length (for skipping corrupted records)
WAL RECOVERY ALGORITHM
====================================================================
fn recover_from_wal(wal_path: &Path, index: &mut Index, data: &mut DataFile) {
let mut reader = WalReader::open(wal_path);
let mut last_valid_seq = data.get_last_sequence();
while let Some(record) = reader.next_record() {
// Skip records already applied
if record.sequence <= last_valid_seq {
continue;
}
// Verify checksum
if record.crc32 != calculate_crc32(&record.data) {
// Corruption detected! Truncate WAL here.
reader.truncate_at_current_position();
break;
}
// Replay the operation
match record.record_type {
RecordType::Put => {
let (key, value) = parse_put_record(&record.data);
let offset = data.append(key, value);
index.insert(key, offset);
}
RecordType::Delete => {
let key = parse_delete_record(&record.data);
index.remove(key);
data.append_tombstone(key);
}
RecordType::Commit => {
// Batch commit marker
data.sync();
}
}
}
// Truncate WAL after successful recovery
reader.reset();
println!("Recovery complete! Replayed {} operations.", replayed_count);
}
Key Insight: The WAL provides durability by ensuring that every committed write exists in a sequential log before acknowledging success. Recovery is simply replaying this log.
Book Reference: “Designing Data-Intensive Applications” Chapter 7: Transactions, covers WAL in the context of database transactions.
Part 4: Memory Mapping (mmap) for Zero-Copy I/O
Memory mapping allows you to treat a file as if it were an array in memory. The operating system handles paging data in and out transparently.
MEMORY MAPPING (mmap) EXPLAINED
====================================================================
TRADITIONAL READ (with copying):
================================
User Space Kernel Space Disk
+---------------+ +---------------+ +------+
| Application | | Page Cache | | File |
| Buffer | | (RAM) | +------+
+---------------+ +---------------+ |
^ ^ |
| | |
+--- 3. Copy data to user space +--- 2. Read from disk -----+
| |
+--- 1. syscall read() -------------+
Total copies: 2 (disk -> page cache, page cache -> user buffer)
System calls: 1 (read)
MEMORY-MAPPED READ (zero-copy):
===============================
User Space Kernel Space Disk
+---------------+ +---------------+ +------+
| Application | | Page Cache | | File |
| (Virtual |------------------>| (RAM) |<-------------| Data |
| Memory) | Same physical | | Page fault +------+
+---------------+ pages! +---------------+ loads data
Total copies: 1 (disk -> page cache, then directly accessed)
System calls: 0 for reads (page faults handled by kernel)
HOW MMAP WORKS
====================================================================
1. SETUP: Request a memory mapping
let file = File::open("data.db")?;
let mmap = unsafe { Mmap::map(&file)? };
Virtual Address Space:
+---------------+---------------+---------------+
| Stack | ... | mmap region |
| (growing) | | (file backed) |
+---------------+---------------+---------------+
^
|
mmap.as_ptr()
2. FIRST ACCESS: Page fault triggers load
let value = mmap[offset..offset+len]; // First access to this page
CPU: "Address 0x7fff... is not in RAM!"
|
v
Kernel: "That's in the mmap region for data.db"
|
v
Kernel: Reads 4KB page from disk into page cache
|
v
Kernel: Maps page cache page to virtual address
|
v
CPU: Continues execution, data is now in RAM
3. SUBSEQUENT ACCESS: Cache hit
let value2 = mmap[offset+4096..]; // Same or nearby page
CPU: Page is in RAM, direct memory access!
NO system call, NO page fault, FASTEST possible!
MMAP ADVANTAGES FOR KV STORES
====================================================================
1. ZERO-COPY READS
- Data flows from disk to page cache ONCE
- Application reads directly from page cache
- No memcpy() to user buffers
2. OS-MANAGED CACHING
- Kernel's page cache acts as your cache
- LRU eviction handled automatically
- Shared between processes
3. LAZY LOADING
- Only pages that are accessed are loaded
- 100GB file can be mmap'd with 1GB RAM
- Perfect for sparse access patterns
4. SIMPLIFIED CODE
- Treat file as &[u8] slice
- Indexing, slicing, pattern matching all work
- No explicit read() calls or buffer management
MMAP PITFALLS
====================================================================
1. PAGE FAULTS ARE BLOCKING
- First access to a page may block for disk I/O
- Unpredictable latency spikes
- Solution: Use madvise() to pre-fault pages
2. WRITE VISIBILITY
- Writes may not be immediately visible on disk
- Must call msync() for durability
- mmap writes + crash = potential data loss
3. FILE GROWTH
- Cannot extend file size through mmap
- Must truncate/fallocate then remap
- Complex for growing files (use append-only)
4. 32-BIT ADDRESS SPACE LIMITS
- Only 4GB virtual address space on 32-bit
- Large files need chunked mapping
- Not an issue on 64-bit systems
MMAP FOR KV STORE READS (Zero-Copy Pattern)
====================================================================
struct DataFile {
mmap: Mmap, // Memory-mapped file
file: File, // For writes (append)
write_offset: AtomicU64,
}
impl DataFile {
fn get(&self, offset: u64, len: usize) -> &[u8] {
// ZERO COPY! Returns slice directly into mmap'd memory
&self.mmap[offset as usize..offset as usize + len]
}
fn put(&self, key: &[u8], value: &[u8]) -> u64 {
// Writes go through normal file I/O (append)
// Then the file is remapped to include new data
let offset = self.write_offset.fetch_add(
(key.len() + value.len() + HEADER_SIZE) as u64,
Ordering::SeqCst
);
// Write to file (buffered or direct)
self.file.write_at(offset, &encode(key, value));
offset
}
}
Book Reference: “The Linux Programming Interface” by Michael Kerrisk, Chapter 49: Memory Mappings, is the definitive guide to mmap.
Part 5: Append-Only Log Structure
Instead of modifying data in place (which requires seeking and risks partial writes), we append all changes to the end of a file. This is the foundation of both LSM trees and simpler bitcask-style engines.
APPEND-ONLY LOG FORMAT
====================================================================
ENTRY FORMAT:
+----------+----------+----------+----------+----------+----------+
| CRC32 | Tstamp | Key Len | Val Len | Key | Value |
| (4 bytes)| (8 bytes)| (4 bytes)| (4 bytes)| (var) | (var) |
+----------+----------+----------+----------+----------+----------+
| | | | | |
| | | | | +-- Value bytes
| | | | +-- Key bytes
| | | +-- Length of value (0 = tombstone)
| | +-- Length of key
| +-- Unix timestamp (for expiry, compaction ordering)
+-- CRC32 checksum of entire entry (for corruption detection)
EXAMPLE LOG FILE (data.log):
====================================================================
Offset Content
------ -------
0x0000 [CRC|ts|4|5|"name"|"Alice"] <- PUT name=Alice
0x001A [CRC|ts|4|3|"name"|"Bob"] <- PUT name=Bob (UPDATE!)
0x0030 [CRC|ts|3|6|"age"|"twenty"] <- PUT age=twenty
0x0048 [CRC|ts|4|0|"name"|] <- DELETE name (tombstone)
0x0058 [CRC|ts|5|12|"email"|"b@test.com"]<- PUT email=b@test.com
IN-MEMORY INDEX (HashMap<Key, IndexEntry>):
====================================================================
After processing the log above:
+--------+----------------------------------+
| Key | IndexEntry |
+--------+----------------------------------+
| "name" | DELETED (tombstone at 0x0048) |
| "age" | { offset: 0x0030, len: 6 } |
| "email"| { offset: 0x0058, len: 12 } |
+--------+----------------------------------+
Note: "name" has 3 entries in the log, but only the LAST one matters.
This is why compaction is needed - to reclaim space from old versions.
READ PATH:
====================================================================
fn get(&self, key: &[u8]) -> Option<&[u8]> {
// 1. Look up in index (O(1) hash lookup)
let entry = self.index.get(key)?;
// 2. Check if tombstone
if entry.is_tombstone() {
return None;
}
// 3. Read value from mmap'd file (zero-copy!)
let value = &self.mmap[entry.offset..entry.offset + entry.len];
Some(value)
}
WRITE PATH:
====================================================================
fn put(&mut self, key: &[u8], value: &[u8]) -> io::Result<()> {
// 1. Build entry
let entry = Entry::new(key, value, timestamp());
let encoded = entry.encode(); // With CRC32
// 2. Write to WAL (if using separate WAL)
self.wal.append(&encoded)?;
self.wal.sync()?; // CRITICAL: fsync before proceeding
// 3. Append to data file
let offset = self.data_file.append(&encoded)?;
// 4. Update index
self.index.insert(key.to_vec(), IndexEntry {
offset,
len: value.len(),
timestamp: entry.timestamp,
});
Ok(())
}
STARTUP (Index Reconstruction):
====================================================================
fn load_index_from_log(data_path: &Path) -> HashMap<Vec<u8>, IndexEntry> {
let mut index = HashMap::new();
let file = File::open(data_path)?;
let mmap = unsafe { Mmap::map(&file)? };
let mut offset = 0;
while offset < mmap.len() {
// Parse entry at current offset
let entry = Entry::decode(&mmap[offset..])?;
// Validate CRC
if entry.crc != calculate_crc(&entry) {
// Corruption! Stop here.
break;
}
// Update index (newer entries overwrite older)
if entry.is_tombstone() {
index.remove(&entry.key);
} else {
index.insert(entry.key.clone(), IndexEntry {
offset: offset + HEADER_SIZE,
len: entry.value.len(),
timestamp: entry.timestamp,
});
}
offset += entry.total_size();
}
index
}
Why Append-Only?
- Sequential writes are fast: SSDs and HDDs both perform much better with sequential I/O
- Crash safety: Partial writes only affect the end of the file, never existing data
- Simplicity: No need to manage free space within the file
- Concurrency: Readers never see partial writes; they read up to a known offset
Part 6: Compaction and Garbage Collection
As data is updated and deleted, the log accumulates “garbage” - old versions of keys that are no longer needed. Compaction reclaims this space.
COMPACTION PROCESS
====================================================================
BEFORE COMPACTION (data.log has 1000 entries, 500 are obsolete):
data.log (10 MB):
+----------+----------+----------+----------+----------+----------+
| name= | age=30 | name= | email= | name= | email= |
| Alice | (LIVE) | Bob | a@b.com | Charlie | c@d.com |
| (DEAD) | | (DEAD) | (DEAD) | (LIVE) | (LIVE) |
+----------+----------+----------+----------+----------+----------+
^ ^ ^
| | |
Only these 3 entries are still valid! ------+----------+
COMPACTION ALGORITHM:
1. Create new data file (data.log.compact):
+----------+----------+----------+
| age=30 | name= | email= |
| (LIVE) | Charlie | c@d.com |
+----------+----------+----------+
Only write entries that are in the current index!
2. Build new index pointing to new file:
+--------+----------------------------------+
| Key | IndexEntry (new offsets) |
+--------+----------------------------------+
| "age" | { offset: 0x0000, len: 2 } |
| "name" | { offset: 0x0014, len: 7 } |
| "email"| { offset: 0x002C, len: 7 } |
+--------+----------------------------------+
3. Atomically swap files:
- Rename data.log.compact -> data.log
- Update mmap to new file
- Delete old data.log
RESULT: 10 MB -> 3 MB (70% space savings!)
COMPACTION STRATEGIES
====================================================================
1. FULL COMPACTION (Simple, used in Bitcask):
+-------------------+ +-------------------+
| OLD FILE | --> | NEW FILE |
| (all data) | | (only live data) |
+-------------------+ +-------------------+
- Rewrites entire dataset
- Simple to implement
- High write amplification for large datasets
- Good for small-medium datasets
2. LEVELED COMPACTION (Used in LevelDB/RocksDB):
Level 0: [SST][SST][SST][SST] <- Raw memtable dumps
\ | /
\ | /
v v v
Level 1: [SST][SST][SST] <- Compacted, non-overlapping
|
v
Level 2: [SST][SST][SST][SST][SST][SST] <- 10x larger
- Each level is 10x the size of the previous
- Only overlapping SSTables are compacted together
- Bounds write amplification to O(levels)
3. SIZE-TIERED COMPACTION (Used in Cassandra):
Tier 0: [1MB][1MB][1MB][1MB] <- 4 similar-sized files
\ | | /
\ | | /
vvvv
Tier 1: [4MB][4MB][4MB][4MB] <- Merge when 4 similar-sized
|
v
Tier 2: [16MB][16MB][16MB]
- Merge files of similar size
- Better write amplification than leveled
- Worse space amplification
CONCURRENT COMPACTION (No Read Blocking)
====================================================================
The key insight: Old files remain readable until compaction completes.
Timeline:
---------
T0: Reader starts reading from old file at offset 100
T1: Compaction starts, creates new file
T2: Reader reads value from old file (still valid!)
T3: Compaction writes new file
T4: Reader finishes, releases reference to old file
T5: Compaction atomically swaps to new file
T6: Old file deleted (all readers finished)
Implementation using reference counting:
----------------------------------------
struct DataFile {
mmap: Mmap,
ref_count: AtomicUsize, // Incremented for each reader
}
impl DataFile {
fn get_reader(&self) -> DataFileReader {
self.ref_count.fetch_add(1, Ordering::AcqRel);
DataFileReader { file: self }
}
}
impl Drop for DataFileReader {
fn drop(&mut self) {
let prev = self.file.ref_count.fetch_sub(1, Ordering::AcqRel);
if prev == 1 {
// Last reader, safe to delete file
}
}
}
Part 7: Concurrent Index Data Structures
The index maps keys to their location in the data file. For high concurrency, you need a thread-safe data structure.
INDEX DATA STRUCTURE OPTIONS
====================================================================
1. RWLOCK<HASHMAP> (Simple, Moderate Concurrency)
+------------------+
| RwLock |
| +------------+ |
| | HashMap | |
| | key -> off | |
| +------------+ |
+------------------+
Pros: Simple, uses std library
Cons: Writers block all readers
Throughput: ~500K ops/sec
Use when: Prototyping, low-medium load
2. DASHMAP (Sharded HashMap, Good Concurrency)
+--------+--------+--------+--------+
| Shard0 | Shard1 | Shard2 | Shard3 |
| RwLock | RwLock | RwLock | RwLock |
| Hash | Hash | Hash | Hash |
+--------+--------+--------+--------+
Key hash determines shard: shard = hash(key) % num_shards
Pros: Writers only block one shard
Cons: Still uses locks
Throughput: ~2-5M ops/sec
Use when: Good default choice
3. SKIP LIST (Lock-Free, Excellent Concurrency)
Level 3: [HEAD] ------------------> [50] ----------------------> [NIL]
| |
Level 2: [HEAD] --------> [25] ----> [50] --------> [75] ------> [NIL]
| | | |
Level 1: [HEAD] -> [10] -> [25] -> [35] -> [50] -> [60] -> [75] -> [NIL]
| | | | | | |
Level 0: [HEAD]->[10]->[20]->[25]->[30]->[35]->[50]->[55]->[60]->[70]->[75]->[NIL]
Pros: Lock-free reads AND writes (with CAS)
Supports range queries
Cons: Complex implementation
Memory overhead (pointers)
Throughput: ~5-10M ops/sec
Use when: High concurrency + range queries
4. CONCURRENT B-TREE (Lock Coupling)
+-------+
| root | <- Lock root
+---+---+
|
+-----------+-----------+
| |
+---+---+ +---+---+
| node1 | | node2 |
+---+---+ +---+---+
| |
v v
Release root lock once child is locked
("Lock coupling" or "crabbing")
Pros: Good for range queries
Cache-friendly
Cons: Complex implementation
Lock coupling overhead
Throughput: ~3-8M ops/sec
Use when: Range queries are primary
SKIP LIST IMPLEMENTATION SKETCH
====================================================================
const MAX_LEVEL: usize = 32;
struct SkipList<K, V> {
head: AtomicPtr<Node<K, V>>,
level: AtomicUsize,
}
struct Node<K, V> {
key: K,
value: AtomicPtr<V>, // Allows lock-free update
next: [AtomicPtr<Node<K, V>>; MAX_LEVEL],
level: usize,
}
impl<K: Ord, V> SkipList<K, V> {
fn get(&self, key: &K) -> Option<&V> {
let mut current = self.head.load(Ordering::Acquire);
// Start from top level, go down
for level in (0..=self.level.load(Ordering::Relaxed)).rev() {
// Go right while next key is less than target
loop {
let next = unsafe { (*current).next[level].load(Ordering::Acquire) };
if next.is_null() {
break;
}
let next_key = unsafe { &(*next).key };
if next_key >= key {
break;
}
current = next;
}
}
// Check if we found it
let next = unsafe { (*current).next[0].load(Ordering::Acquire) };
if !next.is_null() && unsafe { &(*next).key } == key {
let value_ptr = unsafe { (*next).value.load(Ordering::Acquire) };
Some(unsafe { &*value_ptr })
} else {
None
}
}
fn insert(&self, key: K, value: V) -> Option<V> {
// ... Complex CAS-based insertion
// See crossbeam-skiplist for production implementation
}
}
Part 8: Cache-Friendly Data Structures
Modern CPUs are much faster than memory. Cache-friendly data structures can be 10-100x faster.
CPU CACHE HIERARCHY
====================================================================
+------------------------------------------------------------------+
| CPU |
| +----------------------------------------------------------+ |
| | Core 0 Core 1 | |
| | +--------+ +--------+ | |
| | | L1d | 32KB, 4 cycles | L1d | | |
| | +--------+ +--------+ | |
| | +--------+ +--------+ | |
| | | L1i | 32KB, 4 cycles | L1i | | |
| | +--------+ +--------+ | |
| | +--------+ +--------+ | |
| | | L2 | 256KB, 12 cycles | L2 | | |
| | +--------+ +--------+ | |
| +----------------------------------------------------------+ |
| | |
| +----+----+ |
| | L3 | 8MB shared, 40 cycles |
| +---------+ |
+------------------------------------------------------------------+
|
| 100+ cycles
v
+----------+
| RAM | 32GB, 100-300 cycles
+----------+
CACHE LINE EFFECTS
====================================================================
Cache line = 64 bytes (on most modern CPUs)
GOOD: Sequential access (cache prefetcher happy)
for item in array.iter() {
// Each cache line brings 64 bytes
// If items are 8 bytes, get 8 items per cache load
process(item);
}
Access pattern: [----64 bytes----][----64 bytes----][----64 bytes----]
Cache loads: ^ ^ ^
1 2 3
BAD: Pointer chasing (linked list)
let mut node = head;
while let Some(n) = node {
// Each node is in a DIFFERENT cache line!
// Every access is a cache miss
process(n);
node = n.next;
}
Access pattern: [node1]......[node2]......[node3]......
Cache loads: ^ ^ ^
1 2 3
RESULT: 10-100x slower!
ARENA ALLOCATION FOR CACHE LOCALITY
====================================================================
HEAP ALLOCATION (scattered):
+------+ +------+ +------+ +------+
| node | | node | | node | | node |
| 0x100| |0x8000| |0x2000| |0x9000| <- Random addresses!
+------+ +------+ +------+ +------+
Each node access = likely cache miss
ARENA ALLOCATION (contiguous):
+------+------+------+------+------+------+------+------+
| node | node | node | node | node | node | node | node |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+------+------+------+------+------+------+------+------+
^----- All in same or adjacent cache lines! -----^
Sequential access = CPU prefetcher loads ahead
Result: MUCH faster iteration
INDEX ENTRY LAYOUT FOR CACHE EFFICIENCY
====================================================================
BAD: Large entries, few per cache line
struct IndexEntryBad {
key: String, // 24 bytes (ptr, len, cap)
offset: u64, // 8 bytes
length: u32, // 4 bytes
timestamp: u64, // 8 bytes
flags: u32, // 4 bytes
_padding: [u8; 16], // 16 bytes
}
// Total: 64 bytes = 1 entry per cache line
GOOD: Compact entries, many per cache line
struct IndexEntryGood {
key_hash: u64, // 8 bytes (look up full key only on match)
offset: u64, // 8 bytes
length: u32, // 4 bytes
// timestamp and flags moved to data file
}
// Total: 20 bytes = 3 entries per cache line
For 1 million keys:
Bad: 64 MB index, 1M cache lines touched
Good: 20 MB index, 333K cache lines touched
Result: 3x better cache utilization!
Part 9: Tail Latency and the Hot Path
In a high-performance system, average latency matters less than tail latency (p99, p999). A single slow operation can cascade into user-visible delays.
LATENCY DISTRIBUTION
====================================================================
Good System Bad System
Latency (us) (tight distribution) (long tail)
|
1000| *
| *
500| *
| * * *
200| * * * * *
| * * * * * * * * * *
100| * * * * * * * * * * * * *
| * * * * * * * * * * * * * * * * *
50| * * * * * * * * * * * * * * * * * * * * * * * *
|* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
10|* * * * * * * * * * * * *
+----------------------------------------------------------
p50 p90 p99 p999 p50 p90 p99 p999
Average: 45us Average: 85us
p99: 55us p99: 500us <- 10x WORSE!
p999: 60us p999: 1000us
The "hot path" = the code executed for every operation
Any slowdown on the hot path affects ALL requests.
COMMON TAIL LATENCY KILLERS
====================================================================
1. LOCK CONTENTION
Thread 1: acquire_lock() -> 10us operation -> release_lock()
Thread 2: acquire_lock() -> BLOCKED!!! -> 10us operation
Thread 3: acquire_lock() -> BLOCKED!!! -> BLOCKED!!! -> 10us
Solution: Lock-free data structures, sharding, read-write locks
2. MEMORY ALLOCATION
Hot path: Box::new(value) <- May call mmap(), take global lock
Solution: Arena allocation, object pooling
fn handle_request(&self, req: Request) {
let arena = Arena::new(4096); // One alloc per request
let parsed = parse_into_arena(&arena, req);
let result = process(parsed);
// Arena dropped, all memory freed at once
}
3. DISK I/O ON READ PATH
If index doesn't fit in RAM, reads may hit disk.
Solution:
- Keep index in RAM (or mmap it)
- Use bloom filters to avoid disk reads for missing keys
- Warm cache on startup
4. COMPACTION BLOCKING READS
Bad: Lock index during compaction (seconds of latency)
Good: Copy-on-write index, atomic swap after compaction
5. GC PAUSES (in other languages)
Rust advantage: No GC! Deterministic memory management.
HOT PATH OPTIMIZATION CHECKLIST
====================================================================
[ ] No allocations (use arena or pre-allocated buffers)
[ ] No locks (or use reader-writer locks, never block readers)
[ ] No syscalls (use mmap for reads)
[ ] No I/O (all index in RAM)
[ ] No unbounded loops (O(1) or O(log n) only)
[ ] Cache-aligned data structures (avoid false sharing)
[ ] Branch prediction friendly (likely paths first)
MEASURING LATENCY IN RUST
====================================================================
use std::time::Instant;
use hdrhistogram::Histogram;
fn benchmark_operations(kvstore: &KvStore, ops: usize) {
let mut histogram = Histogram::<u64>::new(3).unwrap();
for i in 0..ops {
let key = format!("key{}", i);
let start = Instant::now();
let _ = kvstore.get(&key);
let elapsed = start.elapsed().as_nanos() as u64;
histogram.record(elapsed).unwrap();
}
println!("Latency Distribution:");
println!(" p50: {} ns", histogram.value_at_percentile(50.0));
println!(" p90: {} ns", histogram.value_at_percentile(90.0));
println!(" p99: {} ns", histogram.value_at_percentile(99.0));
println!(" p999: {} ns", histogram.value_at_percentile(99.9));
println!(" max: {} ns", histogram.max());
}
Part 10: Real-World KV Store Architectures
Let’s examine how production systems implement these concepts.
REDIS ARCHITECTURE (In-Memory)
====================================================================
+------------------------------------------------------------------+
| REDIS |
+------------------------------------------------------------------+
| |
| SINGLE-THREADED EVENT LOOP (no locks needed!) |
| +-------------------------------------------------------------+ |
| | while true { | |
| | events = epoll_wait(fd, ...) | |
| | for event in events { | |
| | handle_client_request(event) | |
| | } | |
| | } | |
| +-------------------------------------------------------------+ |
| |
| IN-MEMORY DATA STRUCTURES |
| +------------------+ +------------------+ +------------------+ |
| | Dict (HashMap) | | Sorted Set | | List | |
| | for Strings | | (Skip List) | | (Linked List) | |
| +------------------+ +------------------+ +------------------+ |
| |
| PERSISTENCE (Optional) |
| +------------------+ +------------------+ |
| | RDB (Snapshot) | | AOF (Append Log) | |
| | Point-in-time | | Every write | |
| +------------------+ +------------------+ |
+------------------------------------------------------------------+
Performance: 100K+ ops/sec single-threaded
Limitation: Data must fit in RAM
ROCKSDB ARCHITECTURE (LSM-based)
====================================================================
+------------------------------------------------------------------+
| ROCKSDB |
+------------------------------------------------------------------+
| |
| WRITE PATH |
| +-------------------------------------------------------------+ |
| | 1. Write to WAL (fsync) | |
| | 2. Insert into MemTable (SkipList in RAM) | |
| | 3. When MemTable full -> Flush to Level 0 SSTable | |
| +-------------------------------------------------------------+ |
| |
| +------------------+ |
| | MemTable | <- Mutable, in RAM |
| | (Active) | |
| +------------------+ |
| +------------------+ |
| | MemTable | <- Immutable, being flushed |
| | (Frozen) | |
| +------------------+ |
| | |
| v |
| +------------------+------------------+------------------+ |
| | L0 SST | L0 SST | L0 SST | L0 SST | |
| +------------------+------------------+------------------+ |
| | |
| v (Compaction) |
| +------------------+------------------+------------------+ |
| | L1 SST | L1 SST | L1 SST | |
| +------------------+------------------+------------------+ |
| | |
| v (Compaction) |
| +------------------+------------------+------------------+ ... |
| | L2 SST | L2 SST | L2 SST | L2 SST | L2 SST | L2 SST | |
| +------------------+------------------+------------------+ ... |
| |
+------------------------------------------------------------------+
Performance: 500K+ random writes/sec, 100K+ random reads/sec
Strength: Handles datasets larger than RAM
BITCASK ARCHITECTURE (Append-Only + In-Memory Index)
====================================================================
+------------------------------------------------------------------+
| BITCASK |
+------------------------------------------------------------------+
| |
| WRITE PATH (Simple!) |
| +-------------------------------------------------------------+ |
| | 1. Append entry to data file | |
| | 2. Update in-memory hash index | |
| +-------------------------------------------------------------+ |
| |
| IN-MEMORY INDEX |
| +------------------+ |
| | HashMap | |
| | key -> { | |
| | file_id, | |
| | offset, | |
| | size, | |
| | timestamp | |
| | } | |
| +------------------+ |
| | |
| | Points to |
| v |
| DATA FILES (append-only) |
| +------------------+------------------+------------------+ |
| | data.0 (old) | data.1 | data.2 (active) | |
| +------------------+------------------+------------------+ |
| |
| COMPACTION |
| +-------------------------------------------------------------+ |
| | Merge old files, keep only latest value per key | |
| | Atomic file swap when complete | |
| +-------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Performance: Very fast writes, single-seek reads
Limitation: All keys must fit in RAM (values on disk)
This is the architecture we'll implement!
Book Reference: “Database Internals” by Alex Petrov provides detailed coverage of RocksDB, LevelDB, and B-Tree based systems.
Real World Outcome
When complete, you will have built a production-quality key-value store with impressive performance characteristics:
$ ./my_kv_store --bench
╔══════════════════════════════════════════════════════════════════╗
║ High-Performance KV Store Benchmark ║
╠══════════════════════════════════════════════════════════════════╣
║ System: Apple M2 Pro, 32GB RAM, APFS SSD ║
║ Build: Release mode, LTO enabled ║
╚══════════════════════════════════════════════════════════════════╝
[Phase 1: Write Performance]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Writing 1,000,000 keys (16-byte key, 100-byte value)...
Progress: [████████████████████████████████████████] 100%
Total time: 0.453 seconds
Throughput: 2.21 million writes/sec
Avg latency: 452 ns/write
p99 latency: 1.2 us/write
Data written: 116 MB
Write speed: 256 MB/s
WAL writes: 1,000,000
WAL fsync: 1,000 (every 1000 writes)
Index updates: 1,000,000
[Phase 2: Read Performance]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Reading 1,000,000 random keys...
Progress: [████████████████████████████████████████] 100%
Total time: 0.312 seconds
Throughput: 3.21 million reads/sec
Avg latency: 312 ns/read
p99 latency: 850 ns/read
Index lookups: 1,000,000
Cache hits: 99.2% (mmap page cache)
Zero-copy: 100% (no memcpy in hot path)
[Phase 3: Mixed Workload (80% read, 20% write)]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Running 1,000,000 mixed operations...
Progress: [████████████████████████████████████████] 100%
Total time: 0.372 seconds
Throughput: 2.69 million ops/sec
Read latency: 310 ns avg
Write latency: 498 ns avg
[Phase 4: Concurrent Access (8 threads)]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Running 8,000,000 operations across 8 threads...
Progress: [████████████████████████████████████████] 100%
Total time: 0.891 seconds
Throughput: 8.98 million ops/sec (8 threads combined)
Per-thread: 1.12 million ops/sec
Lock contention: 0.3% (lock-free reads)
[Phase 5: Crash Recovery Test]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Simulating crash after 500,000 writes...
Kill signal sent: SIGKILL
Restarting and recovering...
WAL replay: 500,000 entries
Index reconstruction: 0.234 seconds
Verifying data integrity...
Keys recovered: 500,000 / 500,000 (100%)
Data corruption: 0 entries
✓ RECOVERY SUCCESSFUL
[Phase 6: Compaction Test]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Writing 1,000,000 keys, then updating 500,000 of them...
Before compaction:
Data file size: 232 MB
Live data: 116 MB (50%)
Dead data: 116 MB (50%)
Running compaction...
Compaction time: 1.234 seconds
After compaction:
Data file size: 116 MB
Space reclaimed: 116 MB (50%)
Reads during compaction: NO BLOCKING
Writes during compaction: NO BLOCKING
═══════════════════════════════════════════════════════════════════
PERFORMANCE SUMMARY
╔═══════════════════════════════════════════════════════════════╗
║ Metric │ Result │ Target ║
╠═══════════════════════════════════════════════════════════════╣
║ Write throughput │ 2.21M ops/sec │ 1M ops/sec ✓ ║
║ Read throughput │ 3.21M ops/sec │ 2M ops/sec ✓ ║
║ Mixed (8 threads) │ 8.98M ops/sec │ 5M ops/sec ✓ ║
║ Write latency p99 │ 1.2 us │ < 5 us ✓ ║
║ Read latency p99 │ 850 ns │ < 2 us ✓ ║
║ Crash recovery │ 100% success │ 100% ✓ ║
║ Compaction blocking │ None │ None ✓ ║
╚═══════════════════════════════════════════════════════════════╝
Memory Usage:
Index (1M keys): 48 MB
Arena allocator: 8 MB (reused per request)
mmap (data file): Virtual only (116 MB mapped, ~4 MB resident)
Total RSS: 62 MB
$ ./my_kv_store interactive
╔══════════════════════════════════════════════════════════════════╗
║ RustKV Interactive Shell ║
╠══════════════════════════════════════════════════════════════════╣
║ Commands: GET <key>, SET <key> <value>, DEL <key>, STATS, QUIT ║
╚══════════════════════════════════════════════════════════════════╝
rustkv> SET user:1 {"name":"Alice","age":30}
OK (312 ns)
rustkv> GET user:1
{"name":"Alice","age":30} (198 ns)
rustkv> SET user:2 {"name":"Bob","age":25}
OK (287 ns)
rustkv> STATS
╔═══════════════════════════════════════════════════════════════╗
║ Database Statistics ║
╠═══════════════════════════════════════════════════════════════╣
║ Keys: 2 ║
║ Data file size: 156 bytes ║
║ Index memory: 128 bytes ║
║ Operations: 4 (2 writes, 2 reads) ║
║ Uptime: 12.3 seconds ║
╚═══════════════════════════════════════════════════════════════╝
rustkv> QUIT
Goodbye! (Data synced to disk)
Complete Project Specification
You will build a Bitcask-style key-value store with the following components:
Core Features
- API Operations
GET(key: &[u8]) -> Option<Vec<u8>>- Retrieve a valuePUT(key: &[u8], value: &[u8]) -> Result<()>- Store a key-value pairDELETE(key: &[u8]) -> Result<()>- Remove a keySCAN(start: &[u8], end: &[u8]) -> Iterator<Item=(Key, Value)>- Range query (extension)
- Durability
- Write-ahead logging with configurable fsync policy
- Crash recovery by replaying WAL on startup
- Atomic compaction with no data loss
- Performance
- Arena allocator for index nodes
- Memory-mapped data file for zero-copy reads
- Lock-free or fine-grained locking for concurrent access
- Operations
- Compaction to reclaim space from deleted/updated keys
- Statistics and metrics reporting
- Configurable memory limits
Solution Architecture
PROJECT STRUCTURE
====================================================================
rustkv/
├── Cargo.toml
├── src/
│ ├── lib.rs # Public API and KvStore struct
│ ├── index/
│ │ ├── mod.rs # Index trait and implementations
│ │ ├── hashmap.rs # Simple RwLock<HashMap> index
│ │ └── skiplist.rs # Lock-free skip list (extension)
│ ├── storage/
│ │ ├── mod.rs # Storage trait
│ │ ├── data_file.rs # Append-only data file with mmap
│ │ ├── wal.rs # Write-ahead log
│ │ └── compaction.rs # Background compaction
│ ├── allocator/
│ │ └── arena.rs # Arena allocator for index
│ ├── format/
│ │ ├── mod.rs # Entry encoding/decoding
│ │ └── crc.rs # CRC32 checksum
│ └── cli.rs # Command-line interface
├── benches/
│ └── throughput.rs # Criterion benchmarks
└── tests/
├── basic_ops.rs # Unit tests
├── crash_recovery.rs # Crash simulation tests
└── concurrent.rs # Multi-threaded tests
COMPONENT DIAGRAM
====================================================================
┌─────────────────────────────────────────────────────────────────┐
│ KvStore │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ API Layer ││
│ │ pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> ││
│ │ pub fn put(&mut self, key: &[u8], val: &[u8]) -> Result ││
│ │ pub fn delete(&mut self, key: &[u8]) -> Result ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Index Layer ││
│ │ ┌─────────────┐ ││
│ │ │ Index │ HashMap<Key, IndexEntry> ││
│ │ │ (in RAM) │ or SkipList<Key, IndexEntry> ││
│ │ │ │ ││
│ │ │ Allocated │ ← Uses Arena Allocator ││
│ │ │ in Arena │ for cache locality ││
│ │ └─────────────┘ ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Storage Layer ││
│ │ ││
│ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ WAL │ │ Data File │ │ Compaction │ ││
│ │ │ │ │ │ │ Thread │ ││
│ │ │ Durability│ │ mmap'd for │ │ │ ││
│ │ │ guarantee │ │ zero-copy │ │ Reclaims │ ││
│ │ │ │ │ reads │ │ space │ ││
│ │ └──────────┘ └──────────────┘ └──────────────┘ ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ File System │
└─────────────────────────────────────────────────────────────────┘
DATA FLOW: PUT OPERATION
====================================================================
User: kvstore.put(b"name", b"Alice")
│
▼
┌───────────────────────────────────────────────────────────────┐
│ 1. Encode Entry │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ [CRC32][Timestamp][KeyLen][ValLen][Key][Value] │ │
│ │ [4B] [8B] [4B] [4B] [4B] [5B] │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 2. Write to WAL (with fsync) │
│ wal.log ────────────────────────────────────────────── │
│ │ │
│ ▼ │
│ 3. Append to Data File │
│ data.log ──────────────────────────────────────[ENTRY]── │
│ Returns offset = 1234 │
│ │ │
│ ▼ │
│ 4. Update Index │
│ index.insert("name", IndexEntry { offset: 1234, len: 5 }) │
│ │ │
│ ▼ │
│ 5. Return Ok(()) │
└───────────────────────────────────────────────────────────────┘
DATA FLOW: GET OPERATION
====================================================================
User: kvstore.get(b"name")
│
▼
┌───────────────────────────────────────────────────────────────┐
│ 1. Index Lookup (O(1) hash or O(log n) skip list) │
│ let entry = index.get("name")?; // IndexEntry │
│ entry = { offset: 1234, len: 5 } │
│ │ │
│ ▼ │
│ 2. Read from mmap (ZERO COPY!) │
│ let value = &mmap[1234..1234+5]; // Slice into mmap │
│ // No memcpy! Direct pointer to kernel page cache │
│ │ │
│ ▼ │
│ 3. Return Some(value.to_vec()) │
│ // Only copy when returning to user │
│ // Could return &[u8] for even more zero-copy │
└───────────────────────────────────────────────────────────────┘
Phased Implementation Guide
Phase 1: Simple In-Memory Hash Map with Persistence
Goal: Get a working key-value store that persists to disk.
Duration: 2-3 days
// src/lib.rs
use std::collections::HashMap;
use std::fs::{File, OpenOptions};
use std::io::{BufWriter, Write, BufReader, Read};
use std::path::Path;
pub struct KvStore {
index: HashMap<Vec<u8>, Vec<u8>>,
data_file: BufWriter<File>,
data_path: PathBuf,
}
impl KvStore {
pub fn open(path: &Path) -> std::io::Result<Self> {
let data_path = path.join("data.log");
// Load existing data
let mut index = HashMap::new();
if data_path.exists() {
let file = File::open(&data_path)?;
let mut reader = BufReader::new(file);
// Parse entries and build index
while let Ok(entry) = Entry::decode(&mut reader) {
if entry.is_tombstone() {
index.remove(&entry.key);
} else {
index.insert(entry.key, entry.value);
}
}
}
// Open for appending
let file = OpenOptions::new()
.create(true)
.append(true)
.open(&data_path)?;
Ok(KvStore {
index,
data_file: BufWriter::new(file),
data_path,
})
}
pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
self.index.get(key).cloned()
}
pub fn put(&mut self, key: &[u8], value: &[u8]) -> std::io::Result<()> {
// Write to file
let entry = Entry::new(key, value);
entry.encode(&mut self.data_file)?;
self.data_file.flush()?;
// Update index
self.index.insert(key.to_vec(), value.to_vec());
Ok(())
}
pub fn delete(&mut self, key: &[u8]) -> std::io::Result<()> {
let entry = Entry::tombstone(key);
entry.encode(&mut self.data_file)?;
self.data_file.flush()?;
self.index.remove(key);
Ok(())
}
}
Checkpoint: Run basic tests for GET, PUT, DELETE. Verify persistence across restarts.
Phase 2: Append-Only Log File Format
Goal: Define a robust, self-describing log format with checksums.
Duration: 2-3 days
// src/format/mod.rs
use crc32fast::Hasher;
const HEADER_SIZE: usize = 20; // CRC(4) + Timestamp(8) + KeyLen(4) + ValLen(4)
#[derive(Debug)]
pub struct Entry {
pub crc: u32,
pub timestamp: u64,
pub key: Vec<u8>,
pub value: Vec<u8>, // Empty for tombstone
}
impl Entry {
pub fn new(key: &[u8], value: &[u8]) -> Self {
let timestamp = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_millis() as u64;
let mut entry = Entry {
crc: 0,
timestamp,
key: key.to_vec(),
value: value.to_vec(),
};
entry.crc = entry.calculate_crc();
entry
}
pub fn tombstone(key: &[u8]) -> Self {
Entry::new(key, &[])
}
pub fn is_tombstone(&self) -> bool {
self.value.is_empty()
}
pub fn calculate_crc(&self) -> u32 {
let mut hasher = Hasher::new();
hasher.update(&self.timestamp.to_le_bytes());
hasher.update(&(self.key.len() as u32).to_le_bytes());
hasher.update(&(self.value.len() as u32).to_le_bytes());
hasher.update(&self.key);
hasher.update(&self.value);
hasher.finalize()
}
pub fn verify(&self) -> bool {
self.crc == self.calculate_crc()
}
pub fn encode<W: Write>(&self, writer: &mut W) -> std::io::Result<usize> {
let mut written = 0;
writer.write_all(&self.crc.to_le_bytes())?;
written += 4;
writer.write_all(&self.timestamp.to_le_bytes())?;
written += 8;
writer.write_all(&(self.key.len() as u32).to_le_bytes())?;
written += 4;
writer.write_all(&(self.value.len() as u32).to_le_bytes())?;
written += 4;
writer.write_all(&self.key)?;
written += self.key.len();
writer.write_all(&self.value)?;
written += self.value.len();
Ok(written)
}
pub fn decode<R: Read>(reader: &mut R) -> std::io::Result<Self> {
let mut header = [0u8; HEADER_SIZE];
reader.read_exact(&mut header)?;
let crc = u32::from_le_bytes(header[0..4].try_into().unwrap());
let timestamp = u64::from_le_bytes(header[4..12].try_into().unwrap());
let key_len = u32::from_le_bytes(header[12..16].try_into().unwrap()) as usize;
let val_len = u32::from_le_bytes(header[16..20].try_into().unwrap()) as usize;
let mut key = vec![0u8; key_len];
reader.read_exact(&mut key)?;
let mut value = vec![0u8; val_len];
reader.read_exact(&mut value)?;
let entry = Entry { crc, timestamp, key, value };
if !entry.verify() {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"CRC mismatch"
));
}
Ok(entry)
}
pub fn total_size(&self) -> usize {
HEADER_SIZE + self.key.len() + self.value.len()
}
}
Checkpoint: Write entries, verify CRC on read, detect corruption.
Phase 3: Memory-Mapped Reads
Goal: Use mmap for zero-copy reads from the data file.
Duration: 2-3 days
// src/storage/data_file.rs
use memmap2::{Mmap, MmapMut, MmapOptions};
use std::fs::{File, OpenOptions};
use std::sync::atomic::{AtomicU64, Ordering};
pub struct DataFile {
file: File,
mmap: Mmap,
write_offset: AtomicU64,
path: PathBuf,
}
#[derive(Clone, Copy, Debug)]
pub struct IndexEntry {
pub offset: u64,
pub len: u32,
pub timestamp: u64,
}
impl DataFile {
pub fn open(path: &Path) -> std::io::Result<Self> {
let file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(path)?;
let metadata = file.metadata()?;
let len = metadata.len();
// Ensure file is at least 1 byte for mmap
if len == 0 {
file.set_len(4096)?; // Start with 4KB
}
let mmap = unsafe { MmapOptions::new().map(&file)? };
Ok(DataFile {
file,
mmap,
write_offset: AtomicU64::new(len),
path: path.to_path_buf(),
})
}
pub fn get(&self, entry: &IndexEntry) -> &[u8] {
let start = entry.offset as usize;
let end = start + entry.len as usize;
&self.mmap[start..end]
}
pub fn append(&self, data: &[u8]) -> std::io::Result<u64> {
let offset = self.write_offset.fetch_add(data.len() as u64, Ordering::SeqCst);
// Ensure file is large enough
let new_end = offset + data.len() as u64;
let current_len = self.file.metadata()?.len();
if new_end > current_len {
// Double file size (or minimum needed)
let new_size = std::cmp::max(current_len * 2, new_end);
self.file.set_len(new_size)?;
}
// Write using pwrite (doesn't change file offset)
#[cfg(unix)]
{
use std::os::unix::fs::FileExt;
self.file.write_at(data, offset)?;
}
Ok(offset)
}
pub fn sync(&self) -> std::io::Result<()> {
self.file.sync_all()
}
pub fn remap(&mut self) -> std::io::Result<()> {
self.mmap = unsafe { MmapOptions::new().map(&self.file)? };
Ok(())
}
}
Checkpoint: Writes append to file, reads use mmap, verify zero-copy in benchmarks.
Phase 4: Arena Allocator for Index
Goal: Use arena allocation for index entries to improve cache locality.
Duration: 3-4 days
// src/allocator/arena.rs
use std::alloc::{alloc, dealloc, Layout};
use std::cell::Cell;
use std::ptr::NonNull;
pub struct Arena {
chunks: Vec<NonNull<u8>>,
current: Cell<*mut u8>,
end: Cell<*mut u8>,
chunk_size: usize,
}
impl Arena {
pub fn new(chunk_size: usize) -> Self {
Arena {
chunks: Vec::new(),
current: Cell::new(std::ptr::null_mut()),
end: Cell::new(std::ptr::null_mut()),
chunk_size,
}
}
pub fn alloc<T>(&self, value: T) -> &mut T {
let layout = Layout::new::<T>();
let ptr = self.alloc_raw(layout);
unsafe {
std::ptr::write(ptr as *mut T, value);
&mut *(ptr as *mut T)
}
}
fn alloc_raw(&self, layout: Layout) -> *mut u8 {
// Align current pointer
let current = self.current.get() as usize;
let aligned = (current + layout.align() - 1) & !(layout.align() - 1);
let new_ptr = aligned + layout.size();
if new_ptr <= self.end.get() as usize {
self.current.set(new_ptr as *mut u8);
return aligned as *mut u8;
}
// Need new chunk
self.grow(layout)
}
fn grow(&self, layout: Layout) -> *mut u8 {
let size = std::cmp::max(self.chunk_size, layout.size());
let chunk_layout = Layout::from_size_align(size, layout.align()).unwrap();
let ptr = unsafe { alloc(chunk_layout) };
if ptr.is_null() {
panic!("Arena allocation failed");
}
let nn = NonNull::new(ptr).unwrap();
// Note: This requires interior mutability for chunks vec
// In practice, use RefCell or unsafe
self.current.set(ptr);
self.end.set(unsafe { ptr.add(size) });
self.alloc_raw(layout)
}
pub fn reset(&mut self) {
if let Some(first) = self.chunks.first() {
self.current.set(first.as_ptr());
self.end.set(unsafe { first.as_ptr().add(self.chunk_size) });
}
}
}
impl Drop for Arena {
fn drop(&mut self) {
for chunk in &self.chunks {
unsafe {
let layout = Layout::from_size_align_unchecked(self.chunk_size, 1);
dealloc(chunk.as_ptr(), layout);
}
}
}
}
Checkpoint: Index entries allocated in arena, benchmark shows improved cache performance.
Phase 5: Write-Ahead Logging
Goal: Add WAL for crash recovery guarantees.
Duration: 3-4 days
// src/storage/wal.rs
use std::fs::{File, OpenOptions};
use std::io::{BufWriter, Write, Seek, SeekFrom};
pub struct WriteAheadLog {
file: BufWriter<File>,
sequence: u64,
sync_every: usize,
unflushed: usize,
}
impl WriteAheadLog {
pub fn open(path: &Path) -> std::io::Result<Self> {
let file = OpenOptions::new()
.create(true)
.append(true)
.open(path)?;
Ok(WriteAheadLog {
file: BufWriter::new(file),
sequence: 0,
sync_every: 1000, // Configurable
unflushed: 0,
})
}
pub fn append(&mut self, entry: &Entry) -> std::io::Result<u64> {
// Write WAL record
let record = WalRecord {
sequence: self.sequence,
entry: entry.clone(),
};
record.encode(&mut self.file)?;
self.sequence += 1;
self.unflushed += 1;
// Periodic sync
if self.unflushed >= self.sync_every {
self.sync()?;
}
Ok(self.sequence - 1)
}
pub fn sync(&mut self) -> std::io::Result<()> {
self.file.flush()?;
self.file.get_ref().sync_all()?;
self.unflushed = 0;
Ok(())
}
pub fn replay<F>(&self, mut callback: F) -> std::io::Result<u64>
where
F: FnMut(&Entry) -> std::io::Result<()>,
{
let mut reader = BufReader::new(File::open(&self.path)?);
let mut count = 0;
while let Ok(record) = WalRecord::decode(&mut reader) {
callback(&record.entry)?;
count += 1;
}
Ok(count)
}
}
struct WalRecord {
sequence: u64,
entry: Entry,
}
impl WalRecord {
fn encode<W: Write>(&self, writer: &mut W) -> std::io::Result<()> {
writer.write_all(&self.sequence.to_le_bytes())?;
self.entry.encode(writer)?;
Ok(())
}
fn decode<R: Read>(reader: &mut R) -> std::io::Result<Self> {
let mut seq_bytes = [0u8; 8];
reader.read_exact(&mut seq_bytes)?;
let sequence = u64::from_le_bytes(seq_bytes);
let entry = Entry::decode(reader)?;
Ok(WalRecord { sequence, entry })
}
}
Checkpoint: Crash simulation test passes - data recovered from WAL.
Phase 6: Compaction and Garbage Collection
Goal: Reclaim space from old versions and deleted keys.
Duration: 4-5 days
// src/storage/compaction.rs
use std::collections::HashMap;
pub struct Compactor {
store: Arc<KvStore>,
}
impl Compactor {
pub fn compact(&self) -> std::io::Result<CompactionStats> {
let mut stats = CompactionStats::default();
// 1. Create new data file
let new_path = self.store.data_path.with_extension("compact");
let mut new_file = File::create(&new_path)?;
let mut new_index: HashMap<Vec<u8>, IndexEntry> = HashMap::new();
// 2. Iterate through current index, write live entries to new file
let index = self.store.index.read().unwrap();
let mut offset = 0u64;
for (key, entry) in index.iter() {
// Skip tombstones
if entry.is_tombstone {
stats.deleted_keys += 1;
continue;
}
// Read value from old file
let value = self.store.data_file.get(entry);
// Write to new file
let new_entry = Entry::new(key, value);
let size = new_entry.encode(&mut new_file)?;
// Update new index
new_index.insert(key.clone(), IndexEntry {
offset,
len: value.len() as u32,
timestamp: entry.timestamp,
});
offset += size as u64;
stats.live_keys += 1;
}
drop(index);
// 3. Sync new file
new_file.sync_all()?;
// 4. Atomic swap
// Note: This is simplified. Real implementation needs careful locking.
let old_path = self.store.data_path.clone();
let backup_path = old_path.with_extension("old");
std::fs::rename(&old_path, &backup_path)?;
std::fs::rename(&new_path, &old_path)?;
// 5. Update store's mmap and index
{
let mut index = self.store.index.write().unwrap();
*index = new_index;
}
self.store.data_file.remap()?;
// 6. Delete old file
std::fs::remove_file(&backup_path)?;
stats.old_size = self.store.data_file.len();
stats.new_size = offset;
stats.space_reclaimed = stats.old_size - stats.new_size;
Ok(stats)
}
}
#[derive(Default)]
pub struct CompactionStats {
pub live_keys: usize,
pub deleted_keys: usize,
pub old_size: u64,
pub new_size: u64,
pub space_reclaimed: u64,
}
Checkpoint: Compaction reduces file size, no data loss, concurrent reads work.
Phase 7: Concurrent Access with Atomics
Goal: Allow multiple readers and writers without blocking.
Duration: 4-5 days
// src/index/concurrent.rs
use parking_lot::RwLock;
use std::sync::atomic::{AtomicU64, Ordering};
use dashmap::DashMap;
// Option 1: DashMap (sharded HashMap)
pub struct ConcurrentIndex {
map: DashMap<Vec<u8>, IndexEntry>,
version: AtomicU64,
}
impl ConcurrentIndex {
pub fn new() -> Self {
ConcurrentIndex {
map: DashMap::new(),
version: AtomicU64::new(0),
}
}
pub fn get(&self, key: &[u8]) -> Option<IndexEntry> {
self.map.get(key).map(|e| *e)
}
pub fn insert(&self, key: Vec<u8>, entry: IndexEntry) {
self.map.insert(key, entry);
self.version.fetch_add(1, Ordering::Release);
}
pub fn remove(&self, key: &[u8]) -> Option<IndexEntry> {
self.map.remove(key).map(|(_, v)| v)
}
}
// Option 2: Lock-free reads with copy-on-write for writes
pub struct CowIndex {
current: AtomicPtr<HashMap<Vec<u8>, IndexEntry>>,
}
impl CowIndex {
pub fn get(&self, key: &[u8]) -> Option<IndexEntry> {
let map = unsafe { &*self.current.load(Ordering::Acquire) };
map.get(key).copied()
}
pub fn update<F>(&self, f: F)
where
F: FnOnce(&mut HashMap<Vec<u8>, IndexEntry>),
{
loop {
let old_ptr = self.current.load(Ordering::Acquire);
let old = unsafe { &*old_ptr };
let mut new = old.clone();
f(&mut new);
let new_ptr = Box::into_raw(Box::new(new));
if self.current.compare_exchange(
old_ptr,
new_ptr,
Ordering::AcqRel,
Ordering::Acquire,
).is_ok() {
// Successfully swapped. Schedule old for deletion.
// Use epoch-based reclamation or hazard pointers.
break;
} else {
// CAS failed, retry
unsafe { drop(Box::from_raw(new_ptr)); }
}
}
}
}
Checkpoint: Concurrent benchmark shows linear scaling with threads.
Phase 8: CLI Interface
Goal: Create a user-friendly command-line interface.
Duration: 2-3 days
// src/cli.rs
use clap::{Parser, Subcommand};
#[derive(Parser)]
#[command(name = "rustkv")]
#[command(about = "A high-performance key-value store")]
struct Cli {
#[command(subcommand)]
command: Commands,
#[arg(long, default_value = "./data")]
data_dir: PathBuf,
}
#[derive(Subcommand)]
enum Commands {
Get { key: String },
Set { key: String, value: String },
Del { key: String },
Bench {
#[arg(long, default_value_t = 1000000)]
ops: usize,
#[arg(long, default_value_t = 1)]
threads: usize,
},
Compact,
Stats,
Interactive,
}
fn main() -> Result<()> {
let cli = Cli::parse();
let store = KvStore::open(&cli.data_dir)?;
match cli.command {
Commands::Get { key } => {
match store.get(key.as_bytes()) {
Some(value) => println!("{}", String::from_utf8_lossy(&value)),
None => println!("(nil)"),
}
}
Commands::Set { key, value } => {
store.put(key.as_bytes(), value.as_bytes())?;
println!("OK");
}
Commands::Del { key } => {
store.delete(key.as_bytes())?;
println!("OK");
}
Commands::Bench { ops, threads } => {
run_benchmark(&store, ops, threads)?;
}
Commands::Compact => {
let stats = store.compact()?;
println!("Compaction complete:");
println!(" Live keys: {}", stats.live_keys);
println!(" Deleted keys: {}", stats.deleted_keys);
println!(" Space reclaimed: {} bytes", stats.space_reclaimed);
}
Commands::Stats => {
print_stats(&store);
}
Commands::Interactive => {
run_interactive(&store)?;
}
}
Ok(())
}
Checkpoint: All CLI commands work, benchmark shows target performance.
Testing Strategy
Unit Tests
#[cfg(test)]
mod tests {
use super::*;
use tempfile::tempdir;
#[test]
fn test_basic_put_get() {
let dir = tempdir().unwrap();
let store = KvStore::open(dir.path()).unwrap();
store.put(b"hello", b"world").unwrap();
assert_eq!(store.get(b"hello"), Some(b"world".to_vec()));
}
#[test]
fn test_delete() {
let dir = tempdir().unwrap();
let store = KvStore::open(dir.path()).unwrap();
store.put(b"hello", b"world").unwrap();
store.delete(b"hello").unwrap();
assert_eq!(store.get(b"hello"), None);
}
#[test]
fn test_update() {
let dir = tempdir().unwrap();
let store = KvStore::open(dir.path()).unwrap();
store.put(b"key", b"value1").unwrap();
store.put(b"key", b"value2").unwrap();
assert_eq!(store.get(b"key"), Some(b"value2".to_vec()));
}
#[test]
fn test_persistence() {
let dir = tempdir().unwrap();
// Write data
{
let store = KvStore::open(dir.path()).unwrap();
store.put(b"key", b"value").unwrap();
}
// Read after reopen
{
let store = KvStore::open(dir.path()).unwrap();
assert_eq!(store.get(b"key"), Some(b"value".to_vec()));
}
}
#[test]
fn test_crc_validation() {
let entry = Entry::new(b"key", b"value");
assert!(entry.verify());
// Corrupt the entry
let mut corrupted = entry;
corrupted.crc ^= 1; // Flip a bit
assert!(!corrupted.verify());
}
}
Crash Recovery Tests
#[test]
fn test_crash_recovery() {
let dir = tempdir().unwrap();
// Simulate partial write followed by crash
{
let store = KvStore::open(dir.path()).unwrap();
// Write some complete entries
for i in 0..1000 {
store.put(format!("key{}", i).as_bytes(), b"value").unwrap();
}
// Force sync to ensure WAL is on disk
store.sync().unwrap();
// Don't call drop - simulate crash
std::mem::forget(store);
}
// Recover and verify
{
let store = KvStore::open(dir.path()).unwrap();
for i in 0..1000 {
assert!(store.get(format!("key{}", i).as_bytes()).is_some());
}
}
}
Concurrent Stress Tests
#[test]
fn test_concurrent_access() {
use std::sync::Arc;
use std::thread;
let dir = tempdir().unwrap();
let store = Arc::new(KvStore::open(dir.path()).unwrap());
let num_threads = 8;
let ops_per_thread = 10000;
let handles: Vec<_> = (0..num_threads).map(|t| {
let store = Arc::clone(&store);
thread::spawn(move || {
for i in 0..ops_per_thread {
let key = format!("thread{}key{}", t, i);
let value = format!("value{}", i);
store.put(key.as_bytes(), value.as_bytes()).unwrap();
let read = store.get(key.as_bytes());
assert!(read.is_some());
}
})
}).collect();
for handle in handles {
handle.join().unwrap();
}
// Verify all keys exist
for t in 0..num_threads {
for i in 0..ops_per_thread {
let key = format!("thread{}key{}", t, i);
assert!(store.get(key.as_bytes()).is_some());
}
}
}
Common Pitfalls and Debugging
Pitfall 1: fsync Timing for Durability
// WRONG: Data may be lost on crash
fn put(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
self.data_file.write(&entry.encode())?;
self.index.insert(key, entry);
// No fsync! Crash here = data loss
Ok(())
}
// CORRECT: Durability guaranteed
fn put(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
// 1. Write to WAL first
self.wal.append(&entry)?;
self.wal.sync()?; // fsync WAL
// 2. Now safe to update in-memory state
let offset = self.data_file.append(&entry.encode())?;
self.index.insert(key, IndexEntry { offset, .. });
Ok(())
}
Pitfall 2: mmap Edge Cases
// WRONG: File grows but mmap doesn't see new data
fn append(&mut self, data: &[u8]) -> Result<u64> {
let offset = self.file.seek(SeekFrom::End(0))?;
self.file.write_all(data)?;
// Readers using old mmap won't see this!
Ok(offset)
}
// CORRECT: Remap after file growth
fn append(&mut self, data: &[u8]) -> Result<u64> {
let offset = self.file.seek(SeekFrom::End(0))?;
self.file.write_all(data)?;
// Remap to see new data
self.mmap = unsafe { MmapOptions::new().map(&self.file)? };
Ok(offset)
}
// BEST: Use separate file handle for writes, periodic remap
Pitfall 3: Memory Leaks in Arena Index
// WRONG: Arena never frees memory
struct Index {
arena: Arena,
map: HashMap<Vec<u8>, &'static IndexEntry>, // Lifetime issue!
}
// CORRECT: Use proper lifetime bounds
struct Index<'a> {
arena: &'a Arena,
map: HashMap<&'a [u8], &'a IndexEntry>,
}
// OR: Copy keys/entries to arena
struct Index {
arena: Arena,
map: HashMap<ArenaString, ArenaBox<IndexEntry>>,
}
Pitfall 4: Lock Contention in Hot Path
// WRONG: Global lock for every operation
struct KvStore {
data: Mutex<KvStoreInner>,
}
fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
let guard = self.data.lock().unwrap(); // BLOCKS all readers!
guard.index.get(key).map(|e| self.read_value(e))
}
// CORRECT: Read-write lock or lock-free
struct KvStore {
index: RwLock<HashMap<..>>,
data_file: DataFile, // mmap is thread-safe for reads
}
fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
let guard = self.index.read().unwrap(); // Many readers allowed!
guard.get(key).map(|e| self.data_file.get(e))
}
// BEST: Lock-free index (DashMap, skip list)
Extensions and Advanced Features
Extension 1: Range Queries
Add support for scanning a range of keys:
pub fn scan(&self, start: &[u8], end: &[u8]) -> impl Iterator<Item=(Vec<u8>, Vec<u8>)> {
// Requires sorted index (B-Tree or Skip List instead of HashMap)
self.index.range(start..end)
.map(|(k, e)| (k.clone(), self.data_file.get(e).to_vec()))
}
Extension 2: Transactions
Add multi-key atomic operations:
pub fn transaction<F, R>(&self, f: F) -> Result<R>
where
F: FnOnce(&mut Transaction) -> Result<R>
{
let mut txn = Transaction::new(self);
let result = f(&mut txn)?;
txn.commit()?;
Ok(result)
}
Extension 3: Replication
Add primary-replica replication:
pub fn replicate_to(&self, replica: &str) -> Result<()> {
// Stream WAL to replica
let wal_reader = self.wal.reader_from(replica_last_seq)?;
while let Some(record) = wal_reader.next()? {
send_to_replica(replica, &record)?;
}
Ok(())
}
Extension 4: Bloom Filters
Add bloom filters to avoid unnecessary disk reads:
struct SstableBloomFilter {
bits: BitVec,
hash_count: usize,
}
impl SstableBloomFilter {
fn might_contain(&self, key: &[u8]) -> bool {
// Check all hash positions
for i in 0..self.hash_count {
let pos = hash_with_seed(key, i) % self.bits.len();
if !self.bits[pos] {
return false; // Definitely not present
}
}
true // Might be present (check disk)
}
}
The Interview Questions They Will Ask
-
“What is the difference between an LSM Tree and a B+ Tree?”
Answer: LSM Trees buffer writes in memory (memtable), then flush to immutable sorted files (SSTables) that are periodically merged. This makes writes sequential and fast, but reads may need to check multiple levels. B+ Trees update data in place with a balanced tree structure, providing consistent read latency but requiring random I/O for writes. LSM Trees favor write-heavy workloads; B+ Trees favor read-heavy or mixed workloads.
-
“Why is mmap faster than read/write for some workloads?”
Answer: mmap eliminates the copy from kernel page cache to user space buffers. When you mmap a file, you access the page cache directly through virtual memory. For read-heavy workloads with good locality, this saves one memcpy per read. However, mmap can be slower for sequential access (read() uses readahead) and introduces unpredictable latency from page faults.
-
“How do you handle database recovery after a crash?”
Answer: Use Write-Ahead Logging (WAL). Before any modification, write the change to a sequential log file and fsync it to disk. On recovery, replay the WAL from the last checkpoint to restore the database to a consistent state. The key is that the WAL contains enough information to reconstruct any uncommitted changes.
-
“What is Write Amplification?”
Answer: Write amplification is the ratio of actual bytes written to storage versus the logical bytes written by the application. In LSM Trees, data may be written multiple times (memtable flush, L0 compaction, L1 compaction, etc.), leading to write amplification of 10-30x. This reduces SSD lifespan and limits write throughput.
-
“How do you achieve lock-free reads in a concurrent key-value store?”
Answer: Use an immutable index with atomic pointer swap for updates (copy-on-write), or use lock-free data structures like skip lists with CAS operations. For the data file, mmap is inherently safe for concurrent reads. Epoch-based reclamation or hazard pointers handle memory cleanup.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Storage Engine Design | “Designing Data-Intensive Applications” by Kleppmann | Ch. 3: Storage and Retrieval |
| Memory Mapping | “The Linux Programming Interface” by Kerrisk | Ch. 49: Memory Mappings |
| Concurrent Data Structures | “Rust Atomics and Locks” by Bos | Ch. 9: Building a Lock-Free Stack |
| Database Internals | “Database Internals” by Petrov | Entire book (B-Trees, LSM Trees) |
| Systems Programming | “Computer Systems: A Programmer’s Perspective” | Ch. 9: Virtual Memory |
Success Criteria
You have completed this project when:
- Your KV store can write 1 million keys in under 0.5 seconds
- Your KV store can read 1 million random keys in under 0.35 seconds
- Concurrent benchmark shows 3+ million operations per second
- Crash recovery test passes with 100% data integrity
- Compaction reduces file size without blocking readers
- Memory usage stays bounded (index fits in configured limit)
- You can explain every layer of the architecture from disk to API
Summary
This project represents the culmination of your Advanced Rust journey. You have built a production-grade storage engine that demonstrates:
- Systems Engineering: Understanding of I/O patterns, memory hierarchy, and performance optimization
- Rust Mastery: Unsafe code for performance, atomics for concurrency, lifetimes for safety
- Database Knowledge: WAL for durability, compaction for space management, indexes for fast lookups
- Performance Engineering: Benchmarking, profiling, and optimization of hot paths
You now have the skills to understand, modify, and build real database systems. Whether you continue into distributed systems, embedded databases, or other systems programming domains, the foundations from this project will serve you well.
Welcome to the elite club of engineers who truly understand how data is stored and retrieved. The machine is now yours.
This project is part of the Advanced Rust Ecosystem Deep Dive learning path.