Project 12: High-Performance KV Store (Custom Everything)

“Databases are the cathedrals of the software world - built to last centuries, serving billions, and embodying our deepest understanding of data persistence.” - Andy Pavlo

Project Metadata

Main Programming Language: Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Difficulty: Level 5: Master
Knowledge Area: Databases / Systems Engineering
Time Estimate: 1-2 months
Prerequisites:
- Completed Projects 3 (Arena Allocator), 6 (Lock-Free Queue), and 7 (Zero-Copy Parser)
- Strong understanding of unsafe Rust and raw pointers
- Familiarity with file I/O and memory mapping
- Basic understanding of concurrent programming and atomics

What You Will Build

A production-grade Key-Value store that uses a custom Arena Allocator for the index, Zero-Copy parsing for values, and Atomics for thread-safe access. This database engine will support millions of operations per second on a single core, rivaling the performance of Redis and RocksDB for specific workloads.

This is the “Final Boss” of the Advanced Rust learning path. It requires you to integrate almost every advanced concept: pinning, custom allocation, atomics, memory mapping, and lifetime management.

Learning Objectives

By the end of this project, you will be able to:

Design a complete storage engine architecture - Understand the trade-offs between different data structures (LSM Trees, B+ Trees, Skip Lists) and choose appropriately for your workload
Implement an append-only log with Write-Ahead Logging (WAL) - Build a crash-safe persistence layer that can recover from unexpected shutdowns without data loss
Use memory-mapped files (mmap) for zero-copy I/O - Leverage the operating system’s virtual memory system to read data without copying it to user space
Build a high-performance concurrent index - Implement a lock-free or fine-grained locking data structure that scales with CPU cores
Apply arena allocation in a real system - Use your custom allocator knowledge to minimize allocation overhead in the hot path
Implement compaction and garbage collection - Reclaim space from deleted or updated keys without blocking reads
Achieve sub-microsecond latency on the hot path - Understand and eliminate latency sources, keeping the common case blazingly fast

Deep Theoretical Foundation

Before writing any code, you must understand the fundamental concepts that underpin all modern key-value stores. This section provides the theoretical foundation you’ll need.

Part 1: The Anatomy of a Storage Engine

Every key-value store, from Redis to RocksDB to DynamoDB, shares a common architectural pattern:

KEY-VALUE STORE ARCHITECTURE
====================================================================

                        CLIENT REQUEST
                              |
                              v
+-------------------------------------------------------------------+
|                         API LAYER                                  |
|    GET(key) -> Option<Value>                                       |
|    PUT(key, value) -> Result<()>                                   |
|    DELETE(key) -> Result<()>                                       |
+-------------------------------------------------------------------+
                              |
                              v
+-------------------------------------------------------------------+
|                      CONCURRENCY LAYER                             |
|    - Lock management (mutex, rwlock, or lock-free)                 |
|    - Transaction isolation (if supported)                          |
|    - MVCC (Multi-Version Concurrency Control)                      |
+-------------------------------------------------------------------+
                              |
                              v
+-------------------------------------------------------------------+
|                        INDEX LAYER                                 |
|    Maps: Key -> (File Offset, Value Length)                        |
|    Data Structures: B-Tree, Skip List, Hash Table                  |
|    Lives in: RAM (fast) or Disk (persistent)                       |
+-------------------------------------------------------------------+
                              |
                              v
+-------------------------------------------------------------------+
|                       STORAGE LAYER                                |
|    - Write-Ahead Log (WAL) for durability                          |
|    - Data files (append-only log or sorted segments)               |
|    - Compaction and garbage collection                             |
+-------------------------------------------------------------------+
                              |
                              v
+-------------------------------------------------------------------+
|                      FILE SYSTEM / OS                              |
|    - fsync for durability guarantees                               |
|    - mmap for zero-copy reads                                      |
|    - Page cache for read performance                               |
+-------------------------------------------------------------------+
                              |
                              v
                          DISK / SSD

Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann, Chapter 3: Storage and Retrieval - The definitive resource on storage engine design.

Part 2: LSM Trees vs B+ Trees - The Fundamental Trade-off

The choice between Log-Structured Merge Trees (LSM) and B+ Trees represents the most important architectural decision in storage engine design.

LSM TREE ARCHITECTURE (Write-Optimized)
====================================================================

WRITES flow through multiple levels, from memory to disk:

+-------------------+
|    MEMTABLE       |  <- All writes go here first (in RAM)
|   (Sorted Map)    |     Fast random writes, sorted by key
+-------------------+
         |
         | (Flush when full)
         v
+-------------------+
|    IMMUTABLE      |  <- Previous memtable being flushed
|    MEMTABLE       |
+-------------------+
         |
         | (Write to disk)
         v
+-------------------+      +-------------------+
|    LEVEL 0        |----->|    LEVEL 0        |  <- Unsorted SSTable files
|    (SSTable)      |      |    (SSTable)      |     May have overlapping key ranges
+-------------------+      +-------------------+
         |
         | (Compaction merges and sorts)
         v
+-------------------+      +-------------------+      +-------------------+
|    LEVEL 1        |----->|    LEVEL 1        |----->|    LEVEL 1        |
|    (SSTable)      |      |    (SSTable)      |      |    (SSTable)      |
+-------------------+      +-------------------+      +-------------------+
         |                  Non-overlapping key ranges per level
         | (Further compaction, each level 10x larger)
         v
+-------------------+      +-------------------+      +-------------------+
|    LEVEL 2        |----->|    LEVEL 2        |----->|    LEVEL 2        | ...
|    (SSTable)      |      |    (SSTable)      |      |    (SSTable)      |
+-------------------+      +-------------------+      +-------------------+

SSTable (Sorted String Table) Format:
+------------------------------------------------------------------------+
|  DATA BLOCK  |  DATA BLOCK  |  DATA BLOCK  | INDEX BLOCK | BLOOM FILTER|
+------------------------------------------------------------------------+
     |              |              |               |              |
     v              v              v               v              v
  Sorted         Sorted        Sorted        Points to      Quick "might
  key-value      key-value     key-value     data blocks    contain" check
  pairs          pairs         pairs         by key range


READ PATH in LSM Tree:
1. Check Memtable (RAM) - O(log n)
2. Check Immutable Memtable (RAM) - O(log n)
3. Check Level 0 SSTables (may check ALL files) - O(k * log n)
4. Check Level 1 SSTables (only one file per key) - O(log n)
5. Check Level 2 SSTables (only one file per key) - O(log n)
...

WORST CASE: Read amplification = O(levels * log n)

B+ TREE ARCHITECTURE (Read-Optimized)
====================================================================

All data lives in a single tree structure on disk:

                        +------------------+
                        |    ROOT NODE     |
                        | [50 | 100 | 150] |
                        +--------+---------+
                                 |
          +----------------------+----------------------+
          |                      |                      |
          v                      v                      v
   +-------------+       +-------------+       +-------------+
   | INTERNAL    |       | INTERNAL    |       | INTERNAL    |
   | [10|20|30]  |       | [60|70|80]  |       | [110|120]   |
   +------+------+       +------+------+       +------+------+
          |                     |                     |
    +-----+-----+         +-----+-----+         +-----+-----+
    v     v     v         v     v     v         v     v     v
  +---+ +---+ +---+     +---+ +---+ +---+     +---+ +---+ +---+
  |1-9| |11 | |21 |     |51 | |61 | |71 |     |101| |111| |121|
  |   | |to | |to |     |to | |to | |to |     |to | |to | |to |
  |   | |19 | |29 |     |59 | |69 | |79 |     |109| |119| |129|
  +---+ +---+ +---+     +---+ +---+ +---+     +---+ +---+ +---+
     \     \     \         |     |     |         /     /     /
      \     \     \________|_____|_____|________/     /     /
       \     \_________    |     |     |    _________/     /
        \___________    \  |     |     |   /    __________/
                     \   \ |     |     |  /   /
                      v   v v     v     v v  v
                   LEAF NODES LINKED LIST (for range scans)
                   +---+<->+---+<->+---+<->+---+<->+---+
                   | 1 |   | 11|   | 21|   | 51|   | 61|  ...
                   |...|   |...|   |...|   |...|   |...|
                   +---+   +---+   +---+   +---+   +---+

B+ Tree Node Format:
+------------------------------------------------------------------------+
| HEADER | KEY | PTR | KEY | PTR | KEY | PTR | KEY | PTR | ... | KEY | PTR |
+------------------------------------------------------------------------+
    |       |    |      |    |
    |       |    |      |    +-- Pointer to child (or value in leaf)
    |       |    |      +-- Separator key (all keys in right subtree >= this)
    |       |    +-- Pointer to child (or value in leaf)
    |       +-- Separator key
    +-- Node metadata (is_leaf, key_count, etc.)

READ PATH in B+ Tree:
1. Load root node from disk (usually cached) - 1 I/O
2. Binary search to find child pointer - O(log b)
3. Load next node - 1 I/O
4. Repeat until leaf - O(log_b n) I/Os total
...

BEST CASE: Read amplification = O(log_b n) where b = branching factor (~100-1000)

LSM vs B+ TREE COMPARISON
====================================================================

+-------------------+------------------+------------------+
|     Metric        |     LSM Tree     |    B+ Tree       |
+-------------------+------------------+------------------+
| Write throughput  |     EXCELLENT    |     GOOD         |
|                   | (Sequential I/O) | (Random I/O)     |
+-------------------+------------------+------------------+
| Read latency      |     VARIABLE     |    CONSISTENT    |
|                   | (May check many  | (Tree depth      |
|                   |  levels)         |  is predictable) |
+-------------------+------------------+------------------+
| Space efficiency  |     WORSE        |    BETTER        |
|                   | (Write amplif.)  | (In-place update)|
+-------------------+------------------+------------------+
| Write amplification|    HIGH         |    MODERATE      |
|                   | (Compaction)     | (Page splits)    |
+-------------------+------------------+------------------+
| Range scans       |     MODERATE     |    EXCELLENT     |
|                   | (Merge iterators)| (Linked leaves)  |
+-------------------+------------------+------------------+
| Point lookups     |     VARIABLE     |    EXCELLENT     |
|                   | (Bloom filters   | (Direct path)    |
|                   |  help a lot)     |                  |
+-------------------+------------------+------------------+

USE LSM TREES WHEN:
- Write-heavy workload (logging, time-series, analytics)
- SSD storage (sequential writes are only slightly better)
- Can tolerate variable read latency
- Examples: RocksDB, LevelDB, Cassandra, ScyllaDB

USE B+ TREES WHEN:
- Read-heavy workload (OLTP databases)
- Need consistent read latency
- Range queries are common
- Examples: PostgreSQL, MySQL, SQLite, LMDB

Book Reference: “Database Internals” by Alex Petrov, Chapters 2-7 cover B-Trees and LSM Trees in extraordinary detail.

Part 3: Write-Ahead Logging (WAL) for Crash Recovery

Every durable storage engine uses Write-Ahead Logging: before any modification is applied to the main data structure, it is first written to a sequential log file and flushed to disk.

WRITE-AHEAD LOGGING (WAL) PRINCIPLE
====================================================================

The golden rule: "Never modify data until the log is on disk."

                    WITHOUT WAL (Dangerous!)
                    ========================

1. User: PUT(key="name", value="Alice")
2. Engine: Update index in RAM
3. Engine: Write to data file
4. <<< POWER FAILURE HERE >>>
5. On restart: Index says key exists, but data file is corrupted/incomplete
6. RESULT: DATA CORRUPTION OR LOSS

                    WITH WAL (Safe!)
                    =================

1. User: PUT(key="name", value="Alice")
2. Engine: Write to WAL file
3. Engine: fsync() WAL file  <-- This MUST complete before step 4
4. Engine: Acknowledge to user "Write successful"
5. Engine: Update index in RAM (can be lazy)
6. Engine: Write to data file (can be batched)
7. <<< POWER FAILURE HERE >>>
8. On restart:
   a. Read WAL from beginning
   b. Replay all operations that weren't applied to data file
   c. Data integrity GUARANTEED


WAL FILE FORMAT
====================================================================

+--------+--------+--------+--------+--------+--------+
| Record | Record | Record | Record | Record | Record | ...
+--------+--------+--------+--------+--------+--------+

Each Record:
+----------+----------+----------+----------+------------------+----------+
| Length   | CRC32    | Sequence | Type     | Key/Value Data   | Padding  |
| (4 bytes)| (4 bytes)| (8 bytes)| (1 byte) | (variable)       | (0-7 B)  |
+----------+----------+----------+----------+------------------+----------+
     |          |          |          |               |              |
     |          |          |          |               |              +-- Align to 8 bytes
     |          |          |          |               +-- Actual key+value
     |          |          |          +-- PUT=1, DELETE=2, COMMIT=3
     |          |          +-- Monotonically increasing sequence number
     |          +-- Checksum for corruption detection
     +-- Total record length (for skipping corrupted records)


WAL RECOVERY ALGORITHM
====================================================================

fn recover_from_wal(wal_path: &Path, index: &mut Index, data: &mut DataFile) {
    let mut reader = WalReader::open(wal_path);
    let mut last_valid_seq = data.get_last_sequence();

    while let Some(record) = reader.next_record() {
        // Skip records already applied
        if record.sequence <= last_valid_seq {
            continue;
        }

        // Verify checksum
        if record.crc32 != calculate_crc32(&record.data) {
            // Corruption detected! Truncate WAL here.
            reader.truncate_at_current_position();
            break;
        }

        // Replay the operation
        match record.record_type {
            RecordType::Put => {
                let (key, value) = parse_put_record(&record.data);
                let offset = data.append(key, value);
                index.insert(key, offset);
            }
            RecordType::Delete => {
                let key = parse_delete_record(&record.data);
                index.remove(key);
                data.append_tombstone(key);
            }
            RecordType::Commit => {
                // Batch commit marker
                data.sync();
            }
        }
    }

    // Truncate WAL after successful recovery
    reader.reset();
    println!("Recovery complete! Replayed {} operations.", replayed_count);
}

Key Insight: The WAL provides durability by ensuring that every committed write exists in a sequential log before acknowledging success. Recovery is simply replaying this log.

Book Reference: “Designing Data-Intensive Applications” Chapter 7: Transactions, covers WAL in the context of database transactions.

Part 4: Memory Mapping (mmap) for Zero-Copy I/O

Memory mapping allows you to treat a file as if it were an array in memory. The operating system handles paging data in and out transparently.

MEMORY MAPPING (mmap) EXPLAINED
====================================================================

TRADITIONAL READ (with copying):
================================

User Space                          Kernel Space                   Disk
+---------------+                   +---------------+              +------+
|  Application  |                   |  Page Cache   |              | File |
|     Buffer    |                   |    (RAM)      |              +------+
+---------------+                   +---------------+                  |
       ^                                   ^                           |
       |                                   |                           |
       +--- 3. Copy data to user space     +--- 2. Read from disk -----+
       |                                   |
       +--- 1. syscall read() -------------+

Total copies: 2 (disk -> page cache, page cache -> user buffer)
System calls: 1 (read)


MEMORY-MAPPED READ (zero-copy):
===============================

User Space                          Kernel Space                   Disk
+---------------+                   +---------------+              +------+
|  Application  |                   |  Page Cache   |              | File |
|   (Virtual    |------------------>|    (RAM)      |<-------------| Data |
|    Memory)    |  Same physical    |               |  Page fault  +------+
+---------------+  pages!           +---------------+  loads data

Total copies: 1 (disk -> page cache, then directly accessed)
System calls: 0 for reads (page faults handled by kernel)


HOW MMAP WORKS
====================================================================

1. SETUP: Request a memory mapping

   let file = File::open("data.db")?;
   let mmap = unsafe { Mmap::map(&file)? };

   Virtual Address Space:
   +---------------+---------------+---------------+
   | Stack         | ...           | mmap region   |
   | (growing)     |               | (file backed) |
   +---------------+---------------+---------------+
                                   ^
                                   |
                                   mmap.as_ptr()

2. FIRST ACCESS: Page fault triggers load

   let value = mmap[offset..offset+len];  // First access to this page

   CPU: "Address 0x7fff... is not in RAM!"
        |
        v
   Kernel: "That's in the mmap region for data.db"
        |
        v
   Kernel: Reads 4KB page from disk into page cache
        |
        v
   Kernel: Maps page cache page to virtual address
        |
        v
   CPU: Continues execution, data is now in RAM

3. SUBSEQUENT ACCESS: Cache hit

   let value2 = mmap[offset+4096..];  // Same or nearby page

   CPU: Page is in RAM, direct memory access!
   NO system call, NO page fault, FASTEST possible!


MMAP ADVANTAGES FOR KV STORES
====================================================================

1. ZERO-COPY READS
   - Data flows from disk to page cache ONCE
   - Application reads directly from page cache
   - No memcpy() to user buffers

2. OS-MANAGED CACHING
   - Kernel's page cache acts as your cache
   - LRU eviction handled automatically
   - Shared between processes

3. LAZY LOADING
   - Only pages that are accessed are loaded
   - 100GB file can be mmap'd with 1GB RAM
   - Perfect for sparse access patterns

4. SIMPLIFIED CODE
   - Treat file as &[u8] slice
   - Indexing, slicing, pattern matching all work
   - No explicit read() calls or buffer management


MMAP PITFALLS
====================================================================

1. PAGE FAULTS ARE BLOCKING
   - First access to a page may block for disk I/O
   - Unpredictable latency spikes
   - Solution: Use madvise() to pre-fault pages

2. WRITE VISIBILITY
   - Writes may not be immediately visible on disk
   - Must call msync() for durability
   - mmap writes + crash = potential data loss

3. FILE GROWTH
   - Cannot extend file size through mmap
   - Must truncate/fallocate then remap
   - Complex for growing files (use append-only)

4. 32-BIT ADDRESS SPACE LIMITS
   - Only 4GB virtual address space on 32-bit
   - Large files need chunked mapping
   - Not an issue on 64-bit systems


MMAP FOR KV STORE READS (Zero-Copy Pattern)
====================================================================

struct DataFile {
    mmap: Mmap,          // Memory-mapped file
    file: File,          // For writes (append)
    write_offset: AtomicU64,
}

impl DataFile {
    fn get(&self, offset: u64, len: usize) -> &[u8] {
        // ZERO COPY! Returns slice directly into mmap'd memory
        &self.mmap[offset as usize..offset as usize + len]
    }

    fn put(&self, key: &[u8], value: &[u8]) -> u64 {
        // Writes go through normal file I/O (append)
        // Then the file is remapped to include new data
        let offset = self.write_offset.fetch_add(
            (key.len() + value.len() + HEADER_SIZE) as u64,
            Ordering::SeqCst
        );

        // Write to file (buffered or direct)
        self.file.write_at(offset, &encode(key, value));

        offset
    }
}

Book Reference: “The Linux Programming Interface” by Michael Kerrisk, Chapter 49: Memory Mappings, is the definitive guide to mmap.

Part 5: Append-Only Log Structure

Instead of modifying data in place (which requires seeking and risks partial writes), we append all changes to the end of a file. This is the foundation of both LSM trees and simpler bitcask-style engines.

APPEND-ONLY LOG FORMAT
====================================================================

ENTRY FORMAT:
+----------+----------+----------+----------+----------+----------+
| CRC32    | Tstamp   | Key Len  | Val Len  | Key      | Value    |
| (4 bytes)| (8 bytes)| (4 bytes)| (4 bytes)| (var)    | (var)    |
+----------+----------+----------+----------+----------+----------+
     |          |          |          |          |          |
     |          |          |          |          |          +-- Value bytes
     |          |          |          |          +-- Key bytes
     |          |          |          +-- Length of value (0 = tombstone)
     |          |          +-- Length of key
     |          +-- Unix timestamp (for expiry, compaction ordering)
     +-- CRC32 checksum of entire entry (for corruption detection)


EXAMPLE LOG FILE (data.log):
====================================================================

Offset    Content
------    -------
0x0000    [CRC|ts|4|5|"name"|"Alice"]       <- PUT name=Alice
0x001A    [CRC|ts|4|3|"name"|"Bob"]         <- PUT name=Bob (UPDATE!)
0x0030    [CRC|ts|3|6|"age"|"twenty"]       <- PUT age=twenty
0x0048    [CRC|ts|4|0|"name"|]              <- DELETE name (tombstone)
0x0058    [CRC|ts|5|12|"email"|"b@test.com"]<- PUT email=b@test.com


IN-MEMORY INDEX (HashMap<Key, IndexEntry>):
====================================================================

After processing the log above:
+--------+----------------------------------+
| Key    | IndexEntry                       |
+--------+----------------------------------+
| "name" | DELETED (tombstone at 0x0048)    |
| "age"  | { offset: 0x0030, len: 6 }       |
| "email"| { offset: 0x0058, len: 12 }      |
+--------+----------------------------------+

Note: "name" has 3 entries in the log, but only the LAST one matters.
This is why compaction is needed - to reclaim space from old versions.


READ PATH:
====================================================================

fn get(&self, key: &[u8]) -> Option<&[u8]> {
    // 1. Look up in index (O(1) hash lookup)
    let entry = self.index.get(key)?;

    // 2. Check if tombstone
    if entry.is_tombstone() {
        return None;
    }

    // 3. Read value from mmap'd file (zero-copy!)
    let value = &self.mmap[entry.offset..entry.offset + entry.len];

    Some(value)
}


WRITE PATH:
====================================================================

fn put(&mut self, key: &[u8], value: &[u8]) -> io::Result<()> {
    // 1. Build entry
    let entry = Entry::new(key, value, timestamp());
    let encoded = entry.encode();  // With CRC32

    // 2. Write to WAL (if using separate WAL)
    self.wal.append(&encoded)?;
    self.wal.sync()?;  // CRITICAL: fsync before proceeding

    // 3. Append to data file
    let offset = self.data_file.append(&encoded)?;

    // 4. Update index
    self.index.insert(key.to_vec(), IndexEntry {
        offset,
        len: value.len(),
        timestamp: entry.timestamp,
    });

    Ok(())
}


STARTUP (Index Reconstruction):
====================================================================

fn load_index_from_log(data_path: &Path) -> HashMap<Vec<u8>, IndexEntry> {
    let mut index = HashMap::new();
    let file = File::open(data_path)?;
    let mmap = unsafe { Mmap::map(&file)? };

    let mut offset = 0;
    while offset < mmap.len() {
        // Parse entry at current offset
        let entry = Entry::decode(&mmap[offset..])?;

        // Validate CRC
        if entry.crc != calculate_crc(&entry) {
            // Corruption! Stop here.
            break;
        }

        // Update index (newer entries overwrite older)
        if entry.is_tombstone() {
            index.remove(&entry.key);
        } else {
            index.insert(entry.key.clone(), IndexEntry {
                offset: offset + HEADER_SIZE,
                len: entry.value.len(),
                timestamp: entry.timestamp,
            });
        }

        offset += entry.total_size();
    }

    index
}

Why Append-Only?

Sequential writes are fast: SSDs and HDDs both perform much better with sequential I/O
Crash safety: Partial writes only affect the end of the file, never existing data
Simplicity: No need to manage free space within the file
Concurrency: Readers never see partial writes; they read up to a known offset

Part 6: Compaction and Garbage Collection

As data is updated and deleted, the log accumulates “garbage” - old versions of keys that are no longer needed. Compaction reclaims this space.

COMPACTION PROCESS
====================================================================

BEFORE COMPACTION (data.log has 1000 entries, 500 are obsolete):

data.log (10 MB):
+----------+----------+----------+----------+----------+----------+
| name=    | age=30   | name=    | email=   | name=    | email=   |
| Alice    | (LIVE)   | Bob      | a@b.com  | Charlie  | c@d.com  |
| (DEAD)   |          | (DEAD)   | (DEAD)   | (LIVE)   | (LIVE)   |
+----------+----------+----------+----------+----------+----------+
                ^                                ^          ^
                |                                |          |
    Only these 3 entries are still valid! ------+----------+


COMPACTION ALGORITHM:

1. Create new data file (data.log.compact):

   +----------+----------+----------+
   | age=30   | name=    | email=   |
   | (LIVE)   | Charlie  | c@d.com  |
   +----------+----------+----------+

   Only write entries that are in the current index!

2. Build new index pointing to new file:

   +--------+----------------------------------+
   | Key    | IndexEntry (new offsets)         |
   +--------+----------------------------------+
   | "age"  | { offset: 0x0000, len: 2 }       |
   | "name" | { offset: 0x0014, len: 7 }       |
   | "email"| { offset: 0x002C, len: 7 }       |
   +--------+----------------------------------+

3. Atomically swap files:
   - Rename data.log.compact -> data.log
   - Update mmap to new file
   - Delete old data.log

RESULT: 10 MB -> 3 MB (70% space savings!)


COMPACTION STRATEGIES
====================================================================

1. FULL COMPACTION (Simple, used in Bitcask):

   +-------------------+     +-------------------+
   | OLD FILE          | --> | NEW FILE          |
   | (all data)        |     | (only live data)  |
   +-------------------+     +-------------------+

   - Rewrites entire dataset
   - Simple to implement
   - High write amplification for large datasets
   - Good for small-medium datasets

2. LEVELED COMPACTION (Used in LevelDB/RocksDB):

   Level 0: [SST][SST][SST][SST]  <- Raw memtable dumps
               \   |   /
                \  |  /
                 v v v
   Level 1: [SST][SST][SST]       <- Compacted, non-overlapping
                   |
                   v
   Level 2: [SST][SST][SST][SST][SST][SST]  <- 10x larger

   - Each level is 10x the size of the previous
   - Only overlapping SSTables are compacted together
   - Bounds write amplification to O(levels)

3. SIZE-TIERED COMPACTION (Used in Cassandra):

   Tier 0: [1MB][1MB][1MB][1MB]  <- 4 similar-sized files
               \  |  |  /
                \ | | /
                 vvvv
   Tier 1: [4MB][4MB][4MB][4MB]  <- Merge when 4 similar-sized
                   |
                   v
   Tier 2: [16MB][16MB][16MB]

   - Merge files of similar size
   - Better write amplification than leveled
   - Worse space amplification


CONCURRENT COMPACTION (No Read Blocking)
====================================================================

The key insight: Old files remain readable until compaction completes.

Timeline:
---------
T0: Reader starts reading from old file at offset 100
T1: Compaction starts, creates new file
T2: Reader reads value from old file (still valid!)
T3: Compaction writes new file
T4: Reader finishes, releases reference to old file
T5: Compaction atomically swaps to new file
T6: Old file deleted (all readers finished)

Implementation using reference counting:
----------------------------------------
struct DataFile {
    mmap: Mmap,
    ref_count: AtomicUsize,  // Incremented for each reader
}

impl DataFile {
    fn get_reader(&self) -> DataFileReader {
        self.ref_count.fetch_add(1, Ordering::AcqRel);
        DataFileReader { file: self }
    }
}

impl Drop for DataFileReader {
    fn drop(&mut self) {
        let prev = self.file.ref_count.fetch_sub(1, Ordering::AcqRel);
        if prev == 1 {
            // Last reader, safe to delete file
        }
    }
}

Part 7: Concurrent Index Data Structures

The index maps keys to their location in the data file. For high concurrency, you need a thread-safe data structure.

INDEX DATA STRUCTURE OPTIONS
====================================================================

1. RWLOCK<HASHMAP> (Simple, Moderate Concurrency)

   +------------------+
   | RwLock           |
   |   +------------+ |
   |   | HashMap    | |
   |   | key -> off | |
   |   +------------+ |
   +------------------+

   Pros: Simple, uses std library
   Cons: Writers block all readers
   Throughput: ~500K ops/sec
   Use when: Prototyping, low-medium load

2. DASHMAP (Sharded HashMap, Good Concurrency)

   +--------+--------+--------+--------+
   | Shard0 | Shard1 | Shard2 | Shard3 |
   | RwLock | RwLock | RwLock | RwLock |
   | Hash   | Hash   | Hash   | Hash   |
   +--------+--------+--------+--------+

   Key hash determines shard: shard = hash(key) % num_shards

   Pros: Writers only block one shard
   Cons: Still uses locks
   Throughput: ~2-5M ops/sec
   Use when: Good default choice

3. SKIP LIST (Lock-Free, Excellent Concurrency)

   Level 3:  [HEAD] ------------------> [50] ----------------------> [NIL]
                |                         |
   Level 2:  [HEAD] --------> [25] ----> [50] --------> [75] ------> [NIL]
                |              |          |              |
   Level 1:  [HEAD] -> [10] -> [25] -> [35] -> [50] -> [60] -> [75] -> [NIL]
                |       |       |       |       |       |       |
   Level 0:  [HEAD]->[10]->[20]->[25]->[30]->[35]->[50]->[55]->[60]->[70]->[75]->[NIL]

   Pros: Lock-free reads AND writes (with CAS)
         Supports range queries
   Cons: Complex implementation
         Memory overhead (pointers)
   Throughput: ~5-10M ops/sec
   Use when: High concurrency + range queries

4. CONCURRENT B-TREE (Lock Coupling)

                    +-------+
                    | root  |  <- Lock root
                    +---+---+
                        |
            +-----------+-----------+
            |                       |
        +---+---+               +---+---+
        | node1 |               | node2 |
        +---+---+               +---+---+
            |                       |
            v                       v
        Release root lock once child is locked
        ("Lock coupling" or "crabbing")

   Pros: Good for range queries
         Cache-friendly
   Cons: Complex implementation
         Lock coupling overhead
   Throughput: ~3-8M ops/sec
   Use when: Range queries are primary


SKIP LIST IMPLEMENTATION SKETCH
====================================================================

const MAX_LEVEL: usize = 32;

struct SkipList<K, V> {
    head: AtomicPtr<Node<K, V>>,
    level: AtomicUsize,
}

struct Node<K, V> {
    key: K,
    value: AtomicPtr<V>,  // Allows lock-free update
    next: [AtomicPtr<Node<K, V>>; MAX_LEVEL],
    level: usize,
}

impl<K: Ord, V> SkipList<K, V> {
    fn get(&self, key: &K) -> Option<&V> {
        let mut current = self.head.load(Ordering::Acquire);

        // Start from top level, go down
        for level in (0..=self.level.load(Ordering::Relaxed)).rev() {
            // Go right while next key is less than target
            loop {
                let next = unsafe { (*current).next[level].load(Ordering::Acquire) };
                if next.is_null() {
                    break;
                }
                let next_key = unsafe { &(*next).key };
                if next_key >= key {
                    break;
                }
                current = next;
            }
        }

        // Check if we found it
        let next = unsafe { (*current).next[0].load(Ordering::Acquire) };
        if !next.is_null() && unsafe { &(*next).key } == key {
            let value_ptr = unsafe { (*next).value.load(Ordering::Acquire) };
            Some(unsafe { &*value_ptr })
        } else {
            None
        }
    }

    fn insert(&self, key: K, value: V) -> Option<V> {
        // ... Complex CAS-based insertion
        // See crossbeam-skiplist for production implementation
    }
}

Part 8: Cache-Friendly Data Structures

Modern CPUs are much faster than memory. Cache-friendly data structures can be 10-100x faster.

CPU CACHE HIERARCHY
====================================================================

+------------------------------------------------------------------+
|                            CPU                                    |
|   +----------------------------------------------------------+   |
|   |  Core 0                         Core 1                   |   |
|   |  +--------+                     +--------+               |   |
|   |  |  L1d   | 32KB, 4 cycles     |  L1d   |               |   |
|   |  +--------+                     +--------+               |   |
|   |  +--------+                     +--------+               |   |
|   |  |  L1i   | 32KB, 4 cycles     |  L1i   |               |   |
|   |  +--------+                     +--------+               |   |
|   |  +--------+                     +--------+               |   |
|   |  |  L2    | 256KB, 12 cycles   |  L2    |               |   |
|   |  +--------+                     +--------+               |   |
|   +----------------------------------------------------------+   |
|                              |                                    |
|                         +----+----+                               |
|                         |   L3    | 8MB shared, 40 cycles         |
|                         +---------+                               |
+------------------------------------------------------------------+
                               |
                               | 100+ cycles
                               v
                         +----------+
                         |   RAM    | 32GB, 100-300 cycles
                         +----------+


CACHE LINE EFFECTS
====================================================================

Cache line = 64 bytes (on most modern CPUs)

GOOD: Sequential access (cache prefetcher happy)

for item in array.iter() {
    // Each cache line brings 64 bytes
    // If items are 8 bytes, get 8 items per cache load
    process(item);
}

Access pattern: [----64 bytes----][----64 bytes----][----64 bytes----]
Cache loads:    ^                 ^                 ^
                1                 2                 3

BAD: Pointer chasing (linked list)

let mut node = head;
while let Some(n) = node {
    // Each node is in a DIFFERENT cache line!
    // Every access is a cache miss
    process(n);
    node = n.next;
}

Access pattern: [node1]......[node2]......[node3]......
Cache loads:    ^           ^           ^
                1           2           3

RESULT: 10-100x slower!


ARENA ALLOCATION FOR CACHE LOCALITY
====================================================================

HEAP ALLOCATION (scattered):
+------+      +------+      +------+      +------+
| node |      | node |      | node |      | node |
| 0x100|      |0x8000|      |0x2000|      |0x9000|  <- Random addresses!
+------+      +------+      +------+      +------+

Each node access = likely cache miss

ARENA ALLOCATION (contiguous):
+------+------+------+------+------+------+------+------+
| node | node | node | node | node | node | node | node |
|   0  |   1  |   2  |   3  |   4  |   5  |   6  |   7  |
+------+------+------+------+------+------+------+------+
^----- All in same or adjacent cache lines! -----^

Sequential access = CPU prefetcher loads ahead
Result: MUCH faster iteration


INDEX ENTRY LAYOUT FOR CACHE EFFICIENCY
====================================================================

BAD: Large entries, few per cache line

struct IndexEntryBad {
    key: String,              // 24 bytes (ptr, len, cap)
    offset: u64,              // 8 bytes
    length: u32,              // 4 bytes
    timestamp: u64,           // 8 bytes
    flags: u32,               // 4 bytes
    _padding: [u8; 16],       // 16 bytes
}
// Total: 64 bytes = 1 entry per cache line

GOOD: Compact entries, many per cache line

struct IndexEntryGood {
    key_hash: u64,            // 8 bytes (look up full key only on match)
    offset: u64,              // 8 bytes
    length: u32,              // 4 bytes
    // timestamp and flags moved to data file
}
// Total: 20 bytes = 3 entries per cache line

For 1 million keys:
  Bad:  64 MB index, 1M cache lines touched
  Good: 20 MB index, 333K cache lines touched
  Result: 3x better cache utilization!

Part 9: Tail Latency and the Hot Path

In a high-performance system, average latency matters less than tail latency (p99, p999). A single slow operation can cascade into user-visible delays.

LATENCY DISTRIBUTION
====================================================================

                    Good System                 Bad System
Latency (us)        (tight distribution)        (long tail)
    |
1000|                                          *
    |                                          *
 500|                                          *
    |                                      * * *
 200|                                    * * * * *
    |                              * * * * * * * * * *
 100|                          * * * * * * * * * * * * *
    |                      * * * * * * * * * * * * * * * * *
  50| * * * *          * * * * * * * * * * * * * * * * * * * *
    |* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  10|* * * * * * * * * * * * *
    +----------------------------------------------------------
     p50        p90  p99  p999                p50  p90  p99  p999

     Average: 45us              Average: 85us
     p99: 55us                  p99: 500us  <- 10x WORSE!
     p999: 60us                 p999: 1000us

The "hot path" = the code executed for every operation
Any slowdown on the hot path affects ALL requests.


COMMON TAIL LATENCY KILLERS
====================================================================

1. LOCK CONTENTION

   Thread 1: acquire_lock() -> 10us operation -> release_lock()
   Thread 2: acquire_lock() -> BLOCKED!!! -> 10us operation
   Thread 3: acquire_lock() -> BLOCKED!!! -> BLOCKED!!! -> 10us

   Solution: Lock-free data structures, sharding, read-write locks

2. MEMORY ALLOCATION

   Hot path: Box::new(value)  <- May call mmap(), take global lock

   Solution: Arena allocation, object pooling

   fn handle_request(&self, req: Request) {
       let arena = Arena::new(4096);  // One alloc per request
       let parsed = parse_into_arena(&arena, req);
       let result = process(parsed);
       // Arena dropped, all memory freed at once
   }

3. DISK I/O ON READ PATH

   If index doesn't fit in RAM, reads may hit disk.

   Solution:
   - Keep index in RAM (or mmap it)
   - Use bloom filters to avoid disk reads for missing keys
   - Warm cache on startup

4. COMPACTION BLOCKING READS

   Bad: Lock index during compaction (seconds of latency)
   Good: Copy-on-write index, atomic swap after compaction

5. GC PAUSES (in other languages)

   Rust advantage: No GC! Deterministic memory management.


HOT PATH OPTIMIZATION CHECKLIST
====================================================================

[ ] No allocations (use arena or pre-allocated buffers)
[ ] No locks (or use reader-writer locks, never block readers)
[ ] No syscalls (use mmap for reads)
[ ] No I/O (all index in RAM)
[ ] No unbounded loops (O(1) or O(log n) only)
[ ] Cache-aligned data structures (avoid false sharing)
[ ] Branch prediction friendly (likely paths first)


MEASURING LATENCY IN RUST
====================================================================

use std::time::Instant;
use hdrhistogram::Histogram;

fn benchmark_operations(kvstore: &KvStore, ops: usize) {
    let mut histogram = Histogram::<u64>::new(3).unwrap();

    for i in 0..ops {
        let key = format!("key{}", i);

        let start = Instant::now();
        let _ = kvstore.get(&key);
        let elapsed = start.elapsed().as_nanos() as u64;

        histogram.record(elapsed).unwrap();
    }

    println!("Latency Distribution:");
    println!("  p50:  {} ns", histogram.value_at_percentile(50.0));
    println!("  p90:  {} ns", histogram.value_at_percentile(90.0));
    println!("  p99:  {} ns", histogram.value_at_percentile(99.0));
    println!("  p999: {} ns", histogram.value_at_percentile(99.9));
    println!("  max:  {} ns", histogram.max());
}

Part 10: Real-World KV Store Architectures

Let’s examine how production systems implement these concepts.

REDIS ARCHITECTURE (In-Memory)
====================================================================

+------------------------------------------------------------------+
|                          REDIS                                    |
+------------------------------------------------------------------+
|                                                                   |
|  SINGLE-THREADED EVENT LOOP (no locks needed!)                   |
|  +-------------------------------------------------------------+ |
|  | while true {                                                 | |
|  |     events = epoll_wait(fd, ...)                            | |
|  |     for event in events {                                    | |
|  |         handle_client_request(event)                         | |
|  |     }                                                        | |
|  | }                                                            | |
|  +-------------------------------------------------------------+ |
|                                                                   |
|  IN-MEMORY DATA STRUCTURES                                       |
|  +------------------+  +------------------+  +------------------+ |
|  | Dict (HashMap)   |  | Sorted Set      |  | List             | |
|  | for Strings      |  | (Skip List)     |  | (Linked List)    | |
|  +------------------+  +------------------+  +------------------+ |
|                                                                   |
|  PERSISTENCE (Optional)                                           |
|  +------------------+  +------------------+                       |
|  | RDB (Snapshot)   |  | AOF (Append Log) |                       |
|  | Point-in-time    |  | Every write      |                       |
|  +------------------+  +------------------+                       |
+------------------------------------------------------------------+

Performance: 100K+ ops/sec single-threaded
Limitation: Data must fit in RAM


ROCKSDB ARCHITECTURE (LSM-based)
====================================================================

+------------------------------------------------------------------+
|                         ROCKSDB                                   |
+------------------------------------------------------------------+
|                                                                   |
|  WRITE PATH                                                       |
|  +-------------------------------------------------------------+ |
|  | 1. Write to WAL (fsync)                                      | |
|  | 2. Insert into MemTable (SkipList in RAM)                   | |
|  | 3. When MemTable full -> Flush to Level 0 SSTable           | |
|  +-------------------------------------------------------------+ |
|                                                                   |
|  +------------------+                                            |
|  | MemTable         |  <- Mutable, in RAM                        |
|  | (Active)         |                                            |
|  +------------------+                                            |
|  +------------------+                                            |
|  | MemTable         |  <- Immutable, being flushed               |
|  | (Frozen)         |                                            |
|  +------------------+                                            |
|           |                                                       |
|           v                                                       |
|  +------------------+------------------+------------------+       |
|  | L0 SST | L0 SST | L0 SST | L0 SST |                          |
|  +------------------+------------------+------------------+       |
|           |                                                       |
|           v  (Compaction)                                        |
|  +------------------+------------------+------------------+       |
|  | L1 SST | L1 SST | L1 SST |                                   |
|  +------------------+------------------+------------------+       |
|           |                                                       |
|           v  (Compaction)                                        |
|  +------------------+------------------+------------------+ ...  |
|  | L2 SST | L2 SST | L2 SST | L2 SST | L2 SST | L2 SST |       |
|  +------------------+------------------+------------------+ ...  |
|                                                                   |
+------------------------------------------------------------------+

Performance: 500K+ random writes/sec, 100K+ random reads/sec
Strength: Handles datasets larger than RAM


BITCASK ARCHITECTURE (Append-Only + In-Memory Index)
====================================================================

+------------------------------------------------------------------+
|                         BITCASK                                   |
+------------------------------------------------------------------+
|                                                                   |
|  WRITE PATH (Simple!)                                            |
|  +-------------------------------------------------------------+ |
|  | 1. Append entry to data file                                 | |
|  | 2. Update in-memory hash index                               | |
|  +-------------------------------------------------------------+ |
|                                                                   |
|  IN-MEMORY INDEX                                                 |
|  +------------------+                                            |
|  | HashMap          |                                            |
|  | key -> {         |                                            |
|  |   file_id,       |                                            |
|  |   offset,        |                                            |
|  |   size,          |                                            |
|  |   timestamp      |                                            |
|  | }                |                                            |
|  +------------------+                                            |
|           |                                                       |
|           | Points to                                            |
|           v                                                       |
|  DATA FILES (append-only)                                        |
|  +------------------+------------------+------------------+       |
|  | data.0 (old)     | data.1           | data.2 (active)  |       |
|  +------------------+------------------+------------------+       |
|                                                                   |
|  COMPACTION                                                       |
|  +-------------------------------------------------------------+ |
|  | Merge old files, keep only latest value per key             | |
|  | Atomic file swap when complete                               | |
|  +-------------------------------------------------------------+ |
|                                                                   |
+------------------------------------------------------------------+

Performance: Very fast writes, single-seek reads
Limitation: All keys must fit in RAM (values on disk)
This is the architecture we'll implement!

Book Reference: “Database Internals” by Alex Petrov provides detailed coverage of RocksDB, LevelDB, and B-Tree based systems.

Real World Outcome

When complete, you will have built a production-quality key-value store with impressive performance characteristics:

$ ./my_kv_store --bench

╔══════════════════════════════════════════════════════════════════╗
║           High-Performance KV Store Benchmark                     ║
╠══════════════════════════════════════════════════════════════════╣
║ System: Apple M2 Pro, 32GB RAM, APFS SSD                          ║
║ Build: Release mode, LTO enabled                                  ║
╚══════════════════════════════════════════════════════════════════╝

[Phase 1: Write Performance]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Writing 1,000,000 keys (16-byte key, 100-byte value)...
  Progress: [████████████████████████████████████████] 100%

  Total time:     0.453 seconds
  Throughput:     2.21 million writes/sec
  Avg latency:    452 ns/write
  p99 latency:    1.2 us/write
  Data written:   116 MB
  Write speed:    256 MB/s

  WAL writes:     1,000,000
  WAL fsync:      1,000 (every 1000 writes)
  Index updates:  1,000,000

[Phase 2: Read Performance]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Reading 1,000,000 random keys...
  Progress: [████████████████████████████████████████] 100%

  Total time:     0.312 seconds
  Throughput:     3.21 million reads/sec
  Avg latency:    312 ns/read
  p99 latency:    850 ns/read

  Index lookups:  1,000,000
  Cache hits:     99.2% (mmap page cache)
  Zero-copy:      100% (no memcpy in hot path)

[Phase 3: Mixed Workload (80% read, 20% write)]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Running 1,000,000 mixed operations...
  Progress: [████████████████████████████████████████] 100%

  Total time:     0.372 seconds
  Throughput:     2.69 million ops/sec
  Read latency:   310 ns avg
  Write latency:  498 ns avg

[Phase 4: Concurrent Access (8 threads)]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Running 8,000,000 operations across 8 threads...
  Progress: [████████████████████████████████████████] 100%

  Total time:     0.891 seconds
  Throughput:     8.98 million ops/sec (8 threads combined)
  Per-thread:     1.12 million ops/sec
  Lock contention: 0.3% (lock-free reads)

[Phase 5: Crash Recovery Test]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Simulating crash after 500,000 writes...
  Kill signal sent: SIGKILL

Restarting and recovering...
  WAL replay: 500,000 entries
  Index reconstruction: 0.234 seconds

Verifying data integrity...
  Keys recovered: 500,000 / 500,000 (100%)
  Data corruption: 0 entries
  ✓ RECOVERY SUCCESSFUL

[Phase 6: Compaction Test]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Writing 1,000,000 keys, then updating 500,000 of them...
  Before compaction:
    Data file size: 232 MB
    Live data: 116 MB (50%)
    Dead data: 116 MB (50%)

Running compaction...
  Compaction time: 1.234 seconds
  After compaction:
    Data file size: 116 MB
    Space reclaimed: 116 MB (50%)

  Reads during compaction: NO BLOCKING
  Writes during compaction: NO BLOCKING

═══════════════════════════════════════════════════════════════════

PERFORMANCE SUMMARY
╔═══════════════════════════════════════════════════════════════╗
║ Metric                 │ Result          │ Target             ║
╠═══════════════════════════════════════════════════════════════╣
║ Write throughput       │ 2.21M ops/sec   │ 1M ops/sec    ✓   ║
║ Read throughput        │ 3.21M ops/sec   │ 2M ops/sec    ✓   ║
║ Mixed (8 threads)      │ 8.98M ops/sec   │ 5M ops/sec    ✓   ║
║ Write latency p99      │ 1.2 us          │ < 5 us        ✓   ║
║ Read latency p99       │ 850 ns          │ < 2 us        ✓   ║
║ Crash recovery         │ 100% success    │ 100%          ✓   ║
║ Compaction blocking    │ None            │ None          ✓   ║
╚═══════════════════════════════════════════════════════════════╝

Memory Usage:
  Index (1M keys): 48 MB
  Arena allocator: 8 MB (reused per request)
  mmap (data file): Virtual only (116 MB mapped, ~4 MB resident)
  Total RSS: 62 MB

$ ./my_kv_store interactive
╔══════════════════════════════════════════════════════════════════╗
║                    RustKV Interactive Shell                       ║
╠══════════════════════════════════════════════════════════════════╣
║ Commands: GET <key>, SET <key> <value>, DEL <key>, STATS, QUIT   ║
╚══════════════════════════════════════════════════════════════════╝

rustkv> SET user:1 {"name":"Alice","age":30}
OK (312 ns)

rustkv> GET user:1
{"name":"Alice","age":30} (198 ns)

rustkv> SET user:2 {"name":"Bob","age":25}
OK (287 ns)

rustkv> STATS
╔═══════════════════════════════════════════════════════════════╗
║ Database Statistics                                            ║
╠═══════════════════════════════════════════════════════════════╣
║ Keys:              2                                           ║
║ Data file size:    156 bytes                                   ║
║ Index memory:      128 bytes                                   ║
║ Operations:        4 (2 writes, 2 reads)                       ║
║ Uptime:            12.3 seconds                                ║
╚═══════════════════════════════════════════════════════════════╝

rustkv> QUIT
Goodbye! (Data synced to disk)

Complete Project Specification

You will build a Bitcask-style key-value store with the following components:

Core Features

API Operations
- GET(key: &[u8]) -> Option<Vec<u8>> - Retrieve a value
- PUT(key: &[u8], value: &[u8]) -> Result<()> - Store a key-value pair
- DELETE(key: &[u8]) -> Result<()> - Remove a key
- SCAN(start: &[u8], end: &[u8]) -> Iterator<Item=(Key, Value)> - Range query (extension)
Durability
- Write-ahead logging with configurable fsync policy
- Crash recovery by replaying WAL on startup
- Atomic compaction with no data loss
Performance
- Arena allocator for index nodes
- Memory-mapped data file for zero-copy reads
- Lock-free or fine-grained locking for concurrent access
Operations
- Compaction to reclaim space from deleted/updated keys
- Statistics and metrics reporting
- Configurable memory limits

Solution Architecture

PROJECT STRUCTURE
====================================================================

rustkv/
├── Cargo.toml
├── src/
│   ├── lib.rs              # Public API and KvStore struct
│   ├── index/
│   │   ├── mod.rs          # Index trait and implementations
│   │   ├── hashmap.rs      # Simple RwLock<HashMap> index
│   │   └── skiplist.rs     # Lock-free skip list (extension)
│   ├── storage/
│   │   ├── mod.rs          # Storage trait
│   │   ├── data_file.rs    # Append-only data file with mmap
│   │   ├── wal.rs          # Write-ahead log
│   │   └── compaction.rs   # Background compaction
│   ├── allocator/
│   │   └── arena.rs        # Arena allocator for index
│   ├── format/
│   │   ├── mod.rs          # Entry encoding/decoding
│   │   └── crc.rs          # CRC32 checksum
│   └── cli.rs              # Command-line interface
├── benches/
│   └── throughput.rs       # Criterion benchmarks
└── tests/
    ├── basic_ops.rs        # Unit tests
    ├── crash_recovery.rs   # Crash simulation tests
    └── concurrent.rs       # Multi-threaded tests


COMPONENT DIAGRAM
====================================================================

┌─────────────────────────────────────────────────────────────────┐
│                           KvStore                                │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                        API Layer                             ││
│  │   pub fn get(&self, key: &[u8]) -> Option<Vec<u8>>          ││
│  │   pub fn put(&mut self, key: &[u8], val: &[u8]) -> Result   ││
│  │   pub fn delete(&mut self, key: &[u8]) -> Result            ││
│  └─────────────────────────────────────────────────────────────┘│
│                               │                                  │
│                               ▼                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                     Index Layer                              ││
│  │   ┌─────────────┐                                           ││
│  │   │  Index      │  HashMap<Key, IndexEntry>                 ││
│  │   │  (in RAM)   │  or SkipList<Key, IndexEntry>             ││
│  │   │             │                                            ││
│  │   │  Allocated  │  ← Uses Arena Allocator                   ││
│  │   │  in Arena   │    for cache locality                     ││
│  │   └─────────────┘                                           ││
│  └─────────────────────────────────────────────────────────────┘│
│                               │                                  │
│                               ▼                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Storage Layer                             ││
│  │                                                              ││
│  │   ┌──────────┐    ┌──────────────┐    ┌──────────────┐      ││
│  │   │   WAL    │    │  Data File   │    │  Compaction  │      ││
│  │   │          │    │              │    │  Thread      │      ││
│  │   │ Durability│   │ mmap'd for   │    │              │      ││
│  │   │ guarantee │   │ zero-copy    │    │ Reclaims     │      ││
│  │   │          │    │ reads        │    │ space        │      ││
│  │   └──────────┘    └──────────────┘    └──────────────┘      ││
│  └─────────────────────────────────────────────────────────────┘│
│                               │                                  │
│                               ▼                                  │
│                          File System                             │
└─────────────────────────────────────────────────────────────────┘


DATA FLOW: PUT OPERATION
====================================================================

User: kvstore.put(b"name", b"Alice")
                    │
                    ▼
┌───────────────────────────────────────────────────────────────┐
│ 1. Encode Entry                                                │
│    ┌──────────────────────────────────────────────────────┐   │
│    │ [CRC32][Timestamp][KeyLen][ValLen][Key][Value]       │   │
│    │ [4B]   [8B]       [4B]    [4B]    [4B] [5B]          │   │
│    └──────────────────────────────────────────────────────┘   │
│                    │                                           │
│                    ▼                                           │
│ 2. Write to WAL (with fsync)                                  │
│    wal.log  ──────────────────────────────────────────────    │
│                    │                                           │
│                    ▼                                           │
│ 3. Append to Data File                                        │
│    data.log ──────────────────────────────────────[ENTRY]──   │
│                        Returns offset = 1234                   │
│                    │                                           │
│                    ▼                                           │
│ 4. Update Index                                                │
│    index.insert("name", IndexEntry { offset: 1234, len: 5 })  │
│                    │                                           │
│                    ▼                                           │
│ 5. Return Ok(())                                               │
└───────────────────────────────────────────────────────────────┘


DATA FLOW: GET OPERATION
====================================================================

User: kvstore.get(b"name")
                    │
                    ▼
┌───────────────────────────────────────────────────────────────┐
│ 1. Index Lookup (O(1) hash or O(log n) skip list)             │
│    let entry = index.get("name")?;  // IndexEntry             │
│    entry = { offset: 1234, len: 5 }                           │
│                    │                                           │
│                    ▼                                           │
│ 2. Read from mmap (ZERO COPY!)                                │
│    let value = &mmap[1234..1234+5];  // Slice into mmap       │
│    // No memcpy! Direct pointer to kernel page cache          │
│                    │                                           │
│                    ▼                                           │
│ 3. Return Some(value.to_vec())                                │
│    // Only copy when returning to user                        │
│    // Could return &[u8] for even more zero-copy              │
└───────────────────────────────────────────────────────────────┘

Phased Implementation Guide

Phase 1: Simple In-Memory Hash Map with Persistence

Goal: Get a working key-value store that persists to disk.

Duration: 2-3 days

// src/lib.rs
use std::collections::HashMap;
use std::fs::{File, OpenOptions};
use std::io::{BufWriter, Write, BufReader, Read};
use std::path::Path;

pub struct KvStore {
    index: HashMap<Vec<u8>, Vec<u8>>,
    data_file: BufWriter<File>,
    data_path: PathBuf,
}

impl KvStore {
    pub fn open(path: &Path) -> std::io::Result<Self> {
        let data_path = path.join("data.log");

        // Load existing data
        let mut index = HashMap::new();
        if data_path.exists() {
            let file = File::open(&data_path)?;
            let mut reader = BufReader::new(file);
            // Parse entries and build index
            while let Ok(entry) = Entry::decode(&mut reader) {
                if entry.is_tombstone() {
                    index.remove(&entry.key);
                } else {
                    index.insert(entry.key, entry.value);
                }
            }
        }

        // Open for appending
        let file = OpenOptions::new()
            .create(true)
            .append(true)
            .open(&data_path)?;

        Ok(KvStore {
            index,
            data_file: BufWriter::new(file),
            data_path,
        })
    }

    pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
        self.index.get(key).cloned()
    }

    pub fn put(&mut self, key: &[u8], value: &[u8]) -> std::io::Result<()> {
        // Write to file
        let entry = Entry::new(key, value);
        entry.encode(&mut self.data_file)?;
        self.data_file.flush()?;

        // Update index
        self.index.insert(key.to_vec(), value.to_vec());
        Ok(())
    }

    pub fn delete(&mut self, key: &[u8]) -> std::io::Result<()> {
        let entry = Entry::tombstone(key);
        entry.encode(&mut self.data_file)?;
        self.data_file.flush()?;

        self.index.remove(key);
        Ok(())
    }
}

Checkpoint: Run basic tests for GET, PUT, DELETE. Verify persistence across restarts.

Phase 2: Append-Only Log File Format

Goal: Define a robust, self-describing log format with checksums.

Duration: 2-3 days

// src/format/mod.rs
use crc32fast::Hasher;

const HEADER_SIZE: usize = 20; // CRC(4) + Timestamp(8) + KeyLen(4) + ValLen(4)

#[derive(Debug)]
pub struct Entry {
    pub crc: u32,
    pub timestamp: u64,
    pub key: Vec<u8>,
    pub value: Vec<u8>, // Empty for tombstone
}

impl Entry {
    pub fn new(key: &[u8], value: &[u8]) -> Self {
        let timestamp = std::time::SystemTime::now()
            .duration_since(std::time::UNIX_EPOCH)
            .unwrap()
            .as_millis() as u64;

        let mut entry = Entry {
            crc: 0,
            timestamp,
            key: key.to_vec(),
            value: value.to_vec(),
        };
        entry.crc = entry.calculate_crc();
        entry
    }

    pub fn tombstone(key: &[u8]) -> Self {
        Entry::new(key, &[])
    }

    pub fn is_tombstone(&self) -> bool {
        self.value.is_empty()
    }

    pub fn calculate_crc(&self) -> u32 {
        let mut hasher = Hasher::new();
        hasher.update(&self.timestamp.to_le_bytes());
        hasher.update(&(self.key.len() as u32).to_le_bytes());
        hasher.update(&(self.value.len() as u32).to_le_bytes());
        hasher.update(&self.key);
        hasher.update(&self.value);
        hasher.finalize()
    }

    pub fn verify(&self) -> bool {
        self.crc == self.calculate_crc()
    }

    pub fn encode<W: Write>(&self, writer: &mut W) -> std::io::Result<usize> {
        let mut written = 0;

        writer.write_all(&self.crc.to_le_bytes())?;
        written += 4;

        writer.write_all(&self.timestamp.to_le_bytes())?;
        written += 8;

        writer.write_all(&(self.key.len() as u32).to_le_bytes())?;
        written += 4;

        writer.write_all(&(self.value.len() as u32).to_le_bytes())?;
        written += 4;

        writer.write_all(&self.key)?;
        written += self.key.len();

        writer.write_all(&self.value)?;
        written += self.value.len();

        Ok(written)
    }

    pub fn decode<R: Read>(reader: &mut R) -> std::io::Result<Self> {
        let mut header = [0u8; HEADER_SIZE];
        reader.read_exact(&mut header)?;

        let crc = u32::from_le_bytes(header[0..4].try_into().unwrap());
        let timestamp = u64::from_le_bytes(header[4..12].try_into().unwrap());
        let key_len = u32::from_le_bytes(header[12..16].try_into().unwrap()) as usize;
        let val_len = u32::from_le_bytes(header[16..20].try_into().unwrap()) as usize;

        let mut key = vec![0u8; key_len];
        reader.read_exact(&mut key)?;

        let mut value = vec![0u8; val_len];
        reader.read_exact(&mut value)?;

        let entry = Entry { crc, timestamp, key, value };

        if !entry.verify() {
            return Err(std::io::Error::new(
                std::io::ErrorKind::InvalidData,
                "CRC mismatch"
            ));
        }

        Ok(entry)
    }

    pub fn total_size(&self) -> usize {
        HEADER_SIZE + self.key.len() + self.value.len()
    }
}

Checkpoint: Write entries, verify CRC on read, detect corruption.

Phase 3: Memory-Mapped Reads

Goal: Use mmap for zero-copy reads from the data file.

Duration: 2-3 days

// src/storage/data_file.rs
use memmap2::{Mmap, MmapMut, MmapOptions};
use std::fs::{File, OpenOptions};
use std::sync::atomic::{AtomicU64, Ordering};

pub struct DataFile {
    file: File,
    mmap: Mmap,
    write_offset: AtomicU64,
    path: PathBuf,
}

#[derive(Clone, Copy, Debug)]
pub struct IndexEntry {
    pub offset: u64,
    pub len: u32,
    pub timestamp: u64,
}

impl DataFile {
    pub fn open(path: &Path) -> std::io::Result<Self> {
        let file = OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(path)?;

        let metadata = file.metadata()?;
        let len = metadata.len();

        // Ensure file is at least 1 byte for mmap
        if len == 0 {
            file.set_len(4096)?; // Start with 4KB
        }

        let mmap = unsafe { MmapOptions::new().map(&file)? };

        Ok(DataFile {
            file,
            mmap,
            write_offset: AtomicU64::new(len),
            path: path.to_path_buf(),
        })
    }

    pub fn get(&self, entry: &IndexEntry) -> &[u8] {
        let start = entry.offset as usize;
        let end = start + entry.len as usize;
        &self.mmap[start..end]
    }

    pub fn append(&self, data: &[u8]) -> std::io::Result<u64> {
        let offset = self.write_offset.fetch_add(data.len() as u64, Ordering::SeqCst);

        // Ensure file is large enough
        let new_end = offset + data.len() as u64;
        let current_len = self.file.metadata()?.len();
        if new_end > current_len {
            // Double file size (or minimum needed)
            let new_size = std::cmp::max(current_len * 2, new_end);
            self.file.set_len(new_size)?;
        }

        // Write using pwrite (doesn't change file offset)
        #[cfg(unix)]
        {
            use std::os::unix::fs::FileExt;
            self.file.write_at(data, offset)?;
        }

        Ok(offset)
    }

    pub fn sync(&self) -> std::io::Result<()> {
        self.file.sync_all()
    }

    pub fn remap(&mut self) -> std::io::Result<()> {
        self.mmap = unsafe { MmapOptions::new().map(&self.file)? };
        Ok(())
    }
}

Checkpoint: Writes append to file, reads use mmap, verify zero-copy in benchmarks.

Phase 4: Arena Allocator for Index

Goal: Use arena allocation for index entries to improve cache locality.

Duration: 3-4 days

// src/allocator/arena.rs
use std::alloc::{alloc, dealloc, Layout};
use std::cell::Cell;
use std::ptr::NonNull;

pub struct Arena {
    chunks: Vec<NonNull<u8>>,
    current: Cell<*mut u8>,
    end: Cell<*mut u8>,
    chunk_size: usize,
}

impl Arena {
    pub fn new(chunk_size: usize) -> Self {
        Arena {
            chunks: Vec::new(),
            current: Cell::new(std::ptr::null_mut()),
            end: Cell::new(std::ptr::null_mut()),
            chunk_size,
        }
    }

    pub fn alloc<T>(&self, value: T) -> &mut T {
        let layout = Layout::new::<T>();
        let ptr = self.alloc_raw(layout);
        unsafe {
            std::ptr::write(ptr as *mut T, value);
            &mut *(ptr as *mut T)
        }
    }

    fn alloc_raw(&self, layout: Layout) -> *mut u8 {
        // Align current pointer
        let current = self.current.get() as usize;
        let aligned = (current + layout.align() - 1) & !(layout.align() - 1);
        let new_ptr = aligned + layout.size();

        if new_ptr <= self.end.get() as usize {
            self.current.set(new_ptr as *mut u8);
            return aligned as *mut u8;
        }

        // Need new chunk
        self.grow(layout)
    }

    fn grow(&self, layout: Layout) -> *mut u8 {
        let size = std::cmp::max(self.chunk_size, layout.size());
        let chunk_layout = Layout::from_size_align(size, layout.align()).unwrap();

        let ptr = unsafe { alloc(chunk_layout) };
        if ptr.is_null() {
            panic!("Arena allocation failed");
        }

        let nn = NonNull::new(ptr).unwrap();
        // Note: This requires interior mutability for chunks vec
        // In practice, use RefCell or unsafe

        self.current.set(ptr);
        self.end.set(unsafe { ptr.add(size) });

        self.alloc_raw(layout)
    }

    pub fn reset(&mut self) {
        if let Some(first) = self.chunks.first() {
            self.current.set(first.as_ptr());
            self.end.set(unsafe { first.as_ptr().add(self.chunk_size) });
        }
    }
}

impl Drop for Arena {
    fn drop(&mut self) {
        for chunk in &self.chunks {
            unsafe {
                let layout = Layout::from_size_align_unchecked(self.chunk_size, 1);
                dealloc(chunk.as_ptr(), layout);
            }
        }
    }
}

Checkpoint: Index entries allocated in arena, benchmark shows improved cache performance.

Phase 5: Write-Ahead Logging

Goal: Add WAL for crash recovery guarantees.

Duration: 3-4 days

// src/storage/wal.rs
use std::fs::{File, OpenOptions};
use std::io::{BufWriter, Write, Seek, SeekFrom};

pub struct WriteAheadLog {
    file: BufWriter<File>,
    sequence: u64,
    sync_every: usize,
    unflushed: usize,
}

impl WriteAheadLog {
    pub fn open(path: &Path) -> std::io::Result<Self> {
        let file = OpenOptions::new()
            .create(true)
            .append(true)
            .open(path)?;

        Ok(WriteAheadLog {
            file: BufWriter::new(file),
            sequence: 0,
            sync_every: 1000, // Configurable
            unflushed: 0,
        })
    }

    pub fn append(&mut self, entry: &Entry) -> std::io::Result<u64> {
        // Write WAL record
        let record = WalRecord {
            sequence: self.sequence,
            entry: entry.clone(),
        };
        record.encode(&mut self.file)?;

        self.sequence += 1;
        self.unflushed += 1;

        // Periodic sync
        if self.unflushed >= self.sync_every {
            self.sync()?;
        }

        Ok(self.sequence - 1)
    }

    pub fn sync(&mut self) -> std::io::Result<()> {
        self.file.flush()?;
        self.file.get_ref().sync_all()?;
        self.unflushed = 0;
        Ok(())
    }

    pub fn replay<F>(&self, mut callback: F) -> std::io::Result<u64>
    where
        F: FnMut(&Entry) -> std::io::Result<()>,
    {
        let mut reader = BufReader::new(File::open(&self.path)?);
        let mut count = 0;

        while let Ok(record) = WalRecord::decode(&mut reader) {
            callback(&record.entry)?;
            count += 1;
        }

        Ok(count)
    }
}

struct WalRecord {
    sequence: u64,
    entry: Entry,
}

impl WalRecord {
    fn encode<W: Write>(&self, writer: &mut W) -> std::io::Result<()> {
        writer.write_all(&self.sequence.to_le_bytes())?;
        self.entry.encode(writer)?;
        Ok(())
    }

    fn decode<R: Read>(reader: &mut R) -> std::io::Result<Self> {
        let mut seq_bytes = [0u8; 8];
        reader.read_exact(&mut seq_bytes)?;
        let sequence = u64::from_le_bytes(seq_bytes);

        let entry = Entry::decode(reader)?;

        Ok(WalRecord { sequence, entry })
    }
}

Checkpoint: Crash simulation test passes - data recovered from WAL.

Phase 6: Compaction and Garbage Collection

Goal: Reclaim space from old versions and deleted keys.

Duration: 4-5 days

// src/storage/compaction.rs
use std::collections::HashMap;

pub struct Compactor {
    store: Arc<KvStore>,
}

impl Compactor {
    pub fn compact(&self) -> std::io::Result<CompactionStats> {
        let mut stats = CompactionStats::default();

        // 1. Create new data file
        let new_path = self.store.data_path.with_extension("compact");
        let mut new_file = File::create(&new_path)?;
        let mut new_index: HashMap<Vec<u8>, IndexEntry> = HashMap::new();

        // 2. Iterate through current index, write live entries to new file
        let index = self.store.index.read().unwrap();
        let mut offset = 0u64;

        for (key, entry) in index.iter() {
            // Skip tombstones
            if entry.is_tombstone {
                stats.deleted_keys += 1;
                continue;
            }

            // Read value from old file
            let value = self.store.data_file.get(entry);

            // Write to new file
            let new_entry = Entry::new(key, value);
            let size = new_entry.encode(&mut new_file)?;

            // Update new index
            new_index.insert(key.clone(), IndexEntry {
                offset,
                len: value.len() as u32,
                timestamp: entry.timestamp,
            });

            offset += size as u64;
            stats.live_keys += 1;
        }
        drop(index);

        // 3. Sync new file
        new_file.sync_all()?;

        // 4. Atomic swap
        // Note: This is simplified. Real implementation needs careful locking.
        let old_path = self.store.data_path.clone();
        let backup_path = old_path.with_extension("old");

        std::fs::rename(&old_path, &backup_path)?;
        std::fs::rename(&new_path, &old_path)?;

        // 5. Update store's mmap and index
        {
            let mut index = self.store.index.write().unwrap();
            *index = new_index;
        }
        self.store.data_file.remap()?;

        // 6. Delete old file
        std::fs::remove_file(&backup_path)?;

        stats.old_size = self.store.data_file.len();
        stats.new_size = offset;
        stats.space_reclaimed = stats.old_size - stats.new_size;

        Ok(stats)
    }
}

#[derive(Default)]
pub struct CompactionStats {
    pub live_keys: usize,
    pub deleted_keys: usize,
    pub old_size: u64,
    pub new_size: u64,
    pub space_reclaimed: u64,
}

Checkpoint: Compaction reduces file size, no data loss, concurrent reads work.

Phase 7: Concurrent Access with Atomics

Goal: Allow multiple readers and writers without blocking.

Duration: 4-5 days

// src/index/concurrent.rs
use parking_lot::RwLock;
use std::sync::atomic::{AtomicU64, Ordering};
use dashmap::DashMap;

// Option 1: DashMap (sharded HashMap)
pub struct ConcurrentIndex {
    map: DashMap<Vec<u8>, IndexEntry>,
    version: AtomicU64,
}

impl ConcurrentIndex {
    pub fn new() -> Self {
        ConcurrentIndex {
            map: DashMap::new(),
            version: AtomicU64::new(0),
        }
    }

    pub fn get(&self, key: &[u8]) -> Option<IndexEntry> {
        self.map.get(key).map(|e| *e)
    }

    pub fn insert(&self, key: Vec<u8>, entry: IndexEntry) {
        self.map.insert(key, entry);
        self.version.fetch_add(1, Ordering::Release);
    }

    pub fn remove(&self, key: &[u8]) -> Option<IndexEntry> {
        self.map.remove(key).map(|(_, v)| v)
    }
}

// Option 2: Lock-free reads with copy-on-write for writes
pub struct CowIndex {
    current: AtomicPtr<HashMap<Vec<u8>, IndexEntry>>,
}

impl CowIndex {
    pub fn get(&self, key: &[u8]) -> Option<IndexEntry> {
        let map = unsafe { &*self.current.load(Ordering::Acquire) };
        map.get(key).copied()
    }

    pub fn update<F>(&self, f: F)
    where
        F: FnOnce(&mut HashMap<Vec<u8>, IndexEntry>),
    {
        loop {
            let old_ptr = self.current.load(Ordering::Acquire);
            let old = unsafe { &*old_ptr };

            let mut new = old.clone();
            f(&mut new);

            let new_ptr = Box::into_raw(Box::new(new));

            if self.current.compare_exchange(
                old_ptr,
                new_ptr,
                Ordering::AcqRel,
                Ordering::Acquire,
            ).is_ok() {
                // Successfully swapped. Schedule old for deletion.
                // Use epoch-based reclamation or hazard pointers.
                break;
            } else {
                // CAS failed, retry
                unsafe { drop(Box::from_raw(new_ptr)); }
            }
        }
    }
}

Checkpoint: Concurrent benchmark shows linear scaling with threads.

Phase 8: CLI Interface

Goal: Create a user-friendly command-line interface.

Duration: 2-3 days

// src/cli.rs
use clap::{Parser, Subcommand};

#[derive(Parser)]
#[command(name = "rustkv")]
#[command(about = "A high-performance key-value store")]
struct Cli {
    #[command(subcommand)]
    command: Commands,

    #[arg(long, default_value = "./data")]
    data_dir: PathBuf,
}

#[derive(Subcommand)]
enum Commands {
    Get { key: String },
    Set { key: String, value: String },
    Del { key: String },
    Bench {
        #[arg(long, default_value_t = 1000000)]
        ops: usize,
        #[arg(long, default_value_t = 1)]
        threads: usize,
    },
    Compact,
    Stats,
    Interactive,
}

fn main() -> Result<()> {
    let cli = Cli::parse();
    let store = KvStore::open(&cli.data_dir)?;

    match cli.command {
        Commands::Get { key } => {
            match store.get(key.as_bytes()) {
                Some(value) => println!("{}", String::from_utf8_lossy(&value)),
                None => println!("(nil)"),
            }
        }
        Commands::Set { key, value } => {
            store.put(key.as_bytes(), value.as_bytes())?;
            println!("OK");
        }
        Commands::Del { key } => {
            store.delete(key.as_bytes())?;
            println!("OK");
        }
        Commands::Bench { ops, threads } => {
            run_benchmark(&store, ops, threads)?;
        }
        Commands::Compact => {
            let stats = store.compact()?;
            println!("Compaction complete:");
            println!("  Live keys: {}", stats.live_keys);
            println!("  Deleted keys: {}", stats.deleted_keys);
            println!("  Space reclaimed: {} bytes", stats.space_reclaimed);
        }
        Commands::Stats => {
            print_stats(&store);
        }
        Commands::Interactive => {
            run_interactive(&store)?;
        }
    }

    Ok(())
}

Checkpoint: All CLI commands work, benchmark shows target performance.

Testing Strategy

Unit Tests

#[cfg(test)]
mod tests {
    use super::*;
    use tempfile::tempdir;

    #[test]
    fn test_basic_put_get() {
        let dir = tempdir().unwrap();
        let store = KvStore::open(dir.path()).unwrap();

        store.put(b"hello", b"world").unwrap();
        assert_eq!(store.get(b"hello"), Some(b"world".to_vec()));
    }

    #[test]
    fn test_delete() {
        let dir = tempdir().unwrap();
        let store = KvStore::open(dir.path()).unwrap();

        store.put(b"hello", b"world").unwrap();
        store.delete(b"hello").unwrap();
        assert_eq!(store.get(b"hello"), None);
    }

    #[test]
    fn test_update() {
        let dir = tempdir().unwrap();
        let store = KvStore::open(dir.path()).unwrap();

        store.put(b"key", b"value1").unwrap();
        store.put(b"key", b"value2").unwrap();
        assert_eq!(store.get(b"key"), Some(b"value2".to_vec()));
    }

    #[test]
    fn test_persistence() {
        let dir = tempdir().unwrap();

        // Write data
        {
            let store = KvStore::open(dir.path()).unwrap();
            store.put(b"key", b"value").unwrap();
        }

        // Read after reopen
        {
            let store = KvStore::open(dir.path()).unwrap();
            assert_eq!(store.get(b"key"), Some(b"value".to_vec()));
        }
    }

    #[test]
    fn test_crc_validation() {
        let entry = Entry::new(b"key", b"value");
        assert!(entry.verify());

        // Corrupt the entry
        let mut corrupted = entry;
        corrupted.crc ^= 1; // Flip a bit
        assert!(!corrupted.verify());
    }
}

Crash Recovery Tests

#[test]
fn test_crash_recovery() {
    let dir = tempdir().unwrap();

    // Simulate partial write followed by crash
    {
        let store = KvStore::open(dir.path()).unwrap();

        // Write some complete entries
        for i in 0..1000 {
            store.put(format!("key{}", i).as_bytes(), b"value").unwrap();
        }

        // Force sync to ensure WAL is on disk
        store.sync().unwrap();

        // Don't call drop - simulate crash
        std::mem::forget(store);
    }

    // Recover and verify
    {
        let store = KvStore::open(dir.path()).unwrap();

        for i in 0..1000 {
            assert!(store.get(format!("key{}", i).as_bytes()).is_some());
        }
    }
}

Concurrent Stress Tests

#[test]
fn test_concurrent_access() {
    use std::sync::Arc;
    use std::thread;

    let dir = tempdir().unwrap();
    let store = Arc::new(KvStore::open(dir.path()).unwrap());

    let num_threads = 8;
    let ops_per_thread = 10000;

    let handles: Vec<_> = (0..num_threads).map(|t| {
        let store = Arc::clone(&store);
        thread::spawn(move || {
            for i in 0..ops_per_thread {
                let key = format!("thread{}key{}", t, i);
                let value = format!("value{}", i);

                store.put(key.as_bytes(), value.as_bytes()).unwrap();

                let read = store.get(key.as_bytes());
                assert!(read.is_some());
            }
        })
    }).collect();

    for handle in handles {
        handle.join().unwrap();
    }

    // Verify all keys exist
    for t in 0..num_threads {
        for i in 0..ops_per_thread {
            let key = format!("thread{}key{}", t, i);
            assert!(store.get(key.as_bytes()).is_some());
        }
    }
}

Common Pitfalls and Debugging

Pitfall 1: fsync Timing for Durability

// WRONG: Data may be lost on crash
fn put(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
    self.data_file.write(&entry.encode())?;
    self.index.insert(key, entry);
    // No fsync! Crash here = data loss
    Ok(())
}

// CORRECT: Durability guaranteed
fn put(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
    // 1. Write to WAL first
    self.wal.append(&entry)?;
    self.wal.sync()?;  // fsync WAL

    // 2. Now safe to update in-memory state
    let offset = self.data_file.append(&entry.encode())?;
    self.index.insert(key, IndexEntry { offset, .. });

    Ok(())
}

Pitfall 2: mmap Edge Cases

// WRONG: File grows but mmap doesn't see new data
fn append(&mut self, data: &[u8]) -> Result<u64> {
    let offset = self.file.seek(SeekFrom::End(0))?;
    self.file.write_all(data)?;
    // Readers using old mmap won't see this!
    Ok(offset)
}

// CORRECT: Remap after file growth
fn append(&mut self, data: &[u8]) -> Result<u64> {
    let offset = self.file.seek(SeekFrom::End(0))?;
    self.file.write_all(data)?;

    // Remap to see new data
    self.mmap = unsafe { MmapOptions::new().map(&self.file)? };

    Ok(offset)
}

// BEST: Use separate file handle for writes, periodic remap

Pitfall 3: Memory Leaks in Arena Index

// WRONG: Arena never frees memory
struct Index {
    arena: Arena,
    map: HashMap<Vec<u8>, &'static IndexEntry>, // Lifetime issue!
}

// CORRECT: Use proper lifetime bounds
struct Index<'a> {
    arena: &'a Arena,
    map: HashMap<&'a [u8], &'a IndexEntry>,
}

// OR: Copy keys/entries to arena
struct Index {
    arena: Arena,
    map: HashMap<ArenaString, ArenaBox<IndexEntry>>,
}

Pitfall 4: Lock Contention in Hot Path

// WRONG: Global lock for every operation
struct KvStore {
    data: Mutex<KvStoreInner>,
}

fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
    let guard = self.data.lock().unwrap(); // BLOCKS all readers!
    guard.index.get(key).map(|e| self.read_value(e))
}

// CORRECT: Read-write lock or lock-free
struct KvStore {
    index: RwLock<HashMap<..>>,
    data_file: DataFile, // mmap is thread-safe for reads
}

fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
    let guard = self.index.read().unwrap(); // Many readers allowed!
    guard.get(key).map(|e| self.data_file.get(e))
}

// BEST: Lock-free index (DashMap, skip list)

Extensions and Advanced Features

Extension 1: Range Queries

Add support for scanning a range of keys:

pub fn scan(&self, start: &[u8], end: &[u8]) -> impl Iterator<Item=(Vec<u8>, Vec<u8>)> {
    // Requires sorted index (B-Tree or Skip List instead of HashMap)
    self.index.range(start..end)
        .map(|(k, e)| (k.clone(), self.data_file.get(e).to_vec()))
}

Extension 2: Transactions

Add multi-key atomic operations:

pub fn transaction<F, R>(&self, f: F) -> Result<R>
where
    F: FnOnce(&mut Transaction) -> Result<R>
{
    let mut txn = Transaction::new(self);
    let result = f(&mut txn)?;
    txn.commit()?;
    Ok(result)
}

Extension 3: Replication

Add primary-replica replication:

pub fn replicate_to(&self, replica: &str) -> Result<()> {
    // Stream WAL to replica
    let wal_reader = self.wal.reader_from(replica_last_seq)?;
    while let Some(record) = wal_reader.next()? {
        send_to_replica(replica, &record)?;
    }
    Ok(())
}

Extension 4: Bloom Filters

Add bloom filters to avoid unnecessary disk reads:

struct SstableBloomFilter {
    bits: BitVec,
    hash_count: usize,
}

impl SstableBloomFilter {
    fn might_contain(&self, key: &[u8]) -> bool {
        // Check all hash positions
        for i in 0..self.hash_count {
            let pos = hash_with_seed(key, i) % self.bits.len();
            if !self.bits[pos] {
                return false; // Definitely not present
            }
        }
        true // Might be present (check disk)
    }
}

The Interview Questions They Will Ask

“What is the difference between an LSM Tree and a B+ Tree?”

Answer: LSM Trees buffer writes in memory (memtable), then flush to immutable sorted files (SSTables) that are periodically merged. This makes writes sequential and fast, but reads may need to check multiple levels. B+ Trees update data in place with a balanced tree structure, providing consistent read latency but requiring random I/O for writes. LSM Trees favor write-heavy workloads; B+ Trees favor read-heavy or mixed workloads.
“Why is mmap faster than read/write for some workloads?”

Answer: mmap eliminates the copy from kernel page cache to user space buffers. When you mmap a file, you access the page cache directly through virtual memory. For read-heavy workloads with good locality, this saves one memcpy per read. However, mmap can be slower for sequential access (read() uses readahead) and introduces unpredictable latency from page faults.
“How do you handle database recovery after a crash?”

Answer: Use Write-Ahead Logging (WAL). Before any modification, write the change to a sequential log file and fsync it to disk. On recovery, replay the WAL from the last checkpoint to restore the database to a consistent state. The key is that the WAL contains enough information to reconstruct any uncommitted changes.
“What is Write Amplification?”

Answer: Write amplification is the ratio of actual bytes written to storage versus the logical bytes written by the application. In LSM Trees, data may be written multiple times (memtable flush, L0 compaction, L1 compaction, etc.), leading to write amplification of 10-30x. This reduces SSD lifespan and limits write throughput.
“How do you achieve lock-free reads in a concurrent key-value store?”

Answer: Use an immutable index with atomic pointer swap for updates (copy-on-write), or use lock-free data structures like skip lists with CAS operations. For the data file, mmap is inherently safe for concurrent reads. Epoch-based reclamation or hazard pointers handle memory cleanup.

Books That Will Help

Topic	Book	Chapter
Storage Engine Design	“Designing Data-Intensive Applications” by Kleppmann	Ch. 3: Storage and Retrieval
Memory Mapping	“The Linux Programming Interface” by Kerrisk	Ch. 49: Memory Mappings
Concurrent Data Structures	“Rust Atomics and Locks” by Bos	Ch. 9: Building a Lock-Free Stack
Database Internals	“Database Internals” by Petrov	Entire book (B-Trees, LSM Trees)
Systems Programming	“Computer Systems: A Programmer’s Perspective”	Ch. 9: Virtual Memory

Success Criteria

You have completed this project when:

Your KV store can write 1 million keys in under 0.5 seconds
Your KV store can read 1 million random keys in under 0.35 seconds
Concurrent benchmark shows 3+ million operations per second
Crash recovery test passes with 100% data integrity
Compaction reduces file size without blocking readers
Memory usage stays bounded (index fits in configured limit)
You can explain every layer of the architecture from disk to API

Summary

This project represents the culmination of your Advanced Rust journey. You have built a production-grade storage engine that demonstrates:

Systems Engineering: Understanding of I/O patterns, memory hierarchy, and performance optimization
Rust Mastery: Unsafe code for performance, atomics for concurrency, lifetimes for safety
Database Knowledge: WAL for durability, compaction for space management, indexes for fast lookups
Performance Engineering: Benchmarking, profiling, and optimization of hot paths

You now have the skills to understand, modify, and build real database systems. Whether you continue into distributed systems, embedded databases, or other systems programming domains, the foundations from this project will serve you well.

Welcome to the elite club of engineers who truly understand how data is stored and retrieved. The machine is now yours.

This project is part of the Advanced Rust Ecosystem Deep Dive learning path.