← Back to all projects

LEARN NOSQL DEEP DIVE

Learn NoSQL Databases: From Zero to “I Can Build One”

Goal: Deeply understand how NoSQL databases work—by building the essential components yourself: storage engines, indexes, compaction, partitioning, replication, consistency controls, and operational tooling.


Why NoSQL Exists (and what you’re really learning)

Relational databases optimized around schema + joins + ACID transactions. NoSQL systems exist because many real workloads need different trade-offs:

  • Scale-out across many machines (partitioning/sharding)
  • High write throughput with predictable latencies (log-structured storage, append-first designs)
  • High availability under failures and partitions (replication + tunable consistency)
  • Flexible data models (document, key-value, wide-column, graph)
  • Operational primitives (online rebalancing, rolling upgrades, multi-DC)

After these projects, you should be able to:

  • Explain a NoSQL system’s read path and write path at the “bytes on disk + messages on wire” level
  • Reason precisely about consistency (quorum, leader-based, eventual, causal)
  • Diagnose performance problems like write amplification, read amplification, and hot partitions
  • Build and operate a small but real distributed NoSQL store

Core Concept Analysis

1) Data models and query models

  • Key-Value: primary-key access, minimal query surface
  • Document: JSON-like documents, secondary indexes, partial updates
  • Wide-Column (Bigtable-style): rows + column families + timestamps
  • Graph: adjacency, traversals, index-free patterns
  • Time-series: append-heavy, compression, downsampling

2) Storage engine fundamentals (single-node)

  • Write path: WAL → in-memory structure (memtable) → immutable runs (SSTables)
  • Read path: memtable → block index → bloom filter → SSTable scans → merge results
  • Compaction: turning many runs into fewer runs (and deleting tombstones)
  • Recovery: replay WAL / restore last checkpoint
  • Indexes: primary + secondary; trade-offs between write cost and query power

3) Distribution and resilience

  • Partitioning: hashing, consistent hashing rings, range partitioning
  • Replication: leader/follower vs multi-leader vs leaderless
  • Consistency: quorum reads/writes, causal consistency, read-your-writes, linearizability
  • Membership: failure detection, gossip, rebalancing
  • Repair: anti-entropy, Merkle trees, hinted handoff

4) Operations (what makes it “real”)

  • Observability: metrics, tracing, compaction dashboards
  • Backups/snapshots
  • Schema evolution & migrations (even in “schemaless” systems)
  • SLOs and tail latency, overload control, admission control

Project List

All projects below are written into: LEARN_NOSQL_DEEP_DIVE.md


Project 1: WAL + Memtable Key-Value Store (The Write Path)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Storage Engines / Durability
  • Software or Tool: Bolt of your own (WAL), fsync semantics
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A local key-value database that supports put/get/delete, persists via a write-ahead log, and recovers correctly after crashes.

Why it teaches NoSQL: Most NoSQL systems start with an append-first durability story. WAL + in-memory state is the core of nearly every storage engine.

Core challenges you’ll face:

  • Defining a record format (checksums, length-prefixing) → maps to binary protocols & corruption detection
  • Durability boundaries (fsync rules) → maps to what “committed” really means
  • Crash recovery (replay log into memtable) → maps to idempotency and ordering
  • Tombstones (delete markers) → maps to deletion in append-only stores

Key Concepts

  • WAL & durability: DDIA (Storage & Retrieval)
  • Crash recovery invariants: Operating Systems: Three Easy Pieces (Persistence / FS)
  • Checksums & torn writes: Computer Systems: A Programmer’s Perspective (I/O + representation)

Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: Basic file I/O, basic data structures, comfort with unit tests

Real world outcome

  • You can demonstrate: “write 10k keys → force-kill process → restart → all committed keys are present; non-committed keys are absent.”
  • You can show a human-readable “log dump” that replays operations and highlights corruption.

Implementation Hints

  • Treat the WAL as the source of truth; the memtable is a cache.
  • Define a strict ordering: write bytes → flush → (optionally) fsync → then acknowledge.
  • Include a checksum per record; on recovery, stop at first invalid record.

Learning milestones

  1. Log replay restores state → You understand durability as “replayable history”
  2. Crashes don’t corrupt state → You understand atomicity boundaries
  3. Deletes work via tombstones → You understand why compaction is needed later

Project 2: SSTable Builder + Immutable Sorted Runs (The Read Path Starts)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage Engines / On-disk Layout
  • Software or Tool: SSTable format (your own)
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Convert a memtable into an immutable, sorted on-disk file (SSTable) with a sparse index and block structure, then support reads from SSTables.

Why it teaches NoSQL: LSM-based NoSQL engines store data as immutable sorted runs; everything else (bloom filters, compaction, cache) builds on this.

Core challenges you’ll face:

  • Sorted encoding (key order, prefix compression) → maps to disk-efficient representations
  • Block index design (sparse vs dense) → maps to IO vs CPU trade-offs
  • Merging sources (memtable + multiple SSTables) → maps to multi-way merge logic
  • Tombstone semantics → maps to visibility rules across runs

Key Concepts

  • SSTables and runs: DDIA (LSM trees)
  • Binary search & block indexes: Algorithms, 4th Edition (searching)
  • Compression trade-offs: MongoDB/WiredTiger “compression + checkpoints” concepts (see Project 8 resources)

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 1; familiarity with sorting and file formats

Real world outcome

  • You can generate SSTables and prove reads work without loading full datasets into RAM.
  • You can show “read amplification” by printing which blocks/files were touched per query.

Implementation Hints

  • Use fixed-size blocks; each block contains a sequence of key/value entries.
  • A sparse index maps “first key in block” → file offset.
  • Reads: seek to candidate block(s), then scan inside a block.

Learning milestones

  1. Immutable files enable simple recovery → You see why append + immutability is powerful
  2. Sparse index works → You understand storage layouts for fast lookups
  3. Multi-run reads work → You internalize the merge-based read path

Project 3: LSM Compaction Simulator (Make Write Amplification Visible)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage Engines / Compaction Economics
  • Software or Tool: RocksDB compaction mental model
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A simulator that models leveled vs tiered compaction, producing graphs/tables for write amplification, space amplification, and read amplification.

Why it teaches NoSQL: Compaction is where many NoSQL performance mysteries live: stalls, SSD wear, tail latency spikes, and “why is my disk 10× busier than my workload?”

Core challenges you’ll face:

  • Modeling levels and fanout → maps to how LSM trees scale
  • Modeling tombstones/TTL → maps to deletes and time-based data
  • Compaction scheduling → maps to background work vs foreground latency
  • Hot partitions and skew → maps to real-world workload behavior

Resources for key challenges:

  • RocksDB compaction concepts (leveled vs tiered, write amplification) — RocksDB Wiki (Compaction)

Key Concepts

  • Leveled vs tiered compaction: RocksDB documentation
  • Write amplification mechanics: LSM tree paper lineage
  • Operational tuning mindset: DDIA (storage engines in practice)

Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Comfort with basic probability and plotting; Projects 1–2 helpful

Real world outcome

  • You can answer: “Given workload X and configuration Y, why does the system rewrite Z bytes?”
  • You can produce a “compaction budget” report that predicts stall risk.

Implementation Hints

  • Model each run with size, key-range overlap, and tombstone fraction.
  • Track bytes written per compaction and divide by bytes of user writes.

Learning milestones

  1. You can compute write amplification → You understand why LSMs trade write cost for throughput
  2. You can explain stalls → You understand background debt as a first-class system constraint
  3. You can reason about tuning → You can predict outcomes of fanout/level changes

Project 4: Bloom Filters + Block Cache (Make Reads Predictable)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Indexing / Performance Engineering
  • Software or Tool: Bloom filters, LRU cache
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Add bloom filters to SSTables and implement a block cache with instrumentation (hit rate, bytes saved, latency percentiles).

Why it teaches NoSQL: Modern NoSQL engines are defined as much by performance structures (filters, caches) as by correctness.

Core challenges you’ll face:

  • False positives → maps to probabilistic data structures
  • Cache eviction policy → maps to tail latency control
  • Read amplification visibility → maps to measuring what your system does
  • Interplay with compaction → maps to why files move and caches churn

Key Concepts

  • Bloom filters: canonical probability trade-off (false positive tuning)
  • Caching strategies: DDIA (cache + storage)
  • Latency measurement: “percentiles over averages” mindset

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 (or any SSTable reader)

Real world outcome

  • You can show: “same dataset, same queries; bloom filters cut disk reads by N%.”
  • You can demonstrate “cache thrash” scenarios and mitigation.

Implementation Hints

  • Keep bloom filter per SSTable (or per block) and record filter checks per query.
  • Measure: blocks touched, bytes read from disk, and cache hit ratio.

Learning milestones

  1. Reads get faster without correctness risk → You understand probabilistic acceleration
  2. You can explain p95 spikes → You understand cache miss storms
  3. You can design dashboards → You think like a database operator

Project 5: MVCC Snapshots (Point-in-Time Reads Without Locks)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, Java, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Concurrency Control / Storage Semantics
  • Software or Tool: MVCC + snapshot reads
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Add MVCC so concurrent operations can see consistent snapshots, enabling “read without blocking writes” with versioned values and garbage collection.

Why it teaches NoSQL: Many NoSQL/document stores rely on snapshot semantics (and it’s the conceptual foundation for consistent reads, change streams, and long-running queries).

Core challenges you’ll face:

  • Version chains → maps to data visibility rules
  • Snapshot timestamps → maps to logical time vs wall time
  • Garbage collection → maps to compaction + retention
  • Atomic compare-and-set semantics → maps to concurrency primitives

Resources for key challenges:

  • WiredTiger snapshot + checkpoint descriptions (MongoDB documentation)

Key Concepts

  • Snapshot isolation vs linearizability: DDIA
  • MVCC mechanics: WiredTiger concepts
  • GC and retention windows: operational trade-offs in MVCC

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Comfortable with concurrency; Projects 1–2 help significantly

Real world outcome

  • You can run a workload where writers run continuously while readers get repeatable snapshots.
  • You can show “old version accumulation” and explain how GC/compaction resolves it.

Implementation Hints

  • Assign a monotonically increasing logical version to writes.
  • Readers pick a snapshot version and only read versions ≤ snapshot.
  • Track the oldest active snapshot to know which versions can be reclaimed.

Learning milestones

  1. Snapshot reads are correct → You understand isolation as a visibility rule
  2. Writers don’t block readers → You understand why MVCC is popular
  3. GC is safe and bounded → You understand how correctness meets operations

Project 6: Secondary Indexes (The Hidden Cost of “Flexible Queries”)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Indexing / Querying
  • Software or Tool: Inverted index / B-tree vs LSM index
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Implement secondary indexes (exact-match and range) and quantify their write amplification and consistency challenges.

Why it teaches NoSQL: “NoSQL is flexible” ends the moment you need indexes at scale. This project reveals why many systems restrict query patterns.

Core challenges you’ll face:

  • Index maintenance on updates → maps to write fan-out
  • Consistency between primary and index → maps to atomicity without transactions
  • Backfilling indexes → maps to online migrations
  • Multi-valued fields → maps to document modeling trade-offs

Key Concepts

  • Index maintenance vs query power: DDIA
  • Backfill strategies: operational database thinking
  • Idempotent index updates: recovery + retries

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 or any persistent KV store

Real world outcome

  • You can show: “query by field X returns correct results,” and “index build can run online.”
  • You can produce a report: “write cost increased by Y% due to index maintenance.”

Implementation Hints

  • Treat indexes as separate keyspaces: (index_key → primary_key list).
  • Define explicit update rules for modifications and deletes.

Learning milestones

  1. Indexes work → You understand query acceleration structures
  2. You measure write fan-out → You understand why indexing is expensive
  3. Online backfill works → You understand operational reality

Project 7: Document Store Layer (JSON Documents on Top of KV)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: TypeScript
  • Alternative Programming Languages: Go, Python, Java
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Document Model / API Design
  • Software or Tool: JSON Patch semantics, schema-less pitfalls
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A document database API: insert/update/partial-update, basic validation, and optional secondary indexes for top-level fields.

Why it teaches NoSQL: The “document” model is mostly an API and modeling discipline layered over a KV engine. You’ll see the gap between “schemaless” and “unstructured chaos.”

Core challenges you’ll face:

  • Document identity → maps to primary key strategy
  • Partial updates → maps to write amplification at document granularity
  • Validation & schema evolution → maps to keeping systems operable
  • Indexing specific fields → maps to real query needs

Key Concepts

  • Document vs relational trade-offs: DDIA
  • Schema-on-write vs schema-on-read: modeling discipline
  • Update granularity: why big documents hurt writes

Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: JSON familiarity; basic API design

Real world outcome

  • A small REST or CLI interface where you can store and retrieve documents and run a few indexed queries.
  • Demonstrable “migration” of a field from one shape to another without breaking reads.

Implementation Hints

  • Store document blobs as values; keep indexes separate.
  • Add “document version” metadata for conflict detection later.

Learning milestones

  1. CRUD works reliably → You understand the document layer
  2. Partial updates are correct → You understand update semantics and their cost
  3. Schema evolution is managed → You think like an operator, not a tutorial

Project 8: Wide-Column Mini-Bigtable (Rows, Column Families, Timestamps)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Java
  • Alternative Programming Languages: Go, Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Data Modeling / Storage Layout
  • Software or Tool: Bigtable mental model
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A wide-column store that supports (row key, column family, column qualifier, timestamp) as the addressing model, with range scans by row key.

Why it teaches NoSQL: Bigtable-style systems show how “NoSQL” can still be structured: sorted by row key, versioned by timestamp, grouped into families for IO locality.

Core challenges you’ll face:

  • Row-key ordering and range scans → maps to sorted storage & hot-spot risks
  • Column-family locality → maps to IO patterns and data layout
  • Time-versioned cells → maps to retention policies and GC
  • Sparse data → maps to why wide-column exists

Resources for key challenges:

  • “Bigtable: A Distributed Storage System for Structured Data” (OSDI 2006)

Key Concepts

  • Wide-column model: Bigtable paper
  • Sorted tablets and partitioning: Bigtable paper
  • Version retention: storage GC patterns

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Storage fundamentals (Projects 1–2 recommended)

Real world outcome

  • You can store time-series-like values per row and retrieve “latest N versions.”
  • You can run range scans and show how row-key design affects hotspots.

Implementation Hints

  • Physically cluster by row key first, then family, then qualifier.
  • Implement a retention policy: “keep last K versions” or “keep last T time.”

Learning milestones

  1. Range scans are efficient → You understand why sorted order matters
  2. Versioning is correct → You understand time-based storage semantics
  3. You can explain hotspotting → you think like a Bigtable user

Project 9: Consistent Hashing Sharder (From Single Node to Many)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Systems / Partitioning
  • Software or Tool: Consistent hashing ring
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A sharding layer using consistent hashing with virtual nodes, rebalancing, and a “shard map” tool that explains where each key lives.

Why it teaches NoSQL: Many NoSQL systems partition data via consistent hashing to enable incremental scaling and rebalancing. Dynamo popularized this approach for highly available key-value systems.

Core challenges you’ll face:

  • Virtual nodes → maps to load distribution vs operational complexity
  • Rebalancing while serving traffic → maps to moving data safely
  • Hot key detection → maps to real workload skew
  • Key design constraints → maps to partition keys as product decisions

Resources for key challenges:

  • “Dynamo: Amazon’s Highly Available Key-value Store” (SOSP 2007)

Key Concepts

  • Consistent hashing and replication: Dynamo paper
  • Operational rebalancing: DDIA (partitioning chapter)
  • Hot partitions: practical performance thinking

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Networking basics; Projects 1–2 helpful

Real world outcome

  • A mini cluster can add/remove nodes and show how many keys moved.
  • A “trace” view shows for a key: hash → ring position → node set.

Implementation Hints

  • Keep membership info separate from data storage.
  • Make rebalancing observable: moved keys, bytes moved, and time.

Learning milestones

  1. Keys distribute evenly → You understand hashing trade-offs
  2. Scaling changes only part of the keyspace → You understand why consistent hashing matters
  3. You can debug hotspots → you understand production pain points

Project 10: Replication + Tunable Consistency (Quorums You Can Feel)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, Erlang/Elixir
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Replication / Consistency
  • Software or Tool: Quorum reads/writes
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Replicate each shard across N nodes and support per-operation consistency levels (e.g., “ONE”, “QUORUM”, “ALL”), including read repair.

Why it teaches NoSQL: This is the heart of Dynamo-style and Cassandra-style systems: you tune consistency vs latency/availability by changing how many replicas must respond.

Core challenges you’ll face:

  • Coordinator logic → maps to routing, timeouts, partial failures
  • Conflict detection → maps to versioning (vector clocks or logical versions)
  • Read repair → maps to eventual convergence mechanics
  • Timeout semantics → maps to what “success” means under failure

Resources for key challenges:

  • Dynamo paper (quorum-like technique, versioning)
  • Cassandra docs on tunable consistency

Key Concepts

  • Quorum intersection reasoning: Dynamo/Cassandra approach
  • Eventual consistency mechanics: DDIA
  • Failure as a normal case: distributed systems mindset

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Project 9 (sharding), strong comfort with networking/timeouts

Real world outcome

  • You can simulate partitions and show exactly when stale reads happen (and when they don’t).
  • You can expose a “consistency report” per operation: who answered, who didn’t, what got repaired.

Implementation Hints

  • Treat timeouts as first-class: they determine observed consistency.
  • Implement read repair: if a read sees divergent versions, fix replicas in the background.

Learning milestones

  1. Replication works → You understand redundancy mechanics
  2. Consistency levels behave predictably → You understand trade-offs precisely
  3. Repair converges → You understand “eventual” as an algorithm, not a slogan

Project 11: Anti-Entropy Repair with Merkle Trees (Convergence at Scale)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, Java, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Data Repair / Distributed Consistency
  • Software or Tool: Merkle trees, anti-entropy
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A background repair process that detects divergence between replicas using Merkle trees and syncs only the differing ranges.

Why it teaches NoSQL: Once you stop assuming perfect networks, you need systematic convergence. Dynamo uses a decentralized synchronization approach and describes Merkle trees for reconciliation.

Core challenges you’ll face:

  • Tree construction per range → maps to efficient summaries
  • Comparing trees to locate diffs → maps to logarithmic divergence detection
  • Bandwidth control → maps to repair storms and SLO risk
  • Repair correctness → maps to idempotent, retryable sync

Key Concepts

  • Merkle tree reconciliation: Dynamo paper
  • Background maintenance debt: compaction mindset applied to repair
  • Operational guardrails: rate limiting and scheduling

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 10 recommended

Real world outcome

  • You can demonstrate: “replicas diverge under partition; repair restores convergence with bounded network use.”
  • You can show “bytes transferred vs naive full sync.”

Implementation Hints

  • Build trees over key ranges (or partitions).
  • Compare roots first; drill down only when hashes differ.

Learning milestones

  1. Repair detects divergence → You understand summary-based synchronization
  2. Repair is bandwidth-efficient → You understand why Merkle trees matter
  3. Repair is safe under retries → you understand distributed idempotency

Project 12: CAP and Failure Lab (Make Trade-offs Observable)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Systems / Correctness
  • Software or Tool: Fault injection harness
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A harness that injects partitions, delay, drops, and node crashes into your Project 10 cluster; it outputs a timeline of anomalies (stale reads, write loss, unavailability).

Why it teaches NoSQL: CAP is not a slogan; it’s a set of failure-driven constraints. The formal result (Gilbert & Lynch) shows you cannot guarantee consistency and availability under partitions in an asynchronous network model.

Core challenges you’ll face:

  • Defining consistency checks → maps to linearizability vs eventual
  • Modeling availability → maps to what counts as a successful response
  • Reproducible chaos → maps to deterministic test traces
  • Interpreting anomalies → maps to debugging distributed behavior

Key Concepts

  • CAP impossibility proof framing: Gilbert & Lynch 2002
  • Consistency definitions: DDIA (models of consistency)
  • Fault injection as a method: systems testing discipline

Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Any distributed prototype (Project 10 ideal)

Real world outcome

  • A report that says: “Under partition X, QUORUM reads avoid stale reads but writes become unavailable,” etc.
  • A library of “failure scenarios” you can replay.

Implementation Hints

  • Insert a controllable proxy between nodes; the proxy enforces drop/delay rules.
  • Build invariants: monotonic read checks, read-your-writes checks, and “no lost acknowledged writes.”

Learning milestones

  1. You can reproduce anomalies → You stop treating distributed bugs as “random”
  2. You can explain CAP trade-offs concretely → You understand theory through behavior
  3. You can choose a design intentionally → You can justify NoSQL trade-offs

Project 13: Raft-backed Metadata Service (When You Need Strong Consistency)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Consensus / Cluster Coordination
  • Software or Tool: Raft consensus
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A small strongly-consistent metadata service for your cluster: membership, shard assignments, configuration versions—replicated using Raft.

Why it teaches NoSQL: Many “eventually consistent” stores still need a CP core for coordination (who owns what, what config is current). Raft decomposes consensus into leader election, log replication, and safety properties.

Core challenges you’ll face:

  • Leader election → maps to availability vs split brain
  • Replicated log → maps to state machine replication
  • Membership changes → maps to reconfiguration safety
  • Client semantics → maps to linearizable reads vs stale reads

Resources for key challenges:

  • “In Search of an Understandable Consensus Algorithm (Raft)” (Ongaro & Ousterhout)

Key Concepts

  • Replicated log: Raft paper
  • Leadership model: Raft paper
  • Why consensus is needed: DDIA (coordination chapter)

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Solid networking, comfort with concurrency, testing discipline

Real world outcome

  • You can kill leaders and show the cluster continues with a new leader and no divergent metadata.
  • You can show that shard maps are consistent across nodes at all times.

Implementation Hints

  • Implement Raft as a replicated log; apply committed entries to a deterministic state machine.
  • Separate transport reliability (retries, timeouts) from Raft logic.

Learning milestones

  1. Leader election stabilizes → You understand coordination under failure
  2. Log replication is correct → You understand state machine replication
  3. Reconfiguration is safe → You can operate distributed systems without hand-waving

Project 14: Storage Engine Introspection CLI (Explain Your Database to Yourself)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Observability / Debugging
  • Software or Tool: Internal metrics + “explain” tooling
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A CLI (or small web dashboard) that can introspect your DB: list SSTables, levels, compaction backlog, bloom filter stats, cache hit rates, replication lag, and repair status.

Why it teaches NoSQL: Real databases are operated, not just implemented. Observability turns “it’s slow” into “compaction debt is causing tail latency.”

Core challenges you’ll face:

  • Defining meaningful metrics → maps to operability and SLOs
  • Tracing read/write paths → maps to finding amplification sources
  • Surface invariants → maps to detecting corruption or divergence early
  • Usable UX → maps to making systems debuggable

Key Concepts

  • Operational visibility: DDIA (operational concerns)
  • Amplification metrics: compaction and caching economics
  • Failure signals: replication lag, repair backlog

Difficulty: Intermediate
Time estimate: Weekend
Prerequisites: Any prior project (works best after 1–4 and 10)

Real world outcome

  • You can answer: “Why is this query slow?” using your own “explain” output.
  • You can watch compaction debt rise and fall during load tests.

Implementation Hints

  • Treat internal state as a “system catalog” you can query.
  • Include human-first summaries (top 3 reasons for latency) plus raw numbers.

Learning milestones

  1. Metrics are accurate → you trust your system’s introspection
  2. You can predict incidents → you understand leading indicators
  3. You can tune with confidence → you operate, not guess

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
WAL + Memtable KV Intermediate 1-2 weeks High Medium
SSTable Builder Advanced 1-2 weeks Very High Medium
Compaction Simulator Advanced Weekend High High
Bloom + Cache Advanced 1-2 weeks High Medium
MVCC Snapshots Advanced 1 month+ Very High Medium
Secondary Indexes Advanced 1-2 weeks High Medium
Document Layer Intermediate 1-2 weeks Medium Medium
Mini-Bigtable Advanced 1 month+ Very High High
Consistent Hash Sharder Advanced 1-2 weeks High High
Replication + Quorums Advanced 1 month+ Very High High
Merkle Repair Advanced 1-2 weeks High Medium
CAP Failure Lab Advanced Weekend High High
Raft Metadata Service Advanced 1 month+ Very High High
Introspection CLI Intermediate Weekend Medium Medium

Recommendation: Where to Start (and why)

If your goal is to understand how NoSQL works internally (not just “use Mongo/Cassandra”), start with:

1) Project 1 (WAL + memtable) → because it teaches durability and crash semantics first.
2) Project 2 (SSTables) → because it forces you to confront on-disk layout and the read path.
3) Project 3 (Compaction simulator) → because it makes the main LSM trade-offs measurable.
4) Then choose a branch:

  • Distributed track: Projects 9 → 10 → 11 → 12
  • Data-model track: Projects 7 → 6 → 8
  • Strong-consistency core: Project 13 (only after you’ve felt failure modes)

Final Overall Capstone Project: “Mini-Dynamo + LSM” (A Real NoSQL Database)

  • File: LEARN_NOSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Storage Systems
  • Software or Tool: LSM storage engine + replication + observability
  • Main Book: Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A distributed NoSQL database with an LSM-tree storage engine, sharding via consistent hashing, replication with tunable consistency, background repair, and an operator-friendly introspection console.

Why it teaches NoSQL: This is the “minimum complete” set of mechanisms behind Dynamo/Cassandra-like systems (partitioning + replication + reconciliation) combined with modern LSM storage foundations.

Core challenges you’ll face:

  • End-to-end correctness (ack semantics, crash safety) → maps to WAL + recovery
  • Operational stability (compaction/repair debt) → maps to background work management
  • Consistency controls (client-visible guarantees) → maps to quorum design
  • Reconfiguration (adding nodes safely) → maps to metadata coordination (optional Raft)

Key Concepts

  • Highly available KV patterns: Dynamo paper
  • Compaction economics: RocksDB compaction model
  • Consistency trade-offs under partitions: Gilbert & Lynch (CAP)

Difficulty: Advanced
Time estimate: 1 month+ (expect multiple iterations)
Prerequisites: Projects 1–4 plus 9–12 (Project 13 optional but valuable)

Real world outcome

  • A 3–5 node cluster that can:
    • survive node crashes,
    • rebalance keys when nodes join,
    • demonstrate consistency levels under partitions,
    • and show internal health via a self-hosted dashboard.

Implementation Hints

  • Make every subsystem observable: compaction backlog, repair backlog, and replication lag.
  • Define the API surface narrowly (KV or document-lite). Query power can come later.
  • Use chaos scenarios (Project 12) as your “definition of done.”

Learning milestones

  1. Single-node engine is durable and fast → you understand storage engines deeply
  2. Cluster is correct under failures → you understand replication and partitions
  3. You can operate it with confidence → you’ve internalized NoSQL as an operational system

Summary

# Project Name Main Programming Language
1 WAL + Memtable Key-Value Store Go
2 SSTable Builder + Immutable Runs Go
3 LSM Compaction Simulator Python
4 Bloom Filters + Block Cache Go
5 MVCC Snapshots Rust
6 Secondary Indexes Go
7 Document Store Layer TypeScript
8 Wide-Column Mini-Bigtable Java
9 Consistent Hashing Sharder Go
10 Replication + Tunable Consistency Go
11 Anti-Entropy Repair with Merkle Trees Rust
12 CAP and Failure Lab Python
13 Raft-backed Metadata Service Go
14 Storage Engine Introspection CLI Python
15 Capstone: Mini-Dynamo + LSM Go