Learn NoSQL Databases: From Zero to “I Can Build One”

Goal: Deeply understand how NoSQL databases work—by building the essential components yourself: storage engines, indexes, compaction, partitioning, replication, consistency controls, and operational tooling.


Why NoSQL Exists (and what you’re really learning)

Relational databases optimized around schema + joins + ACID transactions. NoSQL systems exist because many real workloads need different trade-offs:

  • Scale-out across many machines (partitioning/sharding)
  • High write throughput with predictable latencies (log-structured storage, append-first designs)
  • High availability under failures and partitions (replication + tunable consistency)
  • Flexible data models (document, key-value, wide-column, graph)
  • Operational primitives (online rebalancing, rolling upgrades, multi-DC)

After these projects, you should be able to:

  • Explain a NoSQL system’s read path and write path at the “bytes on disk + messages on wire” level
  • Reason precisely about consistency (quorum, leader-based, eventual, causal)
  • Diagnose performance problems like write amplification, read amplification, and hot partitions
  • Build and operate a small but real distributed NoSQL store

Core Concept Analysis

1) Data models and query models

  • Key-Value: primary-key access, minimal query surface
  • Document: JSON-like documents, secondary indexes, partial updates
  • Wide-Column (Bigtable-style): rows + column families + timestamps
  • Graph: adjacency, traversals, index-free patterns
  • Time-series: append-heavy, compression, downsampling

2) Storage engine fundamentals (single-node)

  • Write path: WAL → in-memory structure (memtable) → immutable runs (SSTables)
  • Read path: memtable → block index → bloom filter → SSTable scans → merge results
  • Compaction: turning many runs into fewer runs (and deleting tombstones)
  • Recovery: replay WAL / restore last checkpoint
  • Indexes: primary + secondary; trade-offs between write cost and query power

3) Distribution and resilience

  • Partitioning: hashing, consistent hashing rings, range partitioning
  • Replication: leader/follower vs multi-leader vs leaderless
  • Consistency: quorum reads/writes, causal consistency, read-your-writes, linearizability
  • Membership: failure detection, gossip, rebalancing
  • Repair: anti-entropy, Merkle trees, hinted handoff

4) Operations (what makes it “real”)

  • Observability: metrics, tracing, compaction dashboards
  • Backups/snapshots
  • Schema evolution & migrations (even in “schemaless” systems)
  • SLOs and tail latency, overload control, admission control

Concept Summary Table

Concept Cluster What You Need to Internalize
Data models Key-value, document, wide-column, graph, and time-series differ in access patterns.
Write path WAL + memtable + SSTables define durability and throughput.
Read path Bloom filters and indexes reduce read amplification.
Compaction Background merges trade write cost for read performance.
Replication Leaderless and quorum models enable availability.
Consistency Quorums, causal ordering, and linearizability are explicit choices.
Operations Monitoring, backups, and rebalancing make systems reliable.

Deep Dive Reading by Concept

Storage Engines

Concept Book & Chapter
LSM trees Designing Data-Intensive Applications — Ch. 3: “Storage and Retrieval”
Indexing basics Database Internals — Ch. 2-3: “B-Tree Basics”

Distribution and Consistency

Concept Book & Chapter
Replication models Designing Data-Intensive Applications — Ch. 5: “Replication”
Partitioning Designing Data-Intensive Applications — Ch. 6: “Partitioning”
Consensus Designing Data-Intensive Applications — Ch. 9: “Consistency and Consensus”

Operations

Concept Book & Chapter
Transaction durability Database System Concepts — Ch. 19: “Recovery System”

Project List

All projects below are written into: LEARN_NOSQL_DEEP_DIVE.md


Project 1: “WAL + Memtable Key-Value Store” — The Write Path

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java, C++
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold”
Difficulty Level 2: Intermediate
Knowledge Area Storage Engines / Durability
Software or Tool Bolt of your own (WAL), fsync semantics
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A local key-value database that supports put/get/delete, persists via a write-ahead log, and recovers correctly after crashes.

Why it teaches NoSQL: Most NoSQL systems start with an append-first durability story. WAL + in-memory state is the core of nearly every storage engine.

Core challenges you’ll face:

  • Defining a record format (checksums, length-prefixing) → maps to binary protocols & corruption detection
  • Durability boundaries (fsync rules) → maps to what “committed” really means
  • Crash recovery (replay log into memtable) → maps to idempotency and ordering
  • Tombstones (delete markers) → maps to deletion in append-only stores

Key Concepts

  • WAL & durability: DDIA (Storage & Retrieval)
  • Crash recovery invariants: Operating Systems: Three Easy Pieces (Persistence / FS)
  • Checksums & torn writes: Computer Systems: A Programmer’s Perspective (I/O + representation)

Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: Basic file I/O, basic data structures, comfort with unit tests

Real world outcome

  • You can demonstrate: “write 10k keys → force-kill process → restart → all committed keys are present; non-committed keys are absent.”
  • You can show a human-readable “log dump” that replays operations and highlights corruption.

Implementation Hints

  • Treat the WAL as the source of truth; the memtable is a cache.
  • Define a strict ordering: write bytes → flush → (optionally) fsync → then acknowledge.
  • Include a checksum per record; on recovery, stop at first invalid record.

Learning milestones

  1. Log replay restores state → You understand durability as “replayable history”
  2. Crashes don’t corrupt state → You understand atomicity boundaries
  3. Deletes work via tombstones → You understand why compaction is needed later

Project 2: “SSTable Builder + Immutable Sorted Runs” — The Read Path Starts

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java, C++
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold”
Difficulty Level 3: Advanced
Knowledge Area Storage Engines / On-disk Layout
Software or Tool SSTable format (your own)
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Convert a memtable into an immutable, sorted on-disk file (SSTable) with a sparse index and block structure, then support reads from SSTables.

Why it teaches NoSQL: LSM-based NoSQL engines store data as immutable sorted runs; everything else (bloom filters, compaction, cache) builds on this.

Core challenges you’ll face:

  • Sorted encoding (key order, prefix compression) → maps to disk-efficient representations
  • Block index design (sparse vs dense) → maps to IO vs CPU trade-offs
  • Merging sources (memtable + multiple SSTables) → maps to multi-way merge logic
  • Tombstone semantics → maps to visibility rules across runs

Key Concepts

  • SSTables and runs: DDIA (LSM trees)
  • Binary search & block indexes: Algorithms, 4th Edition (searching)
  • Compression trade-offs: MongoDB/WiredTiger “compression + checkpoints” concepts (see Project 8 resources)

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 1; familiarity with sorting and file formats

Real world outcome

  • You can generate SSTables and prove reads work without loading full datasets into RAM.
  • You can show “read amplification” by printing which blocks/files were touched per query.

Implementation Hints

  • Use fixed-size blocks; each block contains a sequence of key/value entries.
  • A sparse index maps “first key in block” → file offset.
  • Reads: seek to candidate block(s), then scan inside a block.

Learning milestones

  1. Immutable files enable simple recovery → You see why append + immutability is powerful
  2. Sparse index works → You understand storage layouts for fast lookups
  3. Multi-run reads work → You internalize the merge-based read path

Project 3: “LSM Compaction Simulator” — Make Write Amplification Visible

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Python
Alternative Programming Languages Go, Rust, Java
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 2. The “Micro-SaaS / Pro Tool”
Difficulty Level 3: Advanced
Knowledge Area Storage Engines / Compaction Economics
Software or Tool RocksDB compaction mental model
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A simulator that models leveled vs tiered compaction, producing graphs/tables for write amplification, space amplification, and read amplification.

Why it teaches NoSQL: Compaction is where many NoSQL performance mysteries live: stalls, SSD wear, tail latency spikes, and “why is my disk 10× busier than my workload?”

Core challenges you’ll face:

  • Modeling levels and fanout → maps to how LSM trees scale
  • Modeling tombstones/TTL → maps to deletes and time-based data
  • Compaction scheduling → maps to background work vs foreground latency
  • Hot partitions and skew → maps to real-world workload behavior

Resources for key challenges:

  • RocksDB compaction concepts (leveled vs tiered, write amplification) — RocksDB Wiki (Compaction)

Key Concepts

  • Leveled vs tiered compaction: RocksDB documentation
  • Write amplification mechanics: LSM tree paper lineage
  • Operational tuning mindset: DDIA (storage engines in practice)

Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Comfort with basic probability and plotting; Projects 1–2 helpful

Real world outcome

  • You can answer: “Given workload X and configuration Y, why does the system rewrite Z bytes?”
  • You can produce a “compaction budget” report that predicts stall risk.

Implementation Hints

  • Model each run with size, key-range overlap, and tombstone fraction.
  • Track bytes written per compaction and divide by bytes of user writes.

Learning milestones

  1. You can compute write amplification → You understand why LSMs trade write cost for throughput
  2. You can explain stalls → You understand background debt as a first-class system constraint
  3. You can reason about tuning → You can predict outcomes of fanout/level changes

Project 4: “Bloom Filters + Block Cache” — Make Reads Predictable

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java, C++
Coolness Level Level 3: Genuinely Clever
Business Potential 2. The “Micro-SaaS / Pro Tool”
Difficulty Level 3: Advanced
Knowledge Area Indexing / Performance Engineering
Software or Tool Bloom filters, LRU cache
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Add bloom filters to SSTables and implement a block cache with instrumentation (hit rate, bytes saved, latency percentiles).

Why it teaches NoSQL: Modern NoSQL engines are defined as much by performance structures (filters, caches) as by correctness.

Core challenges you’ll face:

  • False positives → maps to probabilistic data structures
  • Cache eviction policy → maps to tail latency control
  • Read amplification visibility → maps to measuring what your system does
  • Interplay with compaction → maps to why files move and caches churn

Key Concepts

  • Bloom filters: canonical probability trade-off (false positive tuning)
  • Caching strategies: DDIA (cache + storage)
  • Latency measurement: “percentiles over averages” mindset

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 (or any SSTable reader)

Real world outcome

  • You can show: “same dataset, same queries; bloom filters cut disk reads by N%.”
  • You can demonstrate “cache thrash” scenarios and mitigation.

Implementation Hints

  • Keep bloom filter per SSTable (or per block) and record filter checks per query.
  • Measure: blocks touched, bytes read from disk, and cache hit ratio.

Learning milestones

  1. Reads get faster without correctness risk → You understand probabilistic acceleration
  2. You can explain p95 spikes → You understand cache miss storms
  3. You can design dashboards → You think like a database operator

Project 5: “MVCC Snapshots” — Point-in-Time Reads Without Locks

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Rust
Alternative Programming Languages Go, Java, C++
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold”
Difficulty Level 4: Expert
Knowledge Area Concurrency Control / Storage Semantics
Software or Tool MVCC + snapshot reads
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Add MVCC so concurrent operations can see consistent snapshots, enabling “read without blocking writes” with versioned values and garbage collection.

Why it teaches NoSQL: Many NoSQL/document stores rely on snapshot semantics (and it’s the conceptual foundation for consistent reads, change streams, and long-running queries).

Core challenges you’ll face:

  • Version chains → maps to data visibility rules
  • Snapshot timestamps → maps to logical time vs wall time
  • Garbage collection → maps to compaction + retention
  • Atomic compare-and-set semantics → maps to concurrency primitives

Resources for key challenges:

  • WiredTiger snapshot + checkpoint descriptions (MongoDB documentation)

Key Concepts

  • Snapshot isolation vs linearizability: DDIA
  • MVCC mechanics: WiredTiger concepts
  • GC and retention windows: operational trade-offs in MVCC

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Comfortable with concurrency; Projects 1–2 help significantly

Real world outcome

  • You can run a workload where writers run continuously while readers get repeatable snapshots.
  • You can show “old version accumulation” and explain how GC/compaction resolves it.

Implementation Hints

  • Assign a monotonically increasing logical version to writes.
  • Readers pick a snapshot version and only read versions ≤ snapshot.
  • Track the oldest active snapshot to know which versions can be reclaimed.

Learning milestones

  1. Snapshot reads are correct → You understand isolation as a visibility rule
  2. Writers don’t block readers → You understand why MVCC is popular
  3. GC is safe and bounded → You understand how correctness meets operations

Project 6: “Secondary Indexes” — The Hidden Cost of “Flexible Queries”

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java, C++
Coolness Level Level 3: Genuinely Clever
Business Potential 3. The “Service & Support” Model
Difficulty Level 3: Advanced
Knowledge Area Indexing / Querying
Software or Tool Inverted index / B-tree vs LSM index
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Implement secondary indexes (exact-match and range) and quantify their write amplification and consistency challenges.

Why it teaches NoSQL: “NoSQL is flexible” ends the moment you need indexes at scale. This project reveals why many systems restrict query patterns.

Core challenges you’ll face:

  • Index maintenance on updates → maps to write fan-out
  • Consistency between primary and index → maps to atomicity without transactions
  • Backfilling indexes → maps to online migrations
  • Multi-valued fields → maps to document modeling trade-offs

Key Concepts

  • Index maintenance vs query power: DDIA
  • Backfill strategies: operational database thinking
  • Idempotent index updates: recovery + retries

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 or any persistent KV store

Real world outcome

  • You can show: “query by field X returns correct results,” and “index build can run online.”
  • You can produce a report: “write cost increased by Y% due to index maintenance.”

Implementation Hints

  • Treat indexes as separate keyspaces: (index_key → primary_key list).
  • Define explicit update rules for modifications and deletes.

Learning milestones

  1. Indexes work → You understand query acceleration structures
  2. You measure write fan-out → You understand why indexing is expensive
  3. Online backfill works → You understand operational reality

Project 7: “Document Store Layer” — JSON Documents on Top of KV

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language TypeScript
Alternative Programming Languages Go, Python, Java
Coolness Level Level 2: Practical but Forgettable
Business Potential 2. The “Micro-SaaS / Pro Tool”
Difficulty Level 2: Intermediate
Knowledge Area Document Model / API Design
Software or Tool JSON Patch semantics, schema-less pitfalls
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A document database API: insert/update/partial-update, basic validation, and optional secondary indexes for top-level fields.

Why it teaches NoSQL: The “document” model is mostly an API and modeling discipline layered over a KV engine. You’ll see the gap between “schemaless” and “unstructured chaos.”

Core challenges you’ll face:

  • Document identity → maps to primary key strategy
  • Partial updates → maps to write amplification at document granularity
  • Validation & schema evolution → maps to keeping systems operable
  • Indexing specific fields → maps to real query needs

Key Concepts

  • Document vs relational trade-offs: DDIA
  • Schema-on-write vs schema-on-read: modeling discipline
  • Update granularity: why big documents hurt writes

Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: JSON familiarity; basic API design

Real world outcome

  • A small REST or CLI interface where you can store and retrieve documents and run a few indexed queries.
  • Demonstrable “migration” of a field from one shape to another without breaking reads.

Implementation Hints

  • Store document blobs as values; keep indexes separate.
  • Add “document version” metadata for conflict detection later.

Learning milestones

  1. CRUD works reliably → You understand the document layer
  2. Partial updates are correct → You understand update semantics and their cost
  3. Schema evolution is managed → You think like an operator, not a tutorial

Project 8: “Wide-Column Mini-Bigtable” — Rows, Column Families, Timestamps

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Java
Alternative Programming Languages Go, Rust, C++
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold”
Difficulty Level 4: Expert
Knowledge Area Data Modeling / Storage Layout
Software or Tool Bigtable mental model
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A wide-column store that supports (row key, column family, column qualifier, timestamp) as the addressing model, with range scans by row key.

Why it teaches NoSQL: Bigtable-style systems show how “NoSQL” can still be structured: sorted by row key, versioned by timestamp, grouped into families for IO locality.

Core challenges you’ll face:

  • Row-key ordering and range scans → maps to sorted storage & hot-spot risks
  • Column-family locality → maps to IO patterns and data layout
  • Time-versioned cells → maps to retention policies and GC
  • Sparse data → maps to why wide-column exists

Resources for key challenges:

  • “Bigtable: A Distributed Storage System for Structured Data” (OSDI 2006)

Key Concepts

  • Wide-column model: Bigtable paper
  • Sorted tablets and partitioning: Bigtable paper
  • Version retention: storage GC patterns

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Storage fundamentals (Projects 1–2 recommended)

Real world outcome

  • You can store time-series-like values per row and retrieve “latest N versions.”
  • You can run range scans and show how row-key design affects hotspots.

Implementation Hints

  • Physically cluster by row key first, then family, then qualifier.
  • Implement a retention policy: “keep last K versions” or “keep last T time.”

Learning milestones

  1. Range scans are efficient → You understand why sorted order matters
  2. Versioning is correct → You understand time-based storage semantics
  3. You can explain hotspotting → you think like a Bigtable user

Project 9: “Consistent Hashing Sharder” — From Single Node to Many

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java, Python
Coolness Level Level 3: Genuinely Clever
Business Potential 3. The “Service & Support” Model
Difficulty Level 3: Advanced
Knowledge Area Distributed Systems / Partitioning
Software or Tool Consistent hashing ring
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A sharding layer using consistent hashing with virtual nodes, rebalancing, and a “shard map” tool that explains where each key lives.

Why it teaches NoSQL: Many NoSQL systems partition data via consistent hashing to enable incremental scaling and rebalancing. Dynamo popularized this approach for highly available key-value systems.

Core challenges you’ll face:

  • Virtual nodes → maps to load distribution vs operational complexity
  • Rebalancing while serving traffic → maps to moving data safely
  • Hot key detection → maps to real workload skew
  • Key design constraints → maps to partition keys as product decisions

Resources for key challenges:

  • “Dynamo: Amazon’s Highly Available Key-value Store” (SOSP 2007)

Key Concepts

  • Consistent hashing and replication: Dynamo paper
  • Operational rebalancing: DDIA (partitioning chapter)
  • Hot partitions: practical performance thinking

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Networking basics; Projects 1–2 helpful

Real world outcome

  • A mini cluster can add/remove nodes and show how many keys moved.
  • A “trace” view shows for a key: hash → ring position → node set.

Implementation Hints

  • Keep membership info separate from data storage.
  • Make rebalancing observable: moved keys, bytes moved, and time.

Learning milestones

  1. Keys distribute evenly → You understand hashing trade-offs
  2. Scaling changes only part of the keyspace → You understand why consistent hashing matters
  3. You can debug hotspots → you understand production pain points

Project 10: “Replication + Tunable Consistency” — Quorums You Can Feel

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java, Erlang/Elixir
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold”
Difficulty Level 4: Expert
Knowledge Area Replication / Consistency
Software or Tool Quorum reads/writes
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Replicate each shard across N nodes and support per-operation consistency levels (e.g., “ONE”, “QUORUM”, “ALL”), including read repair.

Why it teaches NoSQL: This is the heart of Dynamo-style and Cassandra-style systems: you tune consistency vs latency/availability by changing how many replicas must respond.

Core challenges you’ll face:

  • Coordinator logic → maps to routing, timeouts, partial failures
  • Conflict detection → maps to versioning (vector clocks or logical versions)
  • Read repair → maps to eventual convergence mechanics
  • Timeout semantics → maps to what “success” means under failure

Resources for key challenges:

  • Dynamo paper (quorum-like technique, versioning)
  • Cassandra docs on tunable consistency

Key Concepts

  • Quorum intersection reasoning: Dynamo/Cassandra approach
  • Eventual consistency mechanics: DDIA
  • Failure as a normal case: distributed systems mindset

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Project 9 (sharding), strong comfort with networking/timeouts

Real world outcome

  • You can simulate partitions and show exactly when stale reads happen (and when they don’t).
  • You can expose a “consistency report” per operation: who answered, who didn’t, what got repaired.

Implementation Hints

  • Treat timeouts as first-class: they determine observed consistency.
  • Implement read repair: if a read sees divergent versions, fix replicas in the background.

Learning milestones

  1. Replication works → You understand redundancy mechanics
  2. Consistency levels behave predictably → You understand trade-offs precisely
  3. Repair converges → You understand “eventual” as an algorithm, not a slogan

Project 11: “Anti-Entropy Repair with Merkle Trees” — Convergence at Scale

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Rust
Alternative Programming Languages Go, Java, C++
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 3. The “Service & Support” Model
Difficulty Level 4: Expert
Knowledge Area Data Repair / Distributed Consistency
Software or Tool Merkle trees, anti-entropy
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A background repair process that detects divergence between replicas using Merkle trees and syncs only the differing ranges.

Why it teaches NoSQL: Once you stop assuming perfect networks, you need systematic convergence. Dynamo uses a decentralized synchronization approach and describes Merkle trees for reconciliation.

Core challenges you’ll face:

  • Tree construction per range → maps to efficient summaries
  • Comparing trees to locate diffs → maps to logarithmic divergence detection
  • Bandwidth control → maps to repair storms and SLO risk
  • Repair correctness → maps to idempotent, retryable sync

Key Concepts

  • Merkle tree reconciliation: Dynamo paper
  • Background maintenance debt: compaction mindset applied to repair
  • Operational guardrails: rate limiting and scheduling

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 10 recommended

Real world outcome

  • You can demonstrate: “replicas diverge under partition; repair restores convergence with bounded network use.”
  • You can show “bytes transferred vs naive full sync.”

Implementation Hints

  • Build trees over key ranges (or partitions).
  • Compare roots first; drill down only when hashes differ.

Learning milestones

  1. Repair detects divergence → You understand summary-based synchronization
  2. Repair is bandwidth-efficient → You understand why Merkle trees matter
  3. Repair is safe under retries → you understand distributed idempotency

Project 12: “CAP and Failure Lab” — Make Trade-offs Observable

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Python
Alternative Programming Languages Go, Rust, Java
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold”
Difficulty Level 3: Advanced
Knowledge Area Distributed Systems / Correctness
Software or Tool Fault injection harness
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A harness that injects partitions, delay, drops, and node crashes into your Project 10 cluster; it outputs a timeline of anomalies (stale reads, write loss, unavailability).

Why it teaches NoSQL: CAP is not a slogan; it’s a set of failure-driven constraints. The formal result (Gilbert & Lynch) shows you cannot guarantee consistency and availability under partitions in an asynchronous network model.

Core challenges you’ll face:

  • Defining consistency checks → maps to linearizability vs eventual
  • Modeling availability → maps to what counts as a successful response
  • Reproducible chaos → maps to deterministic test traces
  • Interpreting anomalies → maps to debugging distributed behavior

Key Concepts

  • CAP impossibility proof framing: Gilbert & Lynch 2002
  • Consistency definitions: DDIA (models of consistency)
  • Fault injection as a method: systems testing discipline

Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Any distributed prototype (Project 10 ideal)

Real world outcome

  • A report that says: “Under partition X, QUORUM reads avoid stale reads but writes become unavailable,” etc.
  • A library of “failure scenarios” you can replay.

Implementation Hints

  • Insert a controllable proxy between nodes; the proxy enforces drop/delay rules.
  • Build invariants: monotonic read checks, read-your-writes checks, and “no lost acknowledged writes.”

Learning milestones

  1. You can reproduce anomalies → You stop treating distributed bugs as “random”
  2. You can explain CAP trade-offs concretely → You understand theory through behavior
  3. You can choose a design intentionally → You can justify NoSQL trade-offs

Project 13: “Raft-backed Metadata Service” — When You Need Strong Consistency

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java
Coolness Level Level 5: Pure Magic (Super Cool)
Business Potential 1. The “Resume Gold”
Difficulty Level 4: Expert
Knowledge Area Consensus / Cluster Coordination
Software or Tool Raft consensus
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A small strongly-consistent metadata service for your cluster: membership, shard assignments, configuration versions—replicated using Raft.

Why it teaches NoSQL: Many “eventually consistent” stores still need a CP core for coordination (who owns what, what config is current). Raft decomposes consensus into leader election, log replication, and safety properties.

Core challenges you’ll face:

  • Leader election → maps to availability vs split brain
  • Replicated log → maps to state machine replication
  • Membership changes → maps to reconfiguration safety
  • Client semantics → maps to linearizable reads vs stale reads

Resources for key challenges:

  • “In Search of an Understandable Consensus Algorithm (Raft)” (Ongaro & Ousterhout)

Key Concepts

  • Replicated log: Raft paper
  • Leadership model: Raft paper
  • Why consensus is needed: DDIA (coordination chapter)

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Solid networking, comfort with concurrency, testing discipline

Real world outcome

  • You can kill leaders and show the cluster continues with a new leader and no divergent metadata.
  • You can show that shard maps are consistent across nodes at all times.

Implementation Hints

  • Implement Raft as a replicated log; apply committed entries to a deterministic state machine.
  • Separate transport reliability (retries, timeouts) from Raft logic.

Learning milestones

  1. Leader election stabilizes → You understand coordination under failure
  2. Log replication is correct → You understand state machine replication
  3. Reconfiguration is safe → You can operate distributed systems without hand-waving

Project 14: “Storage Engine Introspection CLI” — Explain Your Database to Yourself

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Python
Alternative Programming Languages Go, Rust, Java
Coolness Level Level 3: Genuinely Clever
Business Potential 2. The “Micro-SaaS / Pro Tool”
Difficulty Level 2: Intermediate
Knowledge Area Observability / Debugging
Software or Tool Internal metrics + “explain” tooling
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A CLI (or small web dashboard) that can introspect your DB: list SSTables, levels, compaction backlog, bloom filter stats, cache hit rates, replication lag, and repair status.

Why it teaches NoSQL: Real databases are operated, not just implemented. Observability turns “it’s slow” into “compaction debt is causing tail latency.”

Core challenges you’ll face:

  • Defining meaningful metrics → maps to operability and SLOs
  • Tracing read/write paths → maps to finding amplification sources
  • Surface invariants → maps to detecting corruption or divergence early
  • Usable UX → maps to making systems debuggable

Key Concepts

  • Operational visibility: DDIA (operational concerns)
  • Amplification metrics: compaction and caching economics
  • Failure signals: replication lag, repair backlog

Difficulty: Intermediate
Time estimate: Weekend
Prerequisites: Any prior project (works best after 1–4 and 10)

Real world outcome

  • You can answer: “Why is this query slow?” using your own “explain” output.
  • You can watch compaction debt rise and fall during load tests.

Implementation Hints

  • Treat internal state as a “system catalog” you can query.
  • Include human-first summaries (top 3 reasons for latency) plus raw numbers.

Learning milestones

  1. Metrics are accurate → you trust your system’s introspection
  2. You can predict incidents → you understand leading indicators
  3. You can tune with confidence → you operate, not guess

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
WAL + Memtable KV Intermediate 1-2 weeks High Medium
SSTable Builder Advanced 1-2 weeks Very High Medium
Compaction Simulator Advanced Weekend High High
Bloom + Cache Advanced 1-2 weeks High Medium
MVCC Snapshots Advanced 1 month+ Very High Medium
Secondary Indexes Advanced 1-2 weeks High Medium
Document Layer Intermediate 1-2 weeks Medium Medium
Mini-Bigtable Advanced 1 month+ Very High High
Consistent Hash Sharder Advanced 1-2 weeks High High
Replication + Quorums Advanced 1 month+ Very High High
Merkle Repair Advanced 1-2 weeks High Medium
CAP Failure Lab Advanced Weekend High High
Raft Metadata Service Advanced 1 month+ Very High High
Introspection CLI Intermediate Weekend Medium Medium

Recommendation: Where to Start (and why)

If your goal is to understand how NoSQL works internally (not just “use Mongo/Cassandra”), start with:

1) Project 1 (WAL + memtable) → because it teaches durability and crash semantics first.
2) Project 2 (SSTables) → because it forces you to confront on-disk layout and the read path.
3) Project 3 (Compaction simulator) → because it makes the main LSM trade-offs measurable.
4) Then choose a branch:

  • Distributed track: Projects 9 → 10 → 11 → 12
  • Data-model track: Projects 7 → 6 → 8
  • Strong-consistency core: Project 13 (only after you’ve felt failure modes)

Project 15: “Mini-Dynamo + LSM” — A Real NoSQL Database (Capstone)

Attribute Details
File LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language Go
Alternative Programming Languages Rust, Java
Coolness Level Level 5: Pure Magic (Super Cool)
Business Potential 4. The “Open Core” Infrastructure
Difficulty Level 5: Master
Knowledge Area Distributed Storage Systems
Software or Tool LSM storage engine + replication + observability
Main Book Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A distributed NoSQL database with an LSM-tree storage engine, sharding via consistent hashing, replication with tunable consistency, background repair, and an operator-friendly introspection console.

Why it teaches NoSQL: This is the “minimum complete” set of mechanisms behind Dynamo/Cassandra-like systems (partitioning + replication + reconciliation) combined with modern LSM storage foundations.

Core challenges you’ll face:

  • End-to-end correctness (ack semantics, crash safety) → maps to WAL + recovery
  • Operational stability (compaction/repair debt) → maps to background work management
  • Consistency controls (client-visible guarantees) → maps to quorum design
  • Reconfiguration (adding nodes safely) → maps to metadata coordination (optional Raft)

Key Concepts

  • Highly available KV patterns: Dynamo paper
  • Compaction economics: RocksDB compaction model
  • Consistency trade-offs under partitions: Gilbert & Lynch (CAP)

Difficulty: Advanced
Time estimate: 1 month+ (expect multiple iterations)
Prerequisites: Projects 1–4 plus 9–12 (Project 13 optional but valuable)

Real world outcome

  • A 3–5 node cluster that can:
    • survive node crashes,
    • rebalance keys when nodes join,
    • demonstrate consistency levels under partitions,
    • and show internal health via a self-hosted dashboard.

Implementation Hints

  • Make every subsystem observable: compaction backlog, repair backlog, and replication lag.
  • Define the API surface narrowly (KV or document-lite). Query power can come later.
  • Use chaos scenarios (Project 12) as your “definition of done.”

Learning milestones

  1. Single-node engine is durable and fast → you understand storage engines deeply
  2. Cluster is correct under failures → you understand replication and partitions
  3. You can operate it with confidence → you’ve internalized NoSQL as an operational system

Summary

# Project Name Main Programming Language
1 WAL + Memtable Key-Value Store Go
2 SSTable Builder + Immutable Runs Go
3 LSM Compaction Simulator Python
4 Bloom Filters + Block Cache Go
5 MVCC Snapshots Rust
6 Secondary Indexes Go
7 Document Store Layer TypeScript
8 Wide-Column Mini-Bigtable Java
9 Consistent Hashing Sharder Go
10 Replication + Tunable Consistency Go
11 Anti-Entropy Repair with Merkle Trees Rust
12 CAP and Failure Lab Python
13 Raft-backed Metadata Service Go
14 Storage Engine Introspection CLI Python
15 Capstone: Mini-Dynamo + LSM Go