Learn NoSQL Databases: From Zero to “I Can Build One”

Goal: Deeply understand how NoSQL databases work—by building the essential components yourself: storage engines, indexes, compaction, partitioning, replication, consistency controls, and operational tooling.

Why NoSQL Exists (and what you’re really learning)

Relational databases optimized around schema + joins + ACID transactions. NoSQL systems exist because many real workloads need different trade-offs:

Scale-out across many machines (partitioning/sharding)
High write throughput with predictable latencies (log-structured storage, append-first designs)
High availability under failures and partitions (replication + tunable consistency)
Flexible data models (document, key-value, wide-column, graph)
Operational primitives (online rebalancing, rolling upgrades, multi-DC)

After these projects, you should be able to:

Explain a NoSQL system’s read path and write path at the “bytes on disk + messages on wire” level
Reason precisely about consistency (quorum, leader-based, eventual, causal)
Diagnose performance problems like write amplification, read amplification, and hot partitions
Build and operate a small but real distributed NoSQL store

Core Concept Analysis

1) Data models and query models

Key-Value: primary-key access, minimal query surface
Document: JSON-like documents, secondary indexes, partial updates
Wide-Column (Bigtable-style): rows + column families + timestamps
Graph: adjacency, traversals, index-free patterns
Time-series: append-heavy, compression, downsampling

2) Storage engine fundamentals (single-node)

Write path: WAL → in-memory structure (memtable) → immutable runs (SSTables)
Read path: memtable → block index → bloom filter → SSTable scans → merge results
Compaction: turning many runs into fewer runs (and deleting tombstones)
Recovery: replay WAL / restore last checkpoint
Indexes: primary + secondary; trade-offs between write cost and query power

3) Distribution and resilience

Partitioning: hashing, consistent hashing rings, range partitioning
Replication: leader/follower vs multi-leader vs leaderless
Consistency: quorum reads/writes, causal consistency, read-your-writes, linearizability
Membership: failure detection, gossip, rebalancing
Repair: anti-entropy, Merkle trees, hinted handoff

4) Operations (what makes it “real”)

Observability: metrics, tracing, compaction dashboards
Backups/snapshots
Schema evolution & migrations (even in “schemaless” systems)
SLOs and tail latency, overload control, admission control

Concept Summary Table

Concept Cluster	What You Need to Internalize
Data models	Key-value, document, wide-column, graph, and time-series differ in access patterns.
Write path	WAL + memtable + SSTables define durability and throughput.
Read path	Bloom filters and indexes reduce read amplification.
Compaction	Background merges trade write cost for read performance.
Replication	Leaderless and quorum models enable availability.
Consistency	Quorums, causal ordering, and linearizability are explicit choices.
Operations	Monitoring, backups, and rebalancing make systems reliable.

Deep Dive Reading by Concept

Storage Engines

Concept	Book & Chapter
LSM trees	Designing Data-Intensive Applications — Ch. 3: “Storage and Retrieval”
Indexing basics	Database Internals — Ch. 2-3: “B-Tree Basics”

Distribution and Consistency

Concept	Book & Chapter
Replication models	Designing Data-Intensive Applications — Ch. 5: “Replication”
Partitioning	Designing Data-Intensive Applications — Ch. 6: “Partitioning”
Consensus	Designing Data-Intensive Applications — Ch. 9: “Consistency and Consensus”

Operations

Concept	Book & Chapter
Transaction durability	Database System Concepts — Ch. 19: “Recovery System”

Project List

All projects below are written into: LEARN_NOSQL_DEEP_DIVE.md

Project 1: “WAL + Memtable Key-Value Store” — The Write Path

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, C++
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 2: Intermediate
Knowledge Area	Storage Engines / Durability
Software or Tool	Bolt of your own (WAL), fsync semantics
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A local key-value database that supports put/get/delete, persists via a write-ahead log, and recovers correctly after crashes.

Why it teaches NoSQL: Most NoSQL systems start with an append-first durability story. WAL + in-memory state is the core of nearly every storage engine.

Core challenges you’ll face:

Defining a record format (checksums, length-prefixing) → maps to binary protocols & corruption detection
Durability boundaries (fsync rules) → maps to what “committed” really means
Crash recovery (replay log into memtable) → maps to idempotency and ordering
Tombstones (delete markers) → maps to deletion in append-only stores

Key Concepts

WAL & durability: DDIA (Storage & Retrieval)
Crash recovery invariants: Operating Systems: Three Easy Pieces (Persistence / FS)
Checksums & torn writes: Computer Systems: A Programmer’s Perspective (I/O + representation)

Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: Basic file I/O, basic data structures, comfort with unit tests

Real world outcome

You can demonstrate: “write 10k keys → force-kill process → restart → all committed keys are present; non-committed keys are absent.”
You can show a human-readable “log dump” that replays operations and highlights corruption.

Implementation Hints

Treat the WAL as the source of truth; the memtable is a cache.
Define a strict ordering: write bytes → flush → (optionally) fsync → then acknowledge.
Include a checksum per record; on recovery, stop at first invalid record.

Learning milestones

Log replay restores state → You understand durability as “replayable history”
Crashes don’t corrupt state → You understand atomicity boundaries
Deletes work via tombstones → You understand why compaction is needed later

Project 2: “SSTable Builder + Immutable Sorted Runs” — The Read Path Starts

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, C++
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 3: Advanced
Knowledge Area	Storage Engines / On-disk Layout
Software or Tool	SSTable format (your own)
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Convert a memtable into an immutable, sorted on-disk file (SSTable) with a sparse index and block structure, then support reads from SSTables.

Why it teaches NoSQL: LSM-based NoSQL engines store data as immutable sorted runs; everything else (bloom filters, compaction, cache) builds on this.

Core challenges you’ll face:

Sorted encoding (key order, prefix compression) → maps to disk-efficient representations
Block index design (sparse vs dense) → maps to IO vs CPU trade-offs
Merging sources (memtable + multiple SSTables) → maps to multi-way merge logic
Tombstone semantics → maps to visibility rules across runs

Key Concepts

SSTables and runs: DDIA (LSM trees)
Binary search & block indexes: Algorithms, 4th Edition (searching)
Compression trade-offs: MongoDB/WiredTiger “compression + checkpoints” concepts (see Project 8 resources)

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 1; familiarity with sorting and file formats

Real world outcome

You can generate SSTables and prove reads work without loading full datasets into RAM.
You can show “read amplification” by printing which blocks/files were touched per query.

Implementation Hints

Use fixed-size blocks; each block contains a sequence of key/value entries.
A sparse index maps “first key in block” → file offset.
Reads: seek to candidate block(s), then scan inside a block.

Learning milestones

Immutable files enable simple recovery → You see why append + immutability is powerful
Sparse index works → You understand storage layouts for fast lookups
Multi-run reads work → You internalize the merge-based read path

Project 3: “LSM Compaction Simulator” — Make Write Amplification Visible

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Python
Alternative Programming Languages	Go, Rust, Java
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 3: Advanced
Knowledge Area	Storage Engines / Compaction Economics
Software or Tool	RocksDB compaction mental model
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A simulator that models leveled vs tiered compaction, producing graphs/tables for write amplification, space amplification, and read amplification.

Why it teaches NoSQL: Compaction is where many NoSQL performance mysteries live: stalls, SSD wear, tail latency spikes, and “why is my disk 10× busier than my workload?”

Core challenges you’ll face:

Modeling levels and fanout → maps to how LSM trees scale
Modeling tombstones/TTL → maps to deletes and time-based data
Compaction scheduling → maps to background work vs foreground latency
Hot partitions and skew → maps to real-world workload behavior

Resources for key challenges:

RocksDB compaction concepts (leveled vs tiered, write amplification) — RocksDB Wiki (Compaction)

Key Concepts

Leveled vs tiered compaction: RocksDB documentation
Write amplification mechanics: LSM tree paper lineage
Operational tuning mindset: DDIA (storage engines in practice)

Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Comfort with basic probability and plotting; Projects 1–2 helpful

Real world outcome

You can answer: “Given workload X and configuration Y, why does the system rewrite Z bytes?”
You can produce a “compaction budget” report that predicts stall risk.

Implementation Hints

Model each run with size, key-range overlap, and tombstone fraction.
Track bytes written per compaction and divide by bytes of user writes.

Learning milestones

You can compute write amplification → You understand why LSMs trade write cost for throughput
You can explain stalls → You understand background debt as a first-class system constraint
You can reason about tuning → You can predict outcomes of fanout/level changes

Project 4: “Bloom Filters + Block Cache” — Make Reads Predictable

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, C++
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 3: Advanced
Knowledge Area	Indexing / Performance Engineering
Software or Tool	Bloom filters, LRU cache
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Add bloom filters to SSTables and implement a block cache with instrumentation (hit rate, bytes saved, latency percentiles).

Why it teaches NoSQL: Modern NoSQL engines are defined as much by performance structures (filters, caches) as by correctness.

Core challenges you’ll face:

False positives → maps to probabilistic data structures
Cache eviction policy → maps to tail latency control
Read amplification visibility → maps to measuring what your system does
Interplay with compaction → maps to why files move and caches churn

Key Concepts

Bloom filters: canonical probability trade-off (false positive tuning)
Caching strategies: DDIA (cache + storage)
Latency measurement: “percentiles over averages” mindset

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 (or any SSTable reader)

Real world outcome

You can show: “same dataset, same queries; bloom filters cut disk reads by N%.”
You can demonstrate “cache thrash” scenarios and mitigation.

Implementation Hints

Keep bloom filter per SSTable (or per block) and record filter checks per query.
Measure: blocks touched, bytes read from disk, and cache hit ratio.

Learning milestones

Reads get faster without correctness risk → You understand probabilistic acceleration
You can explain p95 spikes → You understand cache miss storms
You can design dashboards → You think like a database operator

Project 5: “MVCC Snapshots” — Point-in-Time Reads Without Locks

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Rust
Alternative Programming Languages	Go, Java, C++
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Difficulty	Level 4: Expert
Knowledge Area	Concurrency Control / Storage Semantics
Software or Tool	MVCC + snapshot reads
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Add MVCC so concurrent operations can see consistent snapshots, enabling “read without blocking writes” with versioned values and garbage collection.

Why it teaches NoSQL: Many NoSQL/document stores rely on snapshot semantics (and it’s the conceptual foundation for consistent reads, change streams, and long-running queries).

Core challenges you’ll face:

Version chains → maps to data visibility rules
Snapshot timestamps → maps to logical time vs wall time
Garbage collection → maps to compaction + retention
Atomic compare-and-set semantics → maps to concurrency primitives

Resources for key challenges:

WiredTiger snapshot + checkpoint descriptions (MongoDB documentation)

Key Concepts

Snapshot isolation vs linearizability: DDIA
MVCC mechanics: WiredTiger concepts
GC and retention windows: operational trade-offs in MVCC

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Comfortable with concurrency; Projects 1–2 help significantly

Real world outcome

You can run a workload where writers run continuously while readers get repeatable snapshots.
You can show “old version accumulation” and explain how GC/compaction resolves it.

Implementation Hints

Assign a monotonically increasing logical version to writes.
Readers pick a snapshot version and only read versions ≤ snapshot.
Track the oldest active snapshot to know which versions can be reclaimed.

Learning milestones

Snapshot reads are correct → You understand isolation as a visibility rule
Writers don’t block readers → You understand why MVCC is popular
GC is safe and bounded → You understand how correctness meets operations

Project 6: “Secondary Indexes” — The Hidden Cost of “Flexible Queries”

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, C++
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. The “Service & Support” Model
Difficulty	Level 3: Advanced
Knowledge Area	Indexing / Querying
Software or Tool	Inverted index / B-tree vs LSM index
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Implement secondary indexes (exact-match and range) and quantify their write amplification and consistency challenges.

Why it teaches NoSQL: “NoSQL is flexible” ends the moment you need indexes at scale. This project reveals why many systems restrict query patterns.

Core challenges you’ll face:

Index maintenance on updates → maps to write fan-out
Consistency between primary and index → maps to atomicity without transactions
Backfilling indexes → maps to online migrations
Multi-valued fields → maps to document modeling trade-offs

Key Concepts

Index maintenance vs query power: DDIA
Backfill strategies: operational database thinking
Idempotent index updates: recovery + retries

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 or any persistent KV store

Real world outcome

You can show: “query by field X returns correct results,” and “index build can run online.”
You can produce a report: “write cost increased by Y% due to index maintenance.”

Implementation Hints

Treat indexes as separate keyspaces: (index_key → primary_key list).
Define explicit update rules for modifications and deletes.

Learning milestones

Indexes work → You understand query acceleration structures
You measure write fan-out → You understand why indexing is expensive
Online backfill works → You understand operational reality

Project 7: “Document Store Layer” — JSON Documents on Top of KV

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	TypeScript
Alternative Programming Languages	Go, Python, Java
Coolness Level	Level 2: Practical but Forgettable
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 2: Intermediate
Knowledge Area	Document Model / API Design
Software or Tool	JSON Patch semantics, schema-less pitfalls
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A document database API: insert/update/partial-update, basic validation, and optional secondary indexes for top-level fields.

Why it teaches NoSQL: The “document” model is mostly an API and modeling discipline layered over a KV engine. You’ll see the gap between “schemaless” and “unstructured chaos.”

Core challenges you’ll face:

Document identity → maps to primary key strategy
Partial updates → maps to write amplification at document granularity
Validation & schema evolution → maps to keeping systems operable
Indexing specific fields → maps to real query needs

Key Concepts

Document vs relational trade-offs: DDIA
Schema-on-write vs schema-on-read: modeling discipline
Update granularity: why big documents hurt writes

Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: JSON familiarity; basic API design

Real world outcome

A small REST or CLI interface where you can store and retrieve documents and run a few indexed queries.
Demonstrable “migration” of a field from one shape to another without breaking reads.

Implementation Hints

Store document blobs as values; keep indexes separate.
Add “document version” metadata for conflict detection later.

Learning milestones

CRUD works reliably → You understand the document layer
Partial updates are correct → You understand update semantics and their cost
Schema evolution is managed → You think like an operator, not a tutorial

Project 8: “Wide-Column Mini-Bigtable” — Rows, Column Families, Timestamps

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Java
Alternative Programming Languages	Go, Rust, C++
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Difficulty	Level 4: Expert
Knowledge Area	Data Modeling / Storage Layout
Software or Tool	Bigtable mental model
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A wide-column store that supports (row key, column family, column qualifier, timestamp) as the addressing model, with range scans by row key.

Why it teaches NoSQL: Bigtable-style systems show how “NoSQL” can still be structured: sorted by row key, versioned by timestamp, grouped into families for IO locality.

Core challenges you’ll face:

Row-key ordering and range scans → maps to sorted storage & hot-spot risks
Column-family locality → maps to IO patterns and data layout
Time-versioned cells → maps to retention policies and GC
Sparse data → maps to why wide-column exists

Resources for key challenges:

“Bigtable: A Distributed Storage System for Structured Data” (OSDI 2006)

Key Concepts

Wide-column model: Bigtable paper
Sorted tablets and partitioning: Bigtable paper
Version retention: storage GC patterns

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Storage fundamentals (Projects 1–2 recommended)

Real world outcome

You can store time-series-like values per row and retrieve “latest N versions.”
You can run range scans and show how row-key design affects hotspots.

Implementation Hints

Physically cluster by row key first, then family, then qualifier.
Implement a retention policy: “keep last K versions” or “keep last T time.”

Learning milestones

Range scans are efficient → You understand why sorted order matters
Versioning is correct → You understand time-based storage semantics
You can explain hotspotting → you think like a Bigtable user

Project 9: “Consistent Hashing Sharder” — From Single Node to Many

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, Python
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. The “Service & Support” Model
Difficulty	Level 3: Advanced
Knowledge Area	Distributed Systems / Partitioning
Software or Tool	Consistent hashing ring
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A sharding layer using consistent hashing with virtual nodes, rebalancing, and a “shard map” tool that explains where each key lives.

Why it teaches NoSQL: Many NoSQL systems partition data via consistent hashing to enable incremental scaling and rebalancing. Dynamo popularized this approach for highly available key-value systems.

Core challenges you’ll face:

Virtual nodes → maps to load distribution vs operational complexity
Rebalancing while serving traffic → maps to moving data safely
Hot key detection → maps to real workload skew
Key design constraints → maps to partition keys as product decisions

Resources for key challenges:

“Dynamo: Amazon’s Highly Available Key-value Store” (SOSP 2007)

Key Concepts

Consistent hashing and replication: Dynamo paper
Operational rebalancing: DDIA (partitioning chapter)
Hot partitions: practical performance thinking

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Networking basics; Projects 1–2 helpful

Real world outcome

A mini cluster can add/remove nodes and show how many keys moved.
A “trace” view shows for a key: hash → ring position → node set.

Implementation Hints

Keep membership info separate from data storage.
Make rebalancing observable: moved keys, bytes moved, and time.

Learning milestones

Keys distribute evenly → You understand hashing trade-offs
Scaling changes only part of the keyspace → You understand why consistent hashing matters
You can debug hotspots → you understand production pain points

Project 10: “Replication + Tunable Consistency” — Quorums You Can Feel

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, Erlang/Elixir
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Difficulty	Level 4: Expert
Knowledge Area	Replication / Consistency
Software or Tool	Quorum reads/writes
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: Replicate each shard across N nodes and support per-operation consistency levels (e.g., “ONE”, “QUORUM”, “ALL”), including read repair.

Why it teaches NoSQL: This is the heart of Dynamo-style and Cassandra-style systems: you tune consistency vs latency/availability by changing how many replicas must respond.

Core challenges you’ll face:

Coordinator logic → maps to routing, timeouts, partial failures
Conflict detection → maps to versioning (vector clocks or logical versions)
Read repair → maps to eventual convergence mechanics
Timeout semantics → maps to what “success” means under failure

Resources for key challenges:

Dynamo paper (quorum-like technique, versioning)
Cassandra docs on tunable consistency

Key Concepts

Quorum intersection reasoning: Dynamo/Cassandra approach
Eventual consistency mechanics: DDIA
Failure as a normal case: distributed systems mindset

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Project 9 (sharding), strong comfort with networking/timeouts

Real world outcome

You can simulate partitions and show exactly when stale reads happen (and when they don’t).
You can expose a “consistency report” per operation: who answered, who didn’t, what got repaired.

Implementation Hints

Treat timeouts as first-class: they determine observed consistency.
Implement read repair: if a read sees divergent versions, fix replicas in the background.

Learning milestones

Replication works → You understand redundancy mechanics
Consistency levels behave predictably → You understand trade-offs precisely
Repair converges → You understand “eventual” as an algorithm, not a slogan

Project 11: “Anti-Entropy Repair with Merkle Trees” — Convergence at Scale

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Rust
Alternative Programming Languages	Go, Java, C++
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	3. The “Service & Support” Model
Difficulty	Level 4: Expert
Knowledge Area	Data Repair / Distributed Consistency
Software or Tool	Merkle trees, anti-entropy
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A background repair process that detects divergence between replicas using Merkle trees and syncs only the differing ranges.

Why it teaches NoSQL: Once you stop assuming perfect networks, you need systematic convergence. Dynamo uses a decentralized synchronization approach and describes Merkle trees for reconciliation.

Core challenges you’ll face:

Tree construction per range → maps to efficient summaries
Comparing trees to locate diffs → maps to logarithmic divergence detection
Bandwidth control → maps to repair storms and SLO risk
Repair correctness → maps to idempotent, retryable sync

Key Concepts

Merkle tree reconciliation: Dynamo paper
Background maintenance debt: compaction mindset applied to repair
Operational guardrails: rate limiting and scheduling

Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 10 recommended

Real world outcome

You can demonstrate: “replicas diverge under partition; repair restores convergence with bounded network use.”
You can show “bytes transferred vs naive full sync.”

Implementation Hints

Build trees over key ranges (or partitions).
Compare roots first; drill down only when hashes differ.

Learning milestones

Repair detects divergence → You understand summary-based synchronization
Repair is bandwidth-efficient → You understand why Merkle trees matter
Repair is safe under retries → you understand distributed idempotency

Project 12: “CAP and Failure Lab” — Make Trade-offs Observable

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Python
Alternative Programming Languages	Go, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 3: Advanced
Knowledge Area	Distributed Systems / Correctness
Software or Tool	Fault injection harness
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A harness that injects partitions, delay, drops, and node crashes into your Project 10 cluster; it outputs a timeline of anomalies (stale reads, write loss, unavailability).

Why it teaches NoSQL: CAP is not a slogan; it’s a set of failure-driven constraints. The formal result (Gilbert & Lynch) shows you cannot guarantee consistency and availability under partitions in an asynchronous network model.

Core challenges you’ll face:

Defining consistency checks → maps to linearizability vs eventual
Modeling availability → maps to what counts as a successful response
Reproducible chaos → maps to deterministic test traces
Interpreting anomalies → maps to debugging distributed behavior

Key Concepts

CAP impossibility proof framing: Gilbert & Lynch 2002
Consistency definitions: DDIA (models of consistency)
Fault injection as a method: systems testing discipline

Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Any distributed prototype (Project 10 ideal)

Real world outcome

A report that says: “Under partition X, QUORUM reads avoid stale reads but writes become unavailable,” etc.
A library of “failure scenarios” you can replay.

Implementation Hints

Insert a controllable proxy between nodes; the proxy enforces drop/delay rules.
Build invariants: monotonic read checks, read-your-writes checks, and “no lost acknowledged writes.”

Learning milestones

You can reproduce anomalies → You stop treating distributed bugs as “random”
You can explain CAP trade-offs concretely → You understand theory through behavior
You can choose a design intentionally → You can justify NoSQL trade-offs

Project 13: “Raft-backed Metadata Service” — When You Need Strong Consistency

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	1. The “Resume Gold”
Difficulty	Level 4: Expert
Knowledge Area	Consensus / Cluster Coordination
Software or Tool	Raft consensus
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A small strongly-consistent metadata service for your cluster: membership, shard assignments, configuration versions—replicated using Raft.

Why it teaches NoSQL: Many “eventually consistent” stores still need a CP core for coordination (who owns what, what config is current). Raft decomposes consensus into leader election, log replication, and safety properties.

Core challenges you’ll face:

Leader election → maps to availability vs split brain
Replicated log → maps to state machine replication
Membership changes → maps to reconfiguration safety
Client semantics → maps to linearizable reads vs stale reads

Resources for key challenges:

“In Search of an Understandable Consensus Algorithm (Raft)” (Ongaro & Ousterhout)

Key Concepts

Replicated log: Raft paper
Leadership model: Raft paper
Why consensus is needed: DDIA (coordination chapter)

Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Solid networking, comfort with concurrency, testing discipline

Real world outcome

You can kill leaders and show the cluster continues with a new leader and no divergent metadata.
You can show that shard maps are consistent across nodes at all times.

Implementation Hints

Implement Raft as a replicated log; apply committed entries to a deterministic state machine.
Separate transport reliability (retries, timeouts) from Raft logic.

Learning milestones

Leader election stabilizes → You understand coordination under failure
Log replication is correct → You understand state machine replication
Reconfiguration is safe → You can operate distributed systems without hand-waving

Project 14: “Storage Engine Introspection CLI” — Explain Your Database to Yourself

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Python
Alternative Programming Languages	Go, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 2: Intermediate
Knowledge Area	Observability / Debugging
Software or Tool	Internal metrics + “explain” tooling
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A CLI (or small web dashboard) that can introspect your DB: list SSTables, levels, compaction backlog, bloom filter stats, cache hit rates, replication lag, and repair status.

Why it teaches NoSQL: Real databases are operated, not just implemented. Observability turns “it’s slow” into “compaction debt is causing tail latency.”

Core challenges you’ll face:

Defining meaningful metrics → maps to operability and SLOs
Tracing read/write paths → maps to finding amplification sources
Surface invariants → maps to detecting corruption or divergence early
Usable UX → maps to making systems debuggable

Key Concepts

Operational visibility: DDIA (operational concerns)
Amplification metrics: compaction and caching economics
Failure signals: replication lag, repair backlog

Difficulty: Intermediate
Time estimate: Weekend
Prerequisites: Any prior project (works best after 1–4 and 10)

Real world outcome

You can answer: “Why is this query slow?” using your own “explain” output.
You can watch compaction debt rise and fall during load tests.

Implementation Hints

Treat internal state as a “system catalog” you can query.
Include human-first summaries (top 3 reasons for latency) plus raw numbers.

Learning milestones

Metrics are accurate → you trust your system’s introspection
You can predict incidents → you understand leading indicators
You can tune with confidence → you operate, not guess

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
WAL + Memtable KV	Intermediate	1-2 weeks	High	Medium
SSTable Builder	Advanced	1-2 weeks	Very High	Medium
Compaction Simulator	Advanced	Weekend	High	High
Bloom + Cache	Advanced	1-2 weeks	High	Medium
MVCC Snapshots	Advanced	1 month+	Very High	Medium
Secondary Indexes	Advanced	1-2 weeks	High	Medium
Document Layer	Intermediate	1-2 weeks	Medium	Medium
Mini-Bigtable	Advanced	1 month+	Very High	High
Consistent Hash Sharder	Advanced	1-2 weeks	High	High
Replication + Quorums	Advanced	1 month+	Very High	High
Merkle Repair	Advanced	1-2 weeks	High	Medium
CAP Failure Lab	Advanced	Weekend	High	High
Raft Metadata Service	Advanced	1 month+	Very High	High
Introspection CLI	Intermediate	Weekend	Medium	Medium

Recommendation: Where to Start (and why)

If your goal is to understand how NoSQL works internally (not just “use Mongo/Cassandra”), start with:

1) Project 1 (WAL + memtable) → because it teaches durability and crash semantics first.
2) Project 2 (SSTables) → because it forces you to confront on-disk layout and the read path.
3) Project 3 (Compaction simulator) → because it makes the main LSM trade-offs measurable.
4) Then choose a branch:

Distributed track: Projects 9 → 10 → 11 → 12
Data-model track: Projects 7 → 6 → 8
Strong-consistency core: Project 13 (only after you’ve felt failure modes)

Project 15: “Mini-Dynamo + LSM” — A Real NoSQL Database (Capstone)

Attribute	Details
File	LEARN_NOSQL_DEEP_DIVE.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	4. The “Open Core” Infrastructure
Difficulty	Level 5: Master
Knowledge Area	Distributed Storage Systems
Software or Tool	LSM storage engine + replication + observability
Main Book	Designing Data-Intensive Applications by Martin Kleppmann

What you’ll build: A distributed NoSQL database with an LSM-tree storage engine, sharding via consistent hashing, replication with tunable consistency, background repair, and an operator-friendly introspection console.

Why it teaches NoSQL: This is the “minimum complete” set of mechanisms behind Dynamo/Cassandra-like systems (partitioning + replication + reconciliation) combined with modern LSM storage foundations.

Core challenges you’ll face:

End-to-end correctness (ack semantics, crash safety) → maps to WAL + recovery
Operational stability (compaction/repair debt) → maps to background work management
Consistency controls (client-visible guarantees) → maps to quorum design
Reconfiguration (adding nodes safely) → maps to metadata coordination (optional Raft)

Key Concepts

Highly available KV patterns: Dynamo paper
Compaction economics: RocksDB compaction model
Consistency trade-offs under partitions: Gilbert & Lynch (CAP)

Difficulty: Advanced
Time estimate: 1 month+ (expect multiple iterations)
Prerequisites: Projects 1–4 plus 9–12 (Project 13 optional but valuable)

Real world outcome

A 3–5 node cluster that can:
- survive node crashes,
- rebalance keys when nodes join,
- demonstrate consistency levels under partitions,
- and show internal health via a self-hosted dashboard.

Implementation Hints

Make every subsystem observable: compaction backlog, repair backlog, and replication lag.
Define the API surface narrowly (KV or document-lite). Query power can come later.
Use chaos scenarios (Project 12) as your “definition of done.”

Learning milestones

Single-node engine is durable and fast → you understand storage engines deeply
Cluster is correct under failures → you understand replication and partitions
You can operate it with confidence → you’ve internalized NoSQL as an operational system

Summary

#	Project Name	Main Programming Language
1	WAL + Memtable Key-Value Store	Go
2	SSTable Builder + Immutable Runs	Go
3	LSM Compaction Simulator	Python
4	Bloom Filters + Block Cache	Go
5	MVCC Snapshots	Rust
6	Secondary Indexes	Go
7	Document Store Layer	TypeScript
8	Wide-Column Mini-Bigtable	Java
9	Consistent Hashing Sharder	Go
10	Replication + Tunable Consistency	Go
11	Anti-Entropy Repair with Merkle Trees	Rust
12	CAP and Failure Lab	Python
13	Raft-backed Metadata Service	Go
14	Storage Engine Introspection CLI	Python
15	Capstone: Mini-Dynamo + LSM	Go