LEARN NOSQL DEEP DIVE
Learn NoSQL Databases: From Zero to “I Can Build One”
Goal: Deeply understand how NoSQL databases work—by building the essential components yourself: storage engines, indexes, compaction, partitioning, replication, consistency controls, and operational tooling.
Why NoSQL Exists (and what you’re really learning)
Relational databases optimized around schema + joins + ACID transactions. NoSQL systems exist because many real workloads need different trade-offs:
- Scale-out across many machines (partitioning/sharding)
- High write throughput with predictable latencies (log-structured storage, append-first designs)
- High availability under failures and partitions (replication + tunable consistency)
- Flexible data models (document, key-value, wide-column, graph)
- Operational primitives (online rebalancing, rolling upgrades, multi-DC)
After these projects, you should be able to:
- Explain a NoSQL system’s read path and write path at the “bytes on disk + messages on wire” level
- Reason precisely about consistency (quorum, leader-based, eventual, causal)
- Diagnose performance problems like write amplification, read amplification, and hot partitions
- Build and operate a small but real distributed NoSQL store
Core Concept Analysis
1) Data models and query models
- Key-Value: primary-key access, minimal query surface
- Document: JSON-like documents, secondary indexes, partial updates
- Wide-Column (Bigtable-style): rows + column families + timestamps
- Graph: adjacency, traversals, index-free patterns
- Time-series: append-heavy, compression, downsampling
2) Storage engine fundamentals (single-node)
- Write path: WAL → in-memory structure (memtable) → immutable runs (SSTables)
- Read path: memtable → block index → bloom filter → SSTable scans → merge results
- Compaction: turning many runs into fewer runs (and deleting tombstones)
- Recovery: replay WAL / restore last checkpoint
- Indexes: primary + secondary; trade-offs between write cost and query power
3) Distribution and resilience
- Partitioning: hashing, consistent hashing rings, range partitioning
- Replication: leader/follower vs multi-leader vs leaderless
- Consistency: quorum reads/writes, causal consistency, read-your-writes, linearizability
- Membership: failure detection, gossip, rebalancing
- Repair: anti-entropy, Merkle trees, hinted handoff
4) Operations (what makes it “real”)
- Observability: metrics, tracing, compaction dashboards
- Backups/snapshots
- Schema evolution & migrations (even in “schemaless” systems)
- SLOs and tail latency, overload control, admission control
Project List
All projects below are written into: LEARN_NOSQL_DEEP_DIVE.md
Project 1: WAL + Memtable Key-Value Store (The Write Path)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Storage Engines / Durability
- Software or Tool: Bolt of your own (WAL), fsync semantics
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A local key-value database that supports put/get/delete, persists via a write-ahead log, and recovers correctly after crashes.
Why it teaches NoSQL: Most NoSQL systems start with an append-first durability story. WAL + in-memory state is the core of nearly every storage engine.
Core challenges you’ll face:
- Defining a record format (checksums, length-prefixing) → maps to binary protocols & corruption detection
- Durability boundaries (fsync rules) → maps to what “committed” really means
- Crash recovery (replay log into memtable) → maps to idempotency and ordering
- Tombstones (delete markers) → maps to deletion in append-only stores
Key Concepts
- WAL & durability: DDIA (Storage & Retrieval)
- Crash recovery invariants: Operating Systems: Three Easy Pieces (Persistence / FS)
- Checksums & torn writes: Computer Systems: A Programmer’s Perspective (I/O + representation)
Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: Basic file I/O, basic data structures, comfort with unit tests
Real world outcome
- You can demonstrate: “write 10k keys → force-kill process → restart → all committed keys are present; non-committed keys are absent.”
- You can show a human-readable “log dump” that replays operations and highlights corruption.
Implementation Hints
- Treat the WAL as the source of truth; the memtable is a cache.
- Define a strict ordering: write bytes → flush → (optionally) fsync → then acknowledge.
- Include a checksum per record; on recovery, stop at first invalid record.
Learning milestones
- Log replay restores state → You understand durability as “replayable history”
- Crashes don’t corrupt state → You understand atomicity boundaries
- Deletes work via tombstones → You understand why compaction is needed later
Project 2: SSTable Builder + Immutable Sorted Runs (The Read Path Starts)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Storage Engines / On-disk Layout
- Software or Tool: SSTable format (your own)
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: Convert a memtable into an immutable, sorted on-disk file (SSTable) with a sparse index and block structure, then support reads from SSTables.
Why it teaches NoSQL: LSM-based NoSQL engines store data as immutable sorted runs; everything else (bloom filters, compaction, cache) builds on this.
Core challenges you’ll face:
- Sorted encoding (key order, prefix compression) → maps to disk-efficient representations
- Block index design (sparse vs dense) → maps to IO vs CPU trade-offs
- Merging sources (memtable + multiple SSTables) → maps to multi-way merge logic
- Tombstone semantics → maps to visibility rules across runs
Key Concepts
- SSTables and runs: DDIA (LSM trees)
- Binary search & block indexes: Algorithms, 4th Edition (searching)
- Compression trade-offs: MongoDB/WiredTiger “compression + checkpoints” concepts (see Project 8 resources)
Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 1; familiarity with sorting and file formats
Real world outcome
- You can generate SSTables and prove reads work without loading full datasets into RAM.
- You can show “read amplification” by printing which blocks/files were touched per query.
Implementation Hints
- Use fixed-size blocks; each block contains a sequence of key/value entries.
- A sparse index maps “first key in block” → file offset.
- Reads: seek to candidate block(s), then scan inside a block.
Learning milestones
- Immutable files enable simple recovery → You see why append + immutability is powerful
- Sparse index works → You understand storage layouts for fast lookups
- Multi-run reads work → You internalize the merge-based read path
Project 3: LSM Compaction Simulator (Make Write Amplification Visible)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Storage Engines / Compaction Economics
- Software or Tool: RocksDB compaction mental model
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A simulator that models leveled vs tiered compaction, producing graphs/tables for write amplification, space amplification, and read amplification.
Why it teaches NoSQL: Compaction is where many NoSQL performance mysteries live: stalls, SSD wear, tail latency spikes, and “why is my disk 10× busier than my workload?”
Core challenges you’ll face:
- Modeling levels and fanout → maps to how LSM trees scale
- Modeling tombstones/TTL → maps to deletes and time-based data
- Compaction scheduling → maps to background work vs foreground latency
- Hot partitions and skew → maps to real-world workload behavior
Resources for key challenges:
- RocksDB compaction concepts (leveled vs tiered, write amplification) — RocksDB Wiki (Compaction)
Key Concepts
- Leveled vs tiered compaction: RocksDB documentation
- Write amplification mechanics: LSM tree paper lineage
- Operational tuning mindset: DDIA (storage engines in practice)
Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Comfort with basic probability and plotting; Projects 1–2 helpful
Real world outcome
- You can answer: “Given workload X and configuration Y, why does the system rewrite Z bytes?”
- You can produce a “compaction budget” report that predicts stall risk.
Implementation Hints
- Model each run with size, key-range overlap, and tombstone fraction.
- Track bytes written per compaction and divide by bytes of user writes.
Learning milestones
- You can compute write amplification → You understand why LSMs trade write cost for throughput
- You can explain stalls → You understand background debt as a first-class system constraint
- You can reason about tuning → You can predict outcomes of fanout/level changes
Project 4: Bloom Filters + Block Cache (Make Reads Predictable)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Indexing / Performance Engineering
- Software or Tool: Bloom filters, LRU cache
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: Add bloom filters to SSTables and implement a block cache with instrumentation (hit rate, bytes saved, latency percentiles).
Why it teaches NoSQL: Modern NoSQL engines are defined as much by performance structures (filters, caches) as by correctness.
Core challenges you’ll face:
- False positives → maps to probabilistic data structures
- Cache eviction policy → maps to tail latency control
- Read amplification visibility → maps to measuring what your system does
- Interplay with compaction → maps to why files move and caches churn
Key Concepts
- Bloom filters: canonical probability trade-off (false positive tuning)
- Caching strategies: DDIA (cache + storage)
- Latency measurement: “percentiles over averages” mindset
Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 (or any SSTable reader)
Real world outcome
- You can show: “same dataset, same queries; bloom filters cut disk reads by N%.”
- You can demonstrate “cache thrash” scenarios and mitigation.
Implementation Hints
- Keep bloom filter per SSTable (or per block) and record filter checks per query.
- Measure: blocks touched, bytes read from disk, and cache hit ratio.
Learning milestones
- Reads get faster without correctness risk → You understand probabilistic acceleration
- You can explain p95 spikes → You understand cache miss storms
- You can design dashboards → You think like a database operator
Project 5: MVCC Snapshots (Point-in-Time Reads Without Locks)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Rust
- Alternative Programming Languages: Go, Java, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Concurrency Control / Storage Semantics
- Software or Tool: MVCC + snapshot reads
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: Add MVCC so concurrent operations can see consistent snapshots, enabling “read without blocking writes” with versioned values and garbage collection.
Why it teaches NoSQL: Many NoSQL/document stores rely on snapshot semantics (and it’s the conceptual foundation for consistent reads, change streams, and long-running queries).
Core challenges you’ll face:
- Version chains → maps to data visibility rules
- Snapshot timestamps → maps to logical time vs wall time
- Garbage collection → maps to compaction + retention
- Atomic compare-and-set semantics → maps to concurrency primitives
Resources for key challenges:
- WiredTiger snapshot + checkpoint descriptions (MongoDB documentation)
Key Concepts
- Snapshot isolation vs linearizability: DDIA
- MVCC mechanics: WiredTiger concepts
- GC and retention windows: operational trade-offs in MVCC
Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Comfortable with concurrency; Projects 1–2 help significantly
Real world outcome
- You can run a workload where writers run continuously while readers get repeatable snapshots.
- You can show “old version accumulation” and explain how GC/compaction resolves it.
Implementation Hints
- Assign a monotonically increasing logical version to writes.
- Readers pick a snapshot version and only read versions ≤ snapshot.
- Track the oldest active snapshot to know which versions can be reclaimed.
Learning milestones
- Snapshot reads are correct → You understand isolation as a visibility rule
- Writers don’t block readers → You understand why MVCC is popular
- GC is safe and bounded → You understand how correctness meets operations
Project 6: Secondary Indexes (The Hidden Cost of “Flexible Queries”)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Indexing / Querying
- Software or Tool: Inverted index / B-tree vs LSM index
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: Implement secondary indexes (exact-match and range) and quantify their write amplification and consistency challenges.
Why it teaches NoSQL: “NoSQL is flexible” ends the moment you need indexes at scale. This project reveals why many systems restrict query patterns.
Core challenges you’ll face:
- Index maintenance on updates → maps to write fan-out
- Consistency between primary and index → maps to atomicity without transactions
- Backfilling indexes → maps to online migrations
- Multi-valued fields → maps to document modeling trade-offs
Key Concepts
- Index maintenance vs query power: DDIA
- Backfill strategies: operational database thinking
- Idempotent index updates: recovery + retries
Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Projects 1–2 or any persistent KV store
Real world outcome
- You can show: “query by field X returns correct results,” and “index build can run online.”
- You can produce a report: “write cost increased by Y% due to index maintenance.”
Implementation Hints
- Treat indexes as separate keyspaces: (index_key → primary_key list).
- Define explicit update rules for modifications and deletes.
Learning milestones
- Indexes work → You understand query acceleration structures
- You measure write fan-out → You understand why indexing is expensive
- Online backfill works → You understand operational reality
Project 7: Document Store Layer (JSON Documents on Top of KV)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: TypeScript
- Alternative Programming Languages: Go, Python, Java
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Document Model / API Design
- Software or Tool: JSON Patch semantics, schema-less pitfalls
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A document database API: insert/update/partial-update, basic validation, and optional secondary indexes for top-level fields.
Why it teaches NoSQL: The “document” model is mostly an API and modeling discipline layered over a KV engine. You’ll see the gap between “schemaless” and “unstructured chaos.”
Core challenges you’ll face:
- Document identity → maps to primary key strategy
- Partial updates → maps to write amplification at document granularity
- Validation & schema evolution → maps to keeping systems operable
- Indexing specific fields → maps to real query needs
Key Concepts
- Document vs relational trade-offs: DDIA
- Schema-on-write vs schema-on-read: modeling discipline
- Update granularity: why big documents hurt writes
Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: JSON familiarity; basic API design
Real world outcome
- A small REST or CLI interface where you can store and retrieve documents and run a few indexed queries.
- Demonstrable “migration” of a field from one shape to another without breaking reads.
Implementation Hints
- Store document blobs as values; keep indexes separate.
- Add “document version” metadata for conflict detection later.
Learning milestones
- CRUD works reliably → You understand the document layer
- Partial updates are correct → You understand update semantics and their cost
- Schema evolution is managed → You think like an operator, not a tutorial
Project 8: Wide-Column Mini-Bigtable (Rows, Column Families, Timestamps)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Java
- Alternative Programming Languages: Go, Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Data Modeling / Storage Layout
- Software or Tool: Bigtable mental model
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A wide-column store that supports (row key, column family, column qualifier, timestamp) as the addressing model, with range scans by row key.
Why it teaches NoSQL: Bigtable-style systems show how “NoSQL” can still be structured: sorted by row key, versioned by timestamp, grouped into families for IO locality.
Core challenges you’ll face:
- Row-key ordering and range scans → maps to sorted storage & hot-spot risks
- Column-family locality → maps to IO patterns and data layout
- Time-versioned cells → maps to retention policies and GC
- Sparse data → maps to why wide-column exists
Resources for key challenges:
- “Bigtable: A Distributed Storage System for Structured Data” (OSDI 2006)
Key Concepts
- Wide-column model: Bigtable paper
- Sorted tablets and partitioning: Bigtable paper
- Version retention: storage GC patterns
Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Storage fundamentals (Projects 1–2 recommended)
Real world outcome
- You can store time-series-like values per row and retrieve “latest N versions.”
- You can run range scans and show how row-key design affects hotspots.
Implementation Hints
- Physically cluster by row key first, then family, then qualifier.
- Implement a retention policy: “keep last K versions” or “keep last T time.”
Learning milestones
- Range scans are efficient → You understand why sorted order matters
- Versioning is correct → You understand time-based storage semantics
- You can explain hotspotting → you think like a Bigtable user
Project 9: Consistent Hashing Sharder (From Single Node to Many)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Systems / Partitioning
- Software or Tool: Consistent hashing ring
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A sharding layer using consistent hashing with virtual nodes, rebalancing, and a “shard map” tool that explains where each key lives.
Why it teaches NoSQL: Many NoSQL systems partition data via consistent hashing to enable incremental scaling and rebalancing. Dynamo popularized this approach for highly available key-value systems.
Core challenges you’ll face:
- Virtual nodes → maps to load distribution vs operational complexity
- Rebalancing while serving traffic → maps to moving data safely
- Hot key detection → maps to real workload skew
- Key design constraints → maps to partition keys as product decisions
Resources for key challenges:
- “Dynamo: Amazon’s Highly Available Key-value Store” (SOSP 2007)
Key Concepts
- Consistent hashing and replication: Dynamo paper
- Operational rebalancing: DDIA (partitioning chapter)
- Hot partitions: practical performance thinking
Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Networking basics; Projects 1–2 helpful
Real world outcome
- A mini cluster can add/remove nodes and show how many keys moved.
- A “trace” view shows for a key: hash → ring position → node set.
Implementation Hints
- Keep membership info separate from data storage.
- Make rebalancing observable: moved keys, bytes moved, and time.
Learning milestones
- Keys distribute evenly → You understand hashing trade-offs
- Scaling changes only part of the keyspace → You understand why consistent hashing matters
- You can debug hotspots → you understand production pain points
Project 10: Replication + Tunable Consistency (Quorums You Can Feel)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, Erlang/Elixir
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Replication / Consistency
- Software or Tool: Quorum reads/writes
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: Replicate each shard across N nodes and support per-operation consistency levels (e.g., “ONE”, “QUORUM”, “ALL”), including read repair.
Why it teaches NoSQL: This is the heart of Dynamo-style and Cassandra-style systems: you tune consistency vs latency/availability by changing how many replicas must respond.
Core challenges you’ll face:
- Coordinator logic → maps to routing, timeouts, partial failures
- Conflict detection → maps to versioning (vector clocks or logical versions)
- Read repair → maps to eventual convergence mechanics
- Timeout semantics → maps to what “success” means under failure
Resources for key challenges:
- Dynamo paper (quorum-like technique, versioning)
- Cassandra docs on tunable consistency
Key Concepts
- Quorum intersection reasoning: Dynamo/Cassandra approach
- Eventual consistency mechanics: DDIA
- Failure as a normal case: distributed systems mindset
Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Project 9 (sharding), strong comfort with networking/timeouts
Real world outcome
- You can simulate partitions and show exactly when stale reads happen (and when they don’t).
- You can expose a “consistency report” per operation: who answered, who didn’t, what got repaired.
Implementation Hints
- Treat timeouts as first-class: they determine observed consistency.
- Implement read repair: if a read sees divergent versions, fix replicas in the background.
Learning milestones
- Replication works → You understand redundancy mechanics
- Consistency levels behave predictably → You understand trade-offs precisely
- Repair converges → You understand “eventual” as an algorithm, not a slogan
Project 11: Anti-Entropy Repair with Merkle Trees (Convergence at Scale)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Rust
- Alternative Programming Languages: Go, Java, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Data Repair / Distributed Consistency
- Software or Tool: Merkle trees, anti-entropy
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A background repair process that detects divergence between replicas using Merkle trees and syncs only the differing ranges.
Why it teaches NoSQL: Once you stop assuming perfect networks, you need systematic convergence. Dynamo uses a decentralized synchronization approach and describes Merkle trees for reconciliation.
Core challenges you’ll face:
- Tree construction per range → maps to efficient summaries
- Comparing trees to locate diffs → maps to logarithmic divergence detection
- Bandwidth control → maps to repair storms and SLO risk
- Repair correctness → maps to idempotent, retryable sync
Key Concepts
- Merkle tree reconciliation: Dynamo paper
- Background maintenance debt: compaction mindset applied to repair
- Operational guardrails: rate limiting and scheduling
Difficulty: Advanced
Time estimate: 1-2 weeks
Prerequisites: Project 10 recommended
Real world outcome
- You can demonstrate: “replicas diverge under partition; repair restores convergence with bounded network use.”
- You can show “bytes transferred vs naive full sync.”
Implementation Hints
- Build trees over key ranges (or partitions).
- Compare roots first; drill down only when hashes differ.
Learning milestones
- Repair detects divergence → You understand summary-based synchronization
- Repair is bandwidth-efficient → You understand why Merkle trees matter
- Repair is safe under retries → you understand distributed idempotency
Project 12: CAP and Failure Lab (Make Trade-offs Observable)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Systems / Correctness
- Software or Tool: Fault injection harness
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A harness that injects partitions, delay, drops, and node crashes into your Project 10 cluster; it outputs a timeline of anomalies (stale reads, write loss, unavailability).
Why it teaches NoSQL: CAP is not a slogan; it’s a set of failure-driven constraints. The formal result (Gilbert & Lynch) shows you cannot guarantee consistency and availability under partitions in an asynchronous network model.
Core challenges you’ll face:
- Defining consistency checks → maps to linearizability vs eventual
- Modeling availability → maps to what counts as a successful response
- Reproducible chaos → maps to deterministic test traces
- Interpreting anomalies → maps to debugging distributed behavior
Key Concepts
- CAP impossibility proof framing: Gilbert & Lynch 2002
- Consistency definitions: DDIA (models of consistency)
- Fault injection as a method: systems testing discipline
Difficulty: Advanced
Time estimate: Weekend
Prerequisites: Any distributed prototype (Project 10 ideal)
Real world outcome
- A report that says: “Under partition X, QUORUM reads avoid stale reads but writes become unavailable,” etc.
- A library of “failure scenarios” you can replay.
Implementation Hints
- Insert a controllable proxy between nodes; the proxy enforces drop/delay rules.
- Build invariants: monotonic read checks, read-your-writes checks, and “no lost acknowledged writes.”
Learning milestones
- You can reproduce anomalies → You stop treating distributed bugs as “random”
- You can explain CAP trade-offs concretely → You understand theory through behavior
- You can choose a design intentionally → You can justify NoSQL trade-offs
Project 13: Raft-backed Metadata Service (When You Need Strong Consistency)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Consensus / Cluster Coordination
- Software or Tool: Raft consensus
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A small strongly-consistent metadata service for your cluster: membership, shard assignments, configuration versions—replicated using Raft.
Why it teaches NoSQL: Many “eventually consistent” stores still need a CP core for coordination (who owns what, what config is current). Raft decomposes consensus into leader election, log replication, and safety properties.
Core challenges you’ll face:
- Leader election → maps to availability vs split brain
- Replicated log → maps to state machine replication
- Membership changes → maps to reconfiguration safety
- Client semantics → maps to linearizable reads vs stale reads
Resources for key challenges:
- “In Search of an Understandable Consensus Algorithm (Raft)” (Ongaro & Ousterhout)
Key Concepts
- Replicated log: Raft paper
- Leadership model: Raft paper
- Why consensus is needed: DDIA (coordination chapter)
Difficulty: Advanced
Time estimate: 1 month+
Prerequisites: Solid networking, comfort with concurrency, testing discipline
Real world outcome
- You can kill leaders and show the cluster continues with a new leader and no divergent metadata.
- You can show that shard maps are consistent across nodes at all times.
Implementation Hints
- Implement Raft as a replicated log; apply committed entries to a deterministic state machine.
- Separate transport reliability (retries, timeouts) from Raft logic.
Learning milestones
- Leader election stabilizes → You understand coordination under failure
- Log replication is correct → You understand state machine replication
- Reconfiguration is safe → You can operate distributed systems without hand-waving
Project 14: Storage Engine Introspection CLI (Explain Your Database to Yourself)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Observability / Debugging
- Software or Tool: Internal metrics + “explain” tooling
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A CLI (or small web dashboard) that can introspect your DB: list SSTables, levels, compaction backlog, bloom filter stats, cache hit rates, replication lag, and repair status.
Why it teaches NoSQL: Real databases are operated, not just implemented. Observability turns “it’s slow” into “compaction debt is causing tail latency.”
Core challenges you’ll face:
- Defining meaningful metrics → maps to operability and SLOs
- Tracing read/write paths → maps to finding amplification sources
- Surface invariants → maps to detecting corruption or divergence early
- Usable UX → maps to making systems debuggable
Key Concepts
- Operational visibility: DDIA (operational concerns)
- Amplification metrics: compaction and caching economics
- Failure signals: replication lag, repair backlog
Difficulty: Intermediate
Time estimate: Weekend
Prerequisites: Any prior project (works best after 1–4 and 10)
Real world outcome
- You can answer: “Why is this query slow?” using your own “explain” output.
- You can watch compaction debt rise and fall during load tests.
Implementation Hints
- Treat internal state as a “system catalog” you can query.
- Include human-first summaries (top 3 reasons for latency) plus raw numbers.
Learning milestones
- Metrics are accurate → you trust your system’s introspection
- You can predict incidents → you understand leading indicators
- You can tune with confidence → you operate, not guess
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| WAL + Memtable KV | Intermediate | 1-2 weeks | High | Medium |
| SSTable Builder | Advanced | 1-2 weeks | Very High | Medium |
| Compaction Simulator | Advanced | Weekend | High | High |
| Bloom + Cache | Advanced | 1-2 weeks | High | Medium |
| MVCC Snapshots | Advanced | 1 month+ | Very High | Medium |
| Secondary Indexes | Advanced | 1-2 weeks | High | Medium |
| Document Layer | Intermediate | 1-2 weeks | Medium | Medium |
| Mini-Bigtable | Advanced | 1 month+ | Very High | High |
| Consistent Hash Sharder | Advanced | 1-2 weeks | High | High |
| Replication + Quorums | Advanced | 1 month+ | Very High | High |
| Merkle Repair | Advanced | 1-2 weeks | High | Medium |
| CAP Failure Lab | Advanced | Weekend | High | High |
| Raft Metadata Service | Advanced | 1 month+ | Very High | High |
| Introspection CLI | Intermediate | Weekend | Medium | Medium |
Recommendation: Where to Start (and why)
If your goal is to understand how NoSQL works internally (not just “use Mongo/Cassandra”), start with:
1) Project 1 (WAL + memtable) → because it teaches durability and crash semantics first.
2) Project 2 (SSTables) → because it forces you to confront on-disk layout and the read path.
3) Project 3 (Compaction simulator) → because it makes the main LSM trade-offs measurable.
4) Then choose a branch:
- Distributed track: Projects 9 → 10 → 11 → 12
- Data-model track: Projects 7 → 6 → 8
- Strong-consistency core: Project 13 (only after you’ve felt failure modes)
Final Overall Capstone Project: “Mini-Dynamo + LSM” (A Real NoSQL Database)
- File: LEARN_NOSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Storage Systems
- Software or Tool: LSM storage engine + replication + observability
- Main Book: Designing Data-Intensive Applications by Martin Kleppmann
What you’ll build: A distributed NoSQL database with an LSM-tree storage engine, sharding via consistent hashing, replication with tunable consistency, background repair, and an operator-friendly introspection console.
Why it teaches NoSQL: This is the “minimum complete” set of mechanisms behind Dynamo/Cassandra-like systems (partitioning + replication + reconciliation) combined with modern LSM storage foundations.
Core challenges you’ll face:
- End-to-end correctness (ack semantics, crash safety) → maps to WAL + recovery
- Operational stability (compaction/repair debt) → maps to background work management
- Consistency controls (client-visible guarantees) → maps to quorum design
- Reconfiguration (adding nodes safely) → maps to metadata coordination (optional Raft)
Key Concepts
- Highly available KV patterns: Dynamo paper
- Compaction economics: RocksDB compaction model
- Consistency trade-offs under partitions: Gilbert & Lynch (CAP)
Difficulty: Advanced
Time estimate: 1 month+ (expect multiple iterations)
Prerequisites: Projects 1–4 plus 9–12 (Project 13 optional but valuable)
Real world outcome
- A 3–5 node cluster that can:
- survive node crashes,
- rebalance keys when nodes join,
- demonstrate consistency levels under partitions,
- and show internal health via a self-hosted dashboard.
Implementation Hints
- Make every subsystem observable: compaction backlog, repair backlog, and replication lag.
- Define the API surface narrowly (KV or document-lite). Query power can come later.
- Use chaos scenarios (Project 12) as your “definition of done.”
Learning milestones
- Single-node engine is durable and fast → you understand storage engines deeply
- Cluster is correct under failures → you understand replication and partitions
- You can operate it with confidence → you’ve internalized NoSQL as an operational system
Summary
| # | Project Name | Main Programming Language |
|---|---|---|
| 1 | WAL + Memtable Key-Value Store | Go |
| 2 | SSTable Builder + Immutable Runs | Go |
| 3 | LSM Compaction Simulator | Python |
| 4 | Bloom Filters + Block Cache | Go |
| 5 | MVCC Snapshots | Rust |
| 6 | Secondary Indexes | Go |
| 7 | Document Store Layer | TypeScript |
| 8 | Wide-Column Mini-Bigtable | Java |
| 9 | Consistent Hashing Sharder | Go |
| 10 | Replication + Tunable Consistency | Go |
| 11 | Anti-Entropy Repair with Merkle Trees | Rust |
| 12 | CAP and Failure Lab | Python |
| 13 | Raft-backed Metadata Service | Go |
| 14 | Storage Engine Introspection CLI | Python |
| 15 | Capstone: Mini-Dynamo + LSM | Go |