← Back to all projects

LEARN COUCHDB DEEP DIVE

Learn CouchDB: From Zero to Document Database Master

Goal: Deeply understand Apache CouchDB—from JSON document storage to building a complete document-oriented database with append-only B-trees, RESTful HTTP API, MapReduce views, MVCC, and multi-master replication from scratch in C.


Why CouchDB Matters

CouchDB pioneered several revolutionary database concepts that are now industry standards:

  • “Relax”: A database that never corrupts, even during crashes or power failures
  • HTTP all the way down: Every operation is a REST API call
  • Eventual consistency: Multi-master replication without coordination
  • Append-only storage: Never overwrites data, enabling crash-proof durability
  • MapReduce views: Pre-computed indexes using JavaScript functions

After completing these projects, you will:

  • Understand the append-only B-tree design that makes CouchDB crash-proof
  • Know how MVCC enables concurrent reads without locks
  • Build a RESTful HTTP API that serves JSON documents
  • Implement MapReduce views with incremental updates
  • Master conflict-free replication between database instances
  • Have built a working document database from scratch in C

Core Concept Analysis

CouchDB Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         HTTP REST API                               │
│    GET /db/doc    PUT /db/doc    DELETE /db/doc    POST /db/_find   │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Document Layer                                 │
│         JSON Parsing │ Validation │ Revision Management             │
└─────────────────────────────────────────────────────────────────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              ▼                   ▼                   ▼
┌─────────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│    By-ID Index      │ │  By-Seq Index   │ │    MapReduce Views      │
│   (doc_id → doc)    │ │ (seq → doc_id)  │ │  (user-defined indexes) │
└─────────────────────┘ └─────────────────┘ └─────────────────────────┘
              │                   │                   │
              └───────────────────┼───────────────────┘
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Append-Only B-Tree Engine                         │
│          New nodes written to end │ Never overwrites                │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Database File (.couch)                          │
│   [Header][Data...][B-tree nodes...][Header][Data...][New Header]   │
│                              ↑                                      │
│                         Always grows →                              │
└─────────────────────────────────────────────────────────────────────┘

Fundamental Concepts

  1. Document Model
    • Documents are JSON objects with _id and _rev fields
    • _id: Unique identifier (user-provided or auto-generated UUID)
    • _rev: Revision string (e.g., "1-abc123") for conflict detection
    • Schema-less: Any valid JSON structure is allowed
  2. Append-Only B-Tree
    • Modified nodes are written to the end of the file, never in-place
    • Old versions remain on disk until compaction
    • Root pointer stored in file footer (last 4KB)
    • Crash recovery: Scan backward to find last valid header
  3. Multi-Version Concurrency Control (MVCC)
    • Each update creates a new revision
    • Readers see a consistent snapshot (point-in-time)
    • Writers never block readers
    • Conflicts detected via _rev field during updates
  4. Revision Tree
    Document: "user_123"
    
    1-abc  (initial)
       │
    2-def  (update)
       │
    3-ghi  (latest winning revision)
       │
       ├── 3-xyz  (conflict branch from replica)
    
  5. MapReduce Views
    • Map function: Emits key-value pairs for each document
    • Reduce function: Aggregates values by key
    • Stored in separate B-tree, incrementally updated
    • Only processes documents changed since last query
  6. Changes Feed
    • Sequence number increments with every change
    • _changes API returns all changes since a sequence
    • Enables efficient incremental replication
  7. Replication Protocol
    • Compare revision trees between source and target
    • Transfer only missing revisions
    • Deterministic conflict resolution (same winner everywhere)
    • Bidirectional: push + pull = full sync

Project List

Projects are ordered from fundamental understanding to advanced implementations. Each project builds on the previous ones.


Project 1: JSON Parser and Document Model

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Zig
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Parsing / Data Structures / JSON
  • Software or Tool: Building: JSON Library
  • Main Book: “Writing a C Compiler” by Nora Sandler

What you’ll build: A complete JSON parser in C that can parse, modify, and serialize JSON documents—the foundation of any document database.

Why it teaches CouchDB: CouchDB stores everything as JSON. Before you can store documents, you need to parse them. This project teaches you tokenization, recursive descent parsing, and in-memory document representation—skills you’ll use throughout.

Core challenges you’ll face:

  • Tokenizing JSON input → maps to lexical analysis
  • Handling nested objects and arrays → maps to recursive parsing
  • Memory management for dynamic structures → maps to ownership and lifetime
  • Unicode and escape sequence handling → maps to string encoding

Key Concepts:

  • Recursive Descent Parsing: “Writing a C Compiler” Chapter 1 - Nora Sandler
  • Memory Management in C: “C Programming: A Modern Approach” Chapter 17 - K. N. King
  • JSON Specification: RFC 8259 - The JavaScript Object Notation
  • Tokenizer Design: jsmn - Minimal JSON Tokenizer

Resources for key challenges:

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

  • Basic C programming (pointers, structs, dynamic memory)
  • Understanding of string manipulation
  • No prior parsing experience required

Real world outcome:

$ ./json_parser
> {"name": "Alice", "age": 30, "hobbies": ["reading", "coding"]}

Parsed successfully!
Type: Object
Keys: 3

Document structure:
{
  "name": "Alice" (string)
  "age": 30 (number)
  "hobbies": [
    "reading" (string)
    "coding" (string)
  ] (array, 2 elements)
}

> json_get(doc, "hobbies[1]")
"coding"

> json_set(doc, "age", 31)
> json_serialize(doc)
{"name":"Alice","age":31,"hobbies":["reading","coding"]}

Implementation Hints:

JSON value types to support:

typedef enum {
    JSON_NULL,
    JSON_BOOL,
    JSON_NUMBER,
    JSON_STRING,
    JSON_ARRAY,
    JSON_OBJECT
} JsonType;

typedef struct JsonValue {
    JsonType type;
    union {
        bool boolean;
        double number;
        char* string;
        struct { struct JsonValue* items; size_t count; } array;
        struct { char** keys; struct JsonValue* values; size_t count; } object;
    };
} JsonValue;

Tokenizer produces tokens:

  • {, }, [, ], :, ,
  • Strings: "hello", "with \"escapes\""
  • Numbers: 123, -45.67, 1.2e10
  • Keywords: true, false, null

Parsing approach:

parse_value():
    switch (current_token):
        case '{': return parse_object()
        case '[': return parse_array()
        case STRING: return parse_string()
        case NUMBER: return parse_number()
        case TRUE/FALSE: return parse_bool()
        case NULL: return parse_null()

parse_object():
    expect '{'
    while not '}':
        key = parse_string()
        expect ':'
        value = parse_value()
        add_to_object(key, value)
        if not ',': break
    expect '}'

Think about:

  • How do you handle "\u0041" (Unicode escape)?
  • How do you differentiate 123 from "123"?
  • What’s your memory ownership model? Who frees what?

Learning milestones:

  1. Tokenizer produces correct tokens → You understand lexical analysis
  2. Nested structures parse correctly → You understand recursive parsing
  3. Round-trip works (parse → serialize → parse) → Your parser is correct
  4. No memory leaks → You understand C memory management

Project 2: In-Memory Document Store with Revisions

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Structures / MVCC / Versioning
  • Software or Tool: Building: Document Store
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An in-memory document store that tracks revisions (_rev), detects conflicts, and maintains document history—the core of CouchDB’s data model.

Why it teaches CouchDB: The _rev field is CouchDB’s magic. It enables optimistic concurrency control, conflict detection, and replication. Understanding revision trees is essential to understanding CouchDB.

Core challenges you’ll face:

  • Generating revision IDs → maps to content-addressable hashing
  • Revision tree management → maps to tree data structures
  • Conflict detection on update → maps to optimistic locking
  • Efficient document lookup → maps to hash table design

Key Concepts:

  • MVCC Fundamentals: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
  • Hash Tables in C: “Mastering Algorithms with C” Chapter 8 - Kyle Loudon
  • Content Hashing: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
  • Revision Trees: CouchDB Replication and Conflict Model

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

  • Project 1 completed (JSON parser)
  • Understanding of hash tables
  • Basic tree data structures

Real world outcome:

$ ./docstore

docstore> PUT user_123 {"name": "Alice", "age": 30}
{
  "_id": "user_123",
  "_rev": "1-abc123def456",
  "name": "Alice",
  "age": 30
}
Created.

docstore> GET user_123
{
  "_id": "user_123",
  "_rev": "1-abc123def456",
  "name": "Alice",
  "age": 30
}

docstore> PUT user_123 {"_rev": "1-abc123def456", "name": "Alice", "age": 31}
{
  "_id": "user_123",
  "_rev": "2-789ghi012jkl",
  "name": "Alice",
  "age": 31
}
Updated.

docstore> PUT user_123 {"_rev": "1-abc123def456", "name": "Alicia", "age": 30}
Error: Conflict. Document has been modified.
Current revision: 2-789ghi012jkl
Your revision: 1-abc123def456

docstore> REVS user_123
Revision tree for user_123:
  1-abc123def456 (deleted)
  └── 2-789ghi012jkl (current)

Implementation Hints:

Revision ID format: {revision_number}-{md5_of_content}

Generate revision:
1. Serialize document to JSON (without _id, _rev)
2. Compute MD5 hash of JSON string
3. Increment revision number
4. Format: "{rev_num}-{first_16_chars_of_md5}"

Example: "2-a1b2c3d4e5f6g7h8"

Document structure:

typedef struct Revision {
    char* rev_id;           // "1-abc123"
    JsonValue* doc;         // The document content
    struct Revision* parent; // Previous revision
    struct Revision** children; // Branches (conflicts)
    size_t child_count;
    bool deleted;           // Tombstone flag
} Revision;

typedef struct Document {
    char* id;               // _id
    Revision* rev_tree;     // Root of revision tree
    Revision* winner;       // Current winning revision
} Document;

Conflict detection:

on_update(doc_id, new_content, provided_rev):
    current = get_document(doc_id)
    if current.winner.rev_id != provided_rev:
        return CONFLICT_ERROR

    new_rev = create_revision(new_content, current.winner)
    current.winner = new_rev
    return SUCCESS

Think about:

  • What makes a revision “win” when there are conflicts?
  • How do you handle DELETE? (Hint: tombstone with _deleted: true)
  • What if someone provides no _rev for an update?

Learning milestones:

  1. Documents have _id and _rev → You understand the document model
  2. Updates require correct _rev → You understand optimistic locking
  3. Revision history is preserved → You understand MVCC
  4. Conflicts are detected → You’re ready for replication

Project 3: Append-Only B-Tree Storage Engine

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Storage Engines / B-Trees / Durability
  • Software or Tool: Building: Append-Only B-Tree
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: An append-only B-tree that never overwrites data—the crash-proof storage engine that makes CouchDB reliable.

Why it teaches CouchDB: This is CouchDB’s secret weapon. By never overwriting data, the database file is always consistent. Crashes during writes simply leave incomplete data at the end, which is ignored on recovery. This project teaches you how to build truly crash-proof storage.

Core challenges you’ll face:

  • Copy-on-write B-tree updates → maps to immutable data structures
  • Root pointer management in file footer → maps to atomic commits
  • Crash recovery by scanning backward → maps to durability guarantees
  • Space efficiency with node reuse → maps to storage optimization

Key Concepts:

Resources for key challenges:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:

  • Understanding of B-trees
  • File I/O in C
  • Binary data handling

Real world outcome:

$ ./appendonly_btree test.db

btree> INSERT "user_001" "Alice"
Inserted at offset 4096
New root at offset 4096

btree> INSERT "user_002" "Bob"
Inserted at offset 8192
New root at offset 8192 (copied, old root still at 4096)

btree> GET "user_001"
"Alice"

# Simulate crash during write
btree> INSERT "user_003" "Carol"
Writing... [CRASH SIMULATED]

$ ./appendonly_btree test.db
Recovering...
Scanning backward for valid header...
Found valid header at offset 8192
Recovery complete. 2 keys in database.

btree> GET "user_003"
Not found (incomplete write was discarded)

btree> GET "user_001"
"Alice" (previous data intact!)

Implementation Hints:

File structure:

┌─────────────────────────────────────────────────────────────────┐
│ Offset 0:    [File Header - 4KB]                                │
│              Magic number, version, etc.                        │
├─────────────────────────────────────────────────────────────────┤
│ Offset 4096: [B-tree Node 1] [B-tree Node 2] [Node 3] ...      │
│              (nodes appended sequentially)                      │
├─────────────────────────────────────────────────────────────────┤
│ End-4KB:     [Active Header - 4KB]                              │
│              Root pointer, doc count, sequence number           │
│              Written twice (4KB + 4KB) for redundancy           │
└─────────────────────────────────────────────────────────────────┘

Copy-on-write update:

To update key K with value V:
1. Read path from root to leaf containing K
2. Create NEW leaf node with updated K→V
3. Create NEW internal nodes pointing to new leaf
4. Create NEW root pointing to new internal nodes
5. Append all new nodes to end of file
6. Write new header with new root pointer
7. fsync()

Old nodes remain on disk (for old snapshots and crash safety)

Header commit (atomic):

1. Write all data and new B-tree nodes
2. fsync() data
3. Write header copy 1 (first 2KB of footer)
4. Write header copy 2 (second 2KB of footer)
5. fsync() footer

On recovery:
- Read both header copies
- If they match and checksum valid: use it
- If only first valid: use first (crash during step 4)
- If neither valid: scan backward for previous header

Think about:

  • Why write the header twice?
  • How do you find the previous valid header after a crash?
  • What’s the space overhead of copy-on-write?

Learning milestones:

  1. B-tree operations work → You understand B-tree algorithms
  2. Updates append, never overwrite → You understand copy-on-write
  3. Database survives kill -9 → You’ve achieved crash safety
  4. Recovery finds valid state → You understand header scanning

Project 4: HTTP REST API Server

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Networking / HTTP / REST APIs
  • Software or Tool: Building: HTTP Server
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: An HTTP/1.1 server that exposes your document store via REST API—just like CouchDB’s native interface.

Why it teaches CouchDB: CouchDB is “HTTP all the way down.” Every operation—creating databases, inserting documents, querying views—is an HTTP request. Building this teaches you network programming and API design.

Core challenges you’ll face:

  • TCP socket programming → maps to network fundamentals
  • HTTP request parsing → maps to protocol implementation
  • Routing requests to handlers → maps to API design
  • Concurrent connection handling → maps to server architecture

Key Concepts:

  • Socket Programming: “The Linux Programming Interface” Chapter 56-61 - Michael Kerrisk
  • HTTP/1.1 Protocol: RFC 7230-7235
  • REST API Design: CouchDB API Reference
  • Beej’s Guide to Network Programming: Essential socket programming reference

Resources for key challenges:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

  • Projects 1-2 completed
  • Basic understanding of TCP/IP
  • Familiarity with HTTP

Real world outcome:

$ ./couchdb_server --port 5984
CouchDB-like server starting on http://localhost:5984
Ready to accept connections...

# In another terminal:
$ curl http://localhost:5984/
{"couchdb":"Welcome","version":"0.1.0"}

$ curl -X PUT http://localhost:5984/mydb
{"ok":true}

$ curl -X PUT http://localhost:5984/mydb/doc1 \
       -H "Content-Type: application/json" \
       -d '{"name":"Alice","age":30}'
{"ok":true,"id":"doc1","rev":"1-abc123"}

$ curl http://localhost:5984/mydb/doc1
{"_id":"doc1","_rev":"1-abc123","name":"Alice","age":30}

$ curl -X PUT http://localhost:5984/mydb/doc1 \
       -H "Content-Type: application/json" \
       -d '{"_rev":"wrong","name":"Alice","age":31}'
{"error":"conflict","reason":"Document update conflict."}

$ curl -X DELETE http://localhost:5984/mydb/doc1?rev=1-abc123
{"ok":true,"id":"doc1","rev":"2-def456"}

Implementation Hints:

CouchDB REST API endpoints:

Server:
  GET  /                    → Server info
  GET  /_all_dbs            → List all databases

Database:
  PUT    /{db}              → Create database
  GET    /{db}              → Database info
  DELETE /{db}              → Delete database

Documents:
  POST   /{db}              → Create document (auto-ID)
  PUT    /{db}/{docid}      → Create/update document
  GET    /{db}/{docid}      → Read document
  DELETE /{db}/{docid}?rev= → Delete document

Special:
  GET    /{db}/_all_docs    → List all document IDs
  GET    /{db}/_changes     → Changes feed

HTTP request structure:

GET /mydb/doc1 HTTP/1.1
Host: localhost:5984
Accept: application/json

Response:
HTTP/1.1 200 OK
Content-Type: application/json
ETag: "1-abc123"

{"_id":"doc1","_rev":"1-abc123","name":"Alice"}

Server architecture:

main():
    socket = create_tcp_socket(5984)
    while true:
        client = accept(socket)
        fork() or thread:  // Handle concurrently
            request = parse_http_request(client)
            response = route_and_handle(request)
            send_http_response(client, response)
            close(client)

Think about:

  • How do you parse the URL path (/mydb/doc1)?
  • How do you handle ?rev= query parameters?
  • What HTTP status codes should you return? (200, 201, 404, 409, etc.)
  • How do you handle large request bodies?

Learning milestones:

  1. Server accepts connections → You understand sockets
  2. HTTP parsing works → You understand the HTTP protocol
  3. CRUD operations via curl work → You’ve built a REST API
  4. Multiple clients work concurrently → You understand server architecture

Project 5: Sequence Index and Changes Feed

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Event Sourcing / Change Tracking
  • Software or Tool: Building: Changes Feed
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A sequence-based index that tracks every change, plus a _changes API for real-time and historical change streaming.

Why it teaches CouchDB: The changes feed is the foundation of CouchDB replication. Every modification gets a sequence number. Clients can ask “give me all changes since sequence X” to sync efficiently. This is event sourcing at the database level.

Core challenges you’ll face:

  • Monotonic sequence number generation → maps to event ordering
  • By-sequence index maintenance → maps to secondary indexing
  • Long-polling for real-time changes → maps to push notifications
  • Efficient “since” queries → maps to incremental sync

Key Concepts:

Difficulty: Advanced Time estimate: 1 week Prerequisites:

  • Project 3 completed (storage engine)
  • Project 4 completed (HTTP server)
  • Understanding of secondary indexes

Real world outcome:

$ curl "http://localhost:5984/mydb/_changes"
{
  "results": [
    {"seq":1,"id":"doc1","changes":[{"rev":"1-abc"}]},
    {"seq":2,"id":"doc2","changes":[{"rev":"1-def"}]},
    {"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
  ],
  "last_seq": 3
}

$ curl "http://localhost:5984/mydb/_changes?since=2"
{
  "results": [
    {"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
  ],
  "last_seq": 3
}

# Long-polling (waits for new changes)
$ curl "http://localhost:5984/mydb/_changes?feed=longpoll&since=3"
# ... request hangs until a change occurs ...
# In another terminal:
$ curl -X PUT http://localhost:5984/mydb/doc3 -d '{"x":1}'
# First terminal receives:
{
  "results": [
    {"seq":4,"id":"doc3","changes":[{"rev":"1-xyz"}]}
  ],
  "last_seq": 4
}

# Continuous feed (streaming)
$ curl "http://localhost:5984/mydb/_changes?feed=continuous&since=0"
{"seq":1,"id":"doc1","changes":[{"rev":"1-abc"}]}
{"seq":2,"id":"doc2","changes":[{"rev":"1-def"}]}
{"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
{"seq":4,"id":"doc3","changes":[{"rev":"1-xyz"}]}
# ... stays open, new lines appear as changes happen ...

Implementation Hints:

Data structures:

Global sequence counter (per database):
  uint64_t current_seq = 0;

On every document change:
  current_seq++;
  record = { seq: current_seq, doc_id, rev_id }
  by_seq_index.insert(current_seq, record)

By-sequence index (separate B-tree):
  key: sequence number
  value: { doc_id, rev_id, deleted }

Changes feed modes:

Normal (feed=normal, default):
  - Return all changes since `since` parameter
  - Return immediately, even if empty

Long-poll (feed=longpoll):
  - If changes exist since `since`, return immediately
  - If no changes, wait (block) until one occurs
  - Return after first change (or timeout)

Continuous (feed=continuous):
  - Stream changes as newline-delimited JSON
  - Never close connection
  - Send heartbeat every N seconds to keep alive

Implementation:

handle_changes(since, feed_type):
    if feed_type == "normal":
        return get_all_changes_since(since)

    if feed_type == "longpoll":
        changes = get_all_changes_since(since)
        if changes:
            return changes
        wait_for_change(timeout=60)  // Block here
        return get_all_changes_since(since)

    if feed_type == "continuous":
        send_all_changes_since(since)
        while true:
            wait_for_change()
            send_new_change()
            send_heartbeat()  // Empty line every 10s

Think about:

  • What happens if two changes happen while client is disconnected?
  • How do you implement the “wait” efficiently? (Condition variable, poll)
  • How do you handle the include_docs=true parameter?

Learning milestones:

  1. Sequence numbers increment correctly → You understand change tracking
  2. By-seq index works → You understand secondary indexes
  3. Normal feed returns history → You understand incremental queries
  4. Long-poll waits and notifies → You understand server push

Project 6: MapReduce View Engine

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: MapReduce / Indexing / Query Processing
  • Software or Tool: Building: View Engine
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A MapReduce view engine that runs user-defined JavaScript map/reduce functions, maintains B-tree indexes, and updates incrementally.

Why it teaches CouchDB: Views are how you query CouchDB beyond simple ID lookups. The map function transforms documents into key-value pairs; the reduce aggregates them. Understanding incremental index updates is key to CouchDB’s performance.

Core challenges you’ll face:

  • Embedding a JavaScript engine → maps to language integration
  • Incremental view updates → maps to efficient indexing
  • B-tree storage for view results → maps to secondary indexes
  • Reduce tree computation → maps to aggregation strategies

Key Concepts:

Resources for key challenges:

  • Duktape - Embeddable JavaScript engine in C
  • QuickJS - Small, fast JS engine

Difficulty: Expert Time estimate: 3 weeks Prerequisites:

  • Project 3 completed (B-tree storage)
  • Basic understanding of MapReduce
  • Familiarity with JavaScript (for writing map functions)

Real world outcome:

# Create a design document with views
$ curl -X PUT http://localhost:5984/mydb/_design/app \
  -H "Content-Type: application/json" \
  -d '{
    "views": {
      "by_age": {
        "map": "function(doc) { if(doc.age) emit(doc.age, doc.name); }"
      },
      "age_stats": {
        "map": "function(doc) { if(doc.age) emit(null, doc.age); }",
        "reduce": "_stats"
      }
    }
  }'
{"ok":true,"id":"_design/app","rev":"1-abc"}

# Query the view
$ curl "http://localhost:5984/mydb/_design/app/_view/by_age"
{
  "total_rows": 3,
  "offset": 0,
  "rows": [
    {"id":"user2","key":25,"value":"Bob"},
    {"id":"user1","key":30,"value":"Alice"},
    {"id":"user3","key":35,"value":"Carol"}
  ]
}

# Query with range
$ curl "http://localhost:5984/mydb/_design/app/_view/by_age?startkey=28&endkey=32"
{
  "rows": [
    {"id":"user1","key":30,"value":"Alice"}
  ]
}

# Query with reduce
$ curl "http://localhost:5984/mydb/_design/app/_view/age_stats?reduce=true"
{
  "rows": [
    {"key":null,"value":{"sum":90,"count":3,"min":25,"max":35,"avg":30}}
  ]
}

Implementation Hints:

View architecture:

Design Document (_design/app):
{
  "views": {
    "view_name": {
      "map": "function(doc) { emit(key, value); }",
      "reduce": "_sum" // or "_count", "_stats", or custom function
    }
  }
}

View B-tree (separate file per view):
  key: [emitted_key, doc_id]  // Compound key for sorting
  value: emitted_value

Map function execution:

For each document:
    result = js_engine.call("map", doc)
    for each emit(key, value) in result:
        view_btree.insert([key, doc.id], value)

Incremental updates:

view_update(view, last_seq):
    changes = db.changes_since(view.last_indexed_seq)
    for change in changes:
        old_emits = view.get_by_docid(change.doc_id)
        view.delete(old_emits)  // Remove old entries

        if not change.deleted:
            doc = db.get(change.doc_id)
            new_emits = run_map_function(doc)
            for emit in new_emits:
                view.insert([emit.key, doc.id], emit.value)

    view.last_indexed_seq = changes.last_seq

Built-in reduce functions:

_count: Return count of values
_sum: Return sum of numeric values
_stats: Return {sum, count, min, max, avg}

Custom reduce:
function(keys, values, rereduce) {
    if (rereduce) {
        // values are previous reduce results
        return values.reduce((a, b) => a + b, 0);
    } else {
        // values are from map function
        return values.length;
    }
}

Think about:

  • How do you isolate the JavaScript sandbox?
  • What if a map function throws an error?
  • How do you handle the rereduce flag in reduce functions?
  • When do you update the view? (On query, or background?)

Learning milestones:

  1. Map function emits work → You understand JS integration
  2. Views are queryable by key → You understand view B-trees
  3. Incremental updates work → You understand efficient indexing
  4. Reduce aggregates correctly → You understand reduce trees

Project 7: Database Compaction

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage Management / Garbage Collection
  • Software or Tool: Building: Compactor
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A compaction system that reclaims space from the append-only file by copying live data to a new file and discarding obsolete revisions.

Why it teaches CouchDB: Append-only storage is great for durability but terrible for space. Without compaction, your database file grows forever. CouchDB’s compaction copies only current revisions to a new file, then swaps it in atomically.

Core challenges you’ll face:

  • Identifying live vs. dead data → maps to garbage identification
  • Online compaction (while serving requests) → maps to concurrent operations
  • Atomic file swap → maps to safe transitions
  • Space estimation and scheduling → maps to resource management

Key Concepts:

  • Log-Structured Storage Compaction: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
  • Online Compaction: Understanding Data Compaction
  • Garbage Collection: “Database Internals” Chapter 5 - Alex Petrov
  • Atomic File Operations: rename() syscall semantics

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:

  • Project 3 completed (append-only storage)
  • Understanding of file system operations
  • Concurrent programming basics

Real world outcome:

$ ls -lh mydb.couch
-rw-r--r--  1 user  staff  1.2G  Dec 21 10:00 mydb.couch

$ curl -X POST http://localhost:5984/mydb/_compact
{"ok":true}

$ curl http://localhost:5984/mydb
{
  "db_name": "mydb",
  "doc_count": 10000,
  "disk_size": 1288490188,
  "compact_running": true,
  ...
}

# Wait for compaction to complete...
$ curl http://localhost:5984/mydb
{
  "db_name": "mydb",
  "doc_count": 10000,
  "disk_size": 52428800,
  "compact_running": false,
  ...
}

$ ls -lh mydb.couch
-rw-r--r--  1 user  staff   50M  Dec 21 10:05 mydb.couch

# Space reclaimed! 1.2GB → 50MB

Implementation Hints:

Compaction algorithm:

compact(db_file):
    new_file = create_temp_file("db.compact")

    for doc_id in db.all_doc_ids():
        doc = db.get(doc_id)  // Gets winning revision only
        if not doc.deleted:
            new_file.insert(doc_id, doc)

    // Copy views too (only current index entries)
    for view in db.views:
        for entry in view:
            new_file.view_insert(entry)

    // Atomic swap
    rename(new_file, db_file)  // Atomic on POSIX

    // Old file is now unlinked

Online compaction (while serving reads):

During compaction:
  - Reads go to OLD file (consistent snapshot)
  - Writes go to BOTH old file AND track in memory

After compaction:
  - Apply pending writes to new file
  - Atomic swap
  - New reads go to new file

What to keep vs. discard:

KEEP:
  - Current winning revision of each document
  - Conflict revisions (still unresolved)
  - Local documents (_local/*)

DISCARD:
  - Old revisions (superseded)
  - Deleted document tombstones older than threshold
  - Orphaned B-tree nodes

Think about:

  • What if a write happens during compaction?
  • How do you handle views during compaction?
  • What’s the disk space requirement? (2x temporarily)
  • How do you schedule compaction? (Fragmentation threshold)

Learning milestones:

  1. Compaction creates smaller file → You understand garbage collection
  2. No data is lost → You understand live data identification
  3. Swap is atomic → You understand safe transitions
  4. Works while serving requests → You understand online compaction

Project 8: Mango Query Engine (JSON Queries)

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Query Processing / Indexing
  • Software or Tool: Building: Query Engine
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A Mango-style query engine that accepts JSON selector queries (like MongoDB) and executes them efficiently using indexes.

Why it teaches CouchDB: MapReduce views are powerful but require writing JavaScript. Mango queries let you query with JSON: {"age": {"$gt": 30}}. This project teaches query parsing, index selection, and execution planning.

Core challenges you’ll face:

  • Query syntax parsing → maps to DSL implementation
  • Index selection → maps to query optimization
  • Selector evaluation → maps to predicate matching
  • Combining results → maps to query execution

Key Concepts:

  • Mango Queries: CouchDB Find API
  • Query Selectors: MongoDB-style operators ($eq, $gt, $in, etc.)
  • Index Selection: “Database Internals” Chapter 10 - Alex Petrov
  • Query Planning: Choosing between index scan vs. full scan

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

  • Project 6 completed (views for indexing)
  • Project 1 completed (JSON parsing)
  • Understanding of query execution

Real world outcome:

# Create an index
$ curl -X POST http://localhost:5984/mydb/_index \
  -H "Content-Type: application/json" \
  -d '{"index": {"fields": ["age"]}, "name": "age-index"}'
{"result":"created","id":"_design/age-index","name":"age-index"}

# Query with selector
$ curl -X POST http://localhost:5984/mydb/_find \
  -H "Content-Type: application/json" \
  -d '{
    "selector": {
      "age": {"$gt": 25},
      "status": "active"
    },
    "fields": ["_id", "name", "age"],
    "sort": [{"age": "asc"}],
    "limit": 10
  }'
{
  "docs": [
    {"_id":"user2","name":"Bob","age":28},
    {"_id":"user1","name":"Alice","age":30},
    {"_id":"user3","name":"Carol","age":35}
  ],
  "bookmark": "g1AAAA...",
  "warning": "No matching index found, create an index for status"
}

# Explain query plan
$ curl -X POST http://localhost:5984/mydb/_explain \
  -H "Content-Type: application/json" \
  -d '{"selector": {"age": {"$gt": 25}}}'
{
  "index": {
    "ddoc": "_design/age-index",
    "name": "age-index",
    "type": "json",
    "fields": [{"age":"asc"}]
  },
  "selector": {"age":{"$gt":25}},
  "range": {
    "start_key": [25],
    "end_key": [{}]
  }
}

Implementation Hints:

Selector operators:

Comparison:
  $eq   - Equal
  $ne   - Not equal
  $gt   - Greater than
  $gte  - Greater than or equal
  $lt   - Less than
  $lte  - Less than or equal

Logical:
  $and  - All must match
  $or   - At least one must match
  $not  - Negation

Array:
  $in   - Value in array
  $nin  - Value not in array

Existence:
  $exists - Field exists

Query execution:

execute_find(selector, fields, sort, limit):
    // 1. Parse selector
    parsed = parse_selector(selector)

    // 2. Find usable index
    index = find_best_index(parsed, sort)

    // 3. Execute
    if index:
        candidates = index.range_scan(parsed.key_range)
    else:
        candidates = full_table_scan()  // Slow!
        emit_warning("No matching index")

    // 4. Filter (for predicates not covered by index)
    results = []
    for doc in candidates:
        if matches_selector(doc, parsed):
            results.append(project(doc, fields))
            if len(results) >= limit:
                break

    // 5. Sort (if not already sorted by index)
    if sort and not index_covers_sort:
        results.sort(sort)

    return results

Selector evaluation:

matches_selector(doc, selector):
    for field, condition in selector:
        value = doc.get(field)

        if is_operator(condition):
            if not evaluate_operator(value, condition):
                return false
        else:
            if value != condition:  // Implicit $eq
                return false

    return true

evaluate_operator(value, {$gt: 25}):
    return value > 25

Think about:

  • How do you handle nested fields? ("address.city": "NYC")
  • What if no index exists? (Full scan with warning)
  • How do you handle $or with indexes?
  • What’s the bookmark for pagination?

Learning milestones:

  1. Basic selectors work → You understand predicate matching
  2. Indexes are used when available → You understand query optimization
  3. Complex queries with $and/$or work → You understand logical operators
  4. _explain shows the plan → You understand query planning

Project 9: Document Replication

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Replication
  • Software or Tool: Building: Replicator
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A replication engine that syncs documents between CouchDB instances, handling conflicts deterministically—the heart of CouchDB’s distributed nature.

Why it teaches CouchDB: Replication is CouchDB’s killer feature. Two databases can be offline, both receive updates, and later sync without coordination. Conflicts are detected and the same winner is chosen everywhere. This project teaches distributed systems concepts.

Core challenges you’ll face:

  • Comparing revision trees → maps to diff algorithms
  • Transferring missing revisions → maps to efficient sync
  • Deterministic conflict resolution → maps to consensus-free consistency
  • Checkpoint management → maps to sync resume

Key Concepts:

Resources for key challenges:

Difficulty: Expert Time estimate: 3 weeks Prerequisites:

  • Project 4-5 completed (HTTP + changes feed)
  • Project 2 completed (revision trees)
  • Understanding of distributed systems basics

Real world outcome:

# Start two CouchDB instances
$ ./couchdb_server --port 5984 --data ./db1 &
$ ./couchdb_server --port 5985 --data ./db2 &

# Create same database on both
$ curl -X PUT http://localhost:5984/shared
$ curl -X PUT http://localhost:5985/shared

# Add document to first instance
$ curl -X PUT http://localhost:5984/shared/doc1 -d '{"value":"from db1"}'
{"ok":true,"id":"doc1","rev":"1-abc"}

# Replicate from db1 to db2
$ curl -X POST http://localhost:5984/_replicate \
  -H "Content-Type: application/json" \
  -d '{"source":"http://localhost:5984/shared","target":"http://localhost:5985/shared"}'
{
  "ok":true,
  "docs_read":1,
  "docs_written":1,
  "missing_checked":1,
  "missing_found":1
}

# Document now exists on db2
$ curl http://localhost:5985/shared/doc1
{"_id":"doc1","_rev":"1-abc","value":"from db1"}

# Create conflict: Update on both while "offline"
$ curl -X PUT http://localhost:5984/shared/doc1 -d '{"_rev":"1-abc","value":"updated on db1"}'
$ curl -X PUT http://localhost:5985/shared/doc1 -d '{"_rev":"1-abc","value":"updated on db2"}'

# Replicate again - conflict detected
$ curl -X POST http://localhost:5984/_replicate \
  -d '{"source":"http://localhost:5985/shared","target":"http://localhost:5984/shared"}'

# Check for conflicts
$ curl "http://localhost:5984/shared/doc1?conflicts=true"
{
  "_id":"doc1",
  "_rev":"2-xyz",
  "value":"updated on db2",
  "_conflicts":["2-abc"]
}
# Same winner (2-xyz) chosen deterministically on both!

Implementation Hints:

Replication algorithm:

replicate(source, target, since=0):
    // 1. Get changes from source
    changes = source.get_changes(since=since)

    // 2. Find what target is missing
    revs_to_check = [(c.id, c.rev) for c in changes]
    missing = target.revs_diff(revs_to_check)

    // 3. Fetch missing docs with revision history
    for doc_id, missing_revs in missing:
        doc = source.get(doc_id, revs=True, open_revs=missing_revs)
        target.bulk_docs([doc], new_edits=False)

    // 4. Save checkpoint
    save_checkpoint(target, changes.last_seq)

Revision diff (_revs_diff):

Input: {"doc1": ["2-abc", "3-def"], "doc2": ["1-xyz"]}
Output: {"doc1": {"missing": ["3-def"]}}  // Has 2-abc, needs 3-def

Deterministic conflict resolution:

When conflicts exist, the winner is determined by:
1. Longest revision path wins (more edits = more recent)
2. If tie, lexicographically highest revision ID wins

This is deterministic: all replicas pick the same winner
without any coordination!

Example:
  Revisions: "2-abc", "2-xyz"
  Same length (2), compare strings: "xyz" > "abc"
  Winner: "2-xyz"

Bulk docs with new_edits=false:

Normal insert: CouchDB generates new revision
With new_edits=false: Use provided revision (for replication)

This allows inserting historical revisions from another database.

Think about:

  • How do you handle network failures during replication?
  • What if source is updated during replication?
  • How do you detect and resolve conflicts programmatically?
  • What’s continuous replication vs. one-shot?

Learning milestones:

  1. One-shot replication works → You understand sync protocol
  2. Missing revisions are identified → You understand revs_diff
  3. Conflicts are detected → You understand conflict creation
  4. Same winner everywhere → You understand deterministic resolution

Project 10: Authentication and Security

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Security / Authentication / Authorization
  • Software or Tool: Building: Auth System
  • Main Book: “Foundations of Information Security” by Jason Andress

What you’ll build: An authentication system with users, roles, and per-database access control—making your database production-ready.

Why it teaches CouchDB: A database without auth is useless in production. CouchDB has a flexible system with admin users, regular users, database-level permissions, and design document validation. This project teaches security fundamentals.

Core challenges you’ll face:

  • Password hashing → maps to secure credential storage
  • Session management → maps to authentication state
  • Role-based access control → maps to authorization
  • Validate functions → maps to data integrity

Key Concepts:

  • Password Hashing: bcrypt or PBKDF2
  • Cookie Sessions: CouchDB Authentication
  • Authorization: Per-database security objects
  • Validation Functions: Design document validate_doc_update

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:

  • Project 4 completed (HTTP server)
  • Basic security knowledge
  • Understanding of HTTP cookies

Real world outcome:

# Create admin user (first user becomes admin)
$ curl -X PUT http://localhost:5984/_users/org.couchdb.user:admin \
  -H "Content-Type: application/json" \
  -d '{"name":"admin","password":"secret123","roles":["_admin"],"type":"user"}'

# Try to access without auth
$ curl http://localhost:5984/mydb/doc1
{"error":"unauthorized","reason":"You are not authorized."}

# Login (cookie auth)
$ curl -X POST http://localhost:5984/_session \
  -H "Content-Type: application/json" \
  -d '{"name":"admin","password":"secret123"}' \
  -c cookies.txt
{"ok":true,"name":"admin","roles":["_admin"]}

# Access with session cookie
$ curl -b cookies.txt http://localhost:5984/mydb/doc1
{"_id":"doc1","_rev":"1-abc","name":"Alice"}

# Set database security
$ curl -X PUT http://localhost:5984/mydb/_security \
  -b cookies.txt \
  -H "Content-Type: application/json" \
  -d '{"admins":{"names":["admin"]},"members":{"roles":["users"]}}'
{"ok":true}

# Create validate function (in design doc)
$ curl -X PUT http://localhost:5984/mydb/_design/validation \
  -b cookies.txt \
  -d '{
    "validate_doc_update": "function(newDoc, oldDoc, userCtx) {
      if(!newDoc.name) throw({forbidden:\"name required\"});
    }"
  }'

# Try to insert invalid document
$ curl -X PUT http://localhost:5984/mydb/invalid \
  -b cookies.txt \
  -d '{"age":30}'
{"error":"forbidden","reason":"name required"}

Implementation Hints:

User document structure:

{
  "_id": "org.couchdb.user:alice",
  "type": "user",
  "name": "alice",
  "roles": ["editor", "viewer"],
  "password_scheme": "pbkdf2",
  "derived_key": "abc123...",
  "salt": "randomsalt",
  "iterations": 10000
}

Password verification:

verify_password(provided, user_doc):
    derived = pbkdf2(
        provided,
        user_doc.salt,
        user_doc.iterations
    )
    return constant_time_compare(derived, user_doc.derived_key)

Database security object:

{
  "admins": {
    "names": ["admin"],      // Specific users
    "roles": ["db_admins"]   // Users with these roles
  },
  "members": {
    "names": [],
    "roles": ["users"]       // Who can read
  }
}

Access rules:
- Admins can do everything
- Members can read and write
- Non-members get 401/403
- Empty security = public database

Authentication flow:

Every request:
1. Check for Authorization header (Basic auth)
2. Check for session cookie
3. If authenticated, set userCtx
4. Check database security against userCtx
5. If write, run validate_doc_update functions

Think about:

  • How do you handle cookie expiration?
  • What’s the difference between 401 and 403?
  • How do validate functions get user context?
  • How do you protect against timing attacks?

Learning milestones:

  1. Password hashing works → You understand secure credential storage
  2. Sessions persist across requests → You understand cookie auth
  3. Database-level permissions work → You understand authorization
  4. Validate functions reject bad data → You understand data validation

Project 11: Clustering and Sharding

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Erlang
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Clustering
  • Software or Tool: Building: Clustered Database
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A clustered CouchDB that shards data across multiple nodes, with quorum reads/writes and automatic failover.

Why it teaches CouchDB: CouchDB 2.0+ supports clustering. Data is split into shards, replicated across nodes. Reads and writes require a quorum. This is serious distributed systems engineering—coordination, consistency, and failure handling.

Core challenges you’ll face:

  • Consistent hashing for shard placement → maps to data distribution
  • Quorum reads and writes → maps to consistency guarantees
  • Node membership and failure detection → maps to cluster management
  • Request routing → maps to distributed query execution

Key Concepts:

  • Consistent Hashing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
  • Quorum Consensus: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
  • Cluster Membership: Gossip protocols, failure detection
  • Shard Routing: Document ID to shard mapping

Difficulty: Master Time estimate: 1 month+ Prerequisites:

  • All previous projects completed
  • Strong distributed systems knowledge
  • Network programming experience

Real world outcome:

# Start 3-node cluster
$ ./couchdb_server --port 5984 --node node1 --cluster "node1,node2,node3"
$ ./couchdb_server --port 5985 --node node2 --cluster "node1,node2,node3"
$ ./couchdb_server --port 5986 --node node3 --cluster "node1,node2,node3"

# Check cluster status
$ curl http://localhost:5984/_membership
{
  "cluster_nodes": ["node1", "node2", "node3"],
  "all_nodes": ["node1", "node2", "node3"]
}

# Create sharded database
$ curl -X PUT "http://localhost:5984/mydb?q=4&n=2"
{"ok":true}
# q=4 shards, n=2 replicas per shard

# Insert document (goes to appropriate shard)
$ curl -X PUT http://localhost:5984/mydb/user_123 -d '{"name":"Alice"}'
{"ok":true,"id":"user_123","rev":"1-abc"}

# Read works from any node (routes to correct shard)
$ curl http://localhost:5985/mydb/user_123
{"_id":"user_123","_rev":"1-abc","name":"Alice"}

# Kill a node
$ kill -9 $(pgrep -f "node2")

# Writes still work with quorum (n=2, need 2/2, but have 1 replica elsewhere)
$ curl -X PUT http://localhost:5984/mydb/user_456 -d '{"name":"Bob"}'
{"ok":true}  # Success! Quorum achieved with remaining nodes

# Node rejoins and syncs
$ ./couchdb_server --port 5985 --node node2 --cluster "node1,node2,node3"
# Automatic sync of missed writes...

Implementation Hints:

Sharding scheme:

Database "mydb" with q=4 shards:
  Shard 0: hash range 0x00000000-0x3FFFFFFF
  Shard 1: hash range 0x40000000-0x7FFFFFFF
  Shard 2: hash range 0x80000000-0xBFFFFFFF
  Shard 3: hash range 0xC0000000-0xFFFFFFFF

Document placement:
  shard_num = crc32(doc_id) % q
  nodes = get_nodes_for_shard(shard_num)  // n replicas

Quorum configuration:

n = number of replicas
r = read quorum (how many replicas must respond for read)
w = write quorum (how many replicas must acknowledge write)

Default: n=3, r=2, w=2
Rule: r + w > n guarantees overlap (see latest write)

Write path:

write_document(doc):
    shard = get_shard(doc.id)
    nodes = get_nodes_for_shard(shard)

    responses = parallel_write(nodes, doc)
    success_count = count_successes(responses)

    if success_count >= w:
        return success
    else:
        return error("quorum not reached")

Read path:

read_document(doc_id):
    shard = get_shard(doc_id)
    nodes = get_nodes_for_shard(shard)

    responses = parallel_read(nodes, doc_id)

    // Wait for r responses
    docs = collect_until(responses, count=r)

    // Return latest revision
    return pick_winner(docs)

Think about:

  • How do you handle split-brain scenarios?
  • What if a write succeeds on < w nodes?
  • How do you rebalance when adding/removing nodes?
  • How do you route changes feed queries to all shards?

Learning milestones:

  1. Documents are sharded correctly → You understand consistent hashing
  2. Quorum reads/writes work → You understand distributed consistency
  3. Node failure is handled → You understand fault tolerance
  4. Cluster rebalances → You understand distributed operations

Project 12: Attachments and Binary Storage

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Data / HTTP Multipart
  • Software or Tool: Building: Attachment Storage
  • Main Book: “HTTP: The Definitive Guide” by David Gourley

What you’ll build: Support for binary attachments (images, PDFs, etc.) stored alongside JSON documents with content-type handling and streaming.

Why it teaches CouchDB: CouchDB can store files as attachments to documents. This is useful for CouchApps (web apps served from CouchDB) and any application that needs to associate files with data. It teaches binary storage and HTTP multipart handling.

Core challenges you’ll face:

  • Binary data in append-only storage → maps to blob storage
  • HTTP multipart requests → maps to complex request parsing
  • Content-Type handling → maps to MIME types
  • Streaming large files → maps to efficient I/O

Key Concepts:

  • HTTP Multipart: RFC 2046 - Multipart content types
  • MIME Types: Mapping file extensions to content types
  • Streaming I/O: Chunked transfer encoding
  • Content-Addressable Storage: Storing attachments by hash

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

  • Project 4 completed (HTTP server)
  • Project 3 completed (storage engine)
  • Understanding of binary file handling

Real world outcome:

# Upload attachment
$ curl -X PUT http://localhost:5984/mydb/doc1/photo.jpg \
  -H "Content-Type: image/jpeg" \
  --data-binary @photo.jpg
{"ok":true,"id":"doc1","rev":"2-xyz"}

# Document now has _attachments
$ curl http://localhost:5984/mydb/doc1
{
  "_id": "doc1",
  "_rev": "2-xyz",
  "name": "Alice",
  "_attachments": {
    "photo.jpg": {
      "content_type": "image/jpeg",
      "length": 45123,
      "digest": "md5-abc123...",
      "stub": true
    }
  }
}

# Download attachment
$ curl http://localhost:5984/mydb/doc1/photo.jpg > downloaded.jpg
# Returns binary data with Content-Type: image/jpeg

# Inline attachment (base64 in document)
$ curl -X PUT http://localhost:5984/mydb/doc2 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Bob",
    "_attachments": {
      "note.txt": {
        "content_type": "text/plain",
        "data": "SGVsbG8gV29ybGQh"
      }
    }
  }'

Implementation Hints:

Attachment storage:

Option 1: Inline in document B-tree
  - Good for small attachments
  - Bad for large files (copied on every doc update)

Option 2: Separate attachment store (like CouchDB)
  - Content-addressable: key = md5(content)
  - Document stores reference to attachment
  - Attachment deduplicated automatically

Document with attachments:

{
  "_id": "doc1",
  "_rev": "2-xyz",
  "_attachments": {
    "photo.jpg": {
      "content_type": "image/jpeg",
      "length": 45123,
      "digest": "md5-abc123def456",
      "stub": true,        // Indicates attachment not inline
      "revpos": 2          // Revision when added
    }
  }
}

Upload handling:

PUT /{db}/{doc_id}/{attachment_name}
Content-Type: image/jpeg

[binary data]

Steps:
1. Read binary body
2. Compute MD5 digest
3. Store in attachment store (if not exists)
4. Update document with attachment metadata
5. Increment document revision

Download handling:

GET /{db}/{doc_id}/{attachment_name}

Steps:
1. Get document
2. Find attachment metadata
3. Look up in attachment store by digest
4. Stream binary data with correct Content-Type

Think about:

  • What if the same file is attached to multiple documents?
  • How do you handle very large files (1GB+)?
  • What about range requests for partial downloads?
  • How do attachments interact with replication?

Learning milestones:

  1. Attachments upload and download → You understand binary storage
  2. Content-Type is correct → You understand MIME handling
  3. Large files stream efficiently → You understand I/O optimization
  4. Attachments replicate → You understand binary replication

Project 13: Full-Text Search Integration

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Information Retrieval / Search
  • Software or Tool: Building: Search Index
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A full-text search engine integrated with your document database, enabling queries like “find documents containing ‘machine learning’”.

Why it teaches CouchDB: CouchDB integrates with Lucene for full-text search. While MapReduce views handle structured queries, text search requires inverted indexes and relevance scoring. This completes your database’s query capabilities.

Core challenges you’ll face:

  • Text tokenization and stemming → maps to text processing
  • Inverted index construction → maps to search data structures
  • TF-IDF scoring → maps to relevance ranking
  • Incremental index updates → maps to real-time search

Key Concepts:

  • Inverted Indexes: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
  • TF-IDF Scoring: “Introduction to Information Retrieval” Chapter 6 - Manning et al.
  • Tokenization: Splitting text into searchable terms
  • Search Integration: CouchDB Search

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

  • Project 3 completed (B-tree storage)
  • Project 5 completed (changes feed for updates)
  • Basic text processing knowledge

Real world outcome:

# Create search index
$ curl -X PUT http://localhost:5984/mydb/_design/search \
  -H "Content-Type: application/json" \
  -d '{
    "indexes": {
      "content": {
        "analyzer": "standard",
        "index": "function(doc) {
          if(doc.body) index(\"body\", doc.body);
          if(doc.title) index(\"title\", doc.title, {boost: 2.0});
        }"
      }
    }
  }'

# Search
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=machine+learning"
{
  "total_rows": 15,
  "rows": [
    {"id":"doc42","order":[1.234],"fields":{}},
    {"id":"doc17","order":[0.987],"fields":{}},
    ...
  ]
}

# Search with highlighting
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=title:database&include_docs=true&highlights=3"
{
  "rows": [
    {
      "id": "doc5",
      "doc": {"_id":"doc5","title":"Database Systems","body":"..."},
      "highlights": {"title":["<em>Database</em> Systems"]}
    }
  ]
}

# Faceted search
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=*:*&counts=[\"category\"]"
{
  "total_rows": 1000,
  "counts": {
    "category": {
      "tech": 450,
      "science": 320,
      "arts": 230
    }
  }
}

Implementation Hints:

Inverted index structure:

term -> [(doc_id, positions, tf), ...]

Example for "database":
"database" -> [
  (doc5, [0, 45], 2),    // appears at positions 0 and 45, 2 times
  (doc12, [23], 1),
  ...
]

Index function execution:

For each document:
    run index function
    for each index(field, value, options) call:
        tokens = tokenize(value)
        for token in tokens:
            inverted_index[token].append((doc.id, position, options))

TF-IDF scoring:

TF(term, doc) = frequency of term in document
IDF(term) = log(total_docs / docs_containing_term)
Score = TF * IDF

Higher score = term appears often in this doc, rarely overall

Query processing:

search(query):
    terms = parse_and_tokenize(query)

    for term in terms:
        postings = inverted_index[term]
        for (doc_id, positions, tf) in postings:
            score = calculate_tfidf(term, doc_id)
            accumulator[doc_id] += score

    return sorted(accumulator.items(), by=score, descending)

Think about:

  • How do you handle phrases like “machine learning”?
  • How do you update the index incrementally?
  • What’s the storage overhead for positions?
  • How do you handle boolean queries (AND, OR, NOT)?

Learning milestones:

  1. Basic term search works → You understand inverted indexes
  2. Relevance scoring is meaningful → You understand TF-IDF
  3. Index updates incrementally → You understand real-time search
  4. Advanced queries work → You understand query parsing

Project 14: Complete CouchDB Clone

  • File: LEARN_COUCHDB_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Complete Database Systems
  • Software or Tool: Building: Document Database
  • Main Book: “Database Internals” by Alex Petrov + “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete CouchDB-compatible database integrating all previous projects: append-only storage, HTTP API, MVCC, views, replication, auth, and more.

Why this is the ultimate goal: This is the capstone. You’ll integrate every component into a cohesive, production-quality system. The challenges of making components work together teach you as much as building them individually.

Core challenges you’ll face:

  • Component integration → maps to systems architecture
  • API compatibility → maps to protocol implementation
  • Performance tuning → maps to system optimization
  • Error handling across layers → maps to reliability engineering

Key Concepts:

  • All concepts from previous projects
  • Systems Integration: Making components work together
  • API Compatibility: Matching CouchDB’s behavior
  • Testing: Ensuring correctness and performance

Difficulty: Master Time estimate: 1 month+ Prerequisites:

  • All previous projects completed (or most of them)
  • Strong systems programming skills
  • Patience and determination

Real world outcome:

$ ./minicouch --data-dir /var/minicouch --port 5984
MiniCouch v1.0 - CouchDB-compatible Document Database
  Storage: Append-only B-tree
  API: HTTP/1.1 REST
  Auth: Cookie + Basic
  Features: MVCC, Views, Replication

Server ready at http://localhost:5984

# Verify compatibility with CouchDB tools
$ npm install -g pouchdb-server
$ pouchdb-server --port 5985

# Replicate from MiniCouch to PouchDB
$ curl -X POST http://localhost:5984/_replicate \
  -d '{"source":"http://localhost:5984/mydb","target":"http://localhost:5985/mydb"}'
{"ok":true,"docs_written":1000}

# Run CouchDB test suite
$ ./run_couchdb_tests http://localhost:5984
Running 150 compatibility tests...
  ✓ Server info
  ✓ Database creation
  ✓ Document CRUD
  ✓ Revision handling
  ✓ Views and MapReduce
  ✓ Changes feed
  ✓ Replication
  ✓ Authentication
  ...
148/150 tests passed (98.6% compatible)

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        HTTP Server                              │
│              (Routes, Auth, Request/Response)                   │
└─────────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│ Document Store  │ │  View Engine    │ │    Replicator           │
│ (MVCC, Revs)    │ │ (MapReduce)     │ │  (Sync Protocol)        │
└─────────────────┘ └─────────────────┘ └─────────────────────────┘
          │                   │                   │
          └───────────────────┼───────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Storage Engine                               │
│    (Append-only B-tree, Compaction, Crash Recovery)             │
└─────────────────────────────────────────────────────────────────┘

Compatibility checklist:

  • GET / returns CouchDB-like welcome
  • PUT /{db} creates database
  • Document CRUD with _id and _rev
  • Conflicts detected and stored
  • MapReduce views work
  • Changes feed with all modes
  • Replication protocol compatible
  • Authentication (cookie + basic)
  • Mango queries
  • Attachments

Testing strategy:

1. Unit tests for each component
2. Integration tests for component interaction
3. Compatibility tests against CouchDB spec
4. Stress tests for performance
5. Chaos tests for reliability (kill -9, network partition)

Learning milestones:

  1. All components integrate → You understand systems architecture
  2. CouchDB tools work with it → You’ve achieved compatibility
  3. It survives crashes → You’ve achieved durability
  4. It handles concurrent load → You’ve achieved performance
  5. You can explain every decision → You’ve mastered the domain

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. JSON Parser Intermediate 1 week ⭐⭐⭐ ⭐⭐⭐
2. Document Store with Revisions Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐
3. Append-Only B-Tree Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
4. HTTP REST API Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
5. Changes Feed Advanced 1 week ⭐⭐⭐⭐ ⭐⭐⭐
6. MapReduce Views Expert 3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
7. Compaction Advanced 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
8. Mango Queries Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
9. Replication Expert 3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
10. Authentication Advanced 1-2 weeks ⭐⭐⭐ ⭐⭐⭐
11. Clustering Master 1 month+ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
12. Attachments Intermediate 1 week ⭐⭐⭐ ⭐⭐⭐
13. Full-Text Search Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
14. Complete Clone Master 1 month+ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

Based on building a CouchDB clone from scratch in C:

Phase 1: Foundations (Weeks 1-2)

Start here to understand the basics.

  1. Project 1: JSON Parser - The foundation of document storage
  2. Project 2: Document Store with Revisions - Understand MVCC

Phase 2: Storage Engine (Weeks 3-5)

Build the crash-proof core.

  1. Project 3: Append-Only B-Tree - CouchDB’s secret weapon
  2. Project 7: Compaction - Reclaim space

Phase 3: API Layer (Weeks 6-8)

Make it accessible.

  1. Project 4: HTTP REST API - CouchDB’s interface
  2. Project 5: Changes Feed - Enable sync

Phase 4: Querying (Weeks 9-12)

Make data queryable.

  1. Project 6: MapReduce Views - Powerful indexing
  2. Project 8: Mango Queries - JSON-style queries

Phase 5: Distribution (Weeks 13-16)

Go distributed.

  1. Project 9: Replication - Multi-master sync
  2. Project 10: Authentication - Production security

Phase 6: Advanced (Weeks 17+)

Complete the picture.

  1. Project 11: Clustering - Horizontal scaling
  2. Project 12: Attachments - Binary storage
  3. Project 13: Full-Text Search - Text queries
  4. Project 14: Complete Clone - Integration

Summary

# Project Name Main Programming Language
1 JSON Parser and Document Model C
2 In-Memory Document Store with Revisions C
3 Append-Only B-Tree Storage Engine C
4 HTTP REST API Server C
5 Sequence Index and Changes Feed C
6 MapReduce View Engine C
7 Database Compaction C
8 Mango Query Engine C
9 Document Replication C
10 Authentication and Security C
11 Clustering and Sharding C
12 Attachments and Binary Storage C
13 Full-Text Search Integration C
14 Complete CouchDB Clone C

Key Resources

Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann - Distributed data fundamentals
  • “Database Internals” by Alex Petrov - How databases work under the hood
  • “CouchDB: The Definitive Guide” by J. Anderson, J. Lehnardt, N. Slater - The original CouchDB book
  • “C Programming: A Modern Approach” by K. N. King - Essential C reference
  • “The Linux Programming Interface” by Michael Kerrisk - Systems programming

Online Resources

  • PouchDB - JavaScript CouchDB implementation
  • cbt - MVCC append-only B-tree in Erlang
  • Couchstore - Couchbase’s storage engine

You’re ready to build a CouchDB clone from scratch. Start with Project 1 and work your way through. By the end, you’ll understand document databases at the level of the engineers who built CouchDB, PouchDB, and MongoDB.