LEARN COUCHDB DEEP DIVE

Learn CouchDB: From Zero to Document Database Master

Goal: Deeply understand Apache CouchDB—from JSON document storage to building a complete document-oriented database with append-only B-trees, RESTful HTTP API, MapReduce views, MVCC, and multi-master replication from scratch in C.

Why CouchDB Matters

CouchDB pioneered several revolutionary database concepts that are now industry standards:

“Relax”: A database that never corrupts, even during crashes or power failures
HTTP all the way down: Every operation is a REST API call
Eventual consistency: Multi-master replication without coordination
Append-only storage: Never overwrites data, enabling crash-proof durability
MapReduce views: Pre-computed indexes using JavaScript functions

After completing these projects, you will:

Understand the append-only B-tree design that makes CouchDB crash-proof
Know how MVCC enables concurrent reads without locks
Build a RESTful HTTP API that serves JSON documents
Implement MapReduce views with incremental updates
Master conflict-free replication between database instances
Have built a working document database from scratch in C

Core Concept Analysis

CouchDB Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         HTTP REST API                               │
│    GET /db/doc    PUT /db/doc    DELETE /db/doc    POST /db/_find   │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Document Layer                                 │
│         JSON Parsing │ Validation │ Revision Management             │
└─────────────────────────────────────────────────────────────────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              ▼                   ▼                   ▼
┌─────────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│    By-ID Index      │ │  By-Seq Index   │ │    MapReduce Views      │
│   (doc_id → doc)    │ │ (seq → doc_id)  │ │  (user-defined indexes) │
└─────────────────────┘ └─────────────────┘ └─────────────────────────┘
              │                   │                   │
              └───────────────────┼───────────────────┘
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Append-Only B-Tree Engine                         │
│          New nodes written to end │ Never overwrites                │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Database File (.couch)                          │
│   [Header][Data...][B-tree nodes...][Header][Data...][New Header]   │
│                              ↑                                      │
│                         Always grows →                              │
└─────────────────────────────────────────────────────────────────────┘

Fundamental Concepts

Document Model
- Documents are JSON objects with _id and _rev fields
- _id: Unique identifier (user-provided or auto-generated UUID)
- _rev: Revision string (e.g., "1-abc123") for conflict detection
- Schema-less: Any valid JSON structure is allowed
Append-Only B-Tree
- Modified nodes are written to the end of the file, never in-place
- Old versions remain on disk until compaction
- Root pointer stored in file footer (last 4KB)
- Crash recovery: Scan backward to find last valid header
Multi-Version Concurrency Control (MVCC)
- Each update creates a new revision
- Readers see a consistent snapshot (point-in-time)
- Writers never block readers
- Conflicts detected via _rev field during updates

Revision Tree

Document: "user_123"

1-abc  (initial)
   │
2-def  (update)
   │
3-ghi  (latest winning revision)
   │
   ├── 3-xyz  (conflict branch from replica)

MapReduce Views
- Map function: Emits key-value pairs for each document
- Reduce function: Aggregates values by key
- Stored in separate B-tree, incrementally updated
- Only processes documents changed since last query
Changes Feed
- Sequence number increments with every change
- _changes API returns all changes since a sequence
- Enables efficient incremental replication
Replication Protocol
- Compare revision trees between source and target
- Transfer only missing revisions
- Deterministic conflict resolution (same winner everywhere)
- Bidirectional: push + pull = full sync

Project List

Projects are ordered from fundamental understanding to advanced implementations. Each project builds on the previous ones.

Project 1: JSON Parser and Document Model

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Zig
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Parsing / Data Structures / JSON
Software or Tool: Building: JSON Library
Main Book: “Writing a C Compiler” by Nora Sandler

What you’ll build: A complete JSON parser in C that can parse, modify, and serialize JSON documents—the foundation of any document database.

Why it teaches CouchDB: CouchDB stores everything as JSON. Before you can store documents, you need to parse them. This project teaches you tokenization, recursive descent parsing, and in-memory document representation—skills you’ll use throughout.

Core challenges you’ll face:

Tokenizing JSON input → maps to lexical analysis
Handling nested objects and arrays → maps to recursive parsing
Memory management for dynamic structures → maps to ownership and lifetime
Unicode and escape sequence handling → maps to string encoding

Key Concepts:

Recursive Descent Parsing: “Writing a C Compiler” Chapter 1 - Nora Sandler
Memory Management in C: “C Programming: A Modern Approach” Chapter 17 - K. N. King
JSON Specification: RFC 8259 - The JavaScript Object Notation
Tokenizer Design: jsmn - Minimal JSON Tokenizer

Resources for key challenges:

A practical approach to write a simple JSON parser - Step-by-step implementation

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

Basic C programming (pointers, structs, dynamic memory)
Understanding of string manipulation
No prior parsing experience required

Real world outcome:

$ ./json_parser
> {"name": "Alice", "age": 30, "hobbies": ["reading", "coding"]}

Parsed successfully!
Type: Object
Keys: 3

Document structure:
{
  "name": "Alice" (string)
  "age": 30 (number)
  "hobbies": [
    "reading" (string)
    "coding" (string)
  ] (array, 2 elements)
}

> json_get(doc, "hobbies[1]")
"coding"

> json_set(doc, "age", 31)
> json_serialize(doc)
{"name":"Alice","age":31,"hobbies":["reading","coding"]}

Implementation Hints:

JSON value types to support:

typedef enum {
    JSON_NULL,
    JSON_BOOL,
    JSON_NUMBER,
    JSON_STRING,
    JSON_ARRAY,
    JSON_OBJECT
} JsonType;

typedef struct JsonValue {
    JsonType type;
    union {
        bool boolean;
        double number;
        char* string;
        struct { struct JsonValue* items; size_t count; } array;
        struct { char** keys; struct JsonValue* values; size_t count; } object;
    };
} JsonValue;

Tokenizer produces tokens:

{, }, [, ], :, ,
Strings: "hello", "with \"escapes\""
Numbers: 123, -45.67, 1.2e10
Keywords: true, false, null

Parsing approach:

parse_value():
    switch (current_token):
        case '{': return parse_object()
        case '[': return parse_array()
        case STRING: return parse_string()
        case NUMBER: return parse_number()
        case TRUE/FALSE: return parse_bool()
        case NULL: return parse_null()

parse_object():
    expect '{'
    while not '}':
        key = parse_string()
        expect ':'
        value = parse_value()
        add_to_object(key, value)
        if not ',': break
    expect '}'

Think about:

How do you handle "\u0041" (Unicode escape)?
How do you differentiate 123 from "123"?
What’s your memory ownership model? Who frees what?

Learning milestones:

Tokenizer produces correct tokens → You understand lexical analysis
Nested structures parse correctly → You understand recursive parsing
Round-trip works (parse → serialize → parse) → Your parser is correct
No memory leaks → You understand C memory management

Project 2: In-Memory Document Store with Revisions

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Data Structures / MVCC / Versioning
Software or Tool: Building: Document Store
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An in-memory document store that tracks revisions (_rev), detects conflicts, and maintains document history—the core of CouchDB’s data model.

Why it teaches CouchDB: The _rev field is CouchDB’s magic. It enables optimistic concurrency control, conflict detection, and replication. Understanding revision trees is essential to understanding CouchDB.

Core challenges you’ll face:

Generating revision IDs → maps to content-addressable hashing
Revision tree management → maps to tree data structures
Conflict detection on update → maps to optimistic locking
Efficient document lookup → maps to hash table design

Key Concepts:

MVCC Fundamentals: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
Hash Tables in C: “Mastering Algorithms with C” Chapter 8 - Kyle Loudon
Content Hashing: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
Revision Trees: CouchDB Replication and Conflict Model

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

Project 1 completed (JSON parser)
Understanding of hash tables
Basic tree data structures

Real world outcome:

$ ./docstore

docstore> PUT user_123 {"name": "Alice", "age": 30}
{
  "_id": "user_123",
  "_rev": "1-abc123def456",
  "name": "Alice",
  "age": 30
}
Created.

docstore> GET user_123
{
  "_id": "user_123",
  "_rev": "1-abc123def456",
  "name": "Alice",
  "age": 30
}

docstore> PUT user_123 {"_rev": "1-abc123def456", "name": "Alice", "age": 31}
{
  "_id": "user_123",
  "_rev": "2-789ghi012jkl",
  "name": "Alice",
  "age": 31
}
Updated.

docstore> PUT user_123 {"_rev": "1-abc123def456", "name": "Alicia", "age": 30}
Error: Conflict. Document has been modified.
Current revision: 2-789ghi012jkl
Your revision: 1-abc123def456

docstore> REVS user_123
Revision tree for user_123:
  1-abc123def456 (deleted)
  └── 2-789ghi012jkl (current)

Implementation Hints:

Revision ID format: {revision_number}-{md5_of_content}

Generate revision:
1. Serialize document to JSON (without _id, _rev)
2. Compute MD5 hash of JSON string
3. Increment revision number
4. Format: "{rev_num}-{first_16_chars_of_md5}"

Example: "2-a1b2c3d4e5f6g7h8"

Document structure:

typedef struct Revision {
    char* rev_id;           // "1-abc123"
    JsonValue* doc;         // The document content
    struct Revision* parent; // Previous revision
    struct Revision** children; // Branches (conflicts)
    size_t child_count;
    bool deleted;           // Tombstone flag
} Revision;

typedef struct Document {
    char* id;               // _id
    Revision* rev_tree;     // Root of revision tree
    Revision* winner;       // Current winning revision
} Document;

Conflict detection:

on_update(doc_id, new_content, provided_rev):
    current = get_document(doc_id)
    if current.winner.rev_id != provided_rev:
        return CONFLICT_ERROR

    new_rev = create_revision(new_content, current.winner)
    current.winner = new_rev
    return SUCCESS

Think about:

What makes a revision “win” when there are conflicts?
How do you handle DELETE? (Hint: tombstone with _deleted: true)
What if someone provides no _rev for an update?

Learning milestones:

Documents have _id and _rev → You understand the document model
Updates require correct _rev → You understand optimistic locking
Revision history is preserved → You understand MVCC
Conflicts are detected → You’re ready for replication

Project 3: Append-Only B-Tree Storage Engine

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Storage Engines / B-Trees / Durability
Software or Tool: Building: Append-Only B-Tree
Main Book: “Database Internals” by Alex Petrov

What you’ll build: An append-only B-tree that never overwrites data—the crash-proof storage engine that makes CouchDB reliable.

Why it teaches CouchDB: This is CouchDB’s secret weapon. By never overwriting data, the database file is always consistent. Crashes during writes simply leave incomplete data at the end, which is ignored on recovery. This project teaches you how to build truly crash-proof storage.

Core challenges you’ll face:

Copy-on-write B-tree updates → maps to immutable data structures
Root pointer management in file footer → maps to atomic commits
Crash recovery by scanning backward → maps to durability guarantees
Space efficiency with node reuse → maps to storage optimization

Key Concepts:

Append-Only B-Trees: CouchDB - The Power of B-trees
Copy-on-Write: “Database Internals” Chapter 4 - Alex Petrov
File Footer Design: Couchstore File Format
B+Tree Structure: “Database Internals” Chapter 2 - Alex Petrov

Resources for key challenges:

cbt - MVCC Append-Only B-Tree - Reference implementation

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:

Understanding of B-trees
File I/O in C
Binary data handling

Real world outcome:

$ ./appendonly_btree test.db

btree> INSERT "user_001" "Alice"
Inserted at offset 4096
New root at offset 4096

btree> INSERT "user_002" "Bob"
Inserted at offset 8192
New root at offset 8192 (copied, old root still at 4096)

btree> GET "user_001"
"Alice"

# Simulate crash during write
btree> INSERT "user_003" "Carol"
Writing... [CRASH SIMULATED]

$ ./appendonly_btree test.db
Recovering...
Scanning backward for valid header...
Found valid header at offset 8192
Recovery complete. 2 keys in database.

btree> GET "user_003"
Not found (incomplete write was discarded)

btree> GET "user_001"
"Alice" (previous data intact!)

Implementation Hints:

File structure:

┌─────────────────────────────────────────────────────────────────┐
│ Offset 0:    [File Header - 4KB]                                │
│              Magic number, version, etc.                        │
├─────────────────────────────────────────────────────────────────┤
│ Offset 4096: [B-tree Node 1] [B-tree Node 2] [Node 3] ...      │
│              (nodes appended sequentially)                      │
├─────────────────────────────────────────────────────────────────┤
│ End-4KB:     [Active Header - 4KB]                              │
│              Root pointer, doc count, sequence number           │
│              Written twice (4KB + 4KB) for redundancy           │
└─────────────────────────────────────────────────────────────────┘

Copy-on-write update:

To update key K with value V:
Read path from root to leaf containing K
Create NEW leaf node with updated K→V
Create NEW internal nodes pointing to new leaf
Create NEW root pointing to new internal nodes
Append all new nodes to end of file
Write new header with new root pointer
fsync()

Old nodes remain on disk (for old snapshots and crash safety)

Header commit (atomic):

1. Write all data and new B-tree nodes
2. fsync() data
3. Write header copy 1 (first 2KB of footer)
4. Write header copy 2 (second 2KB of footer)
5. fsync() footer

On recovery:
- Read both header copies
- If they match and checksum valid: use it
- If only first valid: use first (crash during step 4)
- If neither valid: scan backward for previous header

Think about:

Why write the header twice?
How do you find the previous valid header after a crash?
What’s the space overhead of copy-on-write?

Learning milestones:

B-tree operations work → You understand B-tree algorithms
Updates append, never overwrite → You understand copy-on-write
Database survives kill -9 → You’ve achieved crash safety
Recovery finds valid state → You understand header scanning

Project 4: HTTP REST API Server

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Zig
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Networking / HTTP / REST APIs
Software or Tool: Building: HTTP Server
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: An HTTP/1.1 server that exposes your document store via REST API—just like CouchDB’s native interface.

Why it teaches CouchDB: CouchDB is “HTTP all the way down.” Every operation—creating databases, inserting documents, querying views—is an HTTP request. Building this teaches you network programming and API design.

Core challenges you’ll face:

TCP socket programming → maps to network fundamentals
HTTP request parsing → maps to protocol implementation
Routing requests to handlers → maps to API design
Concurrent connection handling → maps to server architecture

Key Concepts:

Socket Programming: “The Linux Programming Interface” Chapter 56-61 - Michael Kerrisk
HTTP/1.1 Protocol: RFC 7230-7235
REST API Design: CouchDB API Reference
Beej’s Guide to Network Programming: Essential socket programming reference

Resources for key challenges:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

Projects 1-2 completed
Basic understanding of TCP/IP
Familiarity with HTTP

Real world outcome:

$ ./couchdb_server --port 5984
CouchDB-like server starting on http://localhost:5984
Ready to accept connections...

# In another terminal:
$ curl http://localhost:5984/
{"couchdb":"Welcome","version":"0.1.0"}

$ curl -X PUT http://localhost:5984/mydb
{"ok":true}

$ curl -X PUT http://localhost:5984/mydb/doc1 \
       -H "Content-Type: application/json" \
       -d '{"name":"Alice","age":30}'
{"ok":true,"id":"doc1","rev":"1-abc123"}

$ curl http://localhost:5984/mydb/doc1
{"_id":"doc1","_rev":"1-abc123","name":"Alice","age":30}

$ curl -X PUT http://localhost:5984/mydb/doc1 \
       -H "Content-Type: application/json" \
       -d '{"_rev":"wrong","name":"Alice","age":31}'
{"error":"conflict","reason":"Document update conflict."}

$ curl -X DELETE http://localhost:5984/mydb/doc1?rev=1-abc123
{"ok":true,"id":"doc1","rev":"2-def456"}

Implementation Hints:

CouchDB REST API endpoints:

Server:
  GET  /                    → Server info
  GET  /_all_dbs            → List all databases

Database:
  PUT    /{db}              → Create database
  GET    /{db}              → Database info
  DELETE /{db}              → Delete database

Documents:
  POST   /{db}              → Create document (auto-ID)
  PUT    /{db}/{docid}      → Create/update document
  GET    /{db}/{docid}      → Read document
  DELETE /{db}/{docid}?rev= → Delete document

Special:
  GET    /{db}/_all_docs    → List all document IDs
  GET    /{db}/_changes     → Changes feed

HTTP request structure:

GET /mydb/doc1 HTTP/1.1
Host: localhost:5984
Accept: application/json

Response:
HTTP/1.1 200 OK
Content-Type: application/json
ETag: "1-abc123"

{"_id":"doc1","_rev":"1-abc123","name":"Alice"}

Server architecture:

main():
    socket = create_tcp_socket(5984)
    while true:
        client = accept(socket)
        fork() or thread:  // Handle concurrently
            request = parse_http_request(client)
            response = route_and_handle(request)
            send_http_response(client, response)
            close(client)

Think about:

How do you parse the URL path (/mydb/doc1)?
How do you handle ?rev= query parameters?
What HTTP status codes should you return? (200, 201, 404, 409, etc.)
How do you handle large request bodies?

Learning milestones:

Server accepts connections → You understand sockets
HTTP parsing works → You understand the HTTP protocol
CRUD operations via curl work → You’ve built a REST API
Multiple clients work concurrently → You understand server architecture

Project 5: Sequence Index and Changes Feed

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Event Sourcing / Change Tracking
Software or Tool: Building: Changes Feed
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A sequence-based index that tracks every change, plus a _changes API for real-time and historical change streaming.

Why it teaches CouchDB: The changes feed is the foundation of CouchDB replication. Every modification gets a sequence number. Clients can ask “give me all changes since sequence X” to sync efficiently. This is event sourcing at the database level.

Core challenges you’ll face:

Monotonic sequence number generation → maps to event ordering
By-sequence index maintenance → maps to secondary indexing
Long-polling for real-time changes → maps to push notifications
Efficient “since” queries → maps to incremental sync

Key Concepts:

Changes Feed: CouchDB Changes API
Event Sourcing: “Designing Data-Intensive Applications” Chapter 11 - Martin Kleppmann
Long Polling: HTTP technique for server push
Sequence Indexes: Anatomy of the CouchDB Changes Feed

Difficulty: Advanced Time estimate: 1 week Prerequisites:

Project 3 completed (storage engine)
Project 4 completed (HTTP server)
Understanding of secondary indexes

Real world outcome:

$ curl "http://localhost:5984/mydb/_changes"
{
  "results": [
    {"seq":1,"id":"doc1","changes":[{"rev":"1-abc"}]},
    {"seq":2,"id":"doc2","changes":[{"rev":"1-def"}]},
    {"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
  ],
  "last_seq": 3
}

$ curl "http://localhost:5984/mydb/_changes?since=2"
{
  "results": [
    {"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
  ],
  "last_seq": 3
}

# Long-polling (waits for new changes)
$ curl "http://localhost:5984/mydb/_changes?feed=longpoll&since=3"
# ... request hangs until a change occurs ...
# In another terminal:
$ curl -X PUT http://localhost:5984/mydb/doc3 -d '{"x":1}'
# First terminal receives:
{
  "results": [
    {"seq":4,"id":"doc3","changes":[{"rev":"1-xyz"}]}
  ],
  "last_seq": 4
}

# Continuous feed (streaming)
$ curl "http://localhost:5984/mydb/_changes?feed=continuous&since=0"
{"seq":1,"id":"doc1","changes":[{"rev":"1-abc"}]}
{"seq":2,"id":"doc2","changes":[{"rev":"1-def"}]}
{"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
{"seq":4,"id":"doc3","changes":[{"rev":"1-xyz"}]}
# ... stays open, new lines appear as changes happen ...

Implementation Hints:

Data structures:

Global sequence counter (per database):
  uint64_t current_seq = 0;

On every document change:
  current_seq++;
  record = { seq: current_seq, doc_id, rev_id }
  by_seq_index.insert(current_seq, record)

By-sequence index (separate B-tree):
  key: sequence number
  value: { doc_id, rev_id, deleted }

Changes feed modes:

Normal (feed=normal, default):
  - Return all changes since `since` parameter
  - Return immediately, even if empty

Long-poll (feed=longpoll):
  - If changes exist since `since`, return immediately
  - If no changes, wait (block) until one occurs
  - Return after first change (or timeout)

Continuous (feed=continuous):
  - Stream changes as newline-delimited JSON
  - Never close connection
  - Send heartbeat every N seconds to keep alive

Implementation:

handle_changes(since, feed_type):
    if feed_type == "normal":
        return get_all_changes_since(since)

    if feed_type == "longpoll":
        changes = get_all_changes_since(since)
        if changes:
            return changes
        wait_for_change(timeout=60)  // Block here
        return get_all_changes_since(since)

    if feed_type == "continuous":
        send_all_changes_since(since)
        while true:
            wait_for_change()
            send_new_change()
            send_heartbeat()  // Empty line every 10s

Think about:

What happens if two changes happen while client is disconnected?
How do you implement the “wait” efficiently? (Condition variable, poll)
How do you handle the include_docs=true parameter?

Learning milestones:

Sequence numbers increment correctly → You understand change tracking
By-seq index works → You understand secondary indexes
Normal feed returns history → You understand incremental queries
Long-poll waits and notifies → You understand server push

Project 6: MapReduce View Engine

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: MapReduce / Indexing / Query Processing
Software or Tool: Building: View Engine
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A MapReduce view engine that runs user-defined JavaScript map/reduce functions, maintains B-tree indexes, and updates incrementally.

Why it teaches CouchDB: Views are how you query CouchDB beyond simple ID lookups. The map function transforms documents into key-value pairs; the reduce aggregates them. Understanding incremental index updates is key to CouchDB’s performance.

Core challenges you’ll face:

Embedding a JavaScript engine → maps to language integration
Incremental view updates → maps to efficient indexing
B-tree storage for view results → maps to secondary indexes
Reduce tree computation → maps to aggregation strategies

Key Concepts:

MapReduce Views: CouchDB Introduction to Views
Incremental Indexing: Finding Your Data with Views
Reduce Trees: CouchDB Reduce Functions
JavaScript in C: Libraries like Duktape, QuickJS, or MuJS

Resources for key challenges:

Duktape - Embeddable JavaScript engine in C
QuickJS - Small, fast JS engine

Difficulty: Expert Time estimate: 3 weeks Prerequisites:

Project 3 completed (B-tree storage)
Basic understanding of MapReduce
Familiarity with JavaScript (for writing map functions)

Real world outcome:

# Create a design document with views
$ curl -X PUT http://localhost:5984/mydb/_design/app \
  -H "Content-Type: application/json" \
  -d '{
    "views": {
      "by_age": {
        "map": "function(doc) { if(doc.age) emit(doc.age, doc.name); }"
      },
      "age_stats": {
        "map": "function(doc) { if(doc.age) emit(null, doc.age); }",
        "reduce": "_stats"
      }
    }
  }'
{"ok":true,"id":"_design/app","rev":"1-abc"}

# Query the view
$ curl "http://localhost:5984/mydb/_design/app/_view/by_age"
{
  "total_rows": 3,
  "offset": 0,
  "rows": [
    {"id":"user2","key":25,"value":"Bob"},
    {"id":"user1","key":30,"value":"Alice"},
    {"id":"user3","key":35,"value":"Carol"}
  ]
}

# Query with range
$ curl "http://localhost:5984/mydb/_design/app/_view/by_age?startkey=28&endkey=32"
{
  "rows": [
    {"id":"user1","key":30,"value":"Alice"}
  ]
}

# Query with reduce
$ curl "http://localhost:5984/mydb/_design/app/_view/age_stats?reduce=true"
{
  "rows": [
    {"key":null,"value":{"sum":90,"count":3,"min":25,"max":35,"avg":30}}
  ]
}

Implementation Hints:

View architecture:

Design Document (_design/app):
{
  "views": {
    "view_name": {
      "map": "function(doc) { emit(key, value); }",
      "reduce": "_sum" // or "_count", "_stats", or custom function
    }
  }
}

View B-tree (separate file per view):
  key: [emitted_key, doc_id]  // Compound key for sorting
  value: emitted_value

Map function execution:

For each document:
    result = js_engine.call("map", doc)
    for each emit(key, value) in result:
        view_btree.insert([key, doc.id], value)

Incremental updates:

view_update(view, last_seq):
    changes = db.changes_since(view.last_indexed_seq)
    for change in changes:
        old_emits = view.get_by_docid(change.doc_id)
        view.delete(old_emits)  // Remove old entries

        if not change.deleted:
            doc = db.get(change.doc_id)
            new_emits = run_map_function(doc)
            for emit in new_emits:
                view.insert([emit.key, doc.id], emit.value)

    view.last_indexed_seq = changes.last_seq

Built-in reduce functions:

_count: Return count of values
_sum: Return sum of numeric values
_stats: Return {sum, count, min, max, avg}

Custom reduce:
function(keys, values, rereduce) {
    if (rereduce) {
        // values are previous reduce results
        return values.reduce((a, b) => a + b, 0);
    } else {
        // values are from map function
        return values.length;
    }
}

Think about:

How do you isolate the JavaScript sandbox?
What if a map function throws an error?
How do you handle the rereduce flag in reduce functions?
When do you update the view? (On query, or background?)

Learning milestones:

Map function emits work → You understand JS integration
Views are queryable by key → You understand view B-trees
Incremental updates work → You understand efficient indexing
Reduce aggregates correctly → You understand reduce trees

Project 7: Database Compaction

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Storage Management / Garbage Collection
Software or Tool: Building: Compactor
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A compaction system that reclaims space from the append-only file by copying live data to a new file and discarding obsolete revisions.

Why it teaches CouchDB: Append-only storage is great for durability but terrible for space. Without compaction, your database file grows forever. CouchDB’s compaction copies only current revisions to a new file, then swaps it in atomically.

Core challenges you’ll face:

Identifying live vs. dead data → maps to garbage identification
Online compaction (while serving requests) → maps to concurrent operations
Atomic file swap → maps to safe transitions
Space estimation and scheduling → maps to resource management

Key Concepts:

Log-Structured Storage Compaction: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
Online Compaction: Understanding Data Compaction
Garbage Collection: “Database Internals” Chapter 5 - Alex Petrov
Atomic File Operations: rename() syscall semantics

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:

Project 3 completed (append-only storage)
Understanding of file system operations
Concurrent programming basics

Real world outcome:

$ ls -lh mydb.couch
-rw-r--r--  1 user  staff  1.2G  Dec 21 10:00 mydb.couch

$ curl -X POST http://localhost:5984/mydb/_compact
{"ok":true}

$ curl http://localhost:5984/mydb
{
  "db_name": "mydb",
  "doc_count": 10000,
  "disk_size": 1288490188,
  "compact_running": true,
  ...
}

# Wait for compaction to complete...
$ curl http://localhost:5984/mydb
{
  "db_name": "mydb",
  "doc_count": 10000,
  "disk_size": 52428800,
  "compact_running": false,
  ...
}

$ ls -lh mydb.couch
-rw-r--r--  1 user  staff   50M  Dec 21 10:05 mydb.couch

# Space reclaimed! 1.2GB → 50MB

Implementation Hints:

Compaction algorithm:

compact(db_file):
    new_file = create_temp_file("db.compact")

    for doc_id in db.all_doc_ids():
        doc = db.get(doc_id)  // Gets winning revision only
        if not doc.deleted:
            new_file.insert(doc_id, doc)

    // Copy views too (only current index entries)
    for view in db.views:
        for entry in view:
            new_file.view_insert(entry)

    // Atomic swap
    rename(new_file, db_file)  // Atomic on POSIX

    // Old file is now unlinked

Online compaction (while serving reads):

During compaction:
  - Reads go to OLD file (consistent snapshot)
  - Writes go to BOTH old file AND track in memory

After compaction:
  - Apply pending writes to new file
  - Atomic swap
  - New reads go to new file

What to keep vs. discard:

KEEP:
  - Current winning revision of each document
  - Conflict revisions (still unresolved)
  - Local documents (_local/*)

DISCARD:
  - Old revisions (superseded)
  - Deleted document tombstones older than threshold
  - Orphaned B-tree nodes

Think about:

What if a write happens during compaction?
How do you handle views during compaction?
What’s the disk space requirement? (2x temporarily)
How do you schedule compaction? (Fragmentation threshold)

Learning milestones:

Compaction creates smaller file → You understand garbage collection
No data is lost → You understand live data identification
Swap is atomic → You understand safe transitions
Works while serving requests → You understand online compaction

Project 8: Mango Query Engine (JSON Queries)

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Query Processing / Indexing
Software or Tool: Building: Query Engine
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A Mango-style query engine that accepts JSON selector queries (like MongoDB) and executes them efficiently using indexes.

Why it teaches CouchDB: MapReduce views are powerful but require writing JavaScript. Mango queries let you query with JSON: {"age": {"$gt": 30}}. This project teaches query parsing, index selection, and execution planning.

Core challenges you’ll face:

Query syntax parsing → maps to DSL implementation
Index selection → maps to query optimization
Selector evaluation → maps to predicate matching
Combining results → maps to query execution

Key Concepts:

Mango Queries: CouchDB Find API
Query Selectors: MongoDB-style operators ($eq, $gt, $in, etc.)
Index Selection: “Database Internals” Chapter 10 - Alex Petrov
Query Planning: Choosing between index scan vs. full scan

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

Project 6 completed (views for indexing)
Project 1 completed (JSON parsing)
Understanding of query execution

Real world outcome:

# Create an index
$ curl -X POST http://localhost:5984/mydb/_index \
  -H "Content-Type: application/json" \
  -d '{"index": {"fields": ["age"]}, "name": "age-index"}'
{"result":"created","id":"_design/age-index","name":"age-index"}

# Query with selector
$ curl -X POST http://localhost:5984/mydb/_find \
  -H "Content-Type: application/json" \
  -d '{
    "selector": {
      "age": {"$gt": 25},
      "status": "active"
    },
    "fields": ["_id", "name", "age"],
    "sort": [{"age": "asc"}],
    "limit": 10
  }'
{
  "docs": [
    {"_id":"user2","name":"Bob","age":28},
    {"_id":"user1","name":"Alice","age":30},
    {"_id":"user3","name":"Carol","age":35}
  ],
  "bookmark": "g1AAAA...",
  "warning": "No matching index found, create an index for status"
}

# Explain query plan
$ curl -X POST http://localhost:5984/mydb/_explain \
  -H "Content-Type: application/json" \
  -d '{"selector": {"age": {"$gt": 25}}}'
{
  "index": {
    "ddoc": "_design/age-index",
    "name": "age-index",
    "type": "json",
    "fields": [{"age":"asc"}]
  },
  "selector": {"age":{"$gt":25}},
  "range": {
    "start_key": [25],
    "end_key": [{}]
  }
}

Implementation Hints:

Selector operators:

Comparison:
  $eq   - Equal
  $ne   - Not equal
  $gt   - Greater than
  $gte  - Greater than or equal
  $lt   - Less than
  $lte  - Less than or equal

Logical:
  $and  - All must match
  $or   - At least one must match
  $not  - Negation

Array:
  $in   - Value in array
  $nin  - Value not in array

Existence:
  $exists - Field exists

Query execution:

execute_find(selector, fields, sort, limit):
    // 1. Parse selector
    parsed = parse_selector(selector)

    // 2. Find usable index
    index = find_best_index(parsed, sort)

    // 3. Execute
    if index:
        candidates = index.range_scan(parsed.key_range)
    else:
        candidates = full_table_scan()  // Slow!
        emit_warning("No matching index")

    // 4. Filter (for predicates not covered by index)
    results = []
    for doc in candidates:
        if matches_selector(doc, parsed):
            results.append(project(doc, fields))
            if len(results) >= limit:
                break

    // 5. Sort (if not already sorted by index)
    if sort and not index_covers_sort:
        results.sort(sort)

    return results

Selector evaluation:

matches_selector(doc, selector):
    for field, condition in selector:
        value = doc.get(field)

        if is_operator(condition):
            if not evaluate_operator(value, condition):
                return false
        else:
            if value != condition:  // Implicit $eq
                return false

    return true

evaluate_operator(value, {$gt: 25}):
    return value > 25

Think about:

How do you handle nested fields? ("address.city": "NYC")
What if no index exists? (Full scan with warning)
How do you handle $or with indexes?
What’s the bookmark for pagination?

Learning milestones:

Basic selectors work → You understand predicate matching
Indexes are used when available → You understand query optimization
Complex queries with $and/$or work → You understand logical operators
_explain shows the plan → You understand query planning

Project 9: Document Replication

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Distributed Systems / Replication
Software or Tool: Building: Replicator
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A replication engine that syncs documents between CouchDB instances, handling conflicts deterministically—the heart of CouchDB’s distributed nature.

Why it teaches CouchDB: Replication is CouchDB’s killer feature. Two databases can be offline, both receive updates, and later sync without coordination. Conflicts are detected and the same winner is chosen everywhere. This project teaches distributed systems concepts.

Core challenges you’ll face:

Comparing revision trees → maps to diff algorithms
Transferring missing revisions → maps to efficient sync
Deterministic conflict resolution → maps to consensus-free consistency
Checkpoint management → maps to sync resume

Key Concepts:

CouchDB Replication Protocol: Official Protocol Spec
Conflict Resolution: Replication and Conflict Model
Eventual Consistency: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
CRDTs and Conflict-Free Design: Conflict Management

Resources for key challenges:

CouchDB Replication Protocol - Data Protocols spec

Difficulty: Expert Time estimate: 3 weeks Prerequisites:

Project 4-5 completed (HTTP + changes feed)
Project 2 completed (revision trees)
Understanding of distributed systems basics

Real world outcome:

# Start two CouchDB instances
$ ./couchdb_server --port 5984 --data ./db1 &
$ ./couchdb_server --port 5985 --data ./db2 &

# Create same database on both
$ curl -X PUT http://localhost:5984/shared
$ curl -X PUT http://localhost:5985/shared

# Add document to first instance
$ curl -X PUT http://localhost:5984/shared/doc1 -d '{"value":"from db1"}'
{"ok":true,"id":"doc1","rev":"1-abc"}

# Replicate from db1 to db2
$ curl -X POST http://localhost:5984/_replicate \
  -H "Content-Type: application/json" \
  -d '{"source":"http://localhost:5984/shared","target":"http://localhost:5985/shared"}'
{
  "ok":true,
  "docs_read":1,
  "docs_written":1,
  "missing_checked":1,
  "missing_found":1
}

# Document now exists on db2
$ curl http://localhost:5985/shared/doc1
{"_id":"doc1","_rev":"1-abc","value":"from db1"}

# Create conflict: Update on both while "offline"
$ curl -X PUT http://localhost:5984/shared/doc1 -d '{"_rev":"1-abc","value":"updated on db1"}'
$ curl -X PUT http://localhost:5985/shared/doc1 -d '{"_rev":"1-abc","value":"updated on db2"}'

# Replicate again - conflict detected
$ curl -X POST http://localhost:5984/_replicate \
  -d '{"source":"http://localhost:5985/shared","target":"http://localhost:5984/shared"}'

# Check for conflicts
$ curl "http://localhost:5984/shared/doc1?conflicts=true"
{
  "_id":"doc1",
  "_rev":"2-xyz",
  "value":"updated on db2",
  "_conflicts":["2-abc"]
}
# Same winner (2-xyz) chosen deterministically on both!

Implementation Hints:

Replication algorithm:

replicate(source, target, since=0):
    // 1. Get changes from source
    changes = source.get_changes(since=since)

    // 2. Find what target is missing
    revs_to_check = [(c.id, c.rev) for c in changes]
    missing = target.revs_diff(revs_to_check)

    // 3. Fetch missing docs with revision history
    for doc_id, missing_revs in missing:
        doc = source.get(doc_id, revs=True, open_revs=missing_revs)
        target.bulk_docs([doc], new_edits=False)

    // 4. Save checkpoint
    save_checkpoint(target, changes.last_seq)

Revision diff (_revs_diff):

Input: {"doc1": ["2-abc", "3-def"], "doc2": ["1-xyz"]}
Output: {"doc1": {"missing": ["3-def"]}}  // Has 2-abc, needs 3-def

Deterministic conflict resolution:

When conflicts exist, the winner is determined by:
1. Longest revision path wins (more edits = more recent)
2. If tie, lexicographically highest revision ID wins

This is deterministic: all replicas pick the same winner
without any coordination!

Example:
  Revisions: "2-abc", "2-xyz"
  Same length (2), compare strings: "xyz" > "abc"
  Winner: "2-xyz"

Bulk docs with new_edits=false:

Normal insert: CouchDB generates new revision
With new_edits=false: Use provided revision (for replication)

This allows inserting historical revisions from another database.

Think about:

How do you handle network failures during replication?
What if source is updated during replication?
How do you detect and resolve conflicts programmatically?
What’s continuous replication vs. one-shot?

Learning milestones:

One-shot replication works → You understand sync protocol
Missing revisions are identified → You understand revs_diff
Conflicts are detected → You understand conflict creation
Same winner everywhere → You understand deterministic resolution

Project 10: Authentication and Security

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Security / Authentication / Authorization
Software or Tool: Building: Auth System
Main Book: “Foundations of Information Security” by Jason Andress

What you’ll build: An authentication system with users, roles, and per-database access control—making your database production-ready.

Why it teaches CouchDB: A database without auth is useless in production. CouchDB has a flexible system with admin users, regular users, database-level permissions, and design document validation. This project teaches security fundamentals.

Core challenges you’ll face:

Password hashing → maps to secure credential storage
Session management → maps to authentication state
Role-based access control → maps to authorization
Validate functions → maps to data integrity

Key Concepts:

Password Hashing: bcrypt or PBKDF2
Cookie Sessions: CouchDB Authentication
Authorization: Per-database security objects
Validation Functions: Design document validate_doc_update

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:

Project 4 completed (HTTP server)
Basic security knowledge
Understanding of HTTP cookies

Real world outcome:

# Create admin user (first user becomes admin)
$ curl -X PUT http://localhost:5984/_users/org.couchdb.user:admin \
  -H "Content-Type: application/json" \
  -d '{"name":"admin","password":"secret123","roles":["_admin"],"type":"user"}'

# Try to access without auth
$ curl http://localhost:5984/mydb/doc1
{"error":"unauthorized","reason":"You are not authorized."}

# Login (cookie auth)
$ curl -X POST http://localhost:5984/_session \
  -H "Content-Type: application/json" \
  -d '{"name":"admin","password":"secret123"}' \
  -c cookies.txt
{"ok":true,"name":"admin","roles":["_admin"]}

# Access with session cookie
$ curl -b cookies.txt http://localhost:5984/mydb/doc1
{"_id":"doc1","_rev":"1-abc","name":"Alice"}

# Set database security
$ curl -X PUT http://localhost:5984/mydb/_security \
  -b cookies.txt \
  -H "Content-Type: application/json" \
  -d '{"admins":{"names":["admin"]},"members":{"roles":["users"]}}'
{"ok":true}

# Create validate function (in design doc)
$ curl -X PUT http://localhost:5984/mydb/_design/validation \
  -b cookies.txt \
  -d '{
    "validate_doc_update": "function(newDoc, oldDoc, userCtx) {
      if(!newDoc.name) throw({forbidden:\"name required\"});
    }"
  }'

# Try to insert invalid document
$ curl -X PUT http://localhost:5984/mydb/invalid \
  -b cookies.txt \
  -d '{"age":30}'
{"error":"forbidden","reason":"name required"}

Implementation Hints:

User document structure:

{
  "_id": "org.couchdb.user:alice",
  "type": "user",
  "name": "alice",
  "roles": ["editor", "viewer"],
  "password_scheme": "pbkdf2",
  "derived_key": "abc123...",
  "salt": "randomsalt",
  "iterations": 10000
}

Password verification:

verify_password(provided, user_doc):
    derived = pbkdf2(
        provided,
        user_doc.salt,
        user_doc.iterations
    )
    return constant_time_compare(derived, user_doc.derived_key)

Database security object:

{
  "admins": {
    "names": ["admin"],      // Specific users
    "roles": ["db_admins"]   // Users with these roles
  },
  "members": {
    "names": [],
    "roles": ["users"]       // Who can read
  }
}

Access rules:
- Admins can do everything
- Members can read and write
- Non-members get 401/403
- Empty security = public database

Authentication flow:

Every request:
Check for Authorization header (Basic auth)
Check for session cookie
If authenticated, set userCtx
Check database security against userCtx
If write, run validate_doc_update functions

Think about:

How do you handle cookie expiration?
What’s the difference between 401 and 403?
How do validate functions get user context?
How do you protect against timing attacks?

Learning milestones:

Password hashing works → You understand secure credential storage
Sessions persist across requests → You understand cookie auth
Database-level permissions work → You understand authorization
Validate functions reject bad data → You understand data validation

Project 11: Clustering and Sharding

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Erlang
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Distributed Systems / Clustering
Software or Tool: Building: Clustered Database
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A clustered CouchDB that shards data across multiple nodes, with quorum reads/writes and automatic failover.

Why it teaches CouchDB: CouchDB 2.0+ supports clustering. Data is split into shards, replicated across nodes. Reads and writes require a quorum. This is serious distributed systems engineering—coordination, consistency, and failure handling.

Core challenges you’ll face:

Consistent hashing for shard placement → maps to data distribution
Quorum reads and writes → maps to consistency guarantees
Node membership and failure detection → maps to cluster management
Request routing → maps to distributed query execution

Key Concepts:

Consistent Hashing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
Quorum Consensus: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
Cluster Membership: Gossip protocols, failure detection
Shard Routing: Document ID to shard mapping

Difficulty: Master Time estimate: 1 month+ Prerequisites:

All previous projects completed
Strong distributed systems knowledge
Network programming experience

Real world outcome:

# Start 3-node cluster
$ ./couchdb_server --port 5984 --node node1 --cluster "node1,node2,node3"
$ ./couchdb_server --port 5985 --node node2 --cluster "node1,node2,node3"
$ ./couchdb_server --port 5986 --node node3 --cluster "node1,node2,node3"

# Check cluster status
$ curl http://localhost:5984/_membership
{
  "cluster_nodes": ["node1", "node2", "node3"],
  "all_nodes": ["node1", "node2", "node3"]
}

# Create sharded database
$ curl -X PUT "http://localhost:5984/mydb?q=4&n=2"
{"ok":true}
# q=4 shards, n=2 replicas per shard

# Insert document (goes to appropriate shard)
$ curl -X PUT http://localhost:5984/mydb/user_123 -d '{"name":"Alice"}'
{"ok":true,"id":"user_123","rev":"1-abc"}

# Read works from any node (routes to correct shard)
$ curl http://localhost:5985/mydb/user_123
{"_id":"user_123","_rev":"1-abc","name":"Alice"}

# Kill a node
$ kill -9 $(pgrep -f "node2")

# Writes still work with quorum (n=2, need 2/2, but have 1 replica elsewhere)
$ curl -X PUT http://localhost:5984/mydb/user_456 -d '{"name":"Bob"}'
{"ok":true}  # Success! Quorum achieved with remaining nodes

# Node rejoins and syncs
$ ./couchdb_server --port 5985 --node node2 --cluster "node1,node2,node3"
# Automatic sync of missed writes...

Implementation Hints:

Sharding scheme:

Database "mydb" with q=4 shards:
  Shard 0: hash range 0x00000000-0x3FFFFFFF
  Shard 1: hash range 0x40000000-0x7FFFFFFF
  Shard 2: hash range 0x80000000-0xBFFFFFFF
  Shard 3: hash range 0xC0000000-0xFFFFFFFF

Document placement:
  shard_num = crc32(doc_id) % q
  nodes = get_nodes_for_shard(shard_num)  // n replicas

Quorum configuration:

n = number of replicas
r = read quorum (how many replicas must respond for read)
w = write quorum (how many replicas must acknowledge write)

Default: n=3, r=2, w=2
Rule: r + w > n guarantees overlap (see latest write)

Write path:

write_document(doc):
    shard = get_shard(doc.id)
    nodes = get_nodes_for_shard(shard)

    responses = parallel_write(nodes, doc)
    success_count = count_successes(responses)

    if success_count >= w:
        return success
    else:
        return error("quorum not reached")

Read path:

read_document(doc_id):
    shard = get_shard(doc_id)
    nodes = get_nodes_for_shard(shard)

    responses = parallel_read(nodes, doc_id)

    // Wait for r responses
    docs = collect_until(responses, count=r)

    // Return latest revision
    return pick_winner(docs)

Think about:

How do you handle split-brain scenarios?
What if a write succeeds on < w nodes?
How do you rebalance when adding/removing nodes?
How do you route changes feed queries to all shards?

Learning milestones:

Documents are sharded correctly → You understand consistent hashing
Quorum reads/writes work → You understand distributed consistency
Node failure is handled → You understand fault tolerance
Cluster rebalances → You understand distributed operations

Project 12: Attachments and Binary Storage

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Binary Data / HTTP Multipart
Software or Tool: Building: Attachment Storage
Main Book: “HTTP: The Definitive Guide” by David Gourley

What you’ll build: Support for binary attachments (images, PDFs, etc.) stored alongside JSON documents with content-type handling and streaming.

Why it teaches CouchDB: CouchDB can store files as attachments to documents. This is useful for CouchApps (web apps served from CouchDB) and any application that needs to associate files with data. It teaches binary storage and HTTP multipart handling.

Core challenges you’ll face:

Binary data in append-only storage → maps to blob storage
HTTP multipart requests → maps to complex request parsing
Content-Type handling → maps to MIME types
Streaming large files → maps to efficient I/O

Key Concepts:

HTTP Multipart: RFC 2046 - Multipart content types
MIME Types: Mapping file extensions to content types
Streaming I/O: Chunked transfer encoding
Content-Addressable Storage: Storing attachments by hash

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

Project 4 completed (HTTP server)
Project 3 completed (storage engine)
Understanding of binary file handling

Real world outcome:

# Upload attachment
$ curl -X PUT http://localhost:5984/mydb/doc1/photo.jpg \
  -H "Content-Type: image/jpeg" \
  --data-binary @photo.jpg
{"ok":true,"id":"doc1","rev":"2-xyz"}

# Document now has _attachments
$ curl http://localhost:5984/mydb/doc1
{
  "_id": "doc1",
  "_rev": "2-xyz",
  "name": "Alice",
  "_attachments": {
    "photo.jpg": {
      "content_type": "image/jpeg",
      "length": 45123,
      "digest": "md5-abc123...",
      "stub": true
    }
  }
}

# Download attachment
$ curl http://localhost:5984/mydb/doc1/photo.jpg > downloaded.jpg
# Returns binary data with Content-Type: image/jpeg

# Inline attachment (base64 in document)
$ curl -X PUT http://localhost:5984/mydb/doc2 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Bob",
    "_attachments": {
      "note.txt": {
        "content_type": "text/plain",
        "data": "SGVsbG8gV29ybGQh"
      }
    }
  }'

Implementation Hints:

Attachment storage:

Option 1: Inline in document B-tree
  - Good for small attachments
  - Bad for large files (copied on every doc update)

Option 2: Separate attachment store (like CouchDB)
  - Content-addressable: key = md5(content)
  - Document stores reference to attachment
  - Attachment deduplicated automatically

Document with attachments:

{
  "_id": "doc1",
  "_rev": "2-xyz",
  "_attachments": {
    "photo.jpg": {
      "content_type": "image/jpeg",
      "length": 45123,
      "digest": "md5-abc123def456",
      "stub": true,        // Indicates attachment not inline
      "revpos": 2          // Revision when added
    }
  }
}

Upload handling:

PUT /{db}/{doc_id}/{attachment_name}
Content-Type: image/jpeg

[binary data]

Steps:
1. Read binary body
2. Compute MD5 digest
3. Store in attachment store (if not exists)
4. Update document with attachment metadata
5. Increment document revision

Download handling:

GET /{db}/{doc_id}/{attachment_name}

Steps:
1. Get document
2. Find attachment metadata
3. Look up in attachment store by digest
4. Stream binary data with correct Content-Type

Think about:

What if the same file is attached to multiple documents?
How do you handle very large files (1GB+)?
What about range requests for partial downloads?
How do attachments interact with replication?

Learning milestones:

Attachments upload and download → You understand binary storage
Content-Type is correct → You understand MIME handling
Large files stream efficiently → You understand I/O optimization
Attachments replicate → You understand binary replication

Project 13: Full-Text Search Integration

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Information Retrieval / Search
Software or Tool: Building: Search Index
Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A full-text search engine integrated with your document database, enabling queries like “find documents containing ‘machine learning’”.

Why it teaches CouchDB: CouchDB integrates with Lucene for full-text search. While MapReduce views handle structured queries, text search requires inverted indexes and relevance scoring. This completes your database’s query capabilities.

Core challenges you’ll face:

Text tokenization and stemming → maps to text processing
Inverted index construction → maps to search data structures
TF-IDF scoring → maps to relevance ranking
Incremental index updates → maps to real-time search

Key Concepts:

Inverted Indexes: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
TF-IDF Scoring: “Introduction to Information Retrieval” Chapter 6 - Manning et al.
Tokenization: Splitting text into searchable terms
Search Integration: CouchDB Search

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

Project 3 completed (B-tree storage)
Project 5 completed (changes feed for updates)
Basic text processing knowledge

Real world outcome:

# Create search index
$ curl -X PUT http://localhost:5984/mydb/_design/search \
  -H "Content-Type: application/json" \
  -d '{
    "indexes": {
      "content": {
        "analyzer": "standard",
        "index": "function(doc) {
          if(doc.body) index(\"body\", doc.body);
          if(doc.title) index(\"title\", doc.title, {boost: 2.0});
        }"
      }
    }
  }'

# Search
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=machine+learning"
{
  "total_rows": 15,
  "rows": [
    {"id":"doc42","order":[1.234],"fields":{}},
    {"id":"doc17","order":[0.987],"fields":{}},
    ...
  ]
}

# Search with highlighting
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=title:database&include_docs=true&highlights=3"
{
  "rows": [
    {
      "id": "doc5",
      "doc": {"_id":"doc5","title":"Database Systems","body":"..."},
      "highlights": {"title":["<em>Database</em> Systems"]}
    }
  ]
}

# Faceted search
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=*:*&counts=[\"category\"]"
{
  "total_rows": 1000,
  "counts": {
    "category": {
      "tech": 450,
      "science": 320,
      "arts": 230
    }
  }
}

Implementation Hints:

Inverted index structure:

term -> [(doc_id, positions, tf), ...]

Example for "database":
"database" -> [
  (doc5, [0, 45], 2),    // appears at positions 0 and 45, 2 times
  (doc12, [23], 1),
  ...
]

Index function execution:

For each document:
    run index function
    for each index(field, value, options) call:
        tokens = tokenize(value)
        for token in tokens:
            inverted_index[token].append((doc.id, position, options))

TF-IDF scoring:

TF(term, doc) = frequency of term in document
IDF(term) = log(total_docs / docs_containing_term)
Score = TF * IDF

Higher score = term appears often in this doc, rarely overall

Query processing:

search(query):
    terms = parse_and_tokenize(query)

    for term in terms:
        postings = inverted_index[term]
        for (doc_id, positions, tf) in postings:
            score = calculate_tfidf(term, doc_id)
            accumulator[doc_id] += score

    return sorted(accumulator.items(), by=score, descending)

Think about:

How do you handle phrases like “machine learning”?
How do you update the index incrementally?
What’s the storage overhead for positions?
How do you handle boolean queries (AND, OR, NOT)?

Learning milestones:

Basic term search works → You understand inverted indexes
Relevance scoring is meaningful → You understand TF-IDF
Index updates incrementally → You understand real-time search
Advanced queries work → You understand query parsing

Project 14: Complete CouchDB Clone

File: LEARN_COUCHDB_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Complete Database Systems
Software or Tool: Building: Document Database
Main Book: “Database Internals” by Alex Petrov + “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete CouchDB-compatible database integrating all previous projects: append-only storage, HTTP API, MVCC, views, replication, auth, and more.

Why this is the ultimate goal: This is the capstone. You’ll integrate every component into a cohesive, production-quality system. The challenges of making components work together teach you as much as building them individually.

Core challenges you’ll face:

Component integration → maps to systems architecture
API compatibility → maps to protocol implementation
Performance tuning → maps to system optimization
Error handling across layers → maps to reliability engineering

Key Concepts:

All concepts from previous projects
Systems Integration: Making components work together
API Compatibility: Matching CouchDB’s behavior
Testing: Ensuring correctness and performance

Difficulty: Master Time estimate: 1 month+ Prerequisites:

All previous projects completed (or most of them)
Strong systems programming skills
Patience and determination

Real world outcome:

$ ./minicouch --data-dir /var/minicouch --port 5984
MiniCouch v1.0 - CouchDB-compatible Document Database
  Storage: Append-only B-tree
  API: HTTP/1.1 REST
  Auth: Cookie + Basic
  Features: MVCC, Views, Replication

Server ready at http://localhost:5984

# Verify compatibility with CouchDB tools
$ npm install -g pouchdb-server
$ pouchdb-server --port 5985

# Replicate from MiniCouch to PouchDB
$ curl -X POST http://localhost:5984/_replicate \
  -d '{"source":"http://localhost:5984/mydb","target":"http://localhost:5985/mydb"}'
{"ok":true,"docs_written":1000}

# Run CouchDB test suite
$ ./run_couchdb_tests http://localhost:5984
Running 150 compatibility tests...
  ✓ Server info
  ✓ Database creation
  ✓ Document CRUD
  ✓ Revision handling
  ✓ Views and MapReduce
  ✓ Changes feed
  ✓ Replication
  ✓ Authentication
  ...
148/150 tests passed (98.6% compatible)

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        HTTP Server                              │
│              (Routes, Auth, Request/Response)                   │
└─────────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│ Document Store  │ │  View Engine    │ │    Replicator           │
│ (MVCC, Revs)    │ │ (MapReduce)     │ │  (Sync Protocol)        │
└─────────────────┘ └─────────────────┘ └─────────────────────────┘
          │                   │                   │
          └───────────────────┼───────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Storage Engine                               │
│    (Append-only B-tree, Compaction, Crash Recovery)             │
└─────────────────────────────────────────────────────────────────┘

Compatibility checklist:

Testing strategy:

Unit tests for each component
Integration tests for component interaction
Compatibility tests against CouchDB spec
Stress tests for performance
Chaos tests for reliability (kill -9, network partition)

Learning milestones:

All components integrate → You understand systems architecture
CouchDB tools work with it → You’ve achieved compatibility
It survives crashes → You’ve achieved durability
It handles concurrent load → You’ve achieved performance
You can explain every decision → You’ve mastered the domain

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. JSON Parser	Intermediate	1 week	⭐⭐⭐	⭐⭐⭐
2. Document Store with Revisions	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐
3. Append-Only B-Tree	Expert	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
4. HTTP REST API	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
5. Changes Feed	Advanced	1 week	⭐⭐⭐⭐	⭐⭐⭐
6. MapReduce Views	Expert	3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
7. Compaction	Advanced	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐
8. Mango Queries	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
9. Replication	Expert	3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
10. Authentication	Advanced	1-2 weeks	⭐⭐⭐	⭐⭐⭐
11. Clustering	Master	1 month+	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
12. Attachments	Intermediate	1 week	⭐⭐⭐	⭐⭐⭐
13. Full-Text Search	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
14. Complete Clone	Master	1 month+	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

Recommended Learning Path

Based on building a CouchDB clone from scratch in C:

Phase 1: Foundations (Weeks 1-2)

Start here to understand the basics.

Project 1: JSON Parser - The foundation of document storage
Project 2: Document Store with Revisions - Understand MVCC

Phase 2: Storage Engine (Weeks 3-5)

Build the crash-proof core.

Project 3: Append-Only B-Tree - CouchDB’s secret weapon
Project 7: Compaction - Reclaim space

Phase 3: API Layer (Weeks 6-8)

Make it accessible.

Project 4: HTTP REST API - CouchDB’s interface
Project 5: Changes Feed - Enable sync

Phase 4: Querying (Weeks 9-12)

Make data queryable.

Project 6: MapReduce Views - Powerful indexing
Project 8: Mango Queries - JSON-style queries

Phase 5: Distribution (Weeks 13-16)

Go distributed.

Project 9: Replication - Multi-master sync
Project 10: Authentication - Production security

Phase 6: Advanced (Weeks 17+)

Complete the picture.

Project 11: Clustering - Horizontal scaling
Project 12: Attachments - Binary storage
Project 13: Full-Text Search - Text queries
Project 14: Complete Clone - Integration

Summary

#	Project Name	Main Programming Language
1	JSON Parser and Document Model	C
2	In-Memory Document Store with Revisions	C
3	Append-Only B-Tree Storage Engine	C
4	HTTP REST API Server	C
5	Sequence Index and Changes Feed	C
6	MapReduce View Engine	C
7	Database Compaction	C
8	Mango Query Engine	C
9	Document Replication	C
10	Authentication and Security	C
11	Clustering and Sharding	C
12	Attachments and Binary Storage	C
13	Full-Text Search Integration	C
14	Complete CouchDB Clone	C

Key Resources

Books

“Designing Data-Intensive Applications” by Martin Kleppmann - Distributed data fundamentals
“Database Internals” by Alex Petrov - How databases work under the hood
“CouchDB: The Definitive Guide” by J. Anderson, J. Lehnardt, N. Slater - The original CouchDB book
“C Programming: A Modern Approach” by K. N. King - Essential C reference
“The Linux Programming Interface” by Michael Kerrisk - Systems programming

Online Resources

CouchDB Official Documentation - Complete API reference
CouchDB Guide - The “Relax” book online
The Power of B-trees - CouchDB’s B-tree explained
CouchDB Replication Protocol - Sync protocol spec
jsmn JSON Parser - Minimal C JSON library

PouchDB - JavaScript CouchDB implementation
cbt - MVCC append-only B-tree in Erlang
Couchstore - Couchbase’s storage engine

You’re ready to build a CouchDB clone from scratch. Start with Project 1 and work your way through. By the end, you’ll understand document databases at the level of the engineers who built CouchDB, PouchDB, and MongoDB.

Learn CouchDB: From Zero to Document Database Master

Why CouchDB Matters

Core Concept Analysis

CouchDB Architecture Overview

Fundamental Concepts

Project List

Project 1: JSON Parser and Document Model

Project 2: In-Memory Document Store with Revisions

Project 3: Append-Only B-Tree Storage Engine

Project 4: HTTP REST API Server

Project 5: Sequence Index and Changes Feed

Project 6: MapReduce View Engine

Project 7: Database Compaction

Project 8: Mango Query Engine (JSON Queries)

Project 9: Document Replication

Project 10: Authentication and Security

Project 11: Clustering and Sharding

Project 12: Attachments and Binary Storage

Project 13: Full-Text Search Integration

Project 14: Complete CouchDB Clone

Project Comparison Table

Recommended Learning Path

Phase 1: Foundations (Weeks 1-2)

Phase 2: Storage Engine (Weeks 3-5)

Phase 3: API Layer (Weeks 6-8)

Phase 4: Querying (Weeks 9-12)

Phase 5: Distribution (Weeks 13-16)

Phase 6: Advanced (Weeks 17+)

Summary

Key Resources

Books

Online Resources

Related Projects