LEARN COUCHDB DEEP DIVE
Learn CouchDB: From Zero to Document Database Master
Goal: Deeply understand Apache CouchDB—from JSON document storage to building a complete document-oriented database with append-only B-trees, RESTful HTTP API, MapReduce views, MVCC, and multi-master replication from scratch in C.
Why CouchDB Matters
CouchDB pioneered several revolutionary database concepts that are now industry standards:
- “Relax”: A database that never corrupts, even during crashes or power failures
- HTTP all the way down: Every operation is a REST API call
- Eventual consistency: Multi-master replication without coordination
- Append-only storage: Never overwrites data, enabling crash-proof durability
- MapReduce views: Pre-computed indexes using JavaScript functions
After completing these projects, you will:
- Understand the append-only B-tree design that makes CouchDB crash-proof
- Know how MVCC enables concurrent reads without locks
- Build a RESTful HTTP API that serves JSON documents
- Implement MapReduce views with incremental updates
- Master conflict-free replication between database instances
- Have built a working document database from scratch in C
Core Concept Analysis
CouchDB Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ HTTP REST API │
│ GET /db/doc PUT /db/doc DELETE /db/doc POST /db/_find │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Document Layer │
│ JSON Parsing │ Validation │ Revision Management │
└─────────────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│ By-ID Index │ │ By-Seq Index │ │ MapReduce Views │
│ (doc_id → doc) │ │ (seq → doc_id) │ │ (user-defined indexes) │
└─────────────────────┘ └─────────────────┘ └─────────────────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Append-Only B-Tree Engine │
│ New nodes written to end │ Never overwrites │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Database File (.couch) │
│ [Header][Data...][B-tree nodes...][Header][Data...][New Header] │
│ ↑ │
│ Always grows → │
└─────────────────────────────────────────────────────────────────────┘
Fundamental Concepts
- Document Model
- Documents are JSON objects with
_idand_revfields _id: Unique identifier (user-provided or auto-generated UUID)_rev: Revision string (e.g.,"1-abc123") for conflict detection- Schema-less: Any valid JSON structure is allowed
- Documents are JSON objects with
- Append-Only B-Tree
- Modified nodes are written to the end of the file, never in-place
- Old versions remain on disk until compaction
- Root pointer stored in file footer (last 4KB)
- Crash recovery: Scan backward to find last valid header
- Multi-Version Concurrency Control (MVCC)
- Each update creates a new revision
- Readers see a consistent snapshot (point-in-time)
- Writers never block readers
- Conflicts detected via
_revfield during updates
- Revision Tree
Document: "user_123" 1-abc (initial) │ 2-def (update) │ 3-ghi (latest winning revision) │ ├── 3-xyz (conflict branch from replica) - MapReduce Views
- Map function: Emits key-value pairs for each document
- Reduce function: Aggregates values by key
- Stored in separate B-tree, incrementally updated
- Only processes documents changed since last query
- Changes Feed
- Sequence number increments with every change
_changesAPI returns all changes since a sequence- Enables efficient incremental replication
- Replication Protocol
- Compare revision trees between source and target
- Transfer only missing revisions
- Deterministic conflict resolution (same winner everywhere)
- Bidirectional: push + pull = full sync
Project List
Projects are ordered from fundamental understanding to advanced implementations. Each project builds on the previous ones.
Project 1: JSON Parser and Document Model
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Zig
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Parsing / Data Structures / JSON
- Software or Tool: Building: JSON Library
- Main Book: “Writing a C Compiler” by Nora Sandler
What you’ll build: A complete JSON parser in C that can parse, modify, and serialize JSON documents—the foundation of any document database.
Why it teaches CouchDB: CouchDB stores everything as JSON. Before you can store documents, you need to parse them. This project teaches you tokenization, recursive descent parsing, and in-memory document representation—skills you’ll use throughout.
Core challenges you’ll face:
- Tokenizing JSON input → maps to lexical analysis
- Handling nested objects and arrays → maps to recursive parsing
- Memory management for dynamic structures → maps to ownership and lifetime
- Unicode and escape sequence handling → maps to string encoding
Key Concepts:
- Recursive Descent Parsing: “Writing a C Compiler” Chapter 1 - Nora Sandler
- Memory Management in C: “C Programming: A Modern Approach” Chapter 17 - K. N. King
- JSON Specification: RFC 8259 - The JavaScript Object Notation
- Tokenizer Design: jsmn - Minimal JSON Tokenizer
Resources for key challenges:
- A practical approach to write a simple JSON parser - Step-by-step implementation
Difficulty: Intermediate Time estimate: 1 week Prerequisites:
- Basic C programming (pointers, structs, dynamic memory)
- Understanding of string manipulation
- No prior parsing experience required
Real world outcome:
$ ./json_parser
> {"name": "Alice", "age": 30, "hobbies": ["reading", "coding"]}
Parsed successfully!
Type: Object
Keys: 3
Document structure:
{
"name": "Alice" (string)
"age": 30 (number)
"hobbies": [
"reading" (string)
"coding" (string)
] (array, 2 elements)
}
> json_get(doc, "hobbies[1]")
"coding"
> json_set(doc, "age", 31)
> json_serialize(doc)
{"name":"Alice","age":31,"hobbies":["reading","coding"]}
Implementation Hints:
JSON value types to support:
typedef enum {
JSON_NULL,
JSON_BOOL,
JSON_NUMBER,
JSON_STRING,
JSON_ARRAY,
JSON_OBJECT
} JsonType;
typedef struct JsonValue {
JsonType type;
union {
bool boolean;
double number;
char* string;
struct { struct JsonValue* items; size_t count; } array;
struct { char** keys; struct JsonValue* values; size_t count; } object;
};
} JsonValue;
Tokenizer produces tokens:
{,},[,],:,,- Strings:
"hello","with \"escapes\"" - Numbers:
123,-45.67,1.2e10 - Keywords:
true,false,null
Parsing approach:
parse_value():
switch (current_token):
case '{': return parse_object()
case '[': return parse_array()
case STRING: return parse_string()
case NUMBER: return parse_number()
case TRUE/FALSE: return parse_bool()
case NULL: return parse_null()
parse_object():
expect '{'
while not '}':
key = parse_string()
expect ':'
value = parse_value()
add_to_object(key, value)
if not ',': break
expect '}'
Think about:
- How do you handle
"\u0041"(Unicode escape)? - How do you differentiate
123from"123"? - What’s your memory ownership model? Who frees what?
Learning milestones:
- Tokenizer produces correct tokens → You understand lexical analysis
- Nested structures parse correctly → You understand recursive parsing
- Round-trip works (parse → serialize → parse) → Your parser is correct
- No memory leaks → You understand C memory management
Project 2: In-Memory Document Store with Revisions
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Structures / MVCC / Versioning
- Software or Tool: Building: Document Store
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: An in-memory document store that tracks revisions (_rev), detects conflicts, and maintains document history—the core of CouchDB’s data model.
Why it teaches CouchDB: The _rev field is CouchDB’s magic. It enables optimistic concurrency control, conflict detection, and replication. Understanding revision trees is essential to understanding CouchDB.
Core challenges you’ll face:
- Generating revision IDs → maps to content-addressable hashing
- Revision tree management → maps to tree data structures
- Conflict detection on update → maps to optimistic locking
- Efficient document lookup → maps to hash table design
Key Concepts:
- MVCC Fundamentals: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
- Hash Tables in C: “Mastering Algorithms with C” Chapter 8 - Kyle Loudon
- Content Hashing: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
- Revision Trees: CouchDB Replication and Conflict Model
Difficulty: Intermediate Time estimate: 1 week Prerequisites:
- Project 1 completed (JSON parser)
- Understanding of hash tables
- Basic tree data structures
Real world outcome:
$ ./docstore
docstore> PUT user_123 {"name": "Alice", "age": 30}
{
"_id": "user_123",
"_rev": "1-abc123def456",
"name": "Alice",
"age": 30
}
Created.
docstore> GET user_123
{
"_id": "user_123",
"_rev": "1-abc123def456",
"name": "Alice",
"age": 30
}
docstore> PUT user_123 {"_rev": "1-abc123def456", "name": "Alice", "age": 31}
{
"_id": "user_123",
"_rev": "2-789ghi012jkl",
"name": "Alice",
"age": 31
}
Updated.
docstore> PUT user_123 {"_rev": "1-abc123def456", "name": "Alicia", "age": 30}
Error: Conflict. Document has been modified.
Current revision: 2-789ghi012jkl
Your revision: 1-abc123def456
docstore> REVS user_123
Revision tree for user_123:
1-abc123def456 (deleted)
└── 2-789ghi012jkl (current)
Implementation Hints:
Revision ID format: {revision_number}-{md5_of_content}
Generate revision:
1. Serialize document to JSON (without _id, _rev)
2. Compute MD5 hash of JSON string
3. Increment revision number
4. Format: "{rev_num}-{first_16_chars_of_md5}"
Example: "2-a1b2c3d4e5f6g7h8"
Document structure:
typedef struct Revision {
char* rev_id; // "1-abc123"
JsonValue* doc; // The document content
struct Revision* parent; // Previous revision
struct Revision** children; // Branches (conflicts)
size_t child_count;
bool deleted; // Tombstone flag
} Revision;
typedef struct Document {
char* id; // _id
Revision* rev_tree; // Root of revision tree
Revision* winner; // Current winning revision
} Document;
Conflict detection:
on_update(doc_id, new_content, provided_rev):
current = get_document(doc_id)
if current.winner.rev_id != provided_rev:
return CONFLICT_ERROR
new_rev = create_revision(new_content, current.winner)
current.winner = new_rev
return SUCCESS
Think about:
- What makes a revision “win” when there are conflicts?
- How do you handle DELETE? (Hint: tombstone with
_deleted: true) - What if someone provides no
_revfor an update?
Learning milestones:
- Documents have
_idand_rev→ You understand the document model - Updates require correct
_rev→ You understand optimistic locking - Revision history is preserved → You understand MVCC
- Conflicts are detected → You’re ready for replication
Project 3: Append-Only B-Tree Storage Engine
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Storage Engines / B-Trees / Durability
- Software or Tool: Building: Append-Only B-Tree
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: An append-only B-tree that never overwrites data—the crash-proof storage engine that makes CouchDB reliable.
Why it teaches CouchDB: This is CouchDB’s secret weapon. By never overwriting data, the database file is always consistent. Crashes during writes simply leave incomplete data at the end, which is ignored on recovery. This project teaches you how to build truly crash-proof storage.
Core challenges you’ll face:
- Copy-on-write B-tree updates → maps to immutable data structures
- Root pointer management in file footer → maps to atomic commits
- Crash recovery by scanning backward → maps to durability guarantees
- Space efficiency with node reuse → maps to storage optimization
Key Concepts:
- Append-Only B-Trees: CouchDB - The Power of B-trees
- Copy-on-Write: “Database Internals” Chapter 4 - Alex Petrov
- File Footer Design: Couchstore File Format
- B+Tree Structure: “Database Internals” Chapter 2 - Alex Petrov
Resources for key challenges:
- cbt - MVCC Append-Only B-Tree - Reference implementation
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:
- Understanding of B-trees
- File I/O in C
- Binary data handling
Real world outcome:
$ ./appendonly_btree test.db
btree> INSERT "user_001" "Alice"
Inserted at offset 4096
New root at offset 4096
btree> INSERT "user_002" "Bob"
Inserted at offset 8192
New root at offset 8192 (copied, old root still at 4096)
btree> GET "user_001"
"Alice"
# Simulate crash during write
btree> INSERT "user_003" "Carol"
Writing... [CRASH SIMULATED]
$ ./appendonly_btree test.db
Recovering...
Scanning backward for valid header...
Found valid header at offset 8192
Recovery complete. 2 keys in database.
btree> GET "user_003"
Not found (incomplete write was discarded)
btree> GET "user_001"
"Alice" (previous data intact!)
Implementation Hints:
File structure:
┌─────────────────────────────────────────────────────────────────┐
│ Offset 0: [File Header - 4KB] │
│ Magic number, version, etc. │
├─────────────────────────────────────────────────────────────────┤
│ Offset 4096: [B-tree Node 1] [B-tree Node 2] [Node 3] ... │
│ (nodes appended sequentially) │
├─────────────────────────────────────────────────────────────────┤
│ End-4KB: [Active Header - 4KB] │
│ Root pointer, doc count, sequence number │
│ Written twice (4KB + 4KB) for redundancy │
└─────────────────────────────────────────────────────────────────┘
Copy-on-write update:
To update key K with value V:
1. Read path from root to leaf containing K
2. Create NEW leaf node with updated K→V
3. Create NEW internal nodes pointing to new leaf
4. Create NEW root pointing to new internal nodes
5. Append all new nodes to end of file
6. Write new header with new root pointer
7. fsync()
Old nodes remain on disk (for old snapshots and crash safety)
Header commit (atomic):
1. Write all data and new B-tree nodes
2. fsync() data
3. Write header copy 1 (first 2KB of footer)
4. Write header copy 2 (second 2KB of footer)
5. fsync() footer
On recovery:
- Read both header copies
- If they match and checksum valid: use it
- If only first valid: use first (crash during step 4)
- If neither valid: scan backward for previous header
Think about:
- Why write the header twice?
- How do you find the previous valid header after a crash?
- What’s the space overhead of copy-on-write?
Learning milestones:
- B-tree operations work → You understand B-tree algorithms
- Updates append, never overwrite → You understand copy-on-write
- Database survives kill -9 → You’ve achieved crash safety
- Recovery finds valid state → You understand header scanning
Project 4: HTTP REST API Server
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Networking / HTTP / REST APIs
- Software or Tool: Building: HTTP Server
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: An HTTP/1.1 server that exposes your document store via REST API—just like CouchDB’s native interface.
Why it teaches CouchDB: CouchDB is “HTTP all the way down.” Every operation—creating databases, inserting documents, querying views—is an HTTP request. Building this teaches you network programming and API design.
Core challenges you’ll face:
- TCP socket programming → maps to network fundamentals
- HTTP request parsing → maps to protocol implementation
- Routing requests to handlers → maps to API design
- Concurrent connection handling → maps to server architecture
Key Concepts:
- Socket Programming: “The Linux Programming Interface” Chapter 56-61 - Michael Kerrisk
- HTTP/1.1 Protocol: RFC 7230-7235
- REST API Design: CouchDB API Reference
- Beej’s Guide to Network Programming: Essential socket programming reference
Resources for key challenges:
Difficulty: Advanced Time estimate: 2 weeks Prerequisites:
- Projects 1-2 completed
- Basic understanding of TCP/IP
- Familiarity with HTTP
Real world outcome:
$ ./couchdb_server --port 5984
CouchDB-like server starting on http://localhost:5984
Ready to accept connections...
# In another terminal:
$ curl http://localhost:5984/
{"couchdb":"Welcome","version":"0.1.0"}
$ curl -X PUT http://localhost:5984/mydb
{"ok":true}
$ curl -X PUT http://localhost:5984/mydb/doc1 \
-H "Content-Type: application/json" \
-d '{"name":"Alice","age":30}'
{"ok":true,"id":"doc1","rev":"1-abc123"}
$ curl http://localhost:5984/mydb/doc1
{"_id":"doc1","_rev":"1-abc123","name":"Alice","age":30}
$ curl -X PUT http://localhost:5984/mydb/doc1 \
-H "Content-Type: application/json" \
-d '{"_rev":"wrong","name":"Alice","age":31}'
{"error":"conflict","reason":"Document update conflict."}
$ curl -X DELETE http://localhost:5984/mydb/doc1?rev=1-abc123
{"ok":true,"id":"doc1","rev":"2-def456"}
Implementation Hints:
CouchDB REST API endpoints:
Server:
GET / → Server info
GET /_all_dbs → List all databases
Database:
PUT /{db} → Create database
GET /{db} → Database info
DELETE /{db} → Delete database
Documents:
POST /{db} → Create document (auto-ID)
PUT /{db}/{docid} → Create/update document
GET /{db}/{docid} → Read document
DELETE /{db}/{docid}?rev= → Delete document
Special:
GET /{db}/_all_docs → List all document IDs
GET /{db}/_changes → Changes feed
HTTP request structure:
GET /mydb/doc1 HTTP/1.1
Host: localhost:5984
Accept: application/json
Response:
HTTP/1.1 200 OK
Content-Type: application/json
ETag: "1-abc123"
{"_id":"doc1","_rev":"1-abc123","name":"Alice"}
Server architecture:
main():
socket = create_tcp_socket(5984)
while true:
client = accept(socket)
fork() or thread: // Handle concurrently
request = parse_http_request(client)
response = route_and_handle(request)
send_http_response(client, response)
close(client)
Think about:
- How do you parse the URL path (
/mydb/doc1)? - How do you handle
?rev=query parameters? - What HTTP status codes should you return? (200, 201, 404, 409, etc.)
- How do you handle large request bodies?
Learning milestones:
- Server accepts connections → You understand sockets
- HTTP parsing works → You understand the HTTP protocol
- CRUD operations via curl work → You’ve built a REST API
- Multiple clients work concurrently → You understand server architecture
Project 5: Sequence Index and Changes Feed
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Event Sourcing / Change Tracking
- Software or Tool: Building: Changes Feed
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A sequence-based index that tracks every change, plus a _changes API for real-time and historical change streaming.
Why it teaches CouchDB: The changes feed is the foundation of CouchDB replication. Every modification gets a sequence number. Clients can ask “give me all changes since sequence X” to sync efficiently. This is event sourcing at the database level.
Core challenges you’ll face:
- Monotonic sequence number generation → maps to event ordering
- By-sequence index maintenance → maps to secondary indexing
- Long-polling for real-time changes → maps to push notifications
- Efficient “since” queries → maps to incremental sync
Key Concepts:
- Changes Feed: CouchDB Changes API
- Event Sourcing: “Designing Data-Intensive Applications” Chapter 11 - Martin Kleppmann
- Long Polling: HTTP technique for server push
- Sequence Indexes: Anatomy of the CouchDB Changes Feed
Difficulty: Advanced Time estimate: 1 week Prerequisites:
- Project 3 completed (storage engine)
- Project 4 completed (HTTP server)
- Understanding of secondary indexes
Real world outcome:
$ curl "http://localhost:5984/mydb/_changes"
{
"results": [
{"seq":1,"id":"doc1","changes":[{"rev":"1-abc"}]},
{"seq":2,"id":"doc2","changes":[{"rev":"1-def"}]},
{"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
],
"last_seq": 3
}
$ curl "http://localhost:5984/mydb/_changes?since=2"
{
"results": [
{"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
],
"last_seq": 3
}
# Long-polling (waits for new changes)
$ curl "http://localhost:5984/mydb/_changes?feed=longpoll&since=3"
# ... request hangs until a change occurs ...
# In another terminal:
$ curl -X PUT http://localhost:5984/mydb/doc3 -d '{"x":1}'
# First terminal receives:
{
"results": [
{"seq":4,"id":"doc3","changes":[{"rev":"1-xyz"}]}
],
"last_seq": 4
}
# Continuous feed (streaming)
$ curl "http://localhost:5984/mydb/_changes?feed=continuous&since=0"
{"seq":1,"id":"doc1","changes":[{"rev":"1-abc"}]}
{"seq":2,"id":"doc2","changes":[{"rev":"1-def"}]}
{"seq":3,"id":"doc1","changes":[{"rev":"2-ghi"}]}
{"seq":4,"id":"doc3","changes":[{"rev":"1-xyz"}]}
# ... stays open, new lines appear as changes happen ...
Implementation Hints:
Data structures:
Global sequence counter (per database):
uint64_t current_seq = 0;
On every document change:
current_seq++;
record = { seq: current_seq, doc_id, rev_id }
by_seq_index.insert(current_seq, record)
By-sequence index (separate B-tree):
key: sequence number
value: { doc_id, rev_id, deleted }
Changes feed modes:
Normal (feed=normal, default):
- Return all changes since `since` parameter
- Return immediately, even if empty
Long-poll (feed=longpoll):
- If changes exist since `since`, return immediately
- If no changes, wait (block) until one occurs
- Return after first change (or timeout)
Continuous (feed=continuous):
- Stream changes as newline-delimited JSON
- Never close connection
- Send heartbeat every N seconds to keep alive
Implementation:
handle_changes(since, feed_type):
if feed_type == "normal":
return get_all_changes_since(since)
if feed_type == "longpoll":
changes = get_all_changes_since(since)
if changes:
return changes
wait_for_change(timeout=60) // Block here
return get_all_changes_since(since)
if feed_type == "continuous":
send_all_changes_since(since)
while true:
wait_for_change()
send_new_change()
send_heartbeat() // Empty line every 10s
Think about:
- What happens if two changes happen while client is disconnected?
- How do you implement the “wait” efficiently? (Condition variable, poll)
- How do you handle the
include_docs=trueparameter?
Learning milestones:
- Sequence numbers increment correctly → You understand change tracking
- By-seq index works → You understand secondary indexes
- Normal feed returns history → You understand incremental queries
- Long-poll waits and notifies → You understand server push
Project 6: MapReduce View Engine
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: MapReduce / Indexing / Query Processing
- Software or Tool: Building: View Engine
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A MapReduce view engine that runs user-defined JavaScript map/reduce functions, maintains B-tree indexes, and updates incrementally.
Why it teaches CouchDB: Views are how you query CouchDB beyond simple ID lookups. The map function transforms documents into key-value pairs; the reduce aggregates them. Understanding incremental index updates is key to CouchDB’s performance.
Core challenges you’ll face:
- Embedding a JavaScript engine → maps to language integration
- Incremental view updates → maps to efficient indexing
- B-tree storage for view results → maps to secondary indexes
- Reduce tree computation → maps to aggregation strategies
Key Concepts:
- MapReduce Views: CouchDB Introduction to Views
- Incremental Indexing: Finding Your Data with Views
- Reduce Trees: CouchDB Reduce Functions
- JavaScript in C: Libraries like Duktape, QuickJS, or MuJS
Resources for key challenges:
Difficulty: Expert Time estimate: 3 weeks Prerequisites:
- Project 3 completed (B-tree storage)
- Basic understanding of MapReduce
- Familiarity with JavaScript (for writing map functions)
Real world outcome:
# Create a design document with views
$ curl -X PUT http://localhost:5984/mydb/_design/app \
-H "Content-Type: application/json" \
-d '{
"views": {
"by_age": {
"map": "function(doc) { if(doc.age) emit(doc.age, doc.name); }"
},
"age_stats": {
"map": "function(doc) { if(doc.age) emit(null, doc.age); }",
"reduce": "_stats"
}
}
}'
{"ok":true,"id":"_design/app","rev":"1-abc"}
# Query the view
$ curl "http://localhost:5984/mydb/_design/app/_view/by_age"
{
"total_rows": 3,
"offset": 0,
"rows": [
{"id":"user2","key":25,"value":"Bob"},
{"id":"user1","key":30,"value":"Alice"},
{"id":"user3","key":35,"value":"Carol"}
]
}
# Query with range
$ curl "http://localhost:5984/mydb/_design/app/_view/by_age?startkey=28&endkey=32"
{
"rows": [
{"id":"user1","key":30,"value":"Alice"}
]
}
# Query with reduce
$ curl "http://localhost:5984/mydb/_design/app/_view/age_stats?reduce=true"
{
"rows": [
{"key":null,"value":{"sum":90,"count":3,"min":25,"max":35,"avg":30}}
]
}
Implementation Hints:
View architecture:
Design Document (_design/app):
{
"views": {
"view_name": {
"map": "function(doc) { emit(key, value); }",
"reduce": "_sum" // or "_count", "_stats", or custom function
}
}
}
View B-tree (separate file per view):
key: [emitted_key, doc_id] // Compound key for sorting
value: emitted_value
Map function execution:
For each document:
result = js_engine.call("map", doc)
for each emit(key, value) in result:
view_btree.insert([key, doc.id], value)
Incremental updates:
view_update(view, last_seq):
changes = db.changes_since(view.last_indexed_seq)
for change in changes:
old_emits = view.get_by_docid(change.doc_id)
view.delete(old_emits) // Remove old entries
if not change.deleted:
doc = db.get(change.doc_id)
new_emits = run_map_function(doc)
for emit in new_emits:
view.insert([emit.key, doc.id], emit.value)
view.last_indexed_seq = changes.last_seq
Built-in reduce functions:
_count: Return count of values
_sum: Return sum of numeric values
_stats: Return {sum, count, min, max, avg}
Custom reduce:
function(keys, values, rereduce) {
if (rereduce) {
// values are previous reduce results
return values.reduce((a, b) => a + b, 0);
} else {
// values are from map function
return values.length;
}
}
Think about:
- How do you isolate the JavaScript sandbox?
- What if a map function throws an error?
- How do you handle the
rereduceflag in reduce functions? - When do you update the view? (On query, or background?)
Learning milestones:
- Map function emits work → You understand JS integration
- Views are queryable by key → You understand view B-trees
- Incremental updates work → You understand efficient indexing
- Reduce aggregates correctly → You understand reduce trees
Project 7: Database Compaction
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Storage Management / Garbage Collection
- Software or Tool: Building: Compactor
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A compaction system that reclaims space from the append-only file by copying live data to a new file and discarding obsolete revisions.
Why it teaches CouchDB: Append-only storage is great for durability but terrible for space. Without compaction, your database file grows forever. CouchDB’s compaction copies only current revisions to a new file, then swaps it in atomically.
Core challenges you’ll face:
- Identifying live vs. dead data → maps to garbage identification
- Online compaction (while serving requests) → maps to concurrent operations
- Atomic file swap → maps to safe transitions
- Space estimation and scheduling → maps to resource management
Key Concepts:
- Log-Structured Storage Compaction: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
- Online Compaction: Understanding Data Compaction
- Garbage Collection: “Database Internals” Chapter 5 - Alex Petrov
- Atomic File Operations: rename() syscall semantics
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:
- Project 3 completed (append-only storage)
- Understanding of file system operations
- Concurrent programming basics
Real world outcome:
$ ls -lh mydb.couch
-rw-r--r-- 1 user staff 1.2G Dec 21 10:00 mydb.couch
$ curl -X POST http://localhost:5984/mydb/_compact
{"ok":true}
$ curl http://localhost:5984/mydb
{
"db_name": "mydb",
"doc_count": 10000,
"disk_size": 1288490188,
"compact_running": true,
...
}
# Wait for compaction to complete...
$ curl http://localhost:5984/mydb
{
"db_name": "mydb",
"doc_count": 10000,
"disk_size": 52428800,
"compact_running": false,
...
}
$ ls -lh mydb.couch
-rw-r--r-- 1 user staff 50M Dec 21 10:05 mydb.couch
# Space reclaimed! 1.2GB → 50MB
Implementation Hints:
Compaction algorithm:
compact(db_file):
new_file = create_temp_file("db.compact")
for doc_id in db.all_doc_ids():
doc = db.get(doc_id) // Gets winning revision only
if not doc.deleted:
new_file.insert(doc_id, doc)
// Copy views too (only current index entries)
for view in db.views:
for entry in view:
new_file.view_insert(entry)
// Atomic swap
rename(new_file, db_file) // Atomic on POSIX
// Old file is now unlinked
Online compaction (while serving reads):
During compaction:
- Reads go to OLD file (consistent snapshot)
- Writes go to BOTH old file AND track in memory
After compaction:
- Apply pending writes to new file
- Atomic swap
- New reads go to new file
What to keep vs. discard:
KEEP:
- Current winning revision of each document
- Conflict revisions (still unresolved)
- Local documents (_local/*)
DISCARD:
- Old revisions (superseded)
- Deleted document tombstones older than threshold
- Orphaned B-tree nodes
Think about:
- What if a write happens during compaction?
- How do you handle views during compaction?
- What’s the disk space requirement? (2x temporarily)
- How do you schedule compaction? (Fragmentation threshold)
Learning milestones:
- Compaction creates smaller file → You understand garbage collection
- No data is lost → You understand live data identification
- Swap is atomic → You understand safe transitions
- Works while serving requests → You understand online compaction
Project 8: Mango Query Engine (JSON Queries)
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Query Processing / Indexing
- Software or Tool: Building: Query Engine
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A Mango-style query engine that accepts JSON selector queries (like MongoDB) and executes them efficiently using indexes.
Why it teaches CouchDB: MapReduce views are powerful but require writing JavaScript. Mango queries let you query with JSON: {"age": {"$gt": 30}}. This project teaches query parsing, index selection, and execution planning.
Core challenges you’ll face:
- Query syntax parsing → maps to DSL implementation
- Index selection → maps to query optimization
- Selector evaluation → maps to predicate matching
- Combining results → maps to query execution
Key Concepts:
- Mango Queries: CouchDB Find API
- Query Selectors: MongoDB-style operators ($eq, $gt, $in, etc.)
- Index Selection: “Database Internals” Chapter 10 - Alex Petrov
- Query Planning: Choosing between index scan vs. full scan
Difficulty: Advanced Time estimate: 2 weeks Prerequisites:
- Project 6 completed (views for indexing)
- Project 1 completed (JSON parsing)
- Understanding of query execution
Real world outcome:
# Create an index
$ curl -X POST http://localhost:5984/mydb/_index \
-H "Content-Type: application/json" \
-d '{"index": {"fields": ["age"]}, "name": "age-index"}'
{"result":"created","id":"_design/age-index","name":"age-index"}
# Query with selector
$ curl -X POST http://localhost:5984/mydb/_find \
-H "Content-Type: application/json" \
-d '{
"selector": {
"age": {"$gt": 25},
"status": "active"
},
"fields": ["_id", "name", "age"],
"sort": [{"age": "asc"}],
"limit": 10
}'
{
"docs": [
{"_id":"user2","name":"Bob","age":28},
{"_id":"user1","name":"Alice","age":30},
{"_id":"user3","name":"Carol","age":35}
],
"bookmark": "g1AAAA...",
"warning": "No matching index found, create an index for status"
}
# Explain query plan
$ curl -X POST http://localhost:5984/mydb/_explain \
-H "Content-Type: application/json" \
-d '{"selector": {"age": {"$gt": 25}}}'
{
"index": {
"ddoc": "_design/age-index",
"name": "age-index",
"type": "json",
"fields": [{"age":"asc"}]
},
"selector": {"age":{"$gt":25}},
"range": {
"start_key": [25],
"end_key": [{}]
}
}
Implementation Hints:
Selector operators:
Comparison:
$eq - Equal
$ne - Not equal
$gt - Greater than
$gte - Greater than or equal
$lt - Less than
$lte - Less than or equal
Logical:
$and - All must match
$or - At least one must match
$not - Negation
Array:
$in - Value in array
$nin - Value not in array
Existence:
$exists - Field exists
Query execution:
execute_find(selector, fields, sort, limit):
// 1. Parse selector
parsed = parse_selector(selector)
// 2. Find usable index
index = find_best_index(parsed, sort)
// 3. Execute
if index:
candidates = index.range_scan(parsed.key_range)
else:
candidates = full_table_scan() // Slow!
emit_warning("No matching index")
// 4. Filter (for predicates not covered by index)
results = []
for doc in candidates:
if matches_selector(doc, parsed):
results.append(project(doc, fields))
if len(results) >= limit:
break
// 5. Sort (if not already sorted by index)
if sort and not index_covers_sort:
results.sort(sort)
return results
Selector evaluation:
matches_selector(doc, selector):
for field, condition in selector:
value = doc.get(field)
if is_operator(condition):
if not evaluate_operator(value, condition):
return false
else:
if value != condition: // Implicit $eq
return false
return true
evaluate_operator(value, {$gt: 25}):
return value > 25
Think about:
- How do you handle nested fields? (
"address.city": "NYC") - What if no index exists? (Full scan with warning)
- How do you handle
$orwith indexes? - What’s the bookmark for pagination?
Learning milestones:
- Basic selectors work → You understand predicate matching
- Indexes are used when available → You understand query optimization
- Complex queries with $and/$or work → You understand logical operators
- _explain shows the plan → You understand query planning
Project 9: Document Replication
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems / Replication
- Software or Tool: Building: Replicator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A replication engine that syncs documents between CouchDB instances, handling conflicts deterministically—the heart of CouchDB’s distributed nature.
Why it teaches CouchDB: Replication is CouchDB’s killer feature. Two databases can be offline, both receive updates, and later sync without coordination. Conflicts are detected and the same winner is chosen everywhere. This project teaches distributed systems concepts.
Core challenges you’ll face:
- Comparing revision trees → maps to diff algorithms
- Transferring missing revisions → maps to efficient sync
- Deterministic conflict resolution → maps to consensus-free consistency
- Checkpoint management → maps to sync resume
Key Concepts:
- CouchDB Replication Protocol: Official Protocol Spec
- Conflict Resolution: Replication and Conflict Model
- Eventual Consistency: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
- CRDTs and Conflict-Free Design: Conflict Management
Resources for key challenges:
- CouchDB Replication Protocol - Data Protocols spec
Difficulty: Expert Time estimate: 3 weeks Prerequisites:
- Project 4-5 completed (HTTP + changes feed)
- Project 2 completed (revision trees)
- Understanding of distributed systems basics
Real world outcome:
# Start two CouchDB instances
$ ./couchdb_server --port 5984 --data ./db1 &
$ ./couchdb_server --port 5985 --data ./db2 &
# Create same database on both
$ curl -X PUT http://localhost:5984/shared
$ curl -X PUT http://localhost:5985/shared
# Add document to first instance
$ curl -X PUT http://localhost:5984/shared/doc1 -d '{"value":"from db1"}'
{"ok":true,"id":"doc1","rev":"1-abc"}
# Replicate from db1 to db2
$ curl -X POST http://localhost:5984/_replicate \
-H "Content-Type: application/json" \
-d '{"source":"http://localhost:5984/shared","target":"http://localhost:5985/shared"}'
{
"ok":true,
"docs_read":1,
"docs_written":1,
"missing_checked":1,
"missing_found":1
}
# Document now exists on db2
$ curl http://localhost:5985/shared/doc1
{"_id":"doc1","_rev":"1-abc","value":"from db1"}
# Create conflict: Update on both while "offline"
$ curl -X PUT http://localhost:5984/shared/doc1 -d '{"_rev":"1-abc","value":"updated on db1"}'
$ curl -X PUT http://localhost:5985/shared/doc1 -d '{"_rev":"1-abc","value":"updated on db2"}'
# Replicate again - conflict detected
$ curl -X POST http://localhost:5984/_replicate \
-d '{"source":"http://localhost:5985/shared","target":"http://localhost:5984/shared"}'
# Check for conflicts
$ curl "http://localhost:5984/shared/doc1?conflicts=true"
{
"_id":"doc1",
"_rev":"2-xyz",
"value":"updated on db2",
"_conflicts":["2-abc"]
}
# Same winner (2-xyz) chosen deterministically on both!
Implementation Hints:
Replication algorithm:
replicate(source, target, since=0):
// 1. Get changes from source
changes = source.get_changes(since=since)
// 2. Find what target is missing
revs_to_check = [(c.id, c.rev) for c in changes]
missing = target.revs_diff(revs_to_check)
// 3. Fetch missing docs with revision history
for doc_id, missing_revs in missing:
doc = source.get(doc_id, revs=True, open_revs=missing_revs)
target.bulk_docs([doc], new_edits=False)
// 4. Save checkpoint
save_checkpoint(target, changes.last_seq)
Revision diff (_revs_diff):
Input: {"doc1": ["2-abc", "3-def"], "doc2": ["1-xyz"]}
Output: {"doc1": {"missing": ["3-def"]}} // Has 2-abc, needs 3-def
Deterministic conflict resolution:
When conflicts exist, the winner is determined by:
1. Longest revision path wins (more edits = more recent)
2. If tie, lexicographically highest revision ID wins
This is deterministic: all replicas pick the same winner
without any coordination!
Example:
Revisions: "2-abc", "2-xyz"
Same length (2), compare strings: "xyz" > "abc"
Winner: "2-xyz"
Bulk docs with new_edits=false:
Normal insert: CouchDB generates new revision
With new_edits=false: Use provided revision (for replication)
This allows inserting historical revisions from another database.
Think about:
- How do you handle network failures during replication?
- What if source is updated during replication?
- How do you detect and resolve conflicts programmatically?
- What’s continuous replication vs. one-shot?
Learning milestones:
- One-shot replication works → You understand sync protocol
- Missing revisions are identified → You understand revs_diff
- Conflicts are detected → You understand conflict creation
- Same winner everywhere → You understand deterministic resolution
Project 10: Authentication and Security
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Security / Authentication / Authorization
- Software or Tool: Building: Auth System
- Main Book: “Foundations of Information Security” by Jason Andress
What you’ll build: An authentication system with users, roles, and per-database access control—making your database production-ready.
Why it teaches CouchDB: A database without auth is useless in production. CouchDB has a flexible system with admin users, regular users, database-level permissions, and design document validation. This project teaches security fundamentals.
Core challenges you’ll face:
- Password hashing → maps to secure credential storage
- Session management → maps to authentication state
- Role-based access control → maps to authorization
- Validate functions → maps to data integrity
Key Concepts:
- Password Hashing: bcrypt or PBKDF2
- Cookie Sessions: CouchDB Authentication
- Authorization: Per-database security objects
- Validation Functions: Design document validate_doc_update
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:
- Project 4 completed (HTTP server)
- Basic security knowledge
- Understanding of HTTP cookies
Real world outcome:
# Create admin user (first user becomes admin)
$ curl -X PUT http://localhost:5984/_users/org.couchdb.user:admin \
-H "Content-Type: application/json" \
-d '{"name":"admin","password":"secret123","roles":["_admin"],"type":"user"}'
# Try to access without auth
$ curl http://localhost:5984/mydb/doc1
{"error":"unauthorized","reason":"You are not authorized."}
# Login (cookie auth)
$ curl -X POST http://localhost:5984/_session \
-H "Content-Type: application/json" \
-d '{"name":"admin","password":"secret123"}' \
-c cookies.txt
{"ok":true,"name":"admin","roles":["_admin"]}
# Access with session cookie
$ curl -b cookies.txt http://localhost:5984/mydb/doc1
{"_id":"doc1","_rev":"1-abc","name":"Alice"}
# Set database security
$ curl -X PUT http://localhost:5984/mydb/_security \
-b cookies.txt \
-H "Content-Type: application/json" \
-d '{"admins":{"names":["admin"]},"members":{"roles":["users"]}}'
{"ok":true}
# Create validate function (in design doc)
$ curl -X PUT http://localhost:5984/mydb/_design/validation \
-b cookies.txt \
-d '{
"validate_doc_update": "function(newDoc, oldDoc, userCtx) {
if(!newDoc.name) throw({forbidden:\"name required\"});
}"
}'
# Try to insert invalid document
$ curl -X PUT http://localhost:5984/mydb/invalid \
-b cookies.txt \
-d '{"age":30}'
{"error":"forbidden","reason":"name required"}
Implementation Hints:
User document structure:
{
"_id": "org.couchdb.user:alice",
"type": "user",
"name": "alice",
"roles": ["editor", "viewer"],
"password_scheme": "pbkdf2",
"derived_key": "abc123...",
"salt": "randomsalt",
"iterations": 10000
}
Password verification:
verify_password(provided, user_doc):
derived = pbkdf2(
provided,
user_doc.salt,
user_doc.iterations
)
return constant_time_compare(derived, user_doc.derived_key)
Database security object:
{
"admins": {
"names": ["admin"], // Specific users
"roles": ["db_admins"] // Users with these roles
},
"members": {
"names": [],
"roles": ["users"] // Who can read
}
}
Access rules:
- Admins can do everything
- Members can read and write
- Non-members get 401/403
- Empty security = public database
Authentication flow:
Every request:
1. Check for Authorization header (Basic auth)
2. Check for session cookie
3. If authenticated, set userCtx
4. Check database security against userCtx
5. If write, run validate_doc_update functions
Think about:
- How do you handle cookie expiration?
- What’s the difference between 401 and 403?
- How do validate functions get user context?
- How do you protect against timing attacks?
Learning milestones:
- Password hashing works → You understand secure credential storage
- Sessions persist across requests → You understand cookie auth
- Database-level permissions work → You understand authorization
- Validate functions reject bad data → You understand data validation
Project 11: Clustering and Sharding
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Erlang
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Systems / Clustering
- Software or Tool: Building: Clustered Database
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A clustered CouchDB that shards data across multiple nodes, with quorum reads/writes and automatic failover.
Why it teaches CouchDB: CouchDB 2.0+ supports clustering. Data is split into shards, replicated across nodes. Reads and writes require a quorum. This is serious distributed systems engineering—coordination, consistency, and failure handling.
Core challenges you’ll face:
- Consistent hashing for shard placement → maps to data distribution
- Quorum reads and writes → maps to consistency guarantees
- Node membership and failure detection → maps to cluster management
- Request routing → maps to distributed query execution
Key Concepts:
- Consistent Hashing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
- Quorum Consensus: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
- Cluster Membership: Gossip protocols, failure detection
- Shard Routing: Document ID to shard mapping
Difficulty: Master Time estimate: 1 month+ Prerequisites:
- All previous projects completed
- Strong distributed systems knowledge
- Network programming experience
Real world outcome:
# Start 3-node cluster
$ ./couchdb_server --port 5984 --node node1 --cluster "node1,node2,node3"
$ ./couchdb_server --port 5985 --node node2 --cluster "node1,node2,node3"
$ ./couchdb_server --port 5986 --node node3 --cluster "node1,node2,node3"
# Check cluster status
$ curl http://localhost:5984/_membership
{
"cluster_nodes": ["node1", "node2", "node3"],
"all_nodes": ["node1", "node2", "node3"]
}
# Create sharded database
$ curl -X PUT "http://localhost:5984/mydb?q=4&n=2"
{"ok":true}
# q=4 shards, n=2 replicas per shard
# Insert document (goes to appropriate shard)
$ curl -X PUT http://localhost:5984/mydb/user_123 -d '{"name":"Alice"}'
{"ok":true,"id":"user_123","rev":"1-abc"}
# Read works from any node (routes to correct shard)
$ curl http://localhost:5985/mydb/user_123
{"_id":"user_123","_rev":"1-abc","name":"Alice"}
# Kill a node
$ kill -9 $(pgrep -f "node2")
# Writes still work with quorum (n=2, need 2/2, but have 1 replica elsewhere)
$ curl -X PUT http://localhost:5984/mydb/user_456 -d '{"name":"Bob"}'
{"ok":true} # Success! Quorum achieved with remaining nodes
# Node rejoins and syncs
$ ./couchdb_server --port 5985 --node node2 --cluster "node1,node2,node3"
# Automatic sync of missed writes...
Implementation Hints:
Sharding scheme:
Database "mydb" with q=4 shards:
Shard 0: hash range 0x00000000-0x3FFFFFFF
Shard 1: hash range 0x40000000-0x7FFFFFFF
Shard 2: hash range 0x80000000-0xBFFFFFFF
Shard 3: hash range 0xC0000000-0xFFFFFFFF
Document placement:
shard_num = crc32(doc_id) % q
nodes = get_nodes_for_shard(shard_num) // n replicas
Quorum configuration:
n = number of replicas
r = read quorum (how many replicas must respond for read)
w = write quorum (how many replicas must acknowledge write)
Default: n=3, r=2, w=2
Rule: r + w > n guarantees overlap (see latest write)
Write path:
write_document(doc):
shard = get_shard(doc.id)
nodes = get_nodes_for_shard(shard)
responses = parallel_write(nodes, doc)
success_count = count_successes(responses)
if success_count >= w:
return success
else:
return error("quorum not reached")
Read path:
read_document(doc_id):
shard = get_shard(doc_id)
nodes = get_nodes_for_shard(shard)
responses = parallel_read(nodes, doc_id)
// Wait for r responses
docs = collect_until(responses, count=r)
// Return latest revision
return pick_winner(docs)
Think about:
- How do you handle split-brain scenarios?
- What if a write succeeds on < w nodes?
- How do you rebalance when adding/removing nodes?
- How do you route changes feed queries to all shards?
Learning milestones:
- Documents are sharded correctly → You understand consistent hashing
- Quorum reads/writes work → You understand distributed consistency
- Node failure is handled → You understand fault tolerance
- Cluster rebalances → You understand distributed operations
Project 12: Attachments and Binary Storage
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Binary Data / HTTP Multipart
- Software or Tool: Building: Attachment Storage
- Main Book: “HTTP: The Definitive Guide” by David Gourley
What you’ll build: Support for binary attachments (images, PDFs, etc.) stored alongside JSON documents with content-type handling and streaming.
Why it teaches CouchDB: CouchDB can store files as attachments to documents. This is useful for CouchApps (web apps served from CouchDB) and any application that needs to associate files with data. It teaches binary storage and HTTP multipart handling.
Core challenges you’ll face:
- Binary data in append-only storage → maps to blob storage
- HTTP multipart requests → maps to complex request parsing
- Content-Type handling → maps to MIME types
- Streaming large files → maps to efficient I/O
Key Concepts:
- HTTP Multipart: RFC 2046 - Multipart content types
- MIME Types: Mapping file extensions to content types
- Streaming I/O: Chunked transfer encoding
- Content-Addressable Storage: Storing attachments by hash
Difficulty: Intermediate Time estimate: 1 week Prerequisites:
- Project 4 completed (HTTP server)
- Project 3 completed (storage engine)
- Understanding of binary file handling
Real world outcome:
# Upload attachment
$ curl -X PUT http://localhost:5984/mydb/doc1/photo.jpg \
-H "Content-Type: image/jpeg" \
--data-binary @photo.jpg
{"ok":true,"id":"doc1","rev":"2-xyz"}
# Document now has _attachments
$ curl http://localhost:5984/mydb/doc1
{
"_id": "doc1",
"_rev": "2-xyz",
"name": "Alice",
"_attachments": {
"photo.jpg": {
"content_type": "image/jpeg",
"length": 45123,
"digest": "md5-abc123...",
"stub": true
}
}
}
# Download attachment
$ curl http://localhost:5984/mydb/doc1/photo.jpg > downloaded.jpg
# Returns binary data with Content-Type: image/jpeg
# Inline attachment (base64 in document)
$ curl -X PUT http://localhost:5984/mydb/doc2 \
-H "Content-Type: application/json" \
-d '{
"name": "Bob",
"_attachments": {
"note.txt": {
"content_type": "text/plain",
"data": "SGVsbG8gV29ybGQh"
}
}
}'
Implementation Hints:
Attachment storage:
Option 1: Inline in document B-tree
- Good for small attachments
- Bad for large files (copied on every doc update)
Option 2: Separate attachment store (like CouchDB)
- Content-addressable: key = md5(content)
- Document stores reference to attachment
- Attachment deduplicated automatically
Document with attachments:
{
"_id": "doc1",
"_rev": "2-xyz",
"_attachments": {
"photo.jpg": {
"content_type": "image/jpeg",
"length": 45123,
"digest": "md5-abc123def456",
"stub": true, // Indicates attachment not inline
"revpos": 2 // Revision when added
}
}
}
Upload handling:
PUT /{db}/{doc_id}/{attachment_name}
Content-Type: image/jpeg
[binary data]
Steps:
1. Read binary body
2. Compute MD5 digest
3. Store in attachment store (if not exists)
4. Update document with attachment metadata
5. Increment document revision
Download handling:
GET /{db}/{doc_id}/{attachment_name}
Steps:
1. Get document
2. Find attachment metadata
3. Look up in attachment store by digest
4. Stream binary data with correct Content-Type
Think about:
- What if the same file is attached to multiple documents?
- How do you handle very large files (1GB+)?
- What about range requests for partial downloads?
- How do attachments interact with replication?
Learning milestones:
- Attachments upload and download → You understand binary storage
- Content-Type is correct → You understand MIME handling
- Large files stream efficiently → You understand I/O optimization
- Attachments replicate → You understand binary replication
Project 13: Full-Text Search Integration
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Information Retrieval / Search
- Software or Tool: Building: Search Index
- Main Book: “Introduction to Information Retrieval” by Manning et al.
What you’ll build: A full-text search engine integrated with your document database, enabling queries like “find documents containing ‘machine learning’”.
Why it teaches CouchDB: CouchDB integrates with Lucene for full-text search. While MapReduce views handle structured queries, text search requires inverted indexes and relevance scoring. This completes your database’s query capabilities.
Core challenges you’ll face:
- Text tokenization and stemming → maps to text processing
- Inverted index construction → maps to search data structures
- TF-IDF scoring → maps to relevance ranking
- Incremental index updates → maps to real-time search
Key Concepts:
- Inverted Indexes: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
- TF-IDF Scoring: “Introduction to Information Retrieval” Chapter 6 - Manning et al.
- Tokenization: Splitting text into searchable terms
- Search Integration: CouchDB Search
Difficulty: Advanced Time estimate: 2 weeks Prerequisites:
- Project 3 completed (B-tree storage)
- Project 5 completed (changes feed for updates)
- Basic text processing knowledge
Real world outcome:
# Create search index
$ curl -X PUT http://localhost:5984/mydb/_design/search \
-H "Content-Type: application/json" \
-d '{
"indexes": {
"content": {
"analyzer": "standard",
"index": "function(doc) {
if(doc.body) index(\"body\", doc.body);
if(doc.title) index(\"title\", doc.title, {boost: 2.0});
}"
}
}
}'
# Search
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=machine+learning"
{
"total_rows": 15,
"rows": [
{"id":"doc42","order":[1.234],"fields":{}},
{"id":"doc17","order":[0.987],"fields":{}},
...
]
}
# Search with highlighting
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=title:database&include_docs=true&highlights=3"
{
"rows": [
{
"id": "doc5",
"doc": {"_id":"doc5","title":"Database Systems","body":"..."},
"highlights": {"title":["<em>Database</em> Systems"]}
}
]
}
# Faceted search
$ curl "http://localhost:5984/mydb/_design/search/_search/content?q=*:*&counts=[\"category\"]"
{
"total_rows": 1000,
"counts": {
"category": {
"tech": 450,
"science": 320,
"arts": 230
}
}
}
Implementation Hints:
Inverted index structure:
term -> [(doc_id, positions, tf), ...]
Example for "database":
"database" -> [
(doc5, [0, 45], 2), // appears at positions 0 and 45, 2 times
(doc12, [23], 1),
...
]
Index function execution:
For each document:
run index function
for each index(field, value, options) call:
tokens = tokenize(value)
for token in tokens:
inverted_index[token].append((doc.id, position, options))
TF-IDF scoring:
TF(term, doc) = frequency of term in document
IDF(term) = log(total_docs / docs_containing_term)
Score = TF * IDF
Higher score = term appears often in this doc, rarely overall
Query processing:
search(query):
terms = parse_and_tokenize(query)
for term in terms:
postings = inverted_index[term]
for (doc_id, positions, tf) in postings:
score = calculate_tfidf(term, doc_id)
accumulator[doc_id] += score
return sorted(accumulator.items(), by=score, descending)
Think about:
- How do you handle phrases like “machine learning”?
- How do you update the index incrementally?
- What’s the storage overhead for positions?
- How do you handle boolean queries (AND, OR, NOT)?
Learning milestones:
- Basic term search works → You understand inverted indexes
- Relevance scoring is meaningful → You understand TF-IDF
- Index updates incrementally → You understand real-time search
- Advanced queries work → You understand query parsing
Project 14: Complete CouchDB Clone
- File: LEARN_COUCHDB_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Complete Database Systems
- Software or Tool: Building: Document Database
- Main Book: “Database Internals” by Alex Petrov + “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A complete CouchDB-compatible database integrating all previous projects: append-only storage, HTTP API, MVCC, views, replication, auth, and more.
Why this is the ultimate goal: This is the capstone. You’ll integrate every component into a cohesive, production-quality system. The challenges of making components work together teach you as much as building them individually.
Core challenges you’ll face:
- Component integration → maps to systems architecture
- API compatibility → maps to protocol implementation
- Performance tuning → maps to system optimization
- Error handling across layers → maps to reliability engineering
Key Concepts:
- All concepts from previous projects
- Systems Integration: Making components work together
- API Compatibility: Matching CouchDB’s behavior
- Testing: Ensuring correctness and performance
Difficulty: Master Time estimate: 1 month+ Prerequisites:
- All previous projects completed (or most of them)
- Strong systems programming skills
- Patience and determination
Real world outcome:
$ ./minicouch --data-dir /var/minicouch --port 5984
MiniCouch v1.0 - CouchDB-compatible Document Database
Storage: Append-only B-tree
API: HTTP/1.1 REST
Auth: Cookie + Basic
Features: MVCC, Views, Replication
Server ready at http://localhost:5984
# Verify compatibility with CouchDB tools
$ npm install -g pouchdb-server
$ pouchdb-server --port 5985
# Replicate from MiniCouch to PouchDB
$ curl -X POST http://localhost:5984/_replicate \
-d '{"source":"http://localhost:5984/mydb","target":"http://localhost:5985/mydb"}'
{"ok":true,"docs_written":1000}
# Run CouchDB test suite
$ ./run_couchdb_tests http://localhost:5984
Running 150 compatibility tests...
✓ Server info
✓ Database creation
✓ Document CRUD
✓ Revision handling
✓ Views and MapReduce
✓ Changes feed
✓ Replication
✓ Authentication
...
148/150 tests passed (98.6% compatible)
Implementation Hints:
System architecture:
┌─────────────────────────────────────────────────────────────────┐
│ HTTP Server │
│ (Routes, Auth, Request/Response) │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│ Document Store │ │ View Engine │ │ Replicator │
│ (MVCC, Revs) │ │ (MapReduce) │ │ (Sync Protocol) │
└─────────────────┘ └─────────────────┘ └─────────────────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Storage Engine │
│ (Append-only B-tree, Compaction, Crash Recovery) │
└─────────────────────────────────────────────────────────────────┘
Compatibility checklist:
- GET / returns CouchDB-like welcome
- PUT /{db} creates database
- Document CRUD with _id and _rev
- Conflicts detected and stored
- MapReduce views work
- Changes feed with all modes
- Replication protocol compatible
- Authentication (cookie + basic)
- Mango queries
- Attachments
Testing strategy:
1. Unit tests for each component
2. Integration tests for component interaction
3. Compatibility tests against CouchDB spec
4. Stress tests for performance
5. Chaos tests for reliability (kill -9, network partition)
Learning milestones:
- All components integrate → You understand systems architecture
- CouchDB tools work with it → You’ve achieved compatibility
- It survives crashes → You’ve achieved durability
- It handles concurrent load → You’ve achieved performance
- You can explain every decision → You’ve mastered the domain
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. JSON Parser | Intermediate | 1 week | ⭐⭐⭐ | ⭐⭐⭐ |
| 2. Document Store with Revisions | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 3. Append-Only B-Tree | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 4. HTTP REST API | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 5. Changes Feed | Advanced | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 6. MapReduce Views | Expert | 3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 7. Compaction | Advanced | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 8. Mango Queries | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. Replication | Expert | 3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 10. Authentication | Advanced | 1-2 weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| 11. Clustering | Master | 1 month+ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 12. Attachments | Intermediate | 1 week | ⭐⭐⭐ | ⭐⭐⭐ |
| 13. Full-Text Search | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 14. Complete Clone | Master | 1 month+ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Based on building a CouchDB clone from scratch in C:
Phase 1: Foundations (Weeks 1-2)
Start here to understand the basics.
- Project 1: JSON Parser - The foundation of document storage
- Project 2: Document Store with Revisions - Understand MVCC
Phase 2: Storage Engine (Weeks 3-5)
Build the crash-proof core.
- Project 3: Append-Only B-Tree - CouchDB’s secret weapon
- Project 7: Compaction - Reclaim space
Phase 3: API Layer (Weeks 6-8)
Make it accessible.
- Project 4: HTTP REST API - CouchDB’s interface
- Project 5: Changes Feed - Enable sync
Phase 4: Querying (Weeks 9-12)
Make data queryable.
- Project 6: MapReduce Views - Powerful indexing
- Project 8: Mango Queries - JSON-style queries
Phase 5: Distribution (Weeks 13-16)
Go distributed.
- Project 9: Replication - Multi-master sync
- Project 10: Authentication - Production security
Phase 6: Advanced (Weeks 17+)
Complete the picture.
- Project 11: Clustering - Horizontal scaling
- Project 12: Attachments - Binary storage
- Project 13: Full-Text Search - Text queries
- Project 14: Complete Clone - Integration
Summary
| # | Project Name | Main Programming Language |
|---|---|---|
| 1 | JSON Parser and Document Model | C |
| 2 | In-Memory Document Store with Revisions | C |
| 3 | Append-Only B-Tree Storage Engine | C |
| 4 | HTTP REST API Server | C |
| 5 | Sequence Index and Changes Feed | C |
| 6 | MapReduce View Engine | C |
| 7 | Database Compaction | C |
| 8 | Mango Query Engine | C |
| 9 | Document Replication | C |
| 10 | Authentication and Security | C |
| 11 | Clustering and Sharding | C |
| 12 | Attachments and Binary Storage | C |
| 13 | Full-Text Search Integration | C |
| 14 | Complete CouchDB Clone | C |
Key Resources
Books
- “Designing Data-Intensive Applications” by Martin Kleppmann - Distributed data fundamentals
- “Database Internals” by Alex Petrov - How databases work under the hood
- “CouchDB: The Definitive Guide” by J. Anderson, J. Lehnardt, N. Slater - The original CouchDB book
- “C Programming: A Modern Approach” by K. N. King - Essential C reference
- “The Linux Programming Interface” by Michael Kerrisk - Systems programming
Online Resources
- CouchDB Official Documentation - Complete API reference
- CouchDB Guide - The “Relax” book online
- The Power of B-trees - CouchDB’s B-tree explained
- CouchDB Replication Protocol - Sync protocol spec
- jsmn JSON Parser - Minimal C JSON library
Related Projects
- PouchDB - JavaScript CouchDB implementation
- cbt - MVCC append-only B-tree in Erlang
- Couchstore - Couchbase’s storage engine
You’re ready to build a CouchDB clone from scratch. Start with Project 1 and work your way through. By the end, you’ll understand document databases at the level of the engineers who built CouchDB, PouchDB, and MongoDB.