← Back to all projects

LEARN DATA INTENSIVE APPLICATIONS DESIGN

Learn Data-Intensive Applications Design: From Zero to Systems Architect

Goal: Deeply understand how to design, build, and scale data-intensive applications. Master the principles of scalability, reliability, consistency, and efficiency that underpin modern distributed systems.


Why Learn Data-Intensive Applications Design?

Every modern application is data-intensive. Whether it’s a social media platform handling billions of posts, an e-commerce site processing millions of transactions, or a real-time analytics system ingesting terabytes of events, understanding how to design these systems is essential.

After completing these projects, you will:

  • Design scalable systems that handle millions of users and petabytes of data
  • Understand trade-offs between consistency, availability, and partition tolerance (CAP theorem)
  • Implement replication, partitioning, and consensus algorithms
  • Build reliable systems that handle failures gracefully
  • Design efficient data models and storage engines
  • Create real-time and batch processing pipelines
  • Make informed decisions about database selection and architecture

Core Concept Analysis

The Data-Intensive Application Stack

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                          │
│         (Business Logic, APIs, User Interface)               │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  CACHING LAYER                               │
│              (Redis, Memcached, CDN)                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              DATA PROCESSING LAYER                            │
│    (Batch: Spark, Hadoop | Stream: Kafka, Flink)            │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  DATABASE LAYER                               │
│  (Relational: PostgreSQL | NoSQL: MongoDB, Cassandra)          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  STORAGE LAYER                                │
│         (B-Trees, LSM-Trees, Object Storage)                  │
└─────────────────────────────────────────────────────────────┘

Fundamental Concepts

  1. Reliability: System continues working correctly even when things go wrong
    • Hardware faults, software errors, human errors
    • Fault tolerance vs fault prevention
  2. Scalability: System can handle growth in load
    • Vertical scaling (bigger machines) vs horizontal scaling (more machines)
    • Load parameters: requests/sec, read/write ratio, data volume
  3. Maintainability: System is easy to operate and modify
    • Operability, simplicity, evolvability
  4. Data Models: How data is represented
    • Relational (SQL), Document (JSON), Graph, Key-Value
  5. Storage Engines: How data is stored and retrieved
    • B-Trees (read-optimized), LSM-Trees (write-optimized)
  6. Replication: Keeping copies of data on multiple nodes
    • Leader-follower, multi-leader, leaderless
    • Consistency models: eventual, strong, causal
  7. Partitioning: Splitting data across multiple nodes
    • Range partitioning, hash partitioning
    • Rebalancing strategies
  8. Transactions: Grouping operations that must succeed or fail together
    • ACID properties
    • Isolation levels
  9. Consensus: Getting multiple nodes to agree on something
    • Two-phase commit, Raft, Paxos
  10. Batch Processing: Processing large volumes of data periodically
    • MapReduce, Spark
  11. Stream Processing: Processing data continuously as it arrives
    • Event sourcing, CQRS, Kafka

Project List

The following 18 projects will teach you data-intensive application design from fundamentals to advanced distributed systems.


Project 1: Key-Value Store with B-Tree Index

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage Engines / Indexing
  • Software or Tool: File I/O, B-Tree implementation
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A persistent key-value store that uses a B-Tree index for efficient lookups. It should support insert, get, delete, and range queries with durability guarantees.

Why it teaches data-intensive design: B-Trees are the foundation of most relational databases. Building one teaches you how databases organize data on disk, handle page I/O, and maintain indexes efficiently.

Core challenges you’ll face:

  • Implementing B-Tree structure → maps to understanding balanced tree algorithms
  • Handling page I/O → maps to disk vs memory trade-offs
  • Managing splits and merges → maps to maintaining tree balance
  • Ensuring durability → maps to write-ahead logging or fsync

Key Concepts:

  • B-Tree Structure: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
  • Page-Oriented Storage: “Database Systems: The Complete Book” Ch. 13
  • Durability: “Designing Data-Intensive Applications” Ch. 7

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: C programming, understanding of trees and file I/O

Real world outcome:

$ ./kvstore create mydb
Database created: mydb

$ ./kvstore set mydb user:123 '{"name":"Alice","email":"alice@example.com"}'
OK

$ ./kvstore get mydb user:123
{"name":"Alice","email":"alice@example.com"}

$ ./kvstore range mydb user:100 user:200
user:123: {"name":"Alice","email":"alice@example.com"}
user:145: {"name":"Bob","email":"bob@example.com"}

$ ./kvstore delete mydb user:123
OK

Implementation Hints:

B-Tree structure:

  • Each node contains multiple keys and pointers
  • Internal nodes have keys and child pointers
  • Leaf nodes have keys and values
  • All leaves at same depth (balanced)
  • Typical node size: 4KB (one disk page)

Key design decisions:

  1. How many keys per node? (branching factor)
  2. How to handle variable-length values?
  3. When to split nodes? (typically when full)
  4. How to ensure atomic writes? (WAL or copy-on-write)

Questions to guide implementation:

  • How do you find a key efficiently? (binary search within node)
  • How do you handle concurrent access? (locking or MVCC)
  • How do you recover from crashes? (WAL replay)

Learning milestones:

  1. Basic B-Tree operations → Insert, search, delete work correctly
  2. Persistence → Data survives restarts
  3. Range queries → Efficiently scan key ranges
  4. Concurrent access → Multiple readers/writers

Project 2: Log-Structured Merge Tree (LSM-Tree) Storage Engine

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage Engines / Write-Optimized Structures
  • Software or Tool: File I/O, Bloom filters
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A key-value store using LSM-Tree architecture (like LevelDB/RocksDB). It should have an in-memory memtable, write-ahead log, and multiple sorted string tables (SSTables) that get compacted.

Why it teaches data-intensive design: LSM-Trees are write-optimized, making them perfect for high-write workloads. Understanding them teaches you trade-offs between read and write performance.

Core challenges you’ll face:

  • Memtable management → maps to in-memory sorted structure
  • SSTable creation → maps to sorted file format
  • Compaction strategies → maps to merging sorted files
  • Bloom filters → maps to probabilistic data structures for fast lookups

Key Concepts:

  • LSM-Tree Architecture: “Designing Data-Intensive Applications” Ch. 3
  • SSTable Format: LevelDB documentation
  • Compaction: RocksDB tuning guide

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 1, understanding of sorting algorithms

Real world outcome:

$ ./lsmstore create mydb
LSM-Tree database created

$ ./lsmstore set mydb key1 value1
OK

# High write throughput
$ for i in {1..100000}; do
    ./lsmstore set mydb "key$i" "value$i"
done
# Completes in seconds (much faster than B-Tree for writes)

$ ./lsmstore get mydb key50000
value50000

# Compaction happens automatically
$ ./lsmstore stats mydb
Level 0: 5 SSTables
Level 1: 2 SSTables (compacted)
Total keys: 100000

Implementation Hints:

LSM-Tree structure:

Write Path:
  Write → Memtable (in-memory sorted tree)
  When memtable full → Flush to SSTable (Level 0)
  Background compaction merges SSTables

Read Path:
  Check memtable → Check Level 0 SSTables → Check Level 1+ SSTables
  Use Bloom filter to skip SSTables that don't contain key

Compaction strategies:

  • Size-tiered: Merge SSTables of similar size
  • Leveled: Each level has fixed size, merge into next level
  • Tiered: Similar to size-tiered but with level limits

Key questions:

  • How big should the memtable be before flushing?
  • How many levels should you have?
  • When to trigger compaction? (based on size or ratio)

Learning milestones:

  1. Write path works → Writes go to memtable and WAL
  2. Read path works → Can retrieve values correctly
  3. Compaction works → SSTables merge correctly
  4. Performance → Write throughput exceeds B-Tree

Project 3: Relational Database Query Engine

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: C++
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Query Processing / Relational Algebra
  • Software or Tool: SQL parser, query optimizer
  • Main Book: “Database Systems: The Complete Book” by Garcia-Molina et al.

What you’ll build: A query engine that parses SQL, builds an execution plan, and executes queries with operators like scan, filter, join, sort, and aggregate.

Why it teaches data-intensive design: Understanding how databases execute queries is essential for writing efficient SQL and designing schemas.

Core challenges you’ll face:

  • SQL parsing → maps to converting text to relational algebra
  • Query optimization → maps to choosing efficient execution plans
  • Join algorithms → maps to nested loop, hash join, sort-merge join
  • Operator pipelining → maps to streaming vs materialization

Key Concepts:

  • Relational Algebra: “Database Systems” Ch. 2
  • Query Optimization: “Database Systems” Ch. 16
  • Join Algorithms: “Database Systems” Ch. 15

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of SQL, algorithms and data structures

Real world outcome:

-- Create table
CREATE TABLE users (id INT, name VARCHAR(100), email VARCHAR(100));
CREATE TABLE orders (id INT, user_id INT, amount DECIMAL(10,2));

-- Insert data
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
INSERT INTO orders VALUES (1, 1, 99.99);

-- Query
SELECT u.name, SUM(o.amount) as total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
HAVING SUM(o.amount) > 50;

-- Output:
-- name  | total
-- Alice | 99.99

Implementation Hints:

Query execution pipeline:

SQL → Parser → AST → Optimizer → Execution Plan → Executor → Results

Key operators to implement:

  1. TableScan: Read all rows from table
  2. Filter: Apply WHERE conditions
  3. Project: Select specific columns
  4. Join: Combine two tables (nested loop, hash, sort-merge)
  5. Sort: Order rows
  6. Aggregate: GROUP BY operations
  7. Limit: Top N results

Optimization techniques:

  • Push predicates down (filter early)
  • Choose join order (smallest table first)
  • Use indexes when available
  • Estimate costs (I/O, CPU)

Learning milestones:

  1. Parse SQL → Convert to execution plan
  2. Execute simple queries → SELECT, WHERE work
  3. Implement joins → Multiple algorithms work
  4. Optimize queries → Choose efficient plans

Project 4: Leader-Follower Replication System

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Java, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Replication / High Availability
  • Software or Tool: Network programming, consensus
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A replicated key-value store with one leader and multiple followers. Writes go to the leader and are replicated to followers. Handles leader failure and automatic failover.

Why it teaches data-intensive design: Replication is fundamental for availability and read scalability. Understanding leader-follower replication teaches you consistency models and failure handling.

Core challenges you’ll face:

  • Synchronous vs asynchronous replication → maps to consistency vs latency trade-offs
  • Replication lag → maps to eventual consistency issues
  • Leader election → maps to distributed consensus
  • Split-brain prevention → maps to quorum-based decisions

Key Concepts:

  • Replication Models: “Designing Data-Intensive Applications” Ch. 5
  • Consistency Guarantees: “Designing Data-Intensive Applications” Ch. 9
  • Leader Election: Raft paper

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Network programming, understanding of distributed systems basics

Real world outcome:

# Start leader
$ ./replicated-kv --node-id=1 --port=8001 --leader
Leader started on port 8001

# Start followers
$ ./replicated-kv --node-id=2 --port=8002 --follower --leader-addr=localhost:8001
Follower 2 connected to leader

$ ./replicated-kv --node-id=3 --port=8003 --follower --leader-addr=localhost:8001
Follower 3 connected to leader

# Write to leader
$ curl http://localhost:8001/set?key=user:1&value=Alice
OK

# Read from any node
$ curl http://localhost:8002/get?key=user:1
Alice

# Leader fails
$ kill <leader-pid>

# Automatic failover
[Follower 2] Leader failed, starting election...
[Follower 2] Elected as new leader
[Follower 3] Following new leader: node-2

Implementation Hints:

Replication protocol:

  1. Client sends write to leader
  2. Leader writes to local log
  3. Leader sends log entries to followers
  4. Followers acknowledge receipt
  5. Leader commits when majority acknowledge
  6. Leader sends commit message to followers

Failure scenarios:

  • Leader fails: Followers detect via heartbeat timeout, elect new leader
  • Follower fails: Leader continues, follower catches up on reconnect
  • Network partition: Split-brain prevention via quorum

Consistency models:

  • Synchronous: Wait for all followers (strong consistency, high latency)
  • Asynchronous: Don’t wait (low latency, eventual consistency)
  • Semi-synchronous: Wait for majority (balance)

Learning milestones:

  1. Basic replication → Writes replicate to followers
  2. Read from followers → Can read from any node
  3. Leader failure → Automatic failover works
  4. Consistency → Understand trade-offs

Project 5: Consistent Hashing and Data Partitioning

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Partitioning / Sharding
  • Software or Tool: Hash functions, distributed systems
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed key-value store that partitions data across multiple nodes using consistent hashing. Handles node additions and removals with minimal data movement.

Why it teaches data-intensive design: Partitioning is essential for scaling beyond single-machine limits. Consistent hashing minimizes data movement when nodes join/leave.

Core challenges you’ll face:

  • Hash function selection → maps to uniform distribution
  • Virtual nodes → maps to load balancing
  • Rebalancing → maps to minimizing data movement
  • Handling node failures → maps to replication across partitions

Key Concepts:

  • Partitioning Strategies: “Designing Data-Intensive Applications” Ch. 6
  • Consistent Hashing: “Designing Data-Intensive Applications” Ch. 6
  • Rebalancing: DynamoDB partitioning documentation

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Understanding of hashing, basic distributed systems

Real world outcome:

# Start cluster with 3 nodes
$ ./partitioned-kv --nodes=node1,node2,node3
Cluster started with 3 nodes

# Add data
$ ./partitioned-kv set user:1 "Alice"
Stored on node2 (hash: 0x3a7f...)

$ ./partitioned-kv set user:2 "Bob"
Stored on node1 (hash: 0x8c2d...)

# Add new node
$ ./partitioned-kv add-node node4
Node added. Rebalancing...
Moved 25% of keys to node4

# Remove node
$ ./partitioned-kv remove-node node2
Node removed. Rebalancing...
Moved keys to node1 and node3

# Query still works
$ ./partitioned-kv get user:1
Alice

Implementation Hints:

Consistent hashing ring:

    0x0000
        │
        ├─ node1 (0x4000)
        │
        ├─ node2 (0x8000)
        │
        ├─ node3 (0xC000)
        │
    0xFFFF

Key "user:1" hashes to 0x5a7f → belongs to node2

Virtual nodes:

  • Each physical node has multiple virtual nodes on ring
  • Improves load distribution
  • Example: node1 → [vnode1-1, vnode1-2, vnode1-3]

Rebalancing:

  • When node added: only move keys from immediate neighbors
  • When node removed: redistribute its keys to neighbors
  • Use replication factor of 3: store key on node and next 2 nodes

Learning milestones:

  1. Basic partitioning → Keys distributed across nodes
  2. Consistent hashing → Ring structure works
  3. Node addition → Minimal data movement
  4. Load balancing → Even distribution with virtual nodes

Project 6: Two-Phase Commit Protocol

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Transactions / Consensus
  • Software or Tool: Network programming, state machines
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed transaction coordinator that implements two-phase commit (2PC) to ensure atomicity across multiple nodes.

Why it teaches data-intensive design: 2PC is fundamental for distributed transactions. Understanding it teaches you the challenges of achieving consensus in distributed systems.

Core challenges you’ll face:

  • Coordinator failure → maps to blocking problem
  • Participant failure → maps to recovery procedures
  • Network partitions → maps to availability vs consistency
  • Timeout handling → maps to detecting failures

Key Concepts:

  • Two-Phase Commit: “Designing Data-Intensive Applications” Ch. 9
  • Distributed Transactions: “Database Systems” Ch. 19
  • Consensus Algorithms: Raft paper

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of transactions, network programming

Real world outcome:

# Start coordinator
$ ./txn-coordinator --port=9000
Coordinator started

# Start participants (databases)
$ ./participant --id=db1 --coordinator=localhost:9000
$ ./participant --id=db2 --coordinator=localhost:9000
$ ./participant --id=db3 --coordinator=localhost:9000

# Execute distributed transaction
$ ./client --coordinator=localhost:9000 \
    --txn="UPDATE db1.users SET balance=balance-100 WHERE id=1; \
           UPDATE db2.accounts SET balance=balance+100 WHERE id=2; \
           UPDATE db3.logs SET action='transfer' WHERE id=3;"

[Coordinator] Phase 1: Prepare
[db1] Prepared (vote: YES)
[db2] Prepared (vote: YES)
[db3] Prepared (vote: YES)
[Coordinator] Phase 2: Commit
[db1] Committed
[db2] Committed
[db3] Committed
Transaction completed successfully

Implementation Hints:

2PC Protocol:

Phase 1 (Prepare):
  1. Coordinator sends PREPARE to all participants
  2. Each participant votes YES (ready) or NO (abort)
  3. Coordinator collects votes

Phase 2 (Commit/Abort):
  4. If all YES: Coordinator sends COMMIT
  5. If any NO: Coordinator sends ABORT
  6. Participants commit or abort and acknowledge

Failure scenarios:

  • Participant fails before voting: Coordinator times out, aborts
  • Participant fails after voting YES: Blocks until recovery, then commits
  • Coordinator fails: Participants block (blocking problem)

Improvements:

  • Three-phase commit: Reduces blocking but more complex
  • Saga pattern: Alternative for long-running transactions

Learning milestones:

  1. Basic 2PC → Simple transactions commit/abort correctly
  2. Failure handling → Handles participant failures
  3. Coordinator failure → Understand blocking problem
  4. Recovery → Participants recover correctly

Project 7: Raft Consensus Algorithm Implementation

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Consensus / Distributed Systems
  • Software or Tool: Network programming, state machines
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A Raft consensus implementation that maintains a replicated log across multiple nodes. Handles leader election, log replication, and safety guarantees.

Why it teaches data-intensive design: Raft is used in production systems (etcd, Consul). Understanding it teaches you how distributed systems achieve consensus and maintain consistency.

Core challenges you’ll face:

  • Leader election → maps to majority voting, term numbers
  • Log replication → maps to matching logs, commit index
  • Safety properties → maps to election restriction, log matching
  • Network partitions → maps to split-brain prevention

Key Concepts:

  • Raft Algorithm: “In Search of an Understandable Consensus Algorithm” - Diego Ongaro
  • Consensus: “Designing Data-Intensive Applications” Ch. 9
  • Distributed Logs: Raft paper

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Strong understanding of distributed systems, network programming

Real world outcome:

# Start 5-node Raft cluster
$ ./raft-node --id=1 --cluster=1,2,3,4,5 --port=8001
$ ./raft-node --id=2 --cluster=1,2,3,4,5 --port=8002
$ ./raft-node --id=3 --cluster=1,2,3,4,5 --port=8003
$ ./raft-node --id=4 --cluster=1,2,3,4,5 --port=8004
$ ./raft-node --id=5 --cluster=1,2,3,4,5 --port=8005

# Cluster elects leader
[Node 3] Elected leader for term 1

# Append entry to log
$ curl http://localhost:8003/append -d '{"command":"SET key1 value1"}'
{"success":true,"index":1}

# Entry replicated to majority
[Node 1] Log entry 1 committed
[Node 2] Log entry 1 committed
[Node 3] Log entry 1 committed

# Leader fails
$ kill <node-3-pid>

# New leader elected
[Node 1] Elected leader for term 2

# Cluster continues operating
$ curl http://localhost:8001/append -d '{"command":"SET key2 value2"}'
{"success":true,"index":2}

Implementation Hints:

Raft components:

  1. Leader Election: Nodes vote for leader, need majority
  2. Log Replication: Leader replicates log entries to followers
  3. Safety: Election restriction ensures leader has all committed entries

State machine:

Follower → Candidate → Leader
   ↑         ↓          ↓
   └─────────┴──────────┘

Key invariants:

  • Election Safety: At most one leader per term
  • Leader Append-Only: Leader never overwrites entries
  • Log Matching: If two logs have same entry at same index, they match up to that point
  • Leader Completeness: Committed entry in current term will be in all future leaders

Learning milestones:

  1. Leader election → Cluster elects leader correctly
  2. Log replication → Entries replicate to followers
  3. Failure handling → Cluster handles node failures
  4. Safety → All safety properties maintained

Project 8: Event Sourcing System

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Go, TypeScript
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Event Sourcing / CQRS
  • Software or Tool: Message queue, event store
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An event-sourced system where all changes are stored as a sequence of events. Support event replay, snapshots, and multiple read models.

Why it teaches data-intensive design: Event sourcing provides audit trails, time travel, and decouples write and read models. It’s a powerful pattern for complex domains.

Core challenges you’ll face:

  • Event storage → maps to append-only log, event schema evolution
  • Event replay → maps to rebuilding state from events
  • Snapshots → maps to optimizing replay performance
  • Read models → maps to CQRS pattern, eventual consistency

Key Concepts:

  • Event Sourcing: “Designing Data-Intensive Applications” Ch. 11
  • CQRS: Martin Fowler’s blog on CQRS
  • Event Store: EventStore documentation

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of domain modeling, message queues

Real world outcome:

# Create event store
$ ./eventstore create myapp
Event store created

# Write events
$ ./eventstore append myapp user:1 \
    '{"type":"UserCreated","name":"Alice","email":"alice@example.com"}' \
    '{"type":"EmailChanged","email":"alice.new@example.com"}' \
    '{"type":"UserDeleted"}'

# Replay events to rebuild state
$ ./eventstore replay myapp user:1
Event 1: UserCreated -> State: {name: Alice, email: alice@example.com}
Event 2: EmailChanged -> State: {name: Alice, email: alice.new@example.com}
Event 3: UserDeleted -> State: {deleted: true}

# Query at specific time
$ ./eventstore query myapp user:1 --at="2024-01-15T10:00:00Z"
{name: Alice, email: alice@example.com}

# Create read model
$ ./eventstore create-read-model myapp users-by-email
Read model created, processing events...

Implementation Hints:

Event store structure:

Stream: user:1
  Event 1: UserCreated (timestamp: T1)
  Event 2: EmailChanged (timestamp: T2)
  Event 3: UserDeleted (timestamp: T3)

Replay algorithm:

def replay(stream_id, up_to=None):
    state = {}
    events = load_events(stream_id, up_to)
    for event in events:
        state = apply_event(state, event)
    return state

Snapshots:

  • Periodically save state (e.g., every 100 events)
  • Replay from snapshot instead of beginning
  • Reduces replay time

Read models:

  • Project events into denormalized views
  • Update asynchronously
  • Optimized for queries

Learning milestones:

  1. Event storage → Events stored and retrieved
  2. Event replay → State rebuilt from events
  3. Snapshots → Replay optimized with snapshots
  4. Read models → Multiple views of same data

Project 9: Message Queue with At-Least-Once Delivery

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Java, Rust, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Message Queues / Asynchronous Processing
  • Software or Tool: Network programming, persistence
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A message queue that guarantees at-least-once delivery. Supports topics, consumer groups, and message acknowledgments.

Why it teaches data-intensive design: Message queues are essential for decoupling services and handling asynchronous processing. Understanding delivery guarantees is crucial.

Core challenges you’ll face:

  • Message persistence → maps to durability guarantees
  • Consumer groups → maps to load balancing, parallel processing
  • Acknowledgments → maps to at-least-once vs exactly-once
  • Message ordering → maps to per-partition ordering

Key Concepts:

  • Message Brokers: “Designing Data-Intensive Applications” Ch. 11
  • Delivery Guarantees: “Designing Data-Intensive Applications” Ch. 11
  • Kafka Architecture: Kafka documentation

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Network programming, understanding of queues

Real world outcome:

# Start message queue
$ ./mq-server --port=9092
Message queue started

# Create topic
$ ./mq-cli create-topic orders --partitions=3 --replication=2
Topic 'orders' created

# Produce messages
$ ./mq-cli produce orders '{"order_id":1,"amount":99.99}'
Message produced: offset=0, partition=0

$ ./mq-cli produce orders '{"order_id":2,"amount":149.99}'
Message produced: offset=0, partition=1

# Consume messages
$ ./mq-cli consume orders --group=processors --from-beginning
Message 1: {"order_id":1,"amount":99.99} (offset=0, partition=0)
Message 2: {"order_id":2,"amount":149.99} (offset=0, partition=1)

# Acknowledge message
$ ./mq-cli ack orders --group=processors --offset=0 --partition=0
Acknowledged

# Consumer group rebalancing
$ ./mq-cli consume orders --group=processors --consumer-id=consumer-2
Assigned partitions: [1, 2]

Implementation Hints:

Message queue structure:

Topic: orders
  Partition 0: [msg0, msg1, msg2, ...]
  Partition 1: [msg0, msg1, msg2, ...]
  Partition 2: [msg0, msg1, msg2, ...]

Consumer groups:

  • Multiple consumers in same group share partitions
  • Each partition consumed by one consumer in group
  • Rebalancing when consumers join/leave

Delivery guarantees:

  • At-least-once: Message delivered at least once (may duplicate)
  • At-most-once: Message delivered at most once (may lose)
  • Exactly-once: Message delivered exactly once (hardest)

Acknowledgments:

  • Consumer processes message
  • Sends ACK to broker
  • Broker removes message (or marks as processed)
  • If ACK not received, redeliver

Learning milestones:

  1. Basic queue → Produce and consume messages
  2. Persistence → Messages survive restarts
  3. Consumer groups → Load balancing works
  4. At-least-once → Guarantee maintained

Project 10: Distributed Cache with Cache-Aside Pattern

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Java, Rust
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Caching / Performance Optimization
  • Software or Tool: Network programming, LRU eviction
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed cache (like Redis) with cache-aside pattern. Supports TTL, eviction policies, and cluster mode.

Why it teaches data-intensive design: Caching is crucial for performance. Understanding cache patterns and eviction strategies teaches you trade-offs in system design.

Core challenges you’ll face:

  • Eviction policies → maps to LRU, LFU, TTL-based
  • Cache invalidation → maps to when to invalidate, cache-aside vs write-through
  • Distributed caching → maps to consistent hashing, replication
  • Cache stampede → maps to preventing thundering herd

Key Concepts:

  • Caching Patterns: “Designing Data-Intensive Applications” Ch. 3
  • Cache-Aside: Martin Fowler’s blog
  • Eviction Policies: Redis documentation

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Understanding of hashing, basic distributed systems

Real world outcome:

# Start cache cluster
$ ./cache-server --port=6379 --cluster
Cache cluster started with 3 nodes

# Set with TTL
$ ./cache-cli set user:1 "Alice" --ttl=3600
OK

# Get
$ ./cache-cli get user:1
Alice

# Cache-aside pattern
$ ./cache-cli get user:999
(nil)

# Application fetches from database, then caches
$ ./cache-cli set user:999 "Bob" --ttl=3600
OK

# Eviction when full (LRU)
$ ./cache-cli set key:1000 "value"
OK (evicted key:1)

# Cluster mode
$ ./cache-cli set key:1 "value" --cluster
Stored on node2 (hash: 0x3a7f...)

Implementation Hints:

Cache-aside pattern:

1. Application checks cache
2. If miss: fetch from database
3. Store in cache for future requests
4. On write: update database, invalidate cache

Eviction policies:

  • LRU: Evict least recently used
  • LFU: Evict least frequently used
  • TTL: Evict expired entries
  • Random: Simple but less effective

Cache stampede prevention:

  • Lock on cache miss (only one fetches from DB)
  • Background refresh before expiration
  • Probabilistic early expiration

Learning milestones:

  1. Basic cache → Get/set operations work
  2. Eviction → LRU eviction works correctly
  3. TTL → Expiration works
  4. Cache-aside → Pattern implemented correctly

Project 11: MapReduce Implementation

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Batch Processing / Distributed Computing
  • Software or Tool: Distributed systems, file I/O
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A MapReduce implementation that processes large datasets in parallel across multiple workers. Handles task distribution, failure recovery, and result aggregation.

Why it teaches data-intensive design: MapReduce is the foundation of batch processing. Understanding it teaches you how to process large datasets efficiently.

Core challenges you’ll face:

  • Task distribution → maps to scheduling, load balancing
  • Failure handling → maps to retry, speculative execution
  • Data locality → maps to processing data where it’s stored
  • Shuffle phase → maps to sorting and grouping intermediate results

Key Concepts:

  • MapReduce: “Designing Data-Intensive Applications” Ch. 10
  • Batch Processing: “Designing Data-Intensive Applications” Ch. 10
  • Distributed Computing: Google MapReduce paper

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of distributed systems, file I/O

Real world outcome:

# Start MapReduce cluster
$ ./mapreduce-master --port=8080
Master started

$ ./mapreduce-worker --master=localhost:8080 --port=8081
Worker 1 started

$ ./mapreduce-worker --master=localhost:8080 --port=8082
Worker 2 started

# Submit job: word count
$ ./mapreduce-cli submit \
    --input=/data/books/*.txt \
    --output=/results/wordcount \
    --mapper=word_count_map.py \
    --reducer=word_count_reduce.py

Job submitted: job-12345

# Monitor progress
$ ./mapreduce-cli status job-12345
Job: job-12345
Status: RUNNING
Map tasks: 10/10 complete
Reduce tasks: 3/8 complete
Progress: 65%

# Job completes
$ ./mapreduce-cli status job-12345
Status: COMPLETED
Output: /results/wordcount/part-00000
        /results/wordcount/part-00001
        /results/wordcount/part-00002

# View results
$ cat /results/wordcount/part-00000
the 15234
and 8921
of 6543
...

Implementation Hints:

MapReduce phases:

1. Map Phase:
   - Master splits input into chunks
   - Assigns chunks to workers
   - Workers process chunks, emit (key, value) pairs

2. Shuffle Phase:
   - Sort and group by key
   - Partition to reducers

3. Reduce Phase:
   - Reducers process grouped values
   - Write final output

Failure handling:

  • Worker failure: Reassign tasks to other workers
  • Task failure: Retry with exponential backoff
  • Speculative execution: Run slow tasks on multiple workers

Data locality:

  • Prefer workers that have input data locally
  • Reduces network transfer

Learning milestones:

  1. Basic MapReduce → Simple jobs complete
  2. Failure handling → Handles worker failures
  3. Shuffle → Intermediate data sorted correctly
  4. Scalability → Handles large datasets

Project 12: Stream Processing with Windowing

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Scala
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Stream Processing / Real-Time Analytics
  • Software or Tool: Message queue, time-based processing
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A stream processing engine that processes continuous data streams with windowing (tumbling, sliding, session windows) and aggregations.

Why it teaches data-intensive design: Stream processing enables real-time analytics. Understanding windowing and watermarks teaches you how to handle out-of-order events.

Core challenges you’ll face:

  • Event time vs processing time → maps to handling late events
  • Windowing → maps to tumbling, sliding, session windows
  • Watermarks → maps to determining when window is complete
  • State management → maps to maintaining window state

Key Concepts:

  • Stream Processing: “Designing Data-Intensive Applications” Ch. 11
  • Windowing: Flink documentation on windows
  • Watermarks: “Streaming Systems” by Akidau et al.

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of message queues, time-based processing

Real world outcome:

# Start stream processor
$ ./stream-processor --port=8080
Stream processor started

# Create stream
$ ./stream-cli create-stream clicks --source=kafka://localhost:9092/topic:clicks
Stream created

# Define processing: count clicks per minute
$ ./stream-cli query clicks \
    --window=tumbling,1min \
    --aggregate=count \
    --group-by=user_id

Query started: query-123

# Process events
Event: {user_id: 1, timestamp: 10:00:15, page: /home}
Event: {user_id: 2, timestamp: 10:00:23, page: /products}
Event: 1, timestamp: 10:00:45, page: /cart}
Window [10:00:00-10:01:00]: user_1=2, user_2=1

# Sliding window: count clicks in last 5 minutes, every minute
$ ./stream-cli query clicks \
    --window=sliding,5min,1min \
    --aggregate=count

Implementation Hints:

Window types:

  • Tumbling: Fixed-size, non-overlapping (e.g., every 1 minute)
  • Sliding: Fixed-size, overlapping (e.g., last 5 minutes, every 1 minute)
  • Session: Dynamic based on gaps (e.g., 10-minute inactivity)

Watermarks:

  • Indicate when events are “complete”
  • Allow processing of late events (within allowed lateness)
  • Example: watermark = max(event_time) - 1 minute

State management:

  • Store window state in memory or external store
  • Handle state recovery on failures

Learning milestones:

  1. Basic streaming → Process events continuously
  2. Windowing → Tumbling windows work
  3. Watermarks → Handle late events correctly
  4. Aggregations → Compute windowed aggregates

Project 13: Database Connection Pool

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Java, Python
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Connection Management / Resource Pooling
  • Software or Tool: Database connections, concurrency
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A connection pool that manages database connections efficiently, handling connection lifecycle, health checks, and load balancing.

Why it teaches data-intensive design: Connection pooling is essential for performance. Understanding it teaches you resource management and concurrency patterns.

Core challenges you’ll face:

  • Connection lifecycle → maps to create, reuse, destroy
  • Health checks → maps to detecting dead connections
  • Load balancing → maps to distributing connections across servers
  • Concurrency → maps to thread-safe access

Key Concepts:

  • Connection Pooling: Database connection pool patterns
  • Resource Management: “Designing Data-Intensive Applications” Ch. 3
  • Concurrency: Go concurrency patterns

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Understanding of concurrency, database connections

Real world outcome:

# Start connection pool
$ ./pool-manager \
    --max-connections=100 \
    --min-connections=10 \
    --max-idle-time=300s \
    --health-check-interval=30s \
    --servers=db1:5432,db2:5432,db3:5432

Pool started with 3 servers

# Get connection
$ ./pool-client get-connection
Connection acquired: conn-123 (server: db1)

# Execute query
$ ./pool-client execute conn-123 "SELECT * FROM users LIMIT 10"
[Results...]

# Return connection
$ ./pool-client return-connection conn-123
Connection returned to pool

# Pool statistics
$ ./pool-client stats
Active connections: 45
Idle connections: 55
Total connections: 100
Server distribution:
  db1: 33 connections
  db2: 33 connections
  db3: 34 connections

Implementation Hints:

Connection pool structure:

Pool:
  Available: [conn1, conn2, conn3, ...]
  In-use: {request-id: conn}
  Servers: [db1, db2, db3]

Operations:

  • Acquire: Get connection from pool (create if needed)
  • Release: Return connection to pool
  • Health check: Periodically test connections
  • Evict: Remove idle connections after timeout

Load balancing:

  • Round-robin across servers
  • Least connections
  • Weighted distribution

Learning milestones:

  1. Basic pooling → Connections reused correctly
  2. Health checks → Dead connections detected
  3. Load balancing → Connections distributed evenly
  4. Concurrency → Thread-safe operations

Project 14: Change Data Capture (CDC) System

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Integration / ETL
  • Software or Tool: Database replication logs, message queue
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A CDC system that captures database changes (inserts, updates, deletes) and streams them to downstream systems (data warehouse, search index, cache).

Why it teaches data-intensive design: CDC enables real-time data integration. Understanding it teaches you how to keep multiple systems in sync.

Core challenges you’ll face:

  • Reading replication logs → maps to parsing binary logs, WAL
  • Change ordering → maps to maintaining order across transactions
  • Schema evolution → maps to handling schema changes
  • Downstream delivery → maps to reliable delivery, idempotency

Key Concepts:

  • Change Data Capture: “Designing Data-Intensive Applications” Ch. 11
  • Database Replication: “Designing Data-Intensive Applications” Ch. 5
  • Event Streaming: Kafka Connect documentation

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of database internals, message queues

Real world outcome:

# Start CDC connector
$ ./cdc-connector \
    --source=postgresql://localhost/db \
    --sink=kafka://localhost:9092/topic:changes \
    --tables=users,orders,products

CDC connector started

# Monitor changes
$ ./cdc-cli tail changes
{"table":"users","op":"INSERT","id":1,"data":{"name":"Alice","email":"alice@example.com"}}
{"table":"users","op":"UPDATE","id":1,"data":{"name":"Alice","email":"alice.new@example.com"}}
{"table":"orders","op":"INSERT","id":100,"data":{"user_id":1,"amount":99.99}}
{"table":"users","op":"DELETE","id":1}

# Sync to Elasticsearch
$ ./cdc-cli sync changes --target=elasticsearch://localhost:9200
Syncing changes to Elasticsearch...
Indexed 1000 documents

# Sync to data warehouse
$ ./cdc-cli sync changes --target=warehouse://s3://bucket/changes
Syncing changes to warehouse...

Implementation Hints:

CDC approaches:

  1. Replication logs: Read database’s replication log (MySQL binlog, PostgreSQL WAL)
  2. Triggers: Database triggers on changes
  3. Polling: Periodically query for changes (less efficient)

Change format:

{
  "table": "users",
  "operation": "INSERT|UPDATE|DELETE",
  "timestamp": "2024-01-15T10:00:00Z",
  "before": {...},  // For UPDATE/DELETE
  "after": {...}   // For INSERT/UPDATE
}

Ordering:

  • Use transaction ID or LSN (Log Sequence Number)
  • Process changes in order within transaction
  • Handle out-of-order events

Learning milestones:

  1. Capture changes → Read replication logs correctly
  2. Stream changes → Publish to message queue
  3. Ordering → Maintain correct order
  4. Downstream sync → Update multiple systems

Project 15: Distributed Lock Service

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Coordination / Locks
  • Software or Tool: Consensus algorithm, TTL
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed lock service (like etcd or ZooKeeper) that provides mutual exclusion across multiple processes/nodes.

Why it teaches data-intensive design: Distributed locks are essential for coordination. Understanding them teaches you lease management, failure handling, and consensus.

Core challenges you’ll face:

  • Lease management → maps to TTL, renewal, expiration
  • Failure handling → maps to detecting dead nodes, lock release
  • Fairness → maps to FIFO ordering, preventing starvation
  • Consensus → maps to agreement on lock ownership

Key Concepts:

  • Distributed Locks: “Designing Data-Intensive Applications” Ch. 9
  • Leases: etcd documentation
  • Coordination: ZooKeeper documentation

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of consensus, TTL mechanisms

Real world outcome:

# Start lock service
$ ./lock-service --port=2379
Lock service started

# Acquire lock
$ ./lock-client acquire my-lock --ttl=30s
Lock acquired: lock-id-12345
TTL: 30s

# Renew lock
$ ./lock-client renew lock-id-12345 --ttl=30s
Lock renewed, new TTL: 30s

# Another client tries to acquire
$ ./lock-client acquire my-lock --ttl=30s
Lock unavailable, waiting...

# First client releases
$ ./lock-client release lock-id-12345
Lock released

# Second client acquires
Lock acquired: lock-id-67890

# Lock expires if not renewed
[After 30s]
Lock expired: lock-id-67890

Implementation Hints:

Lock service structure:

Lock: "my-lock"
  Owner: client-12345
  TTL: 30s
  Created: 2024-01-15T10:00:00Z
  Expires: 2024-01-15T10:00:30Z

Operations:

  • Acquire: Create lock if not exists, set TTL
  • Renew: Extend TTL before expiration
  • Release: Delete lock
  • Watch: Get notified when lock released

Failure handling:

  • If client crashes, lock expires automatically
  • Use heartbeats to detect client failures
  • Implement fencing tokens for safety

Learning milestones:

  1. Basic locking → Acquire/release works
  2. TTL → Locks expire correctly
  3. Renewal → TTL extended on renewal
  4. Failure handling → Crashed clients release locks

Project 16: Time-Series Database

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Time-Series Data / Specialized Databases
  • Software or Tool: Compression, columnar storage
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A time-series database optimized for storing and querying time-stamped data (metrics, events). Supports compression, downsampling, and efficient range queries.

Why it teaches data-intensive design: Time-series data has unique characteristics. Understanding specialized storage teaches you how to optimize for specific access patterns.

Core challenges you’ll face:

  • Compression → maps to delta encoding, run-length encoding
  • Downsampling → maps to aggregating data over time
  • Retention policies → maps to data lifecycle management
  • Query optimization → maps to time-range queries, aggregation

Key Concepts:

  • Time-Series Databases: InfluxDB documentation
  • Compression: “Designing Data-Intensive Applications” Ch. 3
  • Columnar Storage: “Designing Data-Intensive Applications” Ch. 3

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of compression, database internals

Real world outcome:

# Start time-series database
$ ./tsdb --data-dir=/data/tsdb
Time-series database started

# Create database
$ ./tsdb-cli create-db metrics
Database created

# Write metrics
$ ./tsdb-cli write metrics \
    cpu.usage,host=server1 value=45.2 1642234567 \
    cpu.usage,host=server1 value=46.1 1642234577 \
    cpu.usage,host=server1 value=44.8 1642234587

# Query: average CPU usage last hour
$ ./tsdb-cli query metrics \
    "SELECT mean(value) FROM cpu.usage WHERE host='server1' AND time > now() - 1h"
mean(value)
45.37

# Downsampling: create 1-minute averages
$ ./tsdb-cli create-downsample metrics cpu.usage --interval=1m --function=mean
Downsample created

# Retention: delete data older than 30 days
$ ./tsdb-cli set-retention metrics --duration=30d
Retention policy set

Implementation Hints:

Time-series storage:

Metric: cpu.usage,host=server1
  Timestamps: [1642234567, 1642234577, 1642234587, ...]
  Values: [45.2, 46.1, 44.8, ...]

Compression:

  • Delta encoding: Store differences between timestamps
  • Run-length encoding: Compress repeated values
  • Gorilla compression: Efficient for floating-point

Downsampling:

  • Pre-aggregate data at different resolutions
  • 1-second → 1-minute → 1-hour → 1-day
  • Reduces storage and query time

Learning milestones:

  1. Basic storage → Store and retrieve time-series data
  2. Compression → Data compressed efficiently
  3. Downsampling → Aggregations computed correctly
  4. Query optimization → Range queries fast

Project 17: Graph Database with Gremlin Query Language

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Graph Databases / Graph Algorithms
  • Software or Tool: Graph algorithms, query language
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A graph database that stores nodes and edges, with a query language (Gremlin-like) for traversing relationships and finding paths.

Why it teaches data-intensive design: Graph databases excel at relationship queries. Understanding them teaches you when to use graph vs relational models.

Core challenges you’ll face:

  • Graph storage → maps to adjacency lists, edge properties
  • Traversal algorithms → maps to BFS, DFS, shortest path
  • Query language → maps to Gremlin-style syntax
  • Indexing → maps to indexing nodes and edges

Key Concepts:

  • Graph Databases: “Designing Data-Intensive Applications” Ch. 2
  • Graph Algorithms: “Introduction to Algorithms” Ch. 22
  • Gremlin: Apache TinkerPop documentation

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of graphs, algorithms, query parsing

Real world outcome:

# Start graph database
$ ./graphdb --port=8182
Graph database started

# Create graph
$ ./gremlin-cli
gremlin> graph = Graph.open()
gremlin> g = graph.traversal()

# Add vertices
gremlin> alice = g.addV('person').property('name', 'Alice').next()
gremlin> bob = g.addV('person').property('name', 'Bob').next()
gremlin> company = g.addV('company').property('name', 'Acme').next()

# Add edges
gremlin> g.addE('knows').from(alice).to(bob).property('since', 2020).next()
gremlin> g.addE('worksFor').from(alice).to(company).next()

# Query: find Alice's friends
gremlin> g.V().has('name', 'Alice').out('knows').values('name')
==>Bob

# Query: find people who work at same company as Alice
gremlin> g.V().has('name', 'Alice').out('worksFor').in('worksFor').values('name')
==>Alice

# Shortest path
gremlin> g.V().has('name', 'Alice').repeat(out()).until(has('name', 'Bob')).path()
==>[v[1], v[2]]

Implementation Hints:

Graph storage:

Vertices:
  v1: {label: 'person', properties: {name: 'Alice'}}
  v2: {label: 'person', properties: {name: 'Bob'}}

Edges:
  e1: {from: v1, to: v2, label: 'knows', properties: {since: 2020}}

Traversal:

  • Start at vertex
  • Follow edges (out/in/both)
  • Filter by properties
  • Return results

Algorithms:

  • BFS: Level-order traversal
  • DFS: Depth-first traversal
  • Shortest path: Dijkstra’s algorithm
  • PageRank: Centrality measure

Learning milestones:

  1. Basic graph → Store vertices and edges
  2. Traversal → Navigate relationships
  3. Query language → Gremlin-like queries work
  4. Algorithms → Shortest path, etc. work

Project 18: Complete Data-Intensive Application

  • File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
  • Main Programming Language: Python/Go
  • Alternative Programming Languages: Java, TypeScript
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Complete System Design / Integration
  • Software or Tool: All previous projects
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete data-intensive application (e.g., social media platform, e-commerce site, analytics platform) that integrates all concepts: replication, partitioning, caching, message queues, batch and stream processing.

Why it teaches data-intensive design: Building a complete system integrates all concepts. You’ll make real-world trade-offs and see how components interact.

Core challenges you’ll face:

  • System architecture → maps to component selection, data flow
  • Scalability → maps to handling growth, performance optimization
  • Reliability → maps to failure handling, monitoring
  • Consistency → maps to choosing appropriate consistency models

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects

Real world outcome:

# Complete social media platform

# Architecture:
# - PostgreSQL (primary database) with read replicas
# - Redis (caching layer)
# - Kafka (event streaming)
# - Elasticsearch (search)
# - Spark (batch analytics)
# - Flink (real-time analytics)

# Features:
# - User posts, comments, likes
# - News feed (personalized, real-time)
# - Search
# - Analytics dashboard
# - Recommendations

# Deploy
$ ./deploy.sh
Starting services...
PostgreSQL: Running (1 primary, 3 replicas)
Redis: Running (cluster mode, 6 nodes)
Kafka: Running (3 brokers)
Elasticsearch: Running (5 nodes)
Spark: Running
Flink: Running

# Load test
$ ./load-test --users=1000000 --posts-per-user=100
Generating load...
Throughput: 50,000 requests/sec
P95 latency: 45ms
P99 latency: 120ms

# Monitor
$ ./monitor dashboard
[Real-time metrics dashboard showing all components]

Implementation Hints:

System components:

  1. API Layer: REST/GraphQL APIs
  2. Database: PostgreSQL with replication
  3. Cache: Redis for hot data
  4. Search: Elasticsearch for full-text search
  5. Events: Kafka for event streaming
  6. Batch: Spark for analytics
  7. Stream: Flink for real-time processing
  8. Monitoring: Metrics, logs, tracing

Design decisions:

  • When to use cache vs database?
  • When to use eventual consistency?
  • How to handle failures?
  • How to scale each component?

Learning milestones:

  1. Basic system → Core features work
  2. Scalability → Handles load
  3. Reliability → Handles failures
  4. Monitoring → Observability in place

Project Comparison Table

# Project Difficulty Time Key Skill Fun
1 B-Tree KV Store ⭐⭐⭐ 3-4 weeks Storage Engines ⭐⭐⭐⭐
2 LSM-Tree Storage ⭐⭐⭐ 3-4 weeks Write Optimization ⭐⭐⭐⭐
3 Query Engine ⭐⭐⭐⭐ 4-6 weeks Query Processing ⭐⭐⭐⭐
4 Leader-Follower Replication ⭐⭐⭐ 3-4 weeks Replication ⭐⭐⭐
5 Consistent Hashing ⭐⭐ 2-3 weeks Partitioning ⭐⭐⭐
6 Two-Phase Commit ⭐⭐⭐ 2-3 weeks Distributed Transactions ⭐⭐⭐
7 Raft Consensus ⭐⭐⭐⭐ 4-6 weeks Consensus ⭐⭐⭐⭐⭐
8 Event Sourcing ⭐⭐⭐ 3-4 weeks Event-Driven ⭐⭐⭐⭐
9 Message Queue ⭐⭐⭐ 3-4 weeks Async Processing ⭐⭐⭐
10 Distributed Cache ⭐⭐ 2-3 weeks Caching ⭐⭐⭐
11 MapReduce ⭐⭐⭐ 3-4 weeks Batch Processing ⭐⭐⭐⭐
12 Stream Processing ⭐⭐⭐ 3-4 weeks Real-Time ⭐⭐⭐⭐
13 Connection Pool ⭐⭐ 1-2 weeks Resource Management ⭐⭐
14 Change Data Capture ⭐⭐⭐ 3-4 weeks Data Integration ⭐⭐⭐
15 Distributed Locks ⭐⭐⭐ 2-3 weeks Coordination ⭐⭐⭐
16 Time-Series DB ⭐⭐⭐ 3-4 weeks Specialized Storage ⭐⭐⭐⭐
17 Graph Database ⭐⭐⭐⭐ 4-6 weeks Graph Algorithms ⭐⭐⭐⭐
18 Complete Application ⭐⭐⭐⭐⭐ 2-3 months System Design ⭐⭐⭐⭐⭐

Phase 1: Storage Foundations (6-8 weeks)

Understand how data is stored and retrieved:

  1. Project 1: B-Tree KV Store - Read-optimized storage
  2. Project 2: LSM-Tree Storage - Write-optimized storage
  3. Project 13: Connection Pool - Resource management

Phase 2: Replication and Partitioning (6-8 weeks)

Learn how to scale beyond single machines:

  1. Project 4: Leader-Follower Replication - High availability
  2. Project 5: Consistent Hashing - Horizontal scaling
  3. Project 7: Raft Consensus - Distributed consensus

Phase 3: Transactions and Consistency (4-6 weeks)

Understand consistency models:

  1. Project 6: Two-Phase Commit - Distributed transactions
  2. Project 15: Distributed Locks - Coordination

Phase 4: Processing Patterns (8-10 weeks)

Learn batch and stream processing:

  1. Project 8: Event Sourcing - Event-driven architecture
  2. Project 9: Message Queue - Asynchronous processing
  3. Project 11: MapReduce - Batch processing
  4. Project 12: Stream Processing - Real-time processing

Phase 5: Specialized Systems (6-8 weeks)

Explore specialized databases:

  1. Project 3: Query Engine - SQL processing
  2. Project 14: Change Data Capture - Data integration
  3. Project 16: Time-Series DB - Time-series optimization
  4. Project 17: Graph Database - Graph algorithms

Phase 6: Integration (2-3 months)

Build complete systems:

  1. Project 10: Distributed Cache - Performance optimization
  2. Project 18: Complete Application - Full system integration

Summary

# Project Main Language
1 B-Tree KV Store C
2 LSM-Tree Storage C
3 Query Engine C++
4 Leader-Follower Replication Go
5 Consistent Hashing Python
6 Two-Phase Commit Go
7 Raft Consensus Go
8 Event Sourcing Python
9 Message Queue Go
10 Distributed Cache Go
11 MapReduce Python
12 Stream Processing Python
13 Connection Pool Go
14 Change Data Capture Python
15 Distributed Locks Go
16 Time-Series DB Go
17 Graph Database Python
18 Complete Application Python/Go

Resources

Essential Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann - The definitive guide
  • “Database Systems: The Complete Book” by Garcia-Molina et al. - Deep database internals
  • “Streaming Systems” by Tyler Akidau et al. - Stream processing
  • “Building Microservices” by Sam Newman - System design

Key Papers

  • “In Search of an Understandable Consensus Algorithm” (Raft) - Diego Ongaro
  • “MapReduce: Simplified Data Processing” - Google
  • “Dynamo: Amazon’s Highly Available Key-value Store” - Amazon

Tools and Technologies

  • PostgreSQL: Relational database
  • Redis: In-memory cache
  • Kafka: Message queue
  • Spark: Batch processing
  • Flink: Stream processing
  • etcd: Distributed coordination

Total Estimated Time: 12-18 months of dedicated study

After completion: You’ll be able to design, build, and scale data-intensive applications that handle millions of users and petabytes of data. These skills are essential for backend engineering, data engineering, platform engineering, and systems architecture roles at top tech companies.