Learn Data-Intensive Applications Design: From Zero to Systems Architect

Goal: Deeply understand how to design, build, and scale data-intensive applications. Master the principles of scalability, reliability, consistency, and efficiency that underpin modern distributed systems.

Why Learn Data-Intensive Applications Design?

Every modern application is data-intensive. Whether it’s a social media platform handling billions of posts, an e-commerce site processing millions of transactions, or a real-time analytics system ingesting terabytes of events, understanding how to design these systems is essential.

After completing these projects, you will:

Design scalable systems that handle millions of users and petabytes of data
Understand trade-offs between consistency, availability, and partition tolerance (CAP theorem)
Implement replication, partitioning, and consensus algorithms
Build reliable systems that handle failures gracefully
Design efficient data models and storage engines
Create real-time and batch processing pipelines
Make informed decisions about database selection and architecture

Core Concept Analysis

The Data-Intensive Application Stack

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                          │
│         (Business Logic, APIs, User Interface)               │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  CACHING LAYER                               │
│              (Redis, Memcached, CDN)                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              DATA PROCESSING LAYER                            │
│    (Batch: Spark, Hadoop | Stream: Kafka, Flink)            │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  DATABASE LAYER                               │
│  (Relational: PostgreSQL | NoSQL: MongoDB, Cassandra)          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  STORAGE LAYER                                │
│         (B-Trees, LSM-Trees, Object Storage)                  │
└─────────────────────────────────────────────────────────────┘

Fundamental Concepts

Reliability: System continues working correctly even when things go wrong
- Hardware faults, software errors, human errors
- Fault tolerance vs fault prevention
Scalability: System can handle growth in load
- Vertical scaling (bigger machines) vs horizontal scaling (more machines)
- Load parameters: requests/sec, read/write ratio, data volume
Maintainability: System is easy to operate and modify
- Operability, simplicity, evolvability
Data Models: How data is represented
- Relational (SQL), Document (JSON), Graph, Key-Value
Storage Engines: How data is stored and retrieved
- B-Trees (read-optimized), LSM-Trees (write-optimized)
Replication: Keeping copies of data on multiple nodes
- Leader-follower, multi-leader, leaderless
- Consistency models: eventual, strong, causal
Partitioning: Splitting data across multiple nodes
- Range partitioning, hash partitioning
- Rebalancing strategies
Transactions: Grouping operations that must succeed or fail together
- ACID properties
- Isolation levels
Consensus: Getting multiple nodes to agree on something
- Two-phase commit, Raft, Paxos
Batch Processing: Processing large volumes of data periodically
- MapReduce, Spark
Stream Processing: Processing data continuously as it arrives
- Event sourcing, CQRS, Kafka

Project List

The following 18 projects will teach you data-intensive application design from fundamentals to advanced distributed systems.

Project 1: Key-Value Store with B-Tree Index

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Storage Engines / Indexing
Software or Tool: File I/O, B-Tree implementation
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A persistent key-value store that uses a B-Tree index for efficient lookups. It should support insert, get, delete, and range queries with durability guarantees.

Why it teaches data-intensive design: B-Trees are the foundation of most relational databases. Building one teaches you how databases organize data on disk, handle page I/O, and maintain indexes efficiently.

Core challenges you’ll face:

Implementing B-Tree structure → maps to understanding balanced tree algorithms
Handling page I/O → maps to disk vs memory trade-offs
Managing splits and merges → maps to maintaining tree balance
Ensuring durability → maps to write-ahead logging or fsync

Key Concepts:

B-Tree Structure: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
Page-Oriented Storage: “Database Systems: The Complete Book” Ch. 13
Durability: “Designing Data-Intensive Applications” Ch. 7

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: C programming, understanding of trees and file I/O

Real world outcome:

$ ./kvstore create mydb
Database created: mydb

$ ./kvstore set mydb user:123 '{"name":"Alice","email":"alice@example.com"}'
OK

$ ./kvstore get mydb user:123
{"name":"Alice","email":"alice@example.com"}

$ ./kvstore range mydb user:100 user:200
user:123: {"name":"Alice","email":"alice@example.com"}
user:145: {"name":"Bob","email":"bob@example.com"}

$ ./kvstore delete mydb user:123
OK

Implementation Hints:

B-Tree structure:

Each node contains multiple keys and pointers
Internal nodes have keys and child pointers
Leaf nodes have keys and values
All leaves at same depth (balanced)
Typical node size: 4KB (one disk page)

Key design decisions:

How many keys per node? (branching factor)
How to handle variable-length values?
When to split nodes? (typically when full)
How to ensure atomic writes? (WAL or copy-on-write)

Questions to guide implementation:

How do you find a key efficiently? (binary search within node)
How do you handle concurrent access? (locking or MVCC)
How do you recover from crashes? (WAL replay)

Learning milestones:

Basic B-Tree operations → Insert, search, delete work correctly
Persistence → Data survives restarts
Range queries → Efficiently scan key ranges
Concurrent access → Multiple readers/writers

Project 2: Log-Structured Merge Tree (LSM-Tree) Storage Engine

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Storage Engines / Write-Optimized Structures
Software or Tool: File I/O, Bloom filters
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A key-value store using LSM-Tree architecture (like LevelDB/RocksDB). It should have an in-memory memtable, write-ahead log, and multiple sorted string tables (SSTables) that get compacted.

Why it teaches data-intensive design: LSM-Trees are write-optimized, making them perfect for high-write workloads. Understanding them teaches you trade-offs between read and write performance.

Core challenges you’ll face:

Memtable management → maps to in-memory sorted structure
SSTable creation → maps to sorted file format
Compaction strategies → maps to merging sorted files
Bloom filters → maps to probabilistic data structures for fast lookups

Key Concepts:

LSM-Tree Architecture: “Designing Data-Intensive Applications” Ch. 3
SSTable Format: LevelDB documentation
Compaction: RocksDB tuning guide

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 1, understanding of sorting algorithms

Real world outcome:

$ ./lsmstore create mydb
LSM-Tree database created

$ ./lsmstore set mydb key1 value1
OK

# High write throughput
$ for i in {1..100000}; do
    ./lsmstore set mydb "key$i" "value$i"
done
# Completes in seconds (much faster than B-Tree for writes)

$ ./lsmstore get mydb key50000
value50000

# Compaction happens automatically
$ ./lsmstore stats mydb
Level 0: 5 SSTables
Level 1: 2 SSTables (compacted)
Total keys: 100000

Implementation Hints:

LSM-Tree structure:

Write Path:
  Write → Memtable (in-memory sorted tree)
  When memtable full → Flush to SSTable (Level 0)
  Background compaction merges SSTables

Read Path:
  Check memtable → Check Level 0 SSTables → Check Level 1+ SSTables
  Use Bloom filter to skip SSTables that don't contain key

Compaction strategies:

Size-tiered: Merge SSTables of similar size
Leveled: Each level has fixed size, merge into next level
Tiered: Similar to size-tiered but with level limits

Key questions:

How big should the memtable be before flushing?
How many levels should you have?
When to trigger compaction? (based on size or ratio)

Learning milestones:

Write path works → Writes go to memtable and WAL
Read path works → Can retrieve values correctly
Compaction works → SSTables merge correctly
Performance → Write throughput exceeds B-Tree

Project 3: Relational Database Query Engine

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Query Processing / Relational Algebra
Software or Tool: SQL parser, query optimizer
Main Book: “Database Systems: The Complete Book” by Garcia-Molina et al.

What you’ll build: A query engine that parses SQL, builds an execution plan, and executes queries with operators like scan, filter, join, sort, and aggregate.

Why it teaches data-intensive design: Understanding how databases execute queries is essential for writing efficient SQL and designing schemas.

Core challenges you’ll face:

SQL parsing → maps to converting text to relational algebra
Query optimization → maps to choosing efficient execution plans
Join algorithms → maps to nested loop, hash join, sort-merge join
Operator pipelining → maps to streaming vs materialization

Key Concepts:

Relational Algebra: “Database Systems” Ch. 2
Query Optimization: “Database Systems” Ch. 16
Join Algorithms: “Database Systems” Ch. 15

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of SQL, algorithms and data structures

Real world outcome:

-- Create table
CREATE TABLE users (id INT, name VARCHAR(100), email VARCHAR(100));
CREATE TABLE orders (id INT, user_id INT, amount DECIMAL(10,2));

-- Insert data
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
INSERT INTO orders VALUES (1, 1, 99.99);

-- Query
SELECT u.name, SUM(o.amount) as total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
HAVING SUM(o.amount) > 50;

-- Output:
-- name  | total
-- Alice | 99.99

Implementation Hints:

Query execution pipeline:

SQL → Parser → AST → Optimizer → Execution Plan → Executor → Results

Key operators to implement:

TableScan: Read all rows from table
Filter: Apply WHERE conditions
Project: Select specific columns
Join: Combine two tables (nested loop, hash, sort-merge)
Sort: Order rows
Aggregate: GROUP BY operations
Limit: Top N results

Optimization techniques:

Push predicates down (filter early)
Choose join order (smallest table first)
Use indexes when available
Estimate costs (I/O, CPU)

Learning milestones:

Parse SQL → Convert to execution plan
Execute simple queries → SELECT, WHERE work
Implement joins → Multiple algorithms work
Optimize queries → Choose efficient plans

Project 4: Leader-Follower Replication System

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Python, Java, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Replication / High Availability
Software or Tool: Network programming, consensus
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A replicated key-value store with one leader and multiple followers. Writes go to the leader and are replicated to followers. Handles leader failure and automatic failover.

Why it teaches data-intensive design: Replication is fundamental for availability and read scalability. Understanding leader-follower replication teaches you consistency models and failure handling.

Core challenges you’ll face:

Synchronous vs asynchronous replication → maps to consistency vs latency trade-offs
Replication lag → maps to eventual consistency issues
Leader election → maps to distributed consensus
Split-brain prevention → maps to quorum-based decisions

Key Concepts:

Replication Models: “Designing Data-Intensive Applications” Ch. 5
Consistency Guarantees: “Designing Data-Intensive Applications” Ch. 9
Leader Election: Raft paper

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Network programming, understanding of distributed systems basics

Real world outcome:

# Start leader
$ ./replicated-kv --node-id=1 --port=8001 --leader
Leader started on port 8001

# Start followers
$ ./replicated-kv --node-id=2 --port=8002 --follower --leader-addr=localhost:8001
Follower 2 connected to leader

$ ./replicated-kv --node-id=3 --port=8003 --follower --leader-addr=localhost:8001
Follower 3 connected to leader

# Write to leader
$ curl http://localhost:8001/set?key=user:1&value=Alice
OK

# Read from any node
$ curl http://localhost:8002/get?key=user:1
Alice

# Leader fails
$ kill <leader-pid>

# Automatic failover
[Follower 2] Leader failed, starting election...
[Follower 2] Elected as new leader
[Follower 3] Following new leader: node-2

Implementation Hints:

Replication protocol:

Client sends write to leader
Leader writes to local log
Leader sends log entries to followers
Followers acknowledge receipt
Leader commits when majority acknowledge
Leader sends commit message to followers

Failure scenarios:

Leader fails: Followers detect via heartbeat timeout, elect new leader
Follower fails: Leader continues, follower catches up on reconnect
Network partition: Split-brain prevention via quorum

Consistency models:

Synchronous: Wait for all followers (strong consistency, high latency)
Asynchronous: Don’t wait (low latency, eventual consistency)
Semi-synchronous: Wait for majority (balance)

Learning milestones:

Basic replication → Writes replicate to followers
Read from followers → Can read from any node
Leader failure → Automatic failover works
Consistency → Understand trade-offs

Project 5: Consistent Hashing and Data Partitioning

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python
Alternative Programming Languages: Go, Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Partitioning / Sharding
Software or Tool: Hash functions, distributed systems
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed key-value store that partitions data across multiple nodes using consistent hashing. Handles node additions and removals with minimal data movement.

Why it teaches data-intensive design: Partitioning is essential for scaling beyond single-machine limits. Consistent hashing minimizes data movement when nodes join/leave.

Core challenges you’ll face:

Hash function selection → maps to uniform distribution
Virtual nodes → maps to load balancing
Rebalancing → maps to minimizing data movement
Handling node failures → maps to replication across partitions

Key Concepts:

Partitioning Strategies: “Designing Data-Intensive Applications” Ch. 6
Consistent Hashing: “Designing Data-Intensive Applications” Ch. 6
Rebalancing: DynamoDB partitioning documentation

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Understanding of hashing, basic distributed systems

Real world outcome:

# Start cluster with 3 nodes
$ ./partitioned-kv --nodes=node1,node2,node3
Cluster started with 3 nodes

# Add data
$ ./partitioned-kv set user:1 "Alice"
Stored on node2 (hash: 0x3a7f...)

$ ./partitioned-kv set user:2 "Bob"
Stored on node1 (hash: 0x8c2d...)

# Add new node
$ ./partitioned-kv add-node node4
Node added. Rebalancing...
Moved 25% of keys to node4

# Remove node
$ ./partitioned-kv remove-node node2
Node removed. Rebalancing...
Moved keys to node1 and node3

# Query still works
$ ./partitioned-kv get user:1
Alice

Implementation Hints:

Consistent hashing ring:

    0x0000
        │
        ├─ node1 (0x4000)
        │
        ├─ node2 (0x8000)
        │
        ├─ node3 (0xC000)
        │
    0xFFFF

Key "user:1" hashes to 0x5a7f → belongs to node2

Virtual nodes:

Each physical node has multiple virtual nodes on ring
Improves load distribution
Example: node1 → [vnode1-1, vnode1-2, vnode1-3]

Rebalancing:

When node added: only move keys from immediate neighbors
When node removed: redistribute its keys to neighbors
Use replication factor of 3: store key on node and next 2 nodes

Learning milestones:

Basic partitioning → Keys distributed across nodes
Consistent hashing → Ring structure works
Node addition → Minimal data movement
Load balancing → Even distribution with virtual nodes

Project 6: Two-Phase Commit Protocol

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Java, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Transactions / Consensus
Software or Tool: Network programming, state machines
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed transaction coordinator that implements two-phase commit (2PC) to ensure atomicity across multiple nodes.

Why it teaches data-intensive design: 2PC is fundamental for distributed transactions. Understanding it teaches you the challenges of achieving consensus in distributed systems.

Core challenges you’ll face:

Coordinator failure → maps to blocking problem
Participant failure → maps to recovery procedures
Network partitions → maps to availability vs consistency
Timeout handling → maps to detecting failures

Key Concepts:

Two-Phase Commit: “Designing Data-Intensive Applications” Ch. 9
Distributed Transactions: “Database Systems” Ch. 19
Consensus Algorithms: Raft paper

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of transactions, network programming

Real world outcome:

# Start coordinator
$ ./txn-coordinator --port=9000
Coordinator started

# Start participants (databases)
$ ./participant --id=db1 --coordinator=localhost:9000
$ ./participant --id=db2 --coordinator=localhost:9000
$ ./participant --id=db3 --coordinator=localhost:9000

# Execute distributed transaction
$ ./client --coordinator=localhost:9000 \
    --txn="UPDATE db1.users SET balance=balance-100 WHERE id=1; \
           UPDATE db2.accounts SET balance=balance+100 WHERE id=2; \
           UPDATE db3.logs SET action='transfer' WHERE id=3;"

[Coordinator] Phase 1: Prepare
[db1] Prepared (vote: YES)
[db2] Prepared (vote: YES)
[db3] Prepared (vote: YES)
[Coordinator] Phase 2: Commit
[db1] Committed
[db2] Committed
[db3] Committed
Transaction completed successfully

Implementation Hints:

2PC Protocol:

Phase 1 (Prepare):
Coordinator sends PREPARE to all participants
Each participant votes YES (ready) or NO (abort)
Coordinator collects votes

Phase 2 (Commit/Abort):
If all YES: Coordinator sends COMMIT
If any NO: Coordinator sends ABORT
Participants commit or abort and acknowledge

Failure scenarios:

Participant fails before voting: Coordinator times out, aborts
Participant fails after voting YES: Blocks until recovery, then commits
Coordinator fails: Participants block (blocking problem)

Improvements:

Three-phase commit: Reduces blocking but more complex
Saga pattern: Alternative for long-running transactions

Learning milestones:

Basic 2PC → Simple transactions commit/abort correctly
Failure handling → Handles participant failures
Coordinator failure → Understand blocking problem
Recovery → Participants recover correctly

Project 7: Raft Consensus Algorithm Implementation

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Java
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Consensus / Distributed Systems
Software or Tool: Network programming, state machines
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A Raft consensus implementation that maintains a replicated log across multiple nodes. Handles leader election, log replication, and safety guarantees.

Why it teaches data-intensive design: Raft is used in production systems (etcd, Consul). Understanding it teaches you how distributed systems achieve consensus and maintain consistency.

Core challenges you’ll face:

Leader election → maps to majority voting, term numbers
Log replication → maps to matching logs, commit index
Safety properties → maps to election restriction, log matching
Network partitions → maps to split-brain prevention

Key Concepts:

Raft Algorithm: “In Search of an Understandable Consensus Algorithm” - Diego Ongaro
Consensus: “Designing Data-Intensive Applications” Ch. 9
Distributed Logs: Raft paper

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Strong understanding of distributed systems, network programming

Real world outcome:

# Start 5-node Raft cluster
$ ./raft-node --id=1 --cluster=1,2,3,4,5 --port=8001
$ ./raft-node --id=2 --cluster=1,2,3,4,5 --port=8002
$ ./raft-node --id=3 --cluster=1,2,3,4,5 --port=8003
$ ./raft-node --id=4 --cluster=1,2,3,4,5 --port=8004
$ ./raft-node --id=5 --cluster=1,2,3,4,5 --port=8005

# Cluster elects leader
[Node 3] Elected leader for term 1

# Append entry to log
$ curl http://localhost:8003/append -d '{"command":"SET key1 value1"}'
{"success":true,"index":1}

# Entry replicated to majority
[Node 1] Log entry 1 committed
[Node 2] Log entry 1 committed
[Node 3] Log entry 1 committed

# Leader fails
$ kill <node-3-pid>

# New leader elected
[Node 1] Elected leader for term 2

# Cluster continues operating
$ curl http://localhost:8001/append -d '{"command":"SET key2 value2"}'
{"success":true,"index":2}

Implementation Hints:

Raft components:

Leader Election: Nodes vote for leader, need majority
Log Replication: Leader replicates log entries to followers
Safety: Election restriction ensures leader has all committed entries

State machine:

Follower → Candidate → Leader
   ↑         ↓          ↓
   └─────────┴──────────┘

Key invariants:

Election Safety: At most one leader per term
Leader Append-Only: Leader never overwrites entries
Log Matching: If two logs have same entry at same index, they match up to that point
Leader Completeness: Committed entry in current term will be in all future leaders

Learning milestones:

Leader election → Cluster elects leader correctly
Log replication → Entries replicate to followers
Failure handling → Cluster handles node failures
Safety → All safety properties maintained

Project 8: Event Sourcing System

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python
Alternative Programming Languages: Java, Go, TypeScript
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Event Sourcing / CQRS
Software or Tool: Message queue, event store
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An event-sourced system where all changes are stored as a sequence of events. Support event replay, snapshots, and multiple read models.

Why it teaches data-intensive design: Event sourcing provides audit trails, time travel, and decouples write and read models. It’s a powerful pattern for complex domains.

Core challenges you’ll face:

Event storage → maps to append-only log, event schema evolution
Event replay → maps to rebuilding state from events
Snapshots → maps to optimizing replay performance
Read models → maps to CQRS pattern, eventual consistency

Key Concepts:

Event Sourcing: “Designing Data-Intensive Applications” Ch. 11
CQRS: Martin Fowler’s blog on CQRS
Event Store: EventStore documentation

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of domain modeling, message queues

Real world outcome:

# Create event store
$ ./eventstore create myapp
Event store created

# Write events
$ ./eventstore append myapp user:1 \
    '{"type":"UserCreated","name":"Alice","email":"alice@example.com"}' \
    '{"type":"EmailChanged","email":"alice.new@example.com"}' \
    '{"type":"UserDeleted"}'

# Replay events to rebuild state
$ ./eventstore replay myapp user:1
Event 1: UserCreated -> State: {name: Alice, email: alice@example.com}
Event 2: EmailChanged -> State: {name: Alice, email: alice.new@example.com}
Event 3: UserDeleted -> State: {deleted: true}

# Query at specific time
$ ./eventstore query myapp user:1 --at="2024-01-15T10:00:00Z"
{name: Alice, email: alice@example.com}

# Create read model
$ ./eventstore create-read-model myapp users-by-email
Read model created, processing events...

Implementation Hints:

Event store structure:

Stream: user:1
  Event 1: UserCreated (timestamp: T1)
  Event 2: EmailChanged (timestamp: T2)
  Event 3: UserDeleted (timestamp: T3)

Replay algorithm:

def replay(stream_id, up_to=None):
    state = {}
    events = load_events(stream_id, up_to)
    for event in events:
        state = apply_event(state, event)
    return state

Snapshots:

Periodically save state (e.g., every 100 events)
Replay from snapshot instead of beginning
Reduces replay time

Read models:

Project events into denormalized views
Update asynchronously
Optimized for queries

Learning milestones:

Event storage → Events stored and retrieved
Event replay → State rebuilt from events
Snapshots → Replay optimized with snapshots
Read models → Multiple views of same data

Project 9: Message Queue with At-Least-Once Delivery

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Java, Rust, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Message Queues / Asynchronous Processing
Software or Tool: Network programming, persistence
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A message queue that guarantees at-least-once delivery. Supports topics, consumer groups, and message acknowledgments.

Why it teaches data-intensive design: Message queues are essential for decoupling services and handling asynchronous processing. Understanding delivery guarantees is crucial.

Core challenges you’ll face:

Message persistence → maps to durability guarantees
Consumer groups → maps to load balancing, parallel processing
Acknowledgments → maps to at-least-once vs exactly-once
Message ordering → maps to per-partition ordering

Key Concepts:

Message Brokers: “Designing Data-Intensive Applications” Ch. 11
Delivery Guarantees: “Designing Data-Intensive Applications” Ch. 11
Kafka Architecture: Kafka documentation

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Network programming, understanding of queues

Real world outcome:

# Start message queue
$ ./mq-server --port=9092
Message queue started

# Create topic
$ ./mq-cli create-topic orders --partitions=3 --replication=2
Topic 'orders' created

# Produce messages
$ ./mq-cli produce orders '{"order_id":1,"amount":99.99}'
Message produced: offset=0, partition=0

$ ./mq-cli produce orders '{"order_id":2,"amount":149.99}'
Message produced: offset=0, partition=1

# Consume messages
$ ./mq-cli consume orders --group=processors --from-beginning
Message 1: {"order_id":1,"amount":99.99} (offset=0, partition=0)
Message 2: {"order_id":2,"amount":149.99} (offset=0, partition=1)

# Acknowledge message
$ ./mq-cli ack orders --group=processors --offset=0 --partition=0
Acknowledged

# Consumer group rebalancing
$ ./mq-cli consume orders --group=processors --consumer-id=consumer-2
Assigned partitions: [1, 2]

Implementation Hints:

Message queue structure:

Topic: orders
  Partition 0: [msg0, msg1, msg2, ...]
  Partition 1: [msg0, msg1, msg2, ...]
  Partition 2: [msg0, msg1, msg2, ...]

Consumer groups:

Multiple consumers in same group share partitions
Each partition consumed by one consumer in group
Rebalancing when consumers join/leave

Delivery guarantees:

At-least-once: Message delivered at least once (may duplicate)
At-most-once: Message delivered at most once (may lose)
Exactly-once: Message delivered exactly once (hardest)

Acknowledgments:

Consumer processes message
Sends ACK to broker
Broker removes message (or marks as processed)
If ACK not received, redeliver

Learning milestones:

Basic queue → Produce and consume messages
Persistence → Messages survive restarts
Consumer groups → Load balancing works
At-least-once → Guarantee maintained

Project 10: Distributed Cache with Cache-Aside Pattern

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Python, Java, Rust
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Caching / Performance Optimization
Software or Tool: Network programming, LRU eviction
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed cache (like Redis) with cache-aside pattern. Supports TTL, eviction policies, and cluster mode.

Why it teaches data-intensive design: Caching is crucial for performance. Understanding cache patterns and eviction strategies teaches you trade-offs in system design.

Core challenges you’ll face:

Eviction policies → maps to LRU, LFU, TTL-based
Cache invalidation → maps to when to invalidate, cache-aside vs write-through
Distributed caching → maps to consistent hashing, replication
Cache stampede → maps to preventing thundering herd

Key Concepts:

Caching Patterns: “Designing Data-Intensive Applications” Ch. 3
Cache-Aside: Martin Fowler’s blog
Eviction Policies: Redis documentation

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Understanding of hashing, basic distributed systems

Real world outcome:

# Start cache cluster
$ ./cache-server --port=6379 --cluster
Cache cluster started with 3 nodes

# Set with TTL
$ ./cache-cli set user:1 "Alice" --ttl=3600
OK

# Get
$ ./cache-cli get user:1
Alice

# Cache-aside pattern
$ ./cache-cli get user:999
(nil)

# Application fetches from database, then caches
$ ./cache-cli set user:999 "Bob" --ttl=3600
OK

# Eviction when full (LRU)
$ ./cache-cli set key:1000 "value"
OK (evicted key:1)

# Cluster mode
$ ./cache-cli set key:1 "value" --cluster
Stored on node2 (hash: 0x3a7f...)

Implementation Hints:

Cache-aside pattern:

Application checks cache
If miss: fetch from database
Store in cache for future requests
On write: update database, invalidate cache

Eviction policies:

LRU: Evict least recently used
LFU: Evict least frequently used
TTL: Evict expired entries
Random: Simple but less effective

Cache stampede prevention:

Lock on cache miss (only one fetches from DB)
Background refresh before expiration
Probabilistic early expiration

Learning milestones:

Basic cache → Get/set operations work
Eviction → LRU eviction works correctly
TTL → Expiration works
Cache-aside → Pattern implemented correctly

Project 11: MapReduce Implementation

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python
Alternative Programming Languages: Go, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Batch Processing / Distributed Computing
Software or Tool: Distributed systems, file I/O
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A MapReduce implementation that processes large datasets in parallel across multiple workers. Handles task distribution, failure recovery, and result aggregation.

Why it teaches data-intensive design: MapReduce is the foundation of batch processing. Understanding it teaches you how to process large datasets efficiently.

Core challenges you’ll face:

Task distribution → maps to scheduling, load balancing
Failure handling → maps to retry, speculative execution
Data locality → maps to processing data where it’s stored
Shuffle phase → maps to sorting and grouping intermediate results

Key Concepts:

MapReduce: “Designing Data-Intensive Applications” Ch. 10
Batch Processing: “Designing Data-Intensive Applications” Ch. 10
Distributed Computing: Google MapReduce paper

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of distributed systems, file I/O

Real world outcome:

# Start MapReduce cluster
$ ./mapreduce-master --port=8080
Master started

$ ./mapreduce-worker --master=localhost:8080 --port=8081
Worker 1 started

$ ./mapreduce-worker --master=localhost:8080 --port=8082
Worker 2 started

# Submit job: word count
$ ./mapreduce-cli submit \
    --input=/data/books/*.txt \
    --output=/results/wordcount \
    --mapper=word_count_map.py \
    --reducer=word_count_reduce.py

Job submitted: job-12345

# Monitor progress
$ ./mapreduce-cli status job-12345
Job: job-12345
Status: RUNNING
Map tasks: 10/10 complete
Reduce tasks: 3/8 complete
Progress: 65%

# Job completes
$ ./mapreduce-cli status job-12345
Status: COMPLETED
Output: /results/wordcount/part-00000
        /results/wordcount/part-00001
        /results/wordcount/part-00002

# View results
$ cat /results/wordcount/part-00000
the 15234
and 8921
of 6543
...

Implementation Hints:

MapReduce phases:

1. Map Phase:
   - Master splits input into chunks
   - Assigns chunks to workers
   - Workers process chunks, emit (key, value) pairs

2. Shuffle Phase:
   - Sort and group by key
   - Partition to reducers

3. Reduce Phase:
   - Reducers process grouped values
   - Write final output

Failure handling:

Worker failure: Reassign tasks to other workers
Task failure: Retry with exponential backoff
Speculative execution: Run slow tasks on multiple workers

Data locality:

Prefer workers that have input data locally
Reduces network transfer

Learning milestones:

Basic MapReduce → Simple jobs complete
Failure handling → Handles worker failures
Shuffle → Intermediate data sorted correctly
Scalability → Handles large datasets

Project 12: Stream Processing with Windowing

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python
Alternative Programming Languages: Java, Scala
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Stream Processing / Real-Time Analytics
Software or Tool: Message queue, time-based processing
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A stream processing engine that processes continuous data streams with windowing (tumbling, sliding, session windows) and aggregations.

Why it teaches data-intensive design: Stream processing enables real-time analytics. Understanding windowing and watermarks teaches you how to handle out-of-order events.

Core challenges you’ll face:

Event time vs processing time → maps to handling late events
Windowing → maps to tumbling, sliding, session windows
Watermarks → maps to determining when window is complete
State management → maps to maintaining window state

Key Concepts:

Stream Processing: “Designing Data-Intensive Applications” Ch. 11
Windowing: Flink documentation on windows
Watermarks: “Streaming Systems” by Akidau et al.

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of message queues, time-based processing

Real world outcome:

# Start stream processor
$ ./stream-processor --port=8080
Stream processor started

# Create stream
$ ./stream-cli create-stream clicks --source=kafka://localhost:9092/topic:clicks
Stream created

# Define processing: count clicks per minute
$ ./stream-cli query clicks \
    --window=tumbling,1min \
    --aggregate=count \
    --group-by=user_id

Query started: query-123

# Process events
Event: {user_id: 1, timestamp: 10:00:15, page: /home}
Event: {user_id: 2, timestamp: 10:00:23, page: /products}
Event: 1, timestamp: 10:00:45, page: /cart}
Window [10:00:00-10:01:00]: user_1=2, user_2=1

# Sliding window: count clicks in last 5 minutes, every minute
$ ./stream-cli query clicks \
    --window=sliding,5min,1min \
    --aggregate=count

Implementation Hints:

Window types:

Tumbling: Fixed-size, non-overlapping (e.g., every 1 minute)
Sliding: Fixed-size, overlapping (e.g., last 5 minutes, every 1 minute)
Session: Dynamic based on gaps (e.g., 10-minute inactivity)

Watermarks:

Indicate when events are “complete”
Allow processing of late events (within allowed lateness)
Example: watermark = max(event_time) - 1 minute

State management:

Store window state in memory or external store
Handle state recovery on failures

Learning milestones:

Basic streaming → Process events continuously
Windowing → Tumbling windows work
Watermarks → Handle late events correctly
Aggregations → Compute windowed aggregates

Project 13: Database Connection Pool

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Java, Python
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Connection Management / Resource Pooling
Software or Tool: Database connections, concurrency
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A connection pool that manages database connections efficiently, handling connection lifecycle, health checks, and load balancing.

Why it teaches data-intensive design: Connection pooling is essential for performance. Understanding it teaches you resource management and concurrency patterns.

Core challenges you’ll face:

Connection lifecycle → maps to create, reuse, destroy
Health checks → maps to detecting dead connections
Load balancing → maps to distributing connections across servers
Concurrency → maps to thread-safe access

Key Concepts:

Connection Pooling: Database connection pool patterns
Resource Management: “Designing Data-Intensive Applications” Ch. 3
Concurrency: Go concurrency patterns

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Understanding of concurrency, database connections

Real world outcome:

# Start connection pool
$ ./pool-manager \
    --max-connections=100 \
    --min-connections=10 \
    --max-idle-time=300s \
    --health-check-interval=30s \
    --servers=db1:5432,db2:5432,db3:5432

Pool started with 3 servers

# Get connection
$ ./pool-client get-connection
Connection acquired: conn-123 (server: db1)

# Execute query
$ ./pool-client execute conn-123 "SELECT * FROM users LIMIT 10"
[Results...]

# Return connection
$ ./pool-client return-connection conn-123
Connection returned to pool

# Pool statistics
$ ./pool-client stats
Active connections: 45
Idle connections: 55
Total connections: 100
Server distribution:
  db1: 33 connections
  db2: 33 connections
  db3: 34 connections

Implementation Hints:

Connection pool structure:

Pool:
  Available: [conn1, conn2, conn3, ...]
  In-use: {request-id: conn}
  Servers: [db1, db2, db3]

Operations:

Acquire: Get connection from pool (create if needed)
Release: Return connection to pool
Health check: Periodically test connections
Evict: Remove idle connections after timeout

Load balancing:

Round-robin across servers
Least connections
Weighted distribution

Learning milestones:

Basic pooling → Connections reused correctly
Health checks → Dead connections detected
Load balancing → Connections distributed evenly
Concurrency → Thread-safe operations

Project 14: Change Data Capture (CDC) System

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python
Alternative Programming Languages: Java, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Data Integration / ETL
Software or Tool: Database replication logs, message queue
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A CDC system that captures database changes (inserts, updates, deletes) and streams them to downstream systems (data warehouse, search index, cache).

Why it teaches data-intensive design: CDC enables real-time data integration. Understanding it teaches you how to keep multiple systems in sync.

Core challenges you’ll face:

Reading replication logs → maps to parsing binary logs, WAL
Change ordering → maps to maintaining order across transactions
Schema evolution → maps to handling schema changes
Downstream delivery → maps to reliable delivery, idempotency

Key Concepts:

Change Data Capture: “Designing Data-Intensive Applications” Ch. 11
Database Replication: “Designing Data-Intensive Applications” Ch. 5
Event Streaming: Kafka Connect documentation

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of database internals, message queues

Real world outcome:

# Start CDC connector
$ ./cdc-connector \
    --source=postgresql://localhost/db \
    --sink=kafka://localhost:9092/topic:changes \
    --tables=users,orders,products

CDC connector started

# Monitor changes
$ ./cdc-cli tail changes
{"table":"users","op":"INSERT","id":1,"data":{"name":"Alice","email":"alice@example.com"}}
{"table":"users","op":"UPDATE","id":1,"data":{"name":"Alice","email":"alice.new@example.com"}}
{"table":"orders","op":"INSERT","id":100,"data":{"user_id":1,"amount":99.99}}
{"table":"users","op":"DELETE","id":1}

# Sync to Elasticsearch
$ ./cdc-cli sync changes --target=elasticsearch://localhost:9200
Syncing changes to Elasticsearch...
Indexed 1000 documents

# Sync to data warehouse
$ ./cdc-cli sync changes --target=warehouse://s3://bucket/changes
Syncing changes to warehouse...

Implementation Hints:

CDC approaches:

Replication logs: Read database’s replication log (MySQL binlog, PostgreSQL WAL)
Triggers: Database triggers on changes
Polling: Periodically query for changes (less efficient)

Change format:

{
  "table": "users",
  "operation": "INSERT|UPDATE|DELETE",
  "timestamp": "2024-01-15T10:00:00Z",
  "before": {...},  // For UPDATE/DELETE
  "after": {...}   // For INSERT/UPDATE
}

Ordering:

Use transaction ID or LSN (Log Sequence Number)
Process changes in order within transaction
Handle out-of-order events

Learning milestones:

Capture changes → Read replication logs correctly
Stream changes → Publish to message queue
Ordering → Maintain correct order
Downstream sync → Update multiple systems

Project 15: Distributed Lock Service

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Java, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Coordination / Locks
Software or Tool: Consensus algorithm, TTL
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed lock service (like etcd or ZooKeeper) that provides mutual exclusion across multiple processes/nodes.

Why it teaches data-intensive design: Distributed locks are essential for coordination. Understanding them teaches you lease management, failure handling, and consensus.

Core challenges you’ll face:

Lease management → maps to TTL, renewal, expiration
Failure handling → maps to detecting dead nodes, lock release
Fairness → maps to FIFO ordering, preventing starvation
Consensus → maps to agreement on lock ownership

Key Concepts:

Distributed Locks: “Designing Data-Intensive Applications” Ch. 9
Leases: etcd documentation
Coordination: ZooKeeper documentation

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of consensus, TTL mechanisms

Real world outcome:

# Start lock service
$ ./lock-service --port=2379
Lock service started

# Acquire lock
$ ./lock-client acquire my-lock --ttl=30s
Lock acquired: lock-id-12345
TTL: 30s

# Renew lock
$ ./lock-client renew lock-id-12345 --ttl=30s
Lock renewed, new TTL: 30s

# Another client tries to acquire
$ ./lock-client acquire my-lock --ttl=30s
Lock unavailable, waiting...

# First client releases
$ ./lock-client release lock-id-12345
Lock released

# Second client acquires
Lock acquired: lock-id-67890

# Lock expires if not renewed
[After 30s]
Lock expired: lock-id-67890

Implementation Hints:

Lock service structure:

Lock: "my-lock"
  Owner: client-12345
  TTL: 30s
  Created: 2024-01-15T10:00:00Z
  Expires: 2024-01-15T10:00:30Z

Operations:

Acquire: Create lock if not exists, set TTL
Renew: Extend TTL before expiration
Release: Delete lock
Watch: Get notified when lock released

Failure handling:

If client crashes, lock expires automatically
Use heartbeats to detect client failures
Implement fencing tokens for safety

Learning milestones:

Basic locking → Acquire/release works
TTL → Locks expire correctly
Renewal → TTL extended on renewal
Failure handling → Crashed clients release locks

Project 16: Time-Series Database

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Go
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Time-Series Data / Specialized Databases
Software or Tool: Compression, columnar storage
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A time-series database optimized for storing and querying time-stamped data (metrics, events). Supports compression, downsampling, and efficient range queries.

Why it teaches data-intensive design: Time-series data has unique characteristics. Understanding specialized storage teaches you how to optimize for specific access patterns.

Core challenges you’ll face:

Compression → maps to delta encoding, run-length encoding
Downsampling → maps to aggregating data over time
Retention policies → maps to data lifecycle management
Query optimization → maps to time-range queries, aggregation

Key Concepts:

Time-Series Databases: InfluxDB documentation
Compression: “Designing Data-Intensive Applications” Ch. 3
Columnar Storage: “Designing Data-Intensive Applications” Ch. 3

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of compression, database internals

Real world outcome:

# Start time-series database
$ ./tsdb --data-dir=/data/tsdb
Time-series database started

# Create database
$ ./tsdb-cli create-db metrics
Database created

# Write metrics
$ ./tsdb-cli write metrics \
    cpu.usage,host=server1 value=45.2 1642234567 \
    cpu.usage,host=server1 value=46.1 1642234577 \
    cpu.usage,host=server1 value=44.8 1642234587

# Query: average CPU usage last hour
$ ./tsdb-cli query metrics \
    "SELECT mean(value) FROM cpu.usage WHERE host='server1' AND time > now() - 1h"
mean(value)
45.37

# Downsampling: create 1-minute averages
$ ./tsdb-cli create-downsample metrics cpu.usage --interval=1m --function=mean
Downsample created

# Retention: delete data older than 30 days
$ ./tsdb-cli set-retention metrics --duration=30d
Retention policy set

Implementation Hints:

Time-series storage:

Metric: cpu.usage,host=server1
  Timestamps: [1642234567, 1642234577, 1642234587, ...]
  Values: [45.2, 46.1, 44.8, ...]

Compression:

Delta encoding: Store differences between timestamps
Run-length encoding: Compress repeated values
Gorilla compression: Efficient for floating-point

Downsampling:

Pre-aggregate data at different resolutions
1-second → 1-minute → 1-hour → 1-day
Reduces storage and query time

Learning milestones:

Basic storage → Store and retrieve time-series data
Compression → Data compressed efficiently
Downsampling → Aggregations computed correctly
Query optimization → Range queries fast

Project 17: Graph Database with Gremlin Query Language

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python
Alternative Programming Languages: Java, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Graph Databases / Graph Algorithms
Software or Tool: Graph algorithms, query language
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A graph database that stores nodes and edges, with a query language (Gremlin-like) for traversing relationships and finding paths.

Why it teaches data-intensive design: Graph databases excel at relationship queries. Understanding them teaches you when to use graph vs relational models.

Core challenges you’ll face:

Graph storage → maps to adjacency lists, edge properties
Traversal algorithms → maps to BFS, DFS, shortest path
Query language → maps to Gremlin-style syntax
Indexing → maps to indexing nodes and edges

Key Concepts:

Graph Databases: “Designing Data-Intensive Applications” Ch. 2
Graph Algorithms: “Introduction to Algorithms” Ch. 22
Gremlin: Apache TinkerPop documentation

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of graphs, algorithms, query parsing

Real world outcome:

# Start graph database
$ ./graphdb --port=8182
Graph database started

# Create graph
$ ./gremlin-cli
gremlin> graph = Graph.open()
gremlin> g = graph.traversal()

# Add vertices
gremlin> alice = g.addV('person').property('name', 'Alice').next()
gremlin> bob = g.addV('person').property('name', 'Bob').next()
gremlin> company = g.addV('company').property('name', 'Acme').next()

# Add edges
gremlin> g.addE('knows').from(alice).to(bob).property('since', 2020).next()
gremlin> g.addE('worksFor').from(alice).to(company).next()

# Query: find Alice's friends
gremlin> g.V().has('name', 'Alice').out('knows').values('name')
==>Bob

# Query: find people who work at same company as Alice
gremlin> g.V().has('name', 'Alice').out('worksFor').in('worksFor').values('name')
==>Alice

# Shortest path
gremlin> g.V().has('name', 'Alice').repeat(out()).until(has('name', 'Bob')).path()
==>[v[1], v[2]]

Implementation Hints:

Graph storage:

Vertices:
  v1: {label: 'person', properties: {name: 'Alice'}}
  v2: {label: 'person', properties: {name: 'Bob'}}

Edges:
  e1: {from: v1, to: v2, label: 'knows', properties: {since: 2020}}

Traversal:

Start at vertex
Follow edges (out/in/both)
Filter by properties
Return results

Algorithms:

BFS: Level-order traversal
DFS: Depth-first traversal
Shortest path: Dijkstra’s algorithm
PageRank: Centrality measure

Learning milestones:

Basic graph → Store vertices and edges
Traversal → Navigate relationships
Query language → Gremlin-like queries work
Algorithms → Shortest path, etc. work

Project 18: Complete Data-Intensive Application

File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
Main Programming Language: Python/Go
Alternative Programming Languages: Java, TypeScript
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Complete System Design / Integration
Software or Tool: All previous projects
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete data-intensive application (e.g., social media platform, e-commerce site, analytics platform) that integrates all concepts: replication, partitioning, caching, message queues, batch and stream processing.

Why it teaches data-intensive design: Building a complete system integrates all concepts. You’ll make real-world trade-offs and see how components interact.

Core challenges you’ll face:

System architecture → maps to component selection, data flow
Scalability → maps to handling growth, performance optimization
Reliability → maps to failure handling, monitoring
Consistency → maps to choosing appropriate consistency models

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects

Real world outcome:

# Complete social media platform

# Architecture:
# - PostgreSQL (primary database) with read replicas
# - Redis (caching layer)
# - Kafka (event streaming)
# - Elasticsearch (search)
# - Spark (batch analytics)
# - Flink (real-time analytics)

# Features:
# - User posts, comments, likes
# - News feed (personalized, real-time)
# - Search
# - Analytics dashboard
# - Recommendations

# Deploy
$ ./deploy.sh
Starting services...
PostgreSQL: Running (1 primary, 3 replicas)
Redis: Running (cluster mode, 6 nodes)
Kafka: Running (3 brokers)
Elasticsearch: Running (5 nodes)
Spark: Running
Flink: Running

# Load test
$ ./load-test --users=1000000 --posts-per-user=100
Generating load...
Throughput: 50,000 requests/sec
P95 latency: 45ms
P99 latency: 120ms

# Monitor
$ ./monitor dashboard
[Real-time metrics dashboard showing all components]

Implementation Hints:

System components:

API Layer: REST/GraphQL APIs
Database: PostgreSQL with replication
Cache: Redis for hot data
Search: Elasticsearch for full-text search
Events: Kafka for event streaming
Batch: Spark for analytics
Stream: Flink for real-time processing
Monitoring: Metrics, logs, tracing

Design decisions:

When to use cache vs database?
When to use eventual consistency?
How to handle failures?
How to scale each component?

Learning milestones:

Basic system → Core features work
Scalability → Handles load
Reliability → Handles failures
Monitoring → Observability in place

Project Comparison Table

#	Project	Difficulty	Time	Key Skill	Fun
1	B-Tree KV Store	⭐⭐⭐	3-4 weeks	Storage Engines	⭐⭐⭐⭐
2	LSM-Tree Storage	⭐⭐⭐	3-4 weeks	Write Optimization	⭐⭐⭐⭐
3	Query Engine	⭐⭐⭐⭐	4-6 weeks	Query Processing	⭐⭐⭐⭐
4	Leader-Follower Replication	⭐⭐⭐	3-4 weeks	Replication	⭐⭐⭐
5	Consistent Hashing	⭐⭐	2-3 weeks	Partitioning	⭐⭐⭐
6	Two-Phase Commit	⭐⭐⭐	2-3 weeks	Distributed Transactions	⭐⭐⭐
7	Raft Consensus	⭐⭐⭐⭐	4-6 weeks	Consensus	⭐⭐⭐⭐⭐
8	Event Sourcing	⭐⭐⭐	3-4 weeks	Event-Driven	⭐⭐⭐⭐
9	Message Queue	⭐⭐⭐	3-4 weeks	Async Processing	⭐⭐⭐
10	Distributed Cache	⭐⭐	2-3 weeks	Caching	⭐⭐⭐
11	MapReduce	⭐⭐⭐	3-4 weeks	Batch Processing	⭐⭐⭐⭐
12	Stream Processing	⭐⭐⭐	3-4 weeks	Real-Time	⭐⭐⭐⭐
13	Connection Pool	⭐⭐	1-2 weeks	Resource Management	⭐⭐
14	Change Data Capture	⭐⭐⭐	3-4 weeks	Data Integration	⭐⭐⭐
15	Distributed Locks	⭐⭐⭐	2-3 weeks	Coordination	⭐⭐⭐
16	Time-Series DB	⭐⭐⭐	3-4 weeks	Specialized Storage	⭐⭐⭐⭐
17	Graph Database	⭐⭐⭐⭐	4-6 weeks	Graph Algorithms	⭐⭐⭐⭐
18	Complete Application	⭐⭐⭐⭐⭐	2-3 months	System Design	⭐⭐⭐⭐⭐

Recommended Learning Path

Phase 1: Storage Foundations (6-8 weeks)

Understand how data is stored and retrieved:

Project 1: B-Tree KV Store - Read-optimized storage
Project 2: LSM-Tree Storage - Write-optimized storage
Project 13: Connection Pool - Resource management

Phase 2: Replication and Partitioning (6-8 weeks)

Learn how to scale beyond single machines:

Project 4: Leader-Follower Replication - High availability
Project 5: Consistent Hashing - Horizontal scaling
Project 7: Raft Consensus - Distributed consensus

Phase 3: Transactions and Consistency (4-6 weeks)

Understand consistency models:

Project 6: Two-Phase Commit - Distributed transactions
Project 15: Distributed Locks - Coordination

Phase 4: Processing Patterns (8-10 weeks)

Learn batch and stream processing:

Project 8: Event Sourcing - Event-driven architecture
Project 9: Message Queue - Asynchronous processing
Project 11: MapReduce - Batch processing
Project 12: Stream Processing - Real-time processing

Phase 5: Specialized Systems (6-8 weeks)

Explore specialized databases:

Project 3: Query Engine - SQL processing
Project 14: Change Data Capture - Data integration
Project 16: Time-Series DB - Time-series optimization
Project 17: Graph Database - Graph algorithms

Phase 6: Integration (2-3 months)

Build complete systems:

Project 10: Distributed Cache - Performance optimization
Project 18: Complete Application - Full system integration

Summary

#	Project	Main Language
1	B-Tree KV Store	C
2	LSM-Tree Storage	C
3	Query Engine	C++
4	Leader-Follower Replication	Go
5	Consistent Hashing	Python
6	Two-Phase Commit	Go
7	Raft Consensus	Go
8	Event Sourcing	Python
9	Message Queue	Go
10	Distributed Cache	Go
11	MapReduce	Python
12	Stream Processing	Python
13	Connection Pool	Go
14	Change Data Capture	Python
15	Distributed Locks	Go
16	Time-Series DB	Go
17	Graph Database	Python
18	Complete Application	Python/Go

Resources

Essential Books

“Designing Data-Intensive Applications” by Martin Kleppmann - The definitive guide
“Database Systems: The Complete Book” by Garcia-Molina et al. - Deep database internals
“Streaming Systems” by Tyler Akidau et al. - Stream processing
“Building Microservices” by Sam Newman - System design

Key Papers

“In Search of an Understandable Consensus Algorithm” (Raft) - Diego Ongaro
“MapReduce: Simplified Data Processing” - Google
“Dynamo: Amazon’s Highly Available Key-value Store” - Amazon

Tools and Technologies

PostgreSQL: Relational database
Redis: In-memory cache
Kafka: Message queue
Spark: Batch processing
Flink: Stream processing
etcd: Distributed coordination

Total Estimated Time: 12-18 months of dedicated study

After completion: You’ll be able to design, build, and scale data-intensive applications that handle millions of users and petabytes of data. These skills are essential for backend engineering, data engineering, platform engineering, and systems architecture roles at top tech companies.