LEARN DATA INTENSIVE APPLICATIONS DESIGN
Learn Data-Intensive Applications Design: From Zero to Systems Architect
Goal: Deeply understand how to design, build, and scale data-intensive applications. Master the principles of scalability, reliability, consistency, and efficiency that underpin modern distributed systems.
Why Learn Data-Intensive Applications Design?
Every modern application is data-intensive. Whether it’s a social media platform handling billions of posts, an e-commerce site processing millions of transactions, or a real-time analytics system ingesting terabytes of events, understanding how to design these systems is essential.
After completing these projects, you will:
- Design scalable systems that handle millions of users and petabytes of data
- Understand trade-offs between consistency, availability, and partition tolerance (CAP theorem)
- Implement replication, partitioning, and consensus algorithms
- Build reliable systems that handle failures gracefully
- Design efficient data models and storage engines
- Create real-time and batch processing pipelines
- Make informed decisions about database selection and architecture
Core Concept Analysis
The Data-Intensive Application Stack
┌─────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ (Business Logic, APIs, User Interface) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CACHING LAYER │
│ (Redis, Memcached, CDN) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DATA PROCESSING LAYER │
│ (Batch: Spark, Hadoop | Stream: Kafka, Flink) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DATABASE LAYER │
│ (Relational: PostgreSQL | NoSQL: MongoDB, Cassandra) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ (B-Trees, LSM-Trees, Object Storage) │
└─────────────────────────────────────────────────────────────┘
Fundamental Concepts
- Reliability: System continues working correctly even when things go wrong
- Hardware faults, software errors, human errors
- Fault tolerance vs fault prevention
- Scalability: System can handle growth in load
- Vertical scaling (bigger machines) vs horizontal scaling (more machines)
- Load parameters: requests/sec, read/write ratio, data volume
- Maintainability: System is easy to operate and modify
- Operability, simplicity, evolvability
- Data Models: How data is represented
- Relational (SQL), Document (JSON), Graph, Key-Value
- Storage Engines: How data is stored and retrieved
- B-Trees (read-optimized), LSM-Trees (write-optimized)
- Replication: Keeping copies of data on multiple nodes
- Leader-follower, multi-leader, leaderless
- Consistency models: eventual, strong, causal
- Partitioning: Splitting data across multiple nodes
- Range partitioning, hash partitioning
- Rebalancing strategies
- Transactions: Grouping operations that must succeed or fail together
- ACID properties
- Isolation levels
- Consensus: Getting multiple nodes to agree on something
- Two-phase commit, Raft, Paxos
- Batch Processing: Processing large volumes of data periodically
- MapReduce, Spark
- Stream Processing: Processing data continuously as it arrives
- Event sourcing, CQRS, Kafka
Project List
The following 18 projects will teach you data-intensive application design from fundamentals to advanced distributed systems.
Project 1: Key-Value Store with B-Tree Index
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Storage Engines / Indexing
- Software or Tool: File I/O, B-Tree implementation
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A persistent key-value store that uses a B-Tree index for efficient lookups. It should support insert, get, delete, and range queries with durability guarantees.
Why it teaches data-intensive design: B-Trees are the foundation of most relational databases. Building one teaches you how databases organize data on disk, handle page I/O, and maintain indexes efficiently.
Core challenges you’ll face:
- Implementing B-Tree structure → maps to understanding balanced tree algorithms
- Handling page I/O → maps to disk vs memory trade-offs
- Managing splits and merges → maps to maintaining tree balance
- Ensuring durability → maps to write-ahead logging or fsync
Key Concepts:
- B-Tree Structure: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
- Page-Oriented Storage: “Database Systems: The Complete Book” Ch. 13
- Durability: “Designing Data-Intensive Applications” Ch. 7
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: C programming, understanding of trees and file I/O
Real world outcome:
$ ./kvstore create mydb
Database created: mydb
$ ./kvstore set mydb user:123 '{"name":"Alice","email":"alice@example.com"}'
OK
$ ./kvstore get mydb user:123
{"name":"Alice","email":"alice@example.com"}
$ ./kvstore range mydb user:100 user:200
user:123: {"name":"Alice","email":"alice@example.com"}
user:145: {"name":"Bob","email":"bob@example.com"}
$ ./kvstore delete mydb user:123
OK
Implementation Hints:
B-Tree structure:
- Each node contains multiple keys and pointers
- Internal nodes have keys and child pointers
- Leaf nodes have keys and values
- All leaves at same depth (balanced)
- Typical node size: 4KB (one disk page)
Key design decisions:
- How many keys per node? (branching factor)
- How to handle variable-length values?
- When to split nodes? (typically when full)
- How to ensure atomic writes? (WAL or copy-on-write)
Questions to guide implementation:
- How do you find a key efficiently? (binary search within node)
- How do you handle concurrent access? (locking or MVCC)
- How do you recover from crashes? (WAL replay)
Learning milestones:
- Basic B-Tree operations → Insert, search, delete work correctly
- Persistence → Data survives restarts
- Range queries → Efficiently scan key ranges
- Concurrent access → Multiple readers/writers
Project 2: Log-Structured Merge Tree (LSM-Tree) Storage Engine
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Storage Engines / Write-Optimized Structures
- Software or Tool: File I/O, Bloom filters
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A key-value store using LSM-Tree architecture (like LevelDB/RocksDB). It should have an in-memory memtable, write-ahead log, and multiple sorted string tables (SSTables) that get compacted.
Why it teaches data-intensive design: LSM-Trees are write-optimized, making them perfect for high-write workloads. Understanding them teaches you trade-offs between read and write performance.
Core challenges you’ll face:
- Memtable management → maps to in-memory sorted structure
- SSTable creation → maps to sorted file format
- Compaction strategies → maps to merging sorted files
- Bloom filters → maps to probabilistic data structures for fast lookups
Key Concepts:
- LSM-Tree Architecture: “Designing Data-Intensive Applications” Ch. 3
- SSTable Format: LevelDB documentation
- Compaction: RocksDB tuning guide
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 1, understanding of sorting algorithms
Real world outcome:
$ ./lsmstore create mydb
LSM-Tree database created
$ ./lsmstore set mydb key1 value1
OK
# High write throughput
$ for i in {1..100000}; do
./lsmstore set mydb "key$i" "value$i"
done
# Completes in seconds (much faster than B-Tree for writes)
$ ./lsmstore get mydb key50000
value50000
# Compaction happens automatically
$ ./lsmstore stats mydb
Level 0: 5 SSTables
Level 1: 2 SSTables (compacted)
Total keys: 100000
Implementation Hints:
LSM-Tree structure:
Write Path:
Write → Memtable (in-memory sorted tree)
When memtable full → Flush to SSTable (Level 0)
Background compaction merges SSTables
Read Path:
Check memtable → Check Level 0 SSTables → Check Level 1+ SSTables
Use Bloom filter to skip SSTables that don't contain key
Compaction strategies:
- Size-tiered: Merge SSTables of similar size
- Leveled: Each level has fixed size, merge into next level
- Tiered: Similar to size-tiered but with level limits
Key questions:
- How big should the memtable be before flushing?
- How many levels should you have?
- When to trigger compaction? (based on size or ratio)
Learning milestones:
- Write path works → Writes go to memtable and WAL
- Read path works → Can retrieve values correctly
- Compaction works → SSTables merge correctly
- Performance → Write throughput exceeds B-Tree
Project 3: Relational Database Query Engine
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: C++
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Query Processing / Relational Algebra
- Software or Tool: SQL parser, query optimizer
- Main Book: “Database Systems: The Complete Book” by Garcia-Molina et al.
What you’ll build: A query engine that parses SQL, builds an execution plan, and executes queries with operators like scan, filter, join, sort, and aggregate.
Why it teaches data-intensive design: Understanding how databases execute queries is essential for writing efficient SQL and designing schemas.
Core challenges you’ll face:
- SQL parsing → maps to converting text to relational algebra
- Query optimization → maps to choosing efficient execution plans
- Join algorithms → maps to nested loop, hash join, sort-merge join
- Operator pipelining → maps to streaming vs materialization
Key Concepts:
- Relational Algebra: “Database Systems” Ch. 2
- Query Optimization: “Database Systems” Ch. 16
- Join Algorithms: “Database Systems” Ch. 15
Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of SQL, algorithms and data structures
Real world outcome:
-- Create table
CREATE TABLE users (id INT, name VARCHAR(100), email VARCHAR(100));
CREATE TABLE orders (id INT, user_id INT, amount DECIMAL(10,2));
-- Insert data
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
INSERT INTO orders VALUES (1, 1, 99.99);
-- Query
SELECT u.name, SUM(o.amount) as total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
HAVING SUM(o.amount) > 50;
-- Output:
-- name | total
-- Alice | 99.99
Implementation Hints:
Query execution pipeline:
SQL → Parser → AST → Optimizer → Execution Plan → Executor → Results
Key operators to implement:
- TableScan: Read all rows from table
- Filter: Apply WHERE conditions
- Project: Select specific columns
- Join: Combine two tables (nested loop, hash, sort-merge)
- Sort: Order rows
- Aggregate: GROUP BY operations
- Limit: Top N results
Optimization techniques:
- Push predicates down (filter early)
- Choose join order (smallest table first)
- Use indexes when available
- Estimate costs (I/O, CPU)
Learning milestones:
- Parse SQL → Convert to execution plan
- Execute simple queries → SELECT, WHERE work
- Implement joins → Multiple algorithms work
- Optimize queries → Choose efficient plans
Project 4: Leader-Follower Replication System
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Java, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Replication / High Availability
- Software or Tool: Network programming, consensus
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A replicated key-value store with one leader and multiple followers. Writes go to the leader and are replicated to followers. Handles leader failure and automatic failover.
Why it teaches data-intensive design: Replication is fundamental for availability and read scalability. Understanding leader-follower replication teaches you consistency models and failure handling.
Core challenges you’ll face:
- Synchronous vs asynchronous replication → maps to consistency vs latency trade-offs
- Replication lag → maps to eventual consistency issues
- Leader election → maps to distributed consensus
- Split-brain prevention → maps to quorum-based decisions
Key Concepts:
- Replication Models: “Designing Data-Intensive Applications” Ch. 5
- Consistency Guarantees: “Designing Data-Intensive Applications” Ch. 9
- Leader Election: Raft paper
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Network programming, understanding of distributed systems basics
Real world outcome:
# Start leader
$ ./replicated-kv --node-id=1 --port=8001 --leader
Leader started on port 8001
# Start followers
$ ./replicated-kv --node-id=2 --port=8002 --follower --leader-addr=localhost:8001
Follower 2 connected to leader
$ ./replicated-kv --node-id=3 --port=8003 --follower --leader-addr=localhost:8001
Follower 3 connected to leader
# Write to leader
$ curl http://localhost:8001/set?key=user:1&value=Alice
OK
# Read from any node
$ curl http://localhost:8002/get?key=user:1
Alice
# Leader fails
$ kill <leader-pid>
# Automatic failover
[Follower 2] Leader failed, starting election...
[Follower 2] Elected as new leader
[Follower 3] Following new leader: node-2
Implementation Hints:
Replication protocol:
- Client sends write to leader
- Leader writes to local log
- Leader sends log entries to followers
- Followers acknowledge receipt
- Leader commits when majority acknowledge
- Leader sends commit message to followers
Failure scenarios:
- Leader fails: Followers detect via heartbeat timeout, elect new leader
- Follower fails: Leader continues, follower catches up on reconnect
- Network partition: Split-brain prevention via quorum
Consistency models:
- Synchronous: Wait for all followers (strong consistency, high latency)
- Asynchronous: Don’t wait (low latency, eventual consistency)
- Semi-synchronous: Wait for majority (balance)
Learning milestones:
- Basic replication → Writes replicate to followers
- Read from followers → Can read from any node
- Leader failure → Automatic failover works
- Consistency → Understand trade-offs
Project 5: Consistent Hashing and Data Partitioning
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Partitioning / Sharding
- Software or Tool: Hash functions, distributed systems
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed key-value store that partitions data across multiple nodes using consistent hashing. Handles node additions and removals with minimal data movement.
Why it teaches data-intensive design: Partitioning is essential for scaling beyond single-machine limits. Consistent hashing minimizes data movement when nodes join/leave.
Core challenges you’ll face:
- Hash function selection → maps to uniform distribution
- Virtual nodes → maps to load balancing
- Rebalancing → maps to minimizing data movement
- Handling node failures → maps to replication across partitions
Key Concepts:
- Partitioning Strategies: “Designing Data-Intensive Applications” Ch. 6
- Consistent Hashing: “Designing Data-Intensive Applications” Ch. 6
- Rebalancing: DynamoDB partitioning documentation
Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Understanding of hashing, basic distributed systems
Real world outcome:
# Start cluster with 3 nodes
$ ./partitioned-kv --nodes=node1,node2,node3
Cluster started with 3 nodes
# Add data
$ ./partitioned-kv set user:1 "Alice"
Stored on node2 (hash: 0x3a7f...)
$ ./partitioned-kv set user:2 "Bob"
Stored on node1 (hash: 0x8c2d...)
# Add new node
$ ./partitioned-kv add-node node4
Node added. Rebalancing...
Moved 25% of keys to node4
# Remove node
$ ./partitioned-kv remove-node node2
Node removed. Rebalancing...
Moved keys to node1 and node3
# Query still works
$ ./partitioned-kv get user:1
Alice
Implementation Hints:
Consistent hashing ring:
0x0000
│
├─ node1 (0x4000)
│
├─ node2 (0x8000)
│
├─ node3 (0xC000)
│
0xFFFF
Key "user:1" hashes to 0x5a7f → belongs to node2
Virtual nodes:
- Each physical node has multiple virtual nodes on ring
- Improves load distribution
- Example: node1 → [vnode1-1, vnode1-2, vnode1-3]
Rebalancing:
- When node added: only move keys from immediate neighbors
- When node removed: redistribute its keys to neighbors
- Use replication factor of 3: store key on node and next 2 nodes
Learning milestones:
- Basic partitioning → Keys distributed across nodes
- Consistent hashing → Ring structure works
- Node addition → Minimal data movement
- Load balancing → Even distribution with virtual nodes
Project 6: Two-Phase Commit Protocol
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Java, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Transactions / Consensus
- Software or Tool: Network programming, state machines
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed transaction coordinator that implements two-phase commit (2PC) to ensure atomicity across multiple nodes.
Why it teaches data-intensive design: 2PC is fundamental for distributed transactions. Understanding it teaches you the challenges of achieving consensus in distributed systems.
Core challenges you’ll face:
- Coordinator failure → maps to blocking problem
- Participant failure → maps to recovery procedures
- Network partitions → maps to availability vs consistency
- Timeout handling → maps to detecting failures
Key Concepts:
- Two-Phase Commit: “Designing Data-Intensive Applications” Ch. 9
- Distributed Transactions: “Database Systems” Ch. 19
- Consensus Algorithms: Raft paper
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of transactions, network programming
Real world outcome:
# Start coordinator
$ ./txn-coordinator --port=9000
Coordinator started
# Start participants (databases)
$ ./participant --id=db1 --coordinator=localhost:9000
$ ./participant --id=db2 --coordinator=localhost:9000
$ ./participant --id=db3 --coordinator=localhost:9000
# Execute distributed transaction
$ ./client --coordinator=localhost:9000 \
--txn="UPDATE db1.users SET balance=balance-100 WHERE id=1; \
UPDATE db2.accounts SET balance=balance+100 WHERE id=2; \
UPDATE db3.logs SET action='transfer' WHERE id=3;"
[Coordinator] Phase 1: Prepare
[db1] Prepared (vote: YES)
[db2] Prepared (vote: YES)
[db3] Prepared (vote: YES)
[Coordinator] Phase 2: Commit
[db1] Committed
[db2] Committed
[db3] Committed
Transaction completed successfully
Implementation Hints:
2PC Protocol:
Phase 1 (Prepare):
1. Coordinator sends PREPARE to all participants
2. Each participant votes YES (ready) or NO (abort)
3. Coordinator collects votes
Phase 2 (Commit/Abort):
4. If all YES: Coordinator sends COMMIT
5. If any NO: Coordinator sends ABORT
6. Participants commit or abort and acknowledge
Failure scenarios:
- Participant fails before voting: Coordinator times out, aborts
- Participant fails after voting YES: Blocks until recovery, then commits
- Coordinator fails: Participants block (blocking problem)
Improvements:
- Three-phase commit: Reduces blocking but more complex
- Saga pattern: Alternative for long-running transactions
Learning milestones:
- Basic 2PC → Simple transactions commit/abort correctly
- Failure handling → Handles participant failures
- Coordinator failure → Understand blocking problem
- Recovery → Participants recover correctly
Project 7: Raft Consensus Algorithm Implementation
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Consensus / Distributed Systems
- Software or Tool: Network programming, state machines
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A Raft consensus implementation that maintains a replicated log across multiple nodes. Handles leader election, log replication, and safety guarantees.
Why it teaches data-intensive design: Raft is used in production systems (etcd, Consul). Understanding it teaches you how distributed systems achieve consensus and maintain consistency.
Core challenges you’ll face:
- Leader election → maps to majority voting, term numbers
- Log replication → maps to matching logs, commit index
- Safety properties → maps to election restriction, log matching
- Network partitions → maps to split-brain prevention
Key Concepts:
- Raft Algorithm: “In Search of an Understandable Consensus Algorithm” - Diego Ongaro
- Consensus: “Designing Data-Intensive Applications” Ch. 9
- Distributed Logs: Raft paper
Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Strong understanding of distributed systems, network programming
Real world outcome:
# Start 5-node Raft cluster
$ ./raft-node --id=1 --cluster=1,2,3,4,5 --port=8001
$ ./raft-node --id=2 --cluster=1,2,3,4,5 --port=8002
$ ./raft-node --id=3 --cluster=1,2,3,4,5 --port=8003
$ ./raft-node --id=4 --cluster=1,2,3,4,5 --port=8004
$ ./raft-node --id=5 --cluster=1,2,3,4,5 --port=8005
# Cluster elects leader
[Node 3] Elected leader for term 1
# Append entry to log
$ curl http://localhost:8003/append -d '{"command":"SET key1 value1"}'
{"success":true,"index":1}
# Entry replicated to majority
[Node 1] Log entry 1 committed
[Node 2] Log entry 1 committed
[Node 3] Log entry 1 committed
# Leader fails
$ kill <node-3-pid>
# New leader elected
[Node 1] Elected leader for term 2
# Cluster continues operating
$ curl http://localhost:8001/append -d '{"command":"SET key2 value2"}'
{"success":true,"index":2}
Implementation Hints:
Raft components:
- Leader Election: Nodes vote for leader, need majority
- Log Replication: Leader replicates log entries to followers
- Safety: Election restriction ensures leader has all committed entries
State machine:
Follower → Candidate → Leader
↑ ↓ ↓
└─────────┴──────────┘
Key invariants:
- Election Safety: At most one leader per term
- Leader Append-Only: Leader never overwrites entries
- Log Matching: If two logs have same entry at same index, they match up to that point
- Leader Completeness: Committed entry in current term will be in all future leaders
Learning milestones:
- Leader election → Cluster elects leader correctly
- Log replication → Entries replicate to followers
- Failure handling → Cluster handles node failures
- Safety → All safety properties maintained
Project 8: Event Sourcing System
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python
- Alternative Programming Languages: Java, Go, TypeScript
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Event Sourcing / CQRS
- Software or Tool: Message queue, event store
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: An event-sourced system where all changes are stored as a sequence of events. Support event replay, snapshots, and multiple read models.
Why it teaches data-intensive design: Event sourcing provides audit trails, time travel, and decouples write and read models. It’s a powerful pattern for complex domains.
Core challenges you’ll face:
- Event storage → maps to append-only log, event schema evolution
- Event replay → maps to rebuilding state from events
- Snapshots → maps to optimizing replay performance
- Read models → maps to CQRS pattern, eventual consistency
Key Concepts:
- Event Sourcing: “Designing Data-Intensive Applications” Ch. 11
- CQRS: Martin Fowler’s blog on CQRS
- Event Store: EventStore documentation
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of domain modeling, message queues
Real world outcome:
# Create event store
$ ./eventstore create myapp
Event store created
# Write events
$ ./eventstore append myapp user:1 \
'{"type":"UserCreated","name":"Alice","email":"alice@example.com"}' \
'{"type":"EmailChanged","email":"alice.new@example.com"}' \
'{"type":"UserDeleted"}'
# Replay events to rebuild state
$ ./eventstore replay myapp user:1
Event 1: UserCreated -> State: {name: Alice, email: alice@example.com}
Event 2: EmailChanged -> State: {name: Alice, email: alice.new@example.com}
Event 3: UserDeleted -> State: {deleted: true}
# Query at specific time
$ ./eventstore query myapp user:1 --at="2024-01-15T10:00:00Z"
{name: Alice, email: alice@example.com}
# Create read model
$ ./eventstore create-read-model myapp users-by-email
Read model created, processing events...
Implementation Hints:
Event store structure:
Stream: user:1
Event 1: UserCreated (timestamp: T1)
Event 2: EmailChanged (timestamp: T2)
Event 3: UserDeleted (timestamp: T3)
Replay algorithm:
def replay(stream_id, up_to=None):
state = {}
events = load_events(stream_id, up_to)
for event in events:
state = apply_event(state, event)
return state
Snapshots:
- Periodically save state (e.g., every 100 events)
- Replay from snapshot instead of beginning
- Reduces replay time
Read models:
- Project events into denormalized views
- Update asynchronously
- Optimized for queries
Learning milestones:
- Event storage → Events stored and retrieved
- Event replay → State rebuilt from events
- Snapshots → Replay optimized with snapshots
- Read models → Multiple views of same data
Project 9: Message Queue with At-Least-Once Delivery
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Java, Rust, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Message Queues / Asynchronous Processing
- Software or Tool: Network programming, persistence
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A message queue that guarantees at-least-once delivery. Supports topics, consumer groups, and message acknowledgments.
Why it teaches data-intensive design: Message queues are essential for decoupling services and handling asynchronous processing. Understanding delivery guarantees is crucial.
Core challenges you’ll face:
- Message persistence → maps to durability guarantees
- Consumer groups → maps to load balancing, parallel processing
- Acknowledgments → maps to at-least-once vs exactly-once
- Message ordering → maps to per-partition ordering
Key Concepts:
- Message Brokers: “Designing Data-Intensive Applications” Ch. 11
- Delivery Guarantees: “Designing Data-Intensive Applications” Ch. 11
- Kafka Architecture: Kafka documentation
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Network programming, understanding of queues
Real world outcome:
# Start message queue
$ ./mq-server --port=9092
Message queue started
# Create topic
$ ./mq-cli create-topic orders --partitions=3 --replication=2
Topic 'orders' created
# Produce messages
$ ./mq-cli produce orders '{"order_id":1,"amount":99.99}'
Message produced: offset=0, partition=0
$ ./mq-cli produce orders '{"order_id":2,"amount":149.99}'
Message produced: offset=0, partition=1
# Consume messages
$ ./mq-cli consume orders --group=processors --from-beginning
Message 1: {"order_id":1,"amount":99.99} (offset=0, partition=0)
Message 2: {"order_id":2,"amount":149.99} (offset=0, partition=1)
# Acknowledge message
$ ./mq-cli ack orders --group=processors --offset=0 --partition=0
Acknowledged
# Consumer group rebalancing
$ ./mq-cli consume orders --group=processors --consumer-id=consumer-2
Assigned partitions: [1, 2]
Implementation Hints:
Message queue structure:
Topic: orders
Partition 0: [msg0, msg1, msg2, ...]
Partition 1: [msg0, msg1, msg2, ...]
Partition 2: [msg0, msg1, msg2, ...]
Consumer groups:
- Multiple consumers in same group share partitions
- Each partition consumed by one consumer in group
- Rebalancing when consumers join/leave
Delivery guarantees:
- At-least-once: Message delivered at least once (may duplicate)
- At-most-once: Message delivered at most once (may lose)
- Exactly-once: Message delivered exactly once (hardest)
Acknowledgments:
- Consumer processes message
- Sends ACK to broker
- Broker removes message (or marks as processed)
- If ACK not received, redeliver
Learning milestones:
- Basic queue → Produce and consume messages
- Persistence → Messages survive restarts
- Consumer groups → Load balancing works
- At-least-once → Guarantee maintained
Project 10: Distributed Cache with Cache-Aside Pattern
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Java, Rust
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Caching / Performance Optimization
- Software or Tool: Network programming, LRU eviction
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed cache (like Redis) with cache-aside pattern. Supports TTL, eviction policies, and cluster mode.
Why it teaches data-intensive design: Caching is crucial for performance. Understanding cache patterns and eviction strategies teaches you trade-offs in system design.
Core challenges you’ll face:
- Eviction policies → maps to LRU, LFU, TTL-based
- Cache invalidation → maps to when to invalidate, cache-aside vs write-through
- Distributed caching → maps to consistent hashing, replication
- Cache stampede → maps to preventing thundering herd
Key Concepts:
- Caching Patterns: “Designing Data-Intensive Applications” Ch. 3
- Cache-Aside: Martin Fowler’s blog
- Eviction Policies: Redis documentation
Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Understanding of hashing, basic distributed systems
Real world outcome:
# Start cache cluster
$ ./cache-server --port=6379 --cluster
Cache cluster started with 3 nodes
# Set with TTL
$ ./cache-cli set user:1 "Alice" --ttl=3600
OK
# Get
$ ./cache-cli get user:1
Alice
# Cache-aside pattern
$ ./cache-cli get user:999
(nil)
# Application fetches from database, then caches
$ ./cache-cli set user:999 "Bob" --ttl=3600
OK
# Eviction when full (LRU)
$ ./cache-cli set key:1000 "value"
OK (evicted key:1)
# Cluster mode
$ ./cache-cli set key:1 "value" --cluster
Stored on node2 (hash: 0x3a7f...)
Implementation Hints:
Cache-aside pattern:
1. Application checks cache
2. If miss: fetch from database
3. Store in cache for future requests
4. On write: update database, invalidate cache
Eviction policies:
- LRU: Evict least recently used
- LFU: Evict least frequently used
- TTL: Evict expired entries
- Random: Simple but less effective
Cache stampede prevention:
- Lock on cache miss (only one fetches from DB)
- Background refresh before expiration
- Probabilistic early expiration
Learning milestones:
- Basic cache → Get/set operations work
- Eviction → LRU eviction works correctly
- TTL → Expiration works
- Cache-aside → Pattern implemented correctly
Project 11: MapReduce Implementation
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Batch Processing / Distributed Computing
- Software or Tool: Distributed systems, file I/O
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A MapReduce implementation that processes large datasets in parallel across multiple workers. Handles task distribution, failure recovery, and result aggregation.
Why it teaches data-intensive design: MapReduce is the foundation of batch processing. Understanding it teaches you how to process large datasets efficiently.
Core challenges you’ll face:
- Task distribution → maps to scheduling, load balancing
- Failure handling → maps to retry, speculative execution
- Data locality → maps to processing data where it’s stored
- Shuffle phase → maps to sorting and grouping intermediate results
Key Concepts:
- MapReduce: “Designing Data-Intensive Applications” Ch. 10
- Batch Processing: “Designing Data-Intensive Applications” Ch. 10
- Distributed Computing: Google MapReduce paper
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of distributed systems, file I/O
Real world outcome:
# Start MapReduce cluster
$ ./mapreduce-master --port=8080
Master started
$ ./mapreduce-worker --master=localhost:8080 --port=8081
Worker 1 started
$ ./mapreduce-worker --master=localhost:8080 --port=8082
Worker 2 started
# Submit job: word count
$ ./mapreduce-cli submit \
--input=/data/books/*.txt \
--output=/results/wordcount \
--mapper=word_count_map.py \
--reducer=word_count_reduce.py
Job submitted: job-12345
# Monitor progress
$ ./mapreduce-cli status job-12345
Job: job-12345
Status: RUNNING
Map tasks: 10/10 complete
Reduce tasks: 3/8 complete
Progress: 65%
# Job completes
$ ./mapreduce-cli status job-12345
Status: COMPLETED
Output: /results/wordcount/part-00000
/results/wordcount/part-00001
/results/wordcount/part-00002
# View results
$ cat /results/wordcount/part-00000
the 15234
and 8921
of 6543
...
Implementation Hints:
MapReduce phases:
1. Map Phase:
- Master splits input into chunks
- Assigns chunks to workers
- Workers process chunks, emit (key, value) pairs
2. Shuffle Phase:
- Sort and group by key
- Partition to reducers
3. Reduce Phase:
- Reducers process grouped values
- Write final output
Failure handling:
- Worker failure: Reassign tasks to other workers
- Task failure: Retry with exponential backoff
- Speculative execution: Run slow tasks on multiple workers
Data locality:
- Prefer workers that have input data locally
- Reduces network transfer
Learning milestones:
- Basic MapReduce → Simple jobs complete
- Failure handling → Handles worker failures
- Shuffle → Intermediate data sorted correctly
- Scalability → Handles large datasets
Project 12: Stream Processing with Windowing
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python
- Alternative Programming Languages: Java, Scala
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Stream Processing / Real-Time Analytics
- Software or Tool: Message queue, time-based processing
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A stream processing engine that processes continuous data streams with windowing (tumbling, sliding, session windows) and aggregations.
Why it teaches data-intensive design: Stream processing enables real-time analytics. Understanding windowing and watermarks teaches you how to handle out-of-order events.
Core challenges you’ll face:
- Event time vs processing time → maps to handling late events
- Windowing → maps to tumbling, sliding, session windows
- Watermarks → maps to determining when window is complete
- State management → maps to maintaining window state
Key Concepts:
- Stream Processing: “Designing Data-Intensive Applications” Ch. 11
- Windowing: Flink documentation on windows
- Watermarks: “Streaming Systems” by Akidau et al.
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of message queues, time-based processing
Real world outcome:
# Start stream processor
$ ./stream-processor --port=8080
Stream processor started
# Create stream
$ ./stream-cli create-stream clicks --source=kafka://localhost:9092/topic:clicks
Stream created
# Define processing: count clicks per minute
$ ./stream-cli query clicks \
--window=tumbling,1min \
--aggregate=count \
--group-by=user_id
Query started: query-123
# Process events
Event: {user_id: 1, timestamp: 10:00:15, page: /home}
Event: {user_id: 2, timestamp: 10:00:23, page: /products}
Event: 1, timestamp: 10:00:45, page: /cart}
Window [10:00:00-10:01:00]: user_1=2, user_2=1
# Sliding window: count clicks in last 5 minutes, every minute
$ ./stream-cli query clicks \
--window=sliding,5min,1min \
--aggregate=count
Implementation Hints:
Window types:
- Tumbling: Fixed-size, non-overlapping (e.g., every 1 minute)
- Sliding: Fixed-size, overlapping (e.g., last 5 minutes, every 1 minute)
- Session: Dynamic based on gaps (e.g., 10-minute inactivity)
Watermarks:
- Indicate when events are “complete”
- Allow processing of late events (within allowed lateness)
- Example: watermark = max(event_time) - 1 minute
State management:
- Store window state in memory or external store
- Handle state recovery on failures
Learning milestones:
- Basic streaming → Process events continuously
- Windowing → Tumbling windows work
- Watermarks → Handle late events correctly
- Aggregations → Compute windowed aggregates
Project 13: Database Connection Pool
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Java, Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Connection Management / Resource Pooling
- Software or Tool: Database connections, concurrency
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A connection pool that manages database connections efficiently, handling connection lifecycle, health checks, and load balancing.
Why it teaches data-intensive design: Connection pooling is essential for performance. Understanding it teaches you resource management and concurrency patterns.
Core challenges you’ll face:
- Connection lifecycle → maps to create, reuse, destroy
- Health checks → maps to detecting dead connections
- Load balancing → maps to distributing connections across servers
- Concurrency → maps to thread-safe access
Key Concepts:
- Connection Pooling: Database connection pool patterns
- Resource Management: “Designing Data-Intensive Applications” Ch. 3
- Concurrency: Go concurrency patterns
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Understanding of concurrency, database connections
Real world outcome:
# Start connection pool
$ ./pool-manager \
--max-connections=100 \
--min-connections=10 \
--max-idle-time=300s \
--health-check-interval=30s \
--servers=db1:5432,db2:5432,db3:5432
Pool started with 3 servers
# Get connection
$ ./pool-client get-connection
Connection acquired: conn-123 (server: db1)
# Execute query
$ ./pool-client execute conn-123 "SELECT * FROM users LIMIT 10"
[Results...]
# Return connection
$ ./pool-client return-connection conn-123
Connection returned to pool
# Pool statistics
$ ./pool-client stats
Active connections: 45
Idle connections: 55
Total connections: 100
Server distribution:
db1: 33 connections
db2: 33 connections
db3: 34 connections
Implementation Hints:
Connection pool structure:
Pool:
Available: [conn1, conn2, conn3, ...]
In-use: {request-id: conn}
Servers: [db1, db2, db3]
Operations:
- Acquire: Get connection from pool (create if needed)
- Release: Return connection to pool
- Health check: Periodically test connections
- Evict: Remove idle connections after timeout
Load balancing:
- Round-robin across servers
- Least connections
- Weighted distribution
Learning milestones:
- Basic pooling → Connections reused correctly
- Health checks → Dead connections detected
- Load balancing → Connections distributed evenly
- Concurrency → Thread-safe operations
Project 14: Change Data Capture (CDC) System
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python
- Alternative Programming Languages: Java, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Integration / ETL
- Software or Tool: Database replication logs, message queue
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A CDC system that captures database changes (inserts, updates, deletes) and streams them to downstream systems (data warehouse, search index, cache).
Why it teaches data-intensive design: CDC enables real-time data integration. Understanding it teaches you how to keep multiple systems in sync.
Core challenges you’ll face:
- Reading replication logs → maps to parsing binary logs, WAL
- Change ordering → maps to maintaining order across transactions
- Schema evolution → maps to handling schema changes
- Downstream delivery → maps to reliable delivery, idempotency
Key Concepts:
- Change Data Capture: “Designing Data-Intensive Applications” Ch. 11
- Database Replication: “Designing Data-Intensive Applications” Ch. 5
- Event Streaming: Kafka Connect documentation
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of database internals, message queues
Real world outcome:
# Start CDC connector
$ ./cdc-connector \
--source=postgresql://localhost/db \
--sink=kafka://localhost:9092/topic:changes \
--tables=users,orders,products
CDC connector started
# Monitor changes
$ ./cdc-cli tail changes
{"table":"users","op":"INSERT","id":1,"data":{"name":"Alice","email":"alice@example.com"}}
{"table":"users","op":"UPDATE","id":1,"data":{"name":"Alice","email":"alice.new@example.com"}}
{"table":"orders","op":"INSERT","id":100,"data":{"user_id":1,"amount":99.99}}
{"table":"users","op":"DELETE","id":1}
# Sync to Elasticsearch
$ ./cdc-cli sync changes --target=elasticsearch://localhost:9200
Syncing changes to Elasticsearch...
Indexed 1000 documents
# Sync to data warehouse
$ ./cdc-cli sync changes --target=warehouse://s3://bucket/changes
Syncing changes to warehouse...
Implementation Hints:
CDC approaches:
- Replication logs: Read database’s replication log (MySQL binlog, PostgreSQL WAL)
- Triggers: Database triggers on changes
- Polling: Periodically query for changes (less efficient)
Change format:
{
"table": "users",
"operation": "INSERT|UPDATE|DELETE",
"timestamp": "2024-01-15T10:00:00Z",
"before": {...}, // For UPDATE/DELETE
"after": {...} // For INSERT/UPDATE
}
Ordering:
- Use transaction ID or LSN (Log Sequence Number)
- Process changes in order within transaction
- Handle out-of-order events
Learning milestones:
- Capture changes → Read replication logs correctly
- Stream changes → Publish to message queue
- Ordering → Maintain correct order
- Downstream sync → Update multiple systems
Project 15: Distributed Lock Service
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Java, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Coordination / Locks
- Software or Tool: Consensus algorithm, TTL
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed lock service (like etcd or ZooKeeper) that provides mutual exclusion across multiple processes/nodes.
Why it teaches data-intensive design: Distributed locks are essential for coordination. Understanding them teaches you lease management, failure handling, and consensus.
Core challenges you’ll face:
- Lease management → maps to TTL, renewal, expiration
- Failure handling → maps to detecting dead nodes, lock release
- Fairness → maps to FIFO ordering, preventing starvation
- Consensus → maps to agreement on lock ownership
Key Concepts:
- Distributed Locks: “Designing Data-Intensive Applications” Ch. 9
- Leases: etcd documentation
- Coordination: ZooKeeper documentation
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of consensus, TTL mechanisms
Real world outcome:
# Start lock service
$ ./lock-service --port=2379
Lock service started
# Acquire lock
$ ./lock-client acquire my-lock --ttl=30s
Lock acquired: lock-id-12345
TTL: 30s
# Renew lock
$ ./lock-client renew lock-id-12345 --ttl=30s
Lock renewed, new TTL: 30s
# Another client tries to acquire
$ ./lock-client acquire my-lock --ttl=30s
Lock unavailable, waiting...
# First client releases
$ ./lock-client release lock-id-12345
Lock released
# Second client acquires
Lock acquired: lock-id-67890
# Lock expires if not renewed
[After 30s]
Lock expired: lock-id-67890
Implementation Hints:
Lock service structure:
Lock: "my-lock"
Owner: client-12345
TTL: 30s
Created: 2024-01-15T10:00:00Z
Expires: 2024-01-15T10:00:30Z
Operations:
- Acquire: Create lock if not exists, set TTL
- Renew: Extend TTL before expiration
- Release: Delete lock
- Watch: Get notified when lock released
Failure handling:
- If client crashes, lock expires automatically
- Use heartbeats to detect client failures
- Implement fencing tokens for safety
Learning milestones:
- Basic locking → Acquire/release works
- TTL → Locks expire correctly
- Renewal → TTL extended on renewal
- Failure handling → Crashed clients release locks
Project 16: Time-Series Database
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Time-Series Data / Specialized Databases
- Software or Tool: Compression, columnar storage
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A time-series database optimized for storing and querying time-stamped data (metrics, events). Supports compression, downsampling, and efficient range queries.
Why it teaches data-intensive design: Time-series data has unique characteristics. Understanding specialized storage teaches you how to optimize for specific access patterns.
Core challenges you’ll face:
- Compression → maps to delta encoding, run-length encoding
- Downsampling → maps to aggregating data over time
- Retention policies → maps to data lifecycle management
- Query optimization → maps to time-range queries, aggregation
Key Concepts:
- Time-Series Databases: InfluxDB documentation
- Compression: “Designing Data-Intensive Applications” Ch. 3
- Columnar Storage: “Designing Data-Intensive Applications” Ch. 3
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Understanding of compression, database internals
Real world outcome:
# Start time-series database
$ ./tsdb --data-dir=/data/tsdb
Time-series database started
# Create database
$ ./tsdb-cli create-db metrics
Database created
# Write metrics
$ ./tsdb-cli write metrics \
cpu.usage,host=server1 value=45.2 1642234567 \
cpu.usage,host=server1 value=46.1 1642234577 \
cpu.usage,host=server1 value=44.8 1642234587
# Query: average CPU usage last hour
$ ./tsdb-cli query metrics \
"SELECT mean(value) FROM cpu.usage WHERE host='server1' AND time > now() - 1h"
mean(value)
45.37
# Downsampling: create 1-minute averages
$ ./tsdb-cli create-downsample metrics cpu.usage --interval=1m --function=mean
Downsample created
# Retention: delete data older than 30 days
$ ./tsdb-cli set-retention metrics --duration=30d
Retention policy set
Implementation Hints:
Time-series storage:
Metric: cpu.usage,host=server1
Timestamps: [1642234567, 1642234577, 1642234587, ...]
Values: [45.2, 46.1, 44.8, ...]
Compression:
- Delta encoding: Store differences between timestamps
- Run-length encoding: Compress repeated values
- Gorilla compression: Efficient for floating-point
Downsampling:
- Pre-aggregate data at different resolutions
- 1-second → 1-minute → 1-hour → 1-day
- Reduces storage and query time
Learning milestones:
- Basic storage → Store and retrieve time-series data
- Compression → Data compressed efficiently
- Downsampling → Aggregations computed correctly
- Query optimization → Range queries fast
Project 17: Graph Database with Gremlin Query Language
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python
- Alternative Programming Languages: Java, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Graph Databases / Graph Algorithms
- Software or Tool: Graph algorithms, query language
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A graph database that stores nodes and edges, with a query language (Gremlin-like) for traversing relationships and finding paths.
Why it teaches data-intensive design: Graph databases excel at relationship queries. Understanding them teaches you when to use graph vs relational models.
Core challenges you’ll face:
- Graph storage → maps to adjacency lists, edge properties
- Traversal algorithms → maps to BFS, DFS, shortest path
- Query language → maps to Gremlin-style syntax
- Indexing → maps to indexing nodes and edges
Key Concepts:
- Graph Databases: “Designing Data-Intensive Applications” Ch. 2
- Graph Algorithms: “Introduction to Algorithms” Ch. 22
- Gremlin: Apache TinkerPop documentation
Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of graphs, algorithms, query parsing
Real world outcome:
# Start graph database
$ ./graphdb --port=8182
Graph database started
# Create graph
$ ./gremlin-cli
gremlin> graph = Graph.open()
gremlin> g = graph.traversal()
# Add vertices
gremlin> alice = g.addV('person').property('name', 'Alice').next()
gremlin> bob = g.addV('person').property('name', 'Bob').next()
gremlin> company = g.addV('company').property('name', 'Acme').next()
# Add edges
gremlin> g.addE('knows').from(alice).to(bob).property('since', 2020).next()
gremlin> g.addE('worksFor').from(alice).to(company).next()
# Query: find Alice's friends
gremlin> g.V().has('name', 'Alice').out('knows').values('name')
==>Bob
# Query: find people who work at same company as Alice
gremlin> g.V().has('name', 'Alice').out('worksFor').in('worksFor').values('name')
==>Alice
# Shortest path
gremlin> g.V().has('name', 'Alice').repeat(out()).until(has('name', 'Bob')).path()
==>[v[1], v[2]]
Implementation Hints:
Graph storage:
Vertices:
v1: {label: 'person', properties: {name: 'Alice'}}
v2: {label: 'person', properties: {name: 'Bob'}}
Edges:
e1: {from: v1, to: v2, label: 'knows', properties: {since: 2020}}
Traversal:
- Start at vertex
- Follow edges (out/in/both)
- Filter by properties
- Return results
Algorithms:
- BFS: Level-order traversal
- DFS: Depth-first traversal
- Shortest path: Dijkstra’s algorithm
- PageRank: Centrality measure
Learning milestones:
- Basic graph → Store vertices and edges
- Traversal → Navigate relationships
- Query language → Gremlin-like queries work
- Algorithms → Shortest path, etc. work
Project 18: Complete Data-Intensive Application
- File: LEARN_DATA_INTENSIVE_APPLICATIONS_DESIGN.md
- Main Programming Language: Python/Go
- Alternative Programming Languages: Java, TypeScript
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Complete System Design / Integration
- Software or Tool: All previous projects
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A complete data-intensive application (e.g., social media platform, e-commerce site, analytics platform) that integrates all concepts: replication, partitioning, caching, message queues, batch and stream processing.
Why it teaches data-intensive design: Building a complete system integrates all concepts. You’ll make real-world trade-offs and see how components interact.
Core challenges you’ll face:
- System architecture → maps to component selection, data flow
- Scalability → maps to handling growth, performance optimization
- Reliability → maps to failure handling, monitoring
- Consistency → maps to choosing appropriate consistency models
Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects
Real world outcome:
# Complete social media platform
# Architecture:
# - PostgreSQL (primary database) with read replicas
# - Redis (caching layer)
# - Kafka (event streaming)
# - Elasticsearch (search)
# - Spark (batch analytics)
# - Flink (real-time analytics)
# Features:
# - User posts, comments, likes
# - News feed (personalized, real-time)
# - Search
# - Analytics dashboard
# - Recommendations
# Deploy
$ ./deploy.sh
Starting services...
PostgreSQL: Running (1 primary, 3 replicas)
Redis: Running (cluster mode, 6 nodes)
Kafka: Running (3 brokers)
Elasticsearch: Running (5 nodes)
Spark: Running
Flink: Running
# Load test
$ ./load-test --users=1000000 --posts-per-user=100
Generating load...
Throughput: 50,000 requests/sec
P95 latency: 45ms
P99 latency: 120ms
# Monitor
$ ./monitor dashboard
[Real-time metrics dashboard showing all components]
Implementation Hints:
System components:
- API Layer: REST/GraphQL APIs
- Database: PostgreSQL with replication
- Cache: Redis for hot data
- Search: Elasticsearch for full-text search
- Events: Kafka for event streaming
- Batch: Spark for analytics
- Stream: Flink for real-time processing
- Monitoring: Metrics, logs, tracing
Design decisions:
- When to use cache vs database?
- When to use eventual consistency?
- How to handle failures?
- How to scale each component?
Learning milestones:
- Basic system → Core features work
- Scalability → Handles load
- Reliability → Handles failures
- Monitoring → Observability in place
Project Comparison Table
| # | Project | Difficulty | Time | Key Skill | Fun |
|---|---|---|---|---|---|
| 1 | B-Tree KV Store | ⭐⭐⭐ | 3-4 weeks | Storage Engines | ⭐⭐⭐⭐ |
| 2 | LSM-Tree Storage | ⭐⭐⭐ | 3-4 weeks | Write Optimization | ⭐⭐⭐⭐ |
| 3 | Query Engine | ⭐⭐⭐⭐ | 4-6 weeks | Query Processing | ⭐⭐⭐⭐ |
| 4 | Leader-Follower Replication | ⭐⭐⭐ | 3-4 weeks | Replication | ⭐⭐⭐ |
| 5 | Consistent Hashing | ⭐⭐ | 2-3 weeks | Partitioning | ⭐⭐⭐ |
| 6 | Two-Phase Commit | ⭐⭐⭐ | 2-3 weeks | Distributed Transactions | ⭐⭐⭐ |
| 7 | Raft Consensus | ⭐⭐⭐⭐ | 4-6 weeks | Consensus | ⭐⭐⭐⭐⭐ |
| 8 | Event Sourcing | ⭐⭐⭐ | 3-4 weeks | Event-Driven | ⭐⭐⭐⭐ |
| 9 | Message Queue | ⭐⭐⭐ | 3-4 weeks | Async Processing | ⭐⭐⭐ |
| 10 | Distributed Cache | ⭐⭐ | 2-3 weeks | Caching | ⭐⭐⭐ |
| 11 | MapReduce | ⭐⭐⭐ | 3-4 weeks | Batch Processing | ⭐⭐⭐⭐ |
| 12 | Stream Processing | ⭐⭐⭐ | 3-4 weeks | Real-Time | ⭐⭐⭐⭐ |
| 13 | Connection Pool | ⭐⭐ | 1-2 weeks | Resource Management | ⭐⭐ |
| 14 | Change Data Capture | ⭐⭐⭐ | 3-4 weeks | Data Integration | ⭐⭐⭐ |
| 15 | Distributed Locks | ⭐⭐⭐ | 2-3 weeks | Coordination | ⭐⭐⭐ |
| 16 | Time-Series DB | ⭐⭐⭐ | 3-4 weeks | Specialized Storage | ⭐⭐⭐⭐ |
| 17 | Graph Database | ⭐⭐⭐⭐ | 4-6 weeks | Graph Algorithms | ⭐⭐⭐⭐ |
| 18 | Complete Application | ⭐⭐⭐⭐⭐ | 2-3 months | System Design | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Phase 1: Storage Foundations (6-8 weeks)
Understand how data is stored and retrieved:
- Project 1: B-Tree KV Store - Read-optimized storage
- Project 2: LSM-Tree Storage - Write-optimized storage
- Project 13: Connection Pool - Resource management
Phase 2: Replication and Partitioning (6-8 weeks)
Learn how to scale beyond single machines:
- Project 4: Leader-Follower Replication - High availability
- Project 5: Consistent Hashing - Horizontal scaling
- Project 7: Raft Consensus - Distributed consensus
Phase 3: Transactions and Consistency (4-6 weeks)
Understand consistency models:
- Project 6: Two-Phase Commit - Distributed transactions
- Project 15: Distributed Locks - Coordination
Phase 4: Processing Patterns (8-10 weeks)
Learn batch and stream processing:
- Project 8: Event Sourcing - Event-driven architecture
- Project 9: Message Queue - Asynchronous processing
- Project 11: MapReduce - Batch processing
- Project 12: Stream Processing - Real-time processing
Phase 5: Specialized Systems (6-8 weeks)
Explore specialized databases:
- Project 3: Query Engine - SQL processing
- Project 14: Change Data Capture - Data integration
- Project 16: Time-Series DB - Time-series optimization
- Project 17: Graph Database - Graph algorithms
Phase 6: Integration (2-3 months)
Build complete systems:
- Project 10: Distributed Cache - Performance optimization
- Project 18: Complete Application - Full system integration
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | B-Tree KV Store | C |
| 2 | LSM-Tree Storage | C |
| 3 | Query Engine | C++ |
| 4 | Leader-Follower Replication | Go |
| 5 | Consistent Hashing | Python |
| 6 | Two-Phase Commit | Go |
| 7 | Raft Consensus | Go |
| 8 | Event Sourcing | Python |
| 9 | Message Queue | Go |
| 10 | Distributed Cache | Go |
| 11 | MapReduce | Python |
| 12 | Stream Processing | Python |
| 13 | Connection Pool | Go |
| 14 | Change Data Capture | Python |
| 15 | Distributed Locks | Go |
| 16 | Time-Series DB | Go |
| 17 | Graph Database | Python |
| 18 | Complete Application | Python/Go |
Resources
Essential Books
- “Designing Data-Intensive Applications” by Martin Kleppmann - The definitive guide
- “Database Systems: The Complete Book” by Garcia-Molina et al. - Deep database internals
- “Streaming Systems” by Tyler Akidau et al. - Stream processing
- “Building Microservices” by Sam Newman - System design
Key Papers
- “In Search of an Understandable Consensus Algorithm” (Raft) - Diego Ongaro
- “MapReduce: Simplified Data Processing” - Google
- “Dynamo: Amazon’s Highly Available Key-value Store” - Amazon
Tools and Technologies
- PostgreSQL: Relational database
- Redis: In-memory cache
- Kafka: Message queue
- Spark: Batch processing
- Flink: Stream processing
- etcd: Distributed coordination
Total Estimated Time: 12-18 months of dedicated study
After completion: You’ll be able to design, build, and scale data-intensive applications that handle millions of users and petabytes of data. These skills are essential for backend engineering, data engineering, platform engineering, and systems architecture roles at top tech companies.