LEARN REALTIME ANALYTICS DATABASES DEEP DIVE

Learn Real-Time Analytics Databases: From Zero to OLAP Master

Goal: Deeply understand real-time analytics databases—from column-oriented storage fundamentals to building distributed OLAP systems that process billions of rows per second.

Why Real-Time Analytics Databases Matter

Every time Netflix shows you “Trending Now,” Uber calculates surge pricing, or LinkedIn suggests “People You May Know,” a real-time analytics database is crunching massive datasets in milliseconds. These systems represent the pinnacle of database engineering—combining storage optimization, query processing, distributed systems, and streaming data ingestion into unified platforms.

After completing these projects, you will:

Understand why columnar storage is 100x faster than row-based for analytics
Build storage engines that compress data to 1/10th its original size
Implement streaming ingestion pipelines that handle millions of events per second
Design query engines that leverage vectorized execution and SIMD
Master distributed consensus and replication for fault-tolerant systems
Build your own mini-ClickHouse or mini-Druid from scratch

Core Concept Analysis

The OLAP vs OLTP Divide

OLTP (Row-Oriented)                    OLAP (Column-Oriented)
┌────────────────────┐                 ┌────────────────────┐
│ ID │ Name  │ Sales │                 │ ID  │ ID  │ ID  │
├────┼───────┼───────┤                 │  1  │  2  │  3  │
│ 1  │ Alice │ 100   │                 ├─────┴─────┴─────┤
│ 2  │ Bob   │ 200   │                 │ Name │ Name │Name│
│ 3  │ Carol │ 150   │                 │Alice │ Bob  │Carol│
└────────────────────┘                 ├─────┴─────┴─────┤
                                       │Sales│Sales│Sales│
Fast: Get all data for ID=2            │ 100 │ 200 │ 150 │
Slow: SUM(Sales) across millions       └─────┴─────┴─────┘

                                       Fast: SUM(Sales) - only read one column
                                       Slow: Get all data for ID=2

Fundamental Concepts

Columnar Storage: Data stored by columns, not rows
- Enables massive compression (similar values grouped together)
- Reads only needed columns (skip irrelevant data)
- Cache-friendly access patterns
Vectorized Execution: Process data in batches (vectors) of thousands of values
- Amortizes function call overhead
- Leverages CPU SIMD instructions
- Modern CPUs can process 8-16 values simultaneously
Time-Series Optimization: Specialized handling for timestamped data
- Delta-of-delta encoding for timestamps
- Gorilla compression for floating-point metrics
- Time-based partitioning and retention
Streaming Ingestion: Real-time data consumption
- Kafka/Kinesis integration
- Micro-batching vs true streaming
- Write-ahead logs for durability
Distributed Query Processing: Scale across machines
- Data partitioning (sharding)
- Query routing and aggregation
- Consensus protocols (Raft/Paxos)
Compression Algorithms: Shrink data dramatically
- Run-Length Encoding (RLE)
- Dictionary Encoding
- Bit-Packing
- LZ4/ZSTD general compression

The Real-Time Analytics Stack

┌─────────────────────────────────────────────────────────────┐
│                    Query Layer (SQL)                        │
├─────────────────────────────────────────────────────────────┤
│              Query Optimizer / Planner                      │
├─────────────────────────────────────────────────────────────┤
│         Vectorized Execution Engine (SIMD)                  │
├─────────────────────────────────────────────────────────────┤
│    ┌─────────────────┐  ┌─────────────────────────────┐    │
│    │  In-Memory      │  │     On-Disk Storage         │    │
│    │  (Hot Data)     │  │  (Columnar + Compressed)    │    │
│    └─────────────────┘  └─────────────────────────────┘    │
├─────────────────────────────────────────────────────────────┤
│              Streaming Ingestion (Kafka)                    │
├─────────────────────────────────────────────────────────────┤
│         Distributed Coordination (Raft/ZooKeeper)           │
└─────────────────────────────────────────────────────────────┘

Project List

Projects are ordered from fundamental understanding to advanced implementations. Each builds on concepts from previous projects.

Project 1: Row vs Column Storage Visualizer

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, Rust, C
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Storage Engines / Data Layout
Software or Tool: Custom Storage Engine
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A tool that stores the same dataset in both row-oriented and column-oriented formats, then visually demonstrates the I/O difference when running analytical queries (like SUM, AVG, COUNT) vs transactional queries (fetch by ID).

Why it teaches real-time analytics: Before diving into complex systems, you must viscerally understand why columnar storage exists. This project makes the abstract concrete—you’ll literally see how many bytes are read for each query type.

Core challenges you’ll face:

Designing the storage layout → maps to understanding how data is physically arranged on disk
Measuring I/O accurately → maps to understanding what “scanning” actually means
Implementing basic aggregations → maps to how queries traverse data differently
Visualizing access patterns → maps to understanding cache efficiency

Key Concepts:

Row vs Column Storage: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
Data Locality: “Computer Systems: A Programmer’s Perspective” Chapter 6 - Bryant & O’Hallaron
File I/O Fundamentals: “The Linux Programming Interface” Chapters 4-5 - Michael Kerrisk

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic file I/O in any language, understanding of arrays/lists

Real world outcome:

$ ./storage_visualizer --dataset sales_1M.csv

=== Storage Comparison ===
Dataset: 1,000,000 rows × 10 columns
Row-oriented size: 85.2 MB
Column-oriented size: 85.2 MB (same uncompressed)

=== Query: SELECT SUM(revenue) FROM sales ===
Row-oriented:
  Bytes read: 85.2 MB (entire dataset)
  Time: 342ms
  [████████████████████████████████████████] 100% of data

Column-oriented:
  Bytes read: 8.5 MB (only revenue column)
  Time: 38ms
  [████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 10% of data

Speedup: 9.0x faster!

=== Query: SELECT * FROM sales WHERE id = 500000 ===
Row-oriented:
  Bytes read: 42.6 MB (scan to row)
  Time: 156ms

Column-oriented:
  Bytes read: 85.2 MB (reassemble from all columns)
  Time: 412ms

Row-oriented wins for point lookups!

Implementation Hints:

The core insight is simple: row storage keeps all fields of a record together, column storage keeps all values of a field together.

Questions to guide you:

How do you represent “reading X bytes from disk” in your program? (Hint: seek and read operations)
What data structure holds your column data? (Hint: one array per column)
How do you “reconstruct” a row from columnar data? (Hint: index into each column array)
How can you visually show which bytes were accessed? (Hint: track byte ranges)

Architecture:

RowStore:
  File: [row1_all_cols][row2_all_cols][row3_all_cols]...

ColumnStore:
  File: [col1_all_rows][col2_all_rows][col3_all_rows]...
  Metadata: {col1: offset=0, len=X}, {col2: offset=X, len=Y}...

For visualization, consider using ANSI colors to highlight which portions of your data file were accessed during each query.

Learning milestones:

Both storage formats work correctly → You understand physical data layout
Query results match between formats → You’ve verified correctness
I/O measurements show expected patterns → You understand the performance implications
Visualization clearly shows the difference → You can explain this to others

Project 2: Compression Algorithm Playground

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Data Compression / Encoding
Software or Tool: Custom Compression Library
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A suite of compression algorithms specifically designed for analytics workloads: Run-Length Encoding (RLE), Dictionary Encoding, Delta Encoding, Delta-of-Delta, Bit-Packing, and XOR-based compression for floats. You’ll benchmark each against real datasets.

Why it teaches real-time analytics: Compression is the secret weapon of OLAP databases. ClickHouse achieves 10-20x compression ratios. Understanding these algorithms reveals why columnar storage enables such dramatic compression—similar values cluster together.

Core challenges you’ll face:

Implementing RLE efficiently → maps to detecting and encoding runs
Building a dictionary encoder → maps to trading memory for compression
Delta-of-delta for timestamps → maps to exploiting sequential patterns
Bit-packing integers → maps to minimizing storage for small values
XOR compression for floats → maps to IEEE 754 exploitation

Key Concepts:

Run-Length Encoding: “Introduction to Data Compression” Chapter 2 - Khalid Sayood
Dictionary Encoding: Facebook’s Gorilla Paper Section 4
Delta Encoding: “Time-series compression algorithms, explained” - TigerData Blog
Bit Manipulation: “Computer Systems: A Programmer’s Perspective” Chapter 2 - Bryant & O’Hallaron

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1, comfortable with binary/bitwise operations, basic understanding of floating-point representation

Real world outcome:

$ ./compression_playground --input sensor_data.csv

=== Compression Benchmark Results ===
Original size: 100.0 MB (10M rows)

Column: timestamp (int64, monotonic)
  Raw:           80.0 MB
  Delta:         12.4 MB (6.5x compression)
  Delta-of-Delta: 1.2 MB (66.7x compression!)
  Winner: Delta-of-Delta

Column: sensor_id (int32, low cardinality)
  Raw:           40.0 MB
  Dictionary:     2.1 MB (19.0x compression)
  RLE:            0.8 MB (50.0x compression)
  Winner: RLE (sorted data)

Column: temperature (float64)
  Raw:           80.0 MB
  XOR (Gorilla):  8.9 MB (9.0x compression)
  Winner: XOR

Column: status (string, 5 unique values)
  Raw:           45.0 MB
  Dictionary:     1.2 MB (37.5x compression)
  Winner: Dictionary

Total compressed: 14.2 MB (7.0x overall compression)

=== Decompression Speed ===
Delta-of-Delta: 1.2 GB/s
Dictionary:     2.8 GB/s
RLE:            3.4 GB/s
XOR:            0.9 GB/s

Implementation Hints:

Run-Length Encoding: Transform [A, A, A, B, B, C, C, C, C] → [(A,3), (B,2), (C,4)] Works brilliantly on sorted columnar data where values repeat consecutively.

Dictionary Encoding:

Original: ["USA", "USA", "Germany", "USA", "France"]
Dictionary: {0: "USA", 1: "Germany", 2: "France"}
Encoded: [0, 0, 1, 0, 2]  // Much smaller!

Delta-of-Delta for timestamps:

Timestamps: [1000, 1060, 1120, 1180, 1240]
Deltas:     [1000, 60, 60, 60, 60]      // First value + differences
Delta-of-D: [1000, 60, 0, 0, 0]         // Most become 0!

Facebook found 96% of timestamp delta-of-deltas are 0!

XOR for floats (Gorilla algorithm): Consecutive float values often share many bits. XOR reveals only the differing bits, which can be encoded very compactly.

Questions to guide you:

How do you detect when to stop a run in RLE?
What happens when dictionary size exceeds the savings?
How do you handle the first value in delta encoding?
What’s the bit layout of a float64? (Sign, exponent, mantissa)

Learning milestones:

RLE works on sorted data → You understand run detection
Dictionary encoding handles arbitrary strings → You understand the space-time tradeoff
Delta-of-delta achieves >50x on timestamps → You’ve exploited temporal patterns
XOR compression works on real sensor data → You understand IEEE 754 structure

Project 3: Build a Mini-Parquet File Reader/Writer

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, C++, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: File Formats / Columnar Storage
Software or Tool: Apache Parquet
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A simplified implementation of the Parquet file format—the industry standard for columnar data storage. You’ll implement row groups, column chunks, page structure, and basic compression, then verify your files work with the official Parquet tools.

Why it teaches real-time analytics: Parquet is everywhere: Spark, Presto, Snowflake, BigQuery, and more. Understanding its internals reveals how production systems balance compression, random access, and query efficiency. You’ll learn why it’s organized the way it is.

Core challenges you’ll face:

Understanding the Parquet file structure → maps to footer, row groups, column chunks, pages
Implementing nested schema encoding → maps to definition and repetition levels
Adding compression per column → maps to codec selection and page compression
Writing valid metadata → maps to Thrift serialization and schema storage

Key Concepts:

Parquet File Format: Apache Parquet Documentation
Dremel Encoding: Google Dremel Paper Section 4
Thrift Serialization: Apache Thrift Documentation
Column Storage: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 1 and 2, understanding of binary file formats, comfortable with nested data structures

Real world outcome:

$ ./mini_parquet write --input users.json --output users.parquet

Writing Parquet file...
Schema: {id: INT64, name: STRING, emails: LIST<STRING>, address: STRUCT}
Row groups: 2 (50,000 rows each)
Compression: SNAPPY

Column statistics:
  id:      min=1, max=100000, nulls=0
  name:    min="Aaron", max="Zoe", nulls=12
  emails:  nested (def/rep levels used)
  address: nested struct

File size: 4.2 MB (vs 45 MB JSON = 10.7x compression)
Written successfully!

$ ./mini_parquet read --input users.parquet --query "SELECT id, name LIMIT 5"

┌────────┬─────────┐
│   id   │  name   │
├────────┼─────────┤
│      1 │ Alice   │
│      2 │ Bob     │
│      3 │ Carol   │
│      4 │ David   │
│      5 │ Eve     │
└────────┴─────────┘

$ parquet-tools schema users.parquet  # Official tool validates our file!
message users {
  required int64 id;
  optional binary name (STRING);
  repeated binary emails (STRING);
  optional group address { ... }
}
✓ Valid Parquet file!

Implementation Hints:

Parquet file structure:

┌──────────────────────────────────┐
│         Row Group 1              │
│  ┌─────────────────────────────┐ │
│  │ Column Chunk: id            │ │
│  │   Page 1, Page 2, ...       │ │
│  ├─────────────────────────────┤ │
│  │ Column Chunk: name          │ │
│  │   Page 1, Page 2, ...       │ │
│  └─────────────────────────────┘ │
├──────────────────────────────────┤
│         Row Group 2              │
│         ...                      │
├──────────────────────────────────┤
│       Footer (Metadata)          │
│  - Schema                        │
│  - Row group locations           │
│  - Column statistics             │
├──────────────────────────────────┤
│    Footer Length (4 bytes)       │
│    Magic: "PAR1" (4 bytes)       │
└──────────────────────────────────┘

Questions to guide you:

Why does Parquet put metadata at the end, not the beginning?
What are row groups for? (Hint: parallel processing, memory management)
Why compress at page level, not column level?
What problem do definition/repetition levels solve for nested data?

Start simple: flat schemas (no nesting), single row group, no compression. Add features incrementally.

Learning milestones:

Write valid Parquet files (verified by parquet-tools) → You understand the file structure
Column pruning works (only read needed columns) → You understand columnar access
Compression reduces file size significantly → You understand page-level compression
Nested data with def/rep levels works → You understand Dremel encoding

Project 4: LSM-Tree Storage Engine for Analytics

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C++, Go, C
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Storage Engines / Write Optimization
Software or Tool: RocksDB-style engine
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An LSM-Tree (Log-Structured Merge-Tree) storage engine optimized for analytics write patterns. You’ll implement a MemTable (in-memory buffer), SSTable (Sorted String Table) files, compaction strategies, and bloom filters for efficient lookups.

Why it teaches real-time analytics: Real-time analytics databases must ingest massive amounts of data continuously. LSM-trees provide the foundation—they convert random writes into sequential writes, enabling sustained ingestion rates of millions of events per second.

Core challenges you’ll face:

Implementing an efficient MemTable → maps to in-memory sorted structures (skip list/red-black tree)
Flushing to SSTable format → maps to sorted, immutable, compressed files
Compaction strategies → maps to size-tiered vs leveled compaction tradeoffs
Bloom filters for reads → maps to probabilistic data structures
Write-ahead log (WAL) → maps to durability guarantees

Key Concepts:

LSM-Tree Fundamentals: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
SSTables: Google’s BigTable Paper Section 5
Bloom Filters: “Algorithms” Chapter 6.2 - Sedgewick & Wayne
Compaction: How to Build an LSM Tree Storage Engine from Scratch - FreeCodeCamp

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Projects 1-2, understanding of tree data structures, file I/O

Real world outcome:

$ ./lsm_engine bench --writes 10000000 --threads 4

=== LSM-Tree Benchmark ===
Configuration:
  MemTable size: 64 MB
  Compaction: Leveled
  Bloom filter: 10 bits/key

Write Performance:
  Total writes: 10,000,000
  Time: 12.4 seconds
  Throughput: 806,451 writes/sec
  Avg latency: 4.96 µs

Storage:
  Data written: 1.2 GB
  On-disk size: 890 MB (compression: 1.35x)
  SSTable files:
    L0: 3 files (128 MB each)
    L1: 12 files (64 MB each)
    L2: 45 files (64 MB each)

$ ./lsm_engine query --key "sensor_12345_2024"
Key found in: L1/sst_00042.db
Value: {"temp": 23.5, "humidity": 65}
Bloom filter checks: 3 (2 negative, 1 positive)
Total read time: 89 µs

$ ./lsm_engine compact --force
Compacting L0 → L1...
  Merged 3 files into 2 files
  Reclaimed 45 MB of space
  Removed 125,000 tombstones
Compaction complete in 2.3s

Implementation Hints:

LSM-Tree structure:

┌────────────────────────────────────┐
│           MemTable                 │  In-Memory (Skip List)
│   (sorted, accepts writes)         │
└────────────────────────────────────┘
              │ flush when full
              ▼
┌────────────────────────────────────┐
│   Level 0 SSTables (unsorted)      │  On-Disk
│   [sst1] [sst2] [sst3]             │
└────────────────────────────────────┘
              │ compaction
              ▼
┌────────────────────────────────────┐
│   Level 1 SSTables (sorted)        │
│   [sst1] [sst2] ... [sst10]        │  Non-overlapping key ranges
└────────────────────────────────────┘
              │ compaction
              ▼
        ... more levels ...

Questions to guide you:

Why use a skip list instead of a balanced tree for the MemTable?
What happens if the machine crashes before MemTable is flushed? (WAL)
Why are L0 SSTables allowed to overlap but L1+ cannot?
How does the bloom filter prevent unnecessary disk reads?
What’s the tradeoff between size-tiered and leveled compaction?

Learning milestones:

MemTable + WAL provides durability → You understand crash recovery
SSTables flush correctly and are queryable → You understand the immutable file format
Compaction merges files and removes tombstones → You understand space reclamation
Bloom filters reduce read amplification → You understand probabilistic optimization

Project 5: Real-Time Metrics Aggregator (In-Memory OLAP)

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, C++, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: In-Memory Analytics / Aggregation
Software or Tool: Mini-Druid/Pinot
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An in-memory analytics engine that can ingest millions of events per second and answer aggregation queries (COUNT, SUM, AVG, MIN, MAX, GROUP BY) in milliseconds. Think of it as a mini-Druid or mini-Pinot for metrics.

Why it teaches real-time analytics: This is the core of real-time analytics—maintaining pre-aggregated data structures that enable instant queries. You’ll learn how systems achieve sub-second query latency on billions of events.

Core challenges you’ll face:

Designing efficient in-memory data structures → maps to columnar arrays with indexes
Supporting GROUP BY efficiently → maps to hash-based aggregation
Implementing rollups and pre-aggregations → maps to trading storage for query speed
Handling high-velocity ingestion → maps to lock-free/concurrent data structures
Time-based windowing → maps to sliding windows, tumbling windows

Key Concepts:

Hash Aggregation: “Database Internals” Chapter 11 - Alex Petrov
Concurrent Data Structures: “Rust Atomics and Locks” - Mara Bos
Time Windows: Kafka Streams Documentation - Windowing section
Rollup Strategies: Apache Druid Rollup Documentation

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2, understanding of hash tables, basic concurrency

Real world outcome:

$ ./metrics_aggregator start --port 8080

Metrics Aggregator v1.0
  In-memory storage: 4 GB allocated
  Ingestion port: 8080/tcp
  Query port: 8081/tcp
  Rollup: 1-minute granularity

$ ./metrics_aggregator ingest --file events.json --rate 500000
Ingesting at 500,000 events/sec...
  [████████████████████] 10,000,000 events ingested
  Time: 20.1 seconds
  Memory used: 892 MB

$ curl "http://localhost:8081/query" -d '{
  "aggregations": ["SUM(revenue)", "COUNT(*)"],
  "groupBy": ["country", "product_category"],
  "timeRange": {"start": "2024-01-01", "end": "2024-01-31"},
  "filter": {"country": ["US", "UK", "DE"]}
}'

{
  "queryTime": "12ms",
  "rowsScanned": 8500000,
  "results": [
    {"country": "US", "product_category": "Electronics", "SUM(revenue)": 4521000, "COUNT(*)": 125000},
    {"country": "US", "product_category": "Clothing", "SUM(revenue)": 2103000, "COUNT(*)": 89000},
    {"country": "UK", "product_category": "Electronics", "SUM(revenue)": 1892000, "COUNT(*)": 67000},
    ...
  ]
}

$ ./metrics_aggregator stats
Current state:
  Total events: 10,000,000
  Unique dimensions: {country: 45, product_category: 12, store_id: 2300}
  Memory: 892 MB
  Rollup compression: 8.5x (raw would be 7.6 GB)
  Avg query latency: 15ms

Implementation Hints:

Core data structure:

Segment {
  timestamp_column: Vec<i64>        // Sorted timestamps
  dimension_columns: {
    "country": DictionaryColumn,    // Low-cardinality → dictionary encoded
    "product_id": Vec<u32>,         // Medium cardinality
  }
  metric_columns: {
    "revenue": Vec<f64>,
    "quantity": Vec<i32>,
  }
  indexes: {
    "country": BitmapIndex,         // Fast filtering
    "timestamp": RangeIndex,        // Time range queries
  }
}

Questions to guide you:

How do you aggregate across multiple segments concurrently?
What’s the memory overhead of a bitmap index per dimension value?
When is pre-aggregation (rollup) worth the storage cost?
How do you handle late-arriving data in a time-windowed system?

Learning milestones:

Ingest 100K+ events/sec → You understand high-throughput ingestion
GROUP BY queries return in <100ms → You understand hash aggregation
Bitmap indexes accelerate filtered queries → You understand inverted indexes
Time windows roll up correctly → You understand temporal aggregation

Project 6: Streaming Ingestion Pipeline (Kafka to Analytics)

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Java
Alternative Programming Languages: Go, Python, Scala
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Streaming / Event Processing
Software or Tool: Apache Kafka
Main Book: “Kafka: The Definitive Guide” by Gwen Shapiro et al.

What you’ll build: A streaming ingestion pipeline that consumes events from Kafka, transforms them, and loads them into your analytics engine (from Project 5). You’ll implement exactly-once semantics, handle backpressure, and build a dead-letter queue for failed events.

Why it teaches real-time analytics: Real-time analytics starts with real-time data. Understanding Kafka’s architecture—partitions, consumer groups, offsets, and exactly-once semantics—is essential. This is how ClickHouse, Druid, and Pinot actually get their data.

Core challenges you’ll face:

Understanding Kafka partitions and consumers → maps to parallelism and ordering
Implementing exactly-once semantics → maps to idempotency and transactions
Handling backpressure → maps to when the sink is slower than the source
Schema evolution → maps to handling changing data formats
Building a DLQ → maps to error handling without blocking

Key Concepts:

Kafka Internals: “Kafka: The Definitive Guide” Chapters 4-5 - Shapiro et al.
Exactly-Once Semantics: Confluent’s EOS Documentation
Stream Processing: “Designing Data-Intensive Applications” Chapter 11 - Martin Kleppmann
Consumer Groups: Kafka Consumer Documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic understanding of message queues, Project 5 helps but not required

Real world outcome:

$ ./kafka_ingester start \
    --bootstrap-servers localhost:9092 \
    --topic events \
    --consumer-group analytics-ingest \
    --sink http://localhost:8080/ingest

Kafka Ingester Started
  Topic: events (12 partitions)
  Consumer group: analytics-ingest
  Parallelism: 12 consumers
  Batch size: 1000 events
  Exactly-once: enabled

[Consumer-0] Partition 0: offset 1,234,567
[Consumer-1] Partition 1: offset 2,345,678
...

=== Stats (last 60 seconds) ===
Events consumed:     1,245,000
Events transformed:  1,244,850
Events loaded:       1,244,850
Events to DLQ:           150 (parse errors)
Throughput:          20,750 events/sec
Avg latency:         45ms (kafka → analytics)
Consumer lag:        2,340 events

=== Backpressure Event ===
[WARN] Sink slow, applying backpressure...
  Pausing consumers for 500ms
  Resuming...
  Lag recovered.

$ ./kafka_ingester dlq --show
Dead Letter Queue (150 events):
  - Offset 1234599: Invalid JSON at position 45
  - Offset 1234612: Missing required field 'timestamp'
  - Offset 1234689: Schema mismatch (expected int, got string)

Implementation Hints:

Consumer group coordination:

Topic: events (12 partitions)
Consumer Group: analytics-ingest

┌─────────┐   ┌─────────┐   ┌─────────┐
│Consumer1│   │Consumer2│   │Consumer3│
│ P0, P1  │   │ P2, P3  │   │ P4, P5  │  ...
│ P6, P7  │   │ P8, P9  │   │P10, P11 │
└─────────┘   └─────────┘   └─────────┘

If Consumer2 dies → P2,P3,P8,P9 rebalanced to 1 & 3

Questions to guide you:

Why does Kafka guarantee ordering only within a partition?
How do you commit offsets to avoid reprocessing after restart?
What happens if your sink throws an error mid-batch?
How do you detect and handle poison messages?
What’s the difference between at-least-once, at-most-once, and exactly-once?

Learning milestones:

Basic consumer processes messages → You understand Kafka consumption
Consumer group rebalances correctly → You understand partition assignment
Exactly-once: no duplicates after restart → You understand offset management
DLQ captures bad messages without blocking → You understand error handling

Project 7: Vectorized Query Execution Engine

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C++, C, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Query Execution / CPU Optimization
Software or Tool: ClickHouse-style engine
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A vectorized query execution engine that processes data in batches (vectors) of 1024+ values, leveraging SIMD (Single Instruction, Multiple Data) instructions to achieve billions of rows/second throughput.

Why it teaches real-time analytics: This is the performance secret of modern OLAP databases. ClickHouse processes billions of rows per second because of vectorized execution. You’ll understand why batch processing is fundamentally faster than row-at-a-time.

Core challenges you’ll face:

Batch processing architecture → maps to volcano model vs vectorized execution
SIMD operations → maps to processing 8-16 values per CPU instruction
Memory alignment → maps to cache-line efficiency and prefetching
Predicate pushdown → maps to filtering before aggregation
Pipeline fusion → maps to reducing intermediate materialization

Key Concepts:

Vectorized Execution: MonetDB/X100 Paper - Sections 1-4
SIMD Programming: “Computer Systems: A Programmer’s Perspective” Chapter 5 - Bryant & O’Hallaron
Cache Optimization: “What Every Programmer Should Know About Memory” - Ulrich Drepper
Query Execution: CMU 15-721 Advanced Database Systems - Vectorization lectures

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5, comfortable with systems programming, understanding of CPU architecture, SIMD concepts

Real world outcome:

$ ./vectorized_engine bench --rows 1000000000 --query "SELECT SUM(value) WHERE flag = 1"

=== Vectorized Query Execution Benchmark ===
Dataset: 1,000,000,000 rows (8 GB)

Query: SELECT SUM(value) WHERE flag = 1

Execution Plan:
  → Scan(value, flag)
  → Filter(flag = 1)         [SIMD: 16 values/instruction]
  → Aggregate(SUM(value))    [SIMD: 8 values/instruction]

=== Row-at-a-time (baseline) ===
Time: 45.2 seconds
Throughput: 22.1 million rows/sec
CPU utilization: 12%

=== Vectorized (no SIMD) ===
Time: 4.8 seconds
Throughput: 208 million rows/sec
Speedup: 9.4x
CPU utilization: 45%

=== Vectorized + SIMD (AVX2) ===
Time: 0.89 seconds
Throughput: 1.12 billion rows/sec
Speedup: 50.8x
CPU utilization: 98%

=== Detailed Stats ===
Vector size: 1024 values
Vectors processed: 976,563
L1 cache misses: 0.2%
L2 cache misses: 0.01%
SIMD lanes utilized: 15.7 / 16 (98.1%)

Implementation Hints:

Vectorized vs row-at-a-time:

Row-at-a-time (Volcano model):
for each row:
    value = get_value(row)
    flag = get_flag(row)
    if flag == 1:          # Branch for EVERY row
        sum += value

Vectorized:
value_vector = get_values(batch_of_1024)
flag_vector = get_flags(batch_of_1024)
mask = simd_compare_eq(flag_vector, 1)  # 16 comparisons at once!
sum += simd_masked_sum(value_vector, mask)

Questions to guide you:

Why is branching expensive in tight loops? (Branch misprediction)
How does SIMD allow “branchless” filtering? (Masks)
Why does vector size matter? (Too small = overhead, too large = cache misses)
How do you handle the tail when row count isn’t divisible by vector size?

SIMD intrinsics to explore:

_mm256_loadu_pd - Load 4 doubles
_mm256_cmp_pd - Compare 4 doubles
_mm256_add_pd - Add 4 doubles
_mm256_maskstore_pd - Conditional store

Learning milestones:

Batch processing shows measurable speedup → You understand loop overhead reduction
SIMD code runs 4-8x faster than scalar → You understand data parallelism
Cache utilization approaches 100% → You understand memory hierarchy
Complex queries (filter + aggregate) vectorize → You understand operator fusion

Project 8: SQL Parser and Query Planner

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: Go, Python, Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Query Processing / Compilers
Software or Tool: SQL Engine
Main Book: “Language Implementation Patterns” by Terence Parr

What you’ll build: A SQL parser that converts SQL strings into an Abstract Syntax Tree (AST), and a query planner that transforms the AST into an optimized execution plan. Support SELECT, WHERE, GROUP BY, ORDER BY, and JOIN.

Why it teaches real-time analytics: SQL is the universal language of analytics. Understanding how SELECT SUM(x) FROM t WHERE y > 10 GROUP BY z becomes an execution plan reveals the translation layer between human intent and machine execution.

Core challenges you’ll face:

Lexical analysis (tokenizing) → maps to breaking SQL into tokens
Parsing (AST generation) → maps to recursive descent or parser generators
Semantic analysis → maps to type checking, name resolution
Query planning → maps to choosing join order, predicate pushdown
Plan optimization → maps to cost-based decisions

Key Concepts:

Recursive Descent Parsing: “Language Implementation Patterns” Chapters 2-4 - Terence Parr
Query Planning: “Database Internals” Chapter 10 - Alex Petrov
Cost-Based Optimization: CMU 15-445 Query Planning - Lecture notes
AST Design: “Engineering a Compiler” Chapter 4 - Cooper & Torczon

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Basic understanding of compilers/parsers, Project 7 helps for execution

Real world outcome:

$ ./sql_engine parse "SELECT country, SUM(revenue) FROM sales WHERE year = 2024 GROUP BY country ORDER BY SUM(revenue) DESC LIMIT 10"

=== Tokens ===
SELECT, IDENTIFIER(country), COMMA, FUNCTION(SUM), LPAREN, IDENTIFIER(revenue), RPAREN, FROM, IDENTIFIER(sales), WHERE, IDENTIFIER(year), EQUALS, INTEGER(2024), GROUP, BY, IDENTIFIER(country), ORDER, BY, FUNCTION(SUM), LPAREN, IDENTIFIER(revenue), RPAREN, DESC, LIMIT, INTEGER(10)

=== Abstract Syntax Tree ===
Select {
  columns: [
    ColumnRef(country),
    FunctionCall(SUM, [ColumnRef(revenue)])
  ],
  from: TableRef(sales),
  where: BinaryOp(
    Equals,
    ColumnRef(year),
    Literal(2024)
  ),
  group_by: [ColumnRef(country)],
  order_by: [
    OrderTerm(FunctionCall(SUM, [ColumnRef(revenue)]), DESC)
  ],
  limit: 10
}

=== Logical Plan ===
Limit(10)
  └── Sort(SUM(revenue) DESC)
        └── Aggregate(GROUP BY country, SUM(revenue))
              └── Filter(year = 2024)
                    └── Scan(sales)

=== Optimized Physical Plan ===
Limit(10)
  └── TopN(10, SUM(revenue) DESC)      // Optimization: Sort→Limit becomes TopN
        └── HashAggregate(country, SUM(revenue))
              └── Filter(year = 2024)   // Predicate pushdown
                    └── ColumnScan(sales, [country, revenue, year])  // Column pruning

Optimizations applied:
  ✓ Column pruning: reading 3/15 columns
  ✓ Predicate pushdown: filter before aggregate
  ✓ Sort elimination: TopN instead of full sort

Implementation Hints:

SQL grammar (simplified):

query       → SELECT columns FROM table [WHERE expr] [GROUP BY cols] [ORDER BY cols] [LIMIT n]
columns     → column ("," column)*
column      → expression [AS alias]
expression  → term ((+ | -) term)*
term        → factor ((* | /) factor)*
factor      → literal | identifier | function_call | "(" expression ")"
function_call → identifier "(" arguments ")"

Questions to guide you:

How do you handle operator precedence (e.g., 1 + 2 * 3)?
What’s the difference between logical and physical plans?
When should predicate pushdown NOT be applied?
How do you represent aggregate functions in the AST?

Learning milestones:

Tokenizer handles all SQL keywords → You understand lexical analysis
Parser builds correct AST for complex queries → You understand grammars
Planner produces executable plans → You understand query semantics
Optimizer improves plan quality → You understand cost-based decisions

Project 9: Distributed Query Coordinator

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Java, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Distributed Systems / Query Routing
Software or Tool: Distributed Analytics Cluster
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed query coordinator that routes queries to multiple analytics nodes, collects partial results, and merges them into final results. Support partition pruning, parallel execution, and failure handling.

Why it teaches real-time analytics: Production OLAP systems are distributed. ClickHouse, Druid, and Pinot all have coordinators that orchestrate queries across shards. Understanding this layer reveals how analytics scales horizontally.

Core challenges you’ll face:

Query routing → maps to which shards have relevant data
Partition pruning → maps to skipping irrelevant partitions
Parallel aggregation → maps to partial vs final aggregates
Failure handling → maps to retries, timeouts, partial results
Result merging → maps to ORDER BY, LIMIT across shards

Key Concepts:

Distributed Query Processing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
Two-Phase Aggregation: Druid Query Execution
Partition Strategies: “Database Internals” Chapter 12 - Alex Petrov
RPC Frameworks: gRPC Documentation

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 5-8, understanding of distributed systems concepts, RPC/networking

Real world outcome:

$ ./query_coordinator start --config cluster.yaml

Query Coordinator Started
  Nodes: 5 (node1:8080, node2:8080, node3:8080, node4:8080, node5:8080)
  Partitioning: hash(user_id) % 5
  Replication factor: 2

$ ./query_coordinator query "SELECT country, SUM(revenue) FROM events WHERE date = '2024-01-15' GROUP BY country ORDER BY SUM(revenue) DESC LIMIT 10"

=== Query Execution Plan ===
Scatter phase:
  → node1: partition 0 (50M rows, date partition match)
  → node2: partition 1 (48M rows, date partition match)
  → node3: partition 2 (52M rows, date partition match)
  → node4: partition 3 (49M rows, date partition match)
  → node5: partition 4 (51M rows, date partition match)

Partition pruning: Skipped 30 date partitions (95% data skipped)

Gather phase:
  Collect partial aggregates from 5 nodes
  Merge SUM(revenue) per country
  Global sort by revenue
  Apply LIMIT 10

=== Execution Stats ===
Scatter time: 12ms (parallel)
Node execution times: [45ms, 52ms, 48ms, 51ms, 47ms]
Gather time: 3ms
Total query time: 67ms

Results:
┌─────────────┬──────────────┐
│  country    │ SUM(revenue) │
├─────────────┼──────────────┤
│ USA         │   12,453,000 │
│ Germany     │    8,234,000 │
│ UK          │    7,891,000 │
│ France      │    5,432,000 │
│ Japan       │    4,321,000 │
...
└─────────────┴──────────────┘

$ ./query_coordinator status
Cluster Health:
  node1: healthy (response: 2ms)
  node2: healthy (response: 3ms)
  node3: degraded (response: 150ms)
  node4: healthy (response: 2ms)
  node5: unreachable (last seen: 5 minutes ago)

Failover: Queries to partition 4 routed to node3 (replica)

Implementation Hints:

Two-phase aggregation:

Phase 1 (Scatter): Each node computes partial aggregates
  node1: {USA: 2000, Germany: 1500}
  node2: {USA: 3000, UK: 2000}
  node3: {Germany: 1000, France: 1500}

Phase 2 (Gather): Coordinator merges
  USA: 2000 + 3000 = 5000
  Germany: 1500 + 1000 = 2500
  UK: 2000
  France: 1500

Questions to guide you:

How do you merge ORDER BY + LIMIT across shards efficiently?
What if one node times out? Return partial results or fail?
How do you handle COUNT(DISTINCT) across shards? (Hint: it’s hard!)
When should you push predicates to shards vs filter at coordinator?

Learning milestones:

Queries scatter to multiple nodes → You understand query distribution
Partition pruning skips irrelevant shards → You understand metadata-based optimization
Aggregates merge correctly → You understand two-phase aggregation
Failover works transparently → You understand fault tolerance

Project 10: Raft Consensus for Data Replication

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: Go, C++, Java
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Distributed Consensus / Replication
Software or Tool: Raft Implementation
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete Raft consensus implementation for replicating analytics data across nodes. Implement leader election, log replication, and snapshotting to ensure data survives node failures.

Why it teaches real-time analytics: High availability is non-negotiable for production systems. Understanding Raft reveals how ClickHouse, CockroachDB, and TiDB maintain consistency across replicas. This is the foundation of reliable distributed databases.

Core challenges you’ll face:

Leader election → maps to terms, votes, and timeouts
Log replication → maps to AppendEntries RPC, commit index
Safety guarantees → maps to election restriction, log matching
Snapshotting → maps to compacting logs, state transfer
Cluster membership changes → maps to joint consensus

Key Concepts:

Raft Paper: In Search of an Understandable Consensus Algorithm - Ongaro & Ousterhout
Distributed Consensus: “Designing Data-Intensive Applications” Chapter 9 - Martin Kleppmann
State Machine Replication: Raft Visualization
Implementation Guide: MIT 6.824 Raft Labs

Difficulty: Master Time estimate: 4-6 weeks Prerequisites: Strong understanding of distributed systems, Projects 4 and 9, comfort with concurrent programming

Real world outcome:

$ ./raft_cluster start --nodes 5

=== Raft Cluster Started ===
Nodes: node1:5001, node2:5002, node3:5003, node4:5004, node5:5005

[node1] State: Follower, Term: 0
[node2] State: Follower, Term: 0
[node3] State: Follower, Term: 0
[node4] State: Follower, Term: 0
[node5] State: Follower, Term: 0

[node3] Election timeout, becoming Candidate
[node3] RequestVote sent to all nodes
[node3] Received votes: 3/5 (majority)
[node3] State: Leader, Term: 1

$ ./raft_cluster write "INSERT INTO events VALUES (1, 'click', 1234567890)"

[Leader:node3] Appending entry: {index: 1, term: 1, cmd: "INSERT..."}
[Leader:node3] Replicating to followers...
  → node1: AppendEntries(prevIdx=0, entries=[1])
  → node2: AppendEntries(prevIdx=0, entries=[1])
  → node4: AppendEntries(prevIdx=0, entries=[1])
  → node5: AppendEntries(prevIdx=0, entries=[1])
[Leader:node3] Entry 1 committed (replicated to 5/5 nodes)
[Leader:node3] Applied to state machine: INSERT...

$ ./raft_cluster kill node3  # Kill the leader

[node3] TERMINATED

[node1] Leader heartbeat timeout, becoming Candidate
[node1] Term: 2, RequestVote sent
[node1] Received votes: 3/4 (majority of remaining)
[node1] State: Leader, Term: 2

$ ./raft_cluster status
Cluster Status:
  Leader: node1 (term 2)
  Followers: node2, node4, node5
  Unavailable: node3

  Log replication:
    node1: [1, 2, 3, 4, 5] commit=5 applied=5
    node2: [1, 2, 3, 4, 5] commit=5 applied=5
    node4: [1, 2, 3, 4, 5] commit=5 applied=5
    node5: [1, 2, 3, 4] commit=4 applied=4  # Catching up

$ ./raft_cluster restart node3

[node3] Rejoining cluster...
[node3] Discovered leader: node1 (term 2)
[node3] Log behind, receiving entries...
[node3] Caught up to commit index 5
[node3] State: Follower (healthy)

Implementation Hints:

Raft state machine:

           timeout          receive higher term
Follower ────────→ Candidate ────────────────→ Follower
    ↑                  │
    │                  │ majority votes
    │                  ▼
    └───────────── Leader
       (on higher term)

Log structure:

Index:   1      2      3      4      5
Term:    1      1      1      2      2
Command: SET    SET    SET    SET    SET
         x=1    y=2    z=3    x=5    y=6
                        ↑
                   Commit Index
         (applied to state machine)

Questions to guide you:

Why can’t a node vote for multiple candidates in the same term?
What happens if a leader dies before an entry is committed?
Why must leaders have all committed entries? (Election restriction)
How does snapshotting prevent unbounded log growth?

Learning milestones:

Leader election works with timeouts → You understand term-based voting
Log replication achieves consensus → You understand AppendEntries
System survives leader failure → You understand fault tolerance
Snapshotting compacts logs → You understand state machine snapshots

Project 11: Time-Series Compression Engine (Gorilla)

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Time-Series / Compression
Software or Tool: Facebook Gorilla-style Engine
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: Implement Facebook’s Gorilla compression algorithm—the technique that achieves 12x compression on time-series data by exploiting timestamp regularity and value similarity. Your engine will compress and decompress time-series data in streaming fashion.

Why it teaches real-time analytics: Time-series is the dominant data type in observability (metrics, logs, traces). Gorilla compression enabled Facebook to store 10x more data in memory, dramatically improving query performance. This technique is used by TimescaleDB, InfluxDB, Prometheus, and many others.

Core challenges you’ll face:

Delta-of-delta timestamp encoding → maps to exploiting regular intervals
XOR-based value compression → maps to exploiting IEEE 754 float structure
Bit-level I/O → maps to reading/writing individual bits
Streaming compression → maps to compress without seeing full dataset
Random access within blocks → maps to block-based organization

Key Concepts:

Gorilla Paper: Gorilla: A Fast, Scalable, In-Memory Time Series Database - Facebook
IEEE 754 Floating Point: “Computer Systems: A Programmer’s Perspective” Chapter 2 - Bryant & O’Hallaron
Bit Manipulation: “Hacker’s Delight” Chapters 1-3 - Henry Warren
Time-Series Compression: Time-series compression algorithms, explained - TigerData

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 2, strong understanding of binary representation, bit manipulation

Real world outcome:

$ ./gorilla_engine compress --input metrics.csv --output metrics.gorilla

=== Gorilla Compression Results ===
Input: metrics.csv
  Rows: 10,000,000
  Columns: timestamp (int64), value (float64)
  Raw size: 160 MB (16 bytes/point)

Timestamp compression:
  Delta-of-delta distribution:
    0 (same delta): 96.2%    → 1 bit each
    Small delta: 3.1%         → 9 bits each
    Large delta: 0.7%         → 64 bits each
  Timestamp bits: 12.4 MB (1.24 bits/timestamp avg)

Value compression:
  XOR distribution:
    Same value: 23.4%        → 1 bit each
    Same leading/trailing: 68.2% → ~12 bits each
    Different: 8.4%          → ~72 bits each
  Value bits: 38.1 MB (3.81 bits/value avg)

Output: metrics.gorilla
  Compressed size: 50.5 MB
  Compression ratio: 3.17x
  Bits per point: 5.05 (vs 128 raw)

$ ./gorilla_engine query --file metrics.gorilla --range "2024-01-01 00:00:00" "2024-01-01 01:00:00"

Time range: 1 hour
Points returned: 36,000
Decompression time: 2.3ms
Throughput: 15.6 million points/sec

Sample output:
  2024-01-01 00:00:00 | 23.45
  2024-01-01 00:00:01 | 23.47
  2024-01-01 00:00:02 | 23.46
  ...

Implementation Hints:

Delta-of-delta encoding:

Timestamps: [1000, 1060, 1120, 1180, 1240, 1241]
Deltas:     [1000,   60,   60,   60,   60,    1]
D-of-D:     [1000,   60,    0,    0,    0,  -59]

Encoding:
  - D-of-D = 0: Write '0' (1 bit)
  - D-of-D in [-63, 64]: Write '10' + 7 bits
  - D-of-D in [-255, 256]: Write '110' + 9 bits
  - Otherwise: Write '1110' + 12 bits (or '1111' + 32 bits)

XOR value compression:

Values: [23.45, 23.47, 23.46]
XOR with previous:
  v1: 23.45 (store full 64 bits for first value)
  v2 XOR v1: 0x0000000000002000 (only a few bits differ!)
  v3 XOR v2: 0x0000000000001000

Encoding:
  - XOR = 0: Write '0' (1 bit)
  - Same leading zeros as previous: Write '10' + meaningful bits
  - Different: Write '11' + leading zero count + length + bits

Questions to guide you:

Why does delta-of-delta work so well on regular timestamps?
How do you handle the first value in each block?
What’s the worst case for Gorilla compression?
How do you implement efficient bit-level reading/writing?

Learning milestones:

Delta-of-delta achieves >10x on timestamps → You understand temporal patterns
XOR compression works on floats → You understand IEEE 754 exploitation
Streaming compression maintains state → You understand block-based encoding
Decompression is fast → You understand the decode path

Project 12: Materialized View Engine

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Java, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Query Optimization / Caching
Software or Tool: Materialized View System
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A materialized view engine that automatically maintains pre-computed aggregations as new data arrives. Support incremental refresh (only process new data) and view composition (views based on views).

Why it teaches real-time analytics: Materialized views are how real-time dashboards achieve sub-second latency. Instead of computing aggregations at query time, they’re maintained continuously. ClickHouse, Druid, and Pinot all rely heavily on this technique.

Core challenges you’ll face:

Incremental maintenance → maps to computing deltas, not full refreshes
Aggregation functions → maps to which functions can be incrementally maintained
View dependencies → maps to cascading updates
Consistency guarantees → maps to exactly-once view updates
Memory management → maps to bounding view size

Key Concepts:

Materialized Views: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
Incremental View Maintenance: Materialize IVM Documentation
Aggregation Theory: “Database Internals” Chapter 11 - Alex Petrov
OLAP Cubes: Apache Druid Rollup

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 5-6, understanding of aggregation functions

Real world outcome:

$ ./matview_engine start

Materialized View Engine Started
  Storage: RocksDB backend
  Refresh: Incremental (streaming)

$ ./matview_engine create-view "
  CREATE MATERIALIZED VIEW hourly_revenue AS
  SELECT
    date_trunc('hour', timestamp) AS hour,
    country,
    SUM(revenue) AS total_revenue,
    COUNT(*) AS order_count
  FROM orders
  GROUP BY 1, 2
"

View 'hourly_revenue' created
  Dimensions: 2 (hour, country)
  Metrics: 2 (SUM(revenue), COUNT(*))
  Estimated memory: 50 MB

$ ./matview_engine ingest --table orders --file orders_batch.json
Ingested 100,000 rows to 'orders'
View 'hourly_revenue' incrementally updated:
  New rows processed: 100,000
  View rows updated: 45 (dimensions touched)
  Update time: 12ms

$ ./matview_engine query "SELECT * FROM hourly_revenue WHERE country = 'US' ORDER BY hour DESC LIMIT 5"

Query executed against materialized view (not base table!)
Time: 0.8ms

┌─────────────────────┬─────────┬───────────────┬─────────────┐
│        hour         │ country │ total_revenue │ order_count │
├─────────────────────┼─────────┼───────────────┼─────────────┤
│ 2024-01-15 14:00:00 │ US      │     125,430   │      1,245  │
│ 2024-01-15 13:00:00 │ US      │     118,920   │      1,189  │
│ 2024-01-15 12:00:00 │ US      │     142,100   │      1,421  │
│ 2024-01-15 11:00:00 │ US      │     134,560   │      1,345  │
│ 2024-01-15 10:00:00 │ US      │      98,230   │        982  │
└─────────────────────┴─────────┴───────────────┴─────────────┘

$ ./matview_engine stats hourly_revenue
View: hourly_revenue
  Rows: 12,450
  Memory: 2.3 MB
  Last refresh: 2024-01-15 14:32:01
  Total updates: 1,245,000
  Avg update latency: 0.012ms

Implementation Hints:

Incremental aggregation:

Existing view:
  (hour=14:00, country=US) → {revenue: 100000, count: 1000}

New batch:
  (timestamp=14:30, country=US, revenue=500)
  (timestamp=14:31, country=US, revenue=600)

Incremental update:
  (hour=14:00, country=US) → {revenue: 100000+500+600, count: 1000+2}
                           = {revenue: 101100, count: 1002}

Which aggregations can be incrementally maintained?

✅ SUM: add new values
✅ COUNT: add count of new rows
✅ MIN/MAX: compare with new values
❌ MEDIAN: requires full dataset
⚠️ AVG: maintain SUM and COUNT separately

Questions to guide you:

How do you handle deletes in incremental maintenance?
What if a dimension has unbounded cardinality?
How do you route incoming data to the right view rows?
When should you trigger a full refresh vs incremental?

Learning milestones:

SUM/COUNT views update incrementally → You understand additive aggregations
Views refresh in milliseconds → You understand incremental vs full refresh
Queries hit views instead of base tables → You understand query rewriting
Memory is bounded for high-cardinality dimensions → You understand eviction

Project 13: Approximate Query Processing (HyperLogLog & Quantiles)

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Probabilistic Data Structures / Approximation
Software or Tool: Approximate Analytics Engine
Main Book: “Algorithms” by Sedgewick & Wayne

What you’ll build: Implement probabilistic data structures for approximate analytics: HyperLogLog for COUNT(DISTINCT), t-digest for quantiles/percentiles, and Count-Min Sketch for frequency estimation. Achieve 1000x memory savings with <1% error.

Why it teaches real-time analytics: Exact COUNT(DISTINCT) on a billion values requires storing all billion values. HyperLogLog does it in 12KB. These algorithms enable analytics at scale that would otherwise be impossible. ClickHouse, Redis, and BigQuery all use these techniques.

Core challenges you’ll face:

HyperLogLog implementation → maps to hashing and counting leading zeros
t-digest for quantiles → maps to clustering and merging
Count-Min Sketch → maps to hash tables with collision tolerance
Error bounds → maps to understanding accuracy vs memory tradeoffs
Mergeability → maps to combining sketches from distributed nodes

Key Concepts:

HyperLogLog: HyperLogLog Paper - Flajolet et al.
t-digest: Computing Extremely Accurate Quantiles - Dunning
Count-Min Sketch: An Improved Data Stream Summary - Cormode & Muthukrishnan
Probabilistic Data Structures: “Grokking Algorithms” Chapter 11 - Bhargava

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of hash functions, basic probability/statistics

Real world outcome:

$ ./approx_engine bench --data user_events.parquet --rows 1000000000

=== Approximate Query Processing Benchmark ===
Dataset: 1 billion user events
Columns: user_id (1M unique), event_type (100 unique), value (float)

=== COUNT(DISTINCT user_id) ===

Exact:
  Memory: 8.5 GB (hash set of 1M IDs)
  Time: 45 seconds
  Result: 1,000,000

HyperLogLog (precision 14):
  Memory: 16 KB (16,384 registers)
  Time: 3.2 seconds
  Result: 1,003,421
  Error: 0.34%
  Memory savings: 530,000x

=== P99 Latency (99th percentile of value) ===

Exact (sort-based):
  Memory: 8 GB (all values)
  Time: 120 seconds
  Result: 4523.45ms

t-digest (compression 100):
  Memory: 24 KB
  Time: 8.5 seconds
  Result: 4518.92ms
  Error: 0.10%
  Memory savings: 333,000x

=== Top-K Event Types ===

Exact (hash map counting):
  Memory: 100 KB
  Time: 25 seconds
  Top 5: [(click, 452M), (view, 321M), (purchase, 89M), ...]

Count-Min Sketch (width=10000, depth=5):
  Memory: 400 KB
  Time: 4.2 seconds
  Top 5: [(click, ~451M), (view, ~323M), (purchase, ~90M), ...]
  Max overcount: 2.1%

=== Merging Distributed Sketches ===
Merging HyperLogLog from 10 nodes: 160 KB → 16 KB (merged)
Result: 1,002,892 unique users (0.29% error)
Merge time: 0.03ms

Implementation Hints:

HyperLogLog algorithm:

For each value x:
  1. h = hash(x)  // 64-bit hash
  2. bucket = first 14 bits of h  // 16384 buckets
  3. remaining = remaining 50 bits
  4. rho = position of first 1-bit in remaining + 1
  5. registers[bucket] = max(registers[bucket], rho)

Estimate:
  harmonic_mean = 1 / sum(2^(-register[i]))
  estimate = alpha * m^2 * harmonic_mean
  // alpha corrects for bias, m = number of buckets

t-digest concept:

Maintain clusters of values with centroids and counts.
Clusters near median can be larger (more values).
Clusters near tails (p1, p99) must be smaller (higher precision).

To compute P99:
  Walk clusters from max until cumulative count reaches 99%
  Interpolate within the final cluster

Questions to guide you:

Why does counting leading zeros estimate cardinality?
How does HyperLogLog handle hash collisions?
Why does t-digest need more precision at the tails?
How do you merge Count-Min Sketches from different nodes?

Learning milestones:

HyperLogLog estimates with <2% error → You understand cardinality estimation
t-digest computes P50/P99 accurately → You understand quantile approximation
Count-Min Sketch finds frequent items → You understand frequency estimation
Sketches merge correctly → You understand distributed approximation

Project 14: Real-Time Dashboard Backend

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, TypeScript, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: API Design / Real-Time Updates
Software or Tool: Dashboard API Server
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A real-time dashboard backend that serves pre-aggregated metrics via REST/WebSocket, supports drill-down queries, handles time-range selections, and pushes live updates to connected clients.

Why it teaches real-time analytics: This is where analytics meets users. Understanding how to serve sub-second queries to dashboards, handle concurrent users, and push real-time updates ties together everything you’ve learned. This is what makes analytics “real-time” from the user’s perspective.

Core challenges you’ll face:

Query caching → maps to when to cache vs recompute
WebSocket updates → maps to pushing changes to clients
Time-range handling → maps to relative vs absolute times, timezones
Drill-down queries → maps to adding filters incrementally
Rate limiting → maps to protecting the analytics backend

Key Concepts:

API Design: “Design and Build Great Web APIs” - Mike Amundsen
WebSockets: RFC 6455
Caching Strategies: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
Real-Time Systems: Pusher Real-Time Guide

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic REST API development, WebSocket understanding, Project 5 or 12 helps

Real world outcome:

$ ./dashboard_server start --port 8080 --analytics-backend localhost:9000

Dashboard Server Started
  REST API: http://localhost:8080/api
  WebSocket: ws://localhost:8080/ws
  Analytics backend: localhost:9000
  Cache TTL: 10 seconds

=== REST API Demo ===

$ curl "http://localhost:8080/api/metrics/revenue?groupBy=country&timeRange=last_24h"
{
  "query_time_ms": 12,
  "cache_hit": false,
  "time_range": {"start": "2024-01-14T14:00:00Z", "end": "2024-01-15T14:00:00Z"},
  "data": [
    {"country": "US", "revenue": 1245000, "orders": 12450},
    {"country": "UK", "revenue": 892000, "orders": 8920},
    {"country": "DE", "revenue": 756000, "orders": 7560}
  ]
}

# Same query again (cached)
$ curl "http://localhost:8080/api/metrics/revenue?groupBy=country&timeRange=last_24h"
{
  "query_time_ms": 0.3,
  "cache_hit": true,
  ...
}

# Drill-down
$ curl "http://localhost:8080/api/metrics/revenue?groupBy=city&filter=country:US&timeRange=last_24h"
{
  "query_time_ms": 8,
  "data": [
    {"city": "New York", "revenue": 423000},
    {"city": "Los Angeles", "revenue": 312000},
    ...
  ]
}

=== WebSocket Real-Time Updates ===

$ wscat -c ws://localhost:8080/ws

> {"subscribe": "revenue.by_country", "interval": "1m"}
< {"type": "snapshot", "data": [{"country": "US", "revenue": 1245000}, ...]}
< {"type": "update", "data": [{"country": "US", "revenue": 1245500}]}  // +500 revenue
< {"type": "update", "data": [{"country": "US", "revenue": 1246200}]}  // +700 revenue
...

=== Server Stats ===
Active connections: 47
Queries/sec: 234
Cache hit rate: 89%
Avg response time: 15ms (cached: 0.5ms, uncached: 45ms)

Implementation Hints:

Caching strategy:

Cache key = hash(query_params + floor(time / TTL))

For "last_24h" queries:
  - Recompute current hour every minute
  - Cache older hours until they "roll off" the 24h window

For absolute ranges:
  - Cache indefinitely (data won't change)

WebSocket update flow:

1. Client subscribes to metric
2. Server stores subscription
3. When new data ingested:
   - Compute affected metrics
   - Push delta to subscribed clients
4. Client applies delta to local state

Questions to guide you:

How do you handle timezone differences for “last 24 hours”?
When should you invalidate cache vs let it expire?
How do you prevent slow queries from blocking fast ones?
How do you handle client disconnects gracefully?

Learning milestones:

REST API serves queries with caching → You understand query serving
WebSocket pushes real-time updates → You understand push notifications
Drill-down queries work efficiently → You understand filter composition
System handles 100+ concurrent clients → You understand connection management

Project 15: Mini-ClickHouse (Putting It All Together)

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C++, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Full OLAP Database
Software or Tool: Complete OLAP System
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete, working OLAP database that combines: columnar storage (Project 1, 3), compression (Project 2, 11), LSM-tree ingestion (Project 4), vectorized execution (Project 7), SQL parsing (Project 8), and materialized views (Project 12). Query billions of rows in seconds.

Why it teaches real-time analytics: This is the capstone. Building a complete OLAP database forces you to understand how all the pieces fit together—the tradeoffs, the optimizations, and the engineering decisions that make systems like ClickHouse process billions of rows per second.

Core challenges you’ll face:

Architecture integration → maps to how components interact
Memory management → maps to buffer pools, memory limits
Concurrent query execution → maps to isolation, resource sharing
Storage management → maps to partitioning, compaction
Performance tuning → maps to identifying and fixing bottlenecks

Key Concepts:

OLAP Architecture: ClickHouse Architecture Overview
Query Execution: MonetDB/X100 Paper
Storage Engines: “Database Internals” Chapters 3-7 - Alex Petrov
Systems Design: “Designing Data-Intensive Applications” - Martin Kleppmann

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects (especially 3, 4, 7, 8), strong systems programming skills

Real world outcome:

$ ./mini-clickhouse server start

╔══════════════════════════════════════════════════════════════╗
║                     Mini-ClickHouse v1.0                      ║
║           A Learning Project - Not for Production!            ║
╚══════════════════════════════════════════════════════════════╝

Storage engine: LSM-Tree + Columnar
Compression: Gorilla (timestamps), LZ4 (general)
Execution: Vectorized (AVX2)
Memory limit: 8 GB

Listening on:
  Native protocol: localhost:9000
  HTTP interface: localhost:8123

$ ./mini-clickhouse client

mini-clickhouse :) CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    revenue Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);

CREATE TABLE events
Ok.
0 rows in set. Elapsed: 0.012 sec.

mini-clickhouse :) INSERT INTO events SELECT * FROM file('events_1B.csv');

INSERT INTO events
1,000,000,000 rows in set. Elapsed: 145.23 sec.
Processed 6.89 million rows/s, 1.10 GB/s.

Storage size: 12.4 GB (92 GB raw → 7.4x compression)

mini-clickhouse :) SELECT
    event_type,
    count() AS events,
    sum(revenue) AS total_revenue,
    avg(revenue) AS avg_revenue
FROM events
WHERE timestamp >= '2024-01-01' AND timestamp < '2024-02-01'
GROUP BY event_type
ORDER BY total_revenue DESC
LIMIT 10;

┌─event_type──┬────events─┬─total_revenue─┬─avg_revenue─┐
│ purchase    │ 89234521  │  892,345,210  │   10.00     │
│ subscription│ 12453678  │  248,756,120  │   19.97     │
│ refund      │   892345  │  -17,846,900  │  -20.00     │
│ ...         │   ...     │     ...       │   ...       │
└─────────────┴───────────┴───────────────┴─────────────┘

10 rows in set. Elapsed: 0.892 sec.
Processed 312.45 million rows, 2.50 GB (350.28 million rows/s, 2.80 GB/s).

Execution plan:
  PartitionPrune(2024-01 only): 12 → 1 partitions
  ColumnScan: [event_type, revenue]
  VectorizedFilter: timestamp range
  HashAggregate: GROUP BY event_type
  TopNSort: ORDER BY total_revenue DESC LIMIT 10

mini-clickhouse :) EXPLAIN PIPELINE
SELECT count(DISTINCT user_id) FROM events WHERE event_type = 'purchase';

┌─explain───────────────────────────────────────────┐
│ (Aggregating)                                     │
│   AggregatingTransform (HyperLogLog)              │
│     (Filter)                                      │
│       FilterTransform (SIMD comparison)           │
│         (Scan)                                    │
│           MergeTreeThread × 4                     │
│             Decompress (LZ4) → Vectorized Scan    │
└───────────────────────────────────────────────────┘

Implementation Hints:

Architecture overview:

┌───────────────────────────────────────────────────────────────┐
│                        SQL Interface                          │
│                    (Parser → Planner → Optimizer)             │
├───────────────────────────────────────────────────────────────┤
│                    Query Execution Engine                     │
│              (Vectorized, SIMD, Parallel Pipelines)           │
├───────────────────────────────────────────────────────────────┤
│     Aggregation Engine        │    Materialized Views         │
│  (Hash, HLL, t-digest, etc.)  │   (Incremental Refresh)       │
├───────────────────────────────┼───────────────────────────────┤
│              Storage Engine (MergeTree)                       │
│  ┌─────────────────┐  ┌─────────────────────────────────┐    │
│  │ In-Memory       │  │ On-Disk                         │    │
│  │ (WAL + MemTable)│  │ (Columnar Parts + Compression)  │    │
│  └─────────────────┘  └─────────────────────────────────┘    │
├───────────────────────────────────────────────────────────────┤
│                    Partition Manager                          │
│          (Partition Pruning, Compaction, TTL)                 │
└───────────────────────────────────────────────────────────────┘

Integration questions:

How does the query planner decide when to use HyperLogLog vs exact count?
How do you balance memory between concurrent queries?
When should compaction run vs serving queries?
How do you handle schema changes on existing data?

Start with a single-threaded, single-table system. Add features incrementally. Profile constantly—real databases spend years on optimization.

Learning milestones:

Basic queries work end-to-end → You understand the full stack
Performance matches expectations → You understand bottlenecks
System handles concurrent queries → You understand resource management
Compression + vectorization work together → You understand the synergies

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Row vs Column Visualizer	Beginner	Weekend	★★★☆☆	★★★★☆
2. Compression Playground	Intermediate	1-2 weeks	★★★★☆	★★★☆☆
3. Mini-Parquet Reader/Writer	Advanced	2-3 weeks	★★★★★	★★★★☆
4. LSM-Tree Storage Engine	Advanced	3-4 weeks	★★★★★	★★★★☆
5. Metrics Aggregator	Advanced	2-3 weeks	★★★★☆	★★★★★
6. Kafka Ingestion Pipeline	Intermediate	1-2 weeks	★★★☆☆	★★★☆☆
7. Vectorized Execution Engine	Expert	3-4 weeks	★★★★★	★★★★★
8. SQL Parser & Planner	Advanced	2-3 weeks	★★★★☆	★★★☆☆
9. Distributed Coordinator	Expert	3-4 weeks	★★★★★	★★★★☆
10. Raft Consensus	Master	4-6 weeks	★★★★★	★★★★★
11. Gorilla Compression	Advanced	2-3 weeks	★★★★☆	★★★★☆
12. Materialized Views	Advanced	2-3 weeks	★★★★☆	★★★☆☆
13. Approximate Queries (HLL)	Advanced	2-3 weeks	★★★★★	★★★★☆
14. Dashboard Backend	Intermediate	1-2 weeks	★★★☆☆	★★★★★
15. Mini-ClickHouse	Master	2-3 months	★★★★★	★★★★★

Recommended Learning Path

If you have 1 month (quick foundation):

Project 1: Row vs Column Visualizer (weekend)
Project 2: Compression Playground (1 week)
Project 5: Metrics Aggregator (2 weeks)

If you have 3 months (solid understanding):

Projects 1-2 (2 weeks)
Project 3: Mini-Parquet (2 weeks)
Project 5: Metrics Aggregator (2 weeks)
Project 6: Kafka Ingestion (1 week)
Project 7: Vectorized Execution (3 weeks)
Project 8: SQL Parser (2 weeks)

If you have 6+ months (deep mastery):

Complete all projects in order. Each builds on the previous.

Recommended Starting Point

Start with Project 1 (Row vs Column Visualizer). It’s the fastest way to viscerally understand why columnar databases exist. From there, choose based on your interests:

Performance-focused: Projects 2 → 7 → 11
Distributed systems-focused: Projects 4 → 9 → 10
Full-stack analytics: Projects 5 → 6 → 12 → 14

Final Capstone: Production-Ready Analytics Platform

After completing the individual projects, build a complete analytics platform:

File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
Main Programming Language: Rust + Go + TypeScript
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Full Analytics Stack
Software or Tool: Complete Analytics Platform

What you’ll build: Combine Projects 3-15 into a unified platform: columnar storage with compression, streaming ingestion from Kafka, SQL query interface, distributed query execution, materialized views, approximate aggregations, and a real-time dashboard. Deploy it on 3+ nodes with replication.

Why it matters: This is a portfolio-defining project. You’ll have built, from scratch, a system that competes conceptually with ClickHouse, Druid, or Pinot. Even if it’s slower or has fewer features, the fact that you built it demonstrates mastery of real-time analytics.

Real world outcome:

Ingest 100K+ events/second
Query billions of rows in <1 second
Survive node failures without data loss
Serve real-time dashboards with <100ms latency
Support standard SQL queries

This is not a weekend project. Budget 3-6 months for a complete implementation. But when you’re done, you’ll understand real-time analytics databases at a level few engineers ever achieve.

Summary

#	Project	Main Language
1	Row vs Column Storage Visualizer	Python
2	Compression Algorithm Playground	C
3	Mini-Parquet File Reader/Writer	Go
4	LSM-Tree Storage Engine	Rust
5	Real-Time Metrics Aggregator	Go
6	Streaming Ingestion Pipeline (Kafka)	Java
7	Vectorized Query Execution Engine	Rust
8	SQL Parser and Query Planner	Rust
9	Distributed Query Coordinator	Go
10	Raft Consensus for Replication	Rust
11	Time-Series Compression (Gorilla)	C
12	Materialized View Engine	Go
13	Approximate Query Processing (HLL)	C++
14	Real-Time Dashboard Backend	Go
15	Mini-ClickHouse (Capstone)	Rust
Final	Production Analytics Platform	Rust + Go + TypeScript

Key Resources

Books

“Designing Data-Intensive Applications” by Martin Kleppmann - The bible for understanding modern data systems
“Database Internals” by Alex Petrov - Deep dive into storage engines and distributed databases
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Essential for understanding performance

Papers

Facebook Gorilla - Time-series compression
MonetDB/X100 - Vectorized execution
Raft Consensus - Understandable distributed consensus
Google Dremel - Columnar storage and nested data

Online Resources

CMU 15-721 Advanced Database Systems - Graduate-level database course
ClickHouse Documentation - Production OLAP database internals
Confluent Kafka Courses - Streaming fundamentals
Raft Visualization - Interactive consensus demo

Sources used in this guide: