← Back to all projects

LEARN REALTIME ANALYTICS DATABASES DEEP DIVE

Learn Real-Time Analytics Databases: From Zero to OLAP Master

Goal: Deeply understand real-time analytics databases—from column-oriented storage fundamentals to building distributed OLAP systems that process billions of rows per second.


Why Real-Time Analytics Databases Matter

Every time Netflix shows you “Trending Now,” Uber calculates surge pricing, or LinkedIn suggests “People You May Know,” a real-time analytics database is crunching massive datasets in milliseconds. These systems represent the pinnacle of database engineering—combining storage optimization, query processing, distributed systems, and streaming data ingestion into unified platforms.

After completing these projects, you will:

  • Understand why columnar storage is 100x faster than row-based for analytics
  • Build storage engines that compress data to 1/10th its original size
  • Implement streaming ingestion pipelines that handle millions of events per second
  • Design query engines that leverage vectorized execution and SIMD
  • Master distributed consensus and replication for fault-tolerant systems
  • Build your own mini-ClickHouse or mini-Druid from scratch

Core Concept Analysis

The OLAP vs OLTP Divide

OLTP (Row-Oriented)                    OLAP (Column-Oriented)
┌────────────────────┐                 ┌────────────────────┐
│ ID │ Name  │ Sales │                 │ ID  │ ID  │ ID  │
├────┼───────┼───────┤                 │  1  │  2  │  3  │
│ 1  │ Alice │ 100   │                 ├─────┴─────┴─────┤
│ 2  │ Bob   │ 200   │                 │ Name │ Name │Name│
│ 3  │ Carol │ 150   │                 │Alice │ Bob  │Carol│
└────────────────────┘                 ├─────┴─────┴─────┤
                                       │Sales│Sales│Sales│
Fast: Get all data for ID=2            │ 100 │ 200 │ 150 │
Slow: SUM(Sales) across millions       └─────┴─────┴─────┘

                                       Fast: SUM(Sales) - only read one column
                                       Slow: Get all data for ID=2

Fundamental Concepts

  1. Columnar Storage: Data stored by columns, not rows
    • Enables massive compression (similar values grouped together)
    • Reads only needed columns (skip irrelevant data)
    • Cache-friendly access patterns
  2. Vectorized Execution: Process data in batches (vectors) of thousands of values
    • Amortizes function call overhead
    • Leverages CPU SIMD instructions
    • Modern CPUs can process 8-16 values simultaneously
  3. Time-Series Optimization: Specialized handling for timestamped data
    • Delta-of-delta encoding for timestamps
    • Gorilla compression for floating-point metrics
    • Time-based partitioning and retention
  4. Streaming Ingestion: Real-time data consumption
    • Kafka/Kinesis integration
    • Micro-batching vs true streaming
    • Write-ahead logs for durability
  5. Distributed Query Processing: Scale across machines
    • Data partitioning (sharding)
    • Query routing and aggregation
    • Consensus protocols (Raft/Paxos)
  6. Compression Algorithms: Shrink data dramatically
    • Run-Length Encoding (RLE)
    • Dictionary Encoding
    • Bit-Packing
    • LZ4/ZSTD general compression

The Real-Time Analytics Stack

┌─────────────────────────────────────────────────────────────┐
│                    Query Layer (SQL)                        │
├─────────────────────────────────────────────────────────────┤
│              Query Optimizer / Planner                      │
├─────────────────────────────────────────────────────────────┤
│         Vectorized Execution Engine (SIMD)                  │
├─────────────────────────────────────────────────────────────┤
│    ┌─────────────────┐  ┌─────────────────────────────┐    │
│    │  In-Memory      │  │     On-Disk Storage         │    │
│    │  (Hot Data)     │  │  (Columnar + Compressed)    │    │
│    └─────────────────┘  └─────────────────────────────┘    │
├─────────────────────────────────────────────────────────────┤
│              Streaming Ingestion (Kafka)                    │
├─────────────────────────────────────────────────────────────┤
│         Distributed Coordination (Raft/ZooKeeper)           │
└─────────────────────────────────────────────────────────────┘

Project List

Projects are ordered from fundamental understanding to advanced implementations. Each builds on concepts from previous projects.


Project 1: Row vs Column Storage Visualizer

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Storage Engines / Data Layout
  • Software or Tool: Custom Storage Engine
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A tool that stores the same dataset in both row-oriented and column-oriented formats, then visually demonstrates the I/O difference when running analytical queries (like SUM, AVG, COUNT) vs transactional queries (fetch by ID).

Why it teaches real-time analytics: Before diving into complex systems, you must viscerally understand why columnar storage exists. This project makes the abstract concrete—you’ll literally see how many bytes are read for each query type.

Core challenges you’ll face:

  • Designing the storage layout → maps to understanding how data is physically arranged on disk
  • Measuring I/O accurately → maps to understanding what “scanning” actually means
  • Implementing basic aggregations → maps to how queries traverse data differently
  • Visualizing access patterns → maps to understanding cache efficiency

Key Concepts:

  • Row vs Column Storage: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
  • Data Locality: “Computer Systems: A Programmer’s Perspective” Chapter 6 - Bryant & O’Hallaron
  • File I/O Fundamentals: “The Linux Programming Interface” Chapters 4-5 - Michael Kerrisk

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic file I/O in any language, understanding of arrays/lists

Real world outcome:

$ ./storage_visualizer --dataset sales_1M.csv

=== Storage Comparison ===
Dataset: 1,000,000 rows × 10 columns
Row-oriented size: 85.2 MB
Column-oriented size: 85.2 MB (same uncompressed)

=== Query: SELECT SUM(revenue) FROM sales ===
Row-oriented:
  Bytes read: 85.2 MB (entire dataset)
  Time: 342ms
  [████████████████████████████████████████] 100% of data

Column-oriented:
  Bytes read: 8.5 MB (only revenue column)
  Time: 38ms
  [████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 10% of data

Speedup: 9.0x faster!

=== Query: SELECT * FROM sales WHERE id = 500000 ===
Row-oriented:
  Bytes read: 42.6 MB (scan to row)
  Time: 156ms

Column-oriented:
  Bytes read: 85.2 MB (reassemble from all columns)
  Time: 412ms

Row-oriented wins for point lookups!

Implementation Hints:

The core insight is simple: row storage keeps all fields of a record together, column storage keeps all values of a field together.

Questions to guide you:

  • How do you represent “reading X bytes from disk” in your program? (Hint: seek and read operations)
  • What data structure holds your column data? (Hint: one array per column)
  • How do you “reconstruct” a row from columnar data? (Hint: index into each column array)
  • How can you visually show which bytes were accessed? (Hint: track byte ranges)

Architecture:

RowStore:
  File: [row1_all_cols][row2_all_cols][row3_all_cols]...

ColumnStore:
  File: [col1_all_rows][col2_all_rows][col3_all_rows]...
  Metadata: {col1: offset=0, len=X}, {col2: offset=X, len=Y}...

For visualization, consider using ANSI colors to highlight which portions of your data file were accessed during each query.

Learning milestones:

  1. Both storage formats work correctly → You understand physical data layout
  2. Query results match between formats → You’ve verified correctness
  3. I/O measurements show expected patterns → You understand the performance implications
  4. Visualization clearly shows the difference → You can explain this to others

Project 2: Compression Algorithm Playground

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Compression / Encoding
  • Software or Tool: Custom Compression Library
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A suite of compression algorithms specifically designed for analytics workloads: Run-Length Encoding (RLE), Dictionary Encoding, Delta Encoding, Delta-of-Delta, Bit-Packing, and XOR-based compression for floats. You’ll benchmark each against real datasets.

Why it teaches real-time analytics: Compression is the secret weapon of OLAP databases. ClickHouse achieves 10-20x compression ratios. Understanding these algorithms reveals why columnar storage enables such dramatic compression—similar values cluster together.

Core challenges you’ll face:

  • Implementing RLE efficiently → maps to detecting and encoding runs
  • Building a dictionary encoder → maps to trading memory for compression
  • Delta-of-delta for timestamps → maps to exploiting sequential patterns
  • Bit-packing integers → maps to minimizing storage for small values
  • XOR compression for floats → maps to IEEE 754 exploitation

Key Concepts:

  • Run-Length Encoding: “Introduction to Data Compression” Chapter 2 - Khalid Sayood
  • Dictionary Encoding: Facebook’s Gorilla Paper Section 4
  • Delta Encoding: “Time-series compression algorithms, explained” - TigerData Blog
  • Bit Manipulation: “Computer Systems: A Programmer’s Perspective” Chapter 2 - Bryant & O’Hallaron

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1, comfortable with binary/bitwise operations, basic understanding of floating-point representation

Real world outcome:

$ ./compression_playground --input sensor_data.csv

=== Compression Benchmark Results ===
Original size: 100.0 MB (10M rows)

Column: timestamp (int64, monotonic)
  Raw:           80.0 MB
  Delta:         12.4 MB (6.5x compression)
  Delta-of-Delta: 1.2 MB (66.7x compression!)
  Winner: Delta-of-Delta

Column: sensor_id (int32, low cardinality)
  Raw:           40.0 MB
  Dictionary:     2.1 MB (19.0x compression)
  RLE:            0.8 MB (50.0x compression)
  Winner: RLE (sorted data)

Column: temperature (float64)
  Raw:           80.0 MB
  XOR (Gorilla):  8.9 MB (9.0x compression)
  Winner: XOR

Column: status (string, 5 unique values)
  Raw:           45.0 MB
  Dictionary:     1.2 MB (37.5x compression)
  Winner: Dictionary

Total compressed: 14.2 MB (7.0x overall compression)

=== Decompression Speed ===
Delta-of-Delta: 1.2 GB/s
Dictionary:     2.8 GB/s
RLE:            3.4 GB/s
XOR:            0.9 GB/s

Implementation Hints:

Run-Length Encoding: Transform [A, A, A, B, B, C, C, C, C][(A,3), (B,2), (C,4)] Works brilliantly on sorted columnar data where values repeat consecutively.

Dictionary Encoding:

Original: ["USA", "USA", "Germany", "USA", "France"]
Dictionary: {0: "USA", 1: "Germany", 2: "France"}
Encoded: [0, 0, 1, 0, 2]  // Much smaller!

Delta-of-Delta for timestamps:

Timestamps: [1000, 1060, 1120, 1180, 1240]
Deltas:     [1000, 60, 60, 60, 60]      // First value + differences
Delta-of-D: [1000, 60, 0, 0, 0]         // Most become 0!

Facebook found 96% of timestamp delta-of-deltas are 0!

XOR for floats (Gorilla algorithm): Consecutive float values often share many bits. XOR reveals only the differing bits, which can be encoded very compactly.

Questions to guide you:

  • How do you detect when to stop a run in RLE?
  • What happens when dictionary size exceeds the savings?
  • How do you handle the first value in delta encoding?
  • What’s the bit layout of a float64? (Sign, exponent, mantissa)

Learning milestones:

  1. RLE works on sorted data → You understand run detection
  2. Dictionary encoding handles arbitrary strings → You understand the space-time tradeoff
  3. Delta-of-delta achieves >50x on timestamps → You’ve exploited temporal patterns
  4. XOR compression works on real sensor data → You understand IEEE 754 structure

Project 3: Build a Mini-Parquet File Reader/Writer

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: File Formats / Columnar Storage
  • Software or Tool: Apache Parquet
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A simplified implementation of the Parquet file format—the industry standard for columnar data storage. You’ll implement row groups, column chunks, page structure, and basic compression, then verify your files work with the official Parquet tools.

Why it teaches real-time analytics: Parquet is everywhere: Spark, Presto, Snowflake, BigQuery, and more. Understanding its internals reveals how production systems balance compression, random access, and query efficiency. You’ll learn why it’s organized the way it is.

Core challenges you’ll face:

  • Understanding the Parquet file structure → maps to footer, row groups, column chunks, pages
  • Implementing nested schema encoding → maps to definition and repetition levels
  • Adding compression per column → maps to codec selection and page compression
  • Writing valid metadata → maps to Thrift serialization and schema storage

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 1 and 2, understanding of binary file formats, comfortable with nested data structures

Real world outcome:

$ ./mini_parquet write --input users.json --output users.parquet

Writing Parquet file...
Schema: {id: INT64, name: STRING, emails: LIST<STRING>, address: STRUCT}
Row groups: 2 (50,000 rows each)
Compression: SNAPPY

Column statistics:
  id:      min=1, max=100000, nulls=0
  name:    min="Aaron", max="Zoe", nulls=12
  emails:  nested (def/rep levels used)
  address: nested struct

File size: 4.2 MB (vs 45 MB JSON = 10.7x compression)
Written successfully!

$ ./mini_parquet read --input users.parquet --query "SELECT id, name LIMIT 5"

┌────────┬─────────┐
│   id   │  name   │
├────────┼─────────┤
│      1 │ Alice   │
│      2 │ Bob     │
│      3 │ Carol   │
│      4 │ David   │
│      5 │ Eve     │
└────────┴─────────┘

$ parquet-tools schema users.parquet  # Official tool validates our file!
message users {
  required int64 id;
  optional binary name (STRING);
  repeated binary emails (STRING);
  optional group address { ... }
}
✓ Valid Parquet file!

Implementation Hints:

Parquet file structure:

┌──────────────────────────────────┐
│         Row Group 1              │
│  ┌─────────────────────────────┐ │
│  │ Column Chunk: id            │ │
│  │   Page 1, Page 2, ...       │ │
│  ├─────────────────────────────┤ │
│  │ Column Chunk: name          │ │
│  │   Page 1, Page 2, ...       │ │
│  └─────────────────────────────┘ │
├──────────────────────────────────┤
│         Row Group 2              │
│         ...                      │
├──────────────────────────────────┤
│       Footer (Metadata)          │
│  - Schema                        │
│  - Row group locations           │
│  - Column statistics             │
├──────────────────────────────────┤
│    Footer Length (4 bytes)       │
│    Magic: "PAR1" (4 bytes)       │
└──────────────────────────────────┘

Questions to guide you:

  • Why does Parquet put metadata at the end, not the beginning?
  • What are row groups for? (Hint: parallel processing, memory management)
  • Why compress at page level, not column level?
  • What problem do definition/repetition levels solve for nested data?

Start simple: flat schemas (no nesting), single row group, no compression. Add features incrementally.

Learning milestones:

  1. Write valid Parquet files (verified by parquet-tools) → You understand the file structure
  2. Column pruning works (only read needed columns) → You understand columnar access
  3. Compression reduces file size significantly → You understand page-level compression
  4. Nested data with def/rep levels works → You understand Dremel encoding

Project 4: LSM-Tree Storage Engine for Analytics

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go, C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage Engines / Write Optimization
  • Software or Tool: RocksDB-style engine
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An LSM-Tree (Log-Structured Merge-Tree) storage engine optimized for analytics write patterns. You’ll implement a MemTable (in-memory buffer), SSTable (Sorted String Table) files, compaction strategies, and bloom filters for efficient lookups.

Why it teaches real-time analytics: Real-time analytics databases must ingest massive amounts of data continuously. LSM-trees provide the foundation—they convert random writes into sequential writes, enabling sustained ingestion rates of millions of events per second.

Core challenges you’ll face:

  • Implementing an efficient MemTable → maps to in-memory sorted structures (skip list/red-black tree)
  • Flushing to SSTable format → maps to sorted, immutable, compressed files
  • Compaction strategies → maps to size-tiered vs leveled compaction tradeoffs
  • Bloom filters for reads → maps to probabilistic data structures
  • Write-ahead log (WAL) → maps to durability guarantees

Key Concepts:

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Projects 1-2, understanding of tree data structures, file I/O

Real world outcome:

$ ./lsm_engine bench --writes 10000000 --threads 4

=== LSM-Tree Benchmark ===
Configuration:
  MemTable size: 64 MB
  Compaction: Leveled
  Bloom filter: 10 bits/key

Write Performance:
  Total writes: 10,000,000
  Time: 12.4 seconds
  Throughput: 806,451 writes/sec
  Avg latency: 4.96 µs

Storage:
  Data written: 1.2 GB
  On-disk size: 890 MB (compression: 1.35x)
  SSTable files:
    L0: 3 files (128 MB each)
    L1: 12 files (64 MB each)
    L2: 45 files (64 MB each)

$ ./lsm_engine query --key "sensor_12345_2024"
Key found in: L1/sst_00042.db
Value: {"temp": 23.5, "humidity": 65}
Bloom filter checks: 3 (2 negative, 1 positive)
Total read time: 89 µs

$ ./lsm_engine compact --force
Compacting L0 → L1...
  Merged 3 files into 2 files
  Reclaimed 45 MB of space
  Removed 125,000 tombstones
Compaction complete in 2.3s

Implementation Hints:

LSM-Tree structure:

┌────────────────────────────────────┐
│           MemTable                 │  In-Memory (Skip List)
│   (sorted, accepts writes)         │
└────────────────────────────────────┘
              │ flush when full
              ▼
┌────────────────────────────────────┐
│   Level 0 SSTables (unsorted)      │  On-Disk
│   [sst1] [sst2] [sst3]             │
└────────────────────────────────────┘
              │ compaction
              ▼
┌────────────────────────────────────┐
│   Level 1 SSTables (sorted)        │
│   [sst1] [sst2] ... [sst10]        │  Non-overlapping key ranges
└────────────────────────────────────┘
              │ compaction
              ▼
        ... more levels ...

Questions to guide you:

  • Why use a skip list instead of a balanced tree for the MemTable?
  • What happens if the machine crashes before MemTable is flushed? (WAL)
  • Why are L0 SSTables allowed to overlap but L1+ cannot?
  • How does the bloom filter prevent unnecessary disk reads?
  • What’s the tradeoff between size-tiered and leveled compaction?

Learning milestones:

  1. MemTable + WAL provides durability → You understand crash recovery
  2. SSTables flush correctly and are queryable → You understand the immutable file format
  3. Compaction merges files and removes tombstones → You understand space reclamation
  4. Bloom filters reduce read amplification → You understand probabilistic optimization

Project 5: Real-Time Metrics Aggregator (In-Memory OLAP)

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: In-Memory Analytics / Aggregation
  • Software or Tool: Mini-Druid/Pinot
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An in-memory analytics engine that can ingest millions of events per second and answer aggregation queries (COUNT, SUM, AVG, MIN, MAX, GROUP BY) in milliseconds. Think of it as a mini-Druid or mini-Pinot for metrics.

Why it teaches real-time analytics: This is the core of real-time analytics—maintaining pre-aggregated data structures that enable instant queries. You’ll learn how systems achieve sub-second query latency on billions of events.

Core challenges you’ll face:

  • Designing efficient in-memory data structures → maps to columnar arrays with indexes
  • Supporting GROUP BY efficiently → maps to hash-based aggregation
  • Implementing rollups and pre-aggregations → maps to trading storage for query speed
  • Handling high-velocity ingestion → maps to lock-free/concurrent data structures
  • Time-based windowing → maps to sliding windows, tumbling windows

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2, understanding of hash tables, basic concurrency

Real world outcome:

$ ./metrics_aggregator start --port 8080

Metrics Aggregator v1.0
  In-memory storage: 4 GB allocated
  Ingestion port: 8080/tcp
  Query port: 8081/tcp
  Rollup: 1-minute granularity

$ ./metrics_aggregator ingest --file events.json --rate 500000
Ingesting at 500,000 events/sec...
  [████████████████████] 10,000,000 events ingested
  Time: 20.1 seconds
  Memory used: 892 MB

$ curl "http://localhost:8081/query" -d '{
  "aggregations": ["SUM(revenue)", "COUNT(*)"],
  "groupBy": ["country", "product_category"],
  "timeRange": {"start": "2024-01-01", "end": "2024-01-31"},
  "filter": {"country": ["US", "UK", "DE"]}
}'

{
  "queryTime": "12ms",
  "rowsScanned": 8500000,
  "results": [
    {"country": "US", "product_category": "Electronics", "SUM(revenue)": 4521000, "COUNT(*)": 125000},
    {"country": "US", "product_category": "Clothing", "SUM(revenue)": 2103000, "COUNT(*)": 89000},
    {"country": "UK", "product_category": "Electronics", "SUM(revenue)": 1892000, "COUNT(*)": 67000},
    ...
  ]
}

$ ./metrics_aggregator stats
Current state:
  Total events: 10,000,000
  Unique dimensions: {country: 45, product_category: 12, store_id: 2300}
  Memory: 892 MB
  Rollup compression: 8.5x (raw would be 7.6 GB)
  Avg query latency: 15ms

Implementation Hints:

Core data structure:

Segment {
  timestamp_column: Vec<i64>        // Sorted timestamps
  dimension_columns: {
    "country": DictionaryColumn,    // Low-cardinality → dictionary encoded
    "product_id": Vec<u32>,         // Medium cardinality
  }
  metric_columns: {
    "revenue": Vec<f64>,
    "quantity": Vec<i32>,
  }
  indexes: {
    "country": BitmapIndex,         // Fast filtering
    "timestamp": RangeIndex,        // Time range queries
  }
}

Questions to guide you:

  • How do you aggregate across multiple segments concurrently?
  • What’s the memory overhead of a bitmap index per dimension value?
  • When is pre-aggregation (rollup) worth the storage cost?
  • How do you handle late-arriving data in a time-windowed system?

Learning milestones:

  1. Ingest 100K+ events/sec → You understand high-throughput ingestion
  2. GROUP BY queries return in <100ms → You understand hash aggregation
  3. Bitmap indexes accelerate filtered queries → You understand inverted indexes
  4. Time windows roll up correctly → You understand temporal aggregation

Project 6: Streaming Ingestion Pipeline (Kafka to Analytics)

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Java
  • Alternative Programming Languages: Go, Python, Scala
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Streaming / Event Processing
  • Software or Tool: Apache Kafka
  • Main Book: “Kafka: The Definitive Guide” by Gwen Shapiro et al.

What you’ll build: A streaming ingestion pipeline that consumes events from Kafka, transforms them, and loads them into your analytics engine (from Project 5). You’ll implement exactly-once semantics, handle backpressure, and build a dead-letter queue for failed events.

Why it teaches real-time analytics: Real-time analytics starts with real-time data. Understanding Kafka’s architecture—partitions, consumer groups, offsets, and exactly-once semantics—is essential. This is how ClickHouse, Druid, and Pinot actually get their data.

Core challenges you’ll face:

  • Understanding Kafka partitions and consumers → maps to parallelism and ordering
  • Implementing exactly-once semantics → maps to idempotency and transactions
  • Handling backpressure → maps to when the sink is slower than the source
  • Schema evolution → maps to handling changing data formats
  • Building a DLQ → maps to error handling without blocking

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic understanding of message queues, Project 5 helps but not required

Real world outcome:

$ ./kafka_ingester start \
    --bootstrap-servers localhost:9092 \
    --topic events \
    --consumer-group analytics-ingest \
    --sink http://localhost:8080/ingest

Kafka Ingester Started
  Topic: events (12 partitions)
  Consumer group: analytics-ingest
  Parallelism: 12 consumers
  Batch size: 1000 events
  Exactly-once: enabled

[Consumer-0] Partition 0: offset 1,234,567
[Consumer-1] Partition 1: offset 2,345,678
...

=== Stats (last 60 seconds) ===
Events consumed:     1,245,000
Events transformed:  1,244,850
Events loaded:       1,244,850
Events to DLQ:           150 (parse errors)
Throughput:          20,750 events/sec
Avg latency:         45ms (kafka → analytics)
Consumer lag:        2,340 events

=== Backpressure Event ===
[WARN] Sink slow, applying backpressure...
  Pausing consumers for 500ms
  Resuming...
  Lag recovered.

$ ./kafka_ingester dlq --show
Dead Letter Queue (150 events):
  - Offset 1234599: Invalid JSON at position 45
  - Offset 1234612: Missing required field 'timestamp'
  - Offset 1234689: Schema mismatch (expected int, got string)

Implementation Hints:

Consumer group coordination:

Topic: events (12 partitions)
Consumer Group: analytics-ingest

┌─────────┐   ┌─────────┐   ┌─────────┐
│Consumer1│   │Consumer2│   │Consumer3│
│ P0, P1  │   │ P2, P3  │   │ P4, P5  │  ...
│ P6, P7  │   │ P8, P9  │   │P10, P11 │
└─────────┘   └─────────┘   └─────────┘

If Consumer2 dies → P2,P3,P8,P9 rebalanced to 1 & 3

Questions to guide you:

  • Why does Kafka guarantee ordering only within a partition?
  • How do you commit offsets to avoid reprocessing after restart?
  • What happens if your sink throws an error mid-batch?
  • How do you detect and handle poison messages?
  • What’s the difference between at-least-once, at-most-once, and exactly-once?

Learning milestones:

  1. Basic consumer processes messages → You understand Kafka consumption
  2. Consumer group rebalances correctly → You understand partition assignment
  3. Exactly-once: no duplicates after restart → You understand offset management
  4. DLQ captures bad messages without blocking → You understand error handling

Project 7: Vectorized Query Execution Engine

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, C, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Query Execution / CPU Optimization
  • Software or Tool: ClickHouse-style engine
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A vectorized query execution engine that processes data in batches (vectors) of 1024+ values, leveraging SIMD (Single Instruction, Multiple Data) instructions to achieve billions of rows/second throughput.

Why it teaches real-time analytics: This is the performance secret of modern OLAP databases. ClickHouse processes billions of rows per second because of vectorized execution. You’ll understand why batch processing is fundamentally faster than row-at-a-time.

Core challenges you’ll face:

  • Batch processing architecture → maps to volcano model vs vectorized execution
  • SIMD operations → maps to processing 8-16 values per CPU instruction
  • Memory alignment → maps to cache-line efficiency and prefetching
  • Predicate pushdown → maps to filtering before aggregation
  • Pipeline fusion → maps to reducing intermediate materialization

Key Concepts:

  • Vectorized Execution: MonetDB/X100 Paper - Sections 1-4
  • SIMD Programming: “Computer Systems: A Programmer’s Perspective” Chapter 5 - Bryant & O’Hallaron
  • Cache Optimization: “What Every Programmer Should Know About Memory” - Ulrich Drepper
  • Query Execution: CMU 15-721 Advanced Database Systems - Vectorization lectures

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5, comfortable with systems programming, understanding of CPU architecture, SIMD concepts

Real world outcome:

$ ./vectorized_engine bench --rows 1000000000 --query "SELECT SUM(value) WHERE flag = 1"

=== Vectorized Query Execution Benchmark ===
Dataset: 1,000,000,000 rows (8 GB)

Query: SELECT SUM(value) WHERE flag = 1

Execution Plan:
  → Scan(value, flag)
  → Filter(flag = 1)         [SIMD: 16 values/instruction]
  → Aggregate(SUM(value))    [SIMD: 8 values/instruction]

=== Row-at-a-time (baseline) ===
Time: 45.2 seconds
Throughput: 22.1 million rows/sec
CPU utilization: 12%

=== Vectorized (no SIMD) ===
Time: 4.8 seconds
Throughput: 208 million rows/sec
Speedup: 9.4x
CPU utilization: 45%

=== Vectorized + SIMD (AVX2) ===
Time: 0.89 seconds
Throughput: 1.12 billion rows/sec
Speedup: 50.8x
CPU utilization: 98%

=== Detailed Stats ===
Vector size: 1024 values
Vectors processed: 976,563
L1 cache misses: 0.2%
L2 cache misses: 0.01%
SIMD lanes utilized: 15.7 / 16 (98.1%)

Implementation Hints:

Vectorized vs row-at-a-time:

Row-at-a-time (Volcano model):
for each row:
    value = get_value(row)
    flag = get_flag(row)
    if flag == 1:          # Branch for EVERY row
        sum += value

Vectorized:
value_vector = get_values(batch_of_1024)
flag_vector = get_flags(batch_of_1024)
mask = simd_compare_eq(flag_vector, 1)  # 16 comparisons at once!
sum += simd_masked_sum(value_vector, mask)

Questions to guide you:

  • Why is branching expensive in tight loops? (Branch misprediction)
  • How does SIMD allow “branchless” filtering? (Masks)
  • Why does vector size matter? (Too small = overhead, too large = cache misses)
  • How do you handle the tail when row count isn’t divisible by vector size?

SIMD intrinsics to explore:

  • _mm256_loadu_pd - Load 4 doubles
  • _mm256_cmp_pd - Compare 4 doubles
  • _mm256_add_pd - Add 4 doubles
  • _mm256_maskstore_pd - Conditional store

Learning milestones:

  1. Batch processing shows measurable speedup → You understand loop overhead reduction
  2. SIMD code runs 4-8x faster than scalar → You understand data parallelism
  3. Cache utilization approaches 100% → You understand memory hierarchy
  4. Complex queries (filter + aggregate) vectorize → You understand operator fusion

Project 8: SQL Parser and Query Planner

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, Python, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Query Processing / Compilers
  • Software or Tool: SQL Engine
  • Main Book: “Language Implementation Patterns” by Terence Parr

What you’ll build: A SQL parser that converts SQL strings into an Abstract Syntax Tree (AST), and a query planner that transforms the AST into an optimized execution plan. Support SELECT, WHERE, GROUP BY, ORDER BY, and JOIN.

Why it teaches real-time analytics: SQL is the universal language of analytics. Understanding how SELECT SUM(x) FROM t WHERE y > 10 GROUP BY z becomes an execution plan reveals the translation layer between human intent and machine execution.

Core challenges you’ll face:

  • Lexical analysis (tokenizing) → maps to breaking SQL into tokens
  • Parsing (AST generation) → maps to recursive descent or parser generators
  • Semantic analysis → maps to type checking, name resolution
  • Query planning → maps to choosing join order, predicate pushdown
  • Plan optimization → maps to cost-based decisions

Key Concepts:

  • Recursive Descent Parsing: “Language Implementation Patterns” Chapters 2-4 - Terence Parr
  • Query Planning: “Database Internals” Chapter 10 - Alex Petrov
  • Cost-Based Optimization: CMU 15-445 Query Planning - Lecture notes
  • AST Design: “Engineering a Compiler” Chapter 4 - Cooper & Torczon

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Basic understanding of compilers/parsers, Project 7 helps for execution

Real world outcome:

$ ./sql_engine parse "SELECT country, SUM(revenue) FROM sales WHERE year = 2024 GROUP BY country ORDER BY SUM(revenue) DESC LIMIT 10"

=== Tokens ===
SELECT, IDENTIFIER(country), COMMA, FUNCTION(SUM), LPAREN, IDENTIFIER(revenue), RPAREN, FROM, IDENTIFIER(sales), WHERE, IDENTIFIER(year), EQUALS, INTEGER(2024), GROUP, BY, IDENTIFIER(country), ORDER, BY, FUNCTION(SUM), LPAREN, IDENTIFIER(revenue), RPAREN, DESC, LIMIT, INTEGER(10)

=== Abstract Syntax Tree ===
Select {
  columns: [
    ColumnRef(country),
    FunctionCall(SUM, [ColumnRef(revenue)])
  ],
  from: TableRef(sales),
  where: BinaryOp(
    Equals,
    ColumnRef(year),
    Literal(2024)
  ),
  group_by: [ColumnRef(country)],
  order_by: [
    OrderTerm(FunctionCall(SUM, [ColumnRef(revenue)]), DESC)
  ],
  limit: 10
}

=== Logical Plan ===
Limit(10)
  └── Sort(SUM(revenue) DESC)
        └── Aggregate(GROUP BY country, SUM(revenue))
              └── Filter(year = 2024)
                    └── Scan(sales)

=== Optimized Physical Plan ===
Limit(10)
  └── TopN(10, SUM(revenue) DESC)      // Optimization: Sort→Limit becomes TopN
        └── HashAggregate(country, SUM(revenue))
              └── Filter(year = 2024)   // Predicate pushdown
                    └── ColumnScan(sales, [country, revenue, year])  // Column pruning

Optimizations applied:
  ✓ Column pruning: reading 3/15 columns
  ✓ Predicate pushdown: filter before aggregate
  ✓ Sort elimination: TopN instead of full sort

Implementation Hints:

SQL grammar (simplified):

query       → SELECT columns FROM table [WHERE expr] [GROUP BY cols] [ORDER BY cols] [LIMIT n]
columns     → column ("," column)*
column      → expression [AS alias]
expression  → term ((+ | -) term)*
term        → factor ((* | /) factor)*
factor      → literal | identifier | function_call | "(" expression ")"
function_call → identifier "(" arguments ")"

Questions to guide you:

  • How do you handle operator precedence (e.g., 1 + 2 * 3)?
  • What’s the difference between logical and physical plans?
  • When should predicate pushdown NOT be applied?
  • How do you represent aggregate functions in the AST?

Learning milestones:

  1. Tokenizer handles all SQL keywords → You understand lexical analysis
  2. Parser builds correct AST for complex queries → You understand grammars
  3. Planner produces executable plans → You understand query semantics
  4. Optimizer improves plan quality → You understand cost-based decisions

Project 9: Distributed Query Coordinator

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Query Routing
  • Software or Tool: Distributed Analytics Cluster
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed query coordinator that routes queries to multiple analytics nodes, collects partial results, and merges them into final results. Support partition pruning, parallel execution, and failure handling.

Why it teaches real-time analytics: Production OLAP systems are distributed. ClickHouse, Druid, and Pinot all have coordinators that orchestrate queries across shards. Understanding this layer reveals how analytics scales horizontally.

Core challenges you’ll face:

  • Query routing → maps to which shards have relevant data
  • Partition pruning → maps to skipping irrelevant partitions
  • Parallel aggregation → maps to partial vs final aggregates
  • Failure handling → maps to retries, timeouts, partial results
  • Result merging → maps to ORDER BY, LIMIT across shards

Key Concepts:

  • Distributed Query Processing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
  • Two-Phase Aggregation: Druid Query Execution
  • Partition Strategies: “Database Internals” Chapter 12 - Alex Petrov
  • RPC Frameworks: gRPC Documentation

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 5-8, understanding of distributed systems concepts, RPC/networking

Real world outcome:

$ ./query_coordinator start --config cluster.yaml

Query Coordinator Started
  Nodes: 5 (node1:8080, node2:8080, node3:8080, node4:8080, node5:8080)
  Partitioning: hash(user_id) % 5
  Replication factor: 2

$ ./query_coordinator query "SELECT country, SUM(revenue) FROM events WHERE date = '2024-01-15' GROUP BY country ORDER BY SUM(revenue) DESC LIMIT 10"

=== Query Execution Plan ===
Scatter phase:
  → node1: partition 0 (50M rows, date partition match)
  → node2: partition 1 (48M rows, date partition match)
  → node3: partition 2 (52M rows, date partition match)
  → node4: partition 3 (49M rows, date partition match)
  → node5: partition 4 (51M rows, date partition match)

Partition pruning: Skipped 30 date partitions (95% data skipped)

Gather phase:
  Collect partial aggregates from 5 nodes
  Merge SUM(revenue) per country
  Global sort by revenue
  Apply LIMIT 10

=== Execution Stats ===
Scatter time: 12ms (parallel)
Node execution times: [45ms, 52ms, 48ms, 51ms, 47ms]
Gather time: 3ms
Total query time: 67ms

Results:
┌─────────────┬──────────────┐
│  country    │ SUM(revenue) │
├─────────────┼──────────────┤
│ USA         │   12,453,000 │
│ Germany     │    8,234,000 │
│ UK          │    7,891,000 │
│ France      │    5,432,000 │
│ Japan       │    4,321,000 │
...
└─────────────┴──────────────┘

$ ./query_coordinator status
Cluster Health:
  node1: healthy (response: 2ms)
  node2: healthy (response: 3ms)
  node3: degraded (response: 150ms)
  node4: healthy (response: 2ms)
  node5: unreachable (last seen: 5 minutes ago)

Failover: Queries to partition 4 routed to node3 (replica)

Implementation Hints:

Two-phase aggregation:

Phase 1 (Scatter): Each node computes partial aggregates
  node1: {USA: 2000, Germany: 1500}
  node2: {USA: 3000, UK: 2000}
  node3: {Germany: 1000, France: 1500}

Phase 2 (Gather): Coordinator merges
  USA: 2000 + 3000 = 5000
  Germany: 1500 + 1000 = 2500
  UK: 2000
  France: 1500

Questions to guide you:

  • How do you merge ORDER BY + LIMIT across shards efficiently?
  • What if one node times out? Return partial results or fail?
  • How do you handle COUNT(DISTINCT) across shards? (Hint: it’s hard!)
  • When should you push predicates to shards vs filter at coordinator?

Learning milestones:

  1. Queries scatter to multiple nodes → You understand query distribution
  2. Partition pruning skips irrelevant shards → You understand metadata-based optimization
  3. Aggregates merge correctly → You understand two-phase aggregation
  4. Failover works transparently → You understand fault tolerance

Project 10: Raft Consensus for Data Replication

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, C++, Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Consensus / Replication
  • Software or Tool: Raft Implementation
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete Raft consensus implementation for replicating analytics data across nodes. Implement leader election, log replication, and snapshotting to ensure data survives node failures.

Why it teaches real-time analytics: High availability is non-negotiable for production systems. Understanding Raft reveals how ClickHouse, CockroachDB, and TiDB maintain consistency across replicas. This is the foundation of reliable distributed databases.

Core challenges you’ll face:

  • Leader election → maps to terms, votes, and timeouts
  • Log replication → maps to AppendEntries RPC, commit index
  • Safety guarantees → maps to election restriction, log matching
  • Snapshotting → maps to compacting logs, state transfer
  • Cluster membership changes → maps to joint consensus

Key Concepts:

Difficulty: Master Time estimate: 4-6 weeks Prerequisites: Strong understanding of distributed systems, Projects 4 and 9, comfort with concurrent programming

Real world outcome:

$ ./raft_cluster start --nodes 5

=== Raft Cluster Started ===
Nodes: node1:5001, node2:5002, node3:5003, node4:5004, node5:5005

[node1] State: Follower, Term: 0
[node2] State: Follower, Term: 0
[node3] State: Follower, Term: 0
[node4] State: Follower, Term: 0
[node5] State: Follower, Term: 0

[node3] Election timeout, becoming Candidate
[node3] RequestVote sent to all nodes
[node3] Received votes: 3/5 (majority)
[node3] State: Leader, Term: 1

$ ./raft_cluster write "INSERT INTO events VALUES (1, 'click', 1234567890)"

[Leader:node3] Appending entry: {index: 1, term: 1, cmd: "INSERT..."}
[Leader:node3] Replicating to followers...
  → node1: AppendEntries(prevIdx=0, entries=[1])
  → node2: AppendEntries(prevIdx=0, entries=[1])
  → node4: AppendEntries(prevIdx=0, entries=[1])
  → node5: AppendEntries(prevIdx=0, entries=[1])
[Leader:node3] Entry 1 committed (replicated to 5/5 nodes)
[Leader:node3] Applied to state machine: INSERT...

$ ./raft_cluster kill node3  # Kill the leader

[node3] TERMINATED

[node1] Leader heartbeat timeout, becoming Candidate
[node1] Term: 2, RequestVote sent
[node1] Received votes: 3/4 (majority of remaining)
[node1] State: Leader, Term: 2

$ ./raft_cluster status
Cluster Status:
  Leader: node1 (term 2)
  Followers: node2, node4, node5
  Unavailable: node3

  Log replication:
    node1: [1, 2, 3, 4, 5] commit=5 applied=5
    node2: [1, 2, 3, 4, 5] commit=5 applied=5
    node4: [1, 2, 3, 4, 5] commit=5 applied=5
    node5: [1, 2, 3, 4] commit=4 applied=4  # Catching up

$ ./raft_cluster restart node3

[node3] Rejoining cluster...
[node3] Discovered leader: node1 (term 2)
[node3] Log behind, receiving entries...
[node3] Caught up to commit index 5
[node3] State: Follower (healthy)

Implementation Hints:

Raft state machine:

           timeout          receive higher term
Follower ────────→ Candidate ────────────────→ Follower
    ↑                  │
    │                  │ majority votes
    │                  ▼
    └───────────── Leader
       (on higher term)

Log structure:

Index:   1      2      3      4      5
Term:    1      1      1      2      2
Command: SET    SET    SET    SET    SET
         x=1    y=2    z=3    x=5    y=6
                        ↑
                   Commit Index
         (applied to state machine)

Questions to guide you:

  • Why can’t a node vote for multiple candidates in the same term?
  • What happens if a leader dies before an entry is committed?
  • Why must leaders have all committed entries? (Election restriction)
  • How does snapshotting prevent unbounded log growth?

Learning milestones:

  1. Leader election works with timeouts → You understand term-based voting
  2. Log replication achieves consensus → You understand AppendEntries
  3. System survives leader failure → You understand fault tolerance
  4. Snapshotting compacts logs → You understand state machine snapshots

Project 11: Time-Series Compression Engine (Gorilla)

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Time-Series / Compression
  • Software or Tool: Facebook Gorilla-style Engine
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: Implement Facebook’s Gorilla compression algorithm—the technique that achieves 12x compression on time-series data by exploiting timestamp regularity and value similarity. Your engine will compress and decompress time-series data in streaming fashion.

Why it teaches real-time analytics: Time-series is the dominant data type in observability (metrics, logs, traces). Gorilla compression enabled Facebook to store 10x more data in memory, dramatically improving query performance. This technique is used by TimescaleDB, InfluxDB, Prometheus, and many others.

Core challenges you’ll face:

  • Delta-of-delta timestamp encoding → maps to exploiting regular intervals
  • XOR-based value compression → maps to exploiting IEEE 754 float structure
  • Bit-level I/O → maps to reading/writing individual bits
  • Streaming compression → maps to compress without seeing full dataset
  • Random access within blocks → maps to block-based organization

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 2, strong understanding of binary representation, bit manipulation

Real world outcome:

$ ./gorilla_engine compress --input metrics.csv --output metrics.gorilla

=== Gorilla Compression Results ===
Input: metrics.csv
  Rows: 10,000,000
  Columns: timestamp (int64), value (float64)
  Raw size: 160 MB (16 bytes/point)

Timestamp compression:
  Delta-of-delta distribution:
    0 (same delta): 96.2%    → 1 bit each
    Small delta: 3.1%         → 9 bits each
    Large delta: 0.7%         → 64 bits each
  Timestamp bits: 12.4 MB (1.24 bits/timestamp avg)

Value compression:
  XOR distribution:
    Same value: 23.4%        → 1 bit each
    Same leading/trailing: 68.2% → ~12 bits each
    Different: 8.4%          → ~72 bits each
  Value bits: 38.1 MB (3.81 bits/value avg)

Output: metrics.gorilla
  Compressed size: 50.5 MB
  Compression ratio: 3.17x
  Bits per point: 5.05 (vs 128 raw)

$ ./gorilla_engine query --file metrics.gorilla --range "2024-01-01 00:00:00" "2024-01-01 01:00:00"

Time range: 1 hour
Points returned: 36,000
Decompression time: 2.3ms
Throughput: 15.6 million points/sec

Sample output:
  2024-01-01 00:00:00 | 23.45
  2024-01-01 00:00:01 | 23.47
  2024-01-01 00:00:02 | 23.46
  ...

Implementation Hints:

Delta-of-delta encoding:

Timestamps: [1000, 1060, 1120, 1180, 1240, 1241]
Deltas:     [1000,   60,   60,   60,   60,    1]
D-of-D:     [1000,   60,    0,    0,    0,  -59]

Encoding:
  - D-of-D = 0: Write '0' (1 bit)
  - D-of-D in [-63, 64]: Write '10' + 7 bits
  - D-of-D in [-255, 256]: Write '110' + 9 bits
  - Otherwise: Write '1110' + 12 bits (or '1111' + 32 bits)

XOR value compression:

Values: [23.45, 23.47, 23.46]
XOR with previous:
  v1: 23.45 (store full 64 bits for first value)
  v2 XOR v1: 0x0000000000002000 (only a few bits differ!)
  v3 XOR v2: 0x0000000000001000

Encoding:
  - XOR = 0: Write '0' (1 bit)
  - Same leading zeros as previous: Write '10' + meaningful bits
  - Different: Write '11' + leading zero count + length + bits

Questions to guide you:

  • Why does delta-of-delta work so well on regular timestamps?
  • How do you handle the first value in each block?
  • What’s the worst case for Gorilla compression?
  • How do you implement efficient bit-level reading/writing?

Learning milestones:

  1. Delta-of-delta achieves >10x on timestamps → You understand temporal patterns
  2. XOR compression works on floats → You understand IEEE 754 exploitation
  3. Streaming compression maintains state → You understand block-based encoding
  4. Decompression is fast → You understand the decode path

Project 12: Materialized View Engine

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Query Optimization / Caching
  • Software or Tool: Materialized View System
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A materialized view engine that automatically maintains pre-computed aggregations as new data arrives. Support incremental refresh (only process new data) and view composition (views based on views).

Why it teaches real-time analytics: Materialized views are how real-time dashboards achieve sub-second latency. Instead of computing aggregations at query time, they’re maintained continuously. ClickHouse, Druid, and Pinot all rely heavily on this technique.

Core challenges you’ll face:

  • Incremental maintenance → maps to computing deltas, not full refreshes
  • Aggregation functions → maps to which functions can be incrementally maintained
  • View dependencies → maps to cascading updates
  • Consistency guarantees → maps to exactly-once view updates
  • Memory management → maps to bounding view size

Key Concepts:

  • Materialized Views: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
  • Incremental View Maintenance: Materialize IVM Documentation
  • Aggregation Theory: “Database Internals” Chapter 11 - Alex Petrov
  • OLAP Cubes: Apache Druid Rollup

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 5-6, understanding of aggregation functions

Real world outcome:

$ ./matview_engine start

Materialized View Engine Started
  Storage: RocksDB backend
  Refresh: Incremental (streaming)

$ ./matview_engine create-view "
  CREATE MATERIALIZED VIEW hourly_revenue AS
  SELECT
    date_trunc('hour', timestamp) AS hour,
    country,
    SUM(revenue) AS total_revenue,
    COUNT(*) AS order_count
  FROM orders
  GROUP BY 1, 2
"

View 'hourly_revenue' created
  Dimensions: 2 (hour, country)
  Metrics: 2 (SUM(revenue), COUNT(*))
  Estimated memory: 50 MB

$ ./matview_engine ingest --table orders --file orders_batch.json
Ingested 100,000 rows to 'orders'
View 'hourly_revenue' incrementally updated:
  New rows processed: 100,000
  View rows updated: 45 (dimensions touched)
  Update time: 12ms

$ ./matview_engine query "SELECT * FROM hourly_revenue WHERE country = 'US' ORDER BY hour DESC LIMIT 5"

Query executed against materialized view (not base table!)
Time: 0.8ms

┌─────────────────────┬─────────┬───────────────┬─────────────┐
│        hour         │ country │ total_revenue │ order_count │
├─────────────────────┼─────────┼───────────────┼─────────────┤
│ 2024-01-15 14:00:00 │ US      │     125,430   │      1,245  │
│ 2024-01-15 13:00:00 │ US      │     118,920   │      1,189  │
│ 2024-01-15 12:00:00 │ US      │     142,100   │      1,421  │
│ 2024-01-15 11:00:00 │ US      │     134,560   │      1,345  │
│ 2024-01-15 10:00:00 │ US      │      98,230   │        982  │
└─────────────────────┴─────────┴───────────────┴─────────────┘

$ ./matview_engine stats hourly_revenue
View: hourly_revenue
  Rows: 12,450
  Memory: 2.3 MB
  Last refresh: 2024-01-15 14:32:01
  Total updates: 1,245,000
  Avg update latency: 0.012ms

Implementation Hints:

Incremental aggregation:

Existing view:
  (hour=14:00, country=US) → {revenue: 100000, count: 1000}

New batch:
  (timestamp=14:30, country=US, revenue=500)
  (timestamp=14:31, country=US, revenue=600)

Incremental update:
  (hour=14:00, country=US) → {revenue: 100000+500+600, count: 1000+2}
                           = {revenue: 101100, count: 1002}

Which aggregations can be incrementally maintained?

  • ✅ SUM: add new values
  • ✅ COUNT: add count of new rows
  • ✅ MIN/MAX: compare with new values
  • ❌ MEDIAN: requires full dataset
  • ⚠️ AVG: maintain SUM and COUNT separately

Questions to guide you:

  • How do you handle deletes in incremental maintenance?
  • What if a dimension has unbounded cardinality?
  • How do you route incoming data to the right view rows?
  • When should you trigger a full refresh vs incremental?

Learning milestones:

  1. SUM/COUNT views update incrementally → You understand additive aggregations
  2. Views refresh in milliseconds → You understand incremental vs full refresh
  3. Queries hit views instead of base tables → You understand query rewriting
  4. Memory is bounded for high-cardinality dimensions → You understand eviction

Project 13: Approximate Query Processing (HyperLogLog & Quantiles)

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C++
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Probabilistic Data Structures / Approximation
  • Software or Tool: Approximate Analytics Engine
  • Main Book: “Algorithms” by Sedgewick & Wayne

What you’ll build: Implement probabilistic data structures for approximate analytics: HyperLogLog for COUNT(DISTINCT), t-digest for quantiles/percentiles, and Count-Min Sketch for frequency estimation. Achieve 1000x memory savings with <1% error.

Why it teaches real-time analytics: Exact COUNT(DISTINCT) on a billion values requires storing all billion values. HyperLogLog does it in 12KB. These algorithms enable analytics at scale that would otherwise be impossible. ClickHouse, Redis, and BigQuery all use these techniques.

Core challenges you’ll face:

  • HyperLogLog implementation → maps to hashing and counting leading zeros
  • t-digest for quantiles → maps to clustering and merging
  • Count-Min Sketch → maps to hash tables with collision tolerance
  • Error bounds → maps to understanding accuracy vs memory tradeoffs
  • Mergeability → maps to combining sketches from distributed nodes

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of hash functions, basic probability/statistics

Real world outcome:

$ ./approx_engine bench --data user_events.parquet --rows 1000000000

=== Approximate Query Processing Benchmark ===
Dataset: 1 billion user events
Columns: user_id (1M unique), event_type (100 unique), value (float)

=== COUNT(DISTINCT user_id) ===

Exact:
  Memory: 8.5 GB (hash set of 1M IDs)
  Time: 45 seconds
  Result: 1,000,000

HyperLogLog (precision 14):
  Memory: 16 KB (16,384 registers)
  Time: 3.2 seconds
  Result: 1,003,421
  Error: 0.34%
  Memory savings: 530,000x

=== P99 Latency (99th percentile of value) ===

Exact (sort-based):
  Memory: 8 GB (all values)
  Time: 120 seconds
  Result: 4523.45ms

t-digest (compression 100):
  Memory: 24 KB
  Time: 8.5 seconds
  Result: 4518.92ms
  Error: 0.10%
  Memory savings: 333,000x

=== Top-K Event Types ===

Exact (hash map counting):
  Memory: 100 KB
  Time: 25 seconds
  Top 5: [(click, 452M), (view, 321M), (purchase, 89M), ...]

Count-Min Sketch (width=10000, depth=5):
  Memory: 400 KB
  Time: 4.2 seconds
  Top 5: [(click, ~451M), (view, ~323M), (purchase, ~90M), ...]
  Max overcount: 2.1%

=== Merging Distributed Sketches ===
Merging HyperLogLog from 10 nodes: 160 KB → 16 KB (merged)
Result: 1,002,892 unique users (0.29% error)
Merge time: 0.03ms

Implementation Hints:

HyperLogLog algorithm:

For each value x:
  1. h = hash(x)  // 64-bit hash
  2. bucket = first 14 bits of h  // 16384 buckets
  3. remaining = remaining 50 bits
  4. rho = position of first 1-bit in remaining + 1
  5. registers[bucket] = max(registers[bucket], rho)

Estimate:
  harmonic_mean = 1 / sum(2^(-register[i]))
  estimate = alpha * m^2 * harmonic_mean
  // alpha corrects for bias, m = number of buckets

t-digest concept:

Maintain clusters of values with centroids and counts.
Clusters near median can be larger (more values).
Clusters near tails (p1, p99) must be smaller (higher precision).

To compute P99:
  Walk clusters from max until cumulative count reaches 99%
  Interpolate within the final cluster

Questions to guide you:

  • Why does counting leading zeros estimate cardinality?
  • How does HyperLogLog handle hash collisions?
  • Why does t-digest need more precision at the tails?
  • How do you merge Count-Min Sketches from different nodes?

Learning milestones:

  1. HyperLogLog estimates with <2% error → You understand cardinality estimation
  2. t-digest computes P50/P99 accurately → You understand quantile approximation
  3. Count-Min Sketch finds frequent items → You understand frequency estimation
  4. Sketches merge correctly → You understand distributed approximation

Project 14: Real-Time Dashboard Backend

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, TypeScript, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: API Design / Real-Time Updates
  • Software or Tool: Dashboard API Server
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A real-time dashboard backend that serves pre-aggregated metrics via REST/WebSocket, supports drill-down queries, handles time-range selections, and pushes live updates to connected clients.

Why it teaches real-time analytics: This is where analytics meets users. Understanding how to serve sub-second queries to dashboards, handle concurrent users, and push real-time updates ties together everything you’ve learned. This is what makes analytics “real-time” from the user’s perspective.

Core challenges you’ll face:

  • Query caching → maps to when to cache vs recompute
  • WebSocket updates → maps to pushing changes to clients
  • Time-range handling → maps to relative vs absolute times, timezones
  • Drill-down queries → maps to adding filters incrementally
  • Rate limiting → maps to protecting the analytics backend

Key Concepts:

  • API Design: “Design and Build Great Web APIs” - Mike Amundsen
  • WebSockets: RFC 6455
  • Caching Strategies: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
  • Real-Time Systems: Pusher Real-Time Guide

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic REST API development, WebSocket understanding, Project 5 or 12 helps

Real world outcome:

$ ./dashboard_server start --port 8080 --analytics-backend localhost:9000

Dashboard Server Started
  REST API: http://localhost:8080/api
  WebSocket: ws://localhost:8080/ws
  Analytics backend: localhost:9000
  Cache TTL: 10 seconds

=== REST API Demo ===

$ curl "http://localhost:8080/api/metrics/revenue?groupBy=country&timeRange=last_24h"
{
  "query_time_ms": 12,
  "cache_hit": false,
  "time_range": {"start": "2024-01-14T14:00:00Z", "end": "2024-01-15T14:00:00Z"},
  "data": [
    {"country": "US", "revenue": 1245000, "orders": 12450},
    {"country": "UK", "revenue": 892000, "orders": 8920},
    {"country": "DE", "revenue": 756000, "orders": 7560}
  ]
}

# Same query again (cached)
$ curl "http://localhost:8080/api/metrics/revenue?groupBy=country&timeRange=last_24h"
{
  "query_time_ms": 0.3,
  "cache_hit": true,
  ...
}

# Drill-down
$ curl "http://localhost:8080/api/metrics/revenue?groupBy=city&filter=country:US&timeRange=last_24h"
{
  "query_time_ms": 8,
  "data": [
    {"city": "New York", "revenue": 423000},
    {"city": "Los Angeles", "revenue": 312000},
    ...
  ]
}

=== WebSocket Real-Time Updates ===

$ wscat -c ws://localhost:8080/ws

> {"subscribe": "revenue.by_country", "interval": "1m"}
< {"type": "snapshot", "data": [{"country": "US", "revenue": 1245000}, ...]}
< {"type": "update", "data": [{"country": "US", "revenue": 1245500}]}  // +500 revenue
< {"type": "update", "data": [{"country": "US", "revenue": 1246200}]}  // +700 revenue
...

=== Server Stats ===
Active connections: 47
Queries/sec: 234
Cache hit rate: 89%
Avg response time: 15ms (cached: 0.5ms, uncached: 45ms)

Implementation Hints:

Caching strategy:

Cache key = hash(query_params + floor(time / TTL))

For "last_24h" queries:
  - Recompute current hour every minute
  - Cache older hours until they "roll off" the 24h window

For absolute ranges:
  - Cache indefinitely (data won't change)

WebSocket update flow:

1. Client subscribes to metric
2. Server stores subscription
3. When new data ingested:
   - Compute affected metrics
   - Push delta to subscribed clients
4. Client applies delta to local state

Questions to guide you:

  • How do you handle timezone differences for “last 24 hours”?
  • When should you invalidate cache vs let it expire?
  • How do you prevent slow queries from blocking fast ones?
  • How do you handle client disconnects gracefully?

Learning milestones:

  1. REST API serves queries with caching → You understand query serving
  2. WebSocket pushes real-time updates → You understand push notifications
  3. Drill-down queries work efficiently → You understand filter composition
  4. System handles 100+ concurrent clients → You understand connection management

Project 15: Mini-ClickHouse (Putting It All Together)

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Full OLAP Database
  • Software or Tool: Complete OLAP System
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete, working OLAP database that combines: columnar storage (Project 1, 3), compression (Project 2, 11), LSM-tree ingestion (Project 4), vectorized execution (Project 7), SQL parsing (Project 8), and materialized views (Project 12). Query billions of rows in seconds.

Why it teaches real-time analytics: This is the capstone. Building a complete OLAP database forces you to understand how all the pieces fit together—the tradeoffs, the optimizations, and the engineering decisions that make systems like ClickHouse process billions of rows per second.

Core challenges you’ll face:

  • Architecture integration → maps to how components interact
  • Memory management → maps to buffer pools, memory limits
  • Concurrent query execution → maps to isolation, resource sharing
  • Storage management → maps to partitioning, compaction
  • Performance tuning → maps to identifying and fixing bottlenecks

Key Concepts:

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects (especially 3, 4, 7, 8), strong systems programming skills

Real world outcome:

$ ./mini-clickhouse server start

╔══════════════════════════════════════════════════════════════╗
║                     Mini-ClickHouse v1.0                      ║
║           A Learning Project - Not for Production!            ║
╚══════════════════════════════════════════════════════════════╝

Storage engine: LSM-Tree + Columnar
Compression: Gorilla (timestamps), LZ4 (general)
Execution: Vectorized (AVX2)
Memory limit: 8 GB

Listening on:
  Native protocol: localhost:9000
  HTTP interface: localhost:8123

$ ./mini-clickhouse client

mini-clickhouse :) CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    revenue Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);

CREATE TABLE events
Ok.
0 rows in set. Elapsed: 0.012 sec.

mini-clickhouse :) INSERT INTO events SELECT * FROM file('events_1B.csv');

INSERT INTO events
1,000,000,000 rows in set. Elapsed: 145.23 sec.
Processed 6.89 million rows/s, 1.10 GB/s.

Storage size: 12.4 GB (92 GB raw → 7.4x compression)

mini-clickhouse :) SELECT
    event_type,
    count() AS events,
    sum(revenue) AS total_revenue,
    avg(revenue) AS avg_revenue
FROM events
WHERE timestamp >= '2024-01-01' AND timestamp < '2024-02-01'
GROUP BY event_type
ORDER BY total_revenue DESC
LIMIT 10;

┌─event_type──┬────events─┬─total_revenue─┬─avg_revenue─┐
│ purchase    │ 89234521  │  892,345,210  │   10.00     │
│ subscription│ 12453678  │  248,756,120  │   19.97     │
│ refund      │   892345  │  -17,846,900  │  -20.00     │
│ ...         │   ...     │     ...       │   ...       │
└─────────────┴───────────┴───────────────┴─────────────┘

10 rows in set. Elapsed: 0.892 sec.
Processed 312.45 million rows, 2.50 GB (350.28 million rows/s, 2.80 GB/s).

Execution plan:
  PartitionPrune(2024-01 only): 12 → 1 partitions
  ColumnScan: [event_type, revenue]
  VectorizedFilter: timestamp range
  HashAggregate: GROUP BY event_type
  TopNSort: ORDER BY total_revenue DESC LIMIT 10

mini-clickhouse :) EXPLAIN PIPELINE
SELECT count(DISTINCT user_id) FROM events WHERE event_type = 'purchase';

┌─explain───────────────────────────────────────────┐
│ (Aggregating)                                     │
│   AggregatingTransform (HyperLogLog)              │
│     (Filter)                                      │
│       FilterTransform (SIMD comparison)           │
│         (Scan)                                    │
│           MergeTreeThread × 4                     │
│             Decompress (LZ4) → Vectorized Scan    │
└───────────────────────────────────────────────────┘

Implementation Hints:

Architecture overview:

┌───────────────────────────────────────────────────────────────┐
│                        SQL Interface                          │
│                    (Parser → Planner → Optimizer)             │
├───────────────────────────────────────────────────────────────┤
│                    Query Execution Engine                     │
│              (Vectorized, SIMD, Parallel Pipelines)           │
├───────────────────────────────────────────────────────────────┤
│     Aggregation Engine        │    Materialized Views         │
│  (Hash, HLL, t-digest, etc.)  │   (Incremental Refresh)       │
├───────────────────────────────┼───────────────────────────────┤
│              Storage Engine (MergeTree)                       │
│  ┌─────────────────┐  ┌─────────────────────────────────┐    │
│  │ In-Memory       │  │ On-Disk                         │    │
│  │ (WAL + MemTable)│  │ (Columnar Parts + Compression)  │    │
│  └─────────────────┘  └─────────────────────────────────┘    │
├───────────────────────────────────────────────────────────────┤
│                    Partition Manager                          │
│          (Partition Pruning, Compaction, TTL)                 │
└───────────────────────────────────────────────────────────────┘

Integration questions:

  • How does the query planner decide when to use HyperLogLog vs exact count?
  • How do you balance memory between concurrent queries?
  • When should compaction run vs serving queries?
  • How do you handle schema changes on existing data?

Start with a single-threaded, single-table system. Add features incrementally. Profile constantly—real databases spend years on optimization.

Learning milestones:

  1. Basic queries work end-to-end → You understand the full stack
  2. Performance matches expectations → You understand bottlenecks
  3. System handles concurrent queries → You understand resource management
  4. Compression + vectorization work together → You understand the synergies

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Row vs Column Visualizer Beginner Weekend ★★★☆☆ ★★★★☆
2. Compression Playground Intermediate 1-2 weeks ★★★★☆ ★★★☆☆
3. Mini-Parquet Reader/Writer Advanced 2-3 weeks ★★★★★ ★★★★☆
4. LSM-Tree Storage Engine Advanced 3-4 weeks ★★★★★ ★★★★☆
5. Metrics Aggregator Advanced 2-3 weeks ★★★★☆ ★★★★★
6. Kafka Ingestion Pipeline Intermediate 1-2 weeks ★★★☆☆ ★★★☆☆
7. Vectorized Execution Engine Expert 3-4 weeks ★★★★★ ★★★★★
8. SQL Parser & Planner Advanced 2-3 weeks ★★★★☆ ★★★☆☆
9. Distributed Coordinator Expert 3-4 weeks ★★★★★ ★★★★☆
10. Raft Consensus Master 4-6 weeks ★★★★★ ★★★★★
11. Gorilla Compression Advanced 2-3 weeks ★★★★☆ ★★★★☆
12. Materialized Views Advanced 2-3 weeks ★★★★☆ ★★★☆☆
13. Approximate Queries (HLL) Advanced 2-3 weeks ★★★★★ ★★★★☆
14. Dashboard Backend Intermediate 1-2 weeks ★★★☆☆ ★★★★★
15. Mini-ClickHouse Master 2-3 months ★★★★★ ★★★★★

If you have 1 month (quick foundation):

  1. Project 1: Row vs Column Visualizer (weekend)
  2. Project 2: Compression Playground (1 week)
  3. Project 5: Metrics Aggregator (2 weeks)

If you have 3 months (solid understanding):

  1. Projects 1-2 (2 weeks)
  2. Project 3: Mini-Parquet (2 weeks)
  3. Project 5: Metrics Aggregator (2 weeks)
  4. Project 6: Kafka Ingestion (1 week)
  5. Project 7: Vectorized Execution (3 weeks)
  6. Project 8: SQL Parser (2 weeks)

If you have 6+ months (deep mastery):

Complete all projects in order. Each builds on the previous.

Start with Project 1 (Row vs Column Visualizer). It’s the fastest way to viscerally understand why columnar databases exist. From there, choose based on your interests:

  • Performance-focused: Projects 2 → 7 → 11
  • Distributed systems-focused: Projects 4 → 9 → 10
  • Full-stack analytics: Projects 5 → 6 → 12 → 14

Final Capstone: Production-Ready Analytics Platform

After completing the individual projects, build a complete analytics platform:

  • File: LEARN_REALTIME_ANALYTICS_DATABASES_DEEP_DIVE.md
  • Main Programming Language: Rust + Go + TypeScript
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Full Analytics Stack
  • Software or Tool: Complete Analytics Platform

What you’ll build: Combine Projects 3-15 into a unified platform: columnar storage with compression, streaming ingestion from Kafka, SQL query interface, distributed query execution, materialized views, approximate aggregations, and a real-time dashboard. Deploy it on 3+ nodes with replication.

Why it matters: This is a portfolio-defining project. You’ll have built, from scratch, a system that competes conceptually with ClickHouse, Druid, or Pinot. Even if it’s slower or has fewer features, the fact that you built it demonstrates mastery of real-time analytics.

Real world outcome:

  • Ingest 100K+ events/second
  • Query billions of rows in <1 second
  • Survive node failures without data loss
  • Serve real-time dashboards with <100ms latency
  • Support standard SQL queries

This is not a weekend project. Budget 3-6 months for a complete implementation. But when you’re done, you’ll understand real-time analytics databases at a level few engineers ever achieve.


Summary

# Project Main Language
1 Row vs Column Storage Visualizer Python
2 Compression Algorithm Playground C
3 Mini-Parquet File Reader/Writer Go
4 LSM-Tree Storage Engine Rust
5 Real-Time Metrics Aggregator Go
6 Streaming Ingestion Pipeline (Kafka) Java
7 Vectorized Query Execution Engine Rust
8 SQL Parser and Query Planner Rust
9 Distributed Query Coordinator Go
10 Raft Consensus for Replication Rust
11 Time-Series Compression (Gorilla) C
12 Materialized View Engine Go
13 Approximate Query Processing (HLL) C++
14 Real-Time Dashboard Backend Go
15 Mini-ClickHouse (Capstone) Rust
Final Production Analytics Platform Rust + Go + TypeScript

Key Resources

Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann - The bible for understanding modern data systems
  • “Database Internals” by Alex Petrov - Deep dive into storage engines and distributed databases
  • “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Essential for understanding performance

Papers

Online Resources


Sources used in this guide: