LEARN SPARK DEEP DIVE

Learn Apache Spark: From First Principles to a Distributed Computing Engine

Goal: To deeply understand distributed data processing by dissecting Apache Spark’s core concepts. You will learn what problems Spark solves, how it works under the hood, and ultimately build your own simplified distributed computing engine from scratch in C to grasp the fundamentals that high-level tools abstract away.

Why Learn Distributed Computing from Scratch?

Apache Spark is the industry standard for big data processing. It allows you to analyze datasets that are terabytes or even petabytes in size, far too large for a single machine. While most developers use its high-level APIs in Python, Scala, or SQL, the true power and complexity lie in how it distributes work, handles failures, and shuffles data across a cluster.

By building a simplified version in C, you will bypass the magic of high-level APIs and confront the raw challenges of network communication, process management, fault tolerance, and data locality. This bottom-up approach provides an unparalleled understanding that makes you a more effective and insightful data engineer.

After completing these projects, you will:

Understand what problems distributed systems like Spark, Flink, and Hadoop were created to solve.
Read and understand the architecture of a distributed computing job.
Grasp the critical importance of data shuffling and its performance implications.
Implement core concepts like MapReduce, lazy evaluation, and fault tolerance.
Write a distributed computing engine that can execute jobs across multiple machines.

Core Concept Analysis

The Big Picture: What Spark Does

At its core, Spark is a system for parallel processing of large datasets across a cluster of computers. Imagine you have a 1TB log file and need to find all error messages. A single computer would take hours. Spark splits that file into thousands of chunks, sends each chunk to a different computer (worker), has each worker search its tiny chunk simultaneously, and then aggregates the results.

┌───────────────────────────┐
│     Driver Program        │
│ (Your Code, e.g., Python) │
│  - Creates SparkContext   │
│  - Defines RDDs/DataFrames│
│  - Calls Actions          │
└────────────┬──────────────┘
             │ 1. Submits Job
             ▼
┌───────────────────────────┐
│      Cluster Manager      │
│ (e.g., YARN, Standalone)  │
│ - Allocates resources     │
└────────────┬──────────────┘
             │ 2. Launches Executors
             ▼
┌────────────┴────────────┐
│      Worker Nodes       │
│ ┌─────────────────────┐ │  ┌─────────────────────┐
│ │ Executor Process    │ │  │ Executor Process    │
│ │ - Caches data       │ │  │ - Caches data       │
│ │ - Executes tasks    │ │  │ - Executes tasks    │
│ └─────────────────────┘ │  └─────────────────────┘
└─────────────────────────┘

Key Concepts Explained

Driver, Executor, and Worker:
- Driver: The process running your main() function. It creates the SparkContext, builds the DAG, and tells the cluster manager what to do.
- Worker Node: A machine in the cluster.
- Executor: A process launched on a worker node that runs tasks and keeps data in memory.
RDDs, DataFrames, and the DAG (Directed Acyclic Graph):
- RDD (Resilient Distributed Dataset): The original and most fundamental abstraction. It’s an immutable, partitioned collection of records that can be operated on in parallel. If a partition is lost, Spark can replay the operations (its lineage) to recreate it.
- DataFrame/Dataset: A higher-level, structured API that looks like a table with columns. Spark’s Catalyst optimizer can highly optimize these operations.
- Transformations (Lazy): Operations like map(), filter(), join(). They don’t execute immediately; they build up a plan of execution called a DAG.
- Actions (Eager): Operations like count(), collect(), save(). They trigger the execution of the DAG. This lazy evaluation is a key optimization.

Shuffle:

The most expensive operation in a distributed system. It’s the process of redistributing data across partitions.
Any operation that needs to see all data for a given key (e.g., reduceByKey, groupBy, join) will trigger a shuffle. This involves massive network and disk I/O as workers exchange data. Understanding and minimizing shuffles is critical for performance.

MAP PHASE (on each worker)        SHUFFLE PHASE (network)          REDUCE PHASE (on each worker)
┌──────────┐                      ┌──────────┐                     ┌──────────┐
│ (A, 1)   │ ─────────┐           │ (A, 1)   │                     │ Reducer 1│
│ (B, 1)   │ ─┐       │           │ (A, 1)   │ ───────────►        │ (A, 2)   │
└──────────┘  │       ├─────────► └──────────┘                     └──────────┘
              │       │
┌──────────┐  │       │           ┌──────────┐                     ┌──────────┐
│ (C, 1)   │ ─┼───────┘           │ (B, 1)   │                     │ Reducer 2│
│ (A, 1)   │ ─┘                   │ (B, 1)   │ ───────────►        │ (B, 2)   │
└──────────┘                      └──────────┘                     └──────────┘
                                  ┌──────────┐                     ┌──────────┐
                                  │ (C, 1)   │                     │ Reducer 3│
                                  │ (C, 1)   │ ───────────►        │ (C, 2)   │
                                  └──────────┘                     └──────────┘

Competitors & Modern Tools:
- Apache Flink: Often seen as a successor to Spark, especially strong for real-time stream processing with low latency.
- Apache Beam: A programming model for both batch and streaming. You write your code once, and it can run on different “runners” like Spark, Flink, or Google Cloud Dataflow.
- Dask: A parallel computing library for Python that scales from single machines to clusters. It integrates tightly with the Python data science ecosystem (Pandas, NumPy).
- Cloud Warehouses (Snowflake, BigQuery, Redshift): These are fully managed, SQL-first platforms that handle the distributed computing for you, but offer less programmatic control than Spark. They are competitors for SQL-based ETL and analytics workloads.

Project List

The following 11 projects will guide you from single-core processing concepts to building your own distributed computing framework in C.

Project 1: Single-Threaded Map and Reduce

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Python, Go, Rust
Coolness Level: Level 1: Pure Corporate Snoozefest
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: File I/O / String Processing
Software or Tool: Standard C Library
Main Book: “The C Programming Language” by Kernighan & Ritchie (K&R)

What you’ll build: A command-line tool that mimics grep | wc -l. It will read a file line by line (map over lines), filter lines containing a specific word, and then count the matching lines (reduce the count).

Why it teaches the core concept: This isolates the conceptual flow of MapReduce without any of the complexity of parallelism. You learn that data processing is a pipeline of transformations (map, filter) followed by an aggregation (reduce).

Core challenges you’ll face:

Efficient file reading → maps to using fgets and managing buffers
String searching → maps to implementing or using strstr
Stateful reduction → maps to accumulating a count across many map operations

Key Concepts:

File I/O: “The C Programming Language” (K&R), Chapter 7
String Manipulation: “The C Programming Language” (K&R), Chapter 5

Difficulty: Beginner Time estimate: A few hours Prerequisites: Basic C programming knowledge.

Real world outcome:

$ cat story.txt
it was the best of times
it was the worst of times
it was the age of wisdom
it was the age of foolishness

$ ./mapreduce_single story.txt "wisdom"
Found 1 matching lines.

$ ./mapreduce_single story.txt "it was"
Found 4 matching lines.

Implementation Hints: Your main function should:

Open the file provided as an argument.
Initialize a counter variable to zero.
Loop through each line of the file.
Inside the loop (the “map” phase), check if the line contains the search term.
If it does, increment the counter (the “reduce” action).
After the loop, print the final count. This simple structure is the seed for everything that follows.

Learning milestones:

Read a file and print its contents → You understand basic file handling.
Filter lines based on a search term → You’ve implemented a map operation.
Correctly count the filtered lines → You’ve implemented a reduce operation.

Project 2: Multi-Process Word Count (Single Machine)

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Go, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Process Management / Inter-Process Communication (IPC)
Software or Tool: fork(), pipe()
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A tool that splits a large text file into N chunks and spawns N child processes to count words in each chunk simultaneously. The main (parent) process will collect the partial counts from each child and sum them up.

Why it teaches the core concept: This is your first taste of parallel computing! It forces you to deal with data partitioning (how to split the file?), process management (fork), and collecting results from multiple workers (pipe or temporary files). This is a microcosm of what Spark does on a single node.

Core challenges you’ll face:

Data Partitioning: How do you split a file into N chunks without breaking words in the middle? → maps to finding newline boundaries.
Process Creation: Spawning child processes with fork(). → maps to understanding process lifecycle.
Result Aggregation: Getting the partial counts back from each child process to the parent. → maps to using pipes for IPC.
Synchronization: Waiting for all child processes to finish before calculating the final sum. → maps to using waitpid().

Key Concepts:

Process Creation: “The Linux Programming Interface”, Chapter 24 (fork)
Pipes and FIFOs: “The Linux Programming Interface”, Chapter 44
Waiting for Child Processes: “The Linux Programming Interface”, Chapter 26 (wait)

Difficulty: Intermediate Time estimate: 1-2 days Prerequisites: Project 1, comfort with pointers and system calls.

Real world outcome:

# A 1GB text file
$ ls -lh large_file.txt
-rw-r--r-- 1 user user 1.0G Dec 21 14:30 large_file.txt

# Run with 4 parallel processes
$ ./parallel_wc large_file.txt 4
File split into 4 chunks.
Spawning 4 worker processes...
Worker 1 finished, partial count: 45,102,301
Worker 2 finished, partial count: 44,998,102
Worker 3 finished, partial count: 45,001,888
Worker 4 finished, partial count: 45,050,009
All workers finished.
Total words: 180,152,300

Implementation Hints:

The parent process should first determine the file size.
Calculate the approximate chunk size (file_size / N).
The parent creates N pipes, one for each child it will spawn.
The parent loops N times, in each loop:
- It fork()s.
- Child process:
  - Calculates its start and end byte offsets in the file. It must adjust the end offset to the next newline to avoid splitting a line.
  - lseek()s to its start position.
  - Performs the word count on its chunk.
  - Writes the final count to its designated pipe.
  - Exits.
- Parent process:
  - Closes the write-end of the child’s pipe.
  - Stores the child’s PID.
After the loop, the parent waits for all children to exit.
It then reads the partial count from each pipe and sums them up.

Learning milestones:

A child process can count words in a specific byte range of a file. → You understand file partitioning.
The parent can spawn multiple children and wait for them. → You understand process management.
The parent correctly aggregates results from all children via pipes. → You understand IPC and parallel aggregation.

Project 3: Distributed Word Count via Sockets

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Go, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Network Programming / Distributed Systems
Software or Tool: TCP Sockets (socket, bind, listen, accept, connect)
Main Book: “TCP/IP Sockets in C” by Donahoo & Calvert

What you’ll build: A true distributed system. A “driver” program reads a file, splits it, and sends the chunks over the network to multiple “worker” programs running on different machines (or on localhost). Each worker counts the words in its chunk and sends the result back to the driver.

Why it teaches the core concept: This is the leap from parallel to distributed computing. You are forced to confront network unreliability, data serialization (how to send data over a wire), and a client-server architecture. This is the fundamental structure of Spark, Hadoop, and nearly all distributed computing frameworks.

Core challenges you’ll face:

Socket Programming: Writing the boilerplate for TCP communication. → maps to low-level networking.
Data Serialization: You can’t just send a pointer over the network. You must send the actual data. → maps to designing a simple wire protocol.
Coordination: The driver needs to know which workers are available and assign them work. → maps to basic service discovery.
Error Handling: What if a worker disconnects? What if the network is slow? → maps to building robust network applications.

Key Concepts:

TCP Sockets: “TCP/IP Sockets in C”, Chapters 4-6
Network Byte Order: htons(), ntohl() functions.
Application-Layer Protocol Design: “Designing Data-Intensive Applications”, Chapter 4

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 2, basic understanding of TCP/IP.

Real world outcome:

You’ll run two different programs.

# On machine worker-1:
$ ./worker 9090
Worker listening on port 9090...

# On machine worker-2:
$ ./worker 9090
Worker listening on port 9090...

# On the driver machine:
$ ./driver large_file.txt worker-1:9090 worker-2:9090
Connecting to 2 workers...
Sending chunk 1 to worker-1:9090...
Sending chunk 2 to worker-2:9090...
Received count 80,152,300 from worker-1:9090.
Received count 100,000,000 from worker-2:9090.
Total words: 180,152,300

Implementation Hints:

Worker:
1. Starts up, listens for connections on a given port.
2. When a driver connects, it enters a loop:
3. Reads data from the socket until the driver signals the end of a chunk.
4. Performs the word count on the received data.
5. Sends the integer count back to the driver.
6. Closes the connection and waits for a new one.
Driver:
1. Parses worker addresses from command line.
2. For each chunk of the input file:
3. Connects to an available worker.
4. Sends the chunk of data. You’ll need a simple protocol, e.g., send 4 bytes for the length, then send the data.
5. Waits for the worker to send back its partial count.
6. Adds it to the total.
Start by running the workers and driver on localhost before trying multiple machines.

Learning milestones:

You can send a chunk of a file from one program to another on the same machine. → You understand basic socket communication.
A worker can receive data, process it, and send a result back. → You’ve built a request/response loop.
The driver can manage multiple workers and correctly aggregate results. → You’ve built a true distributed application.

Project 4: Generic MapReduce Framework

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Software Architecture / Function Pointers
Software or Tool: Your code from Project 3
Main Book: “Expert C Programming” by Peter van der Linden

What you’ll build: Refactor your distributed word count program into a generic framework. The driver will be configured with function pointers for map and reduce logic. This framework will handle the distribution and execution, allowing you to plug in different processing logic without changing the core engine.

Why it teaches the core concept: This is the essence of Spark’s API. Spark itself doesn’t know what a “word count” is. It just knows how to execute user-defined functions (UDFs) in a distributed fashion. By using function pointers, you are forced to separate the what (your logic) from the how (the distributed execution).

Core challenges you’ll face:

Abstracting Logic: Designing a generic map function signature. For word count, map takes a line of text and should emit key-value pairs (e.g., (word, 1)). → maps to data structures and API design.
Serialization of Logic: You can’t send a function pointer over the network. The workers must already have the map and reduce functions available. → maps to shared code and dynamic loading (advanced). For this project, you can compile the user functions into the worker binary.
Generic Data Structures: Creating structures to hold intermediate key-value pairs. → maps to building a small library.

Key Concepts:

Function Pointers: “Expert C Programming”, Chapter 3
API Design: “The Practice of Programming” by Kernighan and Pike, Chapter 4
Static vs. Dynamic Linking: Understanding that workers need the code compiled-in.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3, strong C skills.

Real world outcome: You’ll have a framework library and a separate user program.

// In user_logic.c:
#include "mapreduce.h"

// User defines their map and reduce functions
void word_count_map(char *line, emitter_t *emitter) {
    // ... split line into words ...
    for (each word) {
        emit(emitter, word, "1");
    }
}

void sum_reduce(char *key, char **values, int num_values, emitter_t *emitter) {
    int sum = 0;
    for (int i = 0; i < num_values; i++) {
        sum += atoi(values[i]);
    }
    char result[16];
    sprintf(result, "%d", sum);
    emit(emitter, key, result);
}

// In main.c:
int main(int argc, char **argv) {
    job_t *job = create_job();
    job_set_map(job, word_count_map);
    job_set_reduce(job, sum_reduce);
    job_set_input(job, "input.txt");
    job_add_workers(job, ...);
    
    execute_job(job);
}

The framework handles the rest.

Implementation Hints:

Define the function signatures for your map and reduce functions. A map function might take a char* (a line or chunk) and a pointer to an “emitter” object. The emitter is a struct that has a function pointer emit(key, value).
The workers will need a way to know WHICH map/reduce function to run. You can start by hardcoding this, e.g., the driver sends a simple integer ID (1 for word_count, 2 for log_analyzer) and the worker has a switch statement.
The intermediate data from the map phase ((key, value) pairs) needs to be stored before being sent to reducers. This is the beginning of the shuffle.

Learning milestones:

You can define and use function pointers to represent map/reduce logic. → You understand abstraction.
The driver can configure a job with specific functions. → You’ve created a configurable API.
The framework can successfully execute a word count job defined in user code. → You’ve separated the engine from the logic.

Project 5: Implement a Distributed Shuffle

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Go, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Distributed Algorithms / Networking
Software or Tool: Hashing algorithms (for partitioning)
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: The most critical and complex part of any distributed data engine: the shuffle. After your map workers generate key-value pairs, you will implement the logic to send all values for the same key to a single reduce worker.

Why it teaches the core concept: This project demystifies the magic of reduceByKey, groupByKey, and join. You’ll learn firsthand why shuffles are so expensive (all-to-all network communication) and why controlling data partitioning is the key to performance in Spark.

Core challenges you’ll face:

Partitioning Strategy: How do you decide which reducer gets which key? → maps to implementing a hash partitioner (hash(key) % num_reducers).
All-to-All Communication: Each map worker might need to send data to every reduce worker. → maps to managing many simultaneous network connections.
Buffering and Memory Management: Map workers need to buffer output data for each reducer before sending it. → maps to balancing memory usage and network efficiency.
Reducer-side Sort/Merge: Reducers receive data from all mappers and must group values for the same key before calling the user’s reduce function. → maps to external sorting and merging algorithms.

Key Concepts:

Hash Partitioning: A standard technique to distribute keys.
Sort-Based Shuffle: “Designing Data-Intensive Applications”, Chapter 10
External Sorting: When data doesn’t fit in memory.

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 4, solid understanding of data structures and networking.

Real world outcome: Your framework can now correctly execute a word count on a large, multi-gigabyte file where the intermediate keys are spread across the cluster.

# Driver output
Starting job...
Map phase started on 4 workers.
Map phase complete.
Shuffle phase started.
  - Worker 1 is sending 3 partitions...
  - Worker 2 is sending 3 partitions...
  - ...
Shuffle phase complete.
Reduce phase started on 3 workers.
Reduce phase complete.
Job finished. Output in 'output_dir'.

# Check output
$ cat output_dir/part-00000
(a, 560123)
(and, 890123)
...

Implementation Hints:

Map Side:
- For each (key, value) pair emitted by the user’s map function:
- Calculate the target reducer: partition = hash(key) % num_reducers.
- Append the pair to an in-memory buffer for that partition.
- When a buffer is full (or map is done), the map worker opens a connection to the target reducer worker and sends the entire buffer.
Reduce Side:
- Each reducer worker listens for connections from all map workers.
- It receives streams of (key, value) pairs.
- It writes these pairs to local disk, sorted by key.
- Once all mappers are done, each reducer performs a multi-way merge of the sorted files it received.
- This merge process naturally groups all values for the same key together. The reducer can then call the user’s reduce function for each key.

Learning milestones:

You can partition keys to specific reducers using a hash function. → You understand data distribution.
Map workers can send their output to the correct reduce workers. → You’ve implemented the network part of the shuffle.
Reduce workers can sort and group incoming data to prepare for the reduce function. → You’ve implemented the sort/merge part of the shuffle.
The entire map -> shuffle -> reduce pipeline works correctly. → You have built the core of a MapReduce engine.

Project 6: Implement Fault Tolerance

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Go, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Distributed Systems / Fault Tolerance
Software or Tool: Sockets, process monitoring
Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you’ll build: A mechanism to make your framework resilient to worker failures. The driver program will monitor worker health. If a worker running a map task fails, the driver will re-assign that task to another available worker.

Why it teaches the core concept: This is the “R” in RDD: Resilient. Spark’s killer feature over original Hadoop was its ability to recover from failures quickly by recomputing lost partitions from its lineage (the DAG). This project teaches you how to think about failure as a normal occurrence, not an exception.

Core challenges you’ll face:

Health Checks: How does the driver know a worker has failed? → maps to implementing a heartbeat mechanism or detecting broken TCP connections.
State Management: The driver must keep track of which tasks are assigned to which workers and which are completed. → maps to building a master state machine.
Task Re-scheduling: If a task fails, the driver needs to find a new worker and send it the same work. → maps to job scheduling logic.
Idempotency: Re-running a task should not corrupt the final result. → maps to designing tasks to be stateless.

Key Concepts:

Heartbeating: A common pattern for failure detection.
Master-Worker Architecture: “Designing Data-Intensive Applications”, Chapter 6
Task Lineage: The concept that any piece of data can be re-computed from the original source and a series of transformations.

Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 5.

Real world outcome: You can manually kill a worker process mid-job, and the driver will detect it and successfully complete the job.

$ ./driver large_file.txt worker-1:9090 worker-2:9090 worker-3:9090
... 
Map task 1 assigned to worker-1
Map task 2 assigned to worker-2
Map task 3 assigned to worker-3
... 
[IN ANOTHER TERMINAL] $ pkill -f "worker-2"

# Driver output continues
Connection to worker-2 lost! Re-scheduling map task 2.
Map task 2 assigned to worker-1.
... 
Map phase complete.
... 
Job finished successfully.

Implementation Hints:

Driver State: The driver needs a data structure (e.g., an array of structs) to track the state of each task (e.g., PENDING, IN_PROGRESS, COMPLETED, FAILED). It should also track which worker is assigned to which IN_PROGRESS task.
Heartbeating:
- A simple way is for the driver to maintain a persistent connection to each worker. If read() or write() returns an error (like EPIPE), the connection is broken and the worker is presumed dead.
- A more advanced way is for workers to periodically send a small “I’m alive” message to the driver. If the driver doesn’t hear from a worker for a certain timeout, it marks it as dead.
Recovery Loop: The driver should have a main loop that periodically checks the status of tasks. If it finds a task that was IN_PROGRESS on a now-dead worker, it should move that task back to the PENDING state to be re-assigned.
Output Committer: For this to work correctly, workers shouldn’t finalize their output until the driver confirms the entire stage is complete. This prevents partial results from a failed task from corrupting the final result. This is a very advanced topic (see Hadoop’s OutputCommitter), but you can simulate it by having workers write to temporary directories and only renaming them to the final location on a signal from the driver.

Learning milestones:

The driver can detect a lost worker connection. → You’ve implemented failure detection.
The driver maintains the state of all tasks in the job. → You’ve built a scheduler.
The driver can re-assign a failed map task to a new worker. → You’ve implemented basic fault tolerance.

Project 7: Lazy Evaluation and a DAG Scheduler

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Compilers / Graph Algorithms
Software or Tool: Graph data structures
Main Book: “Language Implementation Patterns” by Terence Parr

What you’ll build: Modify your framework’s API to be lazy. Calls to map() and reduceByKey() will not execute immediately. Instead, they will build up a graph (a DAG) of dependencies. Only when an “action” like collect() or save() is called will the framework analyze the graph, optimize it, and submit it for execution.

Why it teaches the core concept: This is the secret sauce of Spark’s performance. Lazy evaluation allows the optimizer to see the entire workflow before it runs, enabling it to fuse stages together, re-order operations, and make intelligent decisions. It separates the logical plan (what you want to do) from the physical plan (how the cluster will do it).

Core challenges you’ll face:

API Redesign: Your API calls now need to return a “future” or a “dataset” object that represents the result of the operation, rather than the result itself.
Graph Representation: You need to implement a graph data structure to represent the DAG of operations (nodes are datasets, edges are transformations).
DAG Traversal: When an action is called, you need to traverse the graph backwards from the final dataset to find the source data and all the steps required.
Stage Scheduling: Analyze the graph to find “shuffle boundaries”. Operations between shuffles can be pipelined together into a single “stage”. Your scheduler will submit stages, not individual operations.

Key Concepts:

Directed Acyclic Graphs (DAGs): “Grokking Algorithms”, Chapter 6
Lazy Evaluation: A core functional programming concept.
Compiler Optimizations: The process of transforming a computation into a more efficient one.

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 6, good knowledge of data structures.

Real world outcome: Your API will feel much more like modern Spark.

// New lazy API
int main() {
    // These calls just build a graph, they don't run anything
    dataset_t *lines = framework_read_file("input.txt");
    dataset_t *words = framework_flat_map(lines, extract_words_func);
    dataset_t *pairs = framework_map(words, make_kv_pair_func);
    dataset_t *counts = framework_reduce_by_key(pairs, sum_func);

    printf("Job defined. The cluster is idle.\n");
    
    // This is the action that triggers the computation
    framework_save(counts, "output_dir"); 
    
    printf("Action called! Now executing the job...\n");
    // Driver now analyzes the graph and submits stages
    return 0;
}

Implementation Hints:

Define a struct dataset that contains information about the operation that created it (e.g., MAP, READ_FILE) and pointers to its parent dataset(s).
Each API call (map, reduceByKey) will malloc a new dataset struct, link it to its parents, and return a pointer to it.
The save() or collect() action function takes a dataset pointer as input.
Inside the action, start from the target dataset and walk up the parent pointers to the source(s). This traces the lineage.
As you walk, look for “wide” dependencies (reduceByKey) vs. “narrow” dependencies (map, filter). A sequence of narrow dependencies can be fused into a single stage. A wide dependency marks the end of one stage and the beginning of another (a shuffle boundary).
Your scheduler can then submit Stage 1, wait for it to complete, run the shuffle, then submit Stage 2, and so on.

Learning milestones:

API calls build a graph structure in memory without executing. → You’ve implemented lazy evaluation.
You can traverse the graph from the final dataset back to the source. → You’ve implemented lineage tracking.
You can identify shuffle boundaries and group operations into stages. → You’ve built a basic DAG scheduler.

Project 8: Build a Distributed SQL Query Engine

File: LEARN_SPARK_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Compilers / Databases
Software or Tool: A parser generator like Bison/YACC (optional)
Main Book: “Compilers: Principles, Techniques, and Tools” (The Dragon Book)

What you’ll build: A tool that accepts a simple SQL query, parses it, and translates it into a job for your distributed MapReduce framework. This will bridge the gap between a high-level declarative language (SQL) and your low-level execution engine.

Why it teaches the core concept: This project demystifies Spark SQL and its Catalyst optimizer. You’ll learn how a declarative query is transformed into a logical plan, then an optimized physical plan (your DAG), and finally executed. It’s the ultimate demonstration of building layers of abstraction.

Core challenges you’ll face:

SQL Parsing: Converting a SQL string into a structured representation (an Abstract Syntax Tree or AST). → maps to using a parser generator or writing a recursive descent parser.
Logical Planning: Converting the AST into a sequence of logical operations (e.g., Scan -> Filter -> Aggregate). → maps to database query planning.
Physical Planning: Translating the logical plan into the operations of your MapReduce framework. → maps to query optimization.
Execution: Submitting the generated MapReduce job to your framework.

Key Concepts:

Parsing and ASTs: “The Dragon Book”, Chapter 4
Query Planning: “Designing Data-Intensive Applications”, Chapter 2
Relational Algebra: The formal foundation for SQL operations (SELECT, PROJECT, JOIN).

Difficulty: Master Time estimate: 1 month+ Prerequisites: Project 7, willingness to learn about parsing and compilers.

Real world outcome: Your engine can execute a SQL query across the cluster.

$ ./sql_engine 'SELECT country, COUNT(*) FROM users.csv WHERE age > 30 GROUP BY country'
Parsing SQL query...
Generating logical plan...
  Aggregate(country, COUNT)
    └─ Filter(age > 30)
         └─ Scan(users.csv)
Generating physical plan (MapReduce)...
  - Map: Read line, if age > 30, emit (country, 1)
  - Reduce: For each country, sum the 1s
Submitting job to framework...
Job complete. Output:
(USA, 5403)
(India, 3102)
(Germany, 1204)
...

Implementation Hints:

Start simple. Don’t try to implement the whole SQL standard. Start with SELECT COUNT(*) FROM file.
- This is a pure map (read line, emit ("count", 1)) and reduce (sum the 1s).
Add a WHERE clause.
- This becomes a filter operation inside your map function. The map function now reads a line, applies the filter, and then emits the pair.
Add GROUP BY.
- This is the classic reduceByKey operation. The map function emits (group_key, value), and the shuffle/reduce phase handles the aggregation.
Parsing: For simple queries, you can get away with strtok and strstr. For anything more complex, it’s worth learning how to write a basic recursive descent parser. You don’t need YACC/Bison unless you are aiming for a very robust implementation.

Learning milestones:

You can parse a simple SELECT-FROM query. → You understand basic parsing.
A WHERE clause translates to a filter operation in your map phase. → You’ve connected the logical and physical plan.
A GROUP BY clause translates to a full Map-Shuffle-Reduce job. → You’ve implemented distributed aggregation from SQL.
The entire SQL-to-execution pipeline works. → You’ve built a database!

Summary of Projects

Project	Main Language
Project 1: Single-Threaded Map and Reduce	C
Project 2: Multi-Process Word Count (Single Machine)	C
Project 3: Distributed Word Count via Sockets	C
Project 4: Generic MapReduce Framework	C
Project 5: Implement a Distributed Shuffle	C
Project 6: Implement Fault Tolerance	C
Project 7: Lazy Evaluation and a DAG Scheduler	C
Project 8: Build a Distributed SQL Query Engine	C

```