Learning Concurrency & Parallel Programming Through Real-World Projects

Goal: Build a deep, practical understanding of how concurrent programs actually behave on real machines, from thread creation and synchronization all the way down to kernel wakeups and cache effects. You will learn to reason about correctness (no races, no deadlocks) and performance (scaling, contention, memory bandwidth), and you will be able to design, implement, and debug multi-threaded systems that hold up under load. You’ll also develop a mental model that connects user-space primitives (mutexes, condition variables, atomics) to the kernel mechanisms that back them. citeturn6view0turn3search1turn3search0turn1search4turn1search0


Why Concurrency & Parallel Programming Matters

Modern performance gains largely come from parallelism rather than single-core clock speed improvements. If you want your software to keep getting faster on mainstream hardware, you must parallelize it explicitly and correctly. citeturn0search6

Amdahl’s Law explains why even small serial sections can cap speedups as core counts rise. This is why you must design for contention, eliminate unnecessary serialization, and choose the right concurrency patterns early. citeturn0search0

Threads let a single process execute multiple flows of control, sharing the same address space but maintaining separate stacks and registers. This makes sharing data easy, but it also makes races and synchronization bugs easy to create. citeturn6view0

Single Thread                      Multi-Threaded
┌───────────────┐                 ┌───────────────┐
│  Code/Heap    │                 │  Code/Heap    │
│   One Stack   │                 │ Stack A Stack B│
└───────────────┘                 └───────────────┘
     |                                    |
     v                                    v
  Simple state                       Shared state +
                                   more interleavings

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

  • Solid C programming (pointers, structs, memory layout)
  • Basic OS concepts (process vs thread, syscalls, scheduling)
  • Comfort reading and modifying multi-file C projects

Helpful But Not Required

  • Intro networking (sockets, HTTP)
  • Basic performance profiling (time, CPU utilization)
  • Familiarity with Makefiles and gdb

Self-Assessment Questions

  • Can you explain what a race condition is and why it is nondeterministic?
  • Can you write a small pthread_create example from memory?
  • Can you describe the difference between a mutex and a semaphore?
  • Do you understand how shared memory differs from message passing?

Development Environment Setup

  • Linux or WSL2 (Linux recommended for tooling and thread behavior parity)
  • gcc or clang, make, gdb
  • perf, strace, and ltrace for performance and syscall tracing

Time Investment (Realistic)

  • Project 1–2: 1–2 weeks each
  • Project 3–5: 2–3 weeks each
  • Final project: 1–2 months part-time

Important Reality Check

Concurrency bugs are timing-dependent and can appear or disappear based on scheduling. Expect to spend more time observing than coding. You will use logs, tracing, and stress tests to catch issues that “vanish” in debug mode. citeturn13view1


How to Use This Guide

  1. Read the Core Concept Analysis to build your mental model first.
  2. Pick a project that matches your current comfort level.
  3. Before coding, answer the Core Question and Thinking Exercise.
  4. Use the Hints in Layers only when truly stuck.
  5. Complete the Definition of Done checklist to lock in mastery.

Quick Start (First 48 Hours)

Day 1

  • Read OSTEP “Concurrency: An Introduction” and “Thread API”. citeturn6view0turn6view1
  • Write a tiny program that creates 4 threads and increments a shared counter incorrectly.
  • Add a mutex and watch the counter become correct.

Day 2

  • Read OSTEP “Locks” and “Condition Variables”. citeturn6view2turn6view3
  • Implement a bounded queue with one producer and one consumer.
  • Add a second producer and fix the new race.

Path A: Systems Programming Track 1) Download Accelerator → 2) Image Pipeline → 3) Game Server → 4) Lock-Free Queue → Final Scheduler

Path B: Performance/Graphics Track 1) Ray Tracer → 2) Image Pipeline → 3) Lock-Free Queue → Final Scheduler

Path C: Distributed Systems Track 1) Download Accelerator → 2) Game Server → Final Scheduler


Big Picture / Mental Model

Workload
   |
   v
Decompose into independent tasks
   |
   v
Choose pattern: fork-join | pipeline | producer-consumer | event loop
   |
   v
Pick primitives: mutex | condvar | semaphore | atomics
   |
   v
Runtime: pthreads / OpenMP / custom thread pool
   |
   v
Kernel: futex + scheduler + wakeups

Threads, synchronization primitives, and the kernel mechanisms that back them are tightly coupled; understanding those linkages is how you debug hard bugs. citeturn6view0turn3search1turn3search0turn1search4turn1search0


Core Concept Analysis

Concept Cluster What You Need to Internalize
Threading Fundamentals Thread creation, lifecycle, stacks, join/detach, thread IDs. citeturn6view0turn3search1
Synchronization Primitives Mutexes, condition variables, semaphores, and how blocking/unblocking works. citeturn3search0turn1search4turn13view0
Concurrency Hazards Race conditions, deadlocks, livelocks, starvation, missed signals, priority inversion. citeturn13view1
Atomic Operations Compare-and-swap, atomic read-modify-write, memory orderings. citeturn7search1turn7search4
Parallel Patterns Fork-join, pipeline, bounded buffers, event loops. citeturn6view2turn6view3turn13view2
Memory Models Visibility, happens-before, cache effects, false sharing. citeturn7search4turn0search6
I/O Concurrency Event-driven I/O (epoll), backpressure, readiness vs completion. citeturn21view0

Concept Summary Table

Concept You Know It When You Can…
Threads predict ordering changes under different schedulers
Mutexes eliminate races without destroying parallelism
Condvars design correct producer-consumer queues
Semaphores express both mutual exclusion and ordering constraints
Atomics implement a lock-free counter that scales under contention
Memory Model explain why “it worked in debug” is meaningless
Parallel Patterns choose the correct pattern for a workload

Deep Dive Reading by Concept

Concept Book + Chapters
Threads & Thread API OSTEP: “Concurrency: An Introduction” + “Thread API” (threads-intro.pdf, threads-api.pdf). citeturn6view0turn6view1
Locks & Condition Variables OSTEP: “Locks” + “Condition Variables”. citeturn6view2turn6view3
Semaphores OSTEP: “Semaphores”. citeturn13view0
Concurrency Bugs OSTEP: “Common Concurrency Problems”. citeturn13view1
POSIX Threads in C The Linux Programming Interface: Ch. 29–33 (Threads series). citeturn0search1
Atomics & Memory Ordering Rust Atomics and Locks: Ch. 2–3, 7. citeturn14search5
OS Primitives (futex, pthreads) Rust Atomics and Locks: Ch. 8. citeturn14search5

Success Metrics

  • You can explain a deadlock in terms of lock acquisition order.
  • You can use perf, strace, and basic tracing to catch timing issues.
  • You can quantify speedup and explain why it plateaus.
  • You can build a concurrency tool that runs correctly under stress.

Core Subsystem Walkthrough (User Space → Syscall → Kernel)

  1. User space thread calls pthread_mutex_lock() or waits on a condition variable. citeturn3search0turn1search4
  2. If contended, it blocks and the runtime may call into the kernel via futex. citeturn1search0
  3. The kernel uses futex to sleep/wake threads waiting on a shared memory address. citeturn1search0
  4. The scheduler chooses the next runnable thread; wakeups are handled through kernel scheduling and wait queues. citeturn6view0turn1search0
Userspace (pthread) -> futex syscall -> kernel wait queue -> scheduler -> wakeup

Kernel Navigation Cheatsheet

Use these directories and rg patterns if you want to trace how user-space primitives map to kernel code:

  • include/linux/sched.hstruct task_struct, scheduler interfaces. citeturn19search2
  • kernel/futex/ — futex core logic. citeturn20search4turn20search6
  • kernel/locking/mutex.c — mutex implementation. citeturn20search3
  • kernel/locking/rwsem.c — reader-writer semaphores. citeturn19search1

Example search patterns:

  • rg "futex" kernel/futex
  • rg "mutex_lock" kernel/locking/mutex.c
  • rg "rwsem" kernel/locking/rwsem.c

Debugging Toolkit (Concurrency-Focused)

  • ftrace: kernel function tracing via tracefs. citeturn8search6
  • Tracepoints: low-overhead kernel probes for performance/debugging. citeturn8search0
  • perf_event_open: programmatic access to hardware/software counters. citeturn10search0
  • lockdep: runtime lock dependency validator in the kernel. citeturn11search0
  • gdb/rr/strace: user-space debugging and syscall tracing

Kernel Data Structure Atlas

Structure What It Represents Where to Look
task_struct Kernel task/thread descriptor include/linux/sched.h citeturn19search2
Futex hash buckets Wait-queue buckets for futexes kernel/futex/core.c citeturn20search6
struct mutex implementation Kernel mutex behavior kernel/locking/mutex.c citeturn20search3
rw_semaphore implementation Reader/writer lock behavior kernel/locking/rwsem.c citeturn19search1

Kernel Contribution Checklist

  • Run scripts/checkpatch.pl on your changes.
  • Use scripts/get_maintainer.pl to find the right mailing list.
  • Follow kernel commit message conventions and keep patches small.

Glossary of Concurrency Terms

  • Thread: An execution context within a process; threads share an address space but have separate stacks. citeturn6view0
  • Mutex: Mutual exclusion lock; a thread blocks if another thread owns it. citeturn3search0
  • Condition Variable: A wait/notify primitive that atomically unlocks a mutex when waiting and re-locks before returning. citeturn1search4
  • Semaphore: Counter-based synchronization primitive with wait/post operations. citeturn13view0
  • Futex: Kernel facility to sleep/wake on a shared memory word. citeturn1search0
  • Atomic Operation: A read-modify-write that appears indivisible to other threads. citeturn7search1turn7search4
  • Deadlock: A cycle of threads each waiting on locks held by others. citeturn13view1

Common Failure Signatures

  • “Program hangs under load” → likely deadlock or missed wakeup. citeturn13view1turn1search4
  • “Works with 1 thread, fails with 2” → data race or non-atomic update. citeturn13view1
  • “Performance gets worse with more threads” → contention or false sharing. citeturn7search4turn0search6
  • “Random crashes in optimized builds” → memory ordering or lifetime bugs. citeturn7search4

Kernel Reading Plan (10 Days)

  1. OSTEP threads-intro + threads-api. citeturn6view0turn6view1
  2. OSTEP locks + locks-usage. citeturn6view2turn13view3
  3. OSTEP condition variables + semaphores. citeturn6view3turn13view0
  4. OSTEP common concurrency problems. citeturn13view1
  5. TLPI Ch. 29–33 (threads). citeturn0search1
  6. Rust Atomics & Locks Ch. 2–3 (atomics & memory ordering). citeturn14search5
  7. Rust Atomics & Locks Ch. 8–9 (OS primitives, locks). citeturn14search5
  8. Read futex(2) and pthread mutex/condvar man pages. citeturn1search0turn3search0turn1search4
  9. Read epoll(7) and tracepoints docs. citeturn21view0turn8search0
  10. Skim kernel lockdep docs. citeturn11search0

Project 1: “Multi-Threaded Download Accelerator” — Parallel Chunk Downloader

Attribute Value
Programming Language C
Coolness Level Level 3: Genuinely Clever
Business Potential 2. The “Micro-SaaS / Pro Tool”
Difficulty Level 2: Intermediate
Knowledge Area Concurrency / Networking
Software or Tool Threads / HTTP
Main Book “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A CLI tool that downloads large files by splitting them into chunks and downloading them in parallel using HTTP Range requests, then merging them into one file. citeturn2search6

Why it teaches concurrency: You coordinate multiple worker threads, handle partial failures, implement progress aggregation, and ensure correct writes to a shared output file without races. citeturn3search1turn3search0

Core challenges you’ll face

  • Chunk coordination (thread synchronization, work distribution)
  • Progress aggregation (shared state, atomic counters)
  • Handling partial failures (cancellation, cleanup)
  • Rate limiting / connection pooling (semaphores, resource caps)
  • Resume capability (persistent metadata + concurrency)

Key Concepts

Concept Resource
Threads & Thread Creation TLPI Ch. 29–30 citeturn0search1
Mutexes & Condition Variables OSTEP “Locks” + “Condition Variables” citeturn6view2turn6view3
Atomics GCC atomic builtins, LLVM atomics overview citeturn7search1turn7search4
HTTP Range Requests MDN Range header reference citeturn2search6

Project Details

Attribute Value
Difficulty Intermediate
Time estimate 1–2 weeks
Prerequisites Basic networking (HTTP), file I/O, C toolchain

Real World Outcome

$ ./downloader https://example.com/large.iso -c 8 -o large.iso
[init] file_size=4.7GB  chunks=8  chunk_size=587MB
[thread-3] range=1761MB..2348MB  22%  48.2 MB/s
[thread-7] range=4112MB..4700MB  61%  52.9 MB/s
[merge] 8/8 chunks complete
[verify] sha256: OK
[done] total=93.2s  avg=50.4 MB/s

You will also generate a JSON progress file that allows resuming after a crash.

The Core Question You’re Answering

How do you safely coordinate multiple threads writing to a shared output without corrupting data or stalling progress?

Concepts You Must Understand First

  • Thread lifecycle and join semantics (OSTEP threads-intro, threads-api). citeturn6view0turn6view1
  • Mutex locking behavior (pthread_mutex_lock). citeturn3search0
  • Condition variables and signaling (pthread_cond_wait). citeturn1search4
  • HTTP Range semantics (206/416 behavior). citeturn2search6

Questions to Guide Your Design

  • Where should each thread write to avoid overlapping file regions?
  • What shared state must be atomic vs mutex-protected?
  • How will you detect a stalled thread and retry its range?
  • How will you ensure the output file is valid if the program is interrupted?

Thinking Exercise

Sketch the shared data you need: a global “download map” plus per-thread stats. Label which fields require atomics, which require a mutex, and which can be thread-local.

The Interview Questions They’ll Ask

  • How do you avoid data races when multiple threads write to the same file?
  • When should you use atomics vs a mutex for shared counters?
  • How do you ensure partial downloads can resume safely?
  • What happens if the server ignores Range requests?
  • How do you validate correctness under concurrency?

Hints in Layers

Hint 1: Use pwrite() to write each chunk to its own file offset.

Hint 2: Track progress with an atomic byte counter and compute percent in the UI thread.

Hint 3: Store chunk metadata in a small .json file; mark each chunk as done only after fsync.

Hint 4 (code sketch):

// atomic progress
_Atomic unsigned long long bytes_done = 0;

// worker thread
pwrite(out_fd, buf, len, offset);
atomic_fetch_add(&bytes_done, len);

Books That Will Help

Book Chapters Why This Matters
OSTEP Threads, Locks, Condition Variables Core synchronization patterns. citeturn6view0turn6view2turn6view3
TLPI Ch. 29–33 POSIX threads in C. citeturn0search1

Common Pitfalls & Debugging

Problem: “Merged file is corrupt”

  • Why: overlapping writes or wrong offsets
  • Fix: ensure each chunk writes to a unique offset
  • Quick test: compare hashes of single-threaded vs parallel outputs

Problem: “Progress freezes at 99%”

  • Why: a worker died without marking its chunk complete
  • Fix: add a watchdog and requeue stalled chunks
  • Quick test: kill a worker thread and verify recovery

Definition of Done

  • 1-thread mode matches curl output hash
  • 4+ threads speed up downloads on a Range-enabled server
  • Resume works after forced crash
  • Progress UI updates without flicker or negative numbers
  • No data corruption under stress tests

Project 2: “Concurrent Image Processing Pipeline” — Multi-Stage Parallel Processor

Attribute Value
Programming Language C
Coolness Level Level 3: Genuinely Clever
Business Potential 4. The “Open Core” Infrastructure
Difficulty Level 2: Intermediate
Knowledge Area Concurrency
Software or Tool Thread Pools
Main Book “Structured Parallel Programming” by McCool et al.

What you’ll build: A pipeline that processes thousands of images through stages (resize → filter → watermark → compress) with bounded queues between stages. citeturn6view3

Why it teaches concurrency: You implement producer-consumer queues, backpressure, and stage-level parallelism while preventing memory blowups. citeturn6view3turn13view3

Core challenges you’ll face

  • Pipeline stage coordination (producer-consumer, bounded queues)
  • Backpressure (condition variables + queue limits)
  • Work distribution (thread pools, fairness)
  • Memory pressure (ownership + staging buffers)
  • Graceful shutdown (draining pipelines)

Key Concepts

Concept Resource
Producer-Consumer OSTEP “Condition Variables” citeturn6view3
Lock Usage Patterns OSTEP “Locks Usage” citeturn13view3
Semaphores OSTEP “Semaphores” citeturn13view0

Project Details

Attribute Value
Difficulty Intermediate
Time estimate 1–2 weeks
Prerequisites Image basics, queues

Real World Outcome

$ ./imgpipe ./photos --resize 800x600 --watermark logo.png --workers 4
[stage:resize]   in=10000 out=10000  42.3 img/s
[stage:filter]   in=10000 out=10000  39.8 img/s
[stage:watermark]in=10000 out=10000  28.7 img/s (bottleneck)
[stage:compress] in=10000 out=10000  31.1 img/s
[done] total=354s

The Core Question You’re Answering

How do you keep a multi-stage pipeline busy without letting any one stage overwhelm the rest?

Concepts You Must Understand First

  • Condition variables for blocking queues. citeturn6view3
  • Mutex usage patterns for shared queues. citeturn6view2turn3search0
  • Semaphore-based bounding. citeturn13view0

Questions to Guide Your Design

  • How big should each queue be, and what happens when it is full?
  • How do you stop all threads cleanly when input is done?
  • How will you detect which stage is the bottleneck?

Thinking Exercise

Draw the pipeline as boxes with bounded queues between them. Mark where threads sleep and wake.

The Interview Questions They’ll Ask

  • How do you implement backpressure in a pipeline?
  • How do you avoid deadlocks when shutting down the pipeline?
  • When would you choose semaphores over condition variables?
  • How do you measure throughput per stage?

Hints in Layers

Hint 1: Use one queue per stage; each queue has push() and pop() with a mutex + condvar.

Hint 2: Each stage has N workers; a sentinel item signals shutdown.

Hint 3: Track per-stage counters with atomics to compute throughput.

Hint 4 (code sketch):

while (queue_pop(&q, &item)) {
    process(item);
    queue_push(&next_q, item);
}

Books That Will Help

Book Chapters Why This Matters
OSTEP Condition Variables, Semaphores Producer/consumer patterns. citeturn6view3turn13view0
TLPI Ch. 29–33 Thread synchronization APIs. citeturn0search1

Common Pitfalls & Debugging

Problem: “Memory usage grows without bound”

  • Why: unbounded queues
  • Fix: enforce queue capacity + blocking push
  • Quick test: log queue sizes over time

Problem: “Pipeline deadlocks on shutdown”

  • Why: workers waiting without termination signal
  • Fix: push sentinel items or broadcast on shutdown
  • Quick test: run with 1 image and ensure clean exit

Definition of Done

  • Pipeline processes 10k images without OOM
  • Each stage shows stable throughput metrics
  • Shutdown completes without hanging threads
  • Queue size never exceeds configured cap

Project 3: “Real-Time Multiplayer Game Server” — Chat + Game State Synchronization

Attribute Value
Main Programming Language C
Alternative Programming Languages Rust, Go, C++
Coolness Level Level 4: Hardcore Tech Flex
Business Potential Level 4: The “Open Core” Infrastructure
Difficulty Level 3: Advanced
Knowledge Area Network Programming, Concurrency
Software or Tool epoll/select, WebSockets
Main Book “The Linux Programming Interface”

What you’ll build: A server that handles 100+ simultaneous clients, synchronizes game state, and broadcasts updates at a fixed tick rate. citeturn21view0

Why it teaches concurrency: You must guard shared state, handle high fan-out broadcast, and ensure real-time tick consistency without blocking network I/O. citeturn21view0turn3search0

Core challenges you’ll face

  • Concurrent connections (event-driven I/O with epoll)
  • Shared game state (read/write lock patterns)
  • Broadcast fan-out (contention control)
  • Tick loop synchronization (barriers or time-based scheduling)
  • Disconnect cleanup (safe removal while iterating)

Key Concepts

Concept Resource
Event-driven I/O epoll(7) citeturn21view0
Mutexes & Condvars pthread man pages + OSTEP citeturn3search0turn1search4turn6view2
Concurrency Bugs OSTEP “Common Concurrency Problems” citeturn13view1

Project Details

Attribute Value
Difficulty Intermediate–Advanced
Time estimate 2–3 weeks
Prerequisites Socket programming, client/server basics

Real World Outcome

$ ./gameserver --port 8080 --tick-rate 20
[tick 1547] players=52 tick_ms=3.1 outgoing=1240 msgs
[tick 1548] players=53 tick_ms=3.4 outgoing=1288 msgs
[chat] player42: gg!

The Core Question You’re Answering

How do you keep shared state consistent while simultaneously serving hundreds of network events per second?

Concepts You Must Understand First

  • epoll readiness and event loops. citeturn21view0
  • Mutex/condvar usage and pitfalls. citeturn3search0turn1search4
  • Common concurrency bugs (races, deadlocks). citeturn13view1

Questions to Guide Your Design

  • How do you partition game state to minimize lock contention?
  • Will you use one thread per connection or a single event loop?
  • How do you ensure tick updates are consistent with event processing?

Thinking Exercise

Sketch two designs: (A) event loop with a single thread, (B) event loop + worker pool. Mark where locks are required in each.

The Interview Questions They’ll Ask

  • How does epoll differ from select/poll?
  • How do you prevent one slow client from stalling the server?
  • How do you reduce lock contention in shared state?
  • How do you avoid waking too many threads at once when work arrives?

Hints in Layers

Hint 1: Separate network I/O from game-state updates.

Hint 2: Use a read-write lock or state snapshots for broadcasts.

Hint 3: Process input in batches per tick to avoid starvation.

Hint 4 (code sketch):

while (running) {
    n = epoll_wait(epfd, events, MAX, tick_ms);
    for (i = 0; i < n; i++) handle_event(events[i]);
    update_world();
    broadcast_state();
}

Books That Will Help

Book Chapters Why This Matters
TLPI Ch. 56–63 (Sockets, I/O Models) Event-driven networking. citeturn0search1
OSTEP Concurrency + Bugs Avoid deadlocks and races. citeturn6view0turn13view1

Common Pitfalls & Debugging

Problem: “Server stutters at 20 clients”

  • Why: coarse-grained locks or blocking I/O
  • Fix: move to nonblocking sockets + epoll loop
  • Quick test: simulate 100 clients with dummy traffic

Problem: “Players disappear randomly”

  • Why: concurrent modification of shared state
  • Fix: copy-on-write or lock around state updates
  • Quick test: enable debug asserts on state invariants

Definition of Done

  • 100 simulated clients connect without crashes
  • Tick rate stays stable under load
  • No data races under ThreadSanitizer (if available)
  • All disconnects are handled without leaks

Project 4: “Lock-Free Concurrent Queue” — High-Performance Data Structure

Attribute Value
Main Programming Language C
Alternative Programming Languages Rust, C++
Coolness Level Level 5: Pure Magic
Business Potential Level 1: Resume Gold
Difficulty Level 4: Expert
Knowledge Area Lock-Free Programming, Atomics
Software or Tool Atomic Operations, Memory Barriers
Main Book “Rust Atomics and Locks”

What you’ll build: A multi-producer, multi-consumer queue using CAS loops and atomic pointers, with a correctness proof and stress tests. citeturn7search1turn7search4

Why it teaches concurrency: You will confront memory ordering, the ABA problem, and the reality that lock-free != wait-free. citeturn7search4

Core challenges you’ll face

  • CAS loops (atomic compare-exchange semantics)
  • ABA problem (safe memory reclamation)
  • Memory ordering (acquire/release vs relaxed)
  • Linearizability (defining correctness)
  • Stress testing (finding rare races)

Key Concepts

Concept Resource
Atomic RMW & CAS GCC atomic builtins citeturn7search1
Memory Ordering LLVM Atomics Guide citeturn7search4
OS Primitives Rust Atomics & Locks Ch. 8–9 citeturn14search5

Project Details

Attribute Value
Difficulty Advanced
Time estimate 2–3 weeks
Prerequisites Solid C pointers, atomics, memory models

Real World Outcome

$ ./lfqueue_bench --threads 16 --ops 10000000
[run] lock_free_queue ops=10,000,000 threads=16
[result] throughput=85.2 Mops/s  retries=3.1%
[result] mutex_queue throughput=24.7 Mops/s

The Core Question You’re Answering

How do you provide correct FIFO semantics without ever taking a lock?

Concepts You Must Understand First

  • CAS semantics and failure modes. citeturn7search1
  • Memory orderings and visibility. citeturn7search4
  • Thread behavior and concurrency bugs. citeturn13view1

Questions to Guide Your Design

  • Where is the linearization point for enqueue/dequeue?
  • How will you reclaim nodes safely (hazard pointers or epochs)?
  • Which memory order do you need at each atomic operation?

Thinking Exercise

Write down a sequential queue spec, then annotate where each concurrent operation must appear atomic.

The Interview Questions They’ll Ask

  • What is the ABA problem and how do you handle it?
  • How do acquire/release differ from sequential consistency?
  • Why can lock-free algorithms still starve?
  • How do you test lock-free code for correctness?

Hints in Layers

Hint 1: Start with a single-producer/single-consumer queue.

Hint 2: Add CAS on the tail for multi-producer; then on head for multi-consumer.

Hint 3: Keep a dummy node as the initial head to simplify logic.

Hint 4 (code sketch):

if (atomic_compare_exchange_weak(&tail->next, &expected, new)) {
    atomic_compare_exchange_weak(&tail, &old_tail, new);
}

Books That Will Help

Book Chapters Why This Matters
Rust Atomics and Locks Ch. 2–4, 7, 9 Atomics + lock building. citeturn14search5
OSTEP Locks, Common Concurrency Problems Failure patterns + reasoning. citeturn6view2turn13view1

Common Pitfalls & Debugging

Problem: “Queue occasionally drops items”

  • Why: incorrect CAS retry or memory ordering bug
  • Fix: enforce acquire/release on pointer swaps
  • Quick test: run 1e7 ops with randomized thread scheduling

Problem: “Segfault under load”

  • Why: freed node reused (ABA)
  • Fix: hazard pointers or epoch GC
  • Quick test: enable ASAN + stress test

Definition of Done

  • Passes linearizability tests under randomized schedules
  • No crashes under 10M+ ops stress test
  • Benchmarked against a mutex queue
  • ABA protection is documented and tested

Project 5: “Parallel Ray Tracer” — Multi-Core Photorealistic Renderer

Attribute Value
Main Programming Language C
Alternative Programming Languages Rust, C++, Go
Coolness Level Level 5: Pure Magic
Business Potential Level 1: Resume Gold
Difficulty Level 2: Intermediate
Knowledge Area Parallel Computing, Graphics
Software or Tool OpenMP, Thread Pools
Main Book “Ray Tracing in One Weekend”

What you’ll build: A CPU ray tracer that distributes pixel work across cores to render scenes faster. citeturn2search5

Why it teaches concurrency: You learn how to partition work, avoid false sharing, and aggregate results efficiently. citeturn7search4turn0search6

Core challenges you’ll face

  • Work distribution (static vs dynamic scheduling)
  • False sharing (cache line contention)
  • Load balancing (adaptive chunking)
  • Result aggregation (parallel reduction)
  • Progress reporting (atomic counters)

Key Concepts

Concept Resource
OpenMP parallel directives OpenMP spec (pragma omp parallel/for) citeturn2search5
Memory visibility LLVM Atomics Guide citeturn7search4

Project Details

Attribute Value
Difficulty Intermediate
Time estimate 2 weeks
Prerequisites Basic 3D math, recursion

Real World Outcome

$ ./raytracer scene.json --threads 8 --out render.png
[render] 3840x2160 samples=50
[progress] 12%  3.9s
[progress] 67%  18.2s
[done] 28.7s  output=render.png

The Core Question You’re Answering

How do you divide a perfectly parallel workload so that all cores stay busy and memory stays hot?

Concepts You Must Understand First

  • OpenMP parallel regions and work sharing. citeturn2search5
  • Memory effects and false sharing. citeturn7search4turn0search6

Questions to Guide Your Design

  • Should you split by rows, tiles, or adaptive task queues?
  • How do you avoid contention on the image buffer?
  • What is the smallest chunk size before scheduling overhead dominates?

Thinking Exercise

Compute expected speedup with Amdahl’s Law for a 95% parallel workload at 8 cores. citeturn0search0

The Interview Questions They’ll Ask

  • Why isn’t the speedup linear with core count?
  • What is false sharing and how do you avoid it? citeturn0search6
  • When should you use OpenMP vs a custom thread pool?

Hints in Layers

Hint 1: Use per-thread pixel buffers and merge at the end.

Hint 2: Use dynamic scheduling for uneven scenes.

Hint 3: Avoid writing to adjacent pixels in different threads.

Hint 4 (code sketch):

#pragma omp parallel for schedule(dynamic)
for (int y = 0; y < height; y++) {
    render_row(y);
}

Books That Will Help

Book Chapters Why This Matters
OSTEP Concurrency intro Parallelism mental model. citeturn6view0
Rust Atomics and Locks Ch. 2–3 Atomic counters for progress. citeturn14search5

Common Pitfalls & Debugging

Problem: “Image has stripes or artifacts”

  • Why: shared buffer writes without isolation
  • Fix: thread-local tiles + merge
  • Quick test: render with 1 thread vs 8 and compare diffs

Problem: “8 cores only give 2x speedup”

  • Why: poor scheduling or false sharing citeturn0search6
  • Fix: use dynamic scheduling + chunk tuning
  • Quick test: plot time per chunk size

Definition of Done

  • Image output matches single-thread render
  • 4+ cores show clear speedup
  • Progress reporting is stable
  • Chunk size tuning is documented

Final Comprehensive Project: Distributed Task Scheduler

What you’ll build: A mini-Celery/Temporal-like system with an HTTP API, a coordinator, and worker processes that execute tasks with retries, timeouts, and dependency DAGs.

Why it teaches everything: You combine thread pools, queues, concurrency-safe state, distributed coordination, and monitoring.

Core challenges you’ll face

  • Concurrent task queues (priority + fairness)
  • Worker heartbeats (failure detection)
  • DAG execution (parallelism with dependencies)
  • Exactly-once semantics (idempotency + locks)
  • Live dashboard (concurrent reads + writes)

Key Concepts

Concept Resource
Thread pools & queues OSTEP + TLPI threads chapters citeturn6view0turn0search1
Locks & condition variables pthreads + OSTEP citeturn3search0turn1search4turn6view2
Concurrency bugs OSTEP “Common Concurrency Problems” citeturn13view1

Real World Outcome

# submit a task
$ curl -X POST http://localhost:8000/submit \
  -d '{"task":"transcode","input":"video.mp4"}'
{"id":"task-42","status":"queued"}

# dashboard view
[workers] active=6 idle=2
[queue] pending=18 running=6 complete=120

The Core Question You’re Answering

How do you coordinate many concurrent workers safely while preserving correctness and visibility?

Concepts You Must Understand First

  • Producer/consumer queues and backpressure. citeturn6view3
  • Mutex + condition variable semantics. citeturn3search0turn1search4
  • Failure patterns in concurrent systems. citeturn13view1

Questions to Guide Your Design

  • What data must be strongly consistent vs eventually consistent?
  • How do you avoid duplicate execution after failures?
  • How will you expose progress without blocking workers?

Thinking Exercise

Write a timeline of a task that is retried twice and still succeeds. Identify where state must be atomic.

The Interview Questions They’ll Ask

  • How do you implement task retries without duplicates?
  • How do you detect and recover from worker failures?
  • How do you keep the dashboard consistent under load?

Hints in Layers

Hint 1: Keep task state in a single concurrency-safe map.

Hint 2: Make task execution idempotent where possible.

Hint 3: Use a separate thread for heartbeats and monitoring.

Books That Will Help

Book Chapters Why This Matters
TLPI Ch. 29–33 Thread primitives in C. citeturn0search1
OSTEP Concurrency + Bugs Correctness patterns. citeturn6view0turn13view1

Common Pitfalls & Debugging

Problem: “Tasks stuck in RUNNING forever”

  • Why: worker crashed without cleanup
  • Fix: heartbeat timeouts + requeue
  • Quick test: kill a worker mid-task

Problem: “Dashboard freezes under load”

  • Why: long locks held by workers
  • Fix: read-only snapshots or lock-free reads
  • Quick test: stress with 1000 tasks

Definition of Done

  • Multiple workers execute tasks without duplicates
  • Failed workers are detected and tasks are requeued
  • DAG dependencies enforce correct ordering
  • Dashboard updates under load without blocking workers

You’ll Know You’ve Truly Learned Concurrency When…

  • You can spot race conditions by inspection
  • You design with lock order before writing code
  • You can explain why speedup plateaued using Amdahl’s Law citeturn0search0
  • You can debug a deadlock using traces and lock graphs