Learning Concurrency & Parallel Programming Through Real-World Projects
Goal: Build a deep, practical understanding of how concurrent programs actually behave on real machines, from thread creation and synchronization all the way down to kernel wakeups and cache effects. You will learn to reason about correctness (no races, no deadlocks) and performance (scaling, contention, memory bandwidth), and you will be able to design, implement, and debug multi-threaded systems that hold up under load. You’ll also develop a mental model that connects user-space primitives (mutexes, condition variables, atomics) to the kernel mechanisms that back them. citeturn6view0turn3search1turn3search0turn1search4turn1search0
Why Concurrency & Parallel Programming Matters
Modern performance gains largely come from parallelism rather than single-core clock speed improvements. If you want your software to keep getting faster on mainstream hardware, you must parallelize it explicitly and correctly. citeturn0search6
Amdahl’s Law explains why even small serial sections can cap speedups as core counts rise. This is why you must design for contention, eliminate unnecessary serialization, and choose the right concurrency patterns early. citeturn0search0
Threads let a single process execute multiple flows of control, sharing the same address space but maintaining separate stacks and registers. This makes sharing data easy, but it also makes races and synchronization bugs easy to create. citeturn6view0
Single Thread Multi-Threaded
┌───────────────┐ ┌───────────────┐
│ Code/Heap │ │ Code/Heap │
│ One Stack │ │ Stack A Stack B│
└───────────────┘ └───────────────┘
| |
v v
Simple state Shared state +
more interleavings
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Solid C programming (pointers, structs, memory layout)
- Basic OS concepts (process vs thread, syscalls, scheduling)
- Comfort reading and modifying multi-file C projects
Helpful But Not Required
- Intro networking (sockets, HTTP)
- Basic performance profiling (time, CPU utilization)
- Familiarity with Makefiles and gdb
Self-Assessment Questions
- Can you explain what a race condition is and why it is nondeterministic?
- Can you write a small
pthread_createexample from memory? - Can you describe the difference between a mutex and a semaphore?
- Do you understand how shared memory differs from message passing?
Development Environment Setup
- Linux or WSL2 (Linux recommended for tooling and thread behavior parity)
gccorclang,make,gdbperf,strace, andltracefor performance and syscall tracing
Time Investment (Realistic)
- Project 1–2: 1–2 weeks each
- Project 3–5: 2–3 weeks each
- Final project: 1–2 months part-time
Important Reality Check
Concurrency bugs are timing-dependent and can appear or disappear based on scheduling. Expect to spend more time observing than coding. You will use logs, tracing, and stress tests to catch issues that “vanish” in debug mode. citeturn13view1
How to Use This Guide
- Read the Core Concept Analysis to build your mental model first.
- Pick a project that matches your current comfort level.
- Before coding, answer the Core Question and Thinking Exercise.
- Use the Hints in Layers only when truly stuck.
- Complete the Definition of Done checklist to lock in mastery.
Quick Start (First 48 Hours)
Day 1
- Read OSTEP “Concurrency: An Introduction” and “Thread API”. citeturn6view0turn6view1
- Write a tiny program that creates 4 threads and increments a shared counter incorrectly.
- Add a mutex and watch the counter become correct.
Day 2
- Read OSTEP “Locks” and “Condition Variables”. citeturn6view2turn6view3
- Implement a bounded queue with one producer and one consumer.
- Add a second producer and fix the new race.
Recommended Learning Paths
Path A: Systems Programming Track 1) Download Accelerator → 2) Image Pipeline → 3) Game Server → 4) Lock-Free Queue → Final Scheduler
Path B: Performance/Graphics Track 1) Ray Tracer → 2) Image Pipeline → 3) Lock-Free Queue → Final Scheduler
Path C: Distributed Systems Track 1) Download Accelerator → 2) Game Server → Final Scheduler
Big Picture / Mental Model
Workload
|
v
Decompose into independent tasks
|
v
Choose pattern: fork-join | pipeline | producer-consumer | event loop
|
v
Pick primitives: mutex | condvar | semaphore | atomics
|
v
Runtime: pthreads / OpenMP / custom thread pool
|
v
Kernel: futex + scheduler + wakeups
Threads, synchronization primitives, and the kernel mechanisms that back them are tightly coupled; understanding those linkages is how you debug hard bugs. citeturn6view0turn3search1turn3search0turn1search4turn1search0
Core Concept Analysis
| Concept Cluster | What You Need to Internalize |
|---|---|
| Threading Fundamentals | Thread creation, lifecycle, stacks, join/detach, thread IDs. citeturn6view0turn3search1 |
| Synchronization Primitives | Mutexes, condition variables, semaphores, and how blocking/unblocking works. citeturn3search0turn1search4turn13view0 |
| Concurrency Hazards | Race conditions, deadlocks, livelocks, starvation, missed signals, priority inversion. citeturn13view1 |
| Atomic Operations | Compare-and-swap, atomic read-modify-write, memory orderings. citeturn7search1turn7search4 |
| Parallel Patterns | Fork-join, pipeline, bounded buffers, event loops. citeturn6view2turn6view3turn13view2 |
| Memory Models | Visibility, happens-before, cache effects, false sharing. citeturn7search4turn0search6 |
| I/O Concurrency | Event-driven I/O (epoll), backpressure, readiness vs completion. citeturn21view0 |
Concept Summary Table
| Concept | You Know It When You Can… |
|---|---|
| Threads | predict ordering changes under different schedulers |
| Mutexes | eliminate races without destroying parallelism |
| Condvars | design correct producer-consumer queues |
| Semaphores | express both mutual exclusion and ordering constraints |
| Atomics | implement a lock-free counter that scales under contention |
| Memory Model | explain why “it worked in debug” is meaningless |
| Parallel Patterns | choose the correct pattern for a workload |
Deep Dive Reading by Concept
| Concept | Book + Chapters |
|---|---|
| Threads & Thread API | OSTEP: “Concurrency: An Introduction” + “Thread API” (threads-intro.pdf, threads-api.pdf). citeturn6view0turn6view1 |
| Locks & Condition Variables | OSTEP: “Locks” + “Condition Variables”. citeturn6view2turn6view3 |
| Semaphores | OSTEP: “Semaphores”. citeturn13view0 |
| Concurrency Bugs | OSTEP: “Common Concurrency Problems”. citeturn13view1 |
| POSIX Threads in C | The Linux Programming Interface: Ch. 29–33 (Threads series). citeturn0search1 |
| Atomics & Memory Ordering | Rust Atomics and Locks: Ch. 2–3, 7. citeturn14search5 |
| OS Primitives (futex, pthreads) | Rust Atomics and Locks: Ch. 8. citeturn14search5 |
Success Metrics
- You can explain a deadlock in terms of lock acquisition order.
- You can use
perf,strace, and basic tracing to catch timing issues. - You can quantify speedup and explain why it plateaus.
- You can build a concurrency tool that runs correctly under stress.
Core Subsystem Walkthrough (User Space → Syscall → Kernel)
- User space thread calls
pthread_mutex_lock()or waits on a condition variable. citeturn3search0turn1search4 - If contended, it blocks and the runtime may call into the kernel via futex. citeturn1search0
- The kernel uses futex to sleep/wake threads waiting on a shared memory address. citeturn1search0
- The scheduler chooses the next runnable thread; wakeups are handled through kernel scheduling and wait queues. citeturn6view0turn1search0
Userspace (pthread) -> futex syscall -> kernel wait queue -> scheduler -> wakeup
Kernel Navigation Cheatsheet
Use these directories and rg patterns if you want to trace how user-space primitives map to kernel code:
include/linux/sched.h—struct task_struct, scheduler interfaces. citeturn19search2kernel/futex/— futex core logic. citeturn20search4turn20search6kernel/locking/mutex.c— mutex implementation. citeturn20search3kernel/locking/rwsem.c— reader-writer semaphores. citeturn19search1
Example search patterns:
rg "futex" kernel/futexrg "mutex_lock" kernel/locking/mutex.crg "rwsem" kernel/locking/rwsem.c
Debugging Toolkit (Concurrency-Focused)
- ftrace: kernel function tracing via tracefs. citeturn8search6
- Tracepoints: low-overhead kernel probes for performance/debugging. citeturn8search0
- perf_event_open: programmatic access to hardware/software counters. citeturn10search0
- lockdep: runtime lock dependency validator in the kernel. citeturn11search0
- gdb/rr/strace: user-space debugging and syscall tracing
Kernel Data Structure Atlas
| Structure | What It Represents | Where to Look |
|---|---|---|
task_struct |
Kernel task/thread descriptor | include/linux/sched.h citeturn19search2 |
| Futex hash buckets | Wait-queue buckets for futexes | kernel/futex/core.c citeturn20search6 |
struct mutex implementation |
Kernel mutex behavior | kernel/locking/mutex.c citeturn20search3 |
rw_semaphore implementation |
Reader/writer lock behavior | kernel/locking/rwsem.c citeturn19search1 |
Kernel Contribution Checklist
- Run
scripts/checkpatch.plon your changes. - Use
scripts/get_maintainer.plto find the right mailing list. - Follow kernel commit message conventions and keep patches small.
Glossary of Concurrency Terms
- Thread: An execution context within a process; threads share an address space but have separate stacks. citeturn6view0
- Mutex: Mutual exclusion lock; a thread blocks if another thread owns it. citeturn3search0
- Condition Variable: A wait/notify primitive that atomically unlocks a mutex when waiting and re-locks before returning. citeturn1search4
- Semaphore: Counter-based synchronization primitive with wait/post operations. citeturn13view0
- Futex: Kernel facility to sleep/wake on a shared memory word. citeturn1search0
- Atomic Operation: A read-modify-write that appears indivisible to other threads. citeturn7search1turn7search4
- Deadlock: A cycle of threads each waiting on locks held by others. citeturn13view1
Common Failure Signatures
- “Program hangs under load” → likely deadlock or missed wakeup. citeturn13view1turn1search4
- “Works with 1 thread, fails with 2” → data race or non-atomic update. citeturn13view1
- “Performance gets worse with more threads” → contention or false sharing. citeturn7search4turn0search6
- “Random crashes in optimized builds” → memory ordering or lifetime bugs. citeturn7search4
Kernel Reading Plan (10 Days)
- OSTEP threads-intro + threads-api. citeturn6view0turn6view1
- OSTEP locks + locks-usage. citeturn6view2turn13view3
- OSTEP condition variables + semaphores. citeturn6view3turn13view0
- OSTEP common concurrency problems. citeturn13view1
- TLPI Ch. 29–33 (threads). citeturn0search1
- Rust Atomics & Locks Ch. 2–3 (atomics & memory ordering). citeturn14search5
- Rust Atomics & Locks Ch. 8–9 (OS primitives, locks). citeturn14search5
- Read futex(2) and pthread mutex/condvar man pages. citeturn1search0turn3search0turn1search4
- Read epoll(7) and tracepoints docs. citeturn21view0turn8search0
- Skim kernel lockdep docs. citeturn11search0
Project 1: “Multi-Threaded Download Accelerator” — Parallel Chunk Downloader
| Attribute | Value |
|---|---|
| Programming Language | C |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 2. The “Micro-SaaS / Pro Tool” |
| Difficulty | Level 2: Intermediate |
| Knowledge Area | Concurrency / Networking |
| Software or Tool | Threads / HTTP |
| Main Book | “The Linux Programming Interface” by Michael Kerrisk |
What you’ll build: A CLI tool that downloads large files by splitting them into chunks and downloading them in parallel using HTTP Range requests, then merging them into one file. citeturn2search6
Why it teaches concurrency: You coordinate multiple worker threads, handle partial failures, implement progress aggregation, and ensure correct writes to a shared output file without races. citeturn3search1turn3search0
Core challenges you’ll face
- Chunk coordination (thread synchronization, work distribution)
- Progress aggregation (shared state, atomic counters)
- Handling partial failures (cancellation, cleanup)
- Rate limiting / connection pooling (semaphores, resource caps)
- Resume capability (persistent metadata + concurrency)
Key Concepts
| Concept | Resource |
|---|---|
| Threads & Thread Creation | TLPI Ch. 29–30 citeturn0search1 |
| Mutexes & Condition Variables | OSTEP “Locks” + “Condition Variables” citeturn6view2turn6view3 |
| Atomics | GCC atomic builtins, LLVM atomics overview citeturn7search1turn7search4 |
| HTTP Range Requests | MDN Range header reference citeturn2search6 |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time estimate | 1–2 weeks |
| Prerequisites | Basic networking (HTTP), file I/O, C toolchain |
Real World Outcome
$ ./downloader https://example.com/large.iso -c 8 -o large.iso
[init] file_size=4.7GB chunks=8 chunk_size=587MB
[thread-3] range=1761MB..2348MB 22% 48.2 MB/s
[thread-7] range=4112MB..4700MB 61% 52.9 MB/s
[merge] 8/8 chunks complete
[verify] sha256: OK
[done] total=93.2s avg=50.4 MB/s
You will also generate a JSON progress file that allows resuming after a crash.
The Core Question You’re Answering
How do you safely coordinate multiple threads writing to a shared output without corrupting data or stalling progress?
Concepts You Must Understand First
- Thread lifecycle and join semantics (OSTEP threads-intro, threads-api). citeturn6view0turn6view1
- Mutex locking behavior (pthread_mutex_lock). citeturn3search0
- Condition variables and signaling (pthread_cond_wait). citeturn1search4
- HTTP Range semantics (206/416 behavior). citeturn2search6
Questions to Guide Your Design
- Where should each thread write to avoid overlapping file regions?
- What shared state must be atomic vs mutex-protected?
- How will you detect a stalled thread and retry its range?
- How will you ensure the output file is valid if the program is interrupted?
Thinking Exercise
Sketch the shared data you need: a global “download map” plus per-thread stats. Label which fields require atomics, which require a mutex, and which can be thread-local.
The Interview Questions They’ll Ask
- How do you avoid data races when multiple threads write to the same file?
- When should you use atomics vs a mutex for shared counters?
- How do you ensure partial downloads can resume safely?
- What happens if the server ignores Range requests?
- How do you validate correctness under concurrency?
Hints in Layers
Hint 1: Use pwrite() to write each chunk to its own file offset.
Hint 2: Track progress with an atomic byte counter and compute percent in the UI thread.
Hint 3: Store chunk metadata in a small .json file; mark each chunk as done only after fsync.
Hint 4 (code sketch):
// atomic progress
_Atomic unsigned long long bytes_done = 0;
// worker thread
pwrite(out_fd, buf, len, offset);
atomic_fetch_add(&bytes_done, len);
Books That Will Help
| Book | Chapters | Why This Matters |
|---|---|---|
| OSTEP | Threads, Locks, Condition Variables | Core synchronization patterns. citeturn6view0turn6view2turn6view3 |
| TLPI | Ch. 29–33 | POSIX threads in C. citeturn0search1 |
Common Pitfalls & Debugging
Problem: “Merged file is corrupt”
- Why: overlapping writes or wrong offsets
- Fix: ensure each chunk writes to a unique
offset - Quick test: compare hashes of single-threaded vs parallel outputs
Problem: “Progress freezes at 99%”
- Why: a worker died without marking its chunk complete
- Fix: add a watchdog and requeue stalled chunks
- Quick test: kill a worker thread and verify recovery
Definition of Done
- 1-thread mode matches
curloutput hash - 4+ threads speed up downloads on a Range-enabled server
- Resume works after forced crash
- Progress UI updates without flicker or negative numbers
- No data corruption under stress tests
Project 2: “Concurrent Image Processing Pipeline” — Multi-Stage Parallel Processor
| Attribute | Value |
|---|---|
| Programming Language | C |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 4. The “Open Core” Infrastructure |
| Difficulty | Level 2: Intermediate |
| Knowledge Area | Concurrency |
| Software or Tool | Thread Pools |
| Main Book | “Structured Parallel Programming” by McCool et al. |
What you’ll build: A pipeline that processes thousands of images through stages (resize → filter → watermark → compress) with bounded queues between stages. citeturn6view3
Why it teaches concurrency: You implement producer-consumer queues, backpressure, and stage-level parallelism while preventing memory blowups. citeturn6view3turn13view3
Core challenges you’ll face
- Pipeline stage coordination (producer-consumer, bounded queues)
- Backpressure (condition variables + queue limits)
- Work distribution (thread pools, fairness)
- Memory pressure (ownership + staging buffers)
- Graceful shutdown (draining pipelines)
Key Concepts
| Concept | Resource |
|---|---|
| Producer-Consumer | OSTEP “Condition Variables” citeturn6view3 |
| Lock Usage Patterns | OSTEP “Locks Usage” citeturn13view3 |
| Semaphores | OSTEP “Semaphores” citeturn13view0 |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time estimate | 1–2 weeks |
| Prerequisites | Image basics, queues |
Real World Outcome
$ ./imgpipe ./photos --resize 800x600 --watermark logo.png --workers 4
[stage:resize] in=10000 out=10000 42.3 img/s
[stage:filter] in=10000 out=10000 39.8 img/s
[stage:watermark]in=10000 out=10000 28.7 img/s (bottleneck)
[stage:compress] in=10000 out=10000 31.1 img/s
[done] total=354s
The Core Question You’re Answering
How do you keep a multi-stage pipeline busy without letting any one stage overwhelm the rest?
Concepts You Must Understand First
- Condition variables for blocking queues. citeturn6view3
- Mutex usage patterns for shared queues. citeturn6view2turn3search0
- Semaphore-based bounding. citeturn13view0
Questions to Guide Your Design
- How big should each queue be, and what happens when it is full?
- How do you stop all threads cleanly when input is done?
- How will you detect which stage is the bottleneck?
Thinking Exercise
Draw the pipeline as boxes with bounded queues between them. Mark where threads sleep and wake.
The Interview Questions They’ll Ask
- How do you implement backpressure in a pipeline?
- How do you avoid deadlocks when shutting down the pipeline?
- When would you choose semaphores over condition variables?
- How do you measure throughput per stage?
Hints in Layers
Hint 1: Use one queue per stage; each queue has push() and pop() with a mutex + condvar.
Hint 2: Each stage has N workers; a sentinel item signals shutdown.
Hint 3: Track per-stage counters with atomics to compute throughput.
Hint 4 (code sketch):
while (queue_pop(&q, &item)) {
process(item);
queue_push(&next_q, item);
}
Books That Will Help
| Book | Chapters | Why This Matters |
|---|---|---|
| OSTEP | Condition Variables, Semaphores | Producer/consumer patterns. citeturn6view3turn13view0 |
| TLPI | Ch. 29–33 | Thread synchronization APIs. citeturn0search1 |
Common Pitfalls & Debugging
Problem: “Memory usage grows without bound”
- Why: unbounded queues
- Fix: enforce queue capacity + blocking push
- Quick test: log queue sizes over time
Problem: “Pipeline deadlocks on shutdown”
- Why: workers waiting without termination signal
- Fix: push sentinel items or broadcast on shutdown
- Quick test: run with 1 image and ensure clean exit
Definition of Done
- Pipeline processes 10k images without OOM
- Each stage shows stable throughput metrics
- Shutdown completes without hanging threads
- Queue size never exceeds configured cap
Project 3: “Real-Time Multiplayer Game Server” — Chat + Game State Synchronization
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, Go, C++ |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | Level 4: The “Open Core” Infrastructure |
| Difficulty | Level 3: Advanced |
| Knowledge Area | Network Programming, Concurrency |
| Software or Tool | epoll/select, WebSockets |
| Main Book | “The Linux Programming Interface” |
What you’ll build: A server that handles 100+ simultaneous clients, synchronizes game state, and broadcasts updates at a fixed tick rate. citeturn21view0
Why it teaches concurrency: You must guard shared state, handle high fan-out broadcast, and ensure real-time tick consistency without blocking network I/O. citeturn21view0turn3search0
Core challenges you’ll face
- Concurrent connections (event-driven I/O with epoll)
- Shared game state (read/write lock patterns)
- Broadcast fan-out (contention control)
- Tick loop synchronization (barriers or time-based scheduling)
- Disconnect cleanup (safe removal while iterating)
Key Concepts
| Concept | Resource |
|---|---|
| Event-driven I/O | epoll(7) citeturn21view0 |
| Mutexes & Condvars | pthread man pages + OSTEP citeturn3search0turn1search4turn6view2 |
| Concurrency Bugs | OSTEP “Common Concurrency Problems” citeturn13view1 |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate–Advanced |
| Time estimate | 2–3 weeks |
| Prerequisites | Socket programming, client/server basics |
Real World Outcome
$ ./gameserver --port 8080 --tick-rate 20
[tick 1547] players=52 tick_ms=3.1 outgoing=1240 msgs
[tick 1548] players=53 tick_ms=3.4 outgoing=1288 msgs
[chat] player42: gg!
The Core Question You’re Answering
How do you keep shared state consistent while simultaneously serving hundreds of network events per second?
Concepts You Must Understand First
- epoll readiness and event loops. citeturn21view0
- Mutex/condvar usage and pitfalls. citeturn3search0turn1search4
- Common concurrency bugs (races, deadlocks). citeturn13view1
Questions to Guide Your Design
- How do you partition game state to minimize lock contention?
- Will you use one thread per connection or a single event loop?
- How do you ensure tick updates are consistent with event processing?
Thinking Exercise
Sketch two designs: (A) event loop with a single thread, (B) event loop + worker pool. Mark where locks are required in each.
The Interview Questions They’ll Ask
- How does epoll differ from select/poll?
- How do you prevent one slow client from stalling the server?
- How do you reduce lock contention in shared state?
- How do you avoid waking too many threads at once when work arrives?
Hints in Layers
Hint 1: Separate network I/O from game-state updates.
Hint 2: Use a read-write lock or state snapshots for broadcasts.
Hint 3: Process input in batches per tick to avoid starvation.
Hint 4 (code sketch):
while (running) {
n = epoll_wait(epfd, events, MAX, tick_ms);
for (i = 0; i < n; i++) handle_event(events[i]);
update_world();
broadcast_state();
}
Books That Will Help
| Book | Chapters | Why This Matters |
|---|---|---|
| TLPI | Ch. 56–63 (Sockets, I/O Models) | Event-driven networking. citeturn0search1 |
| OSTEP | Concurrency + Bugs | Avoid deadlocks and races. citeturn6view0turn13view1 |
Common Pitfalls & Debugging
Problem: “Server stutters at 20 clients”
- Why: coarse-grained locks or blocking I/O
- Fix: move to nonblocking sockets + epoll loop
- Quick test: simulate 100 clients with dummy traffic
Problem: “Players disappear randomly”
- Why: concurrent modification of shared state
- Fix: copy-on-write or lock around state updates
- Quick test: enable debug asserts on state invariants
Definition of Done
- 100 simulated clients connect without crashes
- Tick rate stays stable under load
- No data races under ThreadSanitizer (if available)
- All disconnects are handled without leaks
Project 4: “Lock-Free Concurrent Queue” — High-Performance Data Structure
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, C++ |
| Coolness Level | Level 5: Pure Magic |
| Business Potential | Level 1: Resume Gold |
| Difficulty | Level 4: Expert |
| Knowledge Area | Lock-Free Programming, Atomics |
| Software or Tool | Atomic Operations, Memory Barriers |
| Main Book | “Rust Atomics and Locks” |
What you’ll build: A multi-producer, multi-consumer queue using CAS loops and atomic pointers, with a correctness proof and stress tests. citeturn7search1turn7search4
Why it teaches concurrency: You will confront memory ordering, the ABA problem, and the reality that lock-free != wait-free. citeturn7search4
Core challenges you’ll face
- CAS loops (atomic compare-exchange semantics)
- ABA problem (safe memory reclamation)
- Memory ordering (acquire/release vs relaxed)
- Linearizability (defining correctness)
- Stress testing (finding rare races)
Key Concepts
| Concept | Resource |
|---|---|
| Atomic RMW & CAS | GCC atomic builtins citeturn7search1 |
| Memory Ordering | LLVM Atomics Guide citeturn7search4 |
| OS Primitives | Rust Atomics & Locks Ch. 8–9 citeturn14search5 |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time estimate | 2–3 weeks |
| Prerequisites | Solid C pointers, atomics, memory models |
Real World Outcome
$ ./lfqueue_bench --threads 16 --ops 10000000
[run] lock_free_queue ops=10,000,000 threads=16
[result] throughput=85.2 Mops/s retries=3.1%
[result] mutex_queue throughput=24.7 Mops/s
The Core Question You’re Answering
How do you provide correct FIFO semantics without ever taking a lock?
Concepts You Must Understand First
- CAS semantics and failure modes. citeturn7search1
- Memory orderings and visibility. citeturn7search4
- Thread behavior and concurrency bugs. citeturn13view1
Questions to Guide Your Design
- Where is the linearization point for enqueue/dequeue?
- How will you reclaim nodes safely (hazard pointers or epochs)?
- Which memory order do you need at each atomic operation?
Thinking Exercise
Write down a sequential queue spec, then annotate where each concurrent operation must appear atomic.
The Interview Questions They’ll Ask
- What is the ABA problem and how do you handle it?
- How do acquire/release differ from sequential consistency?
- Why can lock-free algorithms still starve?
- How do you test lock-free code for correctness?
Hints in Layers
Hint 1: Start with a single-producer/single-consumer queue.
Hint 2: Add CAS on the tail for multi-producer; then on head for multi-consumer.
Hint 3: Keep a dummy node as the initial head to simplify logic.
Hint 4 (code sketch):
if (atomic_compare_exchange_weak(&tail->next, &expected, new)) {
atomic_compare_exchange_weak(&tail, &old_tail, new);
}
Books That Will Help
| Book | Chapters | Why This Matters |
|---|---|---|
| Rust Atomics and Locks | Ch. 2–4, 7, 9 | Atomics + lock building. citeturn14search5 |
| OSTEP | Locks, Common Concurrency Problems | Failure patterns + reasoning. citeturn6view2turn13view1 |
Common Pitfalls & Debugging
Problem: “Queue occasionally drops items”
- Why: incorrect CAS retry or memory ordering bug
- Fix: enforce acquire/release on pointer swaps
- Quick test: run 1e7 ops with randomized thread scheduling
Problem: “Segfault under load”
- Why: freed node reused (ABA)
- Fix: hazard pointers or epoch GC
- Quick test: enable ASAN + stress test
Definition of Done
- Passes linearizability tests under randomized schedules
- No crashes under 10M+ ops stress test
- Benchmarked against a mutex queue
- ABA protection is documented and tested
Project 5: “Parallel Ray Tracer” — Multi-Core Photorealistic Renderer
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, C++, Go |
| Coolness Level | Level 5: Pure Magic |
| Business Potential | Level 1: Resume Gold |
| Difficulty | Level 2: Intermediate |
| Knowledge Area | Parallel Computing, Graphics |
| Software or Tool | OpenMP, Thread Pools |
| Main Book | “Ray Tracing in One Weekend” |
What you’ll build: A CPU ray tracer that distributes pixel work across cores to render scenes faster. citeturn2search5
Why it teaches concurrency: You learn how to partition work, avoid false sharing, and aggregate results efficiently. citeturn7search4turn0search6
Core challenges you’ll face
- Work distribution (static vs dynamic scheduling)
- False sharing (cache line contention)
- Load balancing (adaptive chunking)
- Result aggregation (parallel reduction)
- Progress reporting (atomic counters)
Key Concepts
| Concept | Resource |
|---|---|
| OpenMP parallel directives | OpenMP spec (pragma omp parallel/for) citeturn2search5 |
| Memory visibility | LLVM Atomics Guide citeturn7search4 |
Project Details
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time estimate | 2 weeks |
| Prerequisites | Basic 3D math, recursion |
Real World Outcome
$ ./raytracer scene.json --threads 8 --out render.png
[render] 3840x2160 samples=50
[progress] 12% 3.9s
[progress] 67% 18.2s
[done] 28.7s output=render.png
The Core Question You’re Answering
How do you divide a perfectly parallel workload so that all cores stay busy and memory stays hot?
Concepts You Must Understand First
- OpenMP parallel regions and work sharing. citeturn2search5
- Memory effects and false sharing. citeturn7search4turn0search6
Questions to Guide Your Design
- Should you split by rows, tiles, or adaptive task queues?
- How do you avoid contention on the image buffer?
- What is the smallest chunk size before scheduling overhead dominates?
Thinking Exercise
Compute expected speedup with Amdahl’s Law for a 95% parallel workload at 8 cores. citeturn0search0
The Interview Questions They’ll Ask
- Why isn’t the speedup linear with core count?
- What is false sharing and how do you avoid it? citeturn0search6
- When should you use OpenMP vs a custom thread pool?
Hints in Layers
Hint 1: Use per-thread pixel buffers and merge at the end.
Hint 2: Use dynamic scheduling for uneven scenes.
Hint 3: Avoid writing to adjacent pixels in different threads.
Hint 4 (code sketch):
#pragma omp parallel for schedule(dynamic)
for (int y = 0; y < height; y++) {
render_row(y);
}
Books That Will Help
| Book | Chapters | Why This Matters |
|---|---|---|
| OSTEP | Concurrency intro | Parallelism mental model. citeturn6view0 |
| Rust Atomics and Locks | Ch. 2–3 | Atomic counters for progress. citeturn14search5 |
Common Pitfalls & Debugging
Problem: “Image has stripes or artifacts”
- Why: shared buffer writes without isolation
- Fix: thread-local tiles + merge
- Quick test: render with 1 thread vs 8 and compare diffs
Problem: “8 cores only give 2x speedup”
- Why: poor scheduling or false sharing citeturn0search6
- Fix: use dynamic scheduling + chunk tuning
- Quick test: plot time per chunk size
Definition of Done
- Image output matches single-thread render
- 4+ cores show clear speedup
- Progress reporting is stable
- Chunk size tuning is documented
Final Comprehensive Project: Distributed Task Scheduler
What you’ll build: A mini-Celery/Temporal-like system with an HTTP API, a coordinator, and worker processes that execute tasks with retries, timeouts, and dependency DAGs.
Why it teaches everything: You combine thread pools, queues, concurrency-safe state, distributed coordination, and monitoring.
Core challenges you’ll face
- Concurrent task queues (priority + fairness)
- Worker heartbeats (failure detection)
- DAG execution (parallelism with dependencies)
- Exactly-once semantics (idempotency + locks)
- Live dashboard (concurrent reads + writes)
Key Concepts
| Concept | Resource |
|---|---|
| Thread pools & queues | OSTEP + TLPI threads chapters citeturn6view0turn0search1 |
| Locks & condition variables | pthreads + OSTEP citeturn3search0turn1search4turn6view2 |
| Concurrency bugs | OSTEP “Common Concurrency Problems” citeturn13view1 |
Real World Outcome
# submit a task
$ curl -X POST http://localhost:8000/submit \
-d '{"task":"transcode","input":"video.mp4"}'
{"id":"task-42","status":"queued"}
# dashboard view
[workers] active=6 idle=2
[queue] pending=18 running=6 complete=120
The Core Question You’re Answering
How do you coordinate many concurrent workers safely while preserving correctness and visibility?
Concepts You Must Understand First
- Producer/consumer queues and backpressure. citeturn6view3
- Mutex + condition variable semantics. citeturn3search0turn1search4
- Failure patterns in concurrent systems. citeturn13view1
Questions to Guide Your Design
- What data must be strongly consistent vs eventually consistent?
- How do you avoid duplicate execution after failures?
- How will you expose progress without blocking workers?
Thinking Exercise
Write a timeline of a task that is retried twice and still succeeds. Identify where state must be atomic.
The Interview Questions They’ll Ask
- How do you implement task retries without duplicates?
- How do you detect and recover from worker failures?
- How do you keep the dashboard consistent under load?
Hints in Layers
Hint 1: Keep task state in a single concurrency-safe map.
Hint 2: Make task execution idempotent where possible.
Hint 3: Use a separate thread for heartbeats and monitoring.
Books That Will Help
| Book | Chapters | Why This Matters |
|---|---|---|
| TLPI | Ch. 29–33 | Thread primitives in C. citeturn0search1 |
| OSTEP | Concurrency + Bugs | Correctness patterns. citeturn6view0turn13view1 |
Common Pitfalls & Debugging
Problem: “Tasks stuck in RUNNING forever”
- Why: worker crashed without cleanup
- Fix: heartbeat timeouts + requeue
- Quick test: kill a worker mid-task
Problem: “Dashboard freezes under load”
- Why: long locks held by workers
- Fix: read-only snapshots or lock-free reads
- Quick test: stress with 1000 tasks
Definition of Done
- Multiple workers execute tasks without duplicates
- Failed workers are detected and tasks are requeued
- DAG dependencies enforce correct ordering
- Dashboard updates under load without blocking workers
You’ll Know You’ve Truly Learned Concurrency When…
- You can spot race conditions by inspection
- You design with lock order before writing code
- You can explain why speedup plateaued using Amdahl’s Law citeturn0search0
- You can debug a deadlock using traces and lock graphs