Learning Concurrency, Parallelism, Threads & Synchronization
Goal: Deeply understand how threads, processes, semaphores, and all concurrency primitives work at the operating system level through hands-on projects.
Why Concurrency Matters (Big Picture)
Concurrency is how modern software stays responsive, handles load, and uses hardware efficiently. From web servers handling thousands of clients to audio engines processing streams in real time, concurrency is the difference between a system that scales and one that stalls.
The trade-off: shared memory makes communication fast, but it also makes bugs subtle. The same performance features that make concurrency powerful (multiple threads touching the same data) are exactly what cause race conditions, deadlocks, and heisenbugs.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Solid C fundamentals (pointers, structs, function pointers)
- Familiarity with the Unix command line
- Basic mental model of how a program is compiled and run
Helpful but Not Required
- Prior exposure to operating systems concepts
- Basic understanding of CPU caches and memory hierarchy
Self-Assessment Questions
Before starting, can you answer these?
- What is the difference between a process and a thread?
- What does it mean for memory to be shared?
- Why can
counter++be unsafe across threads? - When should you use a mutex vs a semaphore?
If any of these are unclear, skim OSTEP Ch. 5 and 26 first.
Development Environment Setup
- Compiler:
gccorclangwith-pthread - OS: Linux or macOS (pthreads support)
- Debugging tools:
gdb/lldb,strace(Linux),perf(optional) - Race detection: ThreadSanitizer (
-fsanitize=thread) when available
Core Concept Overview
Before diving into projects, here’s the mental map of what you’re learning:
| Concept | What It Is | Why It’s Confusing |
|---|---|---|
| Process | Independent program with its own memory space | People conflate it with “program” or “thread” |
| Thread | Execution unit within a process, shares memory | There are multiple types of threads |
| Kernel Thread | Thread managed by the OS kernel (1:1 model) | The OS does the scheduling |
| Green Thread/User Thread | Thread managed in user-space (N:1 model) | Your code does the scheduling |
| Concurrency | Dealing with multiple things at once (structure) | Often confused with parallelism |
| Parallelism | Doing multiple things at once (execution) | Requires multiple CPU cores |
| Semaphore | Counter that controls access to resources | It’s not just a mutex with a count |
| Mutex | Lock that ensures only one thread enters critical section | Semaphore with count=1 |
The key insight: Threads share memory within a process. This sharing is both the power (fast communication) and the danger (race conditions, deadlocks).
Core Concept Analysis (Mental Models)
1) Race Conditions = Non-Deterministic Interleavings
Thread A: read counter (0) Thread B: read counter (0)
Thread A: add 1 -> 1 Thread B: add 1 -> 1
Thread A: write counter = 1 Thread B: write counter = 1
Expected: 2, Actual: 1 (lost update)
Mental model: Threads run in tiny slices; the OS can switch between them at any instruction. If two threads touch the same data without coordination, the ordering becomes unpredictable.
2) Mutex vs Semaphore (What They Really Do)
Mutex (binary):
lock() -> if locked, wait
unlock()-> release, wake one waiter
Semaphore (counting):
wait() -> if count == 0, wait
post() -> count++, wake one waiter
Mental model: A mutex protects a critical section. A semaphore controls resource availability.
3) Deadlocks Are Cycles in Resource Graphs
T1 holds A, waits for B
T2 holds B, waits for A
T1 -> B
^ |
| v
A <- T2 (cycle = deadlock)
Mental model: If you can draw a cycle in the “who holds what” graph, you can freeze the system.
4) Condition Variables = “Sleep Until State Changes”
Threads should sleep while waiting for a condition instead of spinning. pthread_cond_wait() atomically releases a mutex and sleeps, then re-locks on wake.
Reference: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_wait.html
5) Atomics & Memory Ordering Matter
Atomic operations (C11 stdatomic.h) guarantee visibility across cores and define ordering guarantees.
Reference: https://en.cppreference.com/w/c/atomic
Concept Summary Table (What You Must Internalize)
| Concept | Why It Matters | Where It Shows Up |
|---|---|---|
| Race conditions | Explain “impossible” bugs | Projects 2, 7 |
| Mutex vs semaphore | Know when to block vs serialize | Projects 2, 3, 5 |
| Condition variables | Build efficient producer/consumer | Projects 2, 6 |
| Deadlock conditions | Prevent system freeze | Projects 5, 8 |
| Atomics & memory model | Correct lock-free code | Projects 3, 6 |
| Scheduling | Understand fairness/latency | Projects 4, 8 |
Deep Dive Reading by Concept
| Topic | Book & Chapter |
|---|---|
| Threads & scheduling | Operating Systems: Three Easy Pieces Ch. 26 |
| Concurrency in C | The Linux Programming Interface Ch. 29-30 |
| Memory model & atomics | Rust Atomics and Locks (concepts apply to C) |
| Synchronization primitives | C++ Concurrency in Action Ch. 3-5 |
| OS scheduling | Operating Systems: Three Easy Pieces Ch. 7-9 |
Quick Start Guide (First 48 Hours)
Hour 1-4: Build the mental model
- Read OSTEP Ch. 26 (Concurrency).
- Run Project 1 (process fork demo).
Hour 5-10: See a race condition
- Build Project 2 without locks.
- Watch the counter break.
- Add a mutex and see it stabilize.
Hour 11-24: Learn primitives
- Skim pthreads API docs:
pthread_create,pthread_mutex_lock,pthread_cond_wait. - Read about semaphores and
sem_init.
References:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_create.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sem_init.html
Hour 25-48: Build a simple thread pool
- Use Project 6’s skeleton.
- Measure throughput with and without work stealing.
Learning Path
Phase 1: Foundations (2-3 weeks)
- Project 1: Process Forker & IPC Visualizer
- Project 2: Multi-threaded Task Executor
Phase 2: Deep Dive (3-4 weeks)
- Project 3: Build Your Own Mutex & Semaphore
- Project 5: Dining Philosophers Visualizer
Phase 3: Mastery (4-6 weeks)
- Project 4: Green Thread Library
- Project 6: Thread Pool with Work Stealing
- Project 7: Concurrent Hash Map
Phase 4: Capstone
- Project 8: Mini Operating System Scheduler
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Process Forker & IPC | ⭐⭐ | Weekend | Foundational | ⭐⭐⭐ |
| Multi-threaded Executor | ⭐⭐⭐ | 1 week | Core threading | ⭐⭐⭐⭐ |
| Build Mutex & Semaphore | ⭐⭐⭐⭐ | 1-2 weeks | Deep primitives | ⭐⭐⭐⭐⭐ |
| Green Thread Library | ⭐⭐⭐⭐⭐ | 2-3 weeks | Expert-level | ⭐⭐⭐⭐⭐ |
| Dining Philosophers Viz | ⭐⭐⭐ | 1 week | Deadlock mastery | ⭐⭐⭐⭐⭐ |
| Thread Pool + Work Stealing | ⭐⭐⭐⭐ | 2-3 weeks | Production patterns | ⭐⭐⭐⭐ |
| Concurrent Hash Map | ⭐⭐⭐⭐ | 2-3 weeks | Data structure concurrency | ⭐⭐⭐⭐ |
| Mini OS Scheduler | ⭐⭐⭐⭐⭐ | 1-2 months | Complete mastery | ⭐⭐⭐⭐⭐ |
Project 1: “Process Forker & IPC Visualizer” — Unix Process Memory Isolation Demo
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, Go, Python |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 1. The “Resume Gold” |
| Difficulty | Level 2: Intermediate |
| Knowledge Area | Operating Systems / Process Management |
| Software or Tool | Unix fork/exec syscalls |
| Main Book | Operating Systems: Three Easy Pieces by Remzi H. Arpaci-Dusseau |
What you’ll build
A visual tool that forks child processes, displays their PIDs, shows memory isolation (same variable address, different values), and demonstrates IPC via pipes and shared memory.
Why it teaches processes
You can’t understand threads until you understand what they’re not. A process has its own memory space—when you fork, the child gets a copy. By printing the same variable’s address and value in parent/child, you’ll see “same address, different value” which clicks the concept of virtual memory isolation.
Core challenges you’ll face
- Fork semantics (child gets copy of memory, returns different PID) → maps to process isolation
- Pipe communication (why can’t parent just write to child’s variable?) → maps to IPC necessity
- Shared memory setup (
shmget/mmap) → maps to explicit memory sharing - Zombie processes (forgetting to
wait()) → maps to process lifecycle
Key Concepts
| Concept | Resource |
|---|---|
| Process Creation | Operating Systems: Three Easy Pieces Chapter 5 - OSTEP |
| Virtual Memory Basics | Computer Systems: A Programmer’s Perspective Chapter 9 - Bryant & O’Hallaron |
| IPC Mechanisms | The Linux Programming Interface Chapters 44-49 - Michael Kerrisk |
Difficulty & Time
- Difficulty: Beginner-Intermediate
- Time estimate: Weekend
- Prerequisites: Basic C, comfort with command line
Real World Outcome
$ ./process_viz
[Parent PID: 1234] Variable 'counter' at 0x7fff5678 = 0
[Forking...]
[Child PID: 1235] Variable 'counter' at 0x7fff5678 = 0 # Same address!
[Child] Incrementing counter...
[Child PID: 1235] Variable 'counter' at 0x7fff5678 = 100
[Parent PID: 1234] Variable 'counter' at 0x7fff5678 = 0 # Still 0! Isolated!
[Demo: Pipe IPC]
Parent sends: "Hello from parent"
Child receives: "Hello from parent"
[Demo: Shared Memory]
Parent writes 42 to shared memory
Child reads 42 from shared memory
Parent sees child wrote 99 back
The Core Question You’re Answering
How does a process get a private memory space, and what exactly is shared (or not) after
fork()?
Concepts You Must Understand First
fork()vsexec()vswait()semantics- Copy-on-write and virtual memory isolation
- Pipes and shared memory (
mmap) - File descriptor inheritance
Questions to Guide Your Design
- How will you prove two processes see the same virtual address but different values?
- How will you ensure parent/child output order is deterministic?
- When should you use pipes vs shared memory?
Thinking Exercise
Draw the parent and child memory layout after fork(). Mark which pages are shared (copy-on-write) and which are unique after the child writes to a variable.
The Interview Questions They’ll Ask
- Why does
fork()return twice? - What is the difference between
fork()andexec()? - Why do you need
wait()? What happens if you omit it? - How does a pipe differ from shared memory?
Hints in Layers
- Fork once and print PID + variable address/value.
- Increment in child and print again.
- Add a pipe for a parent→child message.
- Add
mmapshared memory and demonstrate both directions.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Processes & fork | Operating Systems: Three Easy Pieces | Ch. 5 |
| Virtual memory | Computer Systems: A Programmer’s Perspective | Ch. 9 |
| IPC | The Linux Programming Interface | Ch. 44-49 |
Common Pitfalls & Debugging
Problem: “Parent output appears twice”
- Why: stdout buffer is duplicated across
fork(). - Fix:
fflush(stdout)before forking.
Problem: “Child becomes zombie”
- Why: Parent never calls
wait(). - Fix: Add
waitpid()in parent.
Learning Milestones
- Fork works, child has different PID → You understand process creation
- Same address, different values after fork → You understand memory isolation
- Pipes working → You understand why IPC is necessary
- Shared memory without race conditions → You’re ready for threads
Implementation Hints
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/mman.h>
int main() {
int counter = 0;
printf("[Parent PID: %d] Variable 'counter' at %p = %d\n",
getpid(), (void*)&counter, counter);
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("[Child PID: %d] Variable 'counter' at %p = %d\n",
getpid(), (void*)&counter, counter);
counter = 100;
printf("[Child] After increment: counter = %d\n", counter);
exit(0);
} else {
// Parent process
wait(NULL); // Wait for child to finish
printf("[Parent PID: %d] Variable 'counter' at %p = %d\n",
getpid(), (void*)&counter, counter);
// counter is still 0! Memory is isolated!
}
return 0;
}
Key System Calls to Master
| System Call | Purpose |
|---|---|
fork() |
Create a child process (copy of parent) |
exec*() |
Replace current process with new program |
wait() / waitpid() |
Wait for child to terminate |
pipe() |
Create a unidirectional communication channel |
mmap() |
Create shared memory region |
shmget() / shmat() |
System V shared memory |
Project 2: “Multi-threaded Task Executor” — pthreads Race Condition Explorer
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, C++, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 2. The “Micro-SaaS / Pro Tool” |
| Difficulty | Level 3: Advanced |
| Knowledge Area | Concurrency / Thread Management |
| Software or Tool | POSIX Threads (pthreads) |
| Main Book | The Linux Programming Interface by Michael Kerrisk |
What you’ll build
A command-line task executor that spawns N threads, each pulling tasks from a shared queue, with visible race conditions when you forget to lock, and proper synchronization when you add mutexes.
Why it teaches kernel threads
This is where threads become real. You’ll see that threads share memory by default (unlike processes). You’ll intentionally create race conditions, watch the chaos, then fix them. This visceral experience of “broken → fixed” is the best teacher.
Core challenges you’ll face
- Race conditions (counter increments “lose” updates) → maps to critical sections
- Mutex usage (lock before access, unlock after) → maps to mutual exclusion
- Thread joining (waiting for all threads to finish) → maps to thread lifecycle
- Shared queue access (producer-consumer pattern) → maps to concurrent data structures
Resources for Key Challenges
- POSIX thread creation: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_create.html
- POSIX mutex lock/unlock: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html
Key Concepts
| Concept | Resource |
|---|---|
| Thread Creation & Joining | The Linux Programming Interface Chapter 29 - Kerrisk |
| Race Conditions | Operating Systems: Three Easy Pieces Chapter 26 - OSTEP |
| Mutexes | C++ Concurrency in Action Chapter 3 - Anthony Williams |
Difficulty & Time
- Difficulty: Intermediate
- Time estimate: 1 week
- Prerequisites: Project 1 completed, basic C
Real World Outcome
# First run: NO LOCKS (intentional race condition)
$ ./task_executor --threads 4 --tasks 10000 --no-lock
Starting 4 threads to process 10000 tasks...
Thread 0: processing task 0
Thread 1: processing task 0 # OOPS! Same task!
Thread 2: processing task 1
...
Expected completed: 10000
Actual completed: 9847 # RACE CONDITION! Lost updates!
# Second run: WITH LOCKS
$ ./task_executor --threads 4 --tasks 10000
Starting 4 threads to process 10000 tasks...
Thread 0: processing task 0
Thread 1: processing task 1
Thread 2: processing task 2
...
Expected completed: 10000
Actual completed: 10000 # CORRECT!
Total time: 0.23s (parallel speedup!)
The Core Question You’re Answering
How do you coordinate multiple threads safely around shared state without destroying performance?
Concepts You Must Understand First
- POSIX thread lifecycle (
pthread_create,pthread_join) - Mutex locking discipline (
pthread_mutex_lock) - Condition variables for blocking on empty queue
- False sharing and cache effects
Questions to Guide Your Design
- How will you structure the task queue to avoid races?
- When is it safe to release the lock?
- How will you measure speedup vs overhead?
Thinking Exercise
Assume 4 threads increment a shared counter 1,000,000 times each. What are the best and worst possible outcomes without a mutex?
The Interview Questions They’ll Ask
- What is a data race?
- Why is
counter++not atomic? - What is the difference between a mutex and a spinlock?
- When do condition variables outperform busy-waiting?
Hints in Layers
- Start with a shared queue and no locks to observe failure.
- Add a mutex around enqueue/dequeue.
- Add a condition variable to sleep when the queue is empty.
- Measure throughput with different thread counts.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Threads | The Linux Programming Interface | Ch. 29 |
| Concurrency | Operating Systems: Three Easy Pieces | Ch. 26 |
| Mutexes | C++ Concurrency in Action | Ch. 3 |
Common Pitfalls & Debugging
Problem: “Throughput gets worse with more threads”
- Why: Lock contention or false sharing.
- Fix: Reduce critical section size, pad shared counters.
Problem: “Queue is empty but threads spin”
- Why: Missing condition variable or signal.
- Fix: Use
pthread_cond_wait/pthread_cond_signal.
Learning Milestones
- Threads share memory (see same variable) → You understand thread vs process
- Race condition breaks count → You understand critical sections
- Mutex fixes the race → You understand synchronization
- Parallel speedup measurable → You understand concurrency vs parallelism
Implementation Hints
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define NUM_THREADS 4
#define NUM_INCREMENTS 1000000
long shared_counter = 0;
pthread_mutex_t counter_mutex = PTHREAD_MUTEX_INITIALIZER;
int use_mutex = 1;
void* increment_counter(void* arg) {
int thread_id = *(int*)arg;
for (int i = 0; i < NUM_INCREMENTS; i++) {
if (use_mutex) {
pthread_mutex_lock(&counter_mutex);
shared_counter++;
pthread_mutex_unlock(&counter_mutex);
} else {
shared_counter++; // Race condition!
}
}
printf("Thread %d finished\n", thread_id);
return NULL;
}
int main(int argc, char* argv[]) {
if (argc > 1 && strcmp(argv[1], "--no-lock") == 0) {
use_mutex = 0;
printf("Running WITHOUT mutex (expect race condition)\n");
} else {
printf("Running WITH mutex\n");
}
pthread_t threads[NUM_THREADS];
int thread_ids[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++) {
thread_ids[i] = i;
pthread_create(&threads[i], NULL, increment_counter, &thread_ids[i]);
}
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
long expected = (long)NUM_THREADS * NUM_INCREMENTS;
printf("Expected: %ld\n", expected);
printf("Actual: %ld\n", shared_counter);
printf("Lost: %ld\n", expected - shared_counter);
return 0;
}
Compile with: gcc -pthread -o task_executor task_executor.c
Key pthread Functions to Master
| Function | Purpose |
|---|---|
pthread_create() |
Create a new thread |
pthread_join() |
Wait for thread to terminate |
pthread_mutex_init() |
Initialize a mutex |
pthread_mutex_lock() |
Acquire the lock (blocks if held) |
pthread_mutex_unlock() |
Release the lock |
pthread_mutex_destroy() |
Clean up mutex resources |
Project 3: “Build Your Own Mutex & Semaphore” — Atomic CAS and Futex Implementation
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, C++ |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 1. The “Resume Gold” |
| Difficulty | Level 4: Expert |
| Knowledge Area | Synchronization Primitives / Atomics |
| Software or Tool | Atomic operations, futex |
| Main Book | The Linux Programming Interface by Michael Kerrisk |
What you’ll build
Your own mutex and semaphore from scratch using atomic compare-and-swap (CAS) operations, spinlocks, and optionally the Linux futex syscall for sleeping.
Why it teaches synchronization deeply
Most people use pthread_mutex_lock() as a black box. By building your own, you’ll understand:
- What is a spinlock?
- Why does spinning waste CPU?
- What is a futex and why does it exist?
- What’s the difference between a mutex and a semaphore at the implementation level?
Core challenges you’ll face
- Atomic operations (why normal
counter++isn’t thread-safe) → maps to memory model - Spinlock implementation (busy-wait CAS loop) → maps to lock-free programming
- Futex for sleeping (avoid wasting CPU spinning) → maps to kernel integration
- Semaphore counting (allow N concurrent accessors) → maps to resource pools
Resources for Key Challenges
- POSIX semaphores (
sem_init): https://man7.org/linux/man-pages/man3/sem_init.3.html - Linux futex syscall: https://man7.org/linux/man-pages/man2/futex.2.html
- C11 atomics overview: https://en.cppreference.com/w/c/atomic
Key Concepts
| Concept | Resource |
|---|---|
| Atomic Operations | C++ Concurrency in Action Chapter 5 - Anthony Williams |
| Futex Deep Dive | The Linux Programming Interface Chapter 30 - Kerrisk |
| Memory Barriers | Rust Atomics and Locks by Mara Bos (excellent even for C programmers) |
Difficulty & Time
- Difficulty: Advanced
- Time estimate: 1-2 weeks
- Prerequisites: Projects 1-2 completed, understanding of assembly basics
Real World Outcome
$ ./my_sync_test
=== Testing MY_MUTEX (spinlock-based) ===
Spawning 8 threads, each incrementing counter 100000 times
No mutex: counter = 634982 (WRONG - race condition)
pthread_mutex: counter = 800000 (correct)
my_mutex: counter = 800000 (correct!) ✓
=== Testing MY_SEMAPHORE ===
Semaphore initialized with count = 3
Thread 0: acquired (active: 1)
Thread 1: acquired (active: 2)
Thread 2: acquired (active: 3)
Thread 3: waiting... (sem count = 0)
Thread 0: released
Thread 3: acquired (active: 3)
...
All threads completed, max concurrent = 3 ✓
=== Spinlock vs Futex comparison ===
Spinlock (high contention): 2.3s, CPU 780%
Futex-based (high contention): 0.4s, CPU 95%
Conclusion: Futex avoids wasting CPU when waiting!
The Core Question You’re Answering
How do you build synchronization primitives from atomic instructions and avoid wasting CPU under contention?
Concepts You Must Understand First
- Compare-and-swap (CAS) and atomic memory ordering
- Spinlocks vs blocking locks
- Futex wait/wake model (user space fast path, kernel slow path)
- Counting semaphore semantics
Questions to Guide Your Design
- When should a lock spin vs sleep?
- How do you prevent starvation?
- What invariants must a semaphore maintain?
Thinking Exercise
Estimate how much CPU time a spinlock wastes if 8 threads compete on a 4-core CPU and each holds the lock for 2 ms.
The Interview Questions They’ll Ask
- What is a futex and why does it exist?
- How does a semaphore differ from a mutex?
- What is the ABA problem and how can it appear in lock-free code?
- Why do memory barriers matter for CAS loops?
Hints in Layers
- Implement a basic spinlock with C11 atomics.
- Add a contention benchmark and measure CPU usage.
- Replace spin-wait with futex-based sleeping.
- Implement a counting semaphore and test max concurrency.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Atomics | Rust Atomics and Locks | Ch. 1-4 |
| Concurrency primitives | The Linux Programming Interface | Ch. 30 |
| Memory model | Effective C | Ch. 8 |
Common Pitfalls & Debugging
Problem: “Lock sometimes lets two threads in”
- Why: Missing atomic compare-and-swap.
- Fix: Use
atomic_compare_exchange_*with correct memory order.
Problem: “Semaphore count goes negative”
- Why: Non-atomic decrement or missing wake logic.
- Fix: Protect count with atomic ops and proper wait queue.
Learning Milestones
- Spinlock works → You understand CAS and atomic operations
- You see 800% CPU usage with spinlock → You understand why sleeping matters
- Futex reduces CPU usage → You understand kernel/user-space cooperation
- Semaphore limits concurrency to N → You understand counting semaphores
Implementation Hints
Spinlock Mutex
#include <stdatomic.h>
#include <stdbool.h>
typedef struct {
atomic_flag locked;
} my_spinlock_t;
void my_spinlock_init(my_spinlock_t* lock) {
atomic_flag_clear(&lock->locked);
}
void my_spinlock_lock(my_spinlock_t* lock) {
// Spin until we acquire the lock
while (atomic_flag_test_and_set(&lock->locked)) {
// Busy-wait (wastes CPU!)
// Could add: __builtin_ia32_pause() for x86
}
}
void my_spinlock_unlock(my_spinlock_t* lock) {
atomic_flag_clear(&lock->locked);
}
Futex-based Mutex (Linux)
#include <linux/futex.h>
#include <sys/syscall.h>
#include <stdatomic.h>
typedef struct {
atomic_int state; // 0 = unlocked, 1 = locked
} my_futex_mutex_t;
void my_futex_lock(my_futex_mutex_t* m) {
int expected = 0;
while (!atomic_compare_exchange_weak(&m->state, &expected, 1)) {
// Failed to acquire - sleep until state changes
syscall(SYS_futex, &m->state, FUTEX_WAIT, 1, NULL, NULL, 0);
expected = 0; // Reset for next try
}
}
void my_futex_unlock(my_futex_mutex_t* m) {
atomic_store(&m->state, 0);
// Wake one waiting thread
syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0);
}
Counting Semaphore
typedef struct {
atomic_int count;
pthread_mutex_t mutex;
pthread_cond_t cond;
} my_semaphore_t;
void my_sem_init(my_semaphore_t* sem, int initial) {
atomic_store(&sem->count, initial);
pthread_mutex_init(&sem->mutex, NULL);
pthread_cond_init(&sem->cond, NULL);
}
void my_sem_wait(my_semaphore_t* sem) {
pthread_mutex_lock(&sem->mutex);
while (atomic_load(&sem->count) <= 0) {
pthread_cond_wait(&sem->cond, &sem->mutex);
}
atomic_fetch_sub(&sem->count, 1);
pthread_mutex_unlock(&sem->mutex);
}
void my_sem_post(my_semaphore_t* sem) {
pthread_mutex_lock(&sem->mutex);
atomic_fetch_add(&sem->count, 1);
pthread_cond_signal(&sem->cond);
pthread_mutex_unlock(&sem->mutex);
}
Key Atomic Operations to Master
| Operation | Purpose |
|---|---|
atomic_flag_test_and_set() |
Atomically set flag and return old value |
atomic_compare_exchange_weak() |
CAS - the building block of lock-free code |
atomic_fetch_add() |
Atomically increment and return old value |
atomic_load() / atomic_store() |
Thread-safe read/write |
Project 4: “Green Thread Library” — User-Space Scheduler with Context Switching
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, Zig |
| Coolness Level | Level 5: Pure Magic (Super Cool) |
| Business Potential | 1. The “Resume Gold” |
| Difficulty | Level 5: Master |
| Knowledge Area | Runtime Systems / Cooperative Scheduling |
| Software or Tool | ucontext.h, setjmp/longjmp, or raw assembly |
| Main Book | Operating Systems: Three Easy Pieces by Remzi H. Arpaci-Dusseau |
What you’ll build
A complete green thread (user-space thread) library that manages its own stack allocations, implements cooperative scheduling (yield), and multiplexes many green threads onto a single OS thread.
Why it teaches threading deeply
This is where you become the operating system. You’ll allocate stacks, save/restore registers, and implement yield(). After this project, you’ll understand exactly what Go’s goroutines and Rust’s async/await are doing under the hood.
Core challenges you’ll face
- Stack allocation (each green thread needs its own stack) → maps to memory layout
- Context switching (save registers, swap stack pointer) → maps to CPU state
- Scheduler implementation (round-robin, run queue) → maps to scheduling algorithms
- Cooperative vs preemptive (explicit yield vs timer interrupt) → maps to scheduling models
Resources for Key Challenges
- Green threads explained (c9x.me) - Hands-on implementation guide for x86-64
- Wikipedia: Green thread - Conceptual overview
Key Concepts
| Concept | Resource |
|---|---|
| Context Switching | Operating Systems: Three Easy Pieces Chapter 6 - OSTEP |
| Stack Management | Computer Systems: A Programmer’s Perspective Chapter 3 - Bryant & O’Hallaron |
| ucontext API | man pages + GNU C Library docs |
Difficulty & Time
- Difficulty: Advanced-Expert
- Time estimate: 2-3 weeks
- Prerequisites: Projects 1-3 completed, some assembly knowledge
Real World Outcome
$ ./green_threads_demo
=== Green Thread Library Demo ===
Creating 1000 green threads on 1 OS thread...
Green thread 0: Hello! (stack at 0x7f1234560000)
Green thread 0: Yielding...
Green thread 1: Hello! (stack at 0x7f1234570000)
Green thread 1: Yielding...
Green thread 2: Hello! (stack at 0x7f1234580000)
...
Green thread 0: Resumed!
Green thread 0: Done!
...
Stats:
- Total green threads created: 1000
- OS threads used: 1
- Context switches performed: 5000
- Average switch time: 45 nanoseconds
For comparison:
- 1000 OS threads would use ~8GB stack memory
- 1000 green threads used ~8MB stack memory
The Core Question You’re Answering
How do you schedule thousands of user-space tasks without relying on the kernel for every context switch?
Concepts You Must Understand First
- Stack allocation per green thread
- Context switching (
setjmp/longjmporucontext) - Ready/blocked queues
- Cooperative vs preemptive scheduling
Questions to Guide Your Design
- How will a thread yield control?
- What happens when a green thread makes a blocking syscall?
- How will you detect stack overflow?
Thinking Exercise
Design a tiny state machine for a thread: NEW → READY → RUNNING → BLOCKED → READY → DONE. Where do you save registers?
The Interview Questions They’ll Ask
- What is the difference between green threads and kernel threads?
- Why can green threads be faster but less safe?
- What happens if a green thread blocks in a syscall?
- How would you add a timer-based preemption?
Hints in Layers
- Use
ucontextto switch stacks. - Implement round-robin scheduling across a run queue.
- Add
yield()andsleep()APIs. - Add per-thread stacks and guard pages.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Context switching | Computer Systems: A Programmer’s Perspective | Ch. 8 |
| Scheduling | Operating Systems: Three Easy Pieces | Ch. 7-9 |
| Low-level C | Bare Metal C | Ch. 3-4 |
Common Pitfalls & Debugging
Problem: “Threads crash randomly”
- Why: Stack overflow or misaligned stack.
- Fix: Add guard pages and align stacks to 16 bytes.
Problem: “All threads freeze”
- Why: One blocking syscall stalls the whole scheduler.
- Fix: Replace blocking calls with non-blocking I/O or run in a kernel thread.
Learning Milestones
- Single green thread runs to completion → You understand stack allocation
- Two threads yield to each other → You understand context switching
- Round-robin scheduler works → You understand scheduling
- 1000 threads work smoothly → You understand why green threads scale
Implementation Hints
Using ucontext.h (simpler approach)
#include <ucontext.h>
#include <stdlib.h>
#include <stdio.h>
#define STACK_SIZE 16384
#define MAX_THREADS 64
typedef struct {
ucontext_t context;
char stack[STACK_SIZE];
int id;
int finished;
} green_thread_t;
green_thread_t threads[MAX_THREADS];
int current_thread = 0;
int num_threads = 0;
ucontext_t main_context;
void gt_yield() {
int prev = current_thread;
// Find next runnable thread (round-robin)
do {
current_thread = (current_thread + 1) % num_threads;
} while (threads[current_thread].finished && current_thread != prev);
if (current_thread != prev) {
swapcontext(&threads[prev].context, &threads[current_thread].context);
}
}
void gt_create(void (*func)(void*), void* arg) {
green_thread_t* t = &threads[num_threads];
t->id = num_threads;
t->finished = 0;
getcontext(&t->context);
t->context.uc_stack.ss_sp = t->stack;
t->context.uc_stack.ss_size = STACK_SIZE;
t->context.uc_link = &main_context;
makecontext(&t->context, (void (*)())func, 1, arg);
num_threads++;
}
void thread_func(void* arg) {
int id = *(int*)arg;
for (int i = 0; i < 3; i++) {
printf("Thread %d: iteration %d\n", id, i);
gt_yield();
}
printf("Thread %d: done!\n", id);
threads[id].finished = 1;
}
Raw Assembly Context Switch (x86-64)
// Context: saved registers
typedef struct {
uint64_t rsp;
uint64_t r15;
uint64_t r14;
uint64_t r13;
uint64_t r12;
uint64_t rbx;
uint64_t rbp;
} context_t;
// Assembly function to switch contexts
// Saves current registers, restores target's registers
extern void context_switch(context_t* old, context_t* new);
; context_switch.s (NASM syntax)
global context_switch
context_switch:
; Save old context (rdi = old)
mov [rdi + 0], rsp
mov [rdi + 8], r15
mov [rdi + 16], r14
mov [rdi + 24], r13
mov [rdi + 32], r12
mov [rdi + 40], rbx
mov [rdi + 48], rbp
; Restore new context (rsi = new)
mov rsp, [rsi + 0]
mov r15, [rsi + 8]
mov r14, [rsi + 16]
mov r13, [rsi + 24]
mov r12, [rsi + 32]
mov rbx, [rsi + 40]
mov rbp, [rsi + 48]
ret
Green Thread vs OS Thread Comparison
| Aspect | OS Thread (kernel) | Green Thread (user) |
|---|---|---|
| Created by | pthread_create() |
Your library |
| Scheduled by | OS kernel | Your scheduler |
| Context switch | ~1-10 μs (syscall) | ~50-100 ns (function call) |
| Stack size | 1-8 MB default | You decide (8KB typical) |
| 10,000 threads | 10-80 GB memory | 80 MB memory |
| Blocking syscall | Only that thread blocks | ALL green threads block! |
Project 5: “Dining Philosophers Visualizer” — Deadlock Detection and Prevention
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, Python, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 1. The “Resume Gold” |
| Difficulty | Level 3: Advanced |
| Knowledge Area | Deadlock / Synchronization |
| Software or Tool | ncurses or terminal graphics |
| Main Book | Operating Systems: Three Easy Pieces by Remzi H. Arpaci-Dusseau |
What you’ll build
An animated terminal visualization of the Dining Philosophers problem, where you can toggle between naive (deadlock-prone), ordered locking, and semaphore-based solutions, watching deadlocks happen in real-time.
Why it teaches deadlocks
Deadlock is abstract until you see it. Watching 5 philosophers all grab their left fork and freeze forever makes the concept visceral. Then implementing fixes (resource ordering, arbitrator semaphore) teaches how to prevent them.
Core challenges you’ll face
- Deadlock creation (circular wait) → maps to Coffman conditions
- Deadlock prevention (resource ordering) → maps to deadlock avoidance
- Livelock (philosophers keep retrying) → maps to liveness properties
- Starvation (one philosopher never eats) → maps to fairness
Key Concepts
| Concept | Resource |
|---|---|
| Dining Philosophers Problem | Operating Systems: Three Easy Pieces Chapter 31 - OSTEP |
| Deadlock Conditions | Operating System Concepts Chapter 7 - Silberschatz |
| Semaphore API | POSIX sem_init — https://man7.org/linux/man-pages/man3/sem_init.3.html |
Difficulty & Time
- Difficulty: Intermediate
- Time estimate: 1 week
- Prerequisites: Project 2-3 completed
Real World Outcome
$ ./dining_philosophers --mode naive
🍝 Dining Philosophers 🍝
P0 P1
🧠 🧠
🍴 🍴 🍴 🍴
\ /
\ /
P4 🧠--🍽️--🧠 P2
/ \
/ \
🍴 🍴
🧠
P3
Status: [Mode: NAIVE - Deadlock prone!]
P0: 🍴 HOLDING left fork, WAITING for right
P1: 🍴 HOLDING left fork, WAITING for right
P2: 🍴 HOLDING left fork, WAITING for right
P3: 🍴 HOLDING left fork, WAITING for right
P4: 🍴 HOLDING left fork, WAITING for right
⚠️ DEADLOCK DETECTED! All philosophers stuck!
Press 's' for semaphore solution, 'o' for ordered locking...
The Core Question You’re Answering
Why do deadlocks happen, and which design rules eliminate them without destroying concurrency?
Concepts You Must Understand First
- Coffman deadlock conditions
- Resource ordering vs arbitrator semaphore
- Starvation vs fairness
- Detecting cycles in a resource graph
Questions to Guide Your Design
- Which condition will you break to prevent deadlock?
- How will you prove that no circular wait exists?
- How will you measure starvation?
Thinking Exercise
If you force philosophers to pick up the lower-numbered fork first, which Coffman condition is removed?
The Interview Questions They’ll Ask
- What are the four Coffman conditions?
- What is the difference between deadlock and livelock?
- How can you detect deadlock at runtime?
- What trade-offs do you make when enforcing ordering?
Hints in Layers
- Implement the naive solution and observe deadlock.
- Add ordered locking (lower fork first).
- Add an arbitrator semaphore (max 4 eaters).
- Add a deadlock detector that spots cycles.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Deadlocks | Operating Systems: Three Easy Pieces | Ch. 31 |
| Synchronization | Operating System Concepts | Ch. 7 |
| Semaphores | The Linux Programming Interface | Ch. 30 |
Common Pitfalls & Debugging
Problem: “No deadlock but starvation exists”
- Why: One philosopher always loses the race.
- Fix: Add fairness (queue, randomized backoff).
Problem: “Visualizer freezes”
- Why: Rendering holds a lock too long.
- Fix: Render from a snapshot instead of holding locks.
Learning Milestones
- Deadlock happens visually → You understand circular wait
- Resource ordering prevents deadlock → You understand prevention
- Semaphore (max 4 eating) works → You understand limiting concurrency
- You can explain all 4 Coffman conditions → Mastery
The Four Coffman Conditions for Deadlock
All four must be present for deadlock to occur:
- Mutual Exclusion: Resources cannot be shared (fork can only be held by one philosopher)
- Hold and Wait: Process holds one resource while waiting for another
- No Preemption: Resources cannot be forcibly taken away
- Circular Wait: P0 waits for P1, P1 waits for P2, … P4 waits for P0
Implementation Hints
Naive Solution (causes deadlock)
void philosopher_naive(int id) {
int left = id;
int right = (id + 1) % NUM_PHILOSOPHERS;
while (1) {
think(id);
pthread_mutex_lock(&forks[left]); // Grab left fork
printf("P%d: got left fork\n", id);
usleep(1000); // Small delay makes deadlock more likely
pthread_mutex_lock(&forks[right]); // Grab right fork (DEADLOCK!)
printf("P%d: got right fork\n", id);
eat(id);
pthread_mutex_unlock(&forks[right]);
pthread_mutex_unlock(&forks[left]);
}
}
Solution 1: Resource Ordering
void philosopher_ordered(int id) {
int left = id;
int right = (id + 1) % NUM_PHILOSOPHERS;
// Always lock lower-numbered fork first
int first = (left < right) ? left : right;
int second = (left < right) ? right : left;
while (1) {
think(id);
pthread_mutex_lock(&forks[first]);
pthread_mutex_lock(&forks[second]);
eat(id);
pthread_mutex_unlock(&forks[second]);
pthread_mutex_unlock(&forks[first]);
}
}
Solution 2: Semaphore (limit to N-1)
sem_t table_seats; // Initialized to NUM_PHILOSOPHERS - 1
void philosopher_semaphore(int id) {
int left = id;
int right = (id + 1) % NUM_PHILOSOPHERS;
while (1) {
think(id);
sem_wait(&table_seats); // Only N-1 can try to eat at once
pthread_mutex_lock(&forks[left]);
pthread_mutex_lock(&forks[right]);
eat(id);
pthread_mutex_unlock(&forks[right]);
pthread_mutex_unlock(&forks[left]);
sem_post(&table_seats);
}
}
Project 6: “Thread Pool with Work Stealing” — Lock-Free Deque Load Balancer
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, C++, Go |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 3. The “Service & Support” Model |
| Difficulty | Level 4: Expert |
| Knowledge Area | Parallel Computing / Load Balancing |
| Software or Tool | Lock-free queues, work stealing deques |
| Main Book | C++ Concurrency in Action by Anthony Williams |
What you’ll build
A production-quality thread pool where worker threads can “steal” tasks from each other’s queues when idle, maximizing CPU utilization and reducing latency.
Why it teaches advanced concurrency
Real systems use thread pools, not raw threads. Work stealing is how Go, Tokio (Rust), and Java’s ForkJoin work. Building this teaches you lock-free data structures, cache locality, and real-world parallel performance.
Core challenges you’ll face
- Thread-safe queue (multiple threads push/pop) → maps to concurrent data structures
- Work stealing deque (steal from back, pop from front) → maps to load balancing
- Lock-free implementation (CAS-based operations) → maps to lock-free programming
- Cache-friendly design (avoid false sharing) → maps to performance optimization
Key Concepts
| Concept | Resource |
|---|---|
| Thread Pools | C++ Concurrency in Action Chapter 9 - Anthony Williams |
| Work Stealing | “Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson (paper) |
| Lock-Free Queues | Rust Atomics and Locks Chapter 6 - Mara Bos |
Difficulty & Time
- Difficulty: Expert
- Time estimate: 2-3 weeks
- Prerequisites: Projects 1-4 completed
Real World Outcome
$ ./thread_pool_bench --workers 8 --tasks 100000
=== Thread Pool Benchmark ===
[Naive approach: spawn thread per task]
Time: 45.2s
Threads created: 100000
Context switches: 2,340,000
[Thread pool with single queue]
Time: 2.1s
Threads created: 8
Context switches: 12,000
Queue contention: HIGH (single lock)
[Thread pool with work stealing]
Time: 0.8s
Threads created: 8
Context switches: 9,000
Load balance: 12.5% per worker (perfect!)
Steals performed: 3,421
Work stealing is 56x faster than naive, 2.6x faster than single queue!
The Core Question You’re Answering
How do you build a high-throughput worker system that balances load without global contention?
Concepts You Must Understand First
- Thread pool lifecycle and graceful shutdown
- Condition variables for work notification
- Work-stealing deques and ABA hazards
- Cache locality and false sharing
Questions to Guide Your Design
- When should a worker sleep vs spin?
- How will you prevent idle threads from burning CPU?
- How will you measure imbalance across workers?
Thinking Exercise
You have 8 workers and 1,000 tasks with uneven durations. Sketch how a single queue vs work stealing impacts total runtime.
The Interview Questions They’ll Ask
- Why is a thread pool better than spawning a thread per task?
- What is work stealing and why does it help?
- How do you avoid contention on the global queue?
- What is the cost of false sharing?
Hints in Layers
- Build a single-queue pool with mutex + condition variable.
- Add per-worker queues and local pops.
- Add stealing from the tail of other deques.
- Add metrics: steals, idle time, throughput.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Thread pools | C++ Concurrency in Action | Ch. 9 |
| Atomics | Rust Atomics and Locks | Ch. 6 |
| Performance | Computer Systems: A Programmer’s Perspective | Ch. 5 |
Common Pitfalls & Debugging
Problem: “Workers sleep forever”
- Why: Missed condition variable signal.
- Fix: Signal/broadcast after enqueue.
Problem: “Steals return NULL too often”
- Why: Incorrect memory ordering or ABA race.
- Fix: Add acquire/release fences and version counters.
Learning Milestones
- Basic thread pool works → You understand pooling
- Per-worker queues reduce contention → You understand lock contention
- Work stealing balances load → You understand dynamic scheduling
- Benchmark shows real speedup → You understand parallel efficiency
Architecture
┌─────────────────────────────────────────┐
│ Thread Pool │
└─────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Worker 0 │ │ Worker 1 │ │ Worker 2 │
│ │ │ │ │ │
│ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │
│ │ Deque │ │◄─STEAL───│ │ Deque │ │───STEAL─►│ │ Deque │ │
│ │ │ │ │ │ T T T │ │ │ │ │ │
│ └────────┘ │ │ └────────┘ │ │ └────────┘ │
└────────────┘ └────────────┘ └────────────┘
│ │ │
│ Pop from front │ Steal from back │
▼ ▼ ▼
[Execute] [Execute] [Execute]
Implementation Hints
Work-Stealing Deque (Chase-Lev)
#include <stdatomic.h>
#define DEQUE_SIZE 1024
typedef struct {
atomic_size_t top; // Thieves steal from here
atomic_size_t bottom; // Owner pushes/pops here
void* buffer[DEQUE_SIZE];
} work_stealing_deque_t;
// Owner: push task (only owner calls this)
void deque_push(work_stealing_deque_t* d, void* task) {
size_t b = atomic_load(&d->bottom);
d->buffer[b % DEQUE_SIZE] = task;
atomic_thread_fence(memory_order_release);
atomic_store(&d->bottom, b + 1);
}
// Owner: pop task (only owner calls this)
void* deque_pop(work_stealing_deque_t* d) {
size_t b = atomic_load(&d->bottom) - 1;
atomic_store(&d->bottom, b);
atomic_thread_fence(memory_order_seq_cst);
size_t t = atomic_load(&d->top);
if (t <= b) {
void* task = d->buffer[b % DEQUE_SIZE];
if (t == b) {
// Last element - race with stealers
if (!atomic_compare_exchange_strong(&d->top, &t, t + 1)) {
task = NULL; // Lost race
}
atomic_store(&d->bottom, b + 1);
}
return task;
}
atomic_store(&d->bottom, b + 1);
return NULL;
}
// Thief: steal task (other workers call this)
void* deque_steal(work_stealing_deque_t* d) {
size_t t = atomic_load(&d->top);
atomic_thread_fence(memory_order_seq_cst);
size_t b = atomic_load(&d->bottom);
if (t < b) {
void* task = d->buffer[t % DEQUE_SIZE];
if (!atomic_compare_exchange_strong(&d->top, &t, t + 1)) {
return NULL; // Lost race with another stealer
}
return task;
}
return NULL;
}
Thread Pool Structure
typedef struct {
pthread_t* threads;
work_stealing_deque_t* deques; // One per worker
int num_workers;
atomic_bool shutdown;
} thread_pool_t;
void* worker_loop(void* arg) {
worker_context_t* ctx = arg;
while (!atomic_load(&ctx->pool->shutdown)) {
// Try to get task from own deque
void* task = deque_pop(&ctx->pool->deques[ctx->id]);
if (!task) {
// Own deque empty - try to steal from others
for (int i = 0; i < ctx->pool->num_workers; i++) {
if (i != ctx->id) {
task = deque_steal(&ctx->pool->deques[i]);
if (task) break;
}
}
}
if (task) {
execute_task(task);
} else {
// No work anywhere - yield CPU
sched_yield();
}
}
return NULL;
}
Project 7: “Concurrent Hash Map” — Lock Striping and RCU Implementation
| Attribute | Value |
|---|---|
| Main Programming Language | C |
| Alternative Programming Languages | Rust, C++, Java |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 4. The “Open Core” Infrastructure |
| Difficulty | Level 4: Expert |
| Knowledge Area | Concurrent Data Structures |
| Software or Tool | Lock striping, RCU |
| Main Book | C++ Concurrency in Action by Anthony Williams |
What you’ll build
A high-performance concurrent hash map that supports concurrent reads and writes from multiple threads, using techniques like lock striping, read-write locks, and optionally RCU (Read-Copy-Update).
Why it teaches real-world concurrency
Every serious system needs concurrent data structures. Building a concurrent hash map teaches you the trade-offs between correctness, performance, and complexity. You’ll understand why Java’s ConcurrentHashMap and Go’s sync.Map exist.
Core challenges you’ll face
- Lock striping (lock per bucket, not whole table) → maps to fine-grained locking
- Read-write locks (many readers, one writer) → maps to reader/writer problem
- Resize under contention (grow table without blocking readers) → maps to online operations
- RCU pattern (readers never block) → maps to lock-free reads
Key Concepts
| Concept | Resource |
|---|---|
| Lock Striping | C++ Concurrency in Action Chapter 6 - Anthony Williams |
| Read-Write Locks | The Linux Programming Interface Chapter 30 - Kerrisk |
| RCU Basics | Linux kernel documentation — https://docs.kernel.org/RCU/whatisRCU.html |
Difficulty & Time
- Difficulty: Expert
- Time estimate: 2-3 weeks
- Prerequisites: Projects 1-3 and basic hash table implementation
Real World Outcome
$ ./concurrent_map_bench --readers 6 --writers 2 --ops 1000000
=== Concurrent Hash Map Benchmark ===
[Single global lock]
Throughput: 234,000 ops/sec
Reader wait time: 89% (readers blocked by writers)
Writer wait time: 45%
[Lock striping (256 buckets)]
Throughput: 1,890,000 ops/sec (8x faster!)
Reader wait time: 12%
Writer wait time: 8%
[RW-lock per bucket]
Throughput: 2,340,000 ops/sec
Reader wait time: 3%
Writer wait time: 15%
[RCU-style (readers never block)]
Throughput: 4,120,000 ops/sec
Reader wait time: 0% (readers NEVER blocked!)
Writer wait time: 5%
Memory overhead: +15% (deferred reclamation)
Correctness check: All 1,000,000 operations verified ✓
The Core Question You’re Answering
How do you scale a shared map so reads stay fast as writer count grows?
Concepts You Must Understand First
- Lock striping and contention hotspots
- Read-write locks for read-heavy workloads
- RCU read-side critical sections
- Safe memory reclamation
Questions to Guide Your Design
- How will you resize without blocking readers?
- How will you prevent deadlocks when multiple locks are needed?
- What is your reclamation strategy for old buckets?
Thinking Exercise
If you have 256 buckets and 16 locks, how many buckets share a lock? How does that affect contention?
The Interview Questions They’ll Ask
- Why does lock striping improve throughput?
- What trade-offs do read-write locks introduce?
- What is RCU and why does it help read-heavy systems?
- How do you safely free old nodes in RCU?
Hints in Layers
- Start with a global mutex map.
- Add lock striping per bucket group.
- Add per-bucket RW locks.
- Prototype a simple RCU-style read path.
- Benchmark each version.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Concurrency | C++ Concurrency in Action | Ch. 6 |
| RW locks | The Linux Programming Interface | Ch. 30 |
| Data structures | Mastering Algorithms with C | Ch. 8 |
Common Pitfalls & Debugging
Problem: “Reads crash after resize”
- Why: Freed buckets still referenced by readers.
- Fix: Use epoch-based reclamation or grace periods.
Problem: “Deadlock during resize”
- Why: Locks acquired in inconsistent order.
- Fix: Impose a strict lock ordering.
Learning Milestones
- Single lock works but is slow → You understand coarse-grained locking
- Lock striping dramatically improves throughput → You understand fine-grained locking
- Read-write locks help read-heavy workloads → You understand reader/writer trade-offs
- RCU gives maximum read performance → You understand lock-free reads
Implementation Hints
Lock Striping
#define NUM_BUCKETS 256
#define NUM_LOCKS 16 // Fewer locks than buckets
typedef struct node {
char* key;
void* value;
struct node* next;
} node_t;
typedef struct {
node_t* buckets[NUM_BUCKETS];
pthread_mutex_t locks[NUM_LOCKS];
} concurrent_map_t;
static inline int get_lock_index(int bucket) {
return bucket % NUM_LOCKS; // Multiple buckets share a lock
}
void* map_get(concurrent_map_t* map, const char* key) {
int bucket = hash(key) % NUM_BUCKETS;
int lock_idx = get_lock_index(bucket);
pthread_mutex_lock(&map->locks[lock_idx]);
node_t* node = map->buckets[bucket];
while (node) {
if (strcmp(node->key, key) == 0) {
void* value = node->value;
pthread_mutex_unlock(&map->locks[lock_idx]);
return value;
}
node = node->next;
}
pthread_mutex_unlock(&map->locks[lock_idx]);
return NULL;
}
void map_put(concurrent_map_t* map, const char* key, void* value) {
int bucket = hash(key) % NUM_BUCKETS;
int lock_idx = get_lock_index(bucket);
pthread_mutex_lock(&map->locks[lock_idx]);
// Check if key exists
node_t* node = map->buckets[bucket];
while (node) {
if (strcmp(node->key, key) == 0) {
node->value = value; // Update
pthread_mutex_unlock(&map->locks[lock_idx]);
return;
}
node = node->next;
}
// Insert new node
node_t* new_node = malloc(sizeof(node_t));
new_node->key = strdup(key);
new_node->value = value;
new_node->next = map->buckets[bucket];
map->buckets[bucket] = new_node;
pthread_mutex_unlock(&map->locks[lock_idx]);
}
Read-Write Lock Version
typedef struct {
node_t* buckets[NUM_BUCKETS];
pthread_rwlock_t locks[NUM_LOCKS];
} rw_concurrent_map_t;
void* map_get(rw_concurrent_map_t* map, const char* key) {
int bucket = hash(key) % NUM_BUCKETS;
int lock_idx = get_lock_index(bucket);
pthread_rwlock_rdlock(&map->locks[lock_idx]); // Read lock
node_t* node = map->buckets[bucket];
void* value = NULL;
while (node) {
if (strcmp(node->key, key) == 0) {
value = node->value;
break;
}
node = node->next;
}
pthread_rwlock_unlock(&map->locks[lock_idx]);
return value;
}
void map_put(rw_concurrent_map_t* map, const char* key, void* value) {
int bucket = hash(key) % NUM_BUCKETS;
int lock_idx = get_lock_index(bucket);
pthread_rwlock_wrlock(&map->locks[lock_idx]); // Write lock
// ... insertion logic ...
pthread_rwlock_unlock(&map->locks[lock_idx]);
}
Concurrency Trade-offs
| Approach | Read Performance | Write Performance | Complexity | Memory |
|---|---|---|---|---|
| Single lock | Poor | Poor | Low | Low |
| Lock striping | Good | Good | Medium | Low |
| RW-lock per bucket | Excellent | Moderate | Medium | Low |
| RCU | Perfect (never blocks) | Moderate | High | Higher |
Project 8: “Mini Operating System Scheduler” — Capstone Bare-Metal Kernel
| Attribute | Value |
|---|---|
| Main Programming Language | C + Assembly |
| Alternative Programming Languages | Rust, Zig |
| Coolness Level | Level 5: Pure Magic (Super Cool) |
| Business Potential | 1. The “Resume Gold” |
| Difficulty | Level 5: Master |
| Knowledge Area | Operating Systems / Kernel Development |
| Software or Tool | QEMU, bare-metal x86 |
| Main Book | Operating Systems: Three Easy Pieces by Remzi H. Arpaci-Dusseau |
What you’ll build
A minimal bare-metal kernel (runs in QEMU) that implements preemptive multitasking with a timer interrupt, process/thread scheduling, and basic synchronization primitives—all running without Linux.
Why this is the ultimate test
Everything you learned culminates here. You ARE the operating system. There’s no pthread_create()—you implement context switching, timer interrupts, and scheduling from scratch on real (emulated) hardware.
Core challenges you’ll face
- Bootloader (get code running without an OS) → maps to bare-metal programming
- Timer interrupt (preemptive scheduling) → maps to interrupt handling
- Task switching (save/restore full CPU state) → maps to context switch
- Priority scheduler (multiple scheduling algorithms) → maps to scheduling policies
- Kernel-mode synchronization (spinlocks in kernel) → maps to OS internals
Key Concepts
| Concept | Resource |
|---|---|
| Bootloaders | Operating Systems: Three Easy Pieces Appendix - OSTEP |
| Interrupts | Computer Systems: A Programmer’s Perspective Chapter 8 - Bryant & O’Hallaron |
| Context Switching | Linux Kernel Development Chapter 4 - Robert Love |
| Scheduling | Operating Systems: Three Easy Pieces Chapters 7-9 - OSTEP |
Difficulty & Time
- Difficulty: Master
- Time estimate: 1-2 months
- Prerequisites: All previous projects completed, assembly knowledge
Real World Outcome
Booting mini-OS in QEMU...
[KERNEL] Timer interrupt handler installed
[KERNEL] Scheduler initialized (Round Robin)
[KERNEL] Creating Task 1: counter
[KERNEL] Creating Task 2: printer
[KERNEL] Creating Task 3: beeper
[Task 1] Counter: 0
[Task 2] Printing: Hello from Task 2!
[SCHEDULER] Timer tick! Switching 1 -> 2
[Task 2] Printing: Hello from Task 2!
[Task 3] BEEP!
[SCHEDULER] Timer tick! Switching 2 -> 3
[Task 3] BEEP!
[Task 1] Counter: 1
[SCHEDULER] Timer tick! Switching 3 -> 1
...
Press 'p' to show process table:
PID STATE PRIORITY TICKS NAME
1 RUNNING 5 42 counter
2 READY 5 38 printer
3 BLOCKED 5 25 beeper (waiting on semaphore)
Press 'd' to trigger deadlock demo...
The Core Question You’re Answering
How does a kernel preempt tasks, switch context, and enforce synchronization without help from user space?
Concepts You Must Understand First
- Timer interrupts and IDT/PIC setup
- Saving/restoring registers and stack
- Run queue and scheduler policy
- Spinlocks and interrupt masking
Questions to Guide Your Design
- When exactly do you save CPU state?
- How do you choose the next task to run?
- How do you prevent races inside the kernel?
Thinking Exercise
Assume a timer tick every 10 ms. If each task gets a 20 ms time slice, how many ticks before a context switch? Where does the counter live?
The Interview Questions They’ll Ask
- What is preemption and why is it necessary?
- How does a context switch save/restore state?
- Why must you disable interrupts around certain structures?
- What’s the difference between round-robin and priority scheduling?
Hints in Layers
- Boot into protected mode and print to VGA.
- Install timer interrupt and note tick rate.
- Implement a run queue and round-robin switching.
- Add spinlocks and semaphores in kernel space.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Scheduling | Operating Systems: Three Easy Pieces | Ch. 7-9 |
| Kernel internals | Linux Kernel Development | Ch. 4 |
| Interrupts | Computer Systems: A Programmer’s Perspective | Ch. 8 |
Common Pitfalls & Debugging
Problem: “Kernel crashes during switch”
- Why: Incorrect stack pointer or missing registers.
- Fix: Save/restore full CPU context and validate stack.
Problem: “Race in scheduler”
- Why: Timer interrupt fires mid-queue update.
- Fix: Disable interrupts during critical scheduler operations.
Learning Milestones
- Kernel boots and prints → You understand bare-metal
- Timer interrupt fires → You understand hardware interrupts
- Two tasks switch preemptively → You ARE the scheduler
- Semaphores block/wake tasks → You understand kernel synchronization
- Priority scheduling works → Operating systems are demystified
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Your Mini-OS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Task 1 │ │ Task 2 │ │ Task 3 │ │
│ │ (counter) │ │ (printer) │ │ (beeper) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └────────────────┼─────────────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ Scheduler │ ◄─── Timer Interrupt │
│ └─────┬─────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Mutex │ │ Semaphore │ │ Spinlock │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────┤
│ Interrupt Handlers │ Memory Management │ I/O (VGA, Keyboard) │
├─────────────────────────────────────────────────────────────────┤
│ Hardware (via QEMU) │
│ x86 CPU │ PIC │ PIT (Timer) │ VGA │ Keyboard │
└─────────────────────────────────────────────────────────────────┘
Implementation Roadmap
Phase 1: Bootloader & Basic Kernel
; boot.asm - Bootloader (loaded at 0x7C00)
[BITS 16]
[ORG 0x7C00]
start:
; Set up segments
xor ax, ax
mov ds, ax
mov es, ax
mov ss, ax
mov sp, 0x7C00
; Load kernel from disk
mov ah, 0x02 ; BIOS read sectors
mov al, 10 ; 10 sectors
mov ch, 0 ; Cylinder 0
mov cl, 2 ; Sector 2 (1-indexed)
mov dh, 0 ; Head 0
mov bx, 0x1000 ; Load at 0x1000
int 0x13
; Switch to protected mode
cli
lgdt [gdt_descriptor]
mov eax, cr0
or eax, 1
mov cr0, eax
jmp 0x08:protected_mode
[BITS 32]
protected_mode:
; Set up segment registers
mov ax, 0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
mov esp, 0x90000
; Jump to kernel
jmp 0x1000
; GDT
gdt_start:
dq 0 ; Null descriptor
gdt_code:
dw 0xFFFF, 0x0000
db 0x00, 0x9A, 0xCF, 0x00
gdt_data:
dw 0xFFFF, 0x0000
db 0x00, 0x92, 0xCF, 0x00
gdt_end:
gdt_descriptor:
dw gdt_end - gdt_start - 1
dd gdt_start
times 510-($-$$) db 0
dw 0xAA55 ; Boot signature
Phase 2: Timer Interrupt
// interrupts.c
#include <stdint.h>
#define PIC1_COMMAND 0x20
#define PIC1_DATA 0x21
#define PIT_COMMAND 0x43
#define PIT_DATA 0x40
void timer_handler() {
// Acknowledge interrupt
outb(PIC1_COMMAND, 0x20);
// Call scheduler
schedule();
}
void setup_timer(uint32_t frequency) {
uint32_t divisor = 1193180 / frequency;
outb(PIT_COMMAND, 0x36);
outb(PIT_DATA, divisor & 0xFF);
outb(PIT_DATA, (divisor >> 8) & 0xFF);
}
// IDT entry for timer (IRQ 0 -> INT 32)
void setup_idt() {
idt[32].handler_low = (uint32_t)timer_handler & 0xFFFF;
idt[32].handler_high = (uint32_t)timer_handler >> 16;
idt[32].selector = 0x08;
idt[32].flags = 0x8E;
lidt(&idt_descriptor);
}
Phase 3: Task Structure & Context Switch
// task.h
typedef struct {
uint32_t eax, ebx, ecx, edx;
uint32_t esi, edi, ebp, esp;
uint32_t eip, eflags;
uint32_t cs, ds, es, fs, gs, ss;
} registers_t;
typedef enum {
TASK_READY,
TASK_RUNNING,
TASK_BLOCKED,
TASK_TERMINATED
} task_state_t;
typedef struct task {
int pid;
char name[32];
task_state_t state;
int priority;
registers_t regs;
uint32_t* stack;
struct task* next;
} task_t;
// In assembly
extern void context_switch(registers_t* old, registers_t* new);
Phase 4: Scheduler
// scheduler.c
task_t* current_task = NULL;
task_t* ready_queue = NULL;
void schedule() {
if (!current_task) return;
// Save current task state (done in interrupt handler)
// Move current to end of queue if still runnable
if (current_task->state == TASK_RUNNING) {
current_task->state = TASK_READY;
enqueue(&ready_queue, current_task);
}
// Get next task
task_t* next = dequeue(&ready_queue);
if (!next) {
next = idle_task; // Never block completely
}
next->state = TASK_RUNNING;
current_task = next;
// Context switch happens on return from interrupt
}
Phase 5: Kernel Semaphore
// sync.c
typedef struct {
int count;
task_t* wait_queue;
spinlock_t lock;
} ksemaphore_t;
void ksem_wait(ksemaphore_t* sem) {
spinlock_acquire(&sem->lock);
while (sem->count <= 0) {
// Block current task
current_task->state = TASK_BLOCKED;
enqueue(&sem->wait_queue, current_task);
spinlock_release(&sem->lock);
schedule(); // Switch to another task
spinlock_acquire(&sem->lock);
}
sem->count--;
spinlock_release(&sem->lock);
}
void ksem_signal(ksemaphore_t* sem) {
spinlock_acquire(&sem->lock);
sem->count++;
if (sem->wait_queue) {
task_t* task = dequeue(&sem->wait_queue);
task->state = TASK_READY;
enqueue(&ready_queue, task);
}
spinlock_release(&sem->lock);
}
Build & Run
# Makefile
CC = i686-elf-gcc
AS = nasm
LD = i686-elf-ld
QEMU = qemu-system-i386
CFLAGS = -ffreestanding -O2 -Wall -Wextra
LDFLAGS = -T linker.ld -nostdlib
all: os.img
boot.bin: boot.asm
$(AS) -f bin $< -o $@
kernel.o: kernel.c
$(CC) $(CFLAGS) -c $< -o $@
kernel.bin: kernel.o
$(LD) $(LDFLAGS) $^ -o $@
os.img: boot.bin kernel.bin
cat $^ > $@
truncate -s 1440k $@
run: os.img
$(QEMU) -fda $< -monitor stdio
clean:
rm -f *.bin *.o *.img
Essential Resources Summary
Books (Priority Order)
- Operating Systems: Three Easy Pieces - FREE online, the best OS book
- Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron - Memory and low-level understanding
- The Linux Programming Interface by Michael Kerrisk - The bible for Linux system programming
- Rust Atomics and Locks by Mara Bos - Excellent even for C programmers on atomics/concurrency
- C++ Concurrency in Action by Anthony Williams - Best book on practical concurrency patterns
- Linux Kernel Development by Robert Love - When you’re ready for real kernel code
Online Tutorials
- Green threads explained (c9x.me) - For Project 4
- Threads, Mutexes and Concurrent Programming in C - For Project 2
- Barrgroup: Mutexes and Semaphores Demystified - Historical context
Papers (Advanced)
- “Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson
- “What is RCU, Fundamentally?” - Paul McKenney (LWN article series)
Key Conceptual Distinctions
| Question | Answer |
|---|---|
| Thread vs Process? | Threads share memory, processes are isolated |
| Concurrency vs Parallelism? | Concurrency = structure (dealing with many things), Parallelism = execution (doing many things simultaneously) |
| Mutex vs Semaphore? | Mutex = locking (ownership), Semaphore = signaling (counting) |
| Kernel vs User thread? | Kernel thread = OS schedules, User thread = your code schedules |
| Spinlock vs Mutex? | Spinlock = busy-wait (wastes CPU), Mutex = sleep (efficient) |
| Green thread vs Coroutine? | Green thread = scheduled by runtime, Coroutine = explicit yield by programmer |
| Race condition vs Deadlock? | Race = wrong order causes wrong result, Deadlock = circular wait causes freeze |
| Preemptive vs Cooperative? | Preemptive = OS forces switch (timer), Cooperative = task chooses when to yield |
Appendix: Difficulty & Coolness Scales
Difficulty Levels
| Level | Name | Description |
|---|---|---|
| 1 | Beginner | Basic syntax, straightforward logic, single-purpose tools |
| 2 | Intermediate | State management, async operations, third-party libraries |
| 3 | Advanced | Concurrency, memory management, complex architectures |
| 4 | Expert | Kernel-level, assembly, complex mathematical models |
| 5 | Master | Self-referential systems, cryptography, bootstrapping |
Coolness Levels
| Level | Vibe | Description |
|---|---|---|
| 1 | “This looks like work” | Vital for career but zero flare |
| 2 | “Neat, I guess” | Standard hobbyist projects |
| 3 | “Hey, that’s pretty smart” | Mini versions of famous tools |
| 4 | “Woah, you did that yourself?” | Assembly, kernel modules, bare-metal |
| 5 | “Shut up and take my upvote!” | Visually stunning or technically impossible-looking |
Business Potential
| Level | Model | Description |
|---|---|---|
| 1 | Resume Gold | Gets you a $200k+ job |
| 2 | Micro-SaaS | $5-10/month subscriptions |
| 3 | Service & Support | $500-2000/month retainers |
| 4 | Open Core | Free engine, paid control plane |
| 5 | Industry Disruptor | VC-backable platform play |