Systems Libraries and Runtimes - Phase 2 Track B

Goal: Build the mental models and practical skills required to design the low-level libraries that everything else depends on: allocators, schedulers, async runtimes, ABI-stable interfaces, and performance-critical primitives. You will learn why these layers look the way they do, how to reason about their failure modes, and how to validate correctness under concurrency and load. By the end, you will be able to implement production-grade components (not toy demos) and explain the trade-offs behind every design choice. You will also have a portfolio of concrete artifacts you can benchmark, integrate, and defend in interviews.


Introduction: What This Guide Covers

Systems libraries and runtimes are the invisible layers that sit between your application and the operating system. They provide allocation, threading, async I/O, ABI stability, and performance primitives that everything else builds on. In practice, they let you build fast, safe, and portable software without relying on undefined behavior or accidental kernel details.

What you will build (by the end of this guide):

  • A drop-in malloc replacement with arenas, size classes, and benchmarks
  • A work-stealing thread pool scheduler
  • A mini async runtime with an event loop, timers, and non-blocking I/O
  • A cross-platform syscall abstraction layer (POSIX + Windows)
  • A SIMD-accelerated string search library
  • A crash-safe embedded key-value store with WAL + LSM compaction

Scope (what is included):

  • Memory allocation, concurrency primitives, async I/O, ABI boundaries, and performance tuning
  • Realistic library design: public APIs, error handling, portability, benchmarking, and tests

Out of scope (for this guide):

  • Building a full OS kernel
  • Managed runtimes (JVM, CLR) beyond the low-level runtime concepts

The Big Picture (Mental Model)

+------------------------------+        +-----------------------------+
|          Applications         |        |        Dev Tools            |
|  CLI tools, servers, games   |        |  profilers, sanitizers      |
+---------------+--------------+        +--------------+--------------+
                |                              |
                v                              v
+--------------------------------------------------------------+
|     Systems Libraries and Runtimes (what you build here)      |
|  allocators | thread pools | async runtime | ABI | perf libs   |
+-------------------------------+------------------------------+
                                |
                                v
+--------------------------------------------------------------+
|                 OS Interfaces and Syscalls                    |
|        POSIX, Windows APIs, epoll/kqueue/io_uring             |
+-------------------------------+------------------------------+
                                |
                                v
+--------------------------------------------------------------+
|                     Hardware and Memory                       |
|             CPU caches, NUMA, page tables, I/O                |
+--------------------------------------------------------------+

Key Terms You Will See Everywhere

  • Allocator: Component that manages dynamic memory (malloc/free) and their performance trade-offs.
  • Arena: A region of memory reserved for allocations, often per-thread to reduce contention.
  • Work stealing: Scheduling strategy where idle threads steal tasks from busy threads.
  • Reactor: Async pattern where readiness events drive callbacks/futures.
  • ABI: Binary contract defining calling conventions, struct layout, and symbol visibility.
  • Undefined behavior (UB): Program behavior not defined by the language standard, often leading to silent bugs.
  • Backpressure: Mechanism to avoid overload by slowing producers when consumers fall behind.

How to Use This Guide

  1. Read the Theory Primer first. It is the mini-book that explains the mental models behind every project.
  2. Pick a learning path that matches your background and time budget.
  3. Build each project in order if you are new to systems libraries. Later projects assume earlier concepts.
  4. Treat every project like a product: write a README, benchmarks, tests, and a demo script.
  5. Re-run benchmarks after each change to understand performance trade-offs and regressions.

If you get stuck, use the “Hints in Layers” sections and then return to the Theory Primer to reconnect with the core model.


Prerequisites & Background Knowledge

Before starting these projects, you should have foundational understanding in these areas:

Essential Prerequisites (Must Have)

Programming Skills:

  • Proficiency in C (pointers, structs, manual memory management)
  • Comfort with the command line and build tools (Make, clang/gcc)
  • Ability to read compiler errors and trace crashes with a debugger

Systems Fundamentals:

  • Virtual memory basics (pages, address spaces, mmap vs heap)
  • Threads and synchronization (mutexes, condition variables)
  • Basic UNIX process model (file descriptors, fork/exec)
  • Recommended Reading: “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - Ch. 9 (Virtual Memory), Ch. 12 (Concurrent Programming)

C Interfaces and ABI Basics:

  • Struct layout and alignment rules
  • Function calling conventions at a high level
  • Dynamic linking concepts (shared libraries, symbol lookup)
  • Recommended Reading: “Computer Systems: A Programmer’s Perspective” - Ch. 3 (Machine-Level Representation), Ch. 7 (Linking)

Helpful But Not Required

Advanced Topics:

  • SIMD intrinsics (SSE/AVX) - You will learn during Project 5
  • Async I/O internals (epoll/kqueue/io_uring) - You will learn during Project 3
  • Durability and crash consistency (WAL, fsync) - You will learn during Project 6

Self-Assessment Questions

Before starting, ask yourself:

  1. Can I explain the difference between stack memory and heap memory?
  2. Do I know how to find a race condition with a thread sanitizer or debugger?
  3. Have I used strace, ltrace, or perf to debug a Linux program?
  4. Can I explain what an ABI is and why it matters for shared libraries?
  5. Can I read an assembly-level backtrace when a C program crashes?

If you answered “no” to questions 1-3: spend 1-2 weeks with CS:APP Ch. 1-12 and APUE Ch. 3, 11, 14 before starting.

Development Environment Setup

Required Tools:

  • A Linux machine (Ubuntu 22.04+, Debian 12+, or Fedora 39+)
  • GCC or Clang (C17 support)
  • Make or Ninja
  • git

Recommended Tools:

  • gdb or lldb (debugging)
  • valgrind or heaptrack (allocator debugging)
  • perf and flamegraph tools (profiling)
  • strace and ltrace (syscall and ABI tracing)
  • rr (record/replay debugging, optional)

Testing Your Setup:

$ gcc --version
gcc (Ubuntu 12.3.0) 12.3.0

$ make --version
GNU Make 4.3

$ uname -a
Linux devbox 6.6.12 #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux

Time Investment

  • Simple projects (Project 5): 1-2 weekends (8-16 hours each)
  • Moderate projects (Projects 2, 3, 4): 2-3 weeks each (20-40 hours)
  • Complex projects (Projects 1, 6): 3-6 weeks each (40-80 hours)
  • Total sprint: 3-5 months if done sequentially

Important Reality Check

These are production-grade concepts. Expect to iterate:

  1. First pass: Get it working, even if ugly
  2. Second pass: Understand the correctness model and invariants
  3. Third pass: Optimize the slowest 10 percent
  4. Fourth pass: Make it portable and testable

Systems mastery is a marathon, not a sprint.


Big Picture / Mental Model

Think of this track as building a “mini standard library” from the bottom up:

[Hardware]
   |
   v
[OS syscalls] -> [Allocator] -> [Threading/Async] -> [Library APIs] -> [Applications]
                      ^               ^
                      |               |
              [Performance]     [Portability/ABI]

Every project in this track strengthens one or more of those layers.


Theory Primer (Read This Before Coding)

This section is the mini-book for the projects. Each chapter gives you the mental model, implementation details, and real-world context you need before writing code.

Chapter 1: Memory Allocation, Fragmentation, and Undefined Behavior

Fundamentals

Dynamic memory allocation exists because programs need memory whose size and lifetime cannot be predicted at compile time. An allocator maps requests like malloc(37) into chunks of virtual memory that are aligned, trackable, and safe to free later. The allocator must balance speed (fast allocations), space efficiency (low fragmentation), and correctness (avoid double frees, use-after-free, misalignment). The OS only hands out memory in page-sized chunks, so an allocator is essentially a sub-allocator: it requests pages and divides them into smaller blocks for your program. When it makes the wrong choice, the consequences appear as latency spikes, memory bloat, or crashes that are extremely hard to reproduce. This is also where undefined behavior (UB) lives: pointer arithmetic beyond object bounds, misaligned accesses, strict aliasing violations, and integer overflow can silently corrupt allocator metadata or the program’s own data. A systems programmer treats allocation as a design problem, not a magical library call.

Deep Dive into the Concept

At the lowest level, a classic allocator keeps metadata for each allocated block: size, in-use flag, and links for free lists. The simplest design stores metadata right before the user pointer (“header”), so free(ptr) can move backward and find its own bookkeeping. That is easy but fragile: any buffer overflow overwrites allocator metadata, which can later corrupt the heap. Modern allocators reduce this risk by segregating metadata (like size classes, bitmaps, or side tables) and by adding integrity checks, canaries, or cookies.

Most real allocators are segregated: they use different strategies for small and large objects. Small allocations are handled by size classes, such as 16, 32, 64, 128 bytes. The allocator keeps a free list per size class so allocation is O(1) without searching. This buys speed but introduces internal fragmentation: a 17-byte request might consume a 32-byte block. Large allocations are typically served directly from the OS using mmap, not from the heap. This avoids polluting small-object bins and reduces external fragmentation, where free memory exists but is split into unusable pieces.

Fragmentation is the enemy of long-running systems. External fragmentation arises when free blocks are interleaved with allocated blocks; the allocator cannot satisfy a large request even though total free memory is sufficient. Coalescing adjacent free blocks reduces fragmentation but can be expensive if done eagerly. Many allocators choose lazy coalescing: they coalesce only when needed or at specific times. This is a trade-off between latency (slow frees) and memory usage (fragmentation). Another trade-off is returning memory to the OS. free() does not necessarily reduce RSS because the allocator may keep freed blocks for reuse. Returning memory requires munmap or madvise, which can be expensive and may hurt performance if the freed memory is needed again soon.

Concurrency complicates everything. A single global heap with a single lock causes contention and tail latency spikes. To scale, allocators use per-thread arenas or caches: each thread allocates from its own pool without locking. The cost is higher memory usage and more fragmentation across arenas. To mitigate this, allocators periodically rebalance or purge arenas. This is why jemalloc exposes statistics like “dirty pages” and “muzzy pages” and why tuning options matter for real workloads.

Alignment is another deep, often misunderstood constraint. The allocator must return pointers aligned to at least alignof(max_align_t). If it does not, code may crash or silently misbehave on certain architectures. Alignment also interacts with SIMD: vector instructions often require 16- or 32-byte alignment, so a high-performance library might over-align allocations for speed. This is where UB creeps in: if you write past the end of an allocation to “round up” alignment, you may trample metadata or another allocation. Similarly, pointer arithmetic is only defined within the bounds of a single object (plus one past the end). Many allocators use pointer tricks (like storing metadata in the low bits) that rely on alignment, but the language standard gives few guarantees if you violate those rules.

Finally, the allocator is a protocol between the runtime and the program. The program promises to call free() on the exact pointer returned by malloc() and never use it afterward; the allocator promises to return distinct, properly aligned blocks and to honor size requests. Breaking this protocol results in undefined behavior, which can be exploitable in security contexts. Understanding allocation means understanding this protocol, the OS-level memory model, and the performance consequences of allocator design.

How This Fits on Projects

  • Project 1 builds a real allocator with size classes, coalescing, and per-thread caches.
  • Project 5 uses alignment and pointer rules to avoid UB in SIMD string scanning.
  • Project 6 uses custom allocators for memtables, bloom filters, and block caches.

Definitions and Key Terms

  • Internal fragmentation: wasted space inside allocated blocks due to rounding or alignment.
  • External fragmentation: free memory split into small pieces, preventing large allocations.
  • Arena: a pool of memory managed by an allocator, often per-thread.
  • Bump allocator: allocator that only moves a pointer forward and frees everything at once.
  • Coalescing: merging adjacent free blocks into a larger block.
  • UB (Undefined Behavior): behavior not defined by the language standard, often catastrophic.

Mental Model Diagram

        OS (pages)                     Allocator                     Program
+----------------------+     +----------------------------+     +-------------------+
| mmap/brk gives pages | --> | arenas + size classes       | --> | malloc/free calls |
| 4KB, 2MB, 1GB chunks |     | metadata + free lists       |     | objects + buffers |
+----------------------+     +----------------------------+     +-------------------+
             ^                        |
             |                        v
             +----------- coalesce / purge / reuse ----------+

How It Works (Step-by-Step)

  1. Request arrives: malloc(n) rounds n up to the nearest size class.
  2. Fast path: if the size class free list has a block, pop and return it.
  3. Slow path: if empty, carve a new block from an arena or request pages from the OS.
  4. Metadata update: store size and state; maintain free list pointers.
  5. Free path: free(ptr) finds metadata, marks free, and optionally coalesces.
  6. Reclaim: if a whole page or region is free, return it to the OS (optional).

Invariants:

  • Every allocated block is aligned to at least alignof(max_align_t).
  • No two live allocations overlap.
  • Free list nodes always point to valid free blocks.

Failure modes:

  • Double free corrupts free lists.
  • Buffer overflow corrupts metadata.
  • Misalignment breaks SIMD or strict aliasing rules.

Minimal Concrete Example

// Bump allocator: fast, simple, frees everything at once
#include <stdint.h>
#include <stddef.h>

typedef struct {
    uint8_t* base;
    size_t   size;
    size_t   offset;
} arena_t;

void* arena_alloc(arena_t* a, size_t n, size_t align) {
    size_t mask = align - 1;
    size_t aligned = (a->offset + mask) & ~mask;
    if (aligned + n > a->size) return NULL;
    void* ptr = a->base + aligned;
    a->offset = aligned + n;
    return ptr;
}

void arena_reset(arena_t* a) {
    a->offset = 0; // bulk free
}

Common Misconceptions

  • free() always returns memory to the OS.” -> Most allocators keep freed blocks for reuse.
  • “Alignment only matters for speed.” -> Misalignment can crash or cause UB.
  • “Any pointer arithmetic is fine in C.” -> It is only defined within one object.
  • “Fragmentation only happens in naive allocators.” -> It happens in all allocators; you manage it.

Check-Your-Understanding Questions

  1. Why do most allocators use size classes for small objects?
  2. What is the difference between internal and external fragmentation?
  3. Why might a high-performance allocator keep memory instead of returning it to the OS?
  4. How can undefined behavior corrupt allocator metadata?

Check-Your-Understanding Answers

  1. Size classes make small allocations O(1) by eliminating search time.
  2. Internal fragmentation is wasted space inside blocks; external is wasted space between blocks.
  3. Returning memory to the OS is expensive and may hurt performance if the memory is needed again.
  4. Buffer overflows, misalignment, or out-of-bounds writes can overwrite metadata and break invariants.

Real-World Applications

  • Game engines use arena allocators to avoid frame-time spikes.
  • Databases use custom allocators to control memory growth.
  • Browsers isolate allocators per component to reduce heap corruption blast radius.

Where You’ll Apply It

  • Project 1: Custom memory allocator
  • Project 5: SIMD string search (alignment and UB avoidance)
  • Project 6: Embedded key-value store (memtable and block caches)

References

  • “C Interfaces and Implementations” by David Hanson - Ch. 5 (Arena), Ch. 6 (Mem)
  • “Computer Systems: A Programmer’s Perspective” - Ch. 9 (Virtual Memory), Ch. 9.9 (Dynamic Memory)
  • “The C Programming Language” - Ch. 8.7 (A Storage Allocator)
  • jemalloc background and allocator design notes (NetBSD man page for jemalloc) https://man.netbsd.org/jemalloc.3

Key Insights

Fast allocators are not magic; they are carefully engineered compromises between speed, memory overhead, and concurrency.

Summary

Allocators are a performance-critical subsystem that must respect strict invariants: alignment, metadata integrity, and concurrency safety. Fragmentation is unavoidable but manageable, and the decision to return memory to the OS is a trade-off between throughput and footprint. Understanding allocators also means understanding undefined behavior, because allocator bugs are rarely caught by the compiler and often surface as random crashes.

Homework/Exercises

  1. Implement a bump allocator and measure its speed vs malloc for tiny objects.
  2. Write a small program that intentionally fragments the heap and observe RSS growth.
  3. Add guard bytes before and after allocations and detect buffer overruns.

Solutions

  1. Bump allocators are typically 5-20x faster for fixed-size objects but cannot free individual blocks.
  2. RSS grows even when total freed bytes are large because external fragmentation prevents reuse.
  3. Guard bytes allow you to detect overwrites at free() time, trading speed for safety.

Chapter 2: Concurrency Primitives and Memory Ordering

Fundamentals

Concurrency primitives exist to coordinate access to shared state. Without them, threads can interleave in unpredictable ways, producing race conditions that disappear under a debugger but crash in production. A mutex provides mutual exclusion, a condition variable allows one thread to sleep until another signals it, and an atomic variable allows lock-free updates to a single value. The subtlety is that CPUs and compilers reorder instructions for performance, so your code must declare which orderings are required. The C11 memory model defines how operations on atomics establish “happens-before” relationships that make writes visible to other threads. Understanding these primitives is essential because every project in this track either spawns threads or runs async tasks on a shared runtime.

Deep Dive into the Concept

The core problem of concurrency is visibility: one thread writes a value, another thread reads it. Modern CPUs use caches and store buffers, meaning a write can be temporarily invisible to other cores. Compilers also reorder operations to improve performance. The memory model gives you tools to control this: mutexes imply full ordering (acquire on lock, release on unlock), while atomics let you pick the minimal ordering you need.

At the hardware level, each core has private caches and a coherency protocol ensures that caches eventually agree. However, “eventually” is not enough for correctness. A lock enforces that only one thread at a time can mutate a critical section, but it also introduces contention. When that contention is high, latency spikes. Condition variables solve a different problem: they allow one thread to wait until a condition becomes true without busy-waiting. The correct pattern is always: lock the mutex, check the condition, wait in a loop, and re-check after waking. Skipping the loop leads to lost wakeups.

Atomics enable lock-free data structures, but they are not magic. A lock-free algorithm typically relies on compare-and-swap (CAS) to update a pointer or value only if it has not changed. This introduces the ABA problem, where a value changes from A to B and back to A, fooling CAS into thinking nothing changed. Solutions include tagging pointers with version counters or using hazard pointers for safe reclamation. When you use atomics, you must choose memory ordering: relaxed for counters, acquire/release for producer-consumer, and seq_cst for global ordering when you cannot reason about weaker orderings. Overusing seq_cst can reduce performance; underusing it leads to heisenbugs.

False sharing is another subtle source of performance regressions. If two threads modify separate variables that sit on the same cache line, the cache line bounces between cores and throughput collapses. This is common in thread pools, allocators, and queues. The fix is to pad or align data to cache-line boundaries and to place hot fields together but isolated from unrelated hot fields.

Concurrency primitives are not only about correctness; they are also about progress guarantees. Mutexes are blocking and can cause priority inversion or deadlock. Lock-free algorithms guarantee that at least one thread makes progress, but can still starve. Wait-free algorithms guarantee that every thread makes progress in a bounded number of steps but are complex and rarely used in general libraries. Knowing which guarantee you need is the first design decision: if the system is I/O bound, a mutex is fine; if latency is critical and contention high, lock-free techniques may be required.

Finally, concurrency interacts with the language and ABI. Thread-local storage (TLS) is part of the ABI. Calling conventions must preserve registers across thread boundaries. The same data structure can behave differently depending on memory ordering guarantees of the platform (x86 is strong, ARM is weaker). Portable libraries must use the C11 atomic API or platform intrinsics carefully to remain correct on all architectures.

How This Fits on Projects

  • Project 2 uses mutexes, condition variables, and atomics for a work-stealing scheduler.
  • Project 3 uses atomics for task state and wakeup mechanisms.
  • Project 1 and Project 6 rely on per-thread caches and lock contention avoidance.

Definitions and Key Terms

  • Mutex: mutual exclusion lock for protecting a critical section.
  • Condition variable: synchronization primitive for waiting on state changes.
  • Atomic: operation that is indivisible and visible across threads.
  • Happens-before: ordering relation that guarantees visibility.
  • False sharing: cache-line contention caused by unrelated variables sharing a line.
  • ABA problem: CAS sees the same value but misses intermediate changes.

Mental Model Diagram

Thread A                     Shared State                     Thread B
---------                    ------------                    ---------
write X=1   ---(release)---> [X=1, lock free] ---(acquire)---> read X

Locks and atomics define the arrows that make writes visible.

How It Works (Step-by-Step)

  1. Thread A acquires a mutex (acquire semantics).
  2. Thread A modifies shared data.
  3. Thread A releases the mutex (release semantics).
  4. Thread B acquires the mutex and sees the changes.
  5. If using atomics, Thread A performs a release store and Thread B performs an acquire load.

Invariants:

  • Every shared object is either protected by a lock or accessed atomically.
  • All threads follow the same locking order to avoid deadlocks.

Failure modes:

  • Data races lead to undefined behavior.
  • Deadlocks occur from inconsistent lock ordering.
  • Livelocks occur when threads retry without progress.

Minimal Concrete Example

#include <stdatomic.h>
#include <pthread.h>

atomic_int ready = 0;
int data = 0;

void* producer(void* _) {
    data = 42;                      // regular write
    atomic_store_explicit(&ready, 1, memory_order_release);
    return NULL;
}

void* consumer(void* _) {
    while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
    // data is now visible because of acquire/release
    printf("data=%d\n", data);
    return NULL;
}

Common Misconceptions

  • “Volatile makes code thread-safe.” -> It does not; use atomics or locks.
  • “Locks always kill performance.” -> Only when contention is high.
  • “x86 is strongly ordered so I can ignore memory ordering.” -> Portability breaks on ARM.
  • “Lock-free means no bugs.” -> Lock-free code is often more subtle and error-prone.

Check-Your-Understanding Questions

  1. What does acquire/release guarantee that relaxed ordering does not?
  2. Why does a condition variable require a loop around pthread_cond_wait?
  3. How can false sharing destroy performance even when there are no races?

Check-Your-Understanding Answers

  1. Acquire/release establishes a happens-before relationship for visibility.
  2. Spurious wakeups and missed signals require re-checking the condition.
  3. Cache-line bouncing causes coherence traffic and stalls unrelated updates.

Real-World Applications

  • Thread pools in web servers and build systems
  • Lock-free queues in high-performance messaging systems
  • Per-thread arenas in allocators

Where You’ll Apply It

  • Project 2: Work-stealing thread pool
  • Project 3: Mini async runtime
  • Project 1: Thread-safe allocator
  • Project 6: Concurrent readers in storage engine

References

  • “Computer Systems: A Programmer’s Perspective” - Ch. 12 (Concurrent Programming)
  • “Rust Atomics and Locks” by Mara Bos - Ch. 1-4
  • “Advanced Programming in the UNIX Environment” - Ch. 11-12 (Threads and synchronization)

Key Insights

Concurrency is not just about locks; it is about visibility, ordering, and performance under contention.

Summary

To write correct concurrent code, you must define clear ownership rules, use locks or atomics consistently, and understand memory ordering. The same algorithm can be safe on one architecture and broken on another if you rely on undefined ordering. A systems library must be portable and deliberate about these choices.

Homework/Exercises

  1. Write a bounded queue protected by a mutex and condition variables.
  2. Convert the queue to use a lock-free ring buffer with atomics.
  3. Measure throughput and latency differences on 1, 2, 4, and 8 threads.

Solutions

  1. Mutex + condvar provides correctness but scales poorly at high contention.
  2. Lock-free ring buffers improve throughput but require careful memory ordering.
  3. Expect latency to spike with contention in the mutex version and stabilize with lock-free techniques.

Chapter 3: Scheduling and Work-Stealing Thread Pools

Fundamentals

A thread pool exists to amortize thread creation costs and to keep CPUs busy. Instead of creating a thread per task, you create a fixed number of worker threads and feed them a queue of tasks. The core challenge is load balancing: some tasks are short, some are long, and some spawn additional tasks. Work stealing is a scheduling strategy where each worker has its own deque (double-ended queue). The worker pushes and pops from the bottom (LIFO for locality), while idle workers steal from the top (FIFO for fairness). This design reduces contention and balances load dynamically. A correct thread pool must also handle shutdown, task cancellation, and backpressure so it does not overwhelm the system.

Deep Dive into the Concept

The simplest thread pool uses a single global queue protected by a mutex. It is easy to implement but quickly becomes a bottleneck: every worker contends for the same lock, and cache lines bounce between cores. Work stealing avoids this by giving each worker its own deque. The owning worker performs fast push/pop operations without locking (or with minimal atomic operations), while thieves use a slower, synchronized path to steal from the top. The classic implementation is the Chase-Lev deque, which uses atomic indices and a carefully ordered sequence of reads and writes to ensure that only one thread can successfully steal a task at a time.

The scheduling policy matters. The owner pops from the bottom (LIFO) so that it continues working on the most recently spawned tasks, which likely have cache locality. Thieves take from the top (FIFO) to prevent starvation and to spread work across workers. This also reduces the probability of two threads repeatedly fighting over the same hot task. Work stealing is designed for divide-and-conquer workloads: tasks recursively spawn smaller tasks. In such workloads, work stealing provides provably good bounds on execution time relative to the optimal scheduler.

However, work stealing is not free. The deque must be lock-free or low-lock, which requires careful memory ordering. The owner thread can often operate without full fences, but thieves require stronger ordering. The algorithm must handle races between the owner popping the last element and a thief stealing it. If not handled correctly, tasks can be lost or executed twice. Furthermore, the granularity of tasks matters: if tasks are too small, overhead dominates; if tasks are too large, load imbalance increases and idle threads waste time. Good thread pool design includes task batching, work chunking, and heuristic thresholds.

Another subtlety is blocking tasks. If a worker thread blocks on I/O, it stops executing tasks and reduces throughput. This is why many runtimes split workers into “CPU-bound” pools and separate “blocking” pools, or use async I/O to avoid blocking altogether. For a simple work-stealing pool, you should at least detect long-running tasks or expose a spawn_blocking() API so blocking work does not stall the pool.

Finally, thread pools require robust shutdown semantics. A naive design might set a global shutdown flag and join threads, but if workers are blocked waiting on an empty queue, they need a condition variable or a wakeup mechanism to notice shutdown. If tasks can spawn additional tasks, you must ensure that the pool does not stop prematurely while spawned tasks are still running. Many pools use a work counter: increment when scheduling a task, decrement when finished, and only allow shutdown when the count is zero.

How This Fits on Projects

  • Project 2 is a full work-stealing scheduler with per-thread deques.
  • Project 3 reuses thread pool concepts for async task scheduling.

Definitions and Key Terms

  • Thread pool: a fixed set of worker threads that execute tasks.
  • Work stealing: idle workers steal tasks from others to balance load.
  • Deque: double-ended queue supporting push/pop at both ends.
  • Task granularity: size or duration of work units.
  • Work counter: atomic count of outstanding tasks.

Mental Model Diagram

Worker 1 deque          Worker 2 deque          Worker 3 deque
[bottom ... top]        [bottom ... top]        [bottom ... top]
   ^    pop/push            ^    pop/push           ^   pop/push
   |                         |                       |
   +--- thief steals <-------+------- thief steals <--+

How It Works (Step-by-Step)

  1. Pool starts N worker threads, each with a local deque.
  2. A task is pushed onto the local deque of the submitting thread.
  3. The owning worker pops tasks from the bottom (LIFO).
  4. Idle workers attempt to steal from the top of other deques.
  5. When shutdown begins, workers stop accepting new tasks and drain queues.

Invariants:

  • Every task is executed exactly once.
  • Workers only steal from the top; owners only pop from the bottom.
  • The work counter never goes negative.

Failure modes:

  • Lost tasks due to race conditions in the deque.
  • Starvation when tasks are too coarse or stealing is too slow.
  • Deadlock during shutdown if workers sleep forever.

Minimal Concrete Example

// Pseudocode of a Chase-Lev style deque interface
struct deque { atomic_int top; atomic_int bottom; task_t* buf[N]; };

void push_bottom(deque* d, task_t* t) {
    int b = atomic_load(&d->bottom);
    d->buf[b % N] = t;
    atomic_store(&d->bottom, b + 1);
}

task_t* pop_bottom(deque* d) {
    int b = atomic_load(&d->bottom) - 1;
    atomic_store(&d->bottom, b);
    int t = atomic_load(&d->top);
    if (t <= b) return d->buf[b % N];
    // empty or race with steal
    atomic_store(&d->bottom, b + 1);
    return NULL;
}

Common Misconceptions

  • “A global queue is good enough.” -> It collapses under contention.
  • “Work stealing only helps HPC workloads.” -> It helps any irregular workload.
  • “Tasks can be arbitrarily small.” -> Overhead can dominate if tasks are too fine-grained.

Check-Your-Understanding Questions

  1. Why do owners pop from the bottom while thieves steal from the top?
  2. What happens if tasks block on I/O inside a CPU-bound thread pool?
  3. How does the work counter prevent premature shutdown?

Check-Your-Understanding Answers

  1. Bottom pops preserve locality; top steals preserve fairness and reduce contention.
  2. The pool loses throughput because a worker is blocked and cannot execute other tasks.
  3. The pool only shuts down when the count reaches zero, ensuring no tasks are left.

Real-World Applications

  • Cilk and Intel TBB schedulers
  • Build systems (Bazel, Ninja)
  • Parallel data processing (map/reduce frameworks)

Where You’ll Apply It

  • Project 2: Work-stealing thread pool
  • Project 3: Task scheduling in async runtime

References

  • “Computer Systems: A Programmer’s Perspective” - Ch. 12 (Concurrency overview)
  • “Advanced Programming in the UNIX Environment” - Ch. 11 (Threads)
  • “Rust Atomics and Locks” - Ch. 4 (Lock-free data structures)

Key Insights

Work stealing is a practical way to scale parallelism without a global bottleneck.

Summary

A work-stealing pool balances load by giving each worker its own deque and letting idle threads steal tasks. This design improves locality and reduces contention, but it demands careful synchronization and an understanding of task granularity. A correct pool is as much about shutdown and backpressure as it is about fast scheduling.

Homework/Exercises

  1. Implement a fixed-size thread pool with a global queue and measure its scalability.
  2. Replace the global queue with per-thread deques and add stealing.
  3. Build a benchmark that spawns recursively (Fibonacci or quicksort) and compare throughput.

Solutions

  1. The global queue version will show heavy lock contention beyond 2-4 threads.
  2. Work stealing reduces contention and improves scalability on multicore systems.
  3. Recursive workloads show the biggest improvement because they benefit from locality.

Chapter 4: Async Runtimes and I/O Multiplexing

Fundamentals

Async runtimes exist to handle many concurrent I/O operations without spawning one thread per connection. Instead of blocking on read() or accept(), an async runtime uses an event loop that waits for readiness notifications (reactor pattern) or completion notifications (proactor pattern). Tasks are represented as state machines (futures) that yield when they would block and resume when the I/O is ready. This model scales because a single thread can drive thousands of sockets by multiplexing events. A minimal runtime needs three pieces: a poller (epoll/kqueue/IOCP/io_uring), a scheduler (run queue for tasks), and a timer system (sleep and timeouts). Understanding the boundary between blocking and non-blocking work is the core skill here.

Deep Dive into the Concept

The event loop is built around a kernel interface that reports I/O events. On Linux, epoll provides readiness notification: it tells you a file descriptor is ready to read or write, but you still must perform the read() or write() yourself. On BSD/macOS, kqueue provides similar readiness semantics. On Windows, IOCP is completion-based: you submit an operation and the OS notifies you when it is finished. Linux io_uring adds completion semantics with shared ring buffers: user space posts operations into a submission queue and the kernel writes results into a completion queue, reducing syscall overhead.

The runtime builds a task model on top of this. A task is a state machine that can be paused when it would block. When the task yields, it returns control to the scheduler and registers a “waker” with the poller. When the poller reports readiness, the runtime pushes the task back onto the run queue. This is why async code is often written with await: it hides the state machine from the user. In a minimal runtime you will implement the state machine manually or with explicit callbacks.

The scheduler must decide which tasks to run and in what order. A single-threaded runtime is simple but limited by CPU speed. A multithreaded runtime combines the async model with a thread pool: each worker thread runs an event loop and can steal tasks from others. This hybrid design is what many production runtimes do because it handles both I/O-bound and CPU-bound work. The tricky part is making sure that blocking operations do not stall the event loop. A common solution is to split work into two pools: one for async I/O and another for blocking operations.

Timers are subtle. The runtime must wake tasks when a deadline expires, even if no I/O events occur. This requires a timer wheel or min-heap of deadlines and a way to integrate timeouts into the poller. Many runtimes use timerfd on Linux or kqueue timers on BSD. If no timers are due and no I/O is ready, the event loop should sleep until the next deadline to avoid burning CPU.

Backpressure is another crucial concept. If producers create tasks faster than the runtime can execute them, memory usage grows without bound. A robust runtime exposes bounded queues, applies flow control (like TCP windowing), or yields when the system is overloaded. Without backpressure, async code can be worse than synchronous code because the runtime hides the overload until the process crashes.

Finally, correctness in async systems depends on state management. A task can be polled many times, so it must track partial progress and handle spurious readiness. For example, a socket may be reported writable, but a write() still returns EAGAIN if the kernel buffer fills between readiness and the actual write. Your state machine must be prepared for this and re-register interest. This is why many async runtimes include “edge-triggered” vs “level-triggered” modes and complex logic for read/write loops. Small errors in state machines lead to 100 percent CPU spins or deadlocks.

How This Fits on Projects

  • Project 3 is a mini async runtime that integrates pollers, timers, and a scheduler.
  • Project 4 uses non-blocking I/O and portability concerns across POSIX and Windows.

Definitions and Key Terms

  • Reactor: readiness-based async design (epoll, kqueue).
  • Proactor: completion-based async design (IOCP, io_uring completion).
  • Event loop: loop that waits for I/O events and schedules tasks.
  • Waker: callback or handle used to re-schedule a task.
  • Backpressure: limiting inputs to match system capacity.

Mental Model Diagram

           +------------------+
           |   Task Scheduler |
           +---------+--------+
                     |
                     v
+---------+     +-----------+     +------------------+
| Tasks   | --> | Event Loop| --> | OS Poller        |
| Futures |     | (run queue)|    | epoll/kqueue/...|
+---------+     +-----------+     +------------------+
                     ^
                     |
             +-------+-------+
             | Timers/Clock  |
             +---------------+

How It Works (Step-by-Step)

  1. Initialize poller and run queue.
  2. Register non-blocking file descriptors with the poller.
  3. Pop a task from the run queue and poll it.
  4. If the task needs I/O, register a waker and yield.
  5. Poller wakes tasks when I/O is ready or timers expire.
  6. Repeat until shutdown.

Invariants:

  • A task is only in one of three states: running, ready, or waiting.
  • All I/O is non-blocking or offloaded to a blocking pool.

Failure modes:

  • Busy loops if readiness is mis-handled.
  • Memory blow-up if no backpressure exists.
  • Deadlocks if tasks await each other without a wakeup path.

Minimal Concrete Example

// Pseudo event loop
for (;;) {
    int n = poller_wait(timeout_ms);
    for (int i = 0; i < n; i++) {
        task_t* t = poller_events[i].task;
        enqueue(runq, t);
    }

    while (!runq_empty(runq)) {
        task_t* t = dequeue(runq);
        if (t->poll(t) == PENDING) {
            // task registered a waker and yielded
        }
    }
}

Common Misconceptions

  • “Async means faster.” -> It improves scalability, not necessarily single-request latency.
  • “Non-blocking I/O removes all blocking.” -> CPU work can still block the runtime.
  • “Readiness means the operation will succeed.” -> It can still return EAGAIN.

Check-Your-Understanding Questions

  1. What is the difference between readiness and completion notifications?
  2. Why do async runtimes need a timer subsystem?
  3. How can backpressure prevent memory blow-up?

Check-Your-Understanding Answers

  1. Readiness tells you that I/O is possible; completion tells you it is done.
  2. Tasks must wake up when deadlines expire even without I/O events.
  3. It limits the number of outstanding tasks so the system stays bounded.

Real-World Applications

  • Network servers handling thousands of sockets
  • Event-driven databases and caches
  • GUI event loops and game engines

Where You’ll Apply It

  • Project 3: Mini async runtime
  • Project 4: Cross-platform I/O abstractions

References

  • “Advanced Programming in the UNIX Environment” - Ch. 14 (Advanced I/O)
  • “Computer Systems: A Programmer’s Perspective” - Ch. 10 (System-Level I/O)
  • io_uring manual (Linux) https://man.archlinux.org/man/io_uring.7
  • epoll manual (Linux) https://man7.org/linux/man-pages/man7/epoll.7.html

Key Insights

Async runtimes are schedulers layered on top of kernel I/O APIs; correctness comes from careful state management and backpressure.

Summary

A minimal async runtime combines an OS poller, a scheduler, and timers. The hardest parts are not the syscalls but the state machine logic: handling spurious readiness, balancing work, and avoiding blocking. Once you understand that, you can reason about more advanced runtimes like Tokio or libuv.

Homework/Exercises

  1. Implement a tiny event loop using epoll that handles multiple TCP connections.
  2. Add a timer wheel or heap to support sleep(ms) for tasks.
  3. Simulate backpressure by limiting the number of in-flight tasks and measure memory usage.

Solutions

  1. A correct loop registers sockets with EPOLLIN, accepts connections, and reads until EAGAIN.
  2. The timer structure should wake tasks when the next deadline is reached, even without I/O.
  3. With backpressure, memory usage stays stable and latency increases gradually instead of crashing.

Chapter 5: ABI, FFI, and Cross-Platform Syscall Boundaries

Fundamentals

An ABI (Application Binary Interface) is the contract that lets separately compiled code work together. It defines how functions pass arguments (calling convention), how structs are laid out in memory, how symbols are named and exported, and which registers must be preserved across calls. When you build a systems library, your ABI is the product: it must stay stable so users can upgrade without recompiling their entire application. Cross-platform libraries add another layer: Windows uses different system calls, different data structures, and different error codes than POSIX. You must decide what your library exposes, how errors map, and how to keep behavior consistent across platforms without hiding critical differences.

Deep Dive into the Concept

At the machine level, the ABI defines stack layout, register usage, and alignment constraints. On x86-64 System V (Linux/macOS), the first arguments are passed in registers (RDI, RSI, RDX, RCX, R8, R9) and the stack is 16-byte aligned at call boundaries. On Windows x64, the calling convention uses RCX, RDX, R8, R9 for arguments and reserves “shadow space” on the stack. These differences mean that inline assembly or FFI must be explicitly aware of the platform. If you mis-handle alignment or argument order, the program may crash in ways that look like random memory corruption.

Struct layout is another ABI boundary. The C ABI defines padding rules based on alignment; compilers insert padding between fields to satisfy alignment constraints. If you expose a struct in a public header, its layout becomes part of your ABI. Changing the order of fields or types can break binary compatibility. This is why many libraries hide their internal structs behind opaque pointers and expose accessor functions. It lets you change internal representation without breaking users.

Dynamic linking adds another layer. A shared library exports symbols; the dynamic linker resolves them at load time. Symbol visibility controls which functions are public, and versioning (SONAME, symbol versions) allows multiple versions to coexist. If you ship a library used by many programs, you must treat compatibility as a first-class requirement. Small changes like modifying a function signature, changing struct size, or altering enum values can silently break ABI. Good design uses versioned symbols, feature detection, and defensive programming to keep old binaries working.

Portability is not just about function names. The semantics of syscalls can differ. For example, POSIX read returns 0 on EOF, while Windows APIs often return FALSE and use GetLastError() for error details. File descriptors on POSIX are small integers; Windows uses opaque HANDLEs. Even when functions look similar, their guarantees may differ (e.g., rename semantics on Windows vs POSIX). A cross-platform syscall abstraction must define its own semantics and translate platform specifics into that contract.

Feature detection is another ABI concern. A library might want to use io_uring on Linux, but only if the kernel supports it. This requires runtime detection and fallback paths (epoll or poll). Similarly, CPU features (SSE, AVX) must be detected before using SIMD intrinsics, or the program will crash on older CPUs. The ABI should either expose a capability query API or select the appropriate implementation internally.

Finally, the ABI is a security boundary. A mismatch between caller and callee can corrupt the stack, leak memory, or allow arbitrary code execution. This is why stable ABIs are conservative and why system libraries evolve slowly. The safest approach is to separate API (source-level contract) from ABI (binary-level contract) and to treat any ABI change as a major version bump.

How This Fits on Projects

  • Project 4 is entirely about ABI stability and cross-platform syscalls.
  • Project 5 and Project 1 rely on alignment rules and calling conventions for correctness.

Definitions and Key Terms

  • ABI: Binary contract defining calling convention, struct layout, and symbol names.
  • API: Source-level contract describing functions and types.
  • Opaque struct: A type whose layout is hidden from users to preserve ABI.
  • Symbol visibility: Whether a function is exported from a shared library.
  • SONAME: Shared object name used for versioning in ELF binaries.

Mental Model Diagram

App code  -->  Library API  -->  ABI boundary  -->  OS syscalls
  (C)            (headers)      (binary layout)     (platform-specific)

How It Works (Step-by-Step)

  1. You design a stable public header (API) with opaque types.
  2. The compiler produces binary objects using the platform ABI.
  3. The dynamic linker resolves symbols at runtime.
  4. The library translates API calls into platform-specific syscalls.
  5. Errors are normalized into a consistent error model.

Invariants:

  • Public structs never change size without a major version bump.
  • Symbol names and calling conventions remain stable across releases.

Failure modes:

  • ABI mismatch causes stack corruption or crashes.
  • Struct layout changes break binary compatibility.
  • Platform-specific semantics leak through the abstraction.

Minimal Concrete Example

// Public header (stable ABI)
typedef struct xplat_file xplat_file_t; // opaque

xplat_file_t* xplat_open(const char* path, int flags, int* err);
int xplat_read(xplat_file_t* f, void* buf, size_t n, int* err);
int xplat_close(xplat_file_t* f, int* err);

// Implementation hides platform-specific structs and handles

Common Misconceptions

  • “If it compiles, the ABI is stable.” -> ABI stability is about binary compatibility, not compilation.
  • “Struct layout is always the same.” -> It varies by compiler, flags, and architecture.
  • “Windows and POSIX errors map cleanly.” -> Many error codes do not have 1:1 mappings.

Check-Your-Understanding Questions

  1. Why are opaque structs useful for ABI stability?
  2. What happens if you reorder fields in a public struct?
  3. How do you design a portability layer for syscalls with different semantics?

Check-Your-Understanding Answers

  1. They allow internal changes without changing the public binary layout.
  2. The binary layout changes, breaking existing compiled code.
  3. Define a consistent API contract and translate platform specifics behind it.

Real-World Applications

  • libc and libstdc++ ABI guarantees
  • Cross-platform libraries like libuv or SDL
  • Plugin architectures where modules are compiled separately

Where You’ll Apply It

  • Project 4: Cross-platform syscall abstraction
  • Project 1: allocator as a shared library (symbol visibility)
  • Project 5: SIMD feature detection and ABI-safe APIs

References

  • “Computer Systems: A Programmer’s Perspective” - Ch. 3 (Machine-Level), Ch. 7 (Linking)
  • “Advanced Programming in the UNIX Environment” - Ch. 2-3 (Standardization and file I/O)
  • “C Interfaces and Implementations” - Design philosophy for stable interfaces

Key Insights

The ABI is a contract you must treat as permanent; breaking it breaks every user.

Summary

ABI design sits at the intersection of machine-level calling conventions, compiler layout rules, and platform-specific system semantics. A robust library hides internal details, exposes stable APIs, and translates platform quirks into consistent behavior. Treat ABI compatibility as a first-class feature, not an afterthought.

Homework/Exercises

  1. Create a shared library that exports one function and verify its symbol table.
  2. Break ABI compatibility by changing a struct layout and observe the crash.
  3. Design a small API that hides its internal structs behind opaque pointers.

Solutions

  1. Use nm -D to inspect exported symbols and verify visibility.
  2. The program will compile but crash or misbehave due to layout mismatch.
  3. Opaque pointers preserve ABI while letting you evolve the internal data structures.

Fundamentals

Performance engineering is the discipline of matching algorithm design to hardware reality. Big-O complexity matters, but real systems are often dominated by memory bandwidth, cache misses, and branch mispredictions. A fast system library is usually a set of careful trade-offs: minimize cache misses, avoid unpredictable branches, and use SIMD to process multiple bytes at once. String search is a perfect case study because the naive algorithm is O(n*m) but often fast enough for short patterns, while advanced algorithms like Boyer-Moore or Two-Way can skip work at the cost of more setup. SIMD intrinsics let you compare 16, 32, or 64 bytes in a single instruction, which can turn a bottleneck into a non-issue if you avoid undefined behavior and alignment pitfalls.

Deep Dive into the Concept

Modern CPUs are wide and deep. They execute multiple instructions per cycle, speculate on branches, and rely on caches to keep data close. If your data is not in cache, a single memory access can take hundreds of cycles. This is why memory access patterns often dominate performance. An algorithm that touches memory sequentially (streaming) can be much faster than one that jumps around, even if the latter does fewer comparisons. This insight is fundamental to high-performance string search.

Traditional string search algorithms include:

  • Naive: compare the pattern at every position; simple but branch-heavy.
  • Boyer-Moore: uses bad-character and good-suffix heuristics to skip ahead; excellent for long patterns but has preprocessing cost.
  • Two-Way algorithm: used by glibc memmem for many cases; combines forward and backward scans with a critical factorization.
  • KMP: linear-time but often slower in practice due to extra branches and memory accesses.

In real tools like ripgrep, the fastest path often starts with a SIMD scan for the first byte of the needle. SIMD allows you to compare 16 or 32 bytes at a time, generating a bitmask of candidate positions. Only when a candidate appears do you perform a full comparison. This combines the best of both worlds: streaming memory access and minimal branch misprediction. However, SIMD introduces constraints: you must handle unaligned loads safely (_mm_loadu_si128), avoid reading past the end of the buffer (or use masked loads on newer instruction sets), and ensure that your pointer arithmetic respects the C standard. Reading beyond the buffer, even if you do not use the bytes, is undefined behavior and can crash if it crosses a page boundary.

Performance engineering also requires measurement discipline. Microbenchmarks can mislead because they fit in cache and do not represent real workloads. You need benchmarks on large files, random patterns, and realistic encodings (UTF-8). You also need profiling tools like perf and flame graphs to see where time is spent. Sometimes the best optimization is to change the API: for example, accepting a precomputed needle structure can amortize preprocessing across many searches.

Branch prediction and vectorization interact. A branch-heavy algorithm can be slower than a vectorized one even if it does fewer comparisons because the CPU mispredicts and flushes the pipeline. This is why branchless techniques (bitwise comparisons, SIMD) often win. Yet SIMD also introduces portability concerns: AVX2 may not be available on all machines. A robust library must detect CPU features at runtime, select the correct implementation, and provide a safe fallback.

Finally, performance is not just speed. It is also memory usage, latency distribution, and predictability. A 10x speedup in the common case but a 100x slowdown in a corner case may be unacceptable in production. A good library defines its performance envelope and tests for regressions. This is why you will build benchmarks and include them as part of the definition of done.

How This Fits on Projects

  • Project 5 is the core performance project: SIMD string search with real benchmarks.
  • Project 1 uses cache-aware design in its allocator fast path.
  • Project 3 uses backpressure to avoid latency collapse under load.

Definitions and Key Terms

  • Cache line: smallest unit of cache transfer, usually 64 bytes.
  • Branch misprediction: CPU guesses wrong branch, flushing the pipeline.
  • SIMD: single-instruction multiple-data, vectorized operations.
  • Throughput: amount of data processed per unit time (MB/s).
  • Latency: time taken for a single operation.

Mental Model Diagram

Data stream -> [SIMD scan for first byte] -> [candidate positions] -> [full compare]
             (fast, linear)                 (few branches)         (rare)

How It Works (Step-by-Step)

  1. Detect CPU features (SSE4.2, AVX2, AVX-512).
  2. Choose the fastest available search implementation.
  3. Scan the haystack in vector-sized chunks for the first byte.
  4. For each candidate position, verify the full needle.
  5. Fall back to a scalar algorithm for edge cases or short buffers.

Invariants:

  • Never read beyond the end of the buffer without guard handling.
  • Always produce correct results, even for UTF-8 data.

Failure modes:

  • Undefined behavior from out-of-bounds SIMD loads.
  • Incorrect matches due to encoding boundaries.
  • Speed regressions due to branchy fallbacks.

Minimal Concrete Example

#include <immintrin.h>

// Find first byte of needle using SSE2 (16 bytes at a time)
int find_first_byte(const char* hay, size_t len, char c) {
    __m128i needle = _mm_set1_epi8(c);
    size_t i = 0;
    for (; i + 16 <= len; i += 16) {
        __m128i block = _mm_loadu_si128((const __m128i*)(hay + i));
        __m128i cmp = _mm_cmpeq_epi8(block, needle);
        int mask = _mm_movemask_epi8(cmp);
        if (mask != 0) return i + __builtin_ctz(mask);
    }
    for (; i < len; i++) {
        if (hay[i] == c) return (int)i;
    }
    return -1;
}

Common Misconceptions

  • “Big-O is all that matters.” -> Constant factors and memory access dominate in practice.
  • “SIMD always helps.” -> It can hurt for very small inputs or unaligned data.
  • “Branchless is always faster.” -> It depends on data distribution and CPU.

Check-Your-Understanding Questions

  1. Why can a linear-time algorithm be slower than a sublinear one in practice?
  2. How does SIMD reduce branch misprediction?
  3. Why is reading past the end of a buffer undefined behavior?

Check-Your-Understanding Answers

  1. Linear-time algorithms may have worse cache behavior and more branches.
  2. SIMD compares many bytes at once and uses bitmasks instead of per-byte branches.
  3. The C standard does not allow out-of-bounds reads; it can cross into unmapped pages.

Real-World Applications

  • ripgrep and other code search tools
  • Malware scanners and IDS systems
  • High-performance log processing pipelines

Where You’ll Apply It

  • Project 5: High-performance string search library
  • Project 1: allocator fast path and cache alignment

References

  • “Computer Systems: A Programmer’s Perspective” - Ch. 6 (Memory Hierarchy)
  • “Algorithms, 4th Edition” by Sedgewick and Wayne - Ch. 5 (String Algorithms)
  • “Modern X86 Assembly Language Programming” by Daniel Kusswurm - SIMD chapters

Key Insights

Performance is a hardware-aware discipline; algorithms succeed when they respect caches, branches, and vector width.

Summary

High-performance string search is about combining good algorithms with the realities of CPU pipelines and caches. SIMD can deliver massive gains, but only if you handle alignment, bounds, and feature detection correctly. The result is a library that is fast, predictable, and robust.

Homework/Exercises

  1. Benchmark naive, Boyer-Moore, and Two-Way search on different pattern sizes.
  2. Add an SSE2 path and measure speedup on large files.
  3. Implement a runtime dispatch table for AVX2 vs SSE2 vs scalar.

Solutions

  1. Naive is often fastest for very short patterns; Two-Way wins for medium sizes.
  2. SIMD should deliver 3-10x speedups depending on memory bandwidth.
  3. A function pointer table selected at startup keeps the hot path fast.

Chapter 7: Durable Storage Engines (WAL + LSM)

Fundamentals

Durable storage is about making data survive crashes, power loss, and restarts. A key-value store that loses data is worse than no store at all. The foundational technique is write-ahead logging (WAL): write every mutation to an append-only log, sync it to disk, then update in-memory structures. On restart, the log is replayed to restore state. LSM trees (Log-Structured Merge trees) take this idea further by making writes sequential and deferring expensive sorting or merging to background compaction. This design excels at write-heavy workloads and uses immutable on-disk tables (SSTables). The core challenge is designing an on-disk format, compaction strategy, and recovery process that are correct and performant.

Deep Dive into the Concept

Durability requires understanding the difference between data persistence and acknowledgement. When you return success to the user, you must be sure the data is on stable storage. The OS buffers writes, so you must call fsync() (or fdatasync()) to flush data to disk. For a database, a single missing fsync() can cause silent data loss after a crash. But fsync() is expensive, so you must balance durability with throughput. Most systems batch writes: they append multiple operations to a WAL, then call fsync() once per batch. This provides an acceptable durability window (e.g., 10ms) while maintaining high throughput.

The WAL itself needs a robust format. A typical entry includes a length prefix, the key/value bytes, and a checksum (CRC32 or XXH3). The checksum detects torn writes and partial entries after a crash. During recovery, you replay entries in order until the first invalid checksum. This restores the in-memory state to a consistent point. You also need to handle idempotency: replaying the same entry twice should not corrupt the state. Most engines treat the WAL as an append-only log of operations, which naturally replay in order without duplication.

LSM trees build on the WAL by introducing a memtable (an in-memory sorted structure such as a skip list or tree). Writes go to the WAL and then to the memtable. When the memtable reaches a size threshold, it is frozen and flushed to disk as an immutable SSTable. SSTables are sorted by key, enabling binary search. Because they are immutable, they can be safely read concurrently without locks. The downside is that reads may need to check multiple SSTables, which is why bloom filters are used to quickly test if a key might be present.

Compaction is the heart of LSM performance. Over time, SSTables accumulate and overlap in key ranges. Compaction merges tables into larger, non-overlapping tables, discarding obsolete entries. This improves read performance but introduces write amplification: a single logical write may be rewritten many times as tables merge. The compaction strategy (leveled vs tiered) determines the trade-off between write amplification, read amplification, and space amplification. Leveled compaction keeps tables small and reduces read amplification but increases write cost. Tiered compaction batches writes and reduces write amplification but increases read cost. A real engine must choose based on workload.

Crash consistency extends beyond the WAL. You must ensure directory entries are persisted when new files are created (often by fsync() on the directory). You must ensure that partially written SSTables are not used; this is usually done by writing to a temp file and rename() atomically. On restart, you scan the directory for valid tables, verify checksums, and reconstruct metadata. If your process crashes during compaction, you must be able to detect incomplete output and safely discard it.

Memory management and concurrency matter too. The memtable might be accessed by readers while a writer is inserting new entries. A common solution is to use a single-writer lock with many readers, or to use immutable memtables for readers while a new memtable becomes the active writer. The WAL can be written by a single thread to avoid interleaving. The background compaction thread must coordinate with readers so that SSTables are not deleted while in use. Reference counting or epoch-based reclamation is often used to manage this safely.

Finally, performance and durability must be validated. You need crash tests: kill the process at random points, restart, and verify the data. You also need benchmarks that measure write throughput, read latency, and compaction cost. Without these tests, you will not know whether your design is correct or whether it silently loses data.

How This Fits on Projects

  • Project 6 implements a WAL + LSM-based embedded key-value store.
  • Project 1 provides allocator strategies for memtables and caches.

Definitions and Key Terms

  • WAL (Write-Ahead Log): append-only log written before applying changes.
  • SSTable: immutable sorted table stored on disk.
  • Memtable: in-memory sorted structure for recent writes.
  • Compaction: merging SSTables to reduce overlap and garbage.
  • Write amplification: extra I/O caused by rewriting data during compaction.

Mental Model Diagram

Write path:
 client -> WAL (append+fsync) -> Memtable -> flush -> SSTable

Read path:
 client -> Memtable -> SSTable L0 -> SSTable L1 -> ...

How It Works (Step-by-Step)

  1. Append operation to WAL and fsync() if required.
  2. Insert into memtable (in-memory sorted structure).
  3. When memtable is full, freeze it and create a new memtable.
  4. Flush frozen memtable to disk as an SSTable.
  5. Run background compaction to merge overlapping SSTables.
  6. On startup, replay WAL and rebuild memtable state.

Invariants:

  • A write is durable only after WAL sync.
  • SSTables are immutable once written.
  • Compaction never deletes data that is newer than the output table.

Failure modes:

  • Missing fsync() causes acknowledged writes to be lost.
  • Corrupted WAL entries break recovery without checksums.
  • Compaction race can delete still-in-use files.

Minimal Concrete Example

// WAL entry: [klen][vlen][key][value][crc32]
int wal_append(int fd, const void* key, uint32_t klen,
               const void* val, uint32_t vlen) {
    write(fd, &klen, 4);
    write(fd, &vlen, 4);
    write(fd, key, klen);
    write(fd, val, vlen);
    uint32_t crc = crc32(key, klen, val, vlen);
    write(fd, &crc, 4);
    return fsync(fd); // durability boundary
}

Common Misconceptions

  • “If I write a file, it is durable.” -> Not until fsync() completes.
  • “Compaction is just cleanup.” -> It is the core performance trade-off.
  • “Checksums are optional.” -> Without them, you cannot detect torn writes.

Check-Your-Understanding Questions

  1. Why must the WAL be synced before acknowledging a write?
  2. What is the role of compaction in an LSM tree?
  3. How do bloom filters reduce read amplification?

Check-Your-Understanding Answers

  1. Without a synced WAL, a crash can lose acknowledged writes.
  2. Compaction reduces overlapping SSTables, improving reads at the cost of extra writes.
  3. Bloom filters quickly rule out SSTables that definitely do not contain the key.

Real-World Applications

  • RocksDB, LevelDB, and other embedded key-value stores
  • Time-series databases and logging systems
  • Streaming storage layers in analytics pipelines

Where You’ll Apply It

  • Project 6: Embedded key-value store with WAL + LSM

References

  • “Operating Systems: Three Easy Pieces” - Ch. 40-42 (File Systems and Crash Consistency)
  • “Computer Systems: A Programmer’s Perspective” - Ch. 10 (System-Level I/O)
  • PostgreSQL WAL overview (official docs) https://www.postgresql.org/docs/current/wal-intro.html

Key Insights

Durability is an explicit protocol: log first, sync, then apply. Everything else builds on that guarantee.

Summary

A durable storage engine is defined by its crash-recovery model. The WAL provides durability, the memtable provides speed, and the SSTable/compaction pipeline provides scalable storage. Correctness depends on strict ordering, checksums, and careful file management. Once you can reason about that pipeline, you can build real storage systems.

Homework/Exercises

  1. Build a WAL-only key-value store and test crash recovery.
  2. Add a memtable and flush to immutable SSTables.
  3. Implement a simple compaction pass and measure write amplification.

Solutions

  1. WAL-only stores are durable but slow for reads.
  2. SSTables provide fast reads with a binary search index.
  3. Compaction improves reads but increases total bytes written; measure the ratio.

Glossary (High-Signal)

  • ABI: Binary contract that defines calling conventions, struct layout, and symbol names.
  • Arena: Memory pool used for fast allocations and bulk frees.
  • Backpressure: Mechanism to prevent producers from overwhelming consumers.
  • Bloom filter: Probabilistic set membership filter with false positives but no false negatives.
  • Compaction: Merging immutable files to reduce overlap and improve read performance.
  • Deque: Double-ended queue, used in work-stealing schedulers.
  • Fsync: System call that forces buffered writes to stable storage.
  • Memtable: In-memory sorted structure for recent database writes.
  • SSTable: Immutable on-disk sorted table in LSM storage engines.
  • Thread-local storage (TLS): Per-thread data storage used to avoid contention.
  • Undefined behavior (UB): Behavior not defined by the language standard, often exploitable.

Why Systems Libraries and Runtimes Matter

The Modern Problem It Solves

Modern software stacks depend on libraries that are invisible but decisive: the allocator that controls memory fragmentation, the thread pool that prevents CPU starvation, the runtime that scales I/O to thousands of connections, and the ABI boundary that keeps your system stable across upgrades. When these layers are wrong, the application fails even if the business logic is correct. When they are right, the application feels fast, reliable, and secure.

Real-world impact (recent data):

  • Chromium memory safety: Chromium reports that around 70% of high-severity security bugs are memory safety issues, and about half are use-after-free bugs (analysis of 912 high/critical bugs since 2015). Source: https://www.chromium.org/Home/chromium-security/memory-safety
  • Zero-day exploitation: Google estimates that 75% of CVEs used in zero-day exploits are memory safety vulnerabilities (October 2024). Source: https://security.googleblog.com/2024/10/
  • Android security shift: Google reported the share of Android vulnerabilities caused by memory safety dropped from 76% in 2019 to 24% in 2024 as new code moved to memory-safe languages. Source: https://security.googleblog.com/2024/09/eliminating-memory-safety-vulnerabilities-Android.html

What this implies for systems programmers: memory safety, performance, and concurrency are not optional features. They are the difference between secure, scalable systems and systems that silently fail under load.

OLD APPROACH                            NEW APPROACH
+----------------------+                +----------------------+
| App-level patching   |                | Library-level design |
| (fix bugs after)     |                | (prevent classes)    |
+----------+-----------+                +----------+-----------+
           |                                     |
           v                                     v
  recurring incidents                  predictable systems
  slow hot paths                        stable ABI + fast I/O

Context and Evolution (Optional)

  • jemalloc became the default allocator in FreeBSD and is widely used in performance-critical systems.
  • Work-stealing schedulers rose with multicore CPUs to keep cores busy without global queues.
  • Async runtimes evolved from select/poll to epoll/kqueue and now io_uring for lower overhead.

Concept Summary Table

This section provides a map of the mental models you will build during these projects.

Concept Cluster What You Need to Internalize
Memory Allocation and UB How allocators manage metadata, fragmentation, alignment, and how UB breaks invariants.
Concurrency Primitives How mutexes, atomics, and memory ordering define correctness and performance.
Work-Stealing Scheduling How deques and stealing balance workloads and avoid global contention.
Async Runtimes and I/O How event loops, pollers, timers, and backpressure scale I/O.
ABI and Portability How calling conventions, struct layout, and error mapping keep binaries stable.
Performance and SIMD How caches, branch prediction, and vectorization drive real performance.
Durable Storage Engines How WAL, memtables, SSTables, and compaction provide crash safety.

Project-to-Concept Map

Project What It Builds Primer Chapters It Uses
Project 1: Custom Memory Allocator A drop-in allocator with arenas and size classes Ch. 1, Ch. 2, Ch. 5, Ch. 6
Project 2: Work-Stealing Thread Pool A scheduler with per-thread deques and stealing Ch. 2, Ch. 3, Ch. 6
Project 3: Mini Async Runtime Event loop, poller, timers, task scheduler Ch. 2, Ch. 3, Ch. 4
Project 4: Cross-Platform Syscall Abstraction Portable syscall API and ABI-safe interfaces Ch. 4, Ch. 5
Project 5: High-Performance String Search SIMD-accelerated search with benchmarks Ch. 1, Ch. 6
Project 6: Embedded Key-Value Store WAL + LSM storage engine Ch. 1, Ch. 2, Ch. 7

Deep Dive Reading by Concept

Memory and Allocation

Concept Book and Chapter Why This Matters
Virtual memory and allocators “Computer Systems: A Programmer’s Perspective” - Ch. 9 OS-level foundation for malloc behavior.
Arena and allocator design “C Interfaces and Implementations” - Ch. 5-6 Practical allocator patterns used in systems libraries.
Storage allocator example “The C Programming Language” - Ch. 8.7 Minimal allocator implementation for intuition.

Concurrency and Scheduling

Concept Book and Chapter Why This Matters
Threads and synchronization “Advanced Programming in the UNIX Environment” - Ch. 11-12 Core POSIX threading primitives and patterns.
Memory ordering basics “Rust Atomics and Locks” - Ch. 1-3 Clear explanations of atomics and ordering.
Concurrent programming overview “Computer Systems: A Programmer’s Perspective” - Ch. 12 High-level concurrency model and pitfalls.

Async I/O and Runtimes

Concept Book and Chapter Why This Matters
System-level I/O “Computer Systems: A Programmer’s Perspective” - Ch. 10 Buffering, descriptors, and I/O semantics.
Advanced I/O patterns “Advanced Programming in the UNIX Environment” - Ch. 14 select, poll, and event-driven I/O patterns.

ABI and Portability

Concept Book and Chapter Why This Matters
Calling conventions and stack layout “Computer Systems: A Programmer’s Perspective” - Ch. 3 ABI foundations for correct FFI.
Linking and shared libraries “Computer Systems: A Programmer’s Perspective” - Ch. 7 Symbol resolution and shared library mechanics.

Performance and SIMD

Concept Book and Chapter Why This Matters
Memory hierarchy “Computer Systems: A Programmer’s Perspective” - Ch. 6 Cache-aware optimization.
String algorithms “Algorithms, 4th Edition” - Ch. 5 Foundational string search algorithms.
SIMD fundamentals “Modern X86 Assembly Language Programming” - SIMD chapters Practical vectorization for real speedups.

Storage and Durability

Concept Book and Chapter Why This Matters
Crash consistency “Operating Systems: Three Easy Pieces” - Ch. 40-42 File system crash models and durability.
System-level I/O “Computer Systems: A Programmer’s Perspective” - Ch. 10 Understanding fsync and buffering.

Quick Start: Your First 48 Hours

Feeling overwhelmed? Start here instead of reading everything.

Day 1 (4 hours):

  1. Read Chapter 1 (Memory Allocation) and Chapter 2 (Concurrency).
  2. Skim the Project Overview Table and pick Project 1 or Project 2.
  3. Build the minimal allocator or a basic thread pool (Hint 1 for the project).
  4. Do not optimize yet; just get a demo working.

Day 2 (4 hours):

  1. Add basic benchmarks (allocs/sec or tasks/sec).
  2. Run perf or time to measure a baseline.
  3. Read the “Core Question” of your chosen project and write a 3-sentence answer.
  4. Commit your progress and write a short README.

End of weekend: You can explain why a naive allocator or thread pool fails under load. That is 80% of the mental model for the rest of the track.


Best for: C developers new to runtime internals

  1. Project 1 (Allocator) - foundational memory model
  2. Project 2 (Thread Pool) - concurrency and scheduling
  3. Project 3 (Async Runtime) - I/O scaling
  4. Project 4 (Syscall Abstraction) - portability and ABI
  5. Project 5 (String Search) - performance tuning
  6. Project 6 (Key-Value Store) - durable storage

Path 2: The Performance Hacker

Best for: Developers focused on speed and low-level tuning

  1. Project 5 (String Search)
  2. Project 1 (Allocator)
  3. Project 2 (Thread Pool)
  4. Project 3 (Async Runtime)
  5. Project 4 (Syscall Abstraction)
  6. Project 6 (Key-Value Store)

Path 3: The Runtime Engineer

Best for: Developers building frameworks, servers, or SDKs

  1. Project 2 (Thread Pool)
  2. Project 3 (Async Runtime)
  3. Project 4 (Syscall Abstraction)
  4. Project 1 (Allocator)
  5. Project 6 (Key-Value Store)
  6. Project 5 (String Search)

Path 4: The Completionist

Best for: Building a complete systems lab

Phase 1 (Weeks 1-4): Projects 1 and 2 Phase 2 (Weeks 5-8): Projects 3 and 4 Phase 3 (Weeks 9-12): Projects 5 and 6


Success Metrics

  • You can explain the allocator fast path and fragmentation trade-offs from memory.
  • You can diagnose a data race and fix it using a lock or atomic ordering.
  • Your async runtime can handle 5,000+ concurrent connections in a demo environment.
  • Your syscall abstraction builds on Linux and Windows with identical tests.
  • Your string search library beats memmem and strstr on large files.
  • Your key-value store survives 100 random crash tests without data loss.

Appendix: Tooling and Debugging Cheatsheet

Profiling: perf, perf stat, perf record, flame graphs

Memory debugging: valgrind, heaptrack, asan (AddressSanitizer)

Concurrency debugging: tsan (ThreadSanitizer), rr replay

Syscall tracing: strace, ltrace, dtruss (macOS)

Binary inspection: nm, objdump, readelf, otool (macOS)


Project Overview Table

Project Outcome Difficulty Time
Project 1: Custom Memory Allocator Drop-in malloc replacement with benchmarks Advanced 3-6 weeks
Project 2: Work-Stealing Thread Pool Scheduler with per-thread deques Advanced 2-3 weeks
Project 3: Mini Async Runtime Event loop + timers + non-blocking I/O Advanced 2-4 weeks
Project 4: Cross-Platform Syscall Abstraction Portable I/O + process + time APIs Advanced 2-3 weeks
Project 5: High-Performance String Search SIMD-accelerated search library Expert 2-3 weeks
Project 6: Embedded Key-Value Store WAL + LSM-based storage engine Expert 4-6 weeks

Project List

Project 1: Custom Memory Allocator

  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)
  • Difficulty: Level 3: Advanced (The Engineer)
  • Knowledge Area: Memory Management, Systems Programming
  • Software or Tool: glibc, jemalloc, valgrind, perf
  • Main Book: “C Interfaces and Implementations” by David Hanson

What you’ll build: A production-quality allocator that can be preloaded into real programs and benchmarked against glibc and jemalloc. It will support size classes, coalescing, and per-thread caches.

Why it teaches systems libraries: Nothing exposes hidden system complexity faster than writing malloc. You will face fragmentation, alignment, concurrency, metadata design, and undefined behavior head-on.

Core challenges you’ll face:

  • Designing free lists, size classes, and metadata layouts
  • Handling fragmentation and coalescing efficiently
  • Making allocation fast in multi-threaded workloads
  • Avoiding UB and memory corruption under stress

Real World Outcome

You will have a shared library that can replace the system allocator in any program using LD_PRELOAD.

Command Line Outcome Example:

# Build your allocator
$ make
cc -O2 -fPIC -shared -o libmyalloc.so myalloc.c -lpthread

# Preload into a real program
$ LD_PRELOAD=./libmyalloc.so ls
Desktop  Documents  Downloads  Music  Pictures  Videos

# Enable allocator statistics
$ MALLOC_STATS=1 LD_PRELOAD=./libmyalloc.so ls -la
[myalloc] alloc=1280 bytes free=0 bytes arenas=4
[myalloc] bins: 16B=120 32B=64 64B=32 128B=12 256B=8

# Run benchmarks
$ ./bench_allocator --threads 4 --sizes 16,32,64,256
allocs/sec: 12,500,000 (your allocator)
allocs/sec:  6,800,000 (glibc)
allocs/sec: 11,900,000 (jemalloc)

The Core Question You’re Answering

“What does it actually take to make malloc fast, safe, and scalable under real load?”

Concepts You Must Understand First

  1. Allocator metadata and size classes
    • Why do size classes exist?
    • What is the cost of metadata placement?
    • Book Reference: “C Interfaces and Implementations” Ch. 5-6
  2. Fragmentation and coalescing
    • How do internal and external fragmentation differ?
    • When should you coalesce free blocks?
    • Book Reference: CS:APP Ch. 9.9
  3. Thread-local caches and false sharing
    • Why do per-thread arenas improve scalability?
    • How does false sharing appear in allocator data?
    • Book Reference: CS:APP Ch. 6 (cache behavior)
  4. Undefined behavior in C
    • What happens if you read past an allocation?
    • How does alignment interact with SIMD?
    • Book Reference: “Effective C” Ch. 4-5

Questions to Guide Your Design

  1. Allocation strategy
    • Which size classes will you support?
    • When do you fall back to mmap?
    • How do you handle large allocations?
  2. Metadata placement
    • Inline headers or side tables?
    • How will you detect corruption?
    • What alignment guarantees will you provide?
  3. Concurrency
    • One global heap or per-thread arenas?
    • How will you avoid lock contention?
    • Do you need a global cache for large objects?

Thinking Exercise

The Fragmentation Thought Experiment

Suppose you have 1 MB of memory. You allocate 64-byte blocks, then free every other block. You now have 512 KB free, but no contiguous 512 KB block.

  • How many bytes can you allocate in a single request now?
  • What happens if you attempt malloc(128 KB)?
  • How would coalescing change the answer?

The Interview Questions They’ll Ask

  1. “Why is malloc sometimes slow even for small allocations?”
  2. “What trade-off exists between fragmentation and allocation speed?”
  3. “How do per-thread arenas improve performance?”
  4. “What is the difference between sbrk and mmap in allocator design?”
  5. “How would you detect heap corruption in your allocator?”

Hints in Layers

Hint 1: Start with a bump allocator

Implement a fast arena allocator to learn the mechanics.

void* arena_alloc(arena_t* a, size_t n) {
    if (a->offset + n > a->size) return NULL;
    void* p = a->base + a->offset;
    a->offset += n;
    return p;
}

Hint 2: Add a free list

Use the freed block itself to store the next pointer to avoid extra memory.

Hint 3: Add size classes

Start with 16, 32, 64, 128, 256, 512 bytes. Each class has its own free list.

Hint 4: Add thread-local caches

Use __thread or pthread_key_t to give each thread its own cache and reduce locking.

Books That Will Help

Topic Book Chapter
Allocator design “C Interfaces and Implementations” by Hanson Ch. 5-6
Virtual memory “Computer Systems: A Programmer’s Perspective” Ch. 9
C memory pitfalls “Effective C” by Seacord Ch. 4-5
Cache behavior “Computer Systems: A Programmer’s Perspective” Ch. 6

Common Pitfalls & Debugging

Problem 1: “Random crashes after a few minutes”

  • Why: Metadata corruption or double free
  • Fix: Add canaries and verify on free()
  • Quick test: Run under valgrind --tool=memcheck

Problem 2: “Allocator is slower than glibc”

  • Why: Too much locking or too many cache misses
  • Fix: Add per-thread caches and reduce critical sections
  • Quick test: perf stat ./bench_allocator

Problem 3: “Memory usage keeps growing”

  • Why: Memory never returned to OS or fragmentation is high
  • Fix: Implement page-level purging or coalescing thresholds
  • Verification: Track RSS over time with ps or /proc

Definition of Done

  • Allocator supports malloc/free/calloc/realloc
  • Alignment guarantees meet alignof(max_align_t)
  • Works under LD_PRELOAD with real programs
  • Benchmarks show competitive performance vs glibc
  • Thread-safe under stress tests
  • Leak-free under valgrind

Project 2: Work-Stealing Thread Pool

  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Serious Systems Builder
  • Business Potential: 2. The “Infrastructure Core” (Reusable library)
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Concurrency, Scheduling
  • Software or Tool: pthreads, perf, flamegraph
  • Main Book: “Advanced Programming in the UNIX Environment” by Stevens and Rago

What you’ll build: A scalable work-stealing thread pool with per-thread deques and a clean API for spawning tasks and waiting for completion.

Why it teaches systems libraries: Scheduling is the hidden engine of every runtime. You will learn how contention, cache locality, and work granularity shape real performance.

Core challenges you’ll face:

  • Implementing a correct work-stealing deque
  • Avoiding false sharing and lock contention
  • Handling blocking tasks and shutdown semantics

Real World Outcome

You will run benchmarks that show your pool scaling across CPU cores and outperforming a naive global-queue pool.

$ ./pool_bench --workers 8 --tasks 200000 --task "fib(20)"

=== Work-Stealing Thread Pool ===
Workers: 8
Tasks:   200000

Global queue pool:  1.87s (106,951 tasks/sec)
Work stealing pool: 0.74s (270,270 tasks/sec)  [2.5x faster]

Steals performed:  12,483
Average task time: 3.4 us
CPU utilization:  92%

The Core Question You’re Answering

“How do you keep all CPU cores busy without turning scheduling into a bottleneck?”

Concepts You Must Understand First

  1. Concurrency primitives and memory ordering
    • How do atomics and locks work together?
    • Book Reference: “Rust Atomics and Locks” Ch. 1-3
  2. Work-stealing deques
    • Why does the owner pop from the bottom and thieves steal from the top?
    • Book Reference: CS:APP Ch. 12 (Concurrency overview)
  3. False sharing and cache lines
    • How can two unrelated counters slow each other down?
    • Book Reference: CS:APP Ch. 6 (Memory hierarchy)

Questions to Guide Your Design

  1. Queue design
    • Will you use a Chase-Lev deque or a simpler locked deque?
    • How do you handle the last element race?
  2. Task model
    • Are tasks one-shot or can they spawn new tasks?
    • How will you track outstanding tasks for shutdown?
  3. Blocking behavior
    • Do you allow blocking tasks?
    • Will you provide a separate blocking pool?

Thinking Exercise

The Steal Scenario

Two workers both see one task left in a victim’s deque. The owner is also about to pop it.

  • What ordering of operations can cause the task to be lost?
  • How does a CAS on the top index prevent this?
  • What happens if the task is stolen while the owner pops it?

The Interview Questions They’ll Ask

  1. “Why is work stealing better than a global queue under contention?”
  2. “What is the last-element race and how do you solve it?”
  3. “How do you prevent false sharing in your worker structures?”
  4. “How do you handle blocking tasks in a CPU-bound pool?”
  5. “What is the difference between lock-free and wait-free?”

Hints in Layers

Hint 1: Start with a global queue

Implement correctness first, then add per-thread deques.

Hint 2: Add per-thread deques

Give each worker its own deque and push tasks locally.

Hint 3: Implement stealing

Only idle workers attempt to steal. Use CAS to avoid races.

Hint 4: Add a work counter

Increment on spawn, decrement on completion, and only shutdown at zero.

Books That Will Help

Topic Book Chapter
Threads and synchronization “Advanced Programming in the UNIX Environment” Ch. 11-12
Memory ordering “Rust Atomics and Locks” Ch. 1-3
Concurrency overview “Computer Systems: A Programmer’s Perspective” Ch. 12

Common Pitfalls & Debugging

Problem 1: “Tasks disappear”

  • Why: Race condition in deque indices
  • Fix: Use atomic CAS when stealing and validate bounds
  • Quick test: Run with thread sanitizer (-fsanitize=thread)

Problem 2: “CPU usage is low”

  • Why: Tasks are too coarse or stealing is too slow
  • Fix: Reduce task granularity, increase stealing frequency
  • Quick test: Profile with perf and measure steals/sec

Problem 3: “Shutdown hangs”

  • Why: Workers are sleeping on empty queues
  • Fix: Broadcast condition variable or push sentinel tasks
  • Verification: Add log output for worker exit

Definition of Done

  • Pool supports task submission and waiting for completion
  • Work stealing balances load across threads
  • Benchmarks show scaling on 4+ cores
  • No data races under thread sanitizer
  • Clean shutdown with no leaks

Project 3: Mini Async Runtime

  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Runtime Core” (Framework enabler)
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Async I/O, Scheduling
  • Software or Tool: epoll/kqueue/io_uring, timers, sockets
  • Main Book: “Computer Systems: A Programmer’s Perspective”

What you’ll build: A minimal async runtime with an event loop, task scheduler, and timers. It will run a non-blocking TCP server and a timer-driven job queue.

Why it teaches systems libraries: Async runtimes are the backbone of modern servers. You will learn the mechanics behind futures, wakers, and I/O multiplexing.

Core challenges you’ll face:

  • Correct state machines for non-blocking I/O
  • Integrating timers into the event loop
  • Avoiding busy loops and ensuring fairness

Real World Outcome

You will run a small HTTP server that handles thousands of concurrent connections on a single thread, with timers and task scheduling.

$ ./minirt --serve 8080
[minirt] poller=epoll, timers=heap, workers=1
[minirt] listening on 0.0.0.0:8080

# In another shell
$ ab -n 10000 -c 500 http://127.0.0.1:8080/
Requests per second:  32,500 [#/sec] (mean)
Time per request:     15.4 ms (mean, across all concurrent requests)

# Runtime stats
[minirt] active conns=512 ready=128 inflight=384
[minirt] timers=3 next_wakeup=12ms

The Core Question You’re Answering

“How do you serve thousands of connections without spawning thousands of threads?”

Concepts You Must Understand First

  1. Non-blocking I/O and readiness
    • What does EAGAIN mean?
    • Book Reference: CS:APP Ch. 10
  2. Event loop mechanics
    • How does epoll_wait drive the scheduler?
    • Book Reference: APUE Ch. 14
  3. Timers and scheduling
    • Why do timeouts require a separate data structure?
    • Book Reference: CS:APP Ch. 10

Questions to Guide Your Design

  1. Poller selection
    • epoll only, or add kqueue/IOCP later?
    • Edge-triggered or level-triggered?
  2. Task model
    • How will you represent task state machines?
    • Where do you store the waker callback?
  3. Backpressure
    • How many tasks can be in-flight at once?
    • What happens when the queue is full?

Thinking Exercise

The EAGAIN Loop

A socket reports writable. You attempt write() and get EAGAIN.

  • Why can this happen?
  • What state should you store to retry?
  • How do you avoid spinning at 100 percent CPU?

The Interview Questions They’ll Ask

  1. “What is the difference between readiness and completion?”
  2. “How do you prevent busy loops in an event loop?”
  3. “Why do async runtimes need timers?”
  4. “How would you integrate blocking tasks into an async runtime?”
  5. “What is backpressure and how do you implement it?”

Hints in Layers

Hint 1: Start with a poller-only loop

Use epoll to accept and read connections without tasks.

Hint 2: Add tasks as state machines

Represent each connection as a struct with a state enum and a buffer.

Hint 3: Add timers

Use a min-heap of deadlines and set poller timeout accordingly.

Hint 4: Add backpressure

Stop accepting new connections when the run queue exceeds a limit.

Books That Will Help

Topic Book Chapter
System-level I/O “Computer Systems: A Programmer’s Perspective” Ch. 10
Advanced I/O “Advanced Programming in the UNIX Environment” Ch. 14
Concurrency overview “Computer Systems: A Programmer’s Perspective” Ch. 12

Common Pitfalls & Debugging

Problem 1: “CPU usage at 100% when idle”

  • Why: You are spinning without blocking on the poller
  • Fix: Ensure epoll_wait uses a proper timeout
  • Quick test: Run top and confirm idle CPU usage < 5%

Problem 2: “Connections stall”

  • Why: You are not re-registering interest after partial writes
  • Fix: On EAGAIN, re-arm the fd and store partial state
  • Quick test: Add logging for state transitions

Problem 3: “Timer events never fire”

  • Why: Timer heap not integrated with poller timeout
  • Fix: Compute next deadline and pass timeout to epoll_wait
  • Verification: Unit test with a 50ms timer

Definition of Done

  • Event loop handles 1,000+ concurrent sockets
  • Timers and timeouts work correctly
  • No busy loops when idle
  • Backpressure prevents unbounded memory growth
  • Clean shutdown and resource cleanup

Project 4: Cross-Platform Syscall Abstraction Library

  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Serious Systems Builder
  • Business Potential: 2. The “Infrastructure Core”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: ABI, Portability, System APIs
  • Software or Tool: POSIX, Windows API, libuv
  • Main Book: “Computer Systems: A Programmer’s Perspective”

What you’ll build: A portable C library that wraps core OS functionality (files, processes, time, sockets) and hides platform-specific details behind a stable ABI.

Why it teaches systems libraries: Every real system must run across platforms. This project forces you to design stable APIs, handle platform differences, and respect ABI constraints.

Core challenges you’ll face:

  • Mapping POSIX and Windows APIs into a single contract
  • Designing ABI-stable types and error handling
  • Detecting features at build and runtime

Real World Outcome

You will compile the same codebase on Linux and Windows and get identical behavior from your public API.

# Linux build
$ make
$ ./xplat_demo
[xplat] platform=linux
[xplat] open("data.txt") -> ok
[xplat] stat: size=1024 mtime=1703721820
[xplat] read: 128 bytes
[xplat] sleep(50ms) -> ok

# Windows build (PowerShell)
> cmake --build .
> .\xplat_demo.exe
[xplat] platform=windows
[xplat] open("data.txt") -> ok
[xplat] stat: size=1024 mtime=1703721820
[xplat] read: 128 bytes
[xplat] sleep(50ms) -> ok

The Core Question You’re Answering

“How do you provide a clean, stable API when the underlying OS semantics are different?”

Concepts You Must Understand First

  1. ABI stability and opaque types
    • Why should public structs be opaque?
    • Book Reference: CS:APP Ch. 7
  2. POSIX vs Windows I/O
    • What is a file descriptor vs a HANDLE?
    • Book Reference: APUE Ch. 3 (File I/O)
  3. Error handling and mapping
    • How do you normalize error codes?
    • Book Reference: CS:APP Ch. 10

Questions to Guide Your Design

  1. API surface
    • Which operations are included in v1?
    • Which platform-specific features are deliberately excluded?
  2. Type design
    • Which types are opaque?
    • How do you expose size and alignment safely?
  3. Portability strategy
    • Build-time detection (CMake) vs runtime detection?
    • How will you structure platform-specific code?

Thinking Exercise

The Error Mapping Problem

POSIX open() returns -1 and sets errno. Windows CreateFile returns INVALID_HANDLE_VALUE and GetLastError().

  • What error values will your API expose?
  • How do you preserve enough information for debugging?
  • How will you handle non-overlapping error codes?

The Interview Questions They’ll Ask

  1. “What is the difference between API and ABI?”
  2. “Why do you use opaque structs in public headers?”
  3. “How do you map Windows errors to POSIX-style errors?”
  4. “What breaks binary compatibility?”
  5. “How do you manage platform-specific code without #ifdef chaos?”

Hints in Layers

Hint 1: Define a small, stable API

Start with file open/read/write/close and sleep.

Hint 2: Use opaque handles

Expose typedef struct xplat_file xplat_file_t; and keep fields private.

Hint 3: Centralize error mapping

Convert OS errors into a small set of library error codes.

Hint 4: Add feature detection

Use compile-time checks (CMake) and runtime probes for optional features.

Books That Will Help

Topic Book Chapter
Linking and ABI “Computer Systems: A Programmer’s Perspective” Ch. 7
File I/O “Advanced Programming in the UNIX Environment” Ch. 3
Portability “21st Century C” by Klemens Ch. 2

Common Pitfalls & Debugging

Problem 1: “Works on Linux, fails on Windows”

  • Why: Assumed POSIX semantics (e.g., fork, errno)
  • Fix: Add platform-specific implementations and tests
  • Quick test: Run CI on Windows and Linux

Problem 2: “ABI breaks after refactor”

  • Why: Public struct layout changed
  • Fix: Use opaque pointers and accessor functions
  • Verification: ABI check with abi-compliance-checker

Problem 3: “Different error codes”

  • Why: Windows errors not mapped to your library errors
  • Fix: Build a translation table with documentation
  • Quick test: Force an error and verify error codes

Definition of Done

  • Builds and runs on Linux and Windows
  • Public ABI stable across minor versions
  • Error handling consistent across platforms
  • Test suite passes on both platforms
  • Minimal API documented with examples

Project 5: High-Performance String Search Library

  • Main Programming Language: C (with SIMD intrinsics)
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Algorithms, Performance, SIMD
  • Software or Tool: SSE4.2, AVX2, perf
  • Main Book: “Modern X86 Assembly Language Programming” by Daniel Kusswurm

What you’ll build: A fast substring search library with SIMD acceleration, similar to what powers ripgrep, grep, or Hyperscan-style literal matching.

Why it teaches systems libraries: This is where algorithm theory meets hardware reality. You will learn how cache behavior and vectorization beat naive Big-O analysis in practice.

Core challenges you’ll face:

  • Choosing the right algorithm for pattern length
  • Implementing safe SIMD scans without UB
  • Handling UTF-8 boundaries correctly

Real World Outcome

You will produce a command-line tool and a library API that outperforms strstr on large inputs.

$ ./fastsearch --version
fastsearch 1.0 (SIMD enabled)
Detected CPU: AVX2, SSE4.2

$ time ./fastsearch "ERROR" /var/log/huge.log
[Line 1247] ERROR: Connection timeout
[Line 5893] ERROR: Database unreachable
... (147 matches)

real    0m0.14s

$ time grep "ERROR" /var/log/huge.log
real    0m1.52s

# 10x faster on this corpus

The Core Question You’re Answering

“How do real-world search tools beat naive algorithms by an order of magnitude?”

Concepts You Must Understand First

  1. String search algorithms
    • When does Boyer-Moore beat Two-Way?
    • Book Reference: “Algorithms” Ch. 5
  2. SIMD intrinsics and alignment
    • How do vector loads work safely?
    • Book Reference: “Modern X86 Assembly Language Programming” SIMD chapters
  3. Cache behavior
    • Why does sequential access outperform random access?
    • Book Reference: CS:APP Ch. 6

Questions to Guide Your Design

  1. Algorithm selection
    • What is the crossover point for SIMD vs scalar?
    • How do you handle very short needles?
  2. UTF-8 correctness
    • Will you treat UTF-8 as bytes or decode boundaries?
    • How do you avoid false matches inside multibyte sequences?
  3. API ergonomics
    • Will you expose a simple find() or a precompiled needle?
    • How do you report matches efficiently?

Thinking Exercise

The Cache Line Scan

You have a 64-byte cache line and a 3-byte needle.

  • How many comparisons can you do per line with SIMD?
  • What happens if the match crosses a cache line boundary?
  • How do you avoid reading past the buffer end?

The Interview Questions They’ll Ask

  1. “Why can a SIMD scan be faster than Boyer-Moore?”
  2. “How do you avoid undefined behavior in vectorized code?”
  3. “How do you handle CPU feature detection?”
  4. “What is the Two-Way algorithm and why does glibc use it?”
  5. “How do you benchmark string search fairly?”

Hints in Layers

Hint 1: Implement a scalar baseline

Start with a correct naive search to validate results.

Hint 2: Add a fast first-byte scan

Use memchr or SIMD to find candidate positions.

Hint 3: Add SIMD dispatch

Detect AVX2/SSE4.2 and select the best implementation.

Hint 4: Add UTF-8 validation

Avoid matching inside multibyte sequences if utf8_mode is enabled.

Books That Will Help

Topic Book Chapter
String algorithms “Algorithms” by Sedgewick and Wayne Ch. 5
SIMD basics “Modern X86 Assembly Language Programming” SIMD chapters
Cache behavior “Computer Systems: A Programmer’s Perspective” Ch. 6

Common Pitfalls & Debugging

Problem 1: “SIMD path crashes”

  • Why: Unaligned or out-of-bounds loads
  • Fix: Use _mm_loadu_* or masked loads; guard tail
  • Quick test: Run under ASAN

Problem 2: “Wrong matches on UTF-8”

  • Why: Byte-level search matches inside multi-byte sequences
  • Fix: Validate UTF-8 and skip continuation bytes
  • Verification: Run UTF-8 test corpus

Problem 3: “No speedup”

  • Why: Benchmark too small or dominated by I/O
  • Fix: Use large in-memory buffers and warm caches
  • Quick test: perf stat to see cache misses

Definition of Done

  • SIMD and scalar paths return identical results
  • UTF-8 mode passes correctness tests
  • Benchmarks show speedup over strstr and memmem
  • CPU feature detection works on multiple machines
  • API documented with examples

Project 6: Embedded Key-Value Store with WAL + LSM

  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Infrastructure Core”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Storage Engines, Durability, Concurrency
  • Software or Tool: fsync, mmap, bloom filters
  • Main Book: “Operating Systems: Three Easy Pieces”

What you’ll build: A crash-safe embedded key-value store with a write-ahead log, memtable, immutable SSTables, and background compaction.

Why it teaches systems libraries: Storage engines are runtime components that must be correct under failure. This project forces you to design a file format, recovery protocol, and concurrency model.

Core challenges you’ll face:

  • Designing a WAL format with checksums
  • Implementing memtables and SSTable flush
  • Building crash tests and recovery logic

Real World Outcome

You will run a CLI that can survive power-loss simulations and still return consistent data.

$ ./kvdb put user:123 "Alice"
OK

$ ./kvdb get user:123
Alice

$ ./kvdb bench --writes 100000 --sync-every 1000
Wrote 100000 keys in 1.42s
WAL syncs: 100

# Crash test
$ ./kvdb crash-test --iterations 100
[1] kill during WAL append -> recovery OK
[2] kill during memtable flush -> recovery OK
[3] kill during compaction -> recovery OK
Crash safety: PASSED (100/100)

The Core Question You’re Answering

“How do you guarantee durability and correctness even when the process crashes mid-write?”

Concepts You Must Understand First

  1. Write-ahead logging
    • Why must the WAL be synced before acknowledging writes?
    • Book Reference: OSTEP Ch. 40-42
  2. LSM trees and compaction
    • Why do LSM trees trade write amplification for throughput?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 3 (external)
  3. File system semantics
    • What does fsync() guarantee?
    • Book Reference: CS:APP Ch. 10

Questions to Guide Your Design

  1. WAL format
    • What is the entry layout and checksum?
    • How do you detect torn writes?
  2. Memtable implementation
    • Skip list or red-black tree?
    • How do you handle concurrent reads?
  3. Compaction
    • Leveled or tiered strategy?
    • When do you trigger compaction?

Thinking Exercise

Crash Timeline

A write arrives at t0. The WAL entry is written at t1. The memtable is updated at t2. The process crashes at t3.

  • What state do you expect on disk?
  • What does recovery need to replay?
  • What if the crash occurs during SSTable flush?

The Interview Questions They’ll Ask

  1. “Why does WAL guarantee durability?”
  2. “What is write amplification and why is it a trade-off?”
  3. “How do you recover after a crash?”
  4. “Why are SSTables immutable?”
  5. “How do bloom filters speed up reads?”

Hints in Layers

Hint 1: Build WAL-only first

Store all writes in a log and replay it on startup.

Hint 2: Add a memtable

Use a sorted data structure and keep it in memory for fast reads.

Hint 3: Flush to SSTables

Write sorted runs to disk and keep a manifest of tables.

Hint 4: Add compaction and bloom filters

Merge tables and add a probabilistic filter for fast negative lookups.

Books That Will Help

Topic Book Chapter
Crash consistency “Operating Systems: Three Easy Pieces” Ch. 40-42
System I/O “Computer Systems: A Programmer’s Perspective” Ch. 10
LSM tree concepts “Designing Data-Intensive Applications” Ch. 3 (external)
Data structures “Algorithms” by Sedgewick and Wayne Ch. 3, 5

Common Pitfalls & Debugging

Problem 1: “Data lost after crash”

  • Why: WAL not synced or replayed correctly
  • Fix: Verify fsync() boundaries and checksum validation
  • Quick test: Kill process during writes and restart

Problem 2: “Reads are slow”

  • Why: Too many SSTables or missing bloom filters
  • Fix: Implement compaction and filters
  • Verification: Benchmark read latency before/after

Problem 3: “Compaction corrupts data”

  • Why: Old tables deleted too early
  • Fix: Use atomic rename and reference counting
  • Quick test: Inject crashes during compaction

Definition of Done

  • WAL replay recovers consistent state
  • SSTables are immutable and validated with checksums
  • Compaction reduces number of SSTables
  • Crash tests pass 100 iterations
  • Benchmarks show stable read/write throughput