CS:APP (3rd Edition) — Deep Learning via Buildable Projects

Goal: Transform from a programmer who writes code that “happens to work” into a systems programmer who understands exactly what the machine does with every instruction. By building 17 core projects (+ 9 optional advanced extensions), you will internalize the complete journey from source code to running process—mastering data representation, machine-level execution, memory hierarchy, operating system abstractions, and concurrent programming. When you finish, you will debug crashes by reading registers, optimize code by reasoning about cache lines, and build robust systems that handle real-world failure modes gracefully.


Introduction: What This Guide Covers

CS:APP (Computer Systems: A Programmer’s Perspective) teaches you to treat the compiler, the OS, and the CPU as inspectable systems instead of magic. This guide turns the book into a build-first sprint: you’ll repeatedly ask “what did the machine actually do?” and then prove it with artifacts (ELF headers, disassembly, traces, perf counters, crash dumps).

What you will build (by the end of the core sprint):

  • A pipeline explorer that makes preprocessing → compilation → assembly → linking → loading observable
  • A byte/bit inspector for integers, floats, and endianness
  • Debugging workflows for x86-64 crashes, reverse engineering, and classic memory bugs
  • Simulators/visualizers for CPU pipelines and caches
  • OS-flavored tooling: process/signal sandbox, a job-control shell, VM map visualizer, and a malloc implementation
  • A robust I/O toolkit and a concurrency workbench
  • A secure, observable, high-performance proxy server capstone

Optional advanced extensions (if you keep going):

  • An ELF linker/loader, VM simulator, tiny HTTP server, thread pool, signal-safe I/O, profiler, leak detector, ptrace debugger, and an OS-kernel capstone

The Big Picture (Mental Model)

You write C
  │
  ├─► Toolchain (cpp/cc/as/ld) ──► ELF binary ──► Loader ──► Process
  │                                              │
  │                                              ├─► Syscalls (kernel boundary)
  │                                              ├─► Signals (async interruptions)
  │                                              ├─► VM (pages, TLB, faults)
  │                                              └─► Caches (lines, locality)
  │
  └─► You debug/measure by reading the evidence:
      source ⇄ asm ⇄ symbols ⇄ traces ⇄ counters ⇄ dumps

How to Use This Guide

This guide is designed to be worked through systematically, not skimmed. Follow this workflow:

Reading the Primer (First 2-3 Days)

  1. Read each Theory Primer chapter completely before starting related projects
  2. Work through the “Check Your Understanding” questions without looking at answers
  3. Draw the ASCII diagrams by hand to internalize the mental models
  4. Complete the homework exercises at the end of each chapter

Working Through Projects (Ongoing)

  1. Read the project’s “Core Question” and “Concepts You Must Understand First”
  2. Attempt the “Thinking Exercise” before writing any code
  3. Use “Hints in Layers” only when stuck—start with Hint 1, wait 30 minutes before checking the next
  4. Validate against the “Definition of Done” checklist before moving on
  5. Reflect on “Interview Questions” to solidify understanding

Parallel Reference

  • Keep CS:APP open alongside this guide
  • Use the “Deep Dive Reading by Concept” table to find relevant book sections
  • Refer to the Glossary when terms are unclear

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Programming Skills

  • Proficiency in C (pointers, structs, memory management, compilation)
  • Basic command-line fluency (navigation, file operations, pipes)
  • Familiarity with a text editor (Vim/Emacs/VSCode)
  • Recommended Reading: “C Programming: A Modern Approach” by K.N. King - Chapters 1-20

Computer Science Fundamentals

  • Binary and hexadecimal number systems
  • Basic data structures (arrays, linked lists, stacks)
  • Understanding of what a compiler does (high-level)
  • Recommended Reading: “Code” by Charles Petzold - Chapters 1-15

Helpful But Not Required

These topics are taught through the projects but prior exposure accelerates learning:

  • Assembly language basics (learn during Projects 3-6)
  • Operating system concepts (learn during Projects 11-14)
  • Networking fundamentals (learn during Projects 15, 17, 20)

Self-Assessment Questions

Before starting, you should be able to answer “yes” to these:

  1. Can you explain what a pointer is and how * and & operators work in C?
  2. Can you compile a C program with gcc and run it from the command line?
  3. Do you know the difference between stack and heap memory allocation?
  4. Can you read simple Makefiles and understand targets/dependencies?
  5. Are you comfortable spending 4+ hours debugging a single issue?

If you answered “no” to any question, spend time on the recommended readings first.

Development Environment Setup

Required Tools

Tool Version Purpose
GCC 9.0+ C compiler with debugging symbols
GDB 8.0+ GNU Debugger for x86-64
Make 4.0+ Build automation
Git 2.20+ Version control
Valgrind 3.15+ Memory error detection

Recommended Tools

Tool Purpose
GEF GDB enhancement for reverse engineering
objdump Binary inspection
readelf ELF file analysis
strace System call tracing
perf Performance profiling

Testing Your Setup

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1

$ cat > test.c << 'EOF'
#include <stdio.h>
int main() { printf("Setup OK\n"); return 0; }
EOF
$ gcc -g -Wall test.c -o test && ./test
Setup OK

Operating System

  • Linux x86-64 is required (Ubuntu 20.04+ or Debian 11+ recommended)
  • macOS works for some projects but lacks full ptrace support
  • Windows users should use WSL2 with Ubuntu

Time Investment

Project Type Time Per Project Examples
Foundation 4-8 hours P1, P2, P3
Intermediate 10-20 hours P4-P10
Advanced 20-40 hours P11-P17
Extension 30-60 hours P18-P26

Total Sprint Estimate:

  • Core projects (P1-P17): 3-5 months at 10-15 hours/week
  • With extensions (P18-P26): 6-9 months total

Important Reality Check

Systems programming is genuinely difficult. You will:

  • Spend hours staring at hex dumps and register values
  • Debug problems that seem “impossible” until they suddenly click
  • Feel frustrated when your code crashes with no obvious cause
  • Eventually develop an intuition that makes this feel natural

The struggle is the learning. If projects feel easy, you’re not going deep enough.


Why CS:APP Matters

Modern Motivation: Why Learn This in 2025?

In an era of high-level frameworks, AI code generation, and cloud abstractions, systems programming skills are more valuable than ever—precisely because they’re rarer.

The “Power Programmer” Gap

According to Teach Yourself CS, CS:APP creates “power programmers” who:

  • Debug crashes by reading registers instead of adding print statements
  • Optimize performance by reasoning about cache behavior
  • Understand security vulnerabilities at the instruction level
  • Build infrastructure that AI systems run on

Career Impact (2024-2025)

The job market has shifted toward specialists. According to IEEE Spectrum and Pragmatic Engineer:

  • 87% of tech leaders report difficulty finding skilled systems talent
  • AI/ML infrastructure roles require deep systems knowledge
  • Senior engineers with systems debugging skills command 30-50% salary premiums
  • The share of AI/ML jobs increased from 10% to 50% between 2023-2025—someone must build the infrastructure

Where These Skills Apply

┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEMS PROGRAMMING CAREERS                   │
├─────────────────────────────────────────────────────────────────┤
│  Infrastructure    │  Security         │  Performance           │
│  ─────────────    │  ────────        │  ───────────          │
│  • Cloud platforms │  • Vulnerability  │  • Game engines        │
│  • Database cores  │    research       │  • Trading systems     │
│  • Compiler dev    │  • Malware        │  • Embedded systems    │
│  • Container       │    analysis       │  • Graphics drivers    │
│    runtimes        │  • OS hardening   │  • ML infrastructure   │
└─────────────────────────────────────────────────────────────────┘

Context & Evolution

1960s-70s: The Birth of C and Unix Dennis Ritchie created C at Bell Labs in 1972 to rewrite Unix. His design philosophy—giving programmers direct hardware access with minimal runtime overhead—defined systems programming for the next 50 years.

1980s-90s: x86 Dominance Intel’s x86 architecture became the standard for personal computing. Understanding x86 assembly became essential for performance-critical code.

2000s: The 64-bit Transition AMD64 (x86-64) extended the architecture with more registers and a flat memory model. CS:APP 3rd Edition (2015) updated to focus on this architecture.

2020s: The Resurgence With Moore’s Law slowing and cloud computing creating massive scale challenges, deep systems understanding has become the differentiator between “good enough” and “excellent” engineers.

Real-World Impact Statistics

  • Linux runs on 96.3% of the top 1 million web servers (2024, W3Techs)
  • Firefox’s memory usage dropped 30% after systems-level optimizations
  • Google’s TCMalloc handles billions of allocations per second across their fleet
  • A single TLB shootdown can cost 1,000-5,000 CPU cycles—understanding this prevents performance cliffs

Theory Primer (Mini-Book)

This section is your textbook. It covers the nine pillars of systems programming in detail. You must understand these mental models to complete the projects.

Chapter 1: Translation, Linking, and Execution

Fundamentals

The “Build” button in your IDE is a comfortable lie. It obscures a complex, four-stage pipeline that transforms human-readable text into machine-executable binary signals. Systems programmers must look behind this curtain because that is where “undefined reference” errors, segmentation faults during startup, and weird library versioning bugs live. The translation process is not magic; it is a deterministic sequence of transformations: Preprocessing (text substitution), Compilation (C to Assembly), Assembly (Assembly to Machine Code), and Linking (stitching object files together). Understanding this pipeline transforms you from someone who blindly tries command-line flags until it compiles, into someone who architects build systems and debugs binary compatibility issues with precision.

Deep Dive into Translation & Linking

The journey from hello.c to a running process involves four distinct tools, often orchestrated by a driver like gcc.

  1. Preprocessing (cpp): This is a text-processing step. It handles directives starting with #. It strips comments, expands macros, and physically pastes the contents of header files (#include) into your source file. The output is a “Translation Unit” (often .i). Crucially, the preprocessor knows nothing about C syntax; it just manipulates text. This is why missing a semicolon in a header file can cause an error 500 lines later in your .c file.
  2. Compilation (cc1): This is the heavy lifter. It parses the Translation Unit, checks for syntax and semantic errors (type safety), performs optimizations (like dead code elimination or loop unrolling), and generates Assembly code (.s). Assembly is a human-readable text representation of the CPU’s machine code. This is where the abstraction of “C” ends and the reality of the hardware begins.
  3. Assembly (as): The assembler translates the .s file into a Relocatable Object File (.o). This file contains binary machine code, but it is not executable yet. Addresses of global variables and functions are “relocatable” (usually set to zero) because the assembler doesn’t know where this code will live in memory or where other functions (like printf) are located. It produces a Symbol Table listing the functions it defines and the ones it needs.
  4. Linking (ld): The linker solves the puzzle. It takes multiple .o files and libraries (.a archives or .so shared objects) and combines them into one Executable (ELF file). Its two main jobs are:
    • Symbol Resolution: Matching every reference to a symbol (e.g., a call to foo) with exactly one definition. If zero definitions are found, you get “undefined reference”. If two are found, you get “multiple definition”.
    • Relocation: Once all code size is known, the linker assigns final runtime memory addresses to every instruction and variable. It then modifies the code instructions, patching the dummy addresses with the real ones.

Static vs. Dynamic Linking

Static linking copies the library code into your executable, creating a large, standalone binary. Dynamic linking keeps the library separate; the executable contains a “promise” to find the library at runtime. The Loader (part of the OS) fulfills this promise when you run the program, using the Dynamic Linker (ld-linux.so) to map the library into memory and resolve symbols lazily.

How this fits on projects

You will explore this pipeline in Project 1, inspect ELF files in Project 10, and build your own linker in Project 18.

Definitions & Key Terms

  • Translation Unit: The output of the preprocessor; a single source file with all headers included.
  • Symbol Table: Metadata in an object file that maps names (functions/globals) to their locations.
  • Relocation Entry: A specific instruction in the object file that the linker must patch with a final address.
  • ABI (Application Binary Interface): The binary-level contract (calling convention, alignment, type sizes) that allows modules to interoperate.

Mental Model Diagram

         [Text]         [Text]           [Text]            [Binary]         [Binary]
Source (.c) ──> Preproc ──> Trans Unit ──> Compiler ──> ASM (.s) ──> Assembler ──> Object (.o)
                                                                                         │
                                                                       Libraries (.a) ───┤
                                                                                         ▼
                                                                                      Linker
                                                                                         │
                                                                                         ▼
                                                                                  Executable (ELF)

How it Works (Step-by-Step)

  1. Driver: User runs gcc main.c.
  2. CPP: main.c -> main.i. All #defines replaced.
  3. CC1: main.i -> main.s. C logic becomes x86-64 logic.
  4. AS: main.s -> main.o. Machine code generated. Offsets for functions are 0x0.
  5. LD: main.o + libc.so references -> a.out. Addresses assigned (e.g., main starts at 0x401000).
  6. Loader: User runs ./a.out. OS creates process, maps file, jumps to _start.

Common Misconceptions

  • Misconception: “The compiler handles libraries.” -> Truth: The compiler only checks headers. The linker handles libraries.
  • Misconception: “Compilation is one step.” -> Truth: It is four, and you can stop at any of them to debug.
  • Misconception: “Header files are compiled.” -> Truth: No, they are text-pasted into .c files. You can’t compile a .h file directly to an object.

Check-Your-Understanding

  1. If you delete the definition of puts but keep the declaration, which stage fails?
  2. If you have a syntax error in a macro, which stage reports it?
  3. Does a relocatable object file (.o) contain absolute or relative addresses?

Answers

  1. The Linker. The compiler is happy as long as the declaration exists.
  2. The Compiler (usually), though the error originated in the preprocessor’s expansion.
  3. Relative (or zero-based). It assumes it starts at 0.

Real-World Applications

  • Hot Patching: Modifying binary code in memory to fix bugs without restarting.
  • Plugin Systems: Using dlopen / dlsym to load code at runtime (dynamic linking).
  • Build Systems: CMake and Make exist entirely to orchestrate this pipeline efficiently.

Key Insight

The “undefined reference” error is just a database lookup failure in the Symbol Table.

Summary

Translation is a pipeline of lowering abstractions. Preprocessing handles text, compilation handles logic, assembly handles machine encoding, and linking handles memory layout and dependencies.

Homework

Run gcc -v hello.c. Capture the output and identify the path to cc1, as, and collect2 (the linker driver).

Homework Solution

$ gcc -v hello.c 2>&1 | grep -E "(cc1|as|collect2)"
 /usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -v -imultiarch x86_64-linux-gnu hello.c ...
 as -v --64 -o /tmp/ccXXXXXX.o /tmp/ccXXXXXX.s
 /usr/lib/gcc/x86_64-linux-gnu/11/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so ...

Note: collect2 is GCC’s wrapper around the system linker (ld). The actual paths vary by system.

References

  • CS:APP 3e - Chapters 1, 7 (primary source)
  • Practical Binary Analysis by Dennis Andriesse - Ch. 1-2 (ELF format deep dive)
  • Low-Level Programming by Igor Zhirkov - Ch. 3-4 (compilation pipeline)
  • Linkers and Loaders by John Levine - Classic reference on linking
  • GNU Binutils documentation: https://sourceware.org/binutils/docs/

Chapter 2: Data Representation

Fundamentals

To a computer, “type” is an illusion. Memory is just a vast ocean of bits. Whether a sequence of 32 bits represents the integer 1,065,353,216, the floating-point number 1.0, or two assembly instructions depends entirely on context—specifically, how the CPU is instructed to read it. Systems programmers must master these low-level encodings because hardware does not protect you from misinterpreting them. Understanding bitwise operations, integer overflow, and floating-point inexactness is mandatory for writing secure and reliable software.

Deep Dive into Data Representation

We focus on three primary encodings: Unsigned Integers, Signed Integers (Two’s Complement), and Floating Point (IEEE 754).

  • Unsigned Integers: These are straightforward binary place-value systems. A $w$-bit unsigned number maps bits $\vec{x}$ to $\sum_{i=0}^{w-1} x_i 2^i$. They wrap around modulo $2^w$.
  • Signed Integers (Two’s Complement): This is the standard for representing negative numbers. The most significant bit (MSB) has a negative weight. A $w$-bit number maps to $-x_{w-1}2^{w-1} + \sum_{i=0}^{w-2} x_i 2^i$. This genius design allows the same hardware adder circuit to handle both signed and unsigned addition. However, it introduces asymmetry: TMin (e.g., -128) has no positive counterpart (max is +127). This leads to the famous abs(INT_MIN) bug.
  • Floating Point (IEEE 754): Floats are NOT real numbers. They are a finite, discrete subset of reals. A float is encoded as $V = (-1)^s \times M \times 2^E$.
    • Sign (s): 1 bit.
    • Exponent (E): Encoded with a bias. Controls the range.
    • Mantissa (M): The fractional part. Controls the precision.
    • Floats have gaps. As numbers get larger, the gap between representable numbers increases. $0.1 + 0.2 \ne 0.3$ because 0.1 cannot be exactly represented in binary.
  • Endianness: Multi-byte objects (like int, long) must be stored in memory. Little Endian (x86) stores the least significant byte at the lowest address. Big Endian (Network order) stores the most significant byte first.

How this fits on projects

You will build a tool to visualize this in Project 2 and enforce bitwise constraints in Project 3.

Definitions & Key Terms

  • Word Size: The natural size of pointers and integers (usually 64-bit on modern machines).
  • Two’s Complement: The dominant encoding for signed integers.
  • NaN (Not a Number): A special float bit pattern resulting from invalid operations (e.g., sqrt(-1)).
  • Normalization: The standard form of floats where the leading binary digit is implicitly 1.

Mental Model Diagram

Memory Address:   0x100  0x101  0x102  0x103
Value (0x12345678):
  Big Endian:      [12]   [34]   [56]   [78]  (Reads left-to-right)
  Little Endian:   [78]   [56]   [34]   [12]  (Reads "backwards")

How it Works (Integer Casting)

Casting between signed and unsigned in C does not change the bits; it only changes how the compiler interprets them. short s = -1; (Bits: 1111 1111 1111 1111) unsigned short u = (unsigned short)s; (Bits: 1111..., Value: 65535)

Common Misconceptions

  • Misconception: “Signed overflow throws an exception.” -> Truth: In C, signed overflow is Undefined Behavior (UB). The compiler might assume it never happens and optimize away your safety checks.
  • Misconception: “Floats have constant precision.” -> Truth: Precision decreases as magnitude increases.

Check-Your-Understanding

  1. Why does (int)(unsigned)-1 print -1?
  2. Is x < y always equivalent to x - y < 0?
  3. What happens if you cast a double to an int?

Answers

  1. (unsigned)-1 is UINT_MAX (all ones). Casting back to int interprets “all ones” as -1 in Two’s Complement. The bits never changed.
  2. No! If x is INT_MIN and y is 1, x - y overflows to a large positive number.
  3. Truncation (rounding toward zero). If the double is too large, behavior is undefined.

Real-World Applications

  • Crypto: Relies heavily on precise bitwise operations and large integer arithmetic.
  • Networking: Protocols require converting host endianness to network endianness (htonl, ntohl).
  • Graphics: Floating point precision issues (“z-fighting”) cause flickering textures in games.

Key Insight

Types are just lenses for viewing bits. Casting changes the lens, not the scene (usually).

Summary

Information = Bits + Context. You must understand the limits of integer and float encodings to write safe code.

Homework

Write a function is_little_endian() that returns 1 if the machine is Little Endian, using a union or pointer casting.

Homework Solution

// Method 1: Union (exploits how unions share memory)
int is_little_endian_v1() {
    union { int i; char c; } u;
    u.i = 1;
    return u.c == 1;  // If little endian, LSB is at lowest address
}

// Method 2: Pointer casting
int is_little_endian_v2() {
    int x = 1;
    return *((char*)&x) == 1;  // Cast int* to char*, read first byte
}

On x86-64 (little endian), both return 1. The value 0x00000001 is stored as [01][00][00][00] with 01 at the lowest address.

References

  • CS:APP 3e - Chapter 2 (primary source)
  • Write Great Code Vol. 1 by Randall Hyde - Ch. 4-6 (data representation)
  • Effective C by Robert Seacord - Ch. 5 (integers and overflow)
  • IEEE 754-2019 Standard: https://ieeexplore.ieee.org/document/8766229
  • “What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg

Chapter 3: Machine-Level Programming

Fundamentals

The CPU does not know C. It does not understand while loops, structs, or objects. It understands a linear stream of simple instructions: move data, do math, jump to address. Machine-level programming (Assembly) is the study of how high-level constructs are mapped onto this rigid hardware reality. Mastering this allows you to reverse-engineer binaries, debug “impossible” crashes, and understand exactly how function calls and the stack work.

Deep Dive into x86-64

We focus on x86-64, a Complex Instruction Set Computer (CISC) architecture.

  • Register File: The CPU has 16 general-purpose 64-bit registers (%rax, %rbx, %rcx, %rdx, %rsi, %rdi, %rbp, %rsp, %r8-%r15). These are the fastest memory in the system.
  • Operand Forms: Instructions can operate on:
    • Immediates: Constants ($0x4).
    • Registers: Values in registers (%rax).
    • Memory: Values at an address ((%rax)).
  • The Stack: A region of memory used for function calls. It grows downward (from high addresses to low). %rsp (Stack Pointer) points to the top element. push decrements %rsp and writes; pop reads and increments.
  • Calling Convention (System V AMD64): This is the “law” of function calls.
    • Args: First 6 arguments go in %rdi, %rsi, %rdx, %rcx, %r8, %r9. Remaining go on the stack.
    • Return: Value goes in %rax.
    • Saved Registers: The “Callee” must preserve %rbx, %rbp, %12-%15. The “Caller” must save others if it needs them.
  • Control Flow: if, while, and for are implemented using cmp (compare) and jX (jump if condition) instructions. A switch statement is often compiled into a Jump Table (an array of code addresses) for O(1) performance.

How this fits on projects

You will debug crashes in Project 4, reverse engineer binaries in Project 5, and perform buffer overflow attacks in Project 6.

Definitions & Key Terms

  • PC (Program Counter): The register (%rip in x86-64) holding the address of the next instruction.
  • Stack Frame: The slice of the stack allocated for a single function call. Contains local vars, saved registers, and the return address.
  • LEA (Load Effective Address): An instruction meant for address math, but often used for fast arithmetic (e.g., lea (%rdi,%rdi,2), %rax computes x*3).

Mental Model Diagram

      Stack (High Address)
      ┌──────────────────┐
      │ Return Address   │  <- Pushed by 'call'
      ├──────────────────┤
      │ Saved %rbp       │  <- Pushed by function prologue
      ├──────────────────┤
      │ Local Variables  │  <- Space made by 'sub $N, %rsp'
      ├──────────────────┤
      │ ...              │
      └──────────────────┘  <- %rsp (Stack Pointer)
      (Low Address)

How it Works (The Function Call)

  1. Caller: Puts args in registers. Executes call func.
    • call pushes %rip (return address) to stack and jumps to func.
  2. Callee (Prologue): push %rbp; mov %rsp, %rbp. (Sets up frame).
  3. Callee (Body): Does work. Uses stack for locals if registers aren’t enough.
  4. Callee (Epilogue): leave (restores stack/rbp); ret.
    • ret pops the return address into %rip, resuming the caller.

Common Misconceptions

  • Misconception: “Local variables are initialized to zero.” -> Truth: They contain whatever stack garbage was there before.
  • Misconception: “The stack is secure.” -> Truth: It is just writable memory. If you write past an array, you overwrite the Return Address (Buffer Overflow), hijacking control flow.

Check-Your-Understanding

  1. Where is the return address stored?
  2. If function A calls function B, which registers must B save before using?
  3. What does sub $16, %rsp do?

Answers

  1. On the stack, directly above the saved base pointer.
  2. The Callee-Saved registers: %rbx, %rbp, %r12-%r15.
  3. It allocates 16 bytes of space on the stack (grows downward).

Real-World Applications

  • Debugging: Reading stack traces requires understanding frames.
  • Performance: Passing structs by value is slow (copies to stack); passing by pointer is fast (register).
  • Security: ROP (Return Oriented Programming) exploits rely on chaining assembly gadgets.

Key Insight

C is just a thin wrapper around assembly. The “Stack” is a convention, not a hardware law.

Summary

Machine code is the ground truth. Understanding the stack discipline and register conventions is the key to debugging and security.

Homework

Compile a C function long plus(long a, long b) { return a+b; } with gcc -S and identify the registers used.

Homework Solution

$ echo 'long plus(long a, long b) { return a+b; }' > plus.c
$ gcc -S -O1 plus.c -o plus.s
$ cat plus.s

Expected output (key lines):

plus:
    leaq    (%rdi,%rsi), %rax   ; a is in %rdi, b is in %rsi
    ret                         ; result returned in %rax

Registers used: %rdi (first arg), %rsi (second arg), %rax (return value). Note: lea is used for addition because it’s often faster than add for simple cases.

References

  • CS:APP 3e - Chapter 3 (primary source)
  • Hacking: The Art of Exploitation by Jon Erickson - Ch. 2 (assembly basics)
  • Debugging x86-64 Assembly with GDB - Practical debugging guide
  • GEF (GDB Enhanced Features) - Essential GDB extension
  • System V AMD64 ABI: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
  • Intel x86-64 Manual Vol. 1: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

Chapter 4: Processor Architecture & Optimization

Fundamentals

We treat the CPU as a black box that runs instructions sequentially. In reality, modern CPUs are massively parallel, pipelined, superscalar beasts. They execute many instructions at once, guess the future (speculation), and execute instructions out of order. Understanding this microarchitecture is essential for writing high-performance code.

Deep Dive into Architecture

  • Pipelining: Like a car assembly line. Instead of doing one instruction start-to-finish, the CPU breaks it into stages: Fetch, Decode, Execute, Memory, Writeback. Ideally, if there are 5 stages, you get 5x throughput.
  • Hazards: Things that break the pipeline.
    • Data Hazard: Instruction B needs a result from A, but A isn’t done. The pipeline must Stall (wait) or use Forwarding (pass data internally).
    • Control Hazard: The CPU doesn’t know if a conditional jump will be taken.
  • Branch Prediction: To solve control hazards, the CPU guesses. “I bet this if is true.” It executes speculatively. If wrong, it must Flush the pipeline (throw away work), costing 15-40 cycles.
  • Superscalar: Having multiple pipelines. Can issue add and mul in the same cycle.
  • Optimization Blockers:
    • Aliasing: Two pointers might point to the same memory. The compiler can’t optimize effectively because writing to A might change B.
    • Procedure Calls: The compiler treats functions as black boxes (mostly).
    • Unpredictable Branches: Sorting data makes code faster because branches become predictable.

How this fits on projects

You will build a CPU simulator in Project 7 and optimize code in Project 8.

Definitions & Key Terms

  • Latency: Time to complete one instruction.
  • Throughput: Number of instructions completed per second.
  • CPE (Cycles Per Element): Metric for measuring loop performance.
  • Loop Unrolling: Manually writing out loop bodies to reduce overhead and increase parallelism.

Mental Model Diagram (Pipeline)

Clock Cycle:   1   2   3   4   5
Instr 1:      [F] [D] [E] [M] [W]
Instr 2:          [F] [D] [E] [M]
Instr 3:              [F] [D] [E]  <-- Parallel Execution!

How it Works (Branch Prediction)

If you have if (x < 0), the CPU guesses False. It loads the else code. If x turns out to be -1, the CPU realizes the error, wipes the pipeline, and loads the if code. This penalty is why branchless programming (using bitwise math instead of if) can be faster.

Common Misconceptions

  • Misconception: “Fewer lines of C code = Faster.” -> Truth: Not necessarily. Complex one-liners might generate terrible assembly.
  • Misconception: “The CPU executes in order.” -> Truth: It fetches in order, but executes out of order (OoO) to fill idle units.

Check-Your-Understanding

  1. Why is a linked list traversal often slower than an array traversal (ignoring cache)?
  2. What is the cost of a branch misprediction?
  3. Why does restrict keyword help optimization?

Answers

  1. Data dependency. You can’t load node B until you read node A’s next pointer. This kills instruction-level parallelism.
  2. Pipeline flush. Usually 15-40 cycles of lost work.
  3. It tells the compiler that pointers don’t alias, allowing it to cache values in registers.

Real-World Applications

  • HFT (High-Freq Trading): Optimizing to the cycle level to beat competitors.
  • Game Engines: organizing data to maximize parallel processing (Data-Oriented Design).

Key Insight

To make code fast, feed the pipeline. Avoid dependencies and unpredictable branches.

Summary

The CPU is a factory. Your job is to keep the conveyor belts full.

Homework

Write a loop that sums an array. Compare performance of standard loop vs. unrolled loop (4x).

Homework Solution

// Standard loop
long sum_std(long *arr, int n) {
    long sum = 0;
    for (int i = 0; i < n; i++) sum += arr[i];
    return sum;
}

// 4x unrolled loop
long sum_unrolled(long *arr, int n) {
    long sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    int i;
    for (i = 0; i < n - 3; i += 4) {
        sum0 += arr[i];   sum1 += arr[i+1];
        sum2 += arr[i+2]; sum3 += arr[i+3];
    }
    for (; i < n; i++) sum0 += arr[i];  // Handle remainder
    return sum0 + sum1 + sum2 + sum3;
}

Benchmark with time or perf stat. The unrolled version is typically 20-40% faster due to reduced loop overhead and increased ILP (4 independent additions per iteration).

References:

  • CS:APP 3e - Chapters 4-5 (primary source)
  • Computer Organization and Design by Patterson & Hennessy - Ch. 4 (pipelining)
  • Agner Fog’s Optimization Manuals: https://www.agner.org/optimize/
  • Performance Engineering of Software Systems (MIT OCW) - Excellent performance course

Chapter 5: The Memory Hierarchy

Fundamentals

Programmers want memory that is infinite, fast, and cheap. Physics says you can pick two. The solution is the Memory Hierarchy: a pyramid of storage technologies. Small, fast memory at the top (Registers/L1 Cache); huge, slow memory at the bottom (Disk). The system works because of Locality: programs tend to access the same data, or nearby data, repeatedly.

Deep Dive into Caches

  • Hierarchy Levels:
    • Registers: 0 cycles.
    • L1 Cache: ~4 cycles. (SRAM)
    • L2/L3 Cache: ~10-50 cycles. (SRAM)
    • Main Memory (DRAM): ~100-200 cycles.
    • Disk (SSD/HDD): Millions of cycles.
  • Cache Structure: Caches are divided into Sets, containing Lines (Blocks). A memory address is split into Tag, Set Index, and Block Offset.
  • Miss Types:
    • Compulsory: First access (Cold).
    • Conflict: Multiple addresses map to the same set, evicting each other.
    • Capacity: The Working Set is larger than the cache.
  • Locality:
    • Temporal: Using i in a loop (same address, many times).
    • Spatial: Iterating an array a[i] then a[i+1] (nearby addresses).

How this fits on projects

You will simulate a cache in Project 9 and write a malloc allocator in Project 14.

Definitions & Key Terms

  • Cache Line: The unit of transfer (usually 64 bytes). You never fetch just a byte; you fetch a line.
  • Hit Rate: % of accesses found in cache.
  • Eviction: Throwing out a line to make room for new data (policies: LRU, Random).
  • Write-Back: Only writing to RAM when a modified line is evicted (faster).

Mental Model Diagram

Address: [ Tag  |  Index  |  Offset ]
                    │
                    ▼
           [ Select Set ]
           /      |      \
      Line 1    Line 2   Line 3 ...
      (Tag?)    (Tag?)   (Tag?)

How it Works (The Stride Issue)

Iterating a 2D array int a[1024][1024] row-by-row is fast (Stride-1, Spatial Locality). Iterating column-by-column is slow (Stride-1024). You jump 4KB every step, missing the cache line you just fetched.

Common Misconceptions

  • Misconception: “Memory access is O(1).” -> Truth: It varies by 100x depending on cache hits.
  • Misconception: “Linked lists are fine.” -> Truth: They scramble memory patterns, killing spatial locality (Pointer Chasing).

Check-Your-Understanding

  1. Why is a cache block size of 1 byte a bad idea?
  2. What happens to the cache when you context switch?
  3. If index bits = 4, how many sets are there?

Answers

  1. No spatial locality benefit. Metadata overhead would be huge.
  2. It gets “polluted” by the new process. When you switch back, you suffer cold misses.
  3. $2^4 = 16$ sets.

Real-World Applications

  • Matrix Multiplication: Tiled algorithms exist solely to fit blocks into L1 cache.
  • Databases: Designed to minimize disk seeks and maximize page cache hits.

Key Insight

Locality is King. Algorithms with the same Big-O can differ in speed by 10x-100x due to cache behavior.

Summary

The cache is a hardware hash table that you cannot control directly, but you can influence by accessing memory predictably.

Homework

Write a program that accesses an array with stride 1, 2, 4, … 1024. Measure the time per access. Plot it. You will see the cache cliffs.

Homework Solution

#include <stdio.h>
#include <time.h>
#define SIZE (1024 * 1024 * 64)  // 64M elements = 512MB

int main() {
    static long arr[SIZE];
    for (int stride = 1; stride <= 1024; stride *= 2) {
        clock_t start = clock();
        long sum = 0;
        for (int i = 0; i < SIZE; i += stride) sum += arr[i];
        clock_t end = clock();
        double time_per = (double)(end - start) / CLOCKS_PER_SEC / (SIZE/stride);
        printf("Stride %4d: %.2f ns/access\n", stride, time_per * 1e9);
    }
}

Expected output pattern:

Stride    1: 1.2 ns/access   <- L1 cache hit
Stride    2: 1.3 ns/access
Stride    4: 1.5 ns/access
Stride    8: 2.0 ns/access   <- L1 miss, L2 hit (64B line / 8B stride = 8)
Stride   16: 3.5 ns/access   <- L2 miss starts
Stride   32: 5.0 ns/access
...
Stride 1024: 80+ ns/access   <- RAM access (TLB misses too)

The “cliff” at stride 8 shows when you exceed the cache line size (64 bytes / 8 byte longs = 8 accesses per line).

References:

  • CS:APP 3e - Chapter 6 (primary source)
  • OSTEP - Chapters 18-22 (paging and caching)
  • Intel Optimization Manual - Section on memory subsystem
  • “What Every Programmer Should Know About Memory” by Ulrich Drepper (2007, still relevant)

Chapter 6: Virtual Memory (VM)

Fundamentals

Virtual Memory is the great illusionist. It gives every process the belief that it has a massive, contiguous chunk of private memory starting at address 0. In reality, physical RAM is fragmented, shared, and possibly full. VM adds a layer of indirection: the CPU translates Virtual Addresses to Physical Addresses for every single instruction.

Deep Dive into VM

  • Paging: Memory is divided into fixed-size chunks called Pages (usually 4KB).
  • Page Tables: A data structure in RAM (managed by the OS) that maps Virtual Page Numbers (VPN) to Physical Page Numbers (PPN). Each process has its own page table.
  • MMU (Memory Management Unit): Hardware that performs the translation.
  • TLB (Translation Lookaside Buffer): A cache inside the MMU that stores recent translations. Without TLB, every memory access would require 2+ RAM accesses (one for page table, one for data), slowing the machine by 50%.
  • Page Faults:
    • Hard Fault: Page is on disk (Swap). OS pauses process, fetches from disk.
    • Soft Fault: Page is in RAM but not marked present (e.g., Copy-on-Write).
    • Segfault: Accessing unmapped/forbidden memory.
  • Features Enabled by VM:
    • Protection: Read-only bits prevent code overwrites.
    • Sharing: libc is loaded once in RAM, mapped into all processes.
    • Copy-on-Write: fork() is fast because it shares pages initially.

How this fits on projects

You will visualize this in Project 13 and simulate it in Project 19.

Definitions & Key Terms

  • Swap: Using disk space to extend RAM. Slow.
  • Thrashing: When the working set > RAM, and the OS spends all time swapping pages in and out.
  • VIPT (Virtually Indexed, Physically Tagged): How L1 caches interact with VM for speed.

Mental Model Diagram

Virtual Addr:  [ VPN  | Offset ]
                  │       │
             [Page Table] └───┐
                  │           │
Physical Addr: [ PPN  | Offset ]

How it Works (Fork)

When fork() is called, the OS duplicates the Page Table, not the RAM. It marks all pages as “Read-Only”. If Child tries to write, MMU raises a protection fault. OS catches it, copies that specific page to a new physical frame, updates the Child’s page table, and resumes.

Common Misconceptions

  • Misconception: “malloc allocates RAM.” -> Truth: It allocates virtual address space. RAM is only assigned when you touch the pages (Demand Paging).
  • Misconception: “Different pointers mean different physical memory.” -> Truth: Different virtual addresses can map to the same physical frame (Aliasing/Shared Memory).

Check-Your-Understanding

  1. What causes a Segmentation Fault?
  2. Why is the TLB critical for performance?
  3. If a page size is 4KB, how many bits are the offset?

Answers

  1. Hardware raises a fault when translation fails or permissions are violated. OS sends SIGSEGV.
  2. It avoids reading the Page Table from slow RAM for every instruction.
  3. 12 bits ($2^{12} = 4096$).

Real-World Applications

  • Sandboxing: Isolating processes so a crash in one doesn’t kill the OS.
  • Memory Mapped Files: mmap allows treating files like arrays.

Key Insight

All pointers are lies. They are handles to a lookup table.

Summary

VM provides Isolation, Efficiency, and Protection by decoupling software view from hardware reality.

Homework

Use pmap <pid> to look at the memory map of a running process. Identify the heap, stack, and shared libraries.

Homework Solution

$ sleep 1000 &
[1] 12345
$ pmap 12345
12345:   sleep 1000
0000555555554000     32K r-x-- sleep          <- Code (.text)
000055555575b000      4K r---- sleep          <- Read-only data
000055555575c000      4K rw--- sleep          <- Writable data (.data, .bss)
0000555555580000    132K rw---   [ anon ]     <- HEAP
00007ffff7c00000   2048K r-x-- libc.so.6      <- libc code (shared)
00007ffff7e00000    256K r---- libc.so.6      <- libc rodata
00007ffff7e40000    128K rw--- libc.so.6      <- libc writable
...
00007ffffffde000    132K rw---   [ stack ]    <- STACK (grows down)

Key observations:

  • Heap starts after the program’s data segments (low addresses)
  • Stack is at very high addresses (~0x7fffff…)
  • Shared libraries (.so) are mapped between heap and stack
  • Each region has permissions: r (read), w (write), x (execute)

References:

  • CS:APP 3e - Chapter 9 (primary source)
  • The Linux Programming Interface by Michael Kerrisk - Ch. 48-50 (VM)
  • LWN.net: Memory part 3: Virtual Memory
  • Linux Kernel Documentation on page tables: https://www.kernel.org/doc/gorman/html/understand/understand006.html
  • OSTEP - Chapters 13-23 (free online)

Chapter 7: Exceptional Control Flow (ECF)

Fundamentals

Code does not execute in a vacuum. It must react to the outside world: network packets arriving, disks finishing writes, user pressing Ctrl+C, or division by zero. ECF is the mechanism for abrupt changes in the control flow, handling events at the Hardware, OS, and User levels.

Deep Dive into ECF

  • Exceptions (Hardware Level):
    • Interrupts: Async, from external hardware (Timer, I/O). Handler returns to next instruction.
    • Traps: Sync, intentional (Syscalls). Handler returns to next instruction.
    • Faults: Sync, errors (Page Fault). Handler retries current instruction.
    • Aborts: Sync, fatal errors (Hardware fail). Process dies.
  • Context Switch (OS Level): The OS uses a timer interrupt to slice time. It saves the register state of Process A, restores Process B, and updates the PC. This creates the illusion of multitasking.
  • Signals (User Level): The OS notifies a process of an event (e.g., SIGINT, SIGSEGV). A signal handler is a function that interrupts the main flow. This is user-space concurrency!
    • Async-Signal-Safety: You cannot call printf or malloc in a handler because they use locks. If the main thread held the lock when interrupted, the handler will deadlock waiting for it.

How this fits on projects

You will master signals in Project 11 and build a shell in Project 12.

Definitions & Key Terms

  • Kernel Mode: CPU state allowing privileged instructions (I/O, MMU management).
  • Zombie Process: A child that finished but hasn’t been reaped (wait) by its parent. It consumes a PID.
  • Orphan Process: A child whose parent died. init (PID 1) adopts it.

Mental Model Diagram (Signal)

Main Thread           OS Kernel            Signal Handler
     │                    │                      │
     │ <--- Interrupt ----│                      │
     │                    │                      │
(Context Saved)           │----(Modify Stack)--->│
     │                    │                      │ (Runs)
     │                    │<----(Sigreturn)------│
     │                    │                      │
(Context Restored)        │                      │
     │                    │                      │
     ▼

How it Works (Syscall)

User calls read(). Wrapper puts syscall ID in %rax. Executes syscall instruction. CPU switches to Kernel Mode, jumps to trap handler. Kernel performs operation. Returns to User Mode.

Common Misconceptions

  • Misconception: “Signals are instant.” -> Truth: They are delivered when the OS schedules the process.
  • Misconception: “Ctrl+C kills the program.” -> Truth: It sends SIGINT. The default action is terminate, but you can catch it.

Check-Your-Understanding

  1. What happens if you don’t reap child processes?
  2. Why can’t you use printf in a signal handler?
  3. What is the difference between an Interrupt and a Trap?

Answers

  1. They become Zombies, leaking OS resources.
  2. It is not reentrant (uses global locks). Deadlock risk.
  3. Interrupts are external/async (hardware). Traps are internal/sync (instruction).

Real-World Applications

  • Shells: Managing foreground/background jobs (fg, bg).
  • Web Servers: Handling SIGPIPE when clients disconnect.

Key Insight

ECF is the glue between your code and the chaotic physical world.

Summary

Exceptions allow the system to react. Processes provide isolation. Signals provide notification.

Homework

Write a program that catches SIGINT, prints “Don’t touch me!”, and continues.

Homework Solution

#include <stdio.h>
#include <signal.h>
#include <unistd.h>

volatile sig_atomic_t got_signal = 0;

void handler(int sig) {
    // Note: write() is async-signal-safe, printf is NOT!
    const char msg[] = "Don't touch me!\n";
    write(STDOUT_FILENO, msg, sizeof(msg) - 1);
    got_signal = 1;
}

int main() {
    struct sigaction sa = {0};
    sa.sa_handler = handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART;  // Restart interrupted syscalls

    sigaction(SIGINT, &sa, NULL);

    printf("Try pressing Ctrl+C...\n");
    while (1) {
        pause();  // Wait for signals
        if (got_signal) {
            printf("Continuing after signal...\n");
            got_signal = 0;
        }
    }
    return 0;
}

Test: Press Ctrl+C multiple times. Program prints message but doesn’t exit.

References:

  • CS:APP 3e - Chapter 8 (primary source)
  • The Linux Programming Interface by Michael Kerrisk - Ch. 20-22 (signals)
  • Advanced Programming in the UNIX Environment by Stevens - Ch. 10 (signals)
  • signal(7) man page: man 7 signal
  • List of async-signal-safe functions: signal-safety(7)

Chapter 8: System I/O & Networking

Fundamentals

The Unix Philosophy is “Everything is a File”. A file on disk, a pipe to another process, a network connection to a server in Tokyo—all are accessed via the same interface: read, write, open, close. Understanding this abstraction (and its leaks) is crucial for building robust servers.

Deep Dive into I/O

  • File Descriptors (FD): Small integers representing an open file. Standard: 0 (Stdin), 1 (Stdout), 2 (Stderr).
  • Robust I/O: read and write handle streams, not packets.
    • Short Counts: If you ask for 1000 bytes, read might return 500. You must loop until you get what you need. This happens constantly in networking.
  • Buffering:
    • User Buffering (FILE*): fprintf writes to a buffer in your heap. Flushes on newline or full.
    • Kernel Buffering: write copies data to the Page Cache. Disk write happens later (async).
  • Sockets: The file abstraction for networks.
    • Client: socket -> connect.
    • Server: socket -> bind (reserve port) -> listen -> accept.
  • Multiplexing (select/poll/epoll): How to handle 10,000 connections with one thread. “Wake me up when any of these 1000 FDs are ready.”

How this fits on projects

You will build robust I/O tools in Project 15 and a web server in Project 17 and 20.

Definitions & Key Terms

  • Metadata: Information about a file (size, permissions) stored in the Inode.
  • Redirect: ls > file works by dup2(fd_file, STDOUT_FILENO). The program ls just writes to FD 1, unaware it’s a file.
  • Blocking I/O: The process sleeps until data arrives.

Mental Model Diagram (Echo Server)

Client                         Server
socket, connect  ───────────>  socket, bind, listen
     │                         accept (blocks)
     │ <─────────────────────> (Connection Est.)
write(req)       ───────────>  read(req)
read(resp)       <───────────  write(resp)

How it Works (The Short Count)

TCP breaks data into packets. If you send 1MB, it arrives in 1500-byte chunks. read() returns as soon as some data is available. It does not wait for the full 1MB. You must re-call read() in a loop.

Common Misconceptions

  • Misconception: “Write writes to disk.” -> Truth: It writes to kernel RAM. Power loss = Data loss (unless fsync).
  • Misconception: “TCP preserves message boundaries.” -> Truth: It is a stream of bytes. You need a protocol (newlines, length headers) to separate messages.

Check-Your-Understanding

  1. Why does printf not appear immediately?
  2. What does accept return?
  3. Why is epoll better than select?

Answers

  1. Stdout is line-buffered. Use fflush or \n.
  2. A new file descriptor for that specific connection. The listening socket stays open.
  3. select is O(N) (scans all FDs). epoll is O(1) (notified only of active FDs).

Real-World Applications

  • Nginx/Node.js: Event-driven I/O using epoll for massive scale.
  • Databases: Using O_DIRECT to bypass kernel cache for raw disk access.

Key Insight

The network is a file that fails, hangs, and delivers partial data. Coding for the “Happy Path” guarantees failure.

Summary

Robust I/O requires handling partials, errors, and buffering discipline.

Homework

Write a program that uses dup2 to redirect stdout to a file, then executes ls.

Homework Solution

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

int main() {
    // Open output file (create if doesn't exist, truncate if does)
    int fd = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd < 0) { perror("open"); exit(1); }

    // Redirect stdout (fd 1) to our file
    if (dup2(fd, STDOUT_FILENO) < 0) { perror("dup2"); exit(1); }
    close(fd);  // No longer need the original fd

    // exec replaces this process with `ls`
    execlp("ls", "ls", "-la", NULL);
    perror("exec");  // Only reached if exec fails
    return 1;
}

After running: cat output.txt shows the ls -la output. The key insight is that dup2 makes fd 1 (stdout) point to our file, and exec preserves open file descriptors.

References:

  • CS:APP 3e - Chapters 10-11 (primary source)
  • The Linux Programming Interface by Michael Kerrisk - Ch. 5-6 (file I/O)
  • Unix Network Programming Vol. 1 by Stevens - Sockets API
  • Beej’s Guide to Network Programming: https://beej.us/guide/bgnet/
  • Linux man pages: open(2), read(2), write(2), socket(7)

Chapter 9: Concurrency

Fundamentals

Concurrency is the art of doing multiple things at once (or appearing to). It is essential for utilizing multi-core CPUs and handling overlapping I/O. However, it introduces a new class of bugs: Race Conditions and Deadlocks, which are non-deterministic and terrifying to debug.

Deep Dive into Concurrency

  • Threads vs Processes:
    • Processes: Separate address spaces. Hard to share info (need IPC). Robust (crash doesn’t kill others). High overhead.
    • Threads: Shared address space (same heap/globals). Easy to share. Risky (one crash kills all). Low overhead.
  • Synchronization Primitives:
    • Mutex: Mutual Exclusion. “Only one thread can hold this lock.”
    • Semaphore: “Only N threads can pass.”
    • Condition Variable: “Sleep until X happens.”
  • The Race Condition: When the outcome depends on the timing of threads.
    • cnt++ is NOT atomic. It is Load -> Add -> Store. If two threads do this interleaved, one increment is lost.
  • Deadlock: Thread A holds Lock 1, wants Lock 2. Thread B holds Lock 2, wants Lock 1. Both wait forever.

How this fits on projects

You will build a workbench in Project 16 and a Thread Pool in Project 21.

Definitions & Key Terms

  • Atomic Operation: An operation that cannot be interrupted. It happens entirely or not at all.
  • Critical Section: Code accessing shared resources that must be protected.
  • Thread Safety: A function is thread-safe if it works correctly when called by multiple threads simultaneously.

Mental Model Diagram (Race)

Thread 1:  Read(5) ---------> Add(6) -> Store(6)
Thread 2:      Read(5) -> Add(6) -> Store(6)
Result: 6 (Should be 7!)

How it Works (Mutex)

Hardware provides atomic instructions like CAS (Compare-And-Swap). A Mutex uses this to atomically check if a flag is 0. If yes, set to 1 (lock). If no, sleep.

Common Misconceptions

  • Misconception: “Volatile makes variables thread-safe.” -> Truth: No. It prevents compiler caching, but does not provide atomicity or memory barriers.
  • Misconception: “I tested it and it didn’t crash, so it’s safe.” -> Truth: Concurrency bugs are “Heisenbugs”. They hide during testing.

Check-Your-Understanding

  1. What are the 4 conditions for Deadlock?
  2. Why is a Semaphore initialized to 1 equivalent to a Mutex?
  3. What is the difference between join and detach?

Answers

  1. Mutual Exclusion, Hold & Wait, No Preemption, Circular Wait.
  2. It allows only 1 thread to enter (Binary Semaphore).
  3. join waits for a thread to finish. detach lets it run independently and cleans up automatically.

Real-World Applications

  • Game Engines: Physics, Rendering, and AI on separate threads.
  • Web Servers: Thread pools to handle requests.

Key Insight

Shared mutable state is the root of all evil. Minimize sharing.

Summary

Concurrency yields performance but demands rigorous discipline (locking protocols).

Homework

Write a program where two threads increment a shared counter 1 million times. Observe the result is < 2 million. Fix it with a mutex.

Homework Solution

#include <stdio.h>
#include <pthread.h>

#define ITERATIONS 1000000

long counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void *increment_unsafe(void *arg) {
    for (int i = 0; i < ITERATIONS; i++) counter++;
    return NULL;
}

void *increment_safe(void *arg) {
    for (int i = 0; i < ITERATIONS; i++) {
        pthread_mutex_lock(&lock);
        counter++;
        pthread_mutex_unlock(&lock);
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;

    // Unsafe version
    counter = 0;
    pthread_create(&t1, NULL, increment_unsafe, NULL);
    pthread_create(&t2, NULL, increment_unsafe, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Unsafe: %ld (expected 2000000)\n", counter);  // Often ~1.5M!

    // Safe version
    counter = 0;
    pthread_create(&t1, NULL, increment_safe, NULL);
    pthread_create(&t2, NULL, increment_safe, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Safe:   %ld (expected 2000000)\n", counter);  // Always 2000000

    return 0;
}

Compile: gcc -pthread race.c -o race The unsafe version loses increments because counter++ is actually 3 instructions: load, add, store.

References:

  • CS:APP 3e - Chapter 12 (primary source)
  • OSTEP - Chapters 26-33 (concurrency, free online)
  • The Linux Programming Interface by Michael Kerrisk - Ch. 29-33 (pthreads)
  • C++ Concurrency in Action by Anthony Williams - Modern threading concepts
  • pthread(7) man page and POSIX Threads documentation

Glossary (High-Signal)

Term Meaning
ABI The binary contract for calls, registers, alignment, and data layout.
ASLR Randomizes address space layout to make exploitation harder.
Async-signal-safe Functions that are safe to call inside a signal handler (most are not).
Cache line The unit of transfer between cache and memory (locality “lives” here).
Copy-on-write (COW) fork() shares pages until a write triggers a private copy.
ELF Linux executable/object format (headers, sections, segments, symbols).
GOT/PLT Indirection tables used for dynamic linking and late binding.
Heisenbug A bug that disappears when you observe it (timing-sensitive).
Interposition Overriding symbols at link/load time (e.g., LD_PRELOAD).
MMU Hardware that translates virtual → physical addresses using page tables.
Page fault CPU trap when a page is unmapped/forbidden; drives demand paging.
Race condition Behavior depends on timing/interleavings (often a data race).
Relocation Patching references to final addresses during linking/loading.
Short count read/write returns fewer bytes than requested; you must loop.
Signal Async notification delivered to a process (interrupt/exception-like).
Syscall User → kernel boundary (read, write, mmap, fork, …).
TLB Cache of page table translations (VM performance hinges on it).
Undefined behavior (UB) C behaviors the standard doesn’t define; compilers may “assume it never happens”.
Zombie A dead child process not yet waited on (exit status retained).

Concept Summary Table

Concept Key Questions Danger Signs
Translation What does each stage produce? “Compiled but crashes”
Data Rep Why -1 > 0U? Silent corruption
Machine Code How does stack grow? Cannot debug crashes
Architecture What hazard is this? Random performance
Memory Cache hit rate? 100x slowdown
ECF/Processes What is a zombie? Hangs, orphans
I/O/Networking Short count? Data corruption
Concurrency Race condition? Heisenbugs

Deep Dive Reading By Concept

Concept CS:APP Supplementary
Translation Ch. 1, 7 Practical Binary Analysis, Low-Level Programming
Data Rep Ch. 2 Write Great Code Vol.1, Effective C
Machine Code Ch. 3 Hacking: Art of Exploitation
Architecture Ch. 4-5 Computer Organization and Design
Memory Ch. 6, 9 OSTEP
ECF/Processes Ch. 8 The Linux Programming Interface
I/O/Networking Ch. 10-11 Unix Network Programming
Concurrency Ch. 12 OSTEP, TLPI

Essential Book List: CS:APP, C Programming: A Modern Approach (King), Effective C (Seacord), OSTEP (free online), The Linux Programming Interface (Kerrisk), Low-Level Programming (Zhirkov)


Project-to-Concept Map

This table shows which Theory Primer chapters each project applies, helping you verify you’ve read the prerequisites before starting.

Project Primary Concepts CS:APP Chapters
P1: Build Pipeline Explorer Translation, Linking, ELF Ch. 1, 7
P2: Bitwise Data Inspector Data Representation, Encoding Ch. 2
P3: Data Lab Clone Bit Operations, Integer/Float Ch. 2
P4: Calling Convention Crash Cart Machine Code, Stack Frames Ch. 3
P5: Bomb Lab Workflow Machine Code, Control Flow Ch. 3
P6: Attack Lab Workflow Machine Code, Security Ch. 3
P7: Y86-64 CPU Simulator Processor Architecture Ch. 4-5
P8: Performance Clinic Optimization, Pipelining Ch. 5
P9: Cache Lab Simulator Memory Hierarchy, Locality Ch. 6
P10: ELF Link Map Toolkit Translation, Dynamic Linking Ch. 7
P11: Signals/Processes Sandbox ECF, Signals Ch. 8
P12: Unix Shell with Job Control ECF, Process Control Ch. 8
P13: VM Map Visualizer Virtual Memory Ch. 9
P14: Build Your Own Malloc Virtual Memory, Heap Ch. 9
P15: Robust Unix I/O Toolkit System I/O Ch. 10
P16: Concurrency Workbench Concurrency, Threads Ch. 12
P17: Capstone Proxy Server All Concepts Ch. 10-12
P18-P26: Extensions Deep Dives Various

Quick Start: Your First 48 Hours

If you’re eager to start building immediately, follow this accelerated path:

Day 1: Orientation (4-6 hours)

  1. Skim the Big Picture diagram and Introduction (15 min)
  2. Read Theory Primer Chapter 1: Translation (45 min)
  3. Start Project 1: Build Pipeline Explorer
  4. Get the first output: ./inspect-build hello.c showing preprocessor output

Day 2: First Complete Project (4-6 hours)

  1. Complete Project 1 with all stages visible
  2. Validate against the Definition of Done checklist
  3. Read Theory Primer Chapter 2: Data Representation
  4. Start Project 2: Bitwise Data Inspector—get it to show binary representation of integers

What You’ll Have After 48 Hours:

  • A working understanding of the GCC compilation pipeline
  • A tool that decodes how integers are stored in memory
  • Confidence that you can tackle the remaining projects

Common Day-1 Blockers:

  • “GCC flags are confusing” → Use gcc -v to see exactly what commands run
  • “I don’t understand ELF” → Just run readelf -h a.out and read the output
  • “Where does my program actually start?” → Hint: not main. It’s _start.

Different backgrounds benefit from different project orderings:

Path 1: The CS Student (Following CS:APP Syllabus)

If you’re taking a course based on CS:APP or studying independently with the textbook:

P1 → P2 → P3 → P4 → P5 → P6 → P7 → P8 → P9 → P10 → P11 → P12 → P13 → P14 → P15 → P16 → P17

This mirrors the book’s chapter order and lab sequence.

Path 2: The Security Researcher

If your goal is vulnerability research, exploit development, or malware analysis:

P2 → P4 → P5 → P6 → P10 → P13 → P25 (Debugger) → P24 (Leak Detector)

Focus on memory layout, reverse engineering, and attack surfaces.

Path 3: The Performance Engineer

If you’re optimizing code for games, trading systems, or infrastructure:

P1 → P2 → P7 → P8 → P9 → P14 (Malloc) → P16 → P21 (Thread Pool) → P17

Emphasize architecture, caching, and concurrency.

Path 4: The Systems Developer

If you’re building operating systems, compilers, or low-level infrastructure:

P1 → P4 → P10 → P11 → P12 → P13 → P14 → P18 (Linker) → P19 (VM Sim) → P26 (OS Kernel)

Deep dive into OS internals and tooling.

Path 5: Weekend Warrior (Time-Limited)

If you have limited time and want maximum impact:

P1 → P2 → P4 → P9 → P14 → P17

6 projects covering the essential concepts. Complete in 2-3 months.


Success Metrics

How do you know you’ve internalized these concepts? Use these checkpoints:

Beginner Milestones (After Projects 1-6)

  • You can explain what happens during each compilation stage
  • You can predict integer overflow behavior in C
  • You can read x86-64 assembly and map it to C
  • You can debug a crash using only GDB and register values
  • You understand why buffer overflows are exploitable

Intermediate Milestones (After Projects 7-12)

  • You can explain pipeline hazards and their mitigations
  • You can predict cache miss rates for different access patterns
  • You can implement a job-control shell with proper signal handling
  • You know the difference between fork, exec, and clone
  • You can trace a system call from user space to kernel

Advanced Milestones (After Projects 13-17)

  • You can explain how virtual memory enables process isolation
  • You can implement a memory allocator with coalescing
  • You can write robust I/O code that handles partial reads/writes
  • You can debug race conditions using systematic techniques
  • You can build a concurrent server that doesn’t deadlock

Mastery Indicators

  • You read a crash dump and immediately suspect the root cause
  • You choose data structures based on cache locality, not just Big-O
  • You can explain why certain “obvious” optimizations don’t work
  • When code is slow, your first question is “what’s the memory access pattern?”

Project Overview Table

# Project Name Difficulty Time Key Concepts
1 Build Pipeline Explorer Level 1 4-8h Translation, ELF, Symbols
2 Bitwise Data Inspector Level 1 4-8h Encoding, Endianness
3 Data Lab Clone Level 2 10-15h Bit Ops, IEEE 754
4 Calling Convention Crash Cart Level 2 10-15h Stack, Registers, ABI
5 Bomb Lab Workflow Level 2 15-20h Reverse Engineering
6 Attack Lab Workflow Level 3 20-30h Exploits, ROP
7 Y86-64 CPU Simulator Level 3 30-40h Pipelining, Hazards
8 Performance Clinic Level 2 10-15h Optimization, ILP
9 Cache Lab Simulator Level 2 15-20h Cache, Locality
10 ELF Link Map Toolkit Level 2 10-15h Dynamic Linking, GOT/PLT
11 Signals/Processes Sandbox Level 2 10-15h Signals, fork/exec
12 Unix Shell with Job Control Level 3 25-35h Process Groups, TTY
13 VM Map Visualizer Level 2 10-15h Virtual Memory, mmap
14 Build Your Own Malloc Level 3 30-40h Heap, Fragmentation
15 Robust Unix I/O Toolkit Level 2 10-15h Buffering, Short Counts
16 Concurrency Workbench Level 3 20-30h Threads, Sync
17 Capstone Proxy Server Level 4 40-60h All Concepts
18 ELF Linker/Loader Level 4 40-60h Deep Linking
19 VM Simulator Level 3 25-35h Page Tables, TLB
20 HTTP Web Server Level 3 25-35h Sockets, CGI
21 Thread Pool Level 3 20-25h Producer-Consumer
22 Signal-Safe Printf Level 3 15-20h Reentrancy
23 Performance Profiler Level 3 20-25h Sampling, Signals
24 Memory Leak Detector Level 3 20-30h Interposition
25 ptrace Debugger Level 4 35-45h Breakpoints, Tracing
26 OS Kernel Capstone Level 5 60-100h Bare Metal

Project List

Projects P1–P17 are the core CS:APP sprint. Projects P18–P26 are optional extensions for going beyond the book.

Phase 1: Foundation (Start Here)


Project 1: “Hello, Toolchain” — Build Pipeline Explorer

Real World Outcome

You will build a tool inspect-build that reveals the hidden artifacts of compilation.

$ ./inspect-build hello.c
[PREPROCESSOR] Generated hello.i (312 lines). Macros expanded.
[COMPILER]     Generated hello.s (Assembly). 23 instructions.
               -> Found call to 'puts' (optimization of printf)
[ASSEMBLER]    Generated hello.o (Relocatable Object).
               -> Symbols: main (Global, Func), puts (Undefined)
[LINKER]       Generated hello (Executable).
               -> Entry point: 0x401060
               -> Linked with: /lib/x86_64-linux-gnu/libc.so.6

The Core Question You’re Answering

What physically happens to my code between hitting “save” and running the program?

Concepts You Must Understand First

  • The Translation Pipeline (Primer Ch 1) - cpp, cc1, as, ld.
  • ELF Sections (Primer Ch 1) - .text (code), .data (globals), .rodata (constants).
  • Symbols (Primer Ch 1) - How names map to addresses.

Questions to Guide Your Design

  1. How can you stop gcc after preprocessing? After compilation? (Hint: flags).
  2. How do you parse the output of readelf or nm to extract symbol counts?
  3. What is the difference between a “defined” symbol in a .o file and an “undefined” one?

Thinking Exercise

Run gcc -v hello.c. Look at the output. Can you identify the exact programs (cc1, as, collect2/ld) being called? Why are there so many flags?

The Interview Questions They’ll Ask

  1. “What is the difference between static and dynamic linking?”
  2. “What is a symbol table? Does the final executable need it to run?”
  3. “Why does a C program need a main function? Does the binary really start there?” (Hint: _start).

Hints in Layers

  1. Use gcc -E, -S, -c to stop at different stages.
  2. Use readelf -s to see symbols.
  3. Use ldd to see dynamic dependencies.

Common Pitfalls & Debugging

Problem 1: “gcc -E produces huge output”

  • Why: Headers like <stdio.h> include thousands of lines of library declarations
  • Fix: Focus on the lines at the end (your code) or count total lines for comparison
  • Quick test: gcc -E hello.c | tail -50 shows just your code

Problem 2: “readelf shows no symbols in executable”

  • Why: Release builds often strip symbols for size. Debug builds include them.
  • Fix: Compile with gcc -g to include debug symbols, or use objdump -t instead
  • Quick test: nm a.out vs nm -D a.out (dynamic symbols survive stripping)

Problem 3: “collect2 error: ld returned 1 exit status”

  • Why: This is a linker error—usually undefined reference or multiple definition
  • Fix: Look for “undefined reference to foo” in the error message. Check if foo is defined in linked libraries
  • Quick test: nm *.o | grep foo to find where symbols are defined

Problem 4: “My tool works for hello.c but crashes on other files”

  • Why: Edge cases: files with no symbols, assembly errors, or missing includes
  • Fix: Add error handling for each gcc stage; check return codes
  • Quick test: Test with an empty file, a file with only comments, a file with syntax errors

Books That Will Help

Topic Book Chapter
GCC Internals An Introduction to GCC by Gough Ch. 1-3
ELF Format Practical Binary Analysis by Andriesse Ch. 2
Linking Deep Dive Linkers and Loaders by Levine Ch. 3-4

Definition of Done

  • Tool runs on any simple C file.
  • Reports size/lines of Preprocessed, Assembly, and Object files.
  • Lists undefined symbols in the Object file.
  • Identifies the dynamic linker used by the final executable.

Project 2: Bitwise Data Inspector

Real World Outcome

A CLI tool bit-view that acts as a CT scan for data.

$ ./bit-view 10
Signed:   10
Unsigned: 10
Hex:      0x0000000A
Binary:   00000000 00000000 00000000 00001010
Float:    1.401298e-44 (Denormalized)

The Core Question You’re Answering

How are the same 32 bits interpreted differently depending on type context?

Concepts You Must Understand First

  • Two’s Complement (Primer Ch 2) - How negative numbers are encoded.
  • IEEE 754 Floating Point (Primer Ch 2) - Sign, exponent, mantissa structure.
  • Endianness (Primer Ch 2) - Byte ordering in multi-byte values.
  • Book Reference: “Write Great Code Vol. 1” by Randall Hyde - Ch. 4-5

Questions to Guide Your Design

  1. How do you extract individual bits from an integer? (Hint: bitwise AND with mask)
  2. How do you detect if a float is denormalized, NaN, or infinity?
  3. How do you show byte order without platform-specific code?

Thinking Exercise

Take the integer -1. Write out its 32-bit representation in binary. Now interpret those same bits as an unsigned integer. What value do you get? What about as a float?

The Interview Questions They’ll Ask

  1. “Why is signed integer overflow undefined behavior in C?”
  2. “Why does 0.1 + 0.2 != 0.3 in floating point?”
  3. “How can you tell if a system is big-endian or little-endian?”

Hints in Layers

  1. Use union { int i; float f; } to reinterpret bits without conversion.
  2. For binary output, use (value >> bit) & 1 in a loop from bit 31 to 0.
  3. To decode IEEE 754: Sign = bit 31, Exponent = bits 30-23 (subtract 127), Mantissa = bits 22-0.

Common Pitfalls & Debugging

Problem 1: “My binary output is backwards”

  • Why: You’re printing LSB first instead of MSB first
  • Fix: Loop from bit (width-1) down to 0
  • Quick test: Input 1 should show ...0001 at the end

Problem 2: “Float interpretation shows garbage”

  • Why: You’re casting instead of reinterpreting bits
  • Fix: Use union or memcpy, not (float)intval
  • Quick test: Input 0x3F800000 should decode as float 1.0

Problem 3: “Negative numbers look wrong”

  • Why: Sign extension when casting between sizes
  • Fix: Use unsigned types consistently when doing bit manipulation
  • Quick test: Input -1 should show all 1 bits

Books That Will Help

Topic Book Chapter
Bit Manipulation Hacker’s Delight by Warren Ch. 1-3
Float Internals Write Great Code Vol. 1 by Hyde Ch. 4
Data Types Effective C by Seacord Ch. 5

Definition of Done

  • Correctly displays binary for positive/negative ints.
  • Decodes IEEE 754 float components (Sign, Exponent, Mantissa).
  • Visualizes Endianness (byte order).
  • Handles edge cases: 0, -1, INT_MIN, NaN, infinity.

Project 3: Data Lab Clone

View Detailed Guide

Real World Outcome

You will build a grading framework that enforces restricted operator sets and validates bitwise solutions.

$ ./dlc bits.c
bits.c:15:    absVal  (maxOps: 15, curOps: 14) Valid
bits.c:25:    addOK   (maxOps:  30, curOps: 28) Valid
bits.c:40:    bang    (maxOps:  12, curOps:  5) Valid
bits.c:52:    bitCount (maxOps:  40, curOps: 38) Valid

$ ./test-bits
Testing: absVal
Test [-1] got 1 (expected 1) PASS
Test [0] got 0 (expected 0) PASS
Test [2147483647] got 2147483647 (expected 2147483647) PASS
Test [-2147483648] ERROR: overflow expected

All functions: 47/50 tests passed (94%)

The Core Question You’re Answering

How can you implement standard operations (absolute value, min, max, sign detection) using only bitwise operators—no arithmetic operators, no conditionals?

Concepts You Must Understand First

  1. Two’s Complement Representation (Primer Ch 2)
    • How negative numbers are encoded as bit patterns
    • Why -x = ~x + 1
    • Book Reference: “Write Great Code Vol. 1” by Randall Hyde - Ch. 4
  2. Bitwise Operators and Masking (Primer Ch 2)
    • &, |, ^, ~, <<, >>
    • How to extract/set individual bits
    • Sign extension in right shifts
    • Book Reference: “Hacker’s Delight” by Warren - Ch. 1-2
  3. Integer Overflow & Underflow (Primer Ch 2)
    • When addition/subtraction overflow
    • How to detect overflow without arithmetic
    • Book Reference: “Effective C” by Seacord - Ch. 5

Questions to Guide Your Design

  1. How do you detect if a number is negative using only bitwise ops? (Hint: MSB)
  2. How do you compute -x using only bitwise ops?
  3. How do you check if x + y will overflow without computing x + y?
  4. Can you implement min(x, y) without using conditional operators?
  5. How would you count the number of 1-bits in an integer?

Thinking Exercise

The Absolute Value Problem

Take the integer -5 (in two’s complement: 11111111111111111111111111111011). Your goal is to compute its absolute value (5) using only:

  • Bitwise ops: &, |, ^, ~, <<, >>
  • No arithmetic operators (no +, -, *, /)
  • No conditionals (no if, ?:)

Questions to guide your thinking:

  • How do you detect that this number is negative?
  • Once you know it’s negative, how do you flip it to positive?
  • What about the special case INT_MIN? (Can’t negate -2147483648 in 32 bits!)

The Interview Questions They’ll Ask

  1. “How would you implement absolute value using only bitwise operations?”
  2. “How can you detect integer overflow in addition without performing the addition?”
  3. “Write code to count the number of set bits (1s) in an integer. Can you do it in O(# of set bits) time?”
  4. “Given two integers, how do you swap their values without using a temporary variable or arithmetic?”
  5. “How would you check if a number is a power of 2 using only bitwise operations?”
  6. “Explain the difference between arithmetic and logical right shift. Why does it matter?”

Hints in Layers

Hint 1: Start with Sign Detection

You need to know if a number is negative. The MSB (most significant bit) tells you this in two’s complement. How do you extract it?

Hint 2: Implement Conditional Logic with Bitwise Ops

If you can’t use if, you need a different way to choose between two values. Think: “If bit X is set, use value A, otherwise use value B.” This is a mask operation.

Hint 3: Use Masks to Implement Conditionals

// Pseudocode pattern:
int mask = (condition) ? -1 : 0;  // -1 is all 1s, 0 is all 0s
int result = (a & mask) | (b & ~mask);  // Pick a or b based on mask

But wait—you can’t use ?: either! So create the mask using bitwise ops.

Hint 4: Practice Decomposition

Break complex functions into smaller bitwise primitives:

  • sign(x): Extract the sign bit
  • isZero(x): Check if all bits are 0
  • isNegative(x): Check if MSB is 1
  • Then combine these to solve harder problems

Common Pitfalls & Debugging

Problem 1: “I used arithmetic operators by accident”

  • Why: The assignment seems to allow them, but the grader catches them
  • Fix: Audit your code. Replace every +, -, * with bitwise equivalents
  • Quick test: Run ./dlc bits.c to check operator counts

Problem 2: “My solution works for small numbers but fails for INT_MIN”

  • Why: Two’s complement has an asymmetry: -2147483648 cannot be negated in 32 bits
  • Fix: Handle the special case explicitly or accept the limitation
  • Quick test: Test with INT_MIN and INT_MAX

Problem 3: “Bit count is slow for large numbers”

  • Why: You’re checking every bit instead of just the set bits
  • Fix: Use Brian Kernighan’s trick: x & (x-1) clears the rightmost set bit
  • Quick test: Compare performance on 0xFFFFFFFF vs 0x00000001

Problem 4: “I’m exceeding the operator budget”

  • Why: Your implementation is inefficient (too many steps)
  • Fix: Study the operations: ~x, x & y, etc. are cheaper than shifts
  • Quick test: Refactor to use fewer operations

Books That Will Help

Topic Book Chapter
Bitwise Tricks Hacker’s Delight by Warren Ch. 1-3, 5
Bit Manipulation Competitive Programming by Halim & Halim Ch. 2.2
Integer Math Write Great Code Vol. 1 by Hyde Ch. 4
Two’s Complement CS:APP 3e by Bryant & O’Hallaron Ch. 2

Definition of Done

  • All required functions implemented with correct results
  • No arithmetic operators (+, -, *, /) used
  • No conditional operators (if, ?:) used
  • Operator counts below specified limits for each function
  • Edge cases (0, INT_MIN, INT_MAX, -1) handled correctly
  • ./test-bits passes all provided test cases
  • ./dlc bits.c shows “Valid” for all functions

Project 4: x86-64 Calling Convention Crash Cart

View Detailed Guide

Real World Outcome

You will write C programs that intentionally crash in predictable ways, producing crash dumps that teach register/stack analysis.

$ gcc -g caller.c -o caller && ./caller
Segmentation fault (core dumped)

$ gdb ./caller core
(gdb) bt
#0  0x0000555555554123 in leaf () from ./caller
#1  0x0000555555554145 in caller () from ./caller

(gdb) info registers
rax            0x0                 0
rbx            0x555555554000      93824992235520
rcx            0x7ffff7dd6aa4      140737351862948
rdx            0x0                 0
rsi            0x1                 1
rdi            0x2                 2
rbp            0x7ffffffde9e0      0x7ffffffde9e0
rsp            0x7ffffffde9c0      0x7ffffffde9c0

(gdb) x/20x $rsp
0x7ffffffde9c0: 0x55554145  0x00005555  0xdeadbeef  0xdeadbeef
0x7ffffffde9d0: 0xdeadbeef  0xdeadbeef  0x7fffde9e0  0x00007fff
0x7ffffffde9e0: 0x55554167  0x00005555  0x00000000  0x00000000

# Stack frame structure visible! Return address at 0x7ffffffde9c8

The Core Question You’re Answering

How do I read a crash dump (registers, stack memory, return addresses) to understand what my program was doing when it died?

Concepts You Must Understand First

  1. x86-64 Register File and Calling Conventions (Primer Ch 3)
    • The 16 general-purpose registers
    • Which registers hold function arguments (%rdi, %rsi, %rdx, %rcx, %r8, %r9)
    • Which must be preserved by the callee (%rbx, %rbp, %r12-%r15)
    • Book Reference: “CS:APP 3e” by Bryant & O’Hallaron - Ch. 3
  2. Stack Frames and Memory Layout (Primer Ch 3)
    • How push/pop work
    • Stack grows downward (high → low addresses)
    • Frame pointer (%rbp) and stack pointer (%rsp)
    • Return address location
    • Book Reference: “Hacking: The Art of Exploitation” by Erickson - Ch. 2-3
  3. GDB and Crash Dump Analysis (Primer Ch 3)
    • info registers: See CPU state
    • x/Nx addr: Examine memory
    • bt (backtrace): Call stack
    • disassemble: Read assembly from memory
    • Book Reference: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman

Questions to Guide Your Design

  1. Where is the return address stored relative to the current stack pointer?
  2. What does the stack look like immediately after a function prologue (push %rbp; mov %rsp, %rbp)?
  3. How would you identify caller-saved vs callee-saved registers from a crash dump?
  4. Given a crash address, how would you determine which function was executing?
  5. What would a buffer overflow look like in the stack frame?

Thinking Exercise

Analyzing a Crash Dump

You crash in function leaf(), called from middle(), called from main(). You see:

$rsp = 0x7ffd0000
$rbp = 0x7ffd0010
Memory at $rsp: [0x555540ab] (return address)

Questions to answer:

  • Where does leaf() store its saved %rbp?
  • How big is leaf()’s stack frame?
  • If you inspect 0x7ffd0018, what would you expect to find?
  • How would you find the return address for the call to middle()?

The Interview Questions They’ll Ask

  1. “Describe the x86-64 calling convention. Which registers hold function arguments?”
  2. “What is a stack frame? Draw one on a whiteboard.”
  3. “When a function crashes, how do you use gdb to see what arguments it received?”
  4. “What is the difference between %rsp and %rbp? When would you use each?”
  5. “How would you detect a buffer overflow from a crash dump?”
  6. “Explain how a return address gets onto the stack and how the CPU uses it to return from a function.”

Hints in Layers

Hint 1: Study the ABI Document

The System V AMD64 ABI defines the calling convention. Read the first few pages. Key facts:

  • First 6 integer args: %rdi, %rsi, %rdx, %rcx, %r8, %r9
  • Return value: %rax
  • Callee-saved: %rbx, %rbp, %r12-%r15

Hint 2: Write Simple Nested Calls

Create main()func_a()func_b(). Compile with -g. Run in GDB, break at each level, and inspect $rsp, $rbp, and the stack memory.

Hint 3: Match Assembly to C

Compile a simple function with -S. Count:

  • How many bytes the function allocates (sub $X, %rsp)
  • Which registers it saves
  • Where local variables live relative to %rsp or %rbp

Hint 4: Create Intentional Crashes

Write functions with:

  • Buffer overflows (overflow into return address)
  • Null pointer dereferences
  • Use-after-free (corrupt a pointer)

For each, predict what the crash dump will show before running it. Verify in GDB.

Common Pitfalls & Debugging

Problem 1: “I don’t understand which registers are which”

  • Why: x86-64 has 16 registers, each with multiple names (e.g., %rax vs %eax vs %ax)
  • Fix: Create a reference card: RAX (return), RDI (arg1), RSI (arg2), etc.
  • Quick test: gcc -S simple.c and trace a function call by hand

Problem 2: “The stack grows in the ‘wrong’ direction”

  • Why: Stack grows downward (high → low addresses), which is counterintuitive
  • Fix: Visualize it: push decrements %rsp, pop increments it
  • Quick test: Print $rsp before and after a push instruction in GDB

Problem 3: “I can’t locate local variables”

  • Why: They’re accessed relative to %rsp or %rbp, with negative offsets
  • Fix: Use GDB: p &variable shows the address; info locals lists all locals
  • Quick test: Compile with -g and inspect a simple function

Problem 4: “Crash dumps show weird addresses”

  • Why: ASLR (Address Space Layout Randomization) randomizes addresses each run
  • Fix: Disable ASLR for reproducible testing: setarch $(uname -m) -R ./program
  • Quick test: Run the same program twice, compare crash addresses

Books That Will Help

Topic Book Chapter
x86-64 ABI System V AMD64 ABI Figures 3.4-3.5 (registers, stack layout)
Calling Convention CS:APP 3e by Bryant & O’Hallaron Ch. 3, Sec. 3.7
Debugging The Art of Debugging by Matloff & Salzman Ch. 1-4
Assembly Intro Hacking: The Art of Exploitation by Erickson Ch. 2
GDB Guide GDB Debugging Guide (official docs) Ch. 1-8

Definition of Done

  • Write at least 3 C programs demonstrating different call patterns
  • Each crashes with a predictable register/stack state
  • Use GDB to analyze each crash: inspect $rsp, $rbp, return address
  • Draw a stack frame diagram for each crash
  • Verify callee-saved register preservation
  • Document where each local variable lives relative to %rsp or %rbp

Project 5: Bomb Lab Workflow

View Detailed Guide

Real World Outcome

You will reverse-engineer a compiled binary (“bomb”) and defuse it by discovering the correct inputs to each phase.

$ ./bomb
Welcome to my little bomb.
Phase 1: 1 6 2 4 8 1 6
Phase 1 defused. How about the next one?
Phase 2: 2 10 20 30 40 50
That's number 2. Keep going!
Phase 3: 1 311 3
Halfway there!
Phase 4: 7 0
Phase 4 defused!
Phase 5: ionefg
Congratulations! You've defused the bomb!

The Core Question You’re Answering

Given only a compiled binary (no source code), how can you reverse-engineer it to understand what input will satisfy each phase?

Concepts You Must Understand First

  1. Reading and Understanding x86-64 Assembly (Primer Ch 3)
    • Instruction format and common patterns
    • Control flow (cmp, je, jne, jumps)
    • Loops and recursion
    • Book Reference: “CS:APP 3e” - Ch. 3, Sec. 3.4-3.8
  2. Using GDB for Interactive Debugging (Primer Ch 3)
    • Setting breakpoints
    • Stepping through code
    • Inspecting registers and memory
    • Disassembling functions
    • Book Reference: “The Art of Debugging” - Ch. 2-4
  3. Binary Tools and Analysis (Primer Ch 3)
    • objdump -d: Disassemble binary
    • strings: Find embedded strings
    • readelf -s: Extract symbol table
    • Using GDB’s disassemble and layout asm
    • Book Reference: “Practical Binary Analysis” - Ch. 3-4

Questions to Guide Your Design

  1. How would you identify the entry point of each phase?
  2. What pattern would a string comparison look like in assembly?
  3. How would you recognize a loop in assembly and determine its bounds?
  4. What assembly patterns indicate arithmetic operations (add, multiply)?
  5. How would you use breakpoints to inspect values at specific points?

Thinking Exercise

Reverse Engineering a Simple Phase

You disassemble Phase 1 and see:

mov    %edi,%edx        # edx = arg1
mov    $0x1,%ecx        # ecx = 1
cmp    %ecx,%edx        # compare edx to ecx
jne    0x401...         # jump if not equal
mov    $0x2,%ecx        # ecx = 2
cmp    %ecx,%edx        # compare again
...

Questions:

  • What values does edx get compared against?
  • What would happen if you input “1”?
  • What if you input “2”?
  • Is there a specific sequence required?

The Interview Questions They’ll Ask

  1. “How would you approach reverse-engineering an unfamiliar binary?”
  2. “What GDB commands are most useful for understanding control flow?”
  3. “How would you identify function calls and their arguments from assembly?”
  4. “Explain how you’d detect string comparison in assembly.”
  5. “What’s the difference between a conditional jump based on equality vs. less-than?”
  6. “How would you recognize that code is looping vs. just jumping around?”

Hints in Layers

Hint 1: Use objdump to Get Disassembly

Run objdump -d bomb | less to see the entire binary disassembled. Search for function names or strings.

Hint 2: Start with Strings

Run strings bomb | grep -E "[A-Z]" to find hints about expected input format.

Hint 3: Break at Phase Boundaries

In GDB:

(gdb) break explode_bomb
(gdb) break phase_1
(gdb) run
# Input a test string, watch where it fails

Hint 4: Trace Through Phase Logic

Use stepi (step one instruction) to single-step through assembly. After each instruction, print registers with info registers to see how values change.

Common Pitfalls & Debugging

Problem 1: “I don’t know how to read assembly”

  • Why: x86-64 syntax is unfamiliar
  • Fix: Create a cheat sheet: mov src, dst, cmp a, b, je label (jump if equal)
  • Quick test: Compile simple C to assembly with -S and compare

Problem 2: “Phase defuses but says it’s wrong”

  • Why: Some phases require exact formatting (spaces, number of args)
  • Fix: Look at how scanf is called; match the format string exactly
  • Quick test: Try with different spacing or order

Problem 3: “I can’t find where input is read”

  • Why: scanf or fgets might be called from a library function
  • Fix: Search for string pointers in data sections (.rodata), trace backwards to function
  • Quick test: strings bomb | grep '%' to find format strings

Problem 4: “Breakpoints don’t work where I expect”

  • Why: ASLR or compiler optimizations move code around
  • Fix: Use symbolic names: break phase_1 instead of memory addresses
  • Quick test: Check with info breakpoints

Books That Will Help

Topic Book Chapter
Assembly Reading CS:APP 3e Ch. 3
Reverse Engineering Practical Binary Analysis Ch. 1-5
Debugging Strategies The Art of Debugging Ch. 2-4
x86-64 Reference Intel 64 and IA-32 Architectures Manual Vol. 2A-2C

Definition of Done

  • Defuse all 6 phases (or as many as available)
  • For each phase, document the disassembled code and your analysis
  • Explain the logic of each phase (what condition does it check?)
  • Show successful input and expected output
  • Demonstrate using GDB to step through at least one phase

Project 6: Attack Lab—Exploiting Stack Vulnerabilities

View Detailed Guide

Real World Outcome

You will craft malicious inputs that exploit buffer overflows to hijack control flow using ROP (Return-Oriented Programming) gadgets.

$ cat solution1.txt | ./ctarget
Type string:Entering echo mode (^C to quit)
sdraw!
Normal return
Congratulations! You have successfully completed this warm-up stage 1.

$ cat solution2.txt | ./ctarget
Type string: AAAA...XXXX[gadget_addr_1][gadget_addr_2]...
Congratulations! Stage 2 complete.

$ cat solution5.txt | ./ctarget
# After crafting ROP chain with multiple gadgets
Congratulations! Stage 5 complete. All attacks succeeded!

The Core Question You’re Answering

How can you exploit a buffer overflow to change the control flow of a program and execute arbitrary code?

Concepts You Must Understand First

  1. Stack Buffer Overflows (Primer Ch 3)
    • How writing past a buffer’s boundary corrupts the stack
    • Overwriting the return address
    • Shellcode vs ROP gadgets
    • Book Reference: “CS:APP 3e” - Ch. 3, Sec. 3.12
  2. Return-Oriented Programming (ROP) (Primer Ch 3)
    • Finding gadgets (sequences ending in ret)
    • Chaining gadgets to perform computation
    • Using pop rdi; ret to set arguments
    • Book Reference: “Practical Binary Analysis” - Ch. 9
  3. Exploit Development Techniques (Primer Ch 3)
    • Disassembling to find gadgets
    • Calculating stack offsets
    • Bypassing ASLR (or using relative offsets)
    • Testing exploits safely
    • Book Reference: “Hacking: The Art of Exploitation” - Ch. 3

Questions to Guide Your Design

  1. How many bytes of overflow do you need before reaching the return address?
  2. How would you verify your return address is correct?
  3. Where would you find ROP gadgets in the binary?
  4. How do you craft a sequence of gadgets to pass arguments to a function?
  5. What’s the difference between Stage 1 (code injection) and Stages 4-5 (ROP)?

Thinking Exercise

Building a Simple ROP Chain

You want to call func(42). You find:

  • pop %rdi; ret at 0x401234
  • call func or mov %rdi, %rdi; ... ret at address of func

Your stack should look like:

[Buffer][Padding][Return Address → gadget][42][Next Gadget]

Questions:

  • How many bytes is the buffer?
  • Where does 42 go on the stack?
  • What address do you write as the return address?
  • What happens after the pop %rdi?

The Interview Questions They’ll Ask

  1. “Explain how a buffer overflow can lead to code execution.”
  2. “What is ROP? Why is it more powerful than simple shellcode injection?”
  3. “How would you find gadgets in a binary?”
  4. “What role does ASLR play in exploit mitigation?”
  5. “Can you ROP without knowing the exact memory layout?”
  6. “Describe the stack state before, during, and after exploiting a buffer overflow.”

Hints in Layers

Hint 1: Start with Stage 1

Stage 1 is usually a warm-up: overwrite the return address to jump to a success function (no ROP needed). Calculate offset from buffer start to return address.

Hint 2: Use GDB to Verify Offsets

(gdb) run
(gdb) break *0x401234  # At a chosen gadget
(gdb) x/20x $rsp       # See what's on the stack

Hint 3: Build Gadgets from Tools

Use ropper or objdump to find gadgets:

ropper --file ctarget --search "pop rdi"
objdump -d ctarget | grep -A2 "pop.*rdi"

Hint 4: Test Incrementally

  • Stage 1: Overwrite return address
  • Stage 2: Inject simple code (if allowed)
  • Stage 3+: Chain multiple gadgets

Common Pitfalls & Debugging

Problem 1: “Exploit doesn’t crash or execute correctly”

  • Why: Return address is wrong or stack alignment is off
  • Fix: Use GDB to inspect $rsp at the crash point
  • Quick test: Disable ASLR and run again

Problem 2: “Gadget not found”

  • Why: Looking for exact instruction sequence that doesn’t exist
  • Fix: Look for equivalent sequences (e.g., mov %rdi, %rdi; ret)
  • Quick test: Use ropper --all or objdump -d | grep mov

Problem 3: “Stack alignment causes crash

  • Why: Some instructions require 16-byte alignment
  • Fix: Insert extra ret as padding to adjust alignment
  • Quick test: Count to 16-byte boundary

Problem 4: “Can’t inject code (W^X protection)”

  • Why: Modern systems mark writable memory non-executable
  • Fix: Use ROP gadgets from existing code instead of injecting new code
  • Quick test: Check with checksec ./ctarget

Books That Will Help

Topic Book Chapter
Buffer Overflows Hacking: The Art of Exploitation Ch. 3
ROP Exploitation Practical Binary Analysis Ch. 9
Security Best Practices CS:APP 3e Ch. 3, Sec. 3.12
Gadget Finding ROPPER Documentation Official guide

Definition of Done

  • Complete Stage 1 (redirect control flow)
  • Complete at least one more stage using ROP gadgets
  • For each stage, document: buffer offset, gadgets used, stack layout
  • Explain why your exploit works despite mitigations
  • Test exploit multiple times to verify reliability

Project 7: Y86-64 CPU Simulator—Understanding Pipelining

View Detailed Guide

Real World Outcome

You will build a cycle-accurate CPU simulator that executes Y86-64 programs and visualizes the 5-stage pipeline.

$ ./ysim prog.yo

Y86-64 Simulator
Load program: 14 instructions
Starting execution (verbose mode)

Cycle 1: F(nop) D(nop) E(-) M(-) W(-)
Cycle 2: F(imm) D(nop) E(nop) M(-) W(-)
Cycle 3: F(add) D(imm) E(nop) M(nop) W(-)
Cycle 4: F(rrmov) D(add) E(imm) M(nop) W(nop)
[Stall: Data dependency detected. E needs %rax, D is computing it]
Cycle 5: F(rrmov) D(add) E(imm) M(add) W(nop)
Cycle 6: F(jne) D(rrmov) E(imm) M(add) W(add)

Final state:
%rax = 0x00000003
%rcx = 0x00000004
Condition codes: ZF=0 SF=0 OF=0
Memory[0x100:0x108] = {0x00, 0x01, 0x02, ...}

Performance: 6 instructions / 14 cycles = 0.43 IPC

The Core Question You’re Answering

How does a CPU execute instructions in parallel using a pipeline, and what causes stalls and forwarding?

Concepts You Must Understand First

  1. CPU Pipelining (Primer Ch 4)
    • 5 stages: Fetch, Decode, Execute, Memory, Writeback
    • Parallel execution of multiple instructions
    • Book Reference: “CS:APP 3e” - Ch. 4-5
  2. Pipeline Hazards (Primer Ch 4)
    • Data hazards (dependency on previous instruction’s result)
    • Control hazards (branches)
    • Structural hazards (resource conflicts)
    • Book Reference: “Computer Organization and Design” by Patterson & Hennessy - Ch. 4
  3. Hazard Resolution (Primer Ch 4)
    • Stalling (NOP injection)
    • Forwarding (passing data between stages)
    • Branch prediction
    • Book Reference: “CS:APP 3e” - Ch. 4, Sec. 4.5

Questions to Guide Your Design

  1. How many cycles does it take to execute 10 independent add instructions?
  2. What happens if instruction B depends on instruction A’s result?
  3. How does forwarding reduce stalls compared to stalling alone?
  4. What’s the longest possible dependency chain you can create?
  5. How would branch prediction affect CPI (cycles per instruction)?

Thinking Exercise

Analyzing a Dependency Chain

irmovq $1, %rax    # Cycle 1: rax = 1
addq %rax, %rbx    # Cycle 2: needs rax from Cycle 1
addq %rbx, %rcx    # Cycle 3: needs rbx from Cycle 2
addq %rcx, %rdx    # Cycle 4: needs rcx from Cycle 3

Questions:

  • Without forwarding, how many stalls occur?
  • With forwarding, how many stalls occur?
  • Draw the pipeline state at Cycle 2, 3, 4

The Interview Questions They’ll Ask

  1. “Explain the 5 stages of a CPU pipeline.”
  2. “What is a data dependency? Give an example.”
  3. “How does forwarding work? When does it NOT work?”
  4. “Compare stalling vs forwarding for resolving hazards.”
  5. “What causes a branch to stall the pipeline?”
  6. “Can you have a 10-stage pipeline with 0.1x CPI?” (Why or why not?)

Hints in Layers

Hint 1: Implement the Basic Cycles

Start with Fetch → Decode → Execute → Memory → Writeback. No parallelism yet; just sequential execution.

Hint 2: Add Pipeline Registers

Between each stage, add a “pipeline register” that holds the state. This is where values flow between stages.

Hint 3: Implement Data Dependency Detection

Check: “Does the instruction in Decode need a value that Fetch just fetched?” If yes, stall.

Hint 4: Implement Forwarding

Instead of stalling, pass the result from Execute/Memory stage backward to the Decode stage.

Common Pitfalls & Debugging

Problem 1: “Pipeline doesn’t advance; everything stalls”

  • Why: Dependency detection is too conservative
  • Fix: Only stall if forwarding can’t solve it
  • Quick test: Run a program with no dependencies; verify it advances every cycle

Problem 2: “Results are incorrect; wrong register values”

  • Why: Forwarding path is missing or has a bug
  • Fix: Trace through a simple 2-instruction dependency by hand
  • Quick test: Add debug output showing forwarding path

Problem 3: “Branches cause wrong control flow”

  • Why: Branch prediction / fetching ahead is implemented incorrectly
  • Fix: Start with predict-not-taken. Flush pipeline on misprediction.
  • Quick test: Run a branch-heavy program; verify correct result

Problem 4: “Performance numbers don’t match textbook”

  • Why: Stall logic or forwarding is inefficient
  • Fix: Compare your CPI to expected values; audit the code
  • Quick test: Run a known test case; compare CPI

Books That Will Help

Topic Book Chapter
Pipelining CS:APP 3e Ch. 4-5
Pipeline Simulation Computer Organization and Design Ch. 4
Y86-64 Spec CS:APP 3e Appendix A
Performance Metrics Computer Architecture by Hennessy & Patterson Ch. 1

Definition of Done

  • Simulator correctly executes Y86-64 programs
  • Pipeline visualization shows all 5 stages each cycle
  • Data dependencies detected and stalls inserted
  • Forwarding implemented for common cases
  • Performance metrics (CPI) calculated and reported
  • Test on provided Y86-64 programs; match expected output

Project 8: Performance Optimization Clinic

View Detailed Guide

Real World Outcome

You will benchmark code variants and achieve measurable speedups through optimization, documenting the improvements.

$ ./perf-clinic

=== Loop Sum Benchmark ===
Variant 1 (Naive):       2.5s for 10M iterations
Variant 2 (Unrolled 4x): 1.8s for 10M iterations  [1.39x faster]
Variant 3 (SIMD):        0.6s for 10M iterations  [4.17x faster]
Variant 4 (Cache-aware): 0.4s for 10M iterations  [6.25x faster]

=== Memory Access Pattern ===
Row-major (Stride-1):    1.2 ns/access (L1 cache hit)
Column-major (Stride-1024): 45 ns/access (RAM access)
Random access:            89 ns/access (TLB miss)

=== Branch Prediction ===
Predictable branch:      0.8 ns per iteration
Unpredictable branch:    8.2 ns per iteration [10.25x slower]

Summary: 6.25x speedup achieved through loop unrolling + cache optimization

The Core Question You’re Answering

What techniques reliably improve performance, and how do you measure and prove the improvements?

Concepts You Must Understand First

  1. Loop Optimization (Primer Ch 4-5)
    • Loop unrolling for ILP (Instruction-Level Parallelism)
    • Reducing loop overhead
    • Eliminating dependencies
    • Book Reference: “CS:APP 3e” - Ch. 5, Sec. 5.7-5.9
  2. Memory Locality (Primer Ch 5)
    • Temporal vs spatial locality
    • Cache line behavior
    • Stride effects
    • Book Reference: “CS:APP 3e” - Ch. 6, Sec. 6.4
  3. Branch Prediction (Primer Ch 4)
    • Predictable vs random branches
    • Cost of misprediction
    • Branchless programming
    • Book Reference: “CS:APP 3e” - Ch. 4, Sec. 4.7

Questions to Guide Your Design

  1. What is ILP and how does unrolling increase it?
  2. Why does 4x unrolling often beat 8x?
  3. What’s the relationship between cache line size and array stride?
  4. How would you measure branch prediction penalty?
  5. Can you optimize code without changing the algorithm?

Thinking Exercise

Identifying Optimization Opportunities

Given this code:

long sum = 0;
for (int i = 0; i < n; i++)
    sum += arr[i];

Questions:

  • Is there a data dependency? What is it?
  • Can you unroll the loop? How?
  • Would SIMD help? Why?
  • What would the assembly look like?
  • How many cycles per element (CPE) would it take?

The Interview Questions They’ll Ask

  1. “Why does loop unrolling improve performance?”
  2. “Explain the difference between ILP and pipelining.”
  3. “How does cache line size affect array access patterns?”
  4. “Why is sorting an array sometimes faster than processing it unsorted?”
  5. “What’s the cost of a branch misprediction?”
  6. “Can you optimize code to run faster without changing the algorithm?”

Hints in Layers

Hint 1: Choose Simple Benchmarks

Start with loops: sum, multiplication, or simple recursive operations. Avoid I/O, system calls, or complex logic.

Hint 2: Use perf or time for Measurement

# Use 'perf' for detailed metrics
perf stat -e cycles,instructions,cache-misses ./program

# Or simple timing
time ./program

Hint 3: Vary One Thing at a Time

  • Baseline version
  • 2x unrolled
  • 4x unrolled
  • 8x unrolled
  • +simd
  • +cache-aware data layout

Hint 4: Disable Optimizations for Fairness

Compile with -O0 or at most -O1 so your hand optimizations are visible. Otherwise, the compiler does it for you!

Common Pitfalls & Debugging

Problem 1: “Measurements are noisy”

  • Why: System noise, cache state, thermal throttling
  • Fix: Run multiple times, take average or median
  • Quick test: Run same binary 5 times; see variance

Problem 2: “Optimization didn’t help; maybe made it worse”

  • Why: Compiler is already doing it, or new code has worse cache behavior
  • Fix: Verify with -O0 compilation; check assembly with -S
  • Quick test: Compare hand-optimized vs compiler-optimized

Problem 3: “Can’t measure ILP directly”

  • Why: ILP is a hardware metric, not directly observable
  • Fix: Use perf with hardware events: perf stat -e instructions,cycles
  • Quick test: Measure CPE (cycles per element) instead

Problem 4: “Code runs too fast; measurement is 0ms”

  • Why: Problem size is too small or time resolution is coarse
  • Fix: Increase problem size or repeat inner loop
  • Quick test: Make sure loop takes at least 1 second

Books That Will Help

Topic Book Chapter
Loop Optimization CS:APP 3e Ch. 5, Sec. 5.7-5.9
Cache Optimization CS:APP 3e Ch. 6, Sec. 6.4-6.5
Performance Analysis Performance Engineering (MIT OCW) Lectures 1-5
perf Tool Brendan Gregg’s Blog “perf-tools” series

Definition of Done

  • Implement at least 4 variants of a kernel (baseline, unrolled, cache-aware, SIMD)
  • Measure and document speedup for each variant
  • Explain the optimization technique used in each
  • Achieve at least 2x speedup on at least one benchmark
  • Use perf to verify hardware metrics (IPC, cache misses)
  • Document results with graphs/tables showing improvements

Project 9: Cache Lab++ Simulator

  • File: CSAPP_CACHE_LAB_SIMULATOR.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Cache Hierarchy / Memory System / Performance Analysis
  • Software or Tool: valgrind, perf, cache architecture manuals
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A fully-featured cache simulator that accepts memory access traces and simulates cache behavior (hits, misses, evictions) with configurable cache geometry and replacement policies.

Why it teaches cache behavior: You’ll implement the exact logic that determines whether a memory access hits or misses, forcing you to understand address bit anatomy, the memory hierarchy, and how different workloads interact with cache geometry.

Core challenges you’ll face:

  • Address parsing (Tag/Index/Offset extraction) → maps to understanding cache address translation
  • Replacement policy implementation (LRU, FIFO, Random) → maps to understanding eviction strategies
  • Trace processing and statistics collection → maps to understanding cache performance metrics

Real World Outcome

You’ll produce a command-line tool csim that takes a memory trace and simulates cache behavior:

$ ./csim -v -S 4 -E 2 -b 4 -t traces/yi.trace
L 10,1
 cache [0][0] 0x0 miss
L 18,1
 cache [0][1] 0x8 miss
S 18,1
 cache [0][1] 0x8 hit
L 22,1
 cache [1][0] 0x2 miss evict
L 2d,1
 cache [1][1] 0xd hit
L 4,4
 cache [0][0] 0x0 miss
L 2c,4
 cache [1][1] 0xc miss evict
L 2d,1
 cache [1][0] 0xd hit

Summary:
hits:2 misses:4 evictions:2

The Core Question You’re Answering

“How do CPU cache lines, index bits, and tag bits determine whether a memory access hits or misses?”

Before coding, understand this completely. A cache miss isn’t random—it’s determined by arithmetic on address bits.

Concepts You Must Understand First

  1. Cache Geometry & Addressing
    • What does “S=4, E=2, b=4” mean for a cache?
    • How many bits for index, offset, and tag?
    • Book Reference: CS:APP 3e Ch. 6, Sec. 6.1-6.2
  2. Memory Access Traces
    • How to parse “L 10,1” (Load 1 byte at 0x10)?
    • Book Reference: CS:APP 3e Ch. 6, Sec. 6.6
  3. Replacement Policies
    • LRU: evict the least recently used line
    • How to track ordering efficiently?
    • Book Reference: CS:APP 3e Ch. 6, Sec. 6.2

Questions to Guide Your Design

  1. Address Decomposition: Given address 0x1A34 with S=16 sets and b=4 offset bits, what are index and tag bits?
  2. Hit vs Miss: How do you determine if an address is in the cache using only set index and tag?
  3. Eviction: How do you track which line is “least recently used” without expensive data structures?

Thinking Exercise

Cache Address Mapping

Given S=4 sets, E=2 lines/set, b=3 offset bits:

Address 0x1A5 (binary: 11010100101)

  • Offset (bits 0-2): 101 = 5
  • Index (bits 3-4): 10 = 2
  • Tag (bits 5-10): 110101 = 53

So 0x1A5 maps to set 2 with tag 53.

Questions:

  • If 0x1AA arrives next, where does it map?
  • If 0x1A5 arrives again, does it hit the same line?
  • How many consecutive addresses map to the same set?

The Interview Questions They’ll Ask

  1. “Explain how address bits map to cache index and tag. What if two addresses have the same index but different tags?”
  2. “What’s the difference between a cold miss, capacity miss, and conflict miss?”
  3. “How would you implement LRU replacement efficiently without a linked list?”
  4. “If a cache has 16 sets, 4 lines/set, 64-byte lines, how many bytes total? What’s the address decomposition?”
  5. “Why might increasing cache size not always improve hit rate?”

Hints in Layers

Hint 1: Address Decomposition Address bits aren’t just numbers—they’re structured. Rightmost bits are byte offset within a line. Next bits are set index. Remaining bits are tag. Write a function extracting all three.

Hint 2: Cache Data Structure Use a 2D array: cache[S][E]. Each element stores: valid bit, tag, and timestamp for LRU tracking.

Hint 3: Hit/Miss Detection For each access: extract index and tag. Search the set for valid line with matching tag. If found, hit—update timestamp. If not, miss. If set is full, evict smallest timestamp (LRU). Otherwise insert.

Hint 4: Validate with Tiny Traces Create a trace manually:

L 0x0,1
L 0x40,1
L 0x0,1     # Should be a hit

Trace through after each operation and verify cache state.

Books That Will Help

Topic Book Chapter
Cache Organization CS:APP 3e Ch. 6, Sec. 6.1-6.3
Replacement Policies CS:APP 3e Ch. 6, Sec. 6.2
Understanding Traces CS:APP 3e Ch. 6, Sec. 6.6

Common Pitfalls and Debugging

Problem 1: “Hit rate is way off”

  • Why: Off-by-one error in address bit extraction
  • Fix: Print index, tag, offset for each address; verify manually
  • Quick test: Load same address 4 times; all should be hits

Problem 2: “Eviction isn’t working”

  • Why: Not properly tracking set occupancy
  • Fix: Assert cache[set].count <= E at all times
  • Quick test: Load E+1 addresses to same set; first should evict

Problem 3: “LRU evicts wrong line”

  • Why: Timestamp not updated on access, or evicting most recent instead
  • Fix: Update timestamp on every access (hits too)
  • Quick test: Load 3 addresses into E=3 set, access in order; verify eviction

Problem 4: “Program crashes on traces”

  • Why: Trying to process ‘I’ (instruction) lines as data
  • Fix: Skip ‘I’ lines; only process ‘L’, ‘S’, ‘M’
  • Quick test: Run on official CS:APP traces first

Definition of Done

  • Parse cache geometry from CLI (S, E, b)
  • Extract set index and tag from 64-bit addresses correctly
  • Distinguish hits from misses based on cache state
  • Implement LRU replacement when cache is full
  • Process memory traces and collect hits/misses/evictions
  • Output statistics matching reference implementation
  • Handle both direct-mapped and set-associative caches
  • Skip ‘I’ operations in valgrind traces

  • File: CSAPP_ELF_INTERPOSITION_TOOLKIT.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python, Go, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Dynamic Linking / Symbol Resolution / ELF Format
  • Software or Tool: readelf, objdump, ldd, linker scripts
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A suite of tools including: (1) linkmap to visualize symbol resolution and PLT/GOT dynamics, (2) interpose library that intercepts function calls at runtime via symbol interposition, and (3) reloc-viz to visualize relocation entries and their targets.

Why it teaches dynamic linking: You’ll understand what happens between compilation and execution—how symbols become addresses, how the PLT works, and how runtime interposition redirects calls.

Core challenges you’ll face:

  • ELF parsing (sections, symbols, relocations) → maps to understanding binary format and structure
  • Symbol resolution (strong vs weak, global vs local) → maps to understanding visibility and linking
  • PLT/GOT mechanics and lazy binding → maps to understanding runtime indirection
  • Interposition via LD_PRELOAD and symbol wrapping → maps to understanding link-time behavior modification

Real World Outcome

You’ll produce multiple command-line tools:

$ ./linkmap ./myprog
[DYNAMIC] Loaded libraries:
  libm.so.6 (0x7fce9c000000)
  libc.so.6 (0x7fce9be00000)

[SYMBOLS] Exported from myprog:
  0x400600 T main
  0x400500 T helper_func
  U printf              # Undefined (to be resolved)

[PLT/GOT] Entries:
  printf@plt (0x400450) -> printf@libc (0x7fce9c50000)

$ gcc -fPIC -shared -o interpose.so interpose.c
$ LD_PRELOAD=./interpose.so ./myprog
[INTERCEPT] malloc called! (size=1024)
[INTERCEPT] free called! (ptr=0x5555558)

The Core Question You’re Answering

“How does the dynamic linker resolve symbols, and how can we intercept those resolutions?”

Before coding, understand that linking happens in three phases: compile-time (symbols become relocation entries), link-time (entries become offsets into libraries), and load-time (libraries are mapped and symbols are resolved).

Concepts You Must Understand First

  1. ELF Format & Sections
    • .symtab (symbol table), .dynsym (dynamic symbol table)
    • .rel.* (relocation tables), .got / .plt (runtime indirection)
    • Book Reference: CS:APP 3e Ch. 7, Sec. 7.3-7.5
  2. Symbol Resolution
    • Global vs local scope, strong vs weak symbols
    • Interposition: overriding function definitions at link/load time
    • Book Reference: CS:APP 3e Ch. 7, Sec. 7.6
  3. Procedure Linkage Table (PLT) and Global Offset Table (GOT)
    • How indirect jumps work at runtime
    • Lazy binding: first call triggers resolution, subsequent calls use cached address
    • Book Reference: CS:APP 3e Ch. 7, Sec. 7.8
  4. Dynamic Linking & Relocation
    • Relocation entries tell the linker how to patch addresses
    • At load time, linker walks these entries and fixes them
    • Book Reference: CS:APP 3e Ch. 7, Sec. 7.7

Questions to Guide Your Design

  1. Symbol Tables: How do you extract all symbols from a binary? Which symbols are global vs local, defined vs undefined?
  2. Relocation Inspection: Given a relocation entry R_X86_64_GLOB_DAT, how do you know which symbol it refers to and where it gets patched?
  3. PLT/GOT Tracing: How does a call to printf@plt redirect through the GOT to the real printf in libc?
  4. Interposition: If you want to intercept malloc, how do you create a shared library with a malloc function that takes precedence?

Thinking Exercise

Symbol Resolution Chain

Program calls printf(...). Follow the chain:

  1. In source code: #include <stdio.h> declares printf
  2. After compilation: object file has relocation entry R_X86_64_PLT32 for printf
  3. After linking: executable has printf@plt (jump table entry) and printf@got.plt (cache for address)
  4. At runtime:
    • First call: jump to printf@plt → indirect jump through printf@got.plt
    • GOT initially points back to plt code that calls dynamic linker
    • Linker finds printf in libc, patches GOT
    • Subsequent calls: jump through GOT directly to libc’s printf

Questions:

  • Why does the PLT exist if the GOT can hold the address directly?
  • What happens if interposition changes which library printf comes from?

The Interview Questions They’ll Ask

  1. “Explain the difference between .symtab, .dynsym, and the GOT. When is each used?”
  2. “What is lazy binding, and why is it better than eager binding for large programs?”
  3. “How would you intercept function calls using LD_PRELOAD? What prevents you from intercepting non-exported functions?”
  4. “What’s the difference between weak and strong symbols? How does the linker resolve conflicts?”
  5. “If you remove a symbol from a library, but a binary still calls it, what happens at runtime?”

Hints in Layers

Hint 1: Read ELF Headers and Sections Start with readelf to see what information is available. Then parse the ELF binary yourself: read the ELF header, iterate section headers, extract symbol tables (both .symtab and .dynsym), and print all symbols with their addresses and binding types.

Hint 2: Parse Relocation Entries Relocation entries live in .rel.* sections. Each entry contains: an offset (where to patch), a type (how to patch), and a symbol index. Print all relocations with their target symbols.

Hint 3: Understand PLT/GOT The .plt section contains jump stubs. The .got.plt section contains cached addresses. For each external function, find its PLT entry and its GOT entry. Trace how a call gets redirected.

Hint 4: Build Interposition Wrapper Create a shared library with a function that has the same name as a function you want to intercept. Use dlsym to get the address of the original function, then your wrapper can call the original after logging or modifying behavior. Compile with -fPIC -shared.

Books That Will Help

Topic Book Chapter
ELF Format & Sections CS:APP 3e Ch. 7, Sec. 7.3-7.5
Symbol Resolution CS:APP 3e Ch. 7, Sec. 7.6
PLT & GOT CS:APP 3e Ch. 7, Sec. 7.8
Relocation & Dynamic Linking CS:APP 3e Ch. 7, Sec. 7.7
Deep ELF Internals Linkers and Loaders Ch. 5-6 (Levine)

Common Pitfalls and Debugging

Problem 1: “I can’t find symbols in the binary”

  • Why: Symbols might be stripped (release builds), or you’re looking in .symtab instead of .dynsym
  • Fix: Use .dynsym for dynamic symbols; .symtab might not exist; check with nm or readelf -s
  • Quick test: readelf -s ./binary | grep printf

Problem 2: “PLT/GOT addresses don’t match what I expect”

  • Why: Address layout depends on relocation type and ASLR (Address Space Layout Randomization)
  • Fix: Disable ASLR for predictable addresses: setarch x86_64 -R ./program
  • Quick test: Compare output from objdump -R with runtime observation via gdb

Problem 3: “Interposition doesn’t work; original function still called”

  • Why: Library order matters; LD_PRELOAD works only for exported symbols, not static functions
  • Fix: Ensure your .so is in LD_PRELOAD and uses same symbol name; check with nm
  • Quick test: LD_PRELOAD=./interpose.so ldd ./program should show your .so first

Problem 4: “Segfault when calling original via dlsym”

  • Why: Function pointer dereference or incorrect calling convention
  • Fix: Verify function signature matches; test with a simple wrapper that just prints
  • Quick test: Create a simple malloc wrapper that only logs, doesn’t call original

Definition of Done

  • Parse ELF headers and sections from a binary
  • Extract and display all symbols (both static and dynamic)
  • Parse and visualize relocation entries with their targets
  • Map PLT entries to GOT entries
  • Create an LD_PRELOAD interposition library that intercepts a function
  • Demonstrate lazy binding by observing GOT updates
  • Handle both 32-bit and 64-bit ELF binaries
  • Document symbol resolution chain with examples

Project 11: Signals + Processes Sandbox

  • File: CSAPP_SIGNALS_PROCESSES_SANDBOX.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Process Control / Signal Handling / Async I/O
  • Software or Tool: strace, gdb, kill, signal masks
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A comprehensive test harness (sigsandbox) that safely executes untrusted code and tests signal behavior under controlled conditions. Includes: (1) signal tracing to log all signals, (2) zombie detector to find orphaned processes, (3) race condition fuzzer to test signal handler safety, and (4) test suite validating async-signal-safe operations.

Why it teaches signal handling: Signals are asynchronous—they interrupt program flow at unpredictable times. You’ll confront race conditions, deadlocks, and subtle bugs that only show up under load. This forces deep understanding of signal delivery, masking, and safe handler design.

Core challenges you’ll face:

  • Signal delivery and masking (blocking/unblocking) → maps to understanding atomic critical sections
  • Zombie processes and reaping → maps to understanding process lifecycle
  • Async-signal-safety and reentrancy → maps to understanding which functions are safe to call in handlers
  • Race conditions between signal arrival and handler registration → maps to understanding temporal ordering in concurrent systems

Real World Outcome

You’ll produce sigsandbox, a harness that captures signal behavior:

$ ./sigsandbox --test=signal_safety ./untrusted_program
[RUN] Executing: ./untrusted_program
[SIG] Caught SIGCHLD at time=0.001234s (from PID 12345)
[SIG] Caught SIGTERM at time=0.005678s (from PID 12346)

[ZOMBIE CHECK] Found 0 zombie processes
[SAFETY CHECK] Only async-safe functions called in handlers: PASS
[RACE CHECK] 1000 concurrent signals: No race detected

Summary:
  Signals caught: 2
  Zombies: 0
  Handler safety violations: 0
  Race conditions: 0

Plus additional tools:

  • signal_trace: Logs all signal activity to a file
  • zombie_detector: Identifies zombie processes
  • race_fuzzer: Stress-tests handlers with rapid signal delivery

The Core Question You’re Answering

“How do you write correct code that handles asynchronous interrupts, and how do you test that it works?”

Before coding, understand that signals interrupt code at random points. Any function that wasn’t running when the signal arrived might be unsafe to call in the handler. This creates subtle bugs that are nearly impossible to debug without a harness.

Concepts You Must Understand First

  1. Signal Delivery & Masking
    • How signals are blocked, pending, and delivered
    • sigprocmask, sigaction, signal sets
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.5-8.6
  2. Process Lifecycle & Reaping
    • Parent-child relationships, zombie processes
    • wait, waitpid, SIGCHLD handler
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.4
  3. Async-Signal-Safe Functions
    • Which functions are safe to call in handlers (minimal list)
    • Why malloc, printf, locks are dangerous
    • Book Reference: POSIX signal-safety(7) man page
  4. Race Conditions in Signal Handlers
    • Handler arrives while other code is running
    • Examples: signal arrives during malloc, during file I/O, during mutex lock
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.7

Questions to Guide Your Design

  1. Signal Masking: How do you temporarily block certain signals to protect a critical section? What happens to signals that arrive while blocked?
  2. Zombie Reaping: When a child process exits, when is it considered a zombie? How do you reap it correctly without losing signals?
  3. Handler Safety: You want to log a signal in the handler. printf is not async-safe. What can you use instead?
  4. Race Testing: How would you stress-test your signal handlers to make race conditions more likely to manifest?

Thinking Exercise

Signal Handler Race Condition

int counter = 0;

void handler(int sig) {
    counter++;  // Not atomic!
}

int main() {
    signal(SIGUSR1, handler);
    while (1) {
        printf("Counter: %d\n", counter);  // RACE: signal might arrive mid-printf!
    }
}

The bug: printf is not async-safe. A signal might arrive in the middle of printf (while it’s calling malloc internally), corrupting its state.

Questions:

  • What’s a safe way to update counter in the handler?
  • What should main do instead of calling printf directly?
  • How would you detect this bug with your harness?

The Interview Questions They’ll Ask

  1. “What’s the difference between blocking a signal with sigprocmask and ignoring it with signal(sig, SIG_IGN)?”
  2. “You want to safely access a variable modified in a signal handler. How do you do it without race conditions?”
  3. “If a signal handler is installed but never uninstalled, what happens to signals when the process exits?”
  4. “Why might a zombie process exist even though the parent called wait?”
  5. “How would you design a signal-safe logging function for use in handlers?”

Hints in Layers

Hint 1: Build a Signal Tracer Create a handler for SIGCHLD that logs signal arrival time and sender PID. Use write (async-safe) to log to a file. Install the handler with sigaction (preferred over signal). Test by spawning child processes and verifying logs.

Hint 2: Implement Zombie Detection Scan /proc/[pid]/task/*/ or use getpgid to find all processes. For each, check if it’s a zombie (parent exists, process is exited). Log zombies with their parent PID.

Hint 3: Create an Async-Safe Logger Build a function that only calls async-safe functions. Use a static buffer, write syscall, and integer-to-string conversion (no sprintf). Test by calling it from handlers.

Hint 4: Stress-Test with Signal Storms Fork a child that sends your parent rapidly alternating SIGUSR1 and SIGUSR2. Your handlers should update a counter. After receiving 10,000 signals, verify the counter matches. If it doesn’t, you have a race.

Books That Will Help

Topic Book Chapter
Signal Handling Basics CS:APP 3e Ch. 8, Sec. 8.5
Signal Masking & Delivery CS:APP 3e Ch. 8, Sec. 8.6
Process Lifecycle CS:APP 3e Ch. 8, Sec. 8.4
Async-Signal-Safety POSIX signal-safety(7) Man page
Advanced Signal Programming APUE (Stevens) Ch. 10

Common Pitfalls and Debugging

Problem 1: “My zombie detector doesn’t find zombies”

  • Why: Zombies might be reaped between checks; parent might be handling SIGCHLD correctly
  • Fix: Create a test child that exits but doesn’t get reaped; confirm it appears as zombie
  • Quick test: cat /proc/[pid]/status | grep State should show Z (zombie)

Problem 2: “Signal handler is called but counter doesn’t increment”

  • Why: Counter update might not be atomic; check might not be volatile
  • Fix: Declare counter as volatile sig_atomic_t; use atomic increment
  • Quick test: Send 100 signals rapidly; count should match

Problem 3: “Program deadlocks when signal arrives during printf”

  • Why: printf uses locks; handler calls printf; deadlock when signal arrives inside lock
  • Fix: Don’t use printf in handlers; use only async-safe functions like write
  • Quick test: Install handler, send signal while in printf; should not hang

Problem 4: “Parent misses SIGCHLD; children become zombies”

  • Why: Handler not installed, or waitpid called before SIGCHLD arrives (race)
  • Fix: Ensure sigaction is called early; use waitpid in loop in handler
  • Quick test: Fork 100 children; all should be reaped before parent exits

Definition of Done

  • Implement signal tracer that logs all signals with timing and sender PID
  • Detect zombie processes by scanning /proc or process table
  • Create async-safe logging function using only write and integer conversion
  • Stress-test signal handlers with 1000+ rapid signals; verify no race conditions
  • Demonstrate safe vs unsafe signal handlers with documented examples
  • Show how sigprocmask prevents race conditions in critical sections
  • Handle SIGCHLD correctly without leaving zombies
  • Provide examples of common pitfalls and how to avoid them

Project 12: Unix Shell with Job Control

  • File: CSAPP_SHELL_JOB_CONTROL.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Process Control / Terminal I/O / Job Management
  • Software or Tool: readline, terminal attributes, process group management
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A Unix shell (tsh) that supports background/foreground job control, including: (1) fg and bg commands to manage jobs, (2) signal handling for SIGCHLD, SIGINT, SIGTSTP, (3) correct process group management, and (4) proper terminal control.

Why it teaches process control: A shell is a process manager—it must coordinate multiple child processes, handle signals correctly, and manage terminal I/O. You’ll see how process groups, signals, and terminal control interact in a real system.

Core challenges you’ll face:

  • Process groups and session leaders → maps to understanding how terminal multiplexing works
  • Signal handling without race conditions → maps to atomic operations in concurrent systems
  • Job management (foreground vs background) → maps to process state tracking
  • Terminal control (SIGSTOP, SIGCONT) → maps to understanding kernel-terminal interaction

Real World Outcome

You’ll produce tsh, a shell that behaves like bash for job control:

$ ./tsh
tsh> sleep 100 &
[1] (12345) sleep 100
tsh> sleep 200
^Z
[2] (12346) sleep 200

tsh> jobs
[1] (12345) Running sleep 100
[2] (12346) Stopped sleep 200

tsh> bg 2
[2] (12346) sleep 200

tsh> fg 1
sleep 100
^C
tsh>

The shell correctly:

  • Launches jobs in foreground (blocking) or background (non-blocking)
  • Handles Ctrl-Z (SIGTSTP) to stop foreground job
  • Handles Ctrl-C (SIGINT) to kill foreground job
  • Maintains a job list with status (Running, Stopped, Done)
  • Moves jobs between foreground and background

The Core Question You’re Answering

“How does a shell manage multiple processes and handle terminal signals without losing synchronization?”

Before coding, understand that the shell is a traffic controller: it launches processes, waits for them, forwards signals, manages terminal control, and keeps track of job state—all while handling asynchronous signals.

Concepts You Must Understand First

  1. Process Groups & Sessions
    • Terminal sends signals to process groups, not individual processes
    • Foreground vs background process groups
    • setpgid, getpgrp, tcgetpgrp, tcsetpgrp
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.3-8.4
  2. Signal Handling for Job Control
    • SIGCHLD: child process exited or stopped
    • SIGINT, SIGTSTP: sent by terminal to foreground process group
    • Handlers must reap children and update job state
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.5-8.7
  3. Job Management Data Structure
    • Job list: track PID, process group, state, command line
    • Foreground job tracking: which job is in foreground?
    • State transitions: Running → Stopped → Done → reaped
    • Book Reference: CS:APP 3e Ch. 8 (Shell Lab specification)
  4. Terminal Control & Process Groups
    • How does Ctrl-C reach only the foreground job?
    • What is tcgetpgrp/tcsetpgrp for?
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.3

Questions to Guide Your Design

  1. Process Groups: When you run sleep 100 &, you create a new process. What process group should it belong to?
  2. Terminal Signals: When you press Ctrl-Z, why does it stop only the foreground job, not the shell?
  3. Job Tracking: How do you track which job is foreground? How do you move a job from background to foreground?
  4. Signal Safety: If SIGCHLD arrives while you’re updating the job list, what breaks?

Thinking Exercise

Terminal Signal Flow

You’re in shell with sleep 100 running in foreground. You press Ctrl-Z.

  1. Terminal driver sends SIGTSTP to entire foreground process group
  2. Sleep receives SIGTSTP and stops
  3. Shell receives SIGTSTP too!

Questions:

  • How does the shell prevent itself from stopping when Ctrl-Z is pressed?
  • Should the shell’s signal handler for SIGTSTP do anything?
  • What if the sleep process ignores SIGTSTP? What should the shell do?

The Interview Questions They’ll Ask

  1. “Explain the difference between a process group and a session. Why does the terminal need to know about process groups?”
  2. “When you press Ctrl-C in a shell with a backgroundjob, why doesn’t it kill the background job?”
  3. “What happens if the foreground process group stops? How does the shell regain control?”
  4. “How would you implement the bg command to resume a stopped background job?”
  5. “Why do you need both a SIGCHLD handler and the ability to block signals in critical sections?”

Hints in Layers

Hint 1: Build Job List Management Create a job struct: {pid, pgid, state, cmdline}. Implement functions to add, remove, find, and list jobs. Test with manual job creation before handling signals.

Hint 2: Signal Handlers Implement a SIGCHLD handler that calls waitpid in a loop with WNOHANG flag to reap any stopped or exited children. Update job state for each. Use sigprocmask to protect job list access.

Hint 3: Process Group Management When forking a child, the parent should call setpgid(child_pid, child_pid) to put the child in its own process group. If running in background, the shell shouldn’t wait. If foreground, use tcsetpgrp to give the child the terminal.

Hint 4: Shell Protection from Signals Before any critical operation (job list modification), block SIGCHLD, SIGINT, SIGTSTP with sigprocmask. Unblock after the operation completes. This prevents signal handlers from interrupting updates.

Books That Will Help

Topic Book Chapter
Process Groups & Sessions CS:APP 3e Ch. 8, Sec. 8.3
Job Control & Signal Handling CS:APP 3e Ch. 8, Sec. 8.4-8.7
Shell Lab Specification CS:APP 3e Appendix (datalab/tracinglab)
Advanced Job Control APUE (Stevens) Ch. 9

Common Pitfalls and Debugging

Problem 1: “Ctrl-Z doesn’t stop the foreground job”

  • Why: Process not in correct process group, or terminal signal handling broken
  • Fix: Ensure child is in its own process group; use tcsetpgrp to give terminal to child
  • Quick test: Run sleep 100 in foreground; press Ctrl-Z; should print Stopped

Problem 2: “Reusing a job ID causes confusion”

  • Why: Job list has old job with same ID; not properly removed
  • Fix: When job is reaped, remove it from job list; use next free job number
  • Quick test: Run 10 jobs; kill them; run 10 more; job IDs should be 1-10 again

Problem 3: “SIGCHLD handler runs while updating job list; data corruption”

  • Why: Not blocking signals during critical sections
  • Fix: Use sigprocmask to block SIGCHLD before modifying job list; unblock after
  • Quick test: Stress test with 100 rapid job creations; should not crash

Problem 4: “bg command doesn’t work; job stays stopped”

  • Why: Not sending SIGCONT to stopped job’s process group
  • Fix: bg should call kill(-pgid, SIGCONT) to resume all processes in group
  • Quick test: Stop a job with Ctrl-Z; run bg; job should continue

Definition of Done

  • Parse shell commands with arguments (tokenization)
  • Create child processes and manage process groups
  • Implement jobs command listing all active jobs
  • Handle SIGCHLD correctly, reaping children and updating state
  • Protect job list with signal masking during updates
  • Implement fg command: bring job to foreground and wait for it
  • Implement bg command: resume stopped background job
  • Properly handle Ctrl-C and Ctrl-Z for foreground job only
  • Show job notifications when job stops or completes
  • Support & suffix for background jobs

Project 13: Virtual Memory Map Visualizer

  • File: CSAPP_VM_MAP_VISUALIZER.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Virtual Memory / Process Address Space / Page Management
  • Software or Tool: /proc/[pid]/maps, /proc/[pid]/smaps, mmap
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: An interactive tool vmmap that visualizes a process’s virtual address space, including: (1) visual representation of memory regions (code, heap, stack, libraries), (2) parsing and display of /proc/[pid]/maps and /proc/[pid]/smaps, (3) symbol resolution to identify function boundaries, (4) dynamic visualization showing memory changes as the process runs.

Why it teaches virtual memory: Virtual memory is abstract—addresses look like addresses, but they’re really just virtual. This tool makes that abstraction concrete by showing exactly how the kernel maps virtual pages to physical frames, manages ASLR, and arranges segments.

Core challenges you’ll face:

  • Parsing /proc/[pid]/maps and /proc/[pid]/smaps → maps to understanding kernel’s view of address space
  • Symbol resolution to identify functions → maps to understanding debug symbols and relocation
  • Visualizing memory layout across time → maps to understanding dynamic memory allocation
  • Page permission flags and their meaning → maps to understanding virtual memory protection

Real World Outcome

You’ll produce vmmap, an interactive tool:

$ vmmap --pid=1234
Virtual Address Map for PID 1234 (/usr/bin/firefox)

Address Space:
┌─────────────────────────────────────────────────┐ 0xffffffffffffffff
│               Kernel Space                      │
└─────────────────────────────────────────────────┘ 0xffffffff80000000
                    (gap)
┌─────────────────────────────────────────────────┐ 0x7fffff000000
│  Shared Libraries                               │
│  libc.so.6   0x7f5a58c00000-0x7f5a58da2000 rw-p│
│  libm.so.6   0x7f5a5aa00000-0x7f5a5aa23000 r-xp│
└─────────────────────────────────────────────────┘
                    (gap)
┌─────────────────────────────────────────────────┐ 0x0000600000000000
│  Heap                                           │
│  Allocated: 256 MB | Free: 1.2 GB               │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐ 0x0000000600000000
│  BSS/Data Segment                               │
│  .data      0x0000000600000000-0x0000000600001000 rw-│
│  .bss       0x0000000600001000-0x0000000600002000 rw-│
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐ 0x0000000400000000
│  Code Segment                                   │
│  .text      0x0000000400000000-0x0000000400010000 r-xp│
└─────────────────────────────────────────────────┘

Plus detailed metrics:

$ vmmap --pid=1234 --detailed
[stack] 0x7fffffffde00-0x7ffffffff000 rw- Size: 8.5 MB (RSS: 64 KB)
[vvar]  0x7fffffe00000-0x7fffffe02000 r--p Size: 8 KB (RSS: 8 KB)
[heap]  0x5555557c1000-0x555555942000 rw- Size: 1.8 MB (RSS: 320 KB)
[text]  0x555555554000-0x555555667000 r-xp Size: 1.1 MB (RSS: 1.1 MB)

The Core Question You’re Answering

“What does the virtual address space of a running process actually look like, and how does the kernel organize it?”

Before coding, understand that every process has a complete (64-bit) address space, but only parts are actually mapped to physical memory. The kernel’s page tables determine what’s valid.

Concepts You Must Understand First

  1. Virtual Address Space Layout
    • Stack (grows down), heap (grows up), code, data, libraries
    • Address Space Layout Randomization (ASLR): why are addresses different each run?
    • Book Reference: CS:APP 3e Ch. 9, Sec. 9.9-9.10
  2. Page Maps and Segments
    • What does each line in /proc/self/maps mean?
    • Permission flags: r/w/x, p (private) / s (shared)
    • Book Reference: man proc(5) - /proc/[pid]/maps
  3. Memory Regions and Mapping
    • Executable sections vs data sections vs BSS
    • mmap’d files and anonymous memory
    • Book Reference: CS:APP 3e Ch. 9, Sec. 9.2
  4. Symbol Information
    • Reading symbol tables from binaries to identify function boundaries
    • Debug symbols vs stripped binaries
    • Book Reference: CS:APP 3e Ch. 7, Sec. 7.3

Questions to Guide Your Design

  1. Address Space Parsing: What information does /proc/self/maps provide? How do you extract start address, end address, and permissions?
  2. Memory Regions: How do you distinguish code (.text) from data (.data) from heap from stack?
  3. Symbol Resolution: Given a virtual address, how do you find which function it belongs to?
  4. ASLR Impact: If you run the same program twice, why are the addresses different? How does ASLR work?

Thinking Exercise

Memory Region Identification

$ cat /proc/self/maps
555555554000-555555667000 r-xp ... /usr/bin/prog
555555867000-555555869000 rw-p ... /usr/bin/prog
7ffff7ddc000-7ffff7e02000 r-xp ... /lib64/ld-linux
7ffff7dd2000-7ffff7ddc000 rw-p ... /lib64/ld-linux
7fffde000000-7fffde021000 rw-p ... [heap]
7ffffffff000-7ffffffff000 rw-p ... [stack]

Questions:

  • Which region is the code (.text) of the program?
  • Which region is the heap?
  • Why does the program have two r-xp regions in /lib64/ld-linux?

The Interview Questions They’ll Ask

  1. “Explain the virtual address space layout of a 64-bit process. Where does code, stack, and heap live?”
  2. “What is ASLR, and how does it affect debugging and security?”
  3. “If you map a file with mmap, where does it appear in the address space?”
  4. “What does RSS (Resident Set Size) mean vs VSZ (Virtual Size) in /proc/[pid]/smaps?”
  5. “How do you find which shared library a function comes from given its address?”

Hints in Layers

Hint 1: Parse /proc/[pid]/maps Read the maps file for any process. Extract fields: start address, end address, permissions, offset, device, inode, pathname. Format as a table.

Hint 2: Categorize Memory Regions

  • If pathname matches /, it’s a file mapping (executable, library, etc.)
  • If pathname is [heap], [stack], [vvar], use those labels
  • If pathname is empty and private, it’s likely anonymous memory (stack or heap)

Hint 3: Visualize the Address Space Create a large ASCII diagram showing the virtual memory layout. Scale the sizes appropriately. Show permissions and memory usage (RSS) for each region.

Hint 4: Add Symbol Resolution Use dlopen/dlsym or ELF parsing to find symbols. Given an address, binary search through symbol tables to find the nearest symbol (function name).

Books That Will Help

Topic Book Chapter
Virtual Address Space CS:APP 3e Ch. 9, Sec. 9.9-9.10
Memory Mapping CS:APP 3e Ch. 9, Sec. 9.2
/proc filesystem man proc(5) /proc/[pid]/maps
ASLR & Security CS:APP 3e Ch. 3, Sec. 3.13

Common Pitfalls and Debugging

Problem 1: “Maps file shows different addresses each time I run”

  • Why: ASLR is enabled; kernel randomizes memory layout
  • Fix: This is normal behavior; disable ASLR only for debugging with setarch x86_64 -R
  • Quick test: Run without ASLR; addresses should be consistent

Problem 2: “Can’t identify which region is heap vs stack”

  • Why: Heap and stack might have same permissions; need to look at pathname
  • Fix: Stack usually has [stack] label; heap has [heap] label in some kernels; otherwise, infer from address and growth direction
  • Quick test: Allocate memory, run vmmap again; see which region grew

Problem 3: “Symbol resolution shows wrong function names”

  • Why: Symbols might be stripped, or you’re matching against wrong binary version
  • Fix: Use readelf -s to inspect symbols; ensure you’re reading the correct binary file
  • Quick test: objdump -t /path/to/binary | grep address

Problem 4: “RSS size doesn’t match actual memory usage”

  • Why: RSS includes shared pages; memory shared with other processes counted in each
  • Fix: Use PSS (Proportional Set Size) for more accurate accounting (in /proc/[pid]/smaps)
  • Quick test: Check /proc/[pid]/smaps instead of /maps

Definition of Done

  • Parse /proc/[pid]/maps and extract all fields correctly
  • Categorize regions (code, data, heap, stack, mmap’d files)
  • Visualize address space as ASCII diagram with proper scaling
  • Show memory permissions and sizes for each region
  • Read /proc/[pid]/smaps for per-region memory usage (RSS)
  • Resolve symbols to identify function names at given addresses
  • Handle ASLR correctly (show that addresses vary between runs)
  • Support interactive commands: list regions, show details, follow heap growth, etc.

Project 14: Build Your Own Malloc

  • File: CSAPP_BUILD_MALLOC.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Heap Management / Memory Allocation / Data Structures
  • Software or Tool: sbrk, mmap, heap inspectors
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A complete memory allocator (mymalloc) that implements malloc, realloc, and free with implicit free lists, explicit free lists, or segregated free lists. Includes fragmentation analysis and performance benchmarking.

Why it teaches memory management: Malloc is invisible to most programmers, but building one forces you to confront: how does the heap grow? How does free list fragmentation waste memory? Why does poor allocation strategy kill performance? This project connects abstract memory allocation to concrete data structures.

Core challenges you’ll face:

  • Free list management and coalescing → maps to understanding fragmentation and defragmentation
  • Alignment and padding → maps to understanding hardware constraints and compiler layouts
  • Splitting and merging blocks → maps to understanding data structure operations under constraints
  • Performance vs fragmentation tradeoffs → maps to understanding engineering tradeoffs

Real World Outcome

You’ll produce a malloc library that passes test cases:

$ ./mymalloc_test
[ALLOC] malloc(1024) -> 0x5555557c1000
[ALLOC] malloc(512) -> 0x5555557c1400
[ALLOC] malloc(256) -> 0x5555557c1600
[FREE] free(0x5555557c1400) -> OK
[REALLOC] realloc(0x5555557c1000, 2048) -> 0x5555557c2000
[STATS] Total allocated: 3840 bytes, Fragmentation: 12%, Peak memory: 5.2 KB

Performance Benchmark:
  Allocate 10000 random sizes: 2.3 ms
  Free all: 1.1 ms
  Allocate + Free 5000 pairs: 4.2 ms
  System malloc (reference): 5.1 ms

Plus detailed diagnostics:

$ ./heap_inspector --pid=1234
Heap Analysis for PID 1234:
┌─────────────────────────────────┐
│ Block 1: 1024 bytes [ALLOCATED] │ 0x1000000
│ Block 2:  512 bytes [FREE]      │ 0x1000400
│ Block 3:  256 bytes [ALLOCATED] │ 0x1000600
│ Block 4: 3200 bytes [FREE]      │ 0x1000700
└─────────────────────────────────┘
Fragmentation: 12% (largest free: 3200 bytes, total free: 3712 bytes)

The Core Question You’re Answering

“How does malloc manage the heap, and what happens when memory is fragmented?”

Before coding, understand that malloc is a data structure problem: you have a big region of memory, you need to track which parts are allocated, and you need to find free space for new allocations.

Concepts You Must Understand First

  1. Heap Organization & Free Lists
    • Implicit free lists (boundary tags), explicit free lists, segregated lists
    • Why free lists exist and how they’re traversed
    • Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
  2. Block Coalescing & Splitting
    • Immediate vs deferred coalescing (immediate is simpler, deferred reduces overhead)
    • How to split a free block when allocation is smaller
    • Why coalescing matters for fragmentation
    • Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
  3. Alignment and Padding
    • Why 8-byte or 16-byte alignment is required
    • How to calculate padding
    • Trade-off between alignment and fragmentation
    • Book Reference: CS:APP 3e Ch. 3, Sec. 3.9
  4. Fragmentation & Policies
    • External fragmentation: free space scattered, can’t be used
    • Internal fragmentation: allocated block bigger than request (padding)
    • First-fit, best-fit, worst-fit allocation strategies
    • Book Reference: CS:APP 3e Ch. 9, Sec. 9.9

Questions to Guide Your Design

  1. Block Representation: How do you store metadata (size, allocated/free, next block) without using extra space outside the heap?
  2. Finding Free Space: Given a request for N bytes, how do you find a suitable free block? (First-fit, best-fit?)
  3. Coalescing: When you free a block, how do you find adjacent free blocks to merge with?
  4. Alignment: If you allocate 13 bytes, how much space do you actually use? (Should be rounded to 16 or 24)

Thinking Exercise

Block Representation with Boundary Tags

Free block:  [header: size|alloc] [data...] [footer: size|alloc]
Alloc block: [header: size|alloc] [data...] (no footer, saves space)

Example: Allocate 256 bytes from a 1024-byte free block

Before:
[1024|free] [1016 bytes of data...] [1024|free]

After (if you allocate 256 bytes):
[256|alloc] [252 bytes of data...]
[768|free] [764 bytes of data...] [768|free]

Questions:

  • How do you know where the next block starts?
  • Why include a footer in free blocks but not allocated ones?
  • What size should the header be to store both size and alloc bit?

The Interview Questions They’ll Ask

  1. “What’s the difference between external and internal fragmentation? Can you eliminate one without increasing the other?”
  2. “In a boundary-tag allocator, why do free blocks need footers but allocated blocks don’t?”
  3. “What’s the trade-off between first-fit, best-fit, and worst-fit allocation policies?”
  4. “How does coalescing fragmentation? What’s the difference between immediate and deferred coalescing?”
  5. “If your malloc is slower than system malloc, what optimizations would you try?”

Hints in Layers

Hint 1: Implement a Simple Implicit Free List Use boundary tags (headers with size and alloc bit). Store list sequentially in the heap. For allocation, walk the heap linearly to find a free block. For freeing, mark the block as free. Don’t worry about coalescing yet.

Hint 2: Add Coalescing When freeing, check the previous and next blocks. If they’re free, merge with them. To find the previous block, either store a footer in all blocks (expensive) or maintain a doubly-linked list.

Hint 3: Optimize with Explicit Free Lists Instead of walking all blocks (allocated and free), maintain a list of only free blocks. This speeds up allocation. Use first-fit: find the first free block that fits.

Hint 4: Benchmark and Analyze Create test patterns: rapid allocate/free, pathological fragmentation patterns, realistic workloads. Measure throughput and memory usage. Identify bottlenecks.

Books That Will Help

Topic Book Chapter
Heap Allocation CS:APP 3e Ch. 9, Sec. 9.9
Memory Allocator Design The Craft of System Design Ch. 7
Fragmentation Analysis Research papers “Malloc Fragmentation”

Common Pitfalls and Debugging

Problem 1: “Fragmentation builds up; can’t allocate even though free space exists”

  • Why: Not coalescing adjacent free blocks properly
  • Fix: After free, immediately merge with adjacent free blocks
  • Quick test: Allocate/free in patterns; use heap_inspector to visualize fragmentation

Problem 2: “Alignment issues; crashes on 32-byte SIMD operations”

  • Why: Pointers not properly aligned to 8 or 16 bytes
  • Fix: Round all sizes up to nearest multiple of 8/16; ensure header doesn’t break alignment
  • Quick test: Allocate odd sizes (17, 23 bytes); check alignment of returned pointer

Problem 3: “Realloc doesn’t work; data is corrupted”

  • Why: Realloc tries to expand in-place without checking adjacent block; should move if can’t expand
  • Fix: Check if next block is free and large enough; otherwise allocate new, copy, free old
  • Quick test: Realloc with increasing sizes; verify data integrity

Problem 4: “Performance is terrible; 10x slower than system malloc”

  • Why: Linear search for free block, or excessive coalescing overhead
  • Fix: Use explicit free list or segregated lists; implement best-fit instead of first-fit
  • Quick test: Profile; identify hotspots; switch to more efficient data structure

Definition of Done

  • Implement malloc that allocates memory and tracks allocation state
  • Implement free that marks memory as free
  • Coalesce adjacent free blocks to reduce fragmentation
  • Implement realloc correctly (expand in-place or move)
  • Ensure correct alignment (8-byte or 16-byte)
  • Measure and document fragmentation rate
  • Benchmark against system malloc; be within 2x on throughput
  • Pass comprehensive test suite including stress tests and edge cases

Project 15: Robust Unix I/O Toolkit

  • File: CSAPP_ROBUST_IO_TOOLKIT.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: I/O Operations / System Programming / Error Handling
  • Software or Tool: read, write, select, poll, epoll
  • Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A robust I/O library with wrappers for read/write that handle short counts and signals correctly. Includes: (1) rio_readn/rio_writen for guaranteed I/O, (2) rio_readlineb for buffered reading, (3) signal-safe helpers, (4) multiplexing support with select/epoll.

Why it teaches I/O robustness: System I/O is unreliable—read might return fewer bytes than requested, signals can interrupt syscalls. Most code ignores this and crashes. You’ll learn to write code that actually works on real systems.

Core challenges you’ll face:

  • Partial I/O (short counts) → maps to understanding buffering and retries
  • EINTR and signal interruption → maps to understanding async system behavior
  • Buffered vs unbuffered I/O → maps to understanding performance tradeoffs
  • Multiplexing multiple connections → maps to understanding event-driven programming

Real World Outcome

You’ll create a library tested on real I/O patterns:

$ ./rio_test
[TEST] rio_readn: Read 1MB in 8-byte chunks
  - Expected: 131072 reads
  - Actual: 131072 (correct!)
  - Signals received: 5,847 (all handled)

[TEST] rio_writen: Write 1MB across pipe with limited buffer
  - Expected: 1048576 bytes written
  - Actual: 1048576 (correct!)
  - Write calls: 847 (averaged 1.2KB per syscall)

[TEST] Buffered reading with signals
  - Read 100 lines (8KB total)
  - SIGALRM sent during reads: 23 times
  - All data recovered: YES

Plus a full echo server using rio_readlineb:

$ ./echo_server 8000 &
$ telnet localhost 8000
Hello world
Hello world
[server echoed back]
^C

The Core Question You’re Answering

“How do you write I/O code that works reliably despite signals, short counts, and buffering?”

Before coding, understand: system I/O is not transactional. A read for 1KB might return 1 byte. A write for 1KB might write 512 bytes then return. Signals can interrupt any syscall. Ignoring these facts leads to data loss.

Concepts You Must Understand First

  1. Short Counts & Partial I/O
    • read returns fewer bytes than requested (not an error)
    • write might write only part of buffer (not an error)
    • Must retry in loop until done
    • Book Reference: CS:APP 3e Ch. 10, Sec. 10.2
  2. EINTR and Signal Interruption
    • Signals can interrupt syscalls with EINTR
    • Some syscalls auto-restart on EINTR, others don’t
    • Must handle both cases
    • Book Reference: CS:APP 3e Ch. 8, Sec. 8.6
  3. Buffering Strategies
    • Unbuffered: read/write every byte (slow)
    • Line-buffered: buffer until newline (useful for text)
    • Fully buffered: buffer in blocks (fastest)
    • Book Reference: CS:APP 3e Ch. 10, Sec. 10.1
  4. Multiplexing Connections
    • select/poll/epoll: monitor multiple file descriptors
    • When to use each (performance, portability)
    • Book Reference: CS:APP 3e Ch. 12, Sec. 12.2

Questions to Guide Your Design

  1. Short Counts: If read(fd, buf, 1000) returns 50, what should the caller do?
  2. EINTR Handling: If write returns -1 with errno == EINTR, is the data written? (Answer: unknown!)
  3. Buffering: When reading lines, when should you flush? (When full, on newline, on error?)
  4. Multiplexing: How do you efficiently handle 10,000 simultaneous connections?

Thinking Exercise

Short-Count Error

// WRONG CODE
int n = read(fd, buf, 1000);
if (n < 0) perror("error");
// Now buf definitely has all data... WRONG!

What if read returns 50? The code thinks buf has 1000 bytes. Crash!

Questions:

  • How do you fix this? (Loop until read returns 0)
  • What does it mean when read returns 0? (EOF)
  • How do you handle partial write?

The Interview Questions They’ll Ask

  1. “What’s a short count, and why does it happen? How do you handle it?”
  2. “Why does write return early, and what are the consequences for your code?”
  3. “What’s the difference between auto-restarting and non-restarting syscalls? Give examples.”
  4. “Design a line-reading function that buffers input efficiently. How do you handle EINTR?”
  5. “Why is select better than looping through 10,000 file descriptors with fcntl flags?”

Hints in Layers

Hint 1: Implement rio_readn Wrapper for read that loops until all N bytes are read or EOF. Handle EINTR by retrying on EINTR. Handle short counts by continuing loop.

Hint 2: Implement rio_writen Wrapper for write that loops until all N bytes are written. Handle EINTR by retrying. Handle short counts by adjusting buffer pointer and length.

Hint 3: Add Buffered Line Reading Implement rio_readlineb with internal buffer. Read chunks with rio_readn, return one line at a time. Handle lines split across read boundaries.

Hint 4: Implement select-based Multiplexer Create a simple server that handles multiple clients with select. Demonstrate that one thread can serve many clients if they’re not all active simultaneously.

Books That Will Help

Topic Book Chapter
I/O Basics CS:APP 3e Ch. 10, Sec. 10.1-10.2
Network I/O CS:APP 3e Ch. 11, Sec. 11.4-11.5
Signal Safety CS:APP 3e Ch. 8, Sec. 8.6
Multiplexing CS:APP 3e Ch. 12, Sec. 12.2
Rio Implementation APUE Ch. 3

Common Pitfalls and Debugging

Problem 1: “Signals cause silent data loss”

  • Why: Short count from EINTR not handled; code assumes full read/write
  • Fix: Always loop; handle EINTR explicitly with while (n < 0 && errno == EINTR)
  • Quick test: Send SIGALRM during read; verify all data still arrives

Problem 2: “Buffered reader loses data between buffer fills”

  • Why: Newline at buffer boundary; incomplete handling of wrap-around
  • Fix: Properly track position in buffer; handle lines split across boundaries
  • Quick test: Send lines of exactly buffer size; send many tiny lines

Problem 3: “Server doesn’t detect client disconnect”

  • Why: EOF condition not recognized (read returns 0)
  • Fix: Check for n == 0; close connection; remove from multiplexer
  • Quick test: Telnet to server, kill client suddenly; verify server detects it

Problem 4: “epoll doesn’t work as expected on certain systems”

  • Why: Platform-specific differences or incorrect edge-triggered use
  • Fix: Ensure using level-triggered or handle correctly; test on actual target
  • Quick test: Move code between Linux/BSD; verify behavior

Definition of Done

  • Implement rio_readn that handles short counts and EINTR
  • Implement rio_writen that handles short counts and EINTR
  • Implement rio_readlineb with internal buffering
  • Handle all signal interruptions correctly
  • Create test suite with intentional EINTR injection
  • Demonstrate line-reading with signals firing during reads
  • Implement multiplexing server with select and/or epoll
  • Handle client disconnects correctly
  • Document all edge cases and test patterns

Project 16: Concurrency Workbench

  • File: CSAPP_CONCURRENCY_WORKBENCH.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: Level 2: Practical but Forgettable
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Concurrency / Threading / Synchronization
  • Software or Tool: Pthreads / Valgrind / Helgrind
  • Main Book: CS:APP 3e (Chapter 12)

What you’ll build: A benchmark framework that compares three concurrency models (thread-per-request, thread pool, and event-driven multiplexing) using identical workloads, measuring throughput, latency, and resource usage.

Why it teaches concurrency: Building three different concurrency models side-by-side forces you to understand the tradeoffs: threads are simple but expensive, thread pools are practical but complex, and multiplexing is efficient but non-blocking. You’ll see why different models win for different workloads.

Core challenges you’ll face:

  • Thread safety without deadlocks → maps to mutex ordering and condition variables
  • Correct shutdown and cleanup → maps to graceful termination and resource deallocation
  • Fair comparison between models → maps to equivalent load generation and measurement
  • Detecting race conditions → maps to Helgrind and ThreadSanitizer

Key Concepts:

  • Thread-Per-Request Model: Pthreads Ch. 1 - APUE
  • Thread Pool Pattern: Effective Java (Item 81) - Joshua Bloch
  • Select/Poll/Epoll Multiplexing: CS:APP Ch. 12, Sec. 12.2
  • Synchronization Primitives: CS:APP Ch. 12, Sec. 12.3
  • Race Condition Detection: Valgrind Helgrind user manual

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: You must understand: basic threading, mutex/semaphore concepts, the fork/exec model, and select/epoll basics. Have completed Projects 11-15 to fully appreciate the contrasts.

Real World Outcome

You’ll produce a benchmarking suite that runs identical work across three concurrency models and prints a comparative report.

$ ./concurrency_bench --workload=http --clients=1000 --duration=10

====== CONCURRENCY WORKBENCH ======

[Workload] HTTP request handler
[Duration] 10 seconds
[Load] 1000 concurrent clients

--- Thread-Per-Request Model ---
[INIT] Starting threadpool (1000 threads)...
[SUMMARY] Throughput: 45,230 req/sec
[SUMMARY] Avg latency: 22.1 ms
[SUMMARY] Max latency: 1,250 ms
[SUMMARY] Memory: 125 MB (each thread ~100KB stack)
[WARNING] Thread creation took 2.5 seconds

--- Thread Pool Model (16 workers) ---
[INIT] Starting 16 worker threads with queue...
[SUMMARY] Throughput: 89,450 req/sec
[SUMMARY] Avg latency: 11.2 ms
[SUMMARY] Max latency: 450 ms
[SUMMARY] Memory: 3.2 MB (fixed overhead)
[STATS] Queue depth: max 45, average 8

--- Event-Driven Model (epoll) ---
[INIT] Starting epoll reactor...
[SUMMARY] Throughput: 127,890 req/sec
[SUMMARY] Avg latency: 7.8 ms
[SUMMARY] Max latency: 320 ms
[SUMMARY] Memory: 1.5 MB (minimal per-connection)
[STATS] Epoll FD count: 1001

====== COMPARISON ======
Winner: Event-Driven (2.8x faster, 80x less memory)
Tradeoff: Epoll harder to program correctly

The Core Question You’re Answering

“When should I use threads vs thread pools vs select/epoll, and what does each model trade off?”

Before building three models, you realize most developers pick one and never see the tradeoffs. This project forces you to measure them directly. You’ll see that threads are simple, thread pools are practical, and multiplexing is king for I/O-bound work.

Concepts You Must Understand First

Stop and research these before coding:

  1. Pthreads API and Synchronization
    • What does pthread_create, pthread_join do?
    • Why do you need pthread_mutex_lock and pthread_cond_wait?
    • What is a condition variable and when do you use it?
    • Book Reference: “APUE” by Kerrisk - Ch. 11
  2. Thread Pool Pattern
    • How does a bounded work queue prevent unbounded thread growth?
    • What happens when the queue is full?
    • How do worker threads cleanly shut down?
    • Book Reference: “Java Concurrency in Practice” - Goetz et al. - Ch. 8
  3. Select/Poll/Epoll Multiplexing
    • What is the advantage of select over thread-per-connection?
    • Why is epoll better than select for thousands of connections?
    • What is edge-triggered vs level-triggered epoll?
    • Book Reference: CS:APP 3e - Ch. 12, Sec. 12.2
  4. Synchronization Pitfalls
    • What causes a deadlock? (lock ordering, hold-and-wait)
    • Why can spurious wakeups happen with condition variables?
    • How do you ensure all threads wake up at shutdown?
    • Book Reference: CS:APP 3e - Ch. 12, Sec. 12.3

Questions to Guide Your Design

Before implementing, think through these:

  1. Load Generation
    • How will you generate equivalent load for all three models? (same requests, same timing)
    • How will you measure “started” vs “completed” for each?
    • Should you use blocking or non-blocking requests?
  2. Thread-Per-Request Model
    • How will you handle 1000 threads on a typical system? (each needs ~2MB stack)
    • When do you stop creating threads?
    • How do you gracefully shut down all 1000 threads?
  3. Thread Pool Model
    • How big should the worker pool be? (cores? oversubscribed?)
    • What is the queue capacity? (bounded or unbounded?)
    • What happens when the queue fills? (block, reject, dynamic resize?)
  4. Event-Driven Model
    • How do you handle 1000 non-blocking sockets?
    • How do you detect client disconnect in epoll?
    • How do you ensure no work is starved?
  5. Measurement
    • How will you measure throughput without skewing results?
    • How will you get accurate latency percentiles (p50, p99)?
    • How will you measure memory usage (RSS, VSZ)?

Thinking Exercise

Draw Three Concurrency Models

Before coding, draw or write out:

Model 1: Thread-Per-Request
┌─────────────────────────────────────┐
│ Accepting Thread                    │
├─────────────────────────────────────┤
│ accept() -> spawn pthread_create()  │
│     ↓                               │
│ [Worker Thread 1] [Worker 2] [3]... │
│     ↓                  ↓             │
│  [Socket]         [Socket]          │
└─────────────────────────────────────┘

Model 2: Thread Pool
┌─────────────────────────────────────┐
│  [Worker 1] [Worker 2] ... [Worker N]
│       ↓          ↓            ↓      │
│   ┌──────────────────────┐           │
│   │   Work Queue         │           │
│   │  (Bounded, Locked)   │           │
│   └──────────────────────┘           │
│            ↑                         │
│   Acceptor Thread (feeds queue)      │
└─────────────────────────────────────┘

Model 3: Event-Driven
┌─────────────────────────────────────┐
│   Reactor Thread                    │
│   ┌──────────────────────────────┐  │
│   │ epoll_wait() with timeout    │  │
│   │                              │  │
│   │ for each ready socket:       │  │
│   │   - read/write (non-block)   │  │
│   │   - schedule callback        │  │
│   └──────────────────────────────┘  │
│   Handles 1000s of sockets          │
└─────────────────────────────────────┘

Questions while thinking:

  • In Model 1, if you create 1000 threads, what’s the total memory cost? (1000 × 2MB = 2GB!)
  • In Model 2, what happens if 1000 requests arrive but only 16 workers exist? (queue grows)
  • In Model 3, how does epoll know when a socket is ready without polling?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why would you choose threads over multiplexing for a web server?”
  2. “What’s the memory overhead of creating 1000 threads vs a thread pool of 16?”
  3. “How do you handle client disconnects in an epoll reactor?”
  4. “What does ‘thundering herd’ mean, and when is it a problem?”
  5. “Explain the difference between select, poll, and epoll to someone who’s never used them.”
  6. “If you have 1000 concurrent connections but only 16 threads, how do you prevent starvation?”

Hints in Layers

Hint 1: Start with Thread-Per-Request Build the simplest model first: accept a connection, spawn a thread, let thread handle it. Measure how many threads crash the system. This is your baseline.

Hint 2: Implement Thread Pool with Bounded Queue Use a pthread_mutex_t to protect a linked list (or array) that acts as work queue. Worker threads call pthread_cond_wait() and wake when work arrives. Measure the throughput improvement.

Hint 3: Implement Epoll Reactor Single thread calls epoll_wait() in a loop. For each ready socket, perform one non-blocking read/write. If you want async work, queue it to a thread pool. Measure the throughput again.

Hint 4: Add Measurement and Comparison For each model, track: requests/sec, latency (min/avg/max/p99), memory usage. Print a table side-by-side. Explain why the winner wins.

Books That Will Help

Topic Book Chapter
Pthreads Basics APUE (3e) Ch. 11
Mutex and Condition Variables APUE (3e) Ch. 11.6
Thread Pools Java Concurrency in Practice Ch. 8
Select/Poll/Epoll CS:APP 3e Ch. 12, Sec. 12.2
Synchronization Pitfalls CS:APP 3e Ch. 12, Sec. 12.3
Performance Analysis Systems Performance Ch. 1-3 - Brendan Gregg

Common Pitfalls and Debugging

Problem 1: “Threads created faster than they finish; memory explodes”

  • Why: Thread-per-request model with no limit; each thread takes 1-2MB stack
  • Fix: Add a cap on concurrent threads; reject connections if exceeded; measure memory growth
  • Quick test: Run with 10,000 clients; watch top to see RSS grow; kill program before OOM

Problem 2: “Thread pool workers deadlock or hang”

  • Why: Incorrect lock ordering or forgotten pthread_cond_signal() on shutdown
  • Fix: Use Helgrind to detect lock issues; add explicit shutdown signal to all workers
  • Quick test: Kill benchmark with Ctrl+C; run with Helgrind; check for “possible data race”

Problem 3: “Epoll doesn’t detect all ready connections”

  • Why: Incorrect understanding of edge-triggered vs level-triggered mode
  • Fix: Ensure using level-triggered (default); handle EAGAIN on non-blocking read
  • Quick test: Send data to socket; verify epoll fires; send more data while previous still buffered

Problem 4: “Latency measurements are wrong (missing high percentiles)”

  • Why: Using naive average instead of tracking percentiles; missing outliers
  • Fix: Store all latencies in array; sort and compute p99 = array[0.99 * n]
  • Quick test: Add 10 random 5-second delays; see if your avg hides them but p99 catches them

Definition of Done

  • Thread-Per-Request model runs and measures throughput/latency/memory
  • Thread Pool model with 16 workers handles same load
  • Event-driven epoll model runs and measures same metrics
  • Comparison table clearly shows throughput and memory tradeoffs
  • No race conditions detected by Helgrind on all three models
  • Graceful shutdown for all three models (no hanging threads)
  • Latency includes p50, p99, max (not just average)
  • Load generator creates equivalent work for all models
  • Document why each model wins/loses for your workload
  • Code compiles with -Wall -Wextra -Werror and passes AddressSanitizer

Project 17: CS:APP Capstone — Secure, Observable, High-Performance Proxy

  • File: CSAPP_CAPSTONE_PROXY.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Go, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: Level 3: Service & Support Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Networking / Concurrency / Caching / Debugging
  • Software or Tool: Sockets / Pthreads / Valgrind / GDB
  • Main Book: CS:APP 3e (Multiple chapters)

What you’ll build: A fully functional HTTP proxy that sits between clients and origin servers, forwarding requests, caching responses, and handling concurrent connections with zero data loss.

Why it teaches systems programming: This is the capstone because it forces you to integrate everything: sockets (Ch. 11), processes/threads (Ch. 8-12), memory management (Ch. 9), I/O robustness (Ch. 10), and debugging. You can’t hack your way through this; you need to understand every layer.

Core challenges you’ll face:

  • Robust I/O and EINTR handling → maps to short counts, signals, buffering
  • Thread-safe caching without deadlocks → maps to synchronization, lock ordering
  • HTTP protocol parsing → maps to protocol layering, state machines
  • Finding silent bugs with only logs → maps to debugging skills, Valgrind, GDB
  • Production concerns (graceful shutdown, resource limits, observability) → maps to real-world engineering

Key Concepts:

  • Socket Programming: CS:APP Ch. 11, Sec. 11.4-11.5
  • Thread Synchronization: CS:APP Ch. 12, Sec. 12.3
  • LRU Cache Implementation: Designing Data-Intensive Applications - Kleppmann
  • HTTP Request/Response Format: RFC 7230 (HTTP/1.1)
  • Parser State Machines: Crafting Interpreters - Nystrom (Ch. 4-5)
  • Production Observability: The Art of Monitoring - Turnbull
  • Debugging Real Systems: The Art of Debugging - Matloff & Salzman

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: You MUST have completed Projects 1-16. You need to understand: sockets, threading, signals, memory management, and at minimum one full debugging workflow with gdb/valgrind. This is not for beginners.

Real World Outcome

You’ll have a proxy server that can handle real HTTP traffic, cache responses, and print detailed logs of everything happening inside.

$ ./proxy 8080 --cache-size=100MB --workers=8 --verbose

[STARTUP] Proxy listening on 127.0.0.1:8080
[STARTUP] Cache initialized (100 MB, LRU)
[STARTUP] Worker pool: 8 threads
[STARTUP] Ready to accept connections

$ curl -v http://localhost:8080/index.html (via example.com)

--- Client connects ---
[CONN] Accept from 127.0.0.1:54321
[PARSE] GET /index.html HTTP/1.1
[PARSE] Host: example.com
[PARSE] Connection: close
[CACHE] Check cache for example.com:index.html... MISS
[UPSTREAM] Connecting to example.com:80...
[UPSTREAM] Connected. Sending request...
[UPSTREAM] Reading response (200 OK, 15KB)
[CACHE] Storing in cache (TTL: 3600s)
[CLIENT] Sending 200 OK (15KB) back to client
[CONN] Client disconnected

$ curl -v http://localhost:8080/index.html (2nd request, same URL)

[CONN] Accept from 127.0.0.1:54322
[PARSE] GET /index.html HTTP/1.1
[CACHE] Check cache for example.com:index.html... HIT
[CLIENT] Sending cached 200 OK (15KB) back to client
[STATS] Cache hit rate: 100%
[CONN] Client disconnected

--- Shutdown ---
$ kill -TERM $(pgrep proxy)
[SIGNAL] Received SIGTERM, graceful shutdown...
[SHUTDOWN] Waiting for 2 in-flight requests...
[SHUTDOWN] Closed all sockets
[SHUTDOWN] Cache flushed to disk
[SHUTDOWN] Exited cleanly

The Core Question You’re Answering

“Can I build a real system that handles concurrent requests, caches efficiently, parses HTTP correctly, and recovers gracefully from failures?”

This is the litmus test: Can you apply everything from CS:APP to build something that actually works? You’ll discover that real systems need: robust I/O, careful synchronization, observability, and persistence. You’ll also discover how hard it is to debug concurrency bugs in production.

Concepts You Must Understand First

Stop and research these before coding:

  1. HTTP Protocol Basics
    • What are the required headers in an HTTP/1.1 request? (Host, Connection, etc.)
    • How do you parse a request line “GET /path HTTP/1.1”?
    • What is the difference between “Connection: keep-alive” and “Connection: close”?
    • Book Reference: RFC 7230 (IETF standard)
  2. Socket Programming at Scale
    • How do you handle short reads/writes? (EINTR, small buffers)
    • How do you gracefully close both client and server connections?
    • How do you prevent resource exhaustion (too many open FDs)?
    • Book Reference: CS:APP 3e - Ch. 11, Sec. 11.4-11.5 + APUE - Ch. 16
  3. Thread-Safe Caching with LRU
    • How does an LRU cache decide which entry to evict?
    • How do you protect concurrent reads/writes without serializing all requests?
    • How do you handle cache invalidation?
    • Book Reference: Designing Data-Intensive Applications - Kleppmann - Ch. 3-4
  4. HTTP State Machine
    • What states does an HTTP request go through? (parsing, routing, sending)
    • How do you handle partial requests (request arrives in chunks)?
    • What is the correct order of reading headers, body, etc.?
    • Book Reference: Crafting Interpreters - Nystrom - Ch. 4-5
  5. Signal-Safe Shutdown
    • How do you gracefully shut down threads without data loss?
    • How do you catch SIGTERM and finish in-flight requests before exiting?
    • Book Reference: CS:APP 3e - Ch. 8, Sec. 8.6 + APUE - Ch. 10
  6. Debugging Concurrency with Valgrind
    • How do you use Helgrind to find race conditions?
    • How do you read Valgrind output to find the smoking gun?
    • Book Reference: Valgrind User Manual (Helgrind chapter)

Questions to Guide Your Design

Before implementing, think through these:

  1. Request Parsing
    • What’s the smallest valid HTTP request? (3 lines: request + Host + blank)
    • How will you detect end of headers? (blank line = \r\n\r\n)
    • What if a header spans multiple read() calls?
  2. Upstream Connection
    • Should you reuse upstream connections (keep-alive) or close after each response?
    • How do you handle upstream server slowness (buffering on your side)?
    • What if upstream closes unexpectedly mid-response?
  3. Caching Strategy
    • Which requests should you cache? (GET yes, POST no)
    • Which responses should you cache? (200 yes, 500 no)
    • How do you handle Expires/Cache-Control headers?
    • What’s your cache eviction policy? (LRU? LFU?)
  4. Concurrency Model
    • Thread pool or epoll reactor?
    • Should each thread handle one client or multiple?
    • How big should your thread pool be?
  5. Observability
    • What should you log? (all requests? just errors?)
    • Can you add per-request IDs to trace requests through logs?
    • How do you measure cache hit rate?
  6. Graceful Shutdown
    • How do you signal threads to stop accepting new work?
    • How do you wait for in-flight requests to finish?
    • How do you flush cache to disk?

Thinking Exercise

Design the Thread Model

Before coding, sketch out the flow:

┌─────────────────────────────────────┐
│  Acceptor Thread                    │
│  accept() → hand to worker pool     │
└──────────────┬──────────────────────┘
               │
               v
    ┌──────────────────────────┐
    │  Worker Thread (×8)      │
    │  1. Read HTTP request    │
    │  2. Parse request line   │
    │  3. Check cache          │
    │  4. If miss: upstream    │
    │  5. Cache response       │
    │  6. Send to client       │
    └──────────────────────────┘
               │
    ┌──────────▼──────────┐
    │  Shared LRU Cache   │
    │  (mutex protected)  │
    └─────────────────────┘

Questions while thinking:

  • Who creates/destroys upstream connections?
  • Who decides whether to cache (all threads, or a caching thread)?
  • What if two threads request the same URL simultaneously?

The Interview Questions They’ll Ask

Prepare to answer these (this is where you get hired):

  1. “Walk me through how your proxy handles a request from start to finish.”
  2. “What happens if a client closes the connection mid-request?”
  3. “How would you debug a race condition in your cache without stopping the system?”
  4. “What’s the memory usage of your proxy with 1000 concurrent connections?”
  5. “How do you handle HTTP keep-alive? Is it worth the complexity?”
  6. “If your cache is full, how do you decide what to evict?”
  7. “Your logs show intermittent ‘Connection reset by peer’ errors. How do you debug this?”
  8. “Explain how you handle EINTR from signals. What can go wrong?”

Hints in Layers

Hint 1: Start with a Minimal Proxy Parse a single request, forward it upstream, send response back. Hardcode one upstream server. Get this working with full EINTR handling before adding concurrency. Test with curl.

Hint 2: Add Thread Pool Spawn 4 worker threads. Each waits on a queue of accepted connections. Use pthread_cond_wait() to sleep when queue is empty. Verify no hanging threads on shutdown.

Hint 3: Implement LRU Cache Use a linked list (or array with separate metadata) to track insertion order. Protect with a single mutex. Cache GET responses by (host, path) key. Test cache hits by requesting same URL twice and checking if upstream request is made.

Hint 4: Add Observability and Debugging Log every significant action: accept, parse, cache hit/miss, upstream connect, upstream recv, client send, disconnect. Run with Helgrind. Fix any race conditions detected. Use this to find bugs: “Oh, I see the log shows cache was checked but miss wasn’t counted!”

Hint 5: Handle Real Edge Cases

  • Partial reads (request arrives in 3 chunks)
  • Slow upstream (response arrives slowly)
  • Signals during read/write
  • Client disconnect during upstream read
  • Cache eviction under load

Hint 6: Graceful Shutdown On SIGTERM: stop accepting, wait for in-flight requests to finish, close upstream connections, flush cache, exit. Use a shared shutdown_flag and condition variable to wake all threads.

Books That Will Help

Topic Book Chapter
HTTP Protocol RFC 7230 (IETF Standard) Sections 3-4
Socket Programming CS:APP 3e Ch. 11, Sec. 11.4-11.5
Network I/O APUE (3e) Ch. 16
Threading CS:APP 3e Ch. 12, Sec. 12.1-12.3
Signal Handling CS:APP 3e Ch. 8, Sec. 8.6
Cache Design Designing Data-Intensive Apps Ch. 3
Parser Design Crafting Interpreters Ch. 4-5
Debugging Concurrency The Art of Debugging Ch. 6-7 - Matloff & Salzman
Production Practices The Art of Monitoring Ch. 1-3 - Turnbull

Common Pitfalls and Debugging

Problem 1: “Requests mysteriously lost or truncated”

  • Why: Short counts from read/write not handled; buffer boundary issues
  • Fix: Always loop until full read/write or EOF; handle EINTR explicitly
  • Quick test: Send 100-byte request in 10-byte chunks; verify all received

Problem 2: “Cache causes random crashes or hangs”

  • Why: Lock ordering, use-after-free in cache eviction, or spurious wakeup
  • Fix: Run with Helgrind (valgrind --tool=helgrind); trace lock acquisitions
  • Quick test: Kill proxy with 100 concurrent requests; run with Helgrind; check for data races

Problem 3: “Upstream connection closes but proxy keeps trying to send”

  • Why: Connection close not detected; didn’t check for EPIPE or closed socket
  • Fix: Check for EPIPE when writing; handle closed sockets gracefully
  • Quick test: Kill upstream server; send requests through proxy; verify error handling

Problem 4: “Cascading failures under high load”

  • Why: Resource exhaustion (too many open files, thread pool saturated)
  • Fix: Add resource limits; monitor lsof output; add backpressure
  • Quick test: Send 10,000 concurrent requests; watch /proc/self/fd; ensure it doesn’t hit ulimit

Problem 5: “Cache hit rate is suspiciously low”

  • Why: Cache key collision or TTL set too short
  • Fix: Log cache operations; verify hit/miss statistics; check if requests actually match
  • Quick test: Request same URL 100 times; trace that it hits cache

Problem 6: “Helgrind reports race conditions but code looks correct”

  • Why: Subtle lock ordering issue or missed synchronization
  • Fix: Draw a lock dependency diagram; ensure no circular waits; check for unlocked reads
  • Quick test: Add explicit pthread_mutex_lock around suspected operations; re-run Helgrind

Definition of Done

  • Parses HTTP requests (request line + headers) correctly
  • Forwards requests to upstream server (HTTP/1.1)
  • Receives responses and sends back to client
  • Handles concurrent clients with thread pool (no crashes, no deadlocks)
  • LRU cache works: stores responses, serves from cache, evicts on capacity
  • Handles all EINTR and short read/write cases
  • Graceful shutdown: stops accepting, finishes in-flight, exits cleanly
  • No data loss: tested with intentional signal injection
  • No race conditions: Helgrind clean
  • Observability: logs show clear request flow with timestamps
  • Survives stress test: 1000 concurrent requests, measured latency/throughput
  • Code compiles with -Wall -Wextra -Werror -fsanitize=address,undefined
  • Documented: README explains design choices and tradeoffs

Phase 2: Advanced Extensions


Project 18: ELF Linker and Loader

Real World Outcome

You will build myld, a linker that takes two .o files and produces a working executable.

$ ./myld main.o utils.o -o myprog
[LINKER] Merging .text sections... (Size: 1024 bytes)
[LINKER] Merging .data sections... (Size: 64 bytes)
[LINKER] Resolving symbol 'helper_func'... Found in utils.o at offset 0x20
[LINKER] Relocating call at 0x400105 -> 0x400220
[LOADER] Output written to 'myprog'.
$ ./myprog
Hello from linked code!

The Core Question You’re Answering

How does “Undefined Reference” become a valid memory address?

Concepts You Must Understand First

  • ELF Headers (CS:APP 7.4) - Sections vs Segments.
  • Relocation Entries (CS:APP 7.7) - R_X86_64_PC32 vs R_X86_64_64.
  • Symbol Resolution (CS:APP 7.6) - Strong vs Weak symbols.

Questions to Guide Your Design

  1. How do you merge two .text sections into one? (Concatenation).
  2. If you move utils.o’s code to follow main.o’s code, what happens to the addresses of functions in utils.o?
  3. What is the math formula for a PC-relative relocation? ($Address_{target} - Address_{instruction} - 4$).

Thinking Exercise

Draw two object files. A.o calls B. B.o defines B. When you put them together, B is no longer at offset 0. How do you calculate the number to put in the call instruction in A?

Definition of Done

  • Can parse 64-bit ELF headers.
  • Can merge .text and .data from two input files.
  • Can process R_X86_64_PC32 relocations to fix function calls.
  • Produced binary runs on Linux.

Project 19: Virtual Memory Simulator

Real World Outcome

A CLI vmsim that replays memory traces and simulates page faults.

$ ./vmsim trace.txt --phys-mem=4 --algo=LRU
[ACCESS] R 0x1234 (VPN 1) -> MISS (Cold). Loaded to PPN 0.
[ACCESS] W 0x2345 (VPN 2) -> MISS (Cold). Loaded to PPN 1.
...
[ACCESS] R 0x9999 (VPN 9) -> MISS (Capacity). Evicting VPN 1 (LRU).
[STATS] Hits: 85, Misses: 15, Hit Rate: 85%

The Core Question You’re Answering

How does the OS choose which page to throw out when RAM is full, and how much does it matter?

Concepts You Must Understand First

  • Paging & Page Tables (Primer Ch 6).
  • Replacement Policies (CS:APP 9.6) - LRU, FIFO, Clock.
  • Locality (Primer Ch 5).

Questions to Guide Your Design

  1. How do you split an address into VPN (Virtual Page Number) and Offset?
  2. How do you track “Least Recently Used”? (Counter? Queue? Approx bit?).
  3. What is the “working set” of a trace?

Definition of Done

  • Implements multi-level page table walk logic.
  • Implements FIFO and LRU replacement.
  • Correctly counts hits, compulsory misses, and capacity misses.

Project 20: HTTP Web Server

Real World Outcome

A web server tiny that serves files and runs scripts.

$ ./tiny 8080
[LISTEN] Waiting on port 8080...
[CONN] Accepted 127.0.0.1:54321
[REQ] GET /index.html
[RESP] 200 OK (150 bytes)
[CONN] Accepted 127.0.0.1:54322
[REQ] GET /cgi-bin/adder?a=1&b=2
[CGI] Forking child...
[RESP] 200 OK (Dynamic Content)

The Core Question You’re Answering

How do network packets become “web pages” via the socket API?

Concepts You Must Understand First

  • Sockets (CS:APP 11.4).
  • HTTP Protocol (CS:APP 11.5).
  • Process Control (CS:APP 8.4) - fork/exec.

Definition of Done

  • Serves static files (HTML, JPG).
  • Handles CGI (Dynamic content via child processes).
  • Handles errors (404, 500) gracefully.

Project 21: Thread Pool Implementation

(See expanded guide in repo for full code and details)

What you’ll build: A reusable thread pool library with a bounded work queue. Why it matters: Producer-consumer pattern mastery.

Real World Outcome

$ ./threadpool --workers=4
[INIT] 4 workers started.
[SUBMIT] Task 1..100
[WORKER-0] Processed Task 1
...
[SHUTDOWN] Graceful exit.

The Core Question You’re Answering

How do you safely coordinate multiple threads sharing a queue?

Definition of Done

  • No race conditions (Helgrind clean).
  • Clean shutdown (no zombie threads).
  • Handles spurious wakeups correctly.

Project 22: Signal-Safe Printf

(See expanded guide in repo for full code and details)

What you’ll build: A printf that works inside signal handlers. Why it matters: Understanding reentrancy and async-safety.

Real World Outcome

$ ./sio_test
[HANDLER] Caught signal 10. Safe output only!

The Core Question You’re Answering

Why does printf deadlock in a signal handler, and how do we fix it?

Definition of Done

  • Uses only write syscall.
  • No malloc, no global locks.
  • Formats integers and strings.

Project 23: Performance Profiler

(See expanded guide in repo for full code and details)

What you’ll build: A sampling profiler using signals. Why it matters: Understanding how gprof/perf actually work.

Real World Outcome

$ ./profiler ./myprog
[STATS] Total samples: 1000
[HOT] 40% - matrix_multiply
[HOT] 30% - vector_add

The Core Question You’re Answering

How can I measure code execution without modifying the code?

Definition of Done

  • Uses setitimer for periodic interrupts.
  • Captures rip (instruction pointer).
  • Resolves addresses to symbols (if possible).

Project 24: Memory Leak Detector

Real World Outcome

A shared library you inject into any C program to find leaks.

$ LD_PRELOAD=./libleakcheck.so ./buggy_app
[APP] App running...
[APP] App exiting...
================ LEAK REPORT =================
[LEAK] 1024 bytes at 0x7f... allocated by:
       main.c:25 (main)
       utils.c:10 (helper)
[SUMMARY] 1 leak, 1024 bytes lost.

The Core Question You’re Answering

How can I track every malloc/free without recompiling the target application?

Concepts You Must Understand First

  • Dynamic Interposition (CS:APP 7.13) - LD_PRELOAD.
  • Function Wrappers (dlsym, RTLD_NEXT).
  • Call Stacks (Primer Ch 3) - Walking the stack frames.

Hints in Layers

  1. Implement malloc that just calls the real malloc (via dlsym).
  2. Add logging to a file.
  3. Use a hash map (or linked list) to store “live” pointers. malloc adds, free removes.
  4. On program exit (__attribute__((destructor))), print whatever is left in the map.

Definition of Done

  • Intercepts malloc, free, realloc, calloc.
  • Tracks allocation size and location (return address).
  • Reports memory leaked at exit.
  • Does not crash recursive calls (e.g., if printf calls malloc).

Project 25: Debugger (ptrace-based)

Real World Outcome

A minimalist debugger mydb that can stop code and inspect state.

$ ./mydb ./target
(mydb) break 0x401050
(mydb) run
[STOP] Breakpoint hit at 0x401050
(mydb) regs
rax: 0x0
rip: 0x401050
(mydb) step
[STOP] Stepped to 0x401055

The Core Question You’re Answering

How does a debugger control another process and transparently pause it?

Concepts You Must Understand First

  • ptrace syscall (CS:APP 8.5) - PTRACE_TRACEME, PTRACE_PEEKUSER, PTRACE_CONT.
  • Software Breakpoints - The INT 3 (opcode 0xCC) instruction.
  • Signals - SIGTRAP.

Thinking Exercise

To set a breakpoint:

  1. Read the byte at the target address.
  2. Save it.
  3. Write 0xCC (INT 3) there. To resume:
  4. Write the saved byte back.
  5. Rewind %rip by 1.
  6. Single step.
  7. Write 0xCC back.
  8. Continue.

Definition of Done

  • Can launch a target program.
  • Can set a breakpoint at an address.
  • Can continue and stop at breakpoint.
  • Can read register values.

Project 26: Operating System Kernel Capstone

Real World Outcome

A kernel.iso that you boot in QEMU, printing “Hello World” from your own OS.

The Core Question You’re Answering

What is the absolute minimum software required to manage the hardware?

Concepts You Must Understand First

  • Bare Metal Boot - Multiboot header.
  • VGA Text Mode - Writing bytes to 0xb8000.
  • GDT/IDT - Global Descriptor Table, Interrupt Descriptor Table.

Definition of Done

  • Boots via GRUB/QEMU.
  • Sets up a stack.
  • Writes text to the screen.
  • Handles at least one interrupt (keyboard or timer).

(Note: Projects P3-P17 and P20-P23 are documented in the expanded guides linked in the headers. P21, P22, and P23 have detailed guides available in the full version.)


Project Comparison Table

Project Difficulty Time Depth Fun Factor
P1: Build Pipeline Explorer ★☆☆☆☆ 4-8h Medium ★★★☆☆
P2: Bitwise Data Inspector ★☆☆☆☆ 4-8h Medium ★★★☆☆
P3: Data Lab Clone ★★☆☆☆ 10-15h High ★★★★☆
P4: Crash Cart ★★☆☆☆ 10-15h High ★★★☆☆
P5: Bomb Lab Workflow ★★☆☆☆ 15-20h High ★★★★★
P6: Attack Lab Workflow ★★★☆☆ 20-30h Very High ★★★★★
P7: Y86-64 Simulator ★★★☆☆ 30-40h Very High ★★★★☆
P8: Performance Clinic ★★☆☆☆ 10-15h High ★★★☆☆
P9: Cache Simulator ★★☆☆☆ 15-20h Very High ★★★★☆
P10: ELF Link Map ★★☆☆☆ 10-15h High ★★★☆☆
P11: Signals Sandbox ★★☆☆☆ 10-15h High ★★★☆☆
P12: Unix Shell ★★★☆☆ 25-35h Very High ★★★★★
P13: VM Visualizer ★★☆☆☆ 10-15h Medium ★★★☆☆
P14: Malloc ★★★☆☆ 30-40h Very High ★★★★☆
P15: Robust I/O ★★☆☆☆ 10-15h High ★★★☆☆
P16: Concurrency Workbench ★★★☆☆ 20-30h Very High ★★★★☆
P17: Proxy Server ★★★★☆ 40-60h Comprehensive ★★★★★
P18-P26: Extensions ★★★★-★★★★★ Varies Expert ★★★★★

Recommendation

If you’re new to systems programming: Start with Project 1 (Build Pipeline Explorer) and Project 2 (Bitwise Inspector). These give you quick wins and essential mental models. Then progress through Projects 3-6 to build debugging intuition.

If you’re preparing for interviews: Focus on Project 5 (Bomb Lab) for reverse engineering practice, Project 14 (Malloc) for the classic memory allocator question, and Project 16 (Concurrency) for threading fundamentals. These three cover the most common systems interview topics.

If you’re building production systems: Prioritize Project 12 (Shell) for process management, Project 14 (Malloc) for memory understanding, Project 15 (Robust I/O) for network code, and Project 17 (Proxy) as the capstone that integrates everything.

If you have limited time (3-month sprint): Follow Path 5: P1 → P2 → P4 → P9 → P14 → P17. This covers the essential concepts while producing impressive projects for your portfolio.


Final Overall Project: Production-Grade Systems Toolkit

After completing the core projects, combine your work into a unified systems debugging toolkit:

The Goal

Build a single binary sysinspect that integrates:

  • Build analysis (from P1)
  • Memory inspection (from P2, P13)
  • Process/signal debugging (from P11, P12)
  • Performance profiling (from P8, P23)

Integration Steps

  1. Create a unified CLI with subcommands: sysinspect build, sysinspect memory, etc.
  2. Add a --live mode that attaches to running processes (from P25)
  3. Implement cross-referencing: show how a memory address maps to source code symbols
  4. Add export to JSON/HTML for reports

Success Criteria

  • Single make command builds the toolkit
  • Can analyze your own projects from the sprint
  • Documentation is good enough for another developer to use
  • Handles edge cases gracefully (processes that exit, permission errors, etc.)

From Learning to Production: What’s Next

Your Project Production Equivalent Gap to Fill
P1: Build Pipeline CMake/Bazel/Buck Build caching, dependency graphs
P5: Bomb Lab IDA Pro/Ghidra GUI, decompilation, scripting
P9: Cache Simulator Valgrind’s cachegrind Hardware integration
P12: Unix Shell Bash/Zsh Scripting language, history, completion
P14: Malloc glibc malloc/jemalloc Thread-local caches, arenas
P17: Proxy Nginx/HAProxy Configuration, TLS, load balancing
P24: Leak Detector Valgrind’s memcheck Shadow memory, origin tracking
P25: Debugger GDB/LLDB DWARF parsing, expression evaluation

Continuing Your Journey

  1. Read the Source: Study glibc, Linux kernel, and nginx source code
  2. Contribute to Open Source: Many systems projects welcome contributors
  3. Specialize: Pick one area (security, performance, infrastructure) and go deep
  4. Teach: Writing about systems programming solidifies understanding

Summary

This learning path covers Computer Systems: A Programmer’s Perspective through 26 hands-on projects (17 core + 9 extensions).

# Project Name Main Language Difficulty Time
1 Build Pipeline Explorer C/Shell Level 1 4-8h
2 Bitwise Data Inspector C Level 1 4-8h
3 Data Lab Clone C Level 2 10-15h
4 Calling Convention Crash Cart C/Assembly Level 2 10-15h
5 Bomb Lab Workflow GDB/Assembly Level 2 15-20h
6 Attack Lab Workflow C/Assembly Level 3 20-30h
7 Y86-64 CPU Simulator C Level 3 30-40h
8 Performance Clinic C Level 2 10-15h
9 Cache Lab Simulator C Level 2 15-20h
10 ELF Link Map Toolkit C Level 2 10-15h
11 Signals/Processes Sandbox C Level 2 10-15h
12 Unix Shell with Job Control C Level 3 25-35h
13 VM Map Visualizer C Level 2 10-15h
14 Build Your Own Malloc C Level 3 30-40h
15 Robust Unix I/O Toolkit C Level 2 10-15h
16 Concurrency Workbench C Level 3 20-30h
17 Capstone Proxy Server C Level 4 40-60h
18 ELF Linker/Loader C Level 4 40-60h
19 VM Simulator C Level 3 25-35h
20 HTTP Web Server C Level 3 25-35h
21 Thread Pool C Level 3 20-25h
22 Signal-Safe Printf C Level 3 15-20h
23 Performance Profiler C Level 3 20-25h
24 Memory Leak Detector C Level 3 20-30h
25 ptrace Debugger C Level 4 35-45h
26 OS Kernel Capstone C/Assembly Level 5 60-100h

Expected Outcomes

After completing these projects, you will:

  1. Debug any crash by reading registers, stack traces, and memory dumps
  2. Predict performance by reasoning about cache behavior and memory access patterns
  3. Write secure code by understanding buffer overflows, format strings, and memory safety
  4. Build production systems that handle signals, concurrency, and I/O correctly
  5. Pass systems interviews with deep understanding rather than memorization
  6. Read and contribute to low-level codebases (kernels, compilers, databases)

You’ll have built 26 working projects that demonstrate mastery of systems programming from first principles—from source code to running process and back again.


Additional Resources and References

Primary Textbook

  • Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
  • Official website with labs: https://csapp.cs.cmu.edu/

Supplementary Books (by topic)

C Programming

  • C Programming: A Modern Approach by K.N. King
  • Effective C by Robert Seacord

Assembly & Low-Level

  • Low-Level Programming by Igor Zhirkov
  • Practical Binary Analysis by Dennis Andriesse
  • Hacking: The Art of Exploitation by Jon Erickson

Operating Systems

  • Operating Systems: Three Easy Pieces (OSTEP) - Free online
  • The Linux Programming Interface by Michael Kerrisk
  • Advanced Programming in the UNIX Environment by Stevens & Rago

Networking

  • Unix Network Programming by W. Richard Stevens
  • Beej’s Guide to Network Programming (free online)

Performance

  • Agner Fog’s Optimization Manuals (free online)
  • Computer Organization and Design by Patterson & Hennessy

Online Resources

Standards and Documentation

  • System V AMD64 ABI: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
  • Intel x86-64 Manuals: https://www.intel.com/sdm
  • POSIX Specification: https://pubs.opengroup.org/onlinepubs/9699919799/
  • Linux Man Pages: https://man7.org/linux/man-pages/