CS:APP (3rd Edition) — Deep Learning via Buildable Projects

Goal: Transform from a programmer who writes code that “happens to work” into a systems programmer who understands exactly what the machine does with every instruction. By building 17 core projects (+ 9 optional advanced extensions), you will internalize the complete journey from source code to running process—mastering data representation, machine-level execution, memory hierarchy, operating system abstractions, and concurrent programming. When you finish, you will debug crashes by reading registers, optimize code by reasoning about cache lines, and build robust systems that handle real-world failure modes gracefully.

Introduction

CS:APP (Computer Systems: A Programmer’s Perspective) teaches you to treat the compiler, the OS, and the CPU as inspectable systems instead of magic. This guide turns the book into a build-first sprint: you’ll repeatedly ask “what did the machine actually do?” and then prove it with artifacts (ELF headers, disassembly, traces, perf counters, crash dumps).

What you will build (by the end of the core sprint):

A pipeline explorer that makes preprocessing → compilation → assembly → linking → loading observable
A byte/bit inspector for integers, floats, and endianness
Debugging workflows for x86-64 crashes, reverse engineering, and classic memory bugs
Simulators/visualizers for CPU pipelines and caches
OS-flavored tooling: process/signal sandbox, a job-control shell, VM map visualizer, and a malloc implementation
A robust I/O toolkit and a concurrency workbench
A secure, observable, high-performance proxy server capstone

Optional advanced extensions (if you keep going):

An ELF linker/loader, VM simulator, tiny HTTP server, thread pool, signal-safe I/O, profiler, leak detector, ptrace debugger, and an OS-kernel capstone

Big Picture / Mental Model

You write C
  │
  ├─► Toolchain (cpp/cc/as/ld) ──► ELF binary ──► Loader ──► Process
  │                                              │
  │                                              ├─► Syscalls (kernel boundary)
  │                                              ├─► Signals (async interruptions)
  │                                              ├─► VM (pages, TLB, faults)
  │                                              └─► Caches (lines, locality)
  │
  └─► You debug/measure by reading the evidence:
      source ⇄ asm ⇄ symbols ⇄ traces ⇄ counters ⇄ dumps

How to Use This Guide

This guide is designed to be worked through systematically, not skimmed. Follow this workflow:

Reading the Primer (First 2-3 Days)

Read each Theory Primer chapter completely before starting related projects
Work through the “Check Your Understanding” questions without looking at answers
Draw the ASCII diagrams by hand to internalize the mental models
Complete the homework exercises at the end of each chapter

Working Through Projects (Ongoing)

Read the project’s “Core Question” and “Concepts You Must Understand First”
Attempt the “Thinking Exercise” before writing any code
Use “Hints in Layers” only when stuck—start with Hint 1, wait 30 minutes before checking the next
Validate against the “Definition of Done” checklist before moving on
Reflect on “Interview Questions” to solidify understanding

Parallel Reference

Keep CS:APP open alongside this guide
Use the “Deep Dive Reading by Concept” table to find relevant book sections
Refer to the Glossary when terms are unclear

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Programming Skills

Proficiency in C (pointers, structs, memory management, compilation)
Basic command-line fluency (navigation, file operations, pipes)
Familiarity with a text editor (Vim/Emacs/VSCode)
Recommended Reading: “C Programming: A Modern Approach” by K.N. King - Chapters 1-20

Computer Science Fundamentals

Binary and hexadecimal number systems
Basic data structures (arrays, linked lists, stacks)
Understanding of what a compiler does (high-level)
Recommended Reading: “Code” by Charles Petzold - Chapters 1-15

Helpful But Not Required

These topics are taught through the projects but prior exposure accelerates learning:

Assembly language basics (learn during Projects 3-6)
Operating system concepts (learn during Projects 11-14)
Networking fundamentals (learn during Projects 15, 17, 20)

Self-Assessment Questions

Before starting, you should be able to answer “yes” to these:

Can you explain what a pointer is and how * and & operators work in C?
Can you compile a C program with gcc and run it from the command line?
Do you know the difference between stack and heap memory allocation?
Can you read simple Makefiles and understand targets/dependencies?
Are you comfortable spending 4+ hours debugging a single issue?

If you answered “no” to any question, spend time on the recommended readings first.

Development Environment Setup

Required Tools

Tool	Version	Purpose
GCC	9.0+	C compiler with debugging symbols
GDB	8.0+	GNU Debugger for x86-64
Make	4.0+	Build automation
Git	2.20+	Version control
Valgrind	3.15+	Memory error detection

Recommended Tools

Tool	Purpose
GEF	GDB enhancement for reverse engineering
objdump	Binary inspection
readelf	ELF file analysis
strace	System call tracing
perf	Performance profiling

Testing Your Setup

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1

$ cat > test.c << 'EOF'
#include <stdio.h>
int main() { printf("Setup OK\n"); return 0; }
EOF
$ gcc -g -Wall test.c -o test && ./test
Setup OK

Operating System

Linux x86-64 is required (Ubuntu 20.04+ or Debian 11+ recommended)
macOS works for some projects but lacks full ptrace support
Windows users should use WSL2 with Ubuntu

Time Investment

Project Type	Time Per Project	Examples
Foundation	4-8 hours	P1, P2, P3
Intermediate	10-20 hours	P4-P10
Advanced	20-40 hours	P11-P17
Extension	30-60 hours	P18-P26

Total Sprint Estimate:

Core projects (P1-P17): 3-5 months at 10-15 hours/week
With extensions (P18-P26): 6-9 months total

Important Reality Check

Systems programming is genuinely difficult. You will:

Spend hours staring at hex dumps and register values
Debug problems that seem “impossible” until they suddenly click
Feel frustrated when your code crashes with no obvious cause
Eventually develop an intuition that makes this feel natural

The struggle is the learning. If projects feel easy, you’re not going deep enough.

Why CS:APP Matters

Modern Motivation: Why Learn This in 2026?

In an era of high-level frameworks, AI code generation, and cloud abstractions, systems programming skills are more valuable than ever—precisely because they’re rarer.

The “Power Programmer” Gap

According to Teach Yourself CS, CS:APP creates “power programmers” who:

Debug crashes by reading registers instead of adding print statements
Optimize performance by reasoning about cache behavior
Understand security vulnerabilities at the instruction level
Build infrastructure that AI systems run on

Career Impact (Measured with 2024-2026 data)

The demand signal is clear when you look at current platform, hiring, and security data:

Stack Overflow Developer Survey 2024: among all respondents, C++ is 22.1%, C is 20.3%, and Bash/Shell is 18.8%; among professional developers, C++ is 23.0% and C is 21.2%.
U.S. Bureau of Labor Statistics (updated September 4, 2025): software developer employment is projected to grow 17.9% from 2023 to 2033, with median annual pay at $133,080 (May 2024).
W3Techs snapshot (accessed February 11, 2026): Linux is used by 60.5% of websites where the server-side OS is known.
GitHub Octoverse 2024: open-source projects in C grew 20.9% year-over-year, and C++ grew 11.8%.

Where These Skills Apply

┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEMS PROGRAMMING CAREERS                   │
├─────────────────────────────────────────────────────────────────┤
│  Infrastructure    │  Security         │  Performance           │
│  ─────────────    │  ────────        │  ───────────          │
│  • Cloud platforms │  • Vulnerability  │  • Game engines        │
│  • Database cores  │    research       │  • Trading systems     │
│  • Compiler dev    │  • Malware        │  • Embedded systems    │
│  • Container       │    analysis       │  • Graphics drivers    │
│    runtimes        │  • OS hardening   │  • ML infrastructure   │
└─────────────────────────────────────────────────────────────────┘

Old abstraction-first learning vs systems-evidence-first learning

Traditional path                           CS:APP-first path
┌───────────────────────────────┐          ┌───────────────────────────────┐
│ "Framework works, ship it"    │          │ "Show the machine evidence"   │
│                               │          │                               │
│ Bug? Add logs and retries     │          │ Bug? Inspect asm/regs/ELF     │
│ Slow? Add bigger hardware     │          │ Slow? Measure cache/TLB/IO    │
│ Security? Patch after incident│          │ Security? Reason from memory   │
└───────────────────────────────┘          └───────────────────────────────┘

Context & Evolution

1960s-70s: The Birth of C and Unix Dennis Ritchie created C at Bell Labs in 1972 to rewrite Unix. His design philosophy—giving programmers direct hardware access with minimal runtime overhead—defined systems programming for the next 50 years.

1980s-90s: x86 Dominance Intel’s x86 architecture became the standard for personal computing. Understanding x86 assembly became essential for performance-critical code.

2000s: The 64-bit Transition AMD64 (x86-64) extended the architecture with more registers and a flat memory model. CS:APP 3rd Edition (2015) updated to focus on this architecture.

2020s: The Resurgence With Moore’s Law slowing and cloud computing creating massive scale challenges, deep systems understanding has become the differentiator between “good enough” and “excellent” engineers.

Real-World Impact Statistics

Chromium security team (updated May 2024): roughly 70% of serious security bugs are memory-safety issues.
Android security team (September 2024): Android memory-safety vulnerabilities dropped from 76% (2019) to 24% (2024) after adopting safer implementation strategies.
GitHub Octoverse 2024: C/C++ growth in active open-source projects indicates continued industry investment in systems-level software.
BLS (September 4, 2025): sustained employment growth and compensation reinforce the career value of low-level debugging and performance engineering capabilities.

Theory Primer (Mini-Book)

This section is your textbook. It covers the nine pillars of systems programming in detail. You must understand these mental models to complete the projects.

Chapter 1: Translation, Linking, and Execution

Fundamentals

The “Build” button in your IDE is a comfortable lie. It obscures a complex, four-stage pipeline that transforms human-readable text into machine-executable binary signals. Systems programmers must look behind this curtain because that is where “undefined reference” errors, segmentation faults during startup, and weird library versioning bugs live. The translation process is not magic; it is a deterministic sequence of transformations: Preprocessing (text substitution), Compilation (C to Assembly), Assembly (Assembly to Machine Code), and Linking (stitching object files together). Understanding this pipeline transforms you from someone who blindly tries command-line flags until it compiles, into someone who architects build systems and debugs binary compatibility issues with precision.

Deep Dive into Translation & Linking

The journey from hello.c to a running process involves four distinct tools, often orchestrated by a driver like gcc.

Preprocessing (cpp): This is a text-processing step. It handles directives starting with #. It strips comments, expands macros, and physically pastes the contents of header files (#include) into your source file. The output is a “Translation Unit” (often .i). Crucially, the preprocessor knows nothing about C syntax; it just manipulates text. This is why missing a semicolon in a header file can cause an error 500 lines later in your .c file.
Compilation (cc1): This is the heavy lifter. It parses the Translation Unit, checks for syntax and semantic errors (type safety), performs optimizations (like dead code elimination or loop unrolling), and generates Assembly code (.s). Assembly is a human-readable text representation of the CPU’s machine code. This is where the abstraction of “C” ends and the reality of the hardware begins.
Assembly (as): The assembler translates the .s file into a Relocatable Object File (.o). This file contains binary machine code, but it is not executable yet. Addresses of global variables and functions are “relocatable” (usually set to zero) because the assembler doesn’t know where this code will live in memory or where other functions (like printf) are located. It produces a Symbol Table listing the functions it defines and the ones it needs.
Linking (ld): The linker solves the puzzle. It takes multiple .o files and libraries (.a archives or .so shared objects) and combines them into one Executable (ELF file). Its two main jobs are:
- Symbol Resolution: Matching every reference to a symbol (e.g., a call to foo) with exactly one definition. If zero definitions are found, you get “undefined reference”. If two are found, you get “multiple definition”.
- Relocation: Once all code size is known, the linker assigns final runtime memory addresses to every instruction and variable. It then modifies the code instructions, patching the dummy addresses with the real ones.

Static vs. Dynamic Linking

Static linking copies the library code into your executable, creating a large, standalone binary. Dynamic linking keeps the library separate; the executable contains a “promise” to find the library at runtime. The Loader (part of the OS) fulfills this promise when you run the program, using the Dynamic Linker (ld-linux.so) to map the library into memory and resolve symbols lazily.

How this fits on projects

You will explore this pipeline in Project 1, inspect ELF files in Project 10, and build your own linker in Project 18.

Definitions & Key Terms

Translation Unit: The output of the preprocessor; a single source file with all headers included.
Symbol Table: Metadata in an object file that maps names (functions/globals) to their locations.
Relocation Entry: A specific instruction in the object file that the linker must patch with a final address.
ABI (Application Binary Interface): The binary-level contract (calling convention, alignment, type sizes) that allows modules to interoperate.

Mental Model Diagram

         [Text]         [Text]           [Text]            [Binary]         [Binary]
Source (.c) ──> Preproc ──> Trans Unit ──> Compiler ──> ASM (.s) ──> Assembler ──> Object (.o)
                                                                                         │
                                                                       Libraries (.a) ───┤
                                                                                         ▼
                                                                                      Linker
                                                                                         │
                                                                                         ▼
                                                                                  Executable (ELF)

How it Works (Step-by-Step)

Driver: User runs gcc main.c.
CPP: main.c -> main.i. All #defines replaced.
CC1: main.i -> main.s. C logic becomes x86-64 logic.
AS: main.s -> main.o. Machine code generated. Offsets for functions are 0x0.
LD: main.o + libc.so references -> a.out. Addresses assigned (e.g., main starts at 0x401000).
Loader: User runs ./a.out. OS creates process, maps file, jumps to _start.

Common Misconceptions

Misconception: “The compiler handles libraries.” -> Truth: The compiler only checks headers. The linker handles libraries.
Misconception: “Compilation is one step.” -> Truth: It is four, and you can stop at any of them to debug.
Misconception: “Header files are compiled.” -> Truth: No, they are text-pasted into .c files. You can’t compile a .h file directly to an object.

Check-Your-Understanding

If you delete the definition of puts but keep the declaration, which stage fails?
If you have a syntax error in a macro, which stage reports it?
Does a relocatable object file (.o) contain absolute or relative addresses?

Answers

The Linker. The compiler is happy as long as the declaration exists.
The Compiler (usually), though the error originated in the preprocessor’s expansion.
Relative (or zero-based). It assumes it starts at 0.

Real-World Applications

Hot Patching: Modifying binary code in memory to fix bugs without restarting.
Plugin Systems: Using dlopen / dlsym to load code at runtime (dynamic linking).
Build Systems: CMake and Make exist entirely to orchestrate this pipeline efficiently.

Key Insight

The “undefined reference” error is just a database lookup failure in the Symbol Table.

Summary

Translation is a pipeline of lowering abstractions. Preprocessing handles text, compilation handles logic, assembly handles machine encoding, and linking handles memory layout and dependencies.

Homework

Run gcc -v hello.c. Capture the output and identify the path to cc1, as, and collect2 (the linker driver).

Homework Solution

$ gcc -v hello.c 2>&1 | grep -E "(cc1|as|collect2)"
 /usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -v -imultiarch x86_64-linux-gnu hello.c ...
 as -v --64 -o /tmp/ccXXXXXX.o /tmp/ccXXXXXX.s
 /usr/lib/gcc/x86_64-linux-gnu/11/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so ...

Note: collect2 is GCC’s wrapper around the system linker (ld). The actual paths vary by system.

References

CS:APP 3e - Chapters 1, 7 (primary source)
Practical Binary Analysis by Dennis Andriesse - Ch. 1-2 (ELF format deep dive)
Low-Level Programming by Igor Zhirkov - Ch. 3-4 (compilation pipeline)
Linkers and Loaders by John Levine - Classic reference on linking
GNU Binutils documentation: https://sourceware.org/binutils/docs/

Chapter 2: Data Representation

Fundamentals

To a computer, “type” is an illusion. Memory is just a vast ocean of bits. Whether a sequence of 32 bits represents the integer 1,065,353,216, the floating-point number 1.0, or two assembly instructions depends entirely on context—specifically, how the CPU is instructed to read it. Systems programmers must master these low-level encodings because hardware does not protect you from misinterpreting them. Understanding bitwise operations, integer overflow, and floating-point inexactness is mandatory for writing secure and reliable software.

Deep Dive into Data Representation

We focus on three primary encodings: Unsigned Integers, Signed Integers (Two’s Complement), and Floating Point (IEEE 754).

Unsigned Integers: These are straightforward binary place-value systems. A $w$-bit unsigned number maps bits $\vec{x}$ to $\sum_{i=0}^{w-1} x_i 2^i$. They wrap around modulo $2^w$.
Signed Integers (Two’s Complement): This is the standard for representing negative numbers. The most significant bit (MSB) has a negative weight. A $w$-bit number maps to $-x_{w-1}2^{w-1} + \sum_{i=0}^{w-2} x_i 2^i$. This genius design allows the same hardware adder circuit to handle both signed and unsigned addition. However, it introduces asymmetry: TMin (e.g., -128) has no positive counterpart (max is +127). This leads to the famous abs(INT_MIN) bug.
Floating Point (IEEE 754): Floats are NOT real numbers. They are a finite, discrete subset of reals. A float is encoded as $V = (-1)^s \times M \times 2^E$.
- Sign (s): 1 bit.
- Exponent (E): Encoded with a bias. Controls the range.
- Mantissa (M): The fractional part. Controls the precision.
- Floats have gaps. As numbers get larger, the gap between representable numbers increases. $0.1 + 0.2 \ne 0.3$ because 0.1 cannot be exactly represented in binary.
Endianness: Multi-byte objects (like int, long) must be stored in memory. Little Endian (x86) stores the least significant byte at the lowest address. Big Endian (Network order) stores the most significant byte first.

How this fits on projects

You will build a tool to visualize this in Project 2 and enforce bitwise constraints in Project 3.

Definitions & Key Terms

Word Size: The natural size of pointers and integers (usually 64-bit on modern machines).
Two’s Complement: The dominant encoding for signed integers.
NaN (Not a Number): A special float bit pattern resulting from invalid operations (e.g., sqrt(-1)).
Normalization: The standard form of floats where the leading binary digit is implicitly 1.

Mental Model Diagram

Memory Address:   0x100  0x101  0x102  0x103
Value (0x12345678):
  Big Endian:      [12]   [34]   [56]   [78]  (Reads left-to-right)
  Little Endian:   [78]   [56]   [34]   [12]  (Reads "backwards")

How it Works (Integer Casting)

Casting between signed and unsigned in C does not change the bits; it only changes how the compiler interprets them. short s = -1; (Bits: 1111 1111 1111 1111) unsigned short u = (unsigned short)s; (Bits: 1111..., Value: 65535)

Common Misconceptions

Misconception: “Signed overflow throws an exception.” -> Truth: In C, signed overflow is Undefined Behavior (UB). The compiler might assume it never happens and optimize away your safety checks.
Misconception: “Floats have constant precision.” -> Truth: Precision decreases as magnitude increases.

Check-Your-Understanding

Why does (int)(unsigned)-1 print -1?
Is x < y always equivalent to x - y < 0?
What happens if you cast a double to an int?

Answers

(unsigned)-1 is UINT_MAX (all ones). Casting back to int interprets “all ones” as -1 in Two’s Complement. The bits never changed.
No! If x is INT_MIN and y is 1, x - y overflows to a large positive number.
Truncation (rounding toward zero). If the double is too large, behavior is undefined.

Real-World Applications

Crypto: Relies heavily on precise bitwise operations and large integer arithmetic.
Networking: Protocols require converting host endianness to network endianness (htonl, ntohl).
Graphics: Floating point precision issues (“z-fighting”) cause flickering textures in games.

Key Insight

Types are just lenses for viewing bits. Casting changes the lens, not the scene (usually).

Summary

Information = Bits + Context. You must understand the limits of integer and float encodings to write safe code.

Homework

Write a function is_little_endian() that returns 1 if the machine is Little Endian, using a union or pointer casting.

Homework Solution

// Method 1: Union (exploits how unions share memory)
int is_little_endian_v1() {
    union { int i; char c; } u;
    u.i = 1;
    return u.c == 1;  // If little endian, LSB is at lowest address
}

// Method 2: Pointer casting
int is_little_endian_v2() {
    int x = 1;
    return *((char*)&x) == 1;  // Cast int* to char*, read first byte
}

On x86-64 (little endian), both return 1. The value 0x00000001 is stored as [01][00][00][00] with 01 at the lowest address.

References

CS:APP 3e - Chapter 2 (primary source)
Write Great Code Vol. 1 by Randall Hyde - Ch. 4-6 (data representation)
Effective C by Robert Seacord - Ch. 5 (integers and overflow)
IEEE 754-2019 Standard: https://ieeexplore.ieee.org/document/8766229
“What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg

Chapter 3: Machine-Level Programming

Fundamentals

The CPU does not know C. It does not understand while loops, structs, or objects. It understands a linear stream of simple instructions: move data, do math, jump to address. Machine-level programming (Assembly) is the study of how high-level constructs are mapped onto this rigid hardware reality. Mastering this allows you to reverse-engineer binaries, debug “impossible” crashes, and understand exactly how function calls and the stack work.

Deep Dive into x86-64

We focus on x86-64, a Complex Instruction Set Computer (CISC) architecture.

Register File: The CPU has 16 general-purpose 64-bit registers (%rax, %rbx, %rcx, %rdx, %rsi, %rdi, %rbp, %rsp, %r8-%r15). These are the fastest memory in the system.
Operand Forms: Instructions can operate on:
- Immediates: Constants ($0x4).
- Registers: Values in registers (%rax).
- Memory: Values at an address ((%rax)).
The Stack: A region of memory used for function calls. It grows downward (from high addresses to low). %rsp (Stack Pointer) points to the top element. push decrements %rsp and writes; pop reads and increments.
Calling Convention (System V AMD64): This is the “law” of function calls.
- Args: First 6 arguments go in %rdi, %rsi, %rdx, %rcx, %r8, %r9. Remaining go on the stack.
- Return: Value goes in %rax.
- Saved Registers: The “Callee” must preserve %rbx, %rbp, %12-%15. The “Caller” must save others if it needs them.
Control Flow: if, while, and for are implemented using cmp (compare) and jX (jump if condition) instructions. A switch statement is often compiled into a Jump Table (an array of code addresses) for O(1) performance.

How this fits on projects

You will debug crashes in Project 4, reverse engineer binaries in Project 5, and perform buffer overflow attacks in Project 6.

Definitions & Key Terms

PC (Program Counter): The register (%rip in x86-64) holding the address of the next instruction.
Stack Frame: The slice of the stack allocated for a single function call. Contains local vars, saved registers, and the return address.
LEA (Load Effective Address): An instruction meant for address math, but often used for fast arithmetic (e.g., lea (%rdi,%rdi,2), %rax computes x*3).

Mental Model Diagram

      Stack (High Address)
      ┌──────────────────┐
      │ Return Address   │  <- Pushed by 'call'
      ├──────────────────┤
      │ Saved %rbp       │  <- Pushed by function prologue
      ├──────────────────┤
      │ Local Variables  │  <- Space made by 'sub $N, %rsp'
      ├──────────────────┤
      │ ...              │
      └──────────────────┘  <- %rsp (Stack Pointer)
      (Low Address)

How it Works (The Function Call)

Caller: Puts args in registers. Executes call func.
- call pushes %rip (return address) to stack and jumps to func.
Callee (Prologue): push %rbp; mov %rsp, %rbp. (Sets up frame).
Callee (Body): Does work. Uses stack for locals if registers aren’t enough.
Callee (Epilogue): leave (restores stack/rbp); ret.
- ret pops the return address into %rip, resuming the caller.

Common Misconceptions

Misconception: “Local variables are initialized to zero.” -> Truth: They contain whatever stack garbage was there before.
Misconception: “The stack is secure.” -> Truth: It is just writable memory. If you write past an array, you overwrite the Return Address (Buffer Overflow), hijacking control flow.

Check-Your-Understanding

Where is the return address stored?
If function A calls function B, which registers must B save before using?
What does sub $16, %rsp do?

Answers

On the stack, directly above the saved base pointer.
The Callee-Saved registers: %rbx, %rbp, %r12-%r15.
It allocates 16 bytes of space on the stack (grows downward).

Real-World Applications

Debugging: Reading stack traces requires understanding frames.
Performance: Passing structs by value is slow (copies to stack); passing by pointer is fast (register).
Security: ROP (Return Oriented Programming) exploits rely on chaining assembly gadgets.

Key Insight

C is just a thin wrapper around assembly. The “Stack” is a convention, not a hardware law.

Summary

Machine code is the ground truth. Understanding the stack discipline and register conventions is the key to debugging and security.

Homework

Compile a C function long plus(long a, long b) { return a+b; } with gcc -S and identify the registers used.

Homework Solution

$ echo 'long plus(long a, long b) { return a+b; }' > plus.c
$ gcc -S -O1 plus.c -o plus.s
$ cat plus.s

Expected output (key lines):

plus:
    leaq    (%rdi,%rsi), %rax   ; a is in %rdi, b is in %rsi
    ret                         ; result returned in %rax

Registers used: %rdi (first arg), %rsi (second arg), %rax (return value). Note: lea is used for addition because it’s often faster than add for simple cases.

References

CS:APP 3e - Chapter 3 (primary source)
Hacking: The Art of Exploitation by Jon Erickson - Ch. 2 (assembly basics)
Debugging x86-64 Assembly with GDB - Practical debugging guide
GEF (GDB Enhanced Features) - Essential GDB extension
System V AMD64 ABI: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
Intel x86-64 Manual Vol. 1: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

Chapter 4: Processor Architecture & Optimization

Fundamentals

We treat the CPU as a black box that runs instructions sequentially. In reality, modern CPUs are massively parallel, pipelined, superscalar beasts. They execute many instructions at once, guess the future (speculation), and execute instructions out of order. Understanding this microarchitecture is essential for writing high-performance code.

Deep Dive into Architecture

Pipelining: Like a car assembly line. Instead of doing one instruction start-to-finish, the CPU breaks it into stages: Fetch, Decode, Execute, Memory, Writeback. Ideally, if there are 5 stages, you get 5x throughput.
Hazards: Things that break the pipeline.
- Data Hazard: Instruction B needs a result from A, but A isn’t done. The pipeline must Stall (wait) or use Forwarding (pass data internally).
- Control Hazard: The CPU doesn’t know if a conditional jump will be taken.
Branch Prediction: To solve control hazards, the CPU guesses. “I bet this if is true.” It executes speculatively. If wrong, it must Flush the pipeline (throw away work), costing 15-40 cycles.
Superscalar: Having multiple pipelines. Can issue add and mul in the same cycle.
Optimization Blockers:
- Aliasing: Two pointers might point to the same memory. The compiler can’t optimize effectively because writing to A might change B.
- Procedure Calls: The compiler treats functions as black boxes (mostly).
- Unpredictable Branches: Sorting data makes code faster because branches become predictable.

How this fits on projects

You will build a CPU simulator in Project 7 and optimize code in Project 8.

Definitions & Key Terms

Latency: Time to complete one instruction.
Throughput: Number of instructions completed per second.
CPE (Cycles Per Element): Metric for measuring loop performance.
Loop Unrolling: Manually writing out loop bodies to reduce overhead and increase parallelism.

Mental Model Diagram (Pipeline)

Clock Cycle:   1   2   3   4   5
Instr 1:      [F] [D] [E] [M] [W]
Instr 2:          [F] [D] [E] [M]
Instr 3:              [F] [D] [E]  <-- Parallel Execution!

How it Works (Branch Prediction)

If you have if (x < 0), the CPU guesses False. It loads the else code. If x turns out to be -1, the CPU realizes the error, wipes the pipeline, and loads the if code. This penalty is why branchless programming (using bitwise math instead of if) can be faster.

Common Misconceptions

Misconception: “Fewer lines of C code = Faster.” -> Truth: Not necessarily. Complex one-liners might generate terrible assembly.
Misconception: “The CPU executes in order.” -> Truth: It fetches in order, but executes out of order (OoO) to fill idle units.

Check-Your-Understanding

Why is a linked list traversal often slower than an array traversal (ignoring cache)?
What is the cost of a branch misprediction?
Why does restrict keyword help optimization?

Answers

Data dependency. You can’t load node B until you read node A’s next pointer. This kills instruction-level parallelism.
Pipeline flush. Usually 15-40 cycles of lost work.
It tells the compiler that pointers don’t alias, allowing it to cache values in registers.

Real-World Applications

HFT (High-Freq Trading): Optimizing to the cycle level to beat competitors.
Game Engines: organizing data to maximize parallel processing (Data-Oriented Design).

Key Insight

To make code fast, feed the pipeline. Avoid dependencies and unpredictable branches.

Summary

The CPU is a factory. Your job is to keep the conveyor belts full.

Homework

Write a loop that sums an array. Compare performance of standard loop vs. unrolled loop (4x).

Homework Solution

// Standard loop
long sum_std(long *arr, int n) {
    long sum = 0;
    for (int i = 0; i < n; i++) sum += arr[i];
    return sum;
}

// 4x unrolled loop
long sum_unrolled(long *arr, int n) {
    long sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    int i;
    for (i = 0; i < n - 3; i += 4) {
        sum0 += arr[i];   sum1 += arr[i+1];
        sum2 += arr[i+2]; sum3 += arr[i+3];
    }
    for (; i < n; i++) sum0 += arr[i];  // Handle remainder
    return sum0 + sum1 + sum2 + sum3;
}

Benchmark with time or perf stat. The unrolled version is typically 20-40% faster due to reduced loop overhead and increased ILP (4 independent additions per iteration).

References:

CS:APP 3e - Chapters 4-5 (primary source)
Computer Organization and Design by Patterson & Hennessy - Ch. 4 (pipelining)
Agner Fog’s Optimization Manuals: https://www.agner.org/optimize/
Performance Engineering of Software Systems (MIT OCW) - Excellent performance course

Chapter 5: The Memory Hierarchy

Fundamentals

Programmers want memory that is infinite, fast, and cheap. Physics says you can pick two. The solution is the Memory Hierarchy: a pyramid of storage technologies. Small, fast memory at the top (Registers/L1 Cache); huge, slow memory at the bottom (Disk). The system works because of Locality: programs tend to access the same data, or nearby data, repeatedly.

Deep Dive into Caches

Hierarchy Levels:
- Registers: 0 cycles.
- L1 Cache: ~4 cycles. (SRAM)
- L2/L3 Cache: ~10-50 cycles. (SRAM)
- Main Memory (DRAM): ~100-200 cycles.
- Disk (SSD/HDD): Millions of cycles.
Cache Structure: Caches are divided into Sets, containing Lines (Blocks). A memory address is split into Tag, Set Index, and Block Offset.
Miss Types:
- Compulsory: First access (Cold).
- Conflict: Multiple addresses map to the same set, evicting each other.
- Capacity: The Working Set is larger than the cache.
Locality:
- Temporal: Using i in a loop (same address, many times).
- Spatial: Iterating an array a[i] then a[i+1] (nearby addresses).

How this fits on projects

You will simulate a cache in Project 9 and write a malloc allocator in Project 14.

Definitions & Key Terms

Cache Line: The unit of transfer (usually 64 bytes). You never fetch just a byte; you fetch a line.
Hit Rate: % of accesses found in cache.
Eviction: Throwing out a line to make room for new data (policies: LRU, Random).
Write-Back: Only writing to RAM when a modified line is evicted (faster).

Mental Model Diagram

Address: [ Tag  |  Index  |  Offset ]
                    │
                    ▼
           [ Select Set ]
           /      |      \
      Line 1    Line 2   Line 3 ...
      (Tag?)    (Tag?)   (Tag?)

How it Works (The Stride Issue)

Iterating a 2D array int a[1024][1024] row-by-row is fast (Stride-1, Spatial Locality). Iterating column-by-column is slow (Stride-1024). You jump 4KB every step, missing the cache line you just fetched.

Common Misconceptions

Misconception: “Memory access is O(1).” -> Truth: It varies by 100x depending on cache hits.
Misconception: “Linked lists are fine.” -> Truth: They scramble memory patterns, killing spatial locality (Pointer Chasing).

Check-Your-Understanding

Why is a cache block size of 1 byte a bad idea?
What happens to the cache when you context switch?
If index bits = 4, how many sets are there?

Answers

No spatial locality benefit. Metadata overhead would be huge.
It gets “polluted” by the new process. When you switch back, you suffer cold misses.
$2^4 = 16$ sets.

Real-World Applications

Matrix Multiplication: Tiled algorithms exist solely to fit blocks into L1 cache.
Databases: Designed to minimize disk seeks and maximize page cache hits.

Key Insight

Locality is King. Algorithms with the same Big-O can differ in speed by 10x-100x due to cache behavior.

Summary

The cache is a hardware hash table that you cannot control directly, but you can influence by accessing memory predictably.

Homework

Write a program that accesses an array with stride 1, 2, 4, … 1024. Measure the time per access. Plot it. You will see the cache cliffs.

Homework Solution

#include <stdio.h>
#include <time.h>
#define SIZE (1024 * 1024 * 64)  // 64M elements = 512MB

int main() {
    static long arr[SIZE];
    for (int stride = 1; stride <= 1024; stride *= 2) {
        clock_t start = clock();
        long sum = 0;
        for (int i = 0; i < SIZE; i += stride) sum += arr[i];
        clock_t end = clock();
        double time_per = (double)(end - start) / CLOCKS_PER_SEC / (SIZE/stride);
        printf("Stride %4d: %.2f ns/access\n", stride, time_per * 1e9);
    }
}

Expected output pattern:

Stride    1: 1.2 ns/access   <- L1 cache hit
Stride    2: 1.3 ns/access
Stride    4: 1.5 ns/access
Stride    8: 2.0 ns/access   <- L1 miss, L2 hit (64B line / 8B stride = 8)
Stride   16: 3.5 ns/access   <- L2 miss starts
Stride   32: 5.0 ns/access
...
Stride 1024: 80+ ns/access   <- RAM access (TLB misses too)

The “cliff” at stride 8 shows when you exceed the cache line size (64 bytes / 8 byte longs = 8 accesses per line).

References:

CS:APP 3e - Chapter 6 (primary source)
OSTEP - Chapters 18-22 (paging and caching)
Intel Optimization Manual - Section on memory subsystem
“What Every Programmer Should Know About Memory” by Ulrich Drepper (2007, still relevant)

Chapter 6: Virtual Memory (VM)

Fundamentals

Virtual Memory is the great illusionist. It gives every process the belief that it has a massive, contiguous chunk of private memory starting at address 0. In reality, physical RAM is fragmented, shared, and possibly full. VM adds a layer of indirection: the CPU translates Virtual Addresses to Physical Addresses for every single instruction.

Deep Dive into VM

Paging: Memory is divided into fixed-size chunks called Pages (usually 4KB).
Page Tables: A data structure in RAM (managed by the OS) that maps Virtual Page Numbers (VPN) to Physical Page Numbers (PPN). Each process has its own page table.
MMU (Memory Management Unit): Hardware that performs the translation.
TLB (Translation Lookaside Buffer): A cache inside the MMU that stores recent translations. Without TLB, every memory access would require 2+ RAM accesses (one for page table, one for data), slowing the machine by 50%.
Page Faults:
- Hard Fault: Page is on disk (Swap). OS pauses process, fetches from disk.
- Soft Fault: Page is in RAM but not marked present (e.g., Copy-on-Write).
- Segfault: Accessing unmapped/forbidden memory.
Features Enabled by VM:
- Protection: Read-only bits prevent code overwrites.
- Sharing: libc is loaded once in RAM, mapped into all processes.
- Copy-on-Write: fork() is fast because it shares pages initially.

How this fits on projects

You will visualize this in Project 13 and simulate it in Project 19.

Definitions & Key Terms

Swap: Using disk space to extend RAM. Slow.
Thrashing: When the working set > RAM, and the OS spends all time swapping pages in and out.
VIPT (Virtually Indexed, Physically Tagged): How L1 caches interact with VM for speed.

Mental Model Diagram

Virtual Addr:  [ VPN  | Offset ]
                  │       │
             [Page Table] └───┐
                  │           │
Physical Addr: [ PPN  | Offset ]

How it Works (Fork)

When fork() is called, the OS duplicates the Page Table, not the RAM. It marks all pages as “Read-Only”. If Child tries to write, MMU raises a protection fault. OS catches it, copies that specific page to a new physical frame, updates the Child’s page table, and resumes.

Common Misconceptions

Misconception: “malloc allocates RAM.” -> Truth: It allocates virtual address space. RAM is only assigned when you touch the pages (Demand Paging).
Misconception: “Different pointers mean different physical memory.” -> Truth: Different virtual addresses can map to the same physical frame (Aliasing/Shared Memory).

Check-Your-Understanding

What causes a Segmentation Fault?
Why is the TLB critical for performance?
If a page size is 4KB, how many bits are the offset?

Answers

Hardware raises a fault when translation fails or permissions are violated. OS sends SIGSEGV.
It avoids reading the Page Table from slow RAM for every instruction.
12 bits ($2^{12} = 4096$).

Real-World Applications

Sandboxing: Isolating processes so a crash in one doesn’t kill the OS.
Memory Mapped Files: mmap allows treating files like arrays.

Key Insight

All pointers are lies. They are handles to a lookup table.

Summary

VM provides Isolation, Efficiency, and Protection by decoupling software view from hardware reality.

Homework

Use pmap <pid> to look at the memory map of a running process. Identify the heap, stack, and shared libraries.

Homework Solution

$ sleep 1000 &
[1] 12345
$ pmap 12345
12345:   sleep 1000
0000555555554000     32K r-x-- sleep          <- Code (.text)
000055555575b000      4K r---- sleep          <- Read-only data
000055555575c000      4K rw--- sleep          <- Writable data (.data, .bss)
0000555555580000    132K rw---   [ anon ]     <- HEAP
00007ffff7c00000   2048K r-x-- libc.so.6      <- libc code (shared)
00007ffff7e00000    256K r---- libc.so.6      <- libc rodata
00007ffff7e40000    128K rw--- libc.so.6      <- libc writable
...
00007ffffffde000    132K rw---   [ stack ]    <- STACK (grows down)

Key observations:

Heap starts after the program’s data segments (low addresses)
Stack is at very high addresses (~0x7fffff…)
Shared libraries (.so) are mapped between heap and stack
Each region has permissions: r (read), w (write), x (execute)

References:

CS:APP 3e - Chapter 9 (primary source)
The Linux Programming Interface by Michael Kerrisk - Ch. 48-50 (VM)
LWN.net: Memory part 3: Virtual Memory
Linux Kernel Documentation on page tables: https://www.kernel.org/doc/gorman/html/understand/understand006.html
OSTEP - Chapters 13-23 (free online)

Chapter 7: Exceptional Control Flow (ECF)

Fundamentals

Code does not execute in a vacuum. It must react to the outside world: network packets arriving, disks finishing writes, user pressing Ctrl+C, or division by zero. ECF is the mechanism for abrupt changes in the control flow, handling events at the Hardware, OS, and User levels.

Deep Dive into ECF

Exceptions (Hardware Level):
- Interrupts: Async, from external hardware (Timer, I/O). Handler returns to next instruction.
- Traps: Sync, intentional (Syscalls). Handler returns to next instruction.
- Faults: Sync, errors (Page Fault). Handler retries current instruction.
- Aborts: Sync, fatal errors (Hardware fail). Process dies.
Context Switch (OS Level): The OS uses a timer interrupt to slice time. It saves the register state of Process A, restores Process B, and updates the PC. This creates the illusion of multitasking.
Signals (User Level): The OS notifies a process of an event (e.g., SIGINT, SIGSEGV). A signal handler is a function that interrupts the main flow. This is user-space concurrency!
- Async-Signal-Safety: You cannot call printf or malloc in a handler because they use locks. If the main thread held the lock when interrupted, the handler will deadlock waiting for it.

How this fits on projects

You will master signals in Project 11 and build a shell in Project 12.

Definitions & Key Terms

Kernel Mode: CPU state allowing privileged instructions (I/O, MMU management).
Zombie Process: A child that finished but hasn’t been reaped (wait) by its parent. It consumes a PID.
Orphan Process: A child whose parent died. init (PID 1) adopts it.

Mental Model Diagram (Signal)

Main Thread           OS Kernel            Signal Handler
     │                    │                      │
     │ <--- Interrupt ----│                      │
     │                    │                      │
(Context Saved)           │----(Modify Stack)--->│
     │                    │                      │ (Runs)
     │                    │<----(Sigreturn)------│
     │                    │                      │
(Context Restored)        │                      │
     │                    │                      │
     ▼

How it Works (Syscall)

User calls read(). Wrapper puts syscall ID in %rax. Executes syscall instruction. CPU switches to Kernel Mode, jumps to trap handler. Kernel performs operation. Returns to User Mode.

Common Misconceptions

Misconception: “Signals are instant.” -> Truth: They are delivered when the OS schedules the process.
Misconception: “Ctrl+C kills the program.” -> Truth: It sends SIGINT. The default action is terminate, but you can catch it.

Check-Your-Understanding

What happens if you don’t reap child processes?
Why can’t you use printf in a signal handler?
What is the difference between an Interrupt and a Trap?

Answers

They become Zombies, leaking OS resources.
It is not reentrant (uses global locks). Deadlock risk.
Interrupts are external/async (hardware). Traps are internal/sync (instruction).

Real-World Applications

Shells: Managing foreground/background jobs (fg, bg).
Web Servers: Handling SIGPIPE when clients disconnect.

Key Insight

ECF is the glue between your code and the chaotic physical world.

Summary

Exceptions allow the system to react. Processes provide isolation. Signals provide notification.

Homework

Write a program that catches SIGINT, prints “Don’t touch me!”, and continues.

Homework Solution

#include <stdio.h>
#include <signal.h>
#include <unistd.h>

volatile sig_atomic_t got_signal = 0;

void handler(int sig) {
    // Note: write() is async-signal-safe, printf is NOT!
    const char msg[] = "Don't touch me!\n";
    write(STDOUT_FILENO, msg, sizeof(msg) - 1);
    got_signal = 1;
}

int main() {
    struct sigaction sa = {0};
    sa.sa_handler = handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART;  // Restart interrupted syscalls

    sigaction(SIGINT, &sa, NULL);

    printf("Try pressing Ctrl+C...\n");
    while (1) {
        pause();  // Wait for signals
        if (got_signal) {
            printf("Continuing after signal...\n");
            got_signal = 0;
        }
    }
    return 0;
}

Test: Press Ctrl+C multiple times. Program prints message but doesn’t exit.

References:

CS:APP 3e - Chapter 8 (primary source)
The Linux Programming Interface by Michael Kerrisk - Ch. 20-22 (signals)
Advanced Programming in the UNIX Environment by Stevens - Ch. 10 (signals)
signal(7) man page: man 7 signal
List of async-signal-safe functions: signal-safety(7)

Chapter 8: System I/O & Networking

Fundamentals

The Unix Philosophy is “Everything is a File”. A file on disk, a pipe to another process, a network connection to a server in Tokyo—all are accessed via the same interface: read, write, open, close. Understanding this abstraction (and its leaks) is crucial for building robust servers.

Deep Dive into I/O

File Descriptors (FD): Small integers representing an open file. Standard: 0 (Stdin), 1 (Stdout), 2 (Stderr).
Robust I/O: read and write handle streams, not packets.
- Short Counts: If you ask for 1000 bytes, read might return 500. You must loop until you get what you need. This happens constantly in networking.
Buffering:
- User Buffering (FILE*): fprintf writes to a buffer in your heap. Flushes on newline or full.
- Kernel Buffering: write copies data to the Page Cache. Disk write happens later (async).
Sockets: The file abstraction for networks.
- Client: socket -> connect.
- Server: socket -> bind (reserve port) -> listen -> accept.
Multiplexing (select/poll/epoll): How to handle 10,000 connections with one thread. “Wake me up when any of these 1000 FDs are ready.”

How this fits on projects

You will build robust I/O tools in Project 15 and a web server in Project 17 and 20.

Definitions & Key Terms

Metadata: Information about a file (size, permissions) stored in the Inode.
Redirect: ls > file works by dup2(fd_file, STDOUT_FILENO). The program ls just writes to FD 1, unaware it’s a file.
Blocking I/O: The process sleeps until data arrives.

Mental Model Diagram (Echo Server)

Client                         Server
socket, connect  ───────────>  socket, bind, listen
     │                         accept (blocks)
     │ <─────────────────────> (Connection Est.)
write(req)       ───────────>  read(req)
read(resp)       <───────────  write(resp)

How it Works (The Short Count)

TCP breaks data into packets. If you send 1MB, it arrives in 1500-byte chunks. read() returns as soon as some data is available. It does not wait for the full 1MB. You must re-call read() in a loop.

Common Misconceptions

Misconception: “Write writes to disk.” -> Truth: It writes to kernel RAM. Power loss = Data loss (unless fsync).
Misconception: “TCP preserves message boundaries.” -> Truth: It is a stream of bytes. You need a protocol (newlines, length headers) to separate messages.

Check-Your-Understanding

Why does printf not appear immediately?
What does accept return?
Why is epoll better than select?

Answers

Stdout is line-buffered. Use fflush or \n.
A new file descriptor for that specific connection. The listening socket stays open.
select is O(N) (scans all FDs). epoll is O(1) (notified only of active FDs).

Real-World Applications

Nginx/Node.js: Event-driven I/O using epoll for massive scale.
Databases: Using O_DIRECT to bypass kernel cache for raw disk access.

Key Insight

The network is a file that fails, hangs, and delivers partial data. Coding for the “Happy Path” guarantees failure.

Summary

Robust I/O requires handling partials, errors, and buffering discipline.

Homework

Write a program that uses dup2 to redirect stdout to a file, then executes ls.

Homework Solution

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

int main() {
    // Open output file (create if doesn't exist, truncate if does)
    int fd = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd < 0) { perror("open"); exit(1); }

    // Redirect stdout (fd 1) to our file
    if (dup2(fd, STDOUT_FILENO) < 0) { perror("dup2"); exit(1); }
    close(fd);  // No longer need the original fd

    // exec replaces this process with `ls`
    execlp("ls", "ls", "-la", NULL);
    perror("exec");  // Only reached if exec fails
    return 1;
}

After running: cat output.txt shows the ls -la output. The key insight is that dup2 makes fd 1 (stdout) point to our file, and exec preserves open file descriptors.

References:

CS:APP 3e - Chapters 10-11 (primary source)
The Linux Programming Interface by Michael Kerrisk - Ch. 5-6 (file I/O)
Unix Network Programming Vol. 1 by Stevens - Sockets API
Beej’s Guide to Network Programming: https://beej.us/guide/bgnet/
Linux man pages: open(2), read(2), write(2), socket(7)

Chapter 9: Concurrency

Fundamentals

Concurrency is the art of doing multiple things at once (or appearing to). It is essential for utilizing multi-core CPUs and handling overlapping I/O. However, it introduces a new class of bugs: Race Conditions and Deadlocks, which are non-deterministic and terrifying to debug.

Deep Dive into Concurrency

Threads vs Processes:
- Processes: Separate address spaces. Hard to share info (need IPC). Robust (crash doesn’t kill others). High overhead.
- Threads: Shared address space (same heap/globals). Easy to share. Risky (one crash kills all). Low overhead.
Synchronization Primitives:
- Mutex: Mutual Exclusion. “Only one thread can hold this lock.”
- Semaphore: “Only N threads can pass.”
- Condition Variable: “Sleep until X happens.”
The Race Condition: When the outcome depends on the timing of threads.
- cnt++ is NOT atomic. It is Load -> Add -> Store. If two threads do this interleaved, one increment is lost.
Deadlock: Thread A holds Lock 1, wants Lock 2. Thread B holds Lock 2, wants Lock 1. Both wait forever.

How this fits on projects

You will build a workbench in Project 16 and a Thread Pool in Project 21.

Definitions & Key Terms

Atomic Operation: An operation that cannot be interrupted. It happens entirely or not at all.
Critical Section: Code accessing shared resources that must be protected.
Thread Safety: A function is thread-safe if it works correctly when called by multiple threads simultaneously.

Mental Model Diagram (Race)

Thread 1:  Read(5) ---------> Add(6) -> Store(6)
Thread 2:      Read(5) -> Add(6) -> Store(6)
Result: 6 (Should be 7!)

How it Works (Mutex)

Hardware provides atomic instructions like CAS (Compare-And-Swap). A Mutex uses this to atomically check if a flag is 0. If yes, set to 1 (lock). If no, sleep.

Common Misconceptions

Misconception: “Volatile makes variables thread-safe.” -> Truth: No. It prevents compiler caching, but does not provide atomicity or memory barriers.
Misconception: “I tested it and it didn’t crash, so it’s safe.” -> Truth: Concurrency bugs are “Heisenbugs”. They hide during testing.

Check-Your-Understanding

What are the 4 conditions for Deadlock?
Why is a Semaphore initialized to 1 equivalent to a Mutex?
What is the difference between join and detach?

Answers

Mutual Exclusion, Hold & Wait, No Preemption, Circular Wait.
It allows only 1 thread to enter (Binary Semaphore).
join waits for a thread to finish. detach lets it run independently and cleans up automatically.

Real-World Applications

Game Engines: Physics, Rendering, and AI on separate threads.
Web Servers: Thread pools to handle requests.

Key Insight

Shared mutable state is the root of all evil. Minimize sharing.

Summary

Concurrency yields performance but demands rigorous discipline (locking protocols).

Homework

Write a program where two threads increment a shared counter 1 million times. Observe the result is < 2 million. Fix it with a mutex.

Homework Solution

#include <stdio.h>
#include <pthread.h>

#define ITERATIONS 1000000

long counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void *increment_unsafe(void *arg) {
    for (int i = 0; i < ITERATIONS; i++) counter++;
    return NULL;
}

void *increment_safe(void *arg) {
    for (int i = 0; i < ITERATIONS; i++) {
        pthread_mutex_lock(&lock);
        counter++;
        pthread_mutex_unlock(&lock);
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;

    // Unsafe version
    counter = 0;
    pthread_create(&t1, NULL, increment_unsafe, NULL);
    pthread_create(&t2, NULL, increment_unsafe, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Unsafe: %ld (expected 2000000)\n", counter);  // Often ~1.5M!

    // Safe version
    counter = 0;
    pthread_create(&t1, NULL, increment_safe, NULL);
    pthread_create(&t2, NULL, increment_safe, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Safe:   %ld (expected 2000000)\n", counter);  // Always 2000000

    return 0;
}

Compile: gcc -pthread race.c -o race The unsafe version loses increments because counter++ is actually 3 instructions: load, add, store.

References:

CS:APP 3e - Chapter 12 (primary source)
OSTEP - Chapters 26-33 (concurrency, free online)
The Linux Programming Interface by Michael Kerrisk - Ch. 29-33 (pthreads)
C++ Concurrency in Action by Anthony Williams - Modern threading concepts
pthread(7) man page and POSIX Threads documentation

Glossary (High-Signal)

Term	Meaning
ABI	The binary contract for calls, registers, alignment, and data layout.
ASLR	Randomizes address space layout to make exploitation harder.
Async-signal-safe	Functions that are safe to call inside a signal handler (most are not).
Cache line	The unit of transfer between cache and memory (locality “lives” here).
Copy-on-write (COW)	`fork()` shares pages until a write triggers a private copy.
ELF	Linux executable/object format (headers, sections, segments, symbols).
GOT/PLT	Indirection tables used for dynamic linking and late binding.
Heisenbug	A bug that disappears when you observe it (timing-sensitive).
Interposition	Overriding symbols at link/load time (e.g., `LD_PRELOAD`).
MMU	Hardware that translates virtual → physical addresses using page tables.
Page fault	CPU trap when a page is unmapped/forbidden; drives demand paging.
Race condition	Behavior depends on timing/interleavings (often a data race).
Relocation	Patching references to final addresses during linking/loading.
Short count	`read`/`write` returns fewer bytes than requested; you must loop.
Signal	Async notification delivered to a process (interrupt/exception-like).
Syscall	User → kernel boundary (`read`, `write`, `mmap`, `fork`, …).
TLB	Cache of page table translations (VM performance hinges on it).
Undefined behavior (UB)	C behaviors the standard doesn’t define; compilers may “assume it never happens”.
Zombie	A dead child process not yet `wait`ed on (exit status retained).

Concept Summary Table

Concept	Key Questions	Danger Signs
Translation	What does each stage produce?	“Compiled but crashes”
Data Rep	Why -1 > 0U?	Silent corruption
Machine Code	How does stack grow?	Cannot debug crashes
Architecture	What hazard is this?	Random performance
Memory	Cache hit rate?	100x slowdown
ECF/Processes	What is a zombie?	Hangs, orphans
I/O/Networking	Short count?	Data corruption
Concurrency	Race condition?	Heisenbugs

Deep Dive Reading by Concept

Concept	CS:APP	Supplementary
Translation	Ch. 1, 7	Practical Binary Analysis, Low-Level Programming
Data Rep	Ch. 2	Write Great Code Vol.1, Effective C
Machine Code	Ch. 3	Hacking: Art of Exploitation
Architecture	Ch. 4-5	Computer Organization and Design
Memory	Ch. 6, 9	OSTEP
ECF/Processes	Ch. 8	The Linux Programming Interface
I/O/Networking	Ch. 10-11	Unix Network Programming
Concurrency	Ch. 12	OSTEP, TLPI

Essential Book List: CS:APP, C Programming: A Modern Approach (King), Effective C (Seacord), OSTEP (free online), The Linux Programming Interface (Kerrisk), Low-Level Programming (Zhirkov)

Project-to-Concept Map

This table shows which Theory Primer chapters each project applies, helping you verify you’ve read the prerequisites before starting.

Project	Primary Concepts	CS:APP Chapters
P1: Build Pipeline Explorer	Translation, Linking, ELF	Ch. 1, 7
P2: Bitwise Data Inspector	Data Representation, Encoding	Ch. 2
P3: Data Lab Clone	Bit Operations, Integer/Float	Ch. 2
P4: Calling Convention Crash Cart	Machine Code, Stack Frames	Ch. 3
P5: Bomb Lab Workflow	Machine Code, Control Flow	Ch. 3
P6: Attack Lab Workflow	Machine Code, Security	Ch. 3
P7: Y86-64 CPU Simulator	Processor Architecture	Ch. 4-5
P8: Performance Clinic	Optimization, Pipelining	Ch. 5
P9: Cache Lab Simulator	Memory Hierarchy, Locality	Ch. 6
P10: ELF Link Map Toolkit	Translation, Dynamic Linking	Ch. 7
P11: Signals/Processes Sandbox	ECF, Signals	Ch. 8
P12: Unix Shell with Job Control	ECF, Process Control	Ch. 8
P13: VM Map Visualizer	Virtual Memory	Ch. 9
P14: Build Your Own Malloc	Virtual Memory, Heap	Ch. 9
P15: Robust Unix I/O Toolkit	System I/O	Ch. 10
P16: Concurrency Workbench	Concurrency, Threads	Ch. 12
P17: Capstone Proxy Server	All Concepts	Ch. 10-12
P18-P26: Extensions	Deep Dives	Various

Quick Start: Your First 48 Hours

If you’re eager to start building immediately, follow this accelerated path:

Day 1: Orientation (4-6 hours)

Skim the Big Picture diagram and Introduction (15 min)
Read Theory Primer Chapter 1: Translation (45 min)
Start Project 1: Build Pipeline Explorer
Get the first output: ./inspect-build hello.c showing preprocessor output

Day 2: First Complete Project (4-6 hours)

Complete Project 1 with all stages visible
Validate against the Definition of Done checklist
Read Theory Primer Chapter 2: Data Representation
Start Project 2: Bitwise Data Inspector—get it to show binary representation of integers

What You’ll Have After 48 Hours:

A working understanding of the GCC compilation pipeline
A tool that decodes how integers are stored in memory
Confidence that you can tackle the remaining projects

Common Day-1 Blockers:

“GCC flags are confusing” → Use gcc -v to see exactly what commands run
“I don’t understand ELF” → Just run readelf -h a.out and read the output
“Where does my program actually start?” → Hint: not main. It’s _start.

Recommended Learning Paths

Different backgrounds benefit from different project orderings:

Path 1: The CS Student (Following CS:APP Syllabus)

If you’re taking a course based on CS:APP or studying independently with the textbook:

P1 → P2 → P3 → P4 → P5 → P6 → P7 → P8 → P9 → P10 → P11 → P12 → P13 → P14 → P15 → P16 → P17

This mirrors the book’s chapter order and lab sequence.

Path 2: The Security Researcher

If your goal is vulnerability research, exploit development, or malware analysis:

P2 → P4 → P5 → P6 → P10 → P13 → P25 (Debugger) → P24 (Leak Detector)

Focus on memory layout, reverse engineering, and attack surfaces.

Path 3: The Performance Engineer

If you’re optimizing code for games, trading systems, or infrastructure:

P1 → P2 → P7 → P8 → P9 → P14 (Malloc) → P16 → P21 (Thread Pool) → P17

Emphasize architecture, caching, and concurrency.

Path 4: The Systems Developer

If you’re building operating systems, compilers, or low-level infrastructure:

P1 → P4 → P10 → P11 → P12 → P13 → P14 → P18 (Linker) → P19 (VM Sim) → P26 (OS Kernel)

Deep dive into OS internals and tooling.

Path 5: Weekend Warrior (Time-Limited)

If you have limited time and want maximum impact:

P1 → P2 → P4 → P9 → P14 → P17

6 projects covering the essential concepts. Complete in 2-3 months.

Success Metrics

How do you know you’ve internalized these concepts? Use these checkpoints:

Beginner Milestones (After Projects 1-6)

You can explain what happens during each compilation stage
You can predict integer overflow behavior in C
You can read x86-64 assembly and map it to C
You can debug a crash using only GDB and register values
You understand why buffer overflows are exploitable

Intermediate Milestones (After Projects 7-12)

You can explain pipeline hazards and their mitigations
You can predict cache miss rates for different access patterns
You can implement a job-control shell with proper signal handling
You know the difference between fork, exec, and clone
You can trace a system call from user space to kernel

Advanced Milestones (After Projects 13-17)

You can explain how virtual memory enables process isolation
You can implement a memory allocator with coalescing
You can write robust I/O code that handles partial reads/writes
You can debug race conditions using systematic techniques
You can build a concurrent server that doesn’t deadlock

Mastery Indicators

You read a crash dump and immediately suspect the root cause
You choose data structures based on cache locality, not just Big-O
You can explain why certain “obvious” optimizations don’t work
When code is slow, your first question is “what’s the memory access pattern?”

Project Overview Table

#	Project Name	Difficulty	Time	Key Concepts
1	Build Pipeline Explorer	Level 1	4-8h	Translation, ELF, Symbols
2	Bitwise Data Inspector	Level 1	4-8h	Encoding, Endianness
3	Data Lab Clone	Level 2	10-15h	Bit Ops, IEEE 754
4	Calling Convention Crash Cart	Level 2	10-15h	Stack, Registers, ABI
5	Bomb Lab Workflow	Level 2	15-20h	Reverse Engineering
6	Attack Lab Workflow	Level 3	20-30h	Exploits, ROP
7	Y86-64 CPU Simulator	Level 3	30-40h	Pipelining, Hazards
8	Performance Clinic	Level 2	10-15h	Optimization, ILP
9	Cache Lab Simulator	Level 2	15-20h	Cache, Locality
10	ELF Link Map Toolkit	Level 2	10-15h	Dynamic Linking, GOT/PLT
11	Signals/Processes Sandbox	Level 2	10-15h	Signals, fork/exec
12	Unix Shell with Job Control	Level 3	25-35h	Process Groups, TTY
13	VM Map Visualizer	Level 2	10-15h	Virtual Memory, mmap
14	Build Your Own Malloc	Level 3	30-40h	Heap, Fragmentation
15	Robust Unix I/O Toolkit	Level 2	10-15h	Buffering, Short Counts
16	Concurrency Workbench	Level 3	20-30h	Threads, Sync
17	Capstone Proxy Server	Level 4	40-60h	All Concepts
18	ELF Linker/Loader	Level 4	40-60h	Deep Linking
19	VM Simulator	Level 3	25-35h	Page Tables, TLB
20	HTTP Web Server	Level 3	25-35h	Sockets, CGI
21	Thread Pool	Level 3	20-25h	Producer-Consumer
22	Signal-Safe Printf	Level 3	15-20h	Reentrancy
23	Performance Profiler	Level 3	20-25h	Sampling, Signals
24	Memory Leak Detector	Level 3	20-30h	Interposition
25	ptrace Debugger	Level 4	35-45h	Breakpoints, Tracing
26	OS Kernel Capstone	Level 5	60-100h	Bare Metal

Project List

Projects P1–P17 are the core CS:APP sprint. Projects P18–P26 are optional extensions for going beyond the book.

Phase 1: Foundation (Start Here)

Project 1: “Hello, Toolchain” — Build Pipeline Explorer

Real World Outcome

You will build a tool inspect-build that reveals the hidden artifacts of compilation.

$ ./inspect-build hello.c
[PREPROCESSOR] Generated hello.i (312 lines). Macros expanded.
[COMPILER]     Generated hello.s (Assembly). 23 instructions.
               -> Found call to 'puts' (optimization of printf)
[ASSEMBLER]    Generated hello.o (Relocatable Object).
               -> Symbols: main (Global, Func), puts (Undefined)
[LINKER]       Generated hello (Executable).
               -> Entry point: 0x401060
               -> Linked with: /lib/x86_64-linux-gnu/libc.so.6

The Core Question You’re Answering

What physically happens to my code between hitting “save” and running the program?

Concepts You Must Understand First

The Translation Pipeline (Primer Ch 1) - cpp, cc1, as, ld.
ELF Sections (Primer Ch 1) - .text (code), .data (globals), .rodata (constants).
Symbols (Primer Ch 1) - How names map to addresses.

Questions to Guide Your Design

How can you stop gcc after preprocessing? After compilation? (Hint: flags).
How do you parse the output of readelf or nm to extract symbol counts?
What is the difference between a “defined” symbol in a .o file and an “undefined” one?

Thinking Exercise

Run gcc -v hello.c. Look at the output. Can you identify the exact programs (cc1, as, collect2/ld) being called? Why are there so many flags?

The Interview Questions They’ll Ask

“What is the difference between static and dynamic linking?”
“What is a symbol table? Does the final executable need it to run?”
“Why does a C program need a main function? Does the binary really start there?” (Hint: _start).

Hints in Layers

Use gcc -E, -S, -c to stop at different stages.
Use readelf -s to see symbols.
Use ldd to see dynamic dependencies.

Common Pitfalls & Debugging

Problem 1: “gcc -E produces huge output”

Why: Headers like <stdio.h> include thousands of lines of library declarations
Fix: Focus on the lines at the end (your code) or count total lines for comparison
Quick test: gcc -E hello.c | tail -50 shows just your code

Problem 2: “readelf shows no symbols in executable”

Why: Release builds often strip symbols for size. Debug builds include them.
Fix: Compile with gcc -g to include debug symbols, or use objdump -t instead
Quick test: nm a.out vs nm -D a.out (dynamic symbols survive stripping)

Problem 3: “collect2 error: ld returned 1 exit status”

Why: This is a linker error—usually undefined reference or multiple definition
Fix: Look for “undefined reference to foo” in the error message. Check if foo is defined in linked libraries
Quick test: nm *.o | grep foo to find where symbols are defined

Problem 4: “My tool works for hello.c but crashes on other files”

Why: Edge cases: files with no symbols, assembly errors, or missing includes
Fix: Add error handling for each gcc stage; check return codes
Quick test: Test with an empty file, a file with only comments, a file with syntax errors

Books That Will Help

Topic	Book	Chapter
GCC Internals	An Introduction to GCC by Gough	Ch. 1-3
ELF Format	Practical Binary Analysis by Andriesse	Ch. 2
Linking Deep Dive	Linkers and Loaders by Levine	Ch. 3-4

Definition of Done

Tool runs on any simple C file.
Reports size/lines of Preprocessed, Assembly, and Object files.
Lists undefined symbols in the Object file.
Identifies the dynamic linker used by the final executable.

Project 2: Bitwise Data Inspector

Real World Outcome

A CLI tool bit-view that acts as a CT scan for data.

$ ./bit-view 10
Signed:   10
Unsigned: 10
Hex:      0x0000000A
Binary:   00000000 00000000 00000000 00001010
Float:    1.401298e-44 (Denormalized)

The Core Question You’re Answering

How are the same 32 bits interpreted differently depending on type context?

Concepts You Must Understand First

Two’s Complement (Primer Ch 2) - How negative numbers are encoded.
IEEE 754 Floating Point (Primer Ch 2) - Sign, exponent, mantissa structure.
Endianness (Primer Ch 2) - Byte ordering in multi-byte values.
Book Reference: “Write Great Code Vol. 1” by Randall Hyde - Ch. 4-5

Questions to Guide Your Design

How do you extract individual bits from an integer? (Hint: bitwise AND with mask)
How do you detect if a float is denormalized, NaN, or infinity?
How do you show byte order without platform-specific code?

Thinking Exercise

Take the integer -1. Write out its 32-bit representation in binary. Now interpret those same bits as an unsigned integer. What value do you get? What about as a float?

The Interview Questions They’ll Ask

“Why is signed integer overflow undefined behavior in C?”
“Why does 0.1 + 0.2 != 0.3 in floating point?”
“How can you tell if a system is big-endian or little-endian?”

Hints in Layers

Use union { int i; float f; } to reinterpret bits without conversion.
For binary output, use (value >> bit) & 1 in a loop from bit 31 to 0.
To decode IEEE 754: Sign = bit 31, Exponent = bits 30-23 (subtract 127), Mantissa = bits 22-0.

Common Pitfalls & Debugging

Problem 1: “My binary output is backwards”

Why: You’re printing LSB first instead of MSB first
Fix: Loop from bit (width-1) down to 0
Quick test: Input 1 should show ...0001 at the end

Problem 2: “Float interpretation shows garbage”

Why: You’re casting instead of reinterpreting bits
Fix: Use union or memcpy, not (float)intval
Quick test: Input 0x3F800000 should decode as float 1.0

Problem 3: “Negative numbers look wrong”

Why: Sign extension when casting between sizes
Fix: Use unsigned types consistently when doing bit manipulation
Quick test: Input -1 should show all 1 bits

Books That Will Help

Topic	Book	Chapter
Bit Manipulation	Hacker’s Delight by Warren	Ch. 1-3
Float Internals	Write Great Code Vol. 1 by Hyde	Ch. 4
Data Types	Effective C by Seacord	Ch. 5

Definition of Done

Correctly displays binary for positive/negative ints.
Decodes IEEE 754 float components (Sign, Exponent, Mantissa).
Visualizes Endianness (byte order).
Handles edge cases: 0, -1, INT_MIN, NaN, infinity.

Project 3: Data Lab Clone

Real World Outcome

You will build a grading framework that enforces restricted operator sets and validates bitwise solutions.

$ ./dlc bits.c
bits.c:15:    absVal  (maxOps: 15, curOps: 14) Valid
bits.c:25:    addOK   (maxOps:  30, curOps: 28) Valid
bits.c:40:    bang    (maxOps:  12, curOps:  5) Valid
bits.c:52:    bitCount (maxOps:  40, curOps: 38) Valid

$ ./test-bits
Testing: absVal
Test [-1] got 1 (expected 1) PASS
Test [0] got 0 (expected 0) PASS
Test [2147483647] got 2147483647 (expected 2147483647) PASS
Test [-2147483648] ERROR: overflow expected

All functions: 47/50 tests passed (94%)

The Core Question You’re Answering

How can you implement standard operations (absolute value, min, max, sign detection) using only bitwise operators—no arithmetic operators, no conditionals?

Concepts You Must Understand First

Two’s Complement Representation (Primer Ch 2)
- How negative numbers are encoded as bit patterns
- Why -x = ~x + 1
- Book Reference: “Write Great Code Vol. 1” by Randall Hyde - Ch. 4
Bitwise Operators and Masking (Primer Ch 2)
- &, |, ^, ~, <<, >>
- How to extract/set individual bits
- Sign extension in right shifts
- Book Reference: “Hacker’s Delight” by Warren - Ch. 1-2
Integer Overflow & Underflow (Primer Ch 2)
- When addition/subtraction overflow
- How to detect overflow without arithmetic
- Book Reference: “Effective C” by Seacord - Ch. 5

Questions to Guide Your Design

How do you detect if a number is negative using only bitwise ops? (Hint: MSB)
How do you compute -x using only bitwise ops?
How do you check if x + y will overflow without computing x + y?
Can you implement min(x, y) without using conditional operators?
How would you count the number of 1-bits in an integer?

Thinking Exercise

The Absolute Value Problem

Take the integer -5 (in two’s complement: 11111111111111111111111111111011). Your goal is to compute its absolute value (5) using only:

Bitwise ops: &, |, ^, ~, <<, >>
No arithmetic operators (no +, -, *, /)
No conditionals (no if, ?:)

Questions to guide your thinking:

How do you detect that this number is negative?
Once you know it’s negative, how do you flip it to positive?
What about the special case INT_MIN? (Can’t negate -2147483648 in 32 bits!)

The Interview Questions They’ll Ask

“How would you implement absolute value using only bitwise operations?”
“How can you detect integer overflow in addition without performing the addition?”
“Write code to count the number of set bits (1s) in an integer. Can you do it in O(# of set bits) time?”
“Given two integers, how do you swap their values without using a temporary variable or arithmetic?”
“How would you check if a number is a power of 2 using only bitwise operations?”
“Explain the difference between arithmetic and logical right shift. Why does it matter?”

Hints in Layers

Hint 1: Start with Sign Detection

You need to know if a number is negative. The MSB (most significant bit) tells you this in two’s complement. How do you extract it?

Hint 2: Implement Conditional Logic with Bitwise Ops

If you can’t use if, you need a different way to choose between two values. Think: “If bit X is set, use value A, otherwise use value B.” This is a mask operation.

Hint 3: Use Masks to Implement Conditionals

// Pseudocode pattern:
int mask = (condition) ? -1 : 0;  // -1 is all 1s, 0 is all 0s
int result = (a & mask) | (b & ~mask);  // Pick a or b based on mask

But wait—you can’t use ?: either! So create the mask using bitwise ops.

Hint 4: Practice Decomposition

Break complex functions into smaller bitwise primitives:

sign(x): Extract the sign bit
isZero(x): Check if all bits are 0
isNegative(x): Check if MSB is 1
Then combine these to solve harder problems

Common Pitfalls & Debugging

Problem 1: “I used arithmetic operators by accident”

Why: The assignment seems to allow them, but the grader catches them
Fix: Audit your code. Replace every +, -, * with bitwise equivalents
Quick test: Run ./dlc bits.c to check operator counts

Problem 2: “My solution works for small numbers but fails for INT_MIN”

Why: Two’s complement has an asymmetry: -2147483648 cannot be negated in 32 bits
Fix: Handle the special case explicitly or accept the limitation
Quick test: Test with INT_MIN and INT_MAX

Problem 3: “Bit count is slow for large numbers”

Why: You’re checking every bit instead of just the set bits
Fix: Use Brian Kernighan’s trick: x & (x-1) clears the rightmost set bit
Quick test: Compare performance on 0xFFFFFFFF vs 0x00000001

Problem 4: “I’m exceeding the operator budget”

Why: Your implementation is inefficient (too many steps)
Fix: Study the operations: ~x, x & y, etc. are cheaper than shifts
Quick test: Refactor to use fewer operations

Books That Will Help

Topic	Book	Chapter
Bitwise Tricks	Hacker’s Delight by Warren	Ch. 1-3, 5
Bit Manipulation	Competitive Programming by Halim & Halim	Ch. 2.2
Integer Math	Write Great Code Vol. 1 by Hyde	Ch. 4
Two’s Complement	CS:APP 3e by Bryant & O’Hallaron	Ch. 2

Definition of Done

All required functions implemented with correct results
No arithmetic operators (+, -, *, /) used
No conditional operators (if, ?:) used
Operator counts below specified limits for each function
Edge cases (0, INT_MIN, INT_MAX, -1) handled correctly
./test-bits passes all provided test cases
./dlc bits.c shows “Valid” for all functions

Project 4: x86-64 Calling Convention Crash Cart

Real World Outcome

You will write C programs that intentionally crash in predictable ways, producing crash dumps that teach register/stack analysis.

$ gcc -g caller.c -o caller && ./caller
Segmentation fault (core dumped)

$ gdb ./caller core
(gdb) bt
#0  0x0000555555554123 in leaf () from ./caller
#1  0x0000555555554145 in caller () from ./caller

(gdb) info registers
rax            0x0                 0
rbx            0x555555554000      93824992235520
rcx            0x7ffff7dd6aa4      140737351862948
rdx            0x0                 0
rsi            0x1                 1
rdi            0x2                 2
rbp            0x7ffffffde9e0      0x7ffffffde9e0
rsp            0x7ffffffde9c0      0x7ffffffde9c0

(gdb) x/20x $rsp
0x7ffffffde9c0: 0x55554145  0x00005555  0xdeadbeef  0xdeadbeef
0x7ffffffde9d0: 0xdeadbeef  0xdeadbeef  0x7fffde9e0  0x00007fff
0x7ffffffde9e0: 0x55554167  0x00005555  0x00000000  0x00000000

# Stack frame structure visible! Return address at 0x7ffffffde9c8

The Core Question You’re Answering

How do I read a crash dump (registers, stack memory, return addresses) to understand what my program was doing when it died?

Concepts You Must Understand First

x86-64 Register File and Calling Conventions (Primer Ch 3)
- The 16 general-purpose registers
- Which registers hold function arguments (%rdi, %rsi, %rdx, %rcx, %r8, %r9)
- Which must be preserved by the callee (%rbx, %rbp, %r12-%r15)
- Book Reference: “CS:APP 3e” by Bryant & O’Hallaron - Ch. 3
Stack Frames and Memory Layout (Primer Ch 3)
- How push/pop work
- Stack grows downward (high → low addresses)
- Frame pointer (%rbp) and stack pointer (%rsp)
- Return address location
- Book Reference: “Hacking: The Art of Exploitation” by Erickson - Ch. 2-3
GDB and Crash Dump Analysis (Primer Ch 3)
- info registers: See CPU state
- x/Nx addr: Examine memory
- bt (backtrace): Call stack
- disassemble: Read assembly from memory
- Book Reference: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman

Questions to Guide Your Design

Where is the return address stored relative to the current stack pointer?
What does the stack look like immediately after a function prologue (push %rbp; mov %rsp, %rbp)?
How would you identify caller-saved vs callee-saved registers from a crash dump?
Given a crash address, how would you determine which function was executing?
What would a buffer overflow look like in the stack frame?

Thinking Exercise

Analyzing a Crash Dump

You crash in function leaf(), called from middle(), called from main(). You see:

$rsp = 0x7ffd0000
$rbp = 0x7ffd0010
Memory at $rsp: [0x555540ab] (return address)

Questions to answer:

Where does leaf() store its saved %rbp?
How big is leaf()’s stack frame?
If you inspect 0x7ffd0018, what would you expect to find?
How would you find the return address for the call to middle()?

The Interview Questions They’ll Ask

“Describe the x86-64 calling convention. Which registers hold function arguments?”
“What is a stack frame? Draw one on a whiteboard.”
“When a function crashes, how do you use gdb to see what arguments it received?”
“What is the difference between %rsp and %rbp? When would you use each?”
“How would you detect a buffer overflow from a crash dump?”
“Explain how a return address gets onto the stack and how the CPU uses it to return from a function.”

Hints in Layers

Hint 1: Study the ABI Document

The System V AMD64 ABI defines the calling convention. Read the first few pages. Key facts:

First 6 integer args: %rdi, %rsi, %rdx, %rcx, %r8, %r9
Return value: %rax
Callee-saved: %rbx, %rbp, %r12-%r15

Hint 2: Write Simple Nested Calls

Create main() → func_a() → func_b(). Compile with -g. Run in GDB, break at each level, and inspect $rsp, $rbp, and the stack memory.

Hint 3: Match Assembly to C

Compile a simple function with -S. Count:

How many bytes the function allocates (sub $X, %rsp)
Which registers it saves
Where local variables live relative to %rsp or %rbp

Hint 4: Create Intentional Crashes

Write functions with:

Buffer overflows (overflow into return address)
Null pointer dereferences
Use-after-free (corrupt a pointer)

For each, predict what the crash dump will show before running it. Verify in GDB.

Common Pitfalls & Debugging

Problem 1: “I don’t understand which registers are which”

Why: x86-64 has 16 registers, each with multiple names (e.g., %rax vs %eax vs %ax)
Fix: Create a reference card: RAX (return), RDI (arg1), RSI (arg2), etc.
Quick test: gcc -S simple.c and trace a function call by hand

Problem 2: “The stack grows in the ‘wrong’ direction”

Why: Stack grows downward (high → low addresses), which is counterintuitive
Fix: Visualize it: push decrements %rsp, pop increments it
Quick test: Print $rsp before and after a push instruction in GDB

Problem 3: “I can’t locate local variables”

Why: They’re accessed relative to %rsp or %rbp, with negative offsets
Fix: Use GDB: p &variable shows the address; info locals lists all locals
Quick test: Compile with -g and inspect a simple function

Problem 4: “Crash dumps show weird addresses”

Why: ASLR (Address Space Layout Randomization) randomizes addresses each run
Fix: Disable ASLR for reproducible testing: setarch $(uname -m) -R ./program
Quick test: Run the same program twice, compare crash addresses

Books That Will Help

Topic	Book	Chapter
x86-64 ABI	System V AMD64 ABI	Figures 3.4-3.5 (registers, stack layout)
Calling Convention	CS:APP 3e by Bryant & O’Hallaron	Ch. 3, Sec. 3.7
Debugging	The Art of Debugging by Matloff & Salzman	Ch. 1-4
Assembly Intro	Hacking: The Art of Exploitation by Erickson	Ch. 2
GDB Guide	GDB Debugging Guide (official docs)	Ch. 1-8

Definition of Done

Write at least 3 C programs demonstrating different call patterns
Each crashes with a predictable register/stack state
Use GDB to analyze each crash: inspect $rsp, $rbp, return address
Draw a stack frame diagram for each crash
Verify callee-saved register preservation
Document where each local variable lives relative to %rsp or %rbp

Project 5: Bomb Lab Workflow

Real World Outcome

You will reverse-engineer a compiled binary (“bomb”) and defuse it by discovering the correct inputs to each phase.

$ ./bomb
Welcome to my little bomb.
Phase 1: 1 6 2 4 8 1 6
Phase 1 defused. How about the next one?
Phase 2: 2 10 20 30 40 50
That's number 2. Keep going!
Phase 3: 1 311 3
Halfway there!
Phase 4: 7 0
Phase 4 defused!
Phase 5: ionefg
Congratulations! You've defused the bomb!

The Core Question You’re Answering

Given only a compiled binary (no source code), how can you reverse-engineer it to understand what input will satisfy each phase?

Concepts You Must Understand First

Reading and Understanding x86-64 Assembly (Primer Ch 3)
- Instruction format and common patterns
- Control flow (cmp, je, jne, jumps)
- Loops and recursion
- Book Reference: “CS:APP 3e” - Ch. 3, Sec. 3.4-3.8
Using GDB for Interactive Debugging (Primer Ch 3)
- Setting breakpoints
- Stepping through code
- Inspecting registers and memory
- Disassembling functions
- Book Reference: “The Art of Debugging” - Ch. 2-4
Binary Tools and Analysis (Primer Ch 3)
- objdump -d: Disassemble binary
- strings: Find embedded strings
- readelf -s: Extract symbol table
- Using GDB’s disassemble and layout asm
- Book Reference: “Practical Binary Analysis” - Ch. 3-4

Questions to Guide Your Design

How would you identify the entry point of each phase?
What pattern would a string comparison look like in assembly?
How would you recognize a loop in assembly and determine its bounds?
What assembly patterns indicate arithmetic operations (add, multiply)?
How would you use breakpoints to inspect values at specific points?

Thinking Exercise

Reverse Engineering a Simple Phase

You disassemble Phase 1 and see:

mov    %edi,%edx        # edx = arg1
mov    $0x1,%ecx        # ecx = 1
cmp    %ecx,%edx        # compare edx to ecx
jne    0x401...         # jump if not equal
mov    $0x2,%ecx        # ecx = 2
cmp    %ecx,%edx        # compare again
...

Questions:

What values does edx get compared against?
What would happen if you input “1”?
What if you input “2”?
Is there a specific sequence required?

The Interview Questions They’ll Ask

“How would you approach reverse-engineering an unfamiliar binary?”
“What GDB commands are most useful for understanding control flow?”
“How would you identify function calls and their arguments from assembly?”
“Explain how you’d detect string comparison in assembly.”
“What’s the difference between a conditional jump based on equality vs. less-than?”
“How would you recognize that code is looping vs. just jumping around?”

Hints in Layers

Hint 1: Use objdump to Get Disassembly

Run objdump -d bomb | less to see the entire binary disassembled. Search for function names or strings.

Hint 2: Start with Strings

Run strings bomb | grep -E "[A-Z]" to find hints about expected input format.

Hint 3: Break at Phase Boundaries

In GDB:

(gdb) break explode_bomb
(gdb) break phase_1
(gdb) run
# Input a test string, watch where it fails

Hint 4: Trace Through Phase Logic

Use stepi (step one instruction) to single-step through assembly. After each instruction, print registers with info registers to see how values change.

Common Pitfalls & Debugging

Problem 1: “I don’t know how to read assembly”

Why: x86-64 syntax is unfamiliar
Fix: Create a cheat sheet: mov src, dst, cmp a, b, je label (jump if equal)
Quick test: Compile simple C to assembly with -S and compare

Problem 2: “Phase defuses but says it’s wrong”

Why: Some phases require exact formatting (spaces, number of args)
Fix: Look at how scanf is called; match the format string exactly
Quick test: Try with different spacing or order

Problem 3: “I can’t find where input is read”

Why: scanf or fgets might be called from a library function
Fix: Search for string pointers in data sections (.rodata), trace backwards to function
Quick test: strings bomb | grep '%' to find format strings

Problem 4: “Breakpoints don’t work where I expect”

Why: ASLR or compiler optimizations move code around
Fix: Use symbolic names: break phase_1 instead of memory addresses
Quick test: Check with info breakpoints

Books That Will Help

Topic	Book	Chapter
Assembly Reading	CS:APP 3e	Ch. 3
Reverse Engineering	Practical Binary Analysis	Ch. 1-5
Debugging Strategies	The Art of Debugging	Ch. 2-4
x86-64 Reference	Intel 64 and IA-32 Architectures Manual	Vol. 2A-2C

Definition of Done

Defuse all 6 phases (or as many as available)
For each phase, document the disassembled code and your analysis
Explain the logic of each phase (what condition does it check?)
Show successful input and expected output
Demonstrate using GDB to step through at least one phase

Project 6: Attack Lab—Exploiting Stack Vulnerabilities

Real World Outcome

You will craft malicious inputs that exploit buffer overflows to hijack control flow using ROP (Return-Oriented Programming) gadgets.

$ cat solution1.txt | ./ctarget
Type string:Entering echo mode (^C to quit)
sdraw!
Normal return
Congratulations! You have successfully completed this warm-up stage 1.

$ cat solution2.txt | ./ctarget
Type string: AAAA...XXXX[gadget_addr_1][gadget_addr_2]...
Congratulations! Stage 2 complete.

$ cat solution5.txt | ./ctarget
# After crafting ROP chain with multiple gadgets
Congratulations! Stage 5 complete. All attacks succeeded!

The Core Question You’re Answering

How can you exploit a buffer overflow to change the control flow of a program and execute arbitrary code?

Concepts You Must Understand First

Stack Buffer Overflows (Primer Ch 3)
- How writing past a buffer’s boundary corrupts the stack
- Overwriting the return address
- Shellcode vs ROP gadgets
- Book Reference: “CS:APP 3e” - Ch. 3, Sec. 3.12
Return-Oriented Programming (ROP) (Primer Ch 3)
- Finding gadgets (sequences ending in ret)
- Chaining gadgets to perform computation
- Using pop rdi; ret to set arguments
- Book Reference: “Practical Binary Analysis” - Ch. 9
Exploit Development Techniques (Primer Ch 3)
- Disassembling to find gadgets
- Calculating stack offsets
- Bypassing ASLR (or using relative offsets)
- Testing exploits safely
- Book Reference: “Hacking: The Art of Exploitation” - Ch. 3

Questions to Guide Your Design

How many bytes of overflow do you need before reaching the return address?
How would you verify your return address is correct?
Where would you find ROP gadgets in the binary?
How do you craft a sequence of gadgets to pass arguments to a function?
What’s the difference between Stage 1 (code injection) and Stages 4-5 (ROP)?

Thinking Exercise

Building a Simple ROP Chain

You want to call func(42). You find:

pop %rdi; ret at 0x401234
call func or mov %rdi, %rdi; ... ret at address of func

Your stack should look like:

[Buffer][Padding][Return Address → gadget][42][Next Gadget]

Questions:

How many bytes is the buffer?
Where does 42 go on the stack?
What address do you write as the return address?
What happens after the pop %rdi?

The Interview Questions They’ll Ask

“Explain how a buffer overflow can lead to code execution.”
“What is ROP? Why is it more powerful than simple shellcode injection?”
“How would you find gadgets in a binary?”
“What role does ASLR play in exploit mitigation?”
“Can you ROP without knowing the exact memory layout?”
“Describe the stack state before, during, and after exploiting a buffer overflow.”

Hints in Layers

Hint 1: Start with Stage 1

Stage 1 is usually a warm-up: overwrite the return address to jump to a success function (no ROP needed). Calculate offset from buffer start to return address.

Hint 2: Use GDB to Verify Offsets

(gdb) run
(gdb) break *0x401234  # At a chosen gadget
(gdb) x/20x $rsp       # See what's on the stack

Hint 3: Build Gadgets from Tools

Use ropper or objdump to find gadgets:

ropper --file ctarget --search "pop rdi"
objdump -d ctarget | grep -A2 "pop.*rdi"

Hint 4: Test Incrementally

Stage 1: Overwrite return address
Stage 2: Inject simple code (if allowed)
Stage 3+: Chain multiple gadgets

Common Pitfalls & Debugging

Problem 1: “Exploit doesn’t crash or execute correctly”

Why: Return address is wrong or stack alignment is off
Fix: Use GDB to inspect $rsp at the crash point
Quick test: Disable ASLR and run again

Problem 2: “Gadget not found”

Why: Looking for exact instruction sequence that doesn’t exist
Fix: Look for equivalent sequences (e.g., mov %rdi, %rdi; ret)
Quick test: Use ropper --all or objdump -d | grep mov

Problem 3: “Stack alignment causes crash

Why: Some instructions require 16-byte alignment
Fix: Insert extra ret as padding to adjust alignment
Quick test: Count to 16-byte boundary

Problem 4: “Can’t inject code (W^X protection)”

Why: Modern systems mark writable memory non-executable
Fix: Use ROP gadgets from existing code instead of injecting new code
Quick test: Check with checksec ./ctarget

Books That Will Help

Topic	Book	Chapter
Buffer Overflows	Hacking: The Art of Exploitation	Ch. 3
ROP Exploitation	Practical Binary Analysis	Ch. 9
Security Best Practices	CS:APP 3e	Ch. 3, Sec. 3.12
Gadget Finding	ROPPER Documentation	Official guide

Definition of Done

Complete Stage 1 (redirect control flow)
Complete at least one more stage using ROP gadgets
For each stage, document: buffer offset, gadgets used, stack layout
Explain why your exploit works despite mitigations
Test exploit multiple times to verify reliability

Project 7: Y86-64 CPU Simulator—Understanding Pipelining

Real World Outcome

You will build a cycle-accurate CPU simulator that executes Y86-64 programs and visualizes the 5-stage pipeline.

$ ./ysim prog.yo

Y86-64 Simulator
Load program: 14 instructions
Starting execution (verbose mode)

Cycle 1: F(nop) D(nop) E(-) M(-) W(-)
Cycle 2: F(imm) D(nop) E(nop) M(-) W(-)
Cycle 3: F(add) D(imm) E(nop) M(nop) W(-)
Cycle 4: F(rrmov) D(add) E(imm) M(nop) W(nop)
[Stall: Data dependency detected. E needs %rax, D is computing it]
Cycle 5: F(rrmov) D(add) E(imm) M(add) W(nop)
Cycle 6: F(jne) D(rrmov) E(imm) M(add) W(add)

Final state:
%rax = 0x00000003
%rcx = 0x00000004
Condition codes: ZF=0 SF=0 OF=0
Memory[0x100:0x108] = {0x00, 0x01, 0x02, ...}

Performance: 6 instructions / 14 cycles = 0.43 IPC

The Core Question You’re Answering

How does a CPU execute instructions in parallel using a pipeline, and what causes stalls and forwarding?

Concepts You Must Understand First

CPU Pipelining (Primer Ch 4)
- 5 stages: Fetch, Decode, Execute, Memory, Writeback
- Parallel execution of multiple instructions
- Book Reference: “CS:APP 3e” - Ch. 4-5
Pipeline Hazards (Primer Ch 4)
- Data hazards (dependency on previous instruction’s result)
- Control hazards (branches)
- Structural hazards (resource conflicts)
- Book Reference: “Computer Organization and Design” by Patterson & Hennessy - Ch. 4
Hazard Resolution (Primer Ch 4)
- Stalling (NOP injection)
- Forwarding (passing data between stages)
- Branch prediction
- Book Reference: “CS:APP 3e” - Ch. 4, Sec. 4.5

Questions to Guide Your Design

How many cycles does it take to execute 10 independent add instructions?
What happens if instruction B depends on instruction A’s result?
How does forwarding reduce stalls compared to stalling alone?
What’s the longest possible dependency chain you can create?
How would branch prediction affect CPI (cycles per instruction)?

Thinking Exercise

Analyzing a Dependency Chain

irmovq $1, %rax    # Cycle 1: rax = 1
addq %rax, %rbx    # Cycle 2: needs rax from Cycle 1
addq %rbx, %rcx    # Cycle 3: needs rbx from Cycle 2
addq %rcx, %rdx    # Cycle 4: needs rcx from Cycle 3

Questions:

Without forwarding, how many stalls occur?
With forwarding, how many stalls occur?
Draw the pipeline state at Cycle 2, 3, 4

The Interview Questions They’ll Ask

“Explain the 5 stages of a CPU pipeline.”
“What is a data dependency? Give an example.”
“How does forwarding work? When does it NOT work?”
“Compare stalling vs forwarding for resolving hazards.”
“What causes a branch to stall the pipeline?”
“Can you have a 10-stage pipeline with 0.1x CPI?” (Why or why not?)

Hints in Layers

Hint 1: Implement the Basic Cycles

Start with Fetch → Decode → Execute → Memory → Writeback. No parallelism yet; just sequential execution.

Hint 2: Add Pipeline Registers

Between each stage, add a “pipeline register” that holds the state. This is where values flow between stages.

Hint 3: Implement Data Dependency Detection

Check: “Does the instruction in Decode need a value that Fetch just fetched?” If yes, stall.

Hint 4: Implement Forwarding

Instead of stalling, pass the result from Execute/Memory stage backward to the Decode stage.

Common Pitfalls & Debugging

Problem 1: “Pipeline doesn’t advance; everything stalls”

Why: Dependency detection is too conservative
Fix: Only stall if forwarding can’t solve it
Quick test: Run a program with no dependencies; verify it advances every cycle

Problem 2: “Results are incorrect; wrong register values”

Why: Forwarding path is missing or has a bug
Fix: Trace through a simple 2-instruction dependency by hand
Quick test: Add debug output showing forwarding path

Problem 3: “Branches cause wrong control flow”

Why: Branch prediction / fetching ahead is implemented incorrectly
Fix: Start with predict-not-taken. Flush pipeline on misprediction.
Quick test: Run a branch-heavy program; verify correct result

Problem 4: “Performance numbers don’t match textbook”

Why: Stall logic or forwarding is inefficient
Fix: Compare your CPI to expected values; audit the code
Quick test: Run a known test case; compare CPI

Books That Will Help

Topic	Book	Chapter
Pipelining	CS:APP 3e	Ch. 4-5
Pipeline Simulation	Computer Organization and Design	Ch. 4
Y86-64 Spec	CS:APP 3e	Appendix A
Performance Metrics	Computer Architecture by Hennessy & Patterson	Ch. 1

Definition of Done

Simulator correctly executes Y86-64 programs
Pipeline visualization shows all 5 stages each cycle
Data dependencies detected and stalls inserted
Forwarding implemented for common cases
Performance metrics (CPI) calculated and reported
Test on provided Y86-64 programs; match expected output

Project 8: Performance Optimization Clinic

Real World Outcome

You will benchmark code variants and achieve measurable speedups through optimization, documenting the improvements.

$ ./perf-clinic

=== Loop Sum Benchmark ===
Variant 1 (Naive):       2.5s for 10M iterations
Variant 2 (Unrolled 4x): 1.8s for 10M iterations  [1.39x faster]
Variant 3 (SIMD):        0.6s for 10M iterations  [4.17x faster]
Variant 4 (Cache-aware): 0.4s for 10M iterations  [6.25x faster]

=== Memory Access Pattern ===
Row-major (Stride-1):    1.2 ns/access (L1 cache hit)
Column-major (Stride-1024): 45 ns/access (RAM access)
Random access:            89 ns/access (TLB miss)

=== Branch Prediction ===
Predictable branch:      0.8 ns per iteration
Unpredictable branch:    8.2 ns per iteration [10.25x slower]

Summary: 6.25x speedup achieved through loop unrolling + cache optimization

The Core Question You’re Answering

What techniques reliably improve performance, and how do you measure and prove the improvements?

Concepts You Must Understand First

Loop Optimization (Primer Ch 4-5)
- Loop unrolling for ILP (Instruction-Level Parallelism)
- Reducing loop overhead
- Eliminating dependencies
- Book Reference: “CS:APP 3e” - Ch. 5, Sec. 5.7-5.9
Memory Locality (Primer Ch 5)
- Temporal vs spatial locality
- Cache line behavior
- Stride effects
- Book Reference: “CS:APP 3e” - Ch. 6, Sec. 6.4
Branch Prediction (Primer Ch 4)
- Predictable vs random branches
- Cost of misprediction
- Branchless programming
- Book Reference: “CS:APP 3e” - Ch. 4, Sec. 4.7

Questions to Guide Your Design

What is ILP and how does unrolling increase it?
Why does 4x unrolling often beat 8x?
What’s the relationship between cache line size and array stride?
How would you measure branch prediction penalty?
Can you optimize code without changing the algorithm?

Thinking Exercise

Identifying Optimization Opportunities

Given this code:

long sum = 0;
for (int i = 0; i < n; i++)
    sum += arr[i];

Questions:

Is there a data dependency? What is it?
Can you unroll the loop? How?
Would SIMD help? Why?
What would the assembly look like?
How many cycles per element (CPE) would it take?

The Interview Questions They’ll Ask

“Why does loop unrolling improve performance?”
“Explain the difference between ILP and pipelining.”
“How does cache line size affect array access patterns?”
“Why is sorting an array sometimes faster than processing it unsorted?”
“What’s the cost of a branch misprediction?”
“Can you optimize code to run faster without changing the algorithm?”

Hints in Layers

Hint 1: Choose Simple Benchmarks

Start with loops: sum, multiplication, or simple recursive operations. Avoid I/O, system calls, or complex logic.

Hint 2: Use perf or time for Measurement

# Use 'perf' for detailed metrics
perf stat -e cycles,instructions,cache-misses ./program

# Or simple timing
time ./program

Hint 3: Vary One Thing at a Time

Baseline version
2x unrolled
4x unrolled
8x unrolled
+simd
+cache-aware data layout

Hint 4: Disable Optimizations for Fairness

Compile with -O0 or at most -O1 so your hand optimizations are visible. Otherwise, the compiler does it for you!

Common Pitfalls & Debugging

Problem 1: “Measurements are noisy”

Why: System noise, cache state, thermal throttling
Fix: Run multiple times, take average or median
Quick test: Run same binary 5 times; see variance

Problem 2: “Optimization didn’t help; maybe made it worse”

Why: Compiler is already doing it, or new code has worse cache behavior
Fix: Verify with -O0 compilation; check assembly with -S
Quick test: Compare hand-optimized vs compiler-optimized

Problem 3: “Can’t measure ILP directly”

Why: ILP is a hardware metric, not directly observable
Fix: Use perf with hardware events: perf stat -e instructions,cycles
Quick test: Measure CPE (cycles per element) instead

Problem 4: “Code runs too fast; measurement is 0ms”

Why: Problem size is too small or time resolution is coarse
Fix: Increase problem size or repeat inner loop
Quick test: Make sure loop takes at least 1 second

Books That Will Help

Topic	Book	Chapter
Loop Optimization	CS:APP 3e	Ch. 5, Sec. 5.7-5.9
Cache Optimization	CS:APP 3e	Ch. 6, Sec. 6.4-6.5
Performance Analysis	Performance Engineering (MIT OCW)	Lectures 1-5
`perf` Tool	Brendan Gregg’s Blog	“perf-tools” series

Definition of Done

Implement at least 4 variants of a kernel (baseline, unrolled, cache-aware, SIMD)
Measure and document speedup for each variant
Explain the optimization technique used in each
Achieve at least 2x speedup on at least one benchmark
Use perf to verify hardware metrics (IPC, cache misses)
Document results with graphs/tables showing improvements

Project 9: Cache Lab++ Simulator

File: CSAPP_CACHE_LAB_SIMULATOR.md
Main Programming Language: C
Alternative Programming Languages: Python, Rust, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Cache Hierarchy / Memory System / Performance Analysis
Software or Tool: valgrind, perf, cache architecture manuals
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A fully-featured cache simulator that accepts memory access traces and simulates cache behavior (hits, misses, evictions) with configurable cache geometry and replacement policies.

Why it teaches cache behavior: You’ll implement the exact logic that determines whether a memory access hits or misses, forcing you to understand address bit anatomy, the memory hierarchy, and how different workloads interact with cache geometry.

Core challenges you’ll face:

Address parsing (Tag/Index/Offset extraction) → maps to understanding cache address translation
Replacement policy implementation (LRU, FIFO, Random) → maps to understanding eviction strategies
Trace processing and statistics collection → maps to understanding cache performance metrics

Real World Outcome

You’ll produce a command-line tool csim that takes a memory trace and simulates cache behavior:

$ ./csim -v -S 4 -E 2 -b 4 -t traces/yi.trace
L 10,1
 cache [0][0] 0x0 miss
L 18,1
 cache [0][1] 0x8 miss
S 18,1
 cache [0][1] 0x8 hit
L 22,1
 cache [1][0] 0x2 miss evict
L 2d,1
 cache [1][1] 0xd hit
L 4,4
 cache [0][0] 0x0 miss
L 2c,4
 cache [1][1] 0xc miss evict
L 2d,1
 cache [1][0] 0xd hit

Summary:
hits:2 misses:4 evictions:2

The Core Question You’re Answering

“How do CPU cache lines, index bits, and tag bits determine whether a memory access hits or misses?”

Before coding, understand this completely. A cache miss isn’t random—it’s determined by arithmetic on address bits.

Concepts You Must Understand First

Cache Geometry & Addressing
- What does “S=4, E=2, b=4” mean for a cache?
- How many bits for index, offset, and tag?
- Book Reference: CS:APP 3e Ch. 6, Sec. 6.1-6.2
Memory Access Traces
- How to parse “L 10,1” (Load 1 byte at 0x10)?
- Book Reference: CS:APP 3e Ch. 6, Sec. 6.6
Replacement Policies
- LRU: evict the least recently used line
- How to track ordering efficiently?
- Book Reference: CS:APP 3e Ch. 6, Sec. 6.2

Questions to Guide Your Design

Address Decomposition: Given address 0x1A34 with S=16 sets and b=4 offset bits, what are index and tag bits?
Hit vs Miss: How do you determine if an address is in the cache using only set index and tag?
Eviction: How do you track which line is “least recently used” without expensive data structures?

Thinking Exercise

Cache Address Mapping

Given S=4 sets, E=2 lines/set, b=3 offset bits:

Address 0x1A5 (binary: 11010100101)

Offset (bits 0-2): 101 = 5
Index (bits 3-4): 10 = 2
Tag (bits 5-10): 110101 = 53

So 0x1A5 maps to set 2 with tag 53.

Questions:

If 0x1AA arrives next, where does it map?
If 0x1A5 arrives again, does it hit the same line?
How many consecutive addresses map to the same set?

The Interview Questions They’ll Ask

“Explain how address bits map to cache index and tag. What if two addresses have the same index but different tags?”
“What’s the difference between a cold miss, capacity miss, and conflict miss?”
“How would you implement LRU replacement efficiently without a linked list?”
“If a cache has 16 sets, 4 lines/set, 64-byte lines, how many bytes total? What’s the address decomposition?”
“Why might increasing cache size not always improve hit rate?”

Hints in Layers

Hint 1: Address Decomposition Address bits aren’t just numbers—they’re structured. Rightmost bits are byte offset within a line. Next bits are set index. Remaining bits are tag. Write a function extracting all three.

Hint 2: Cache Data Structure Use a 2D array: cache[S][E]. Each element stores: valid bit, tag, and timestamp for LRU tracking.

Hint 3: Hit/Miss Detection For each access: extract index and tag. Search the set for valid line with matching tag. If found, hit—update timestamp. If not, miss. If set is full, evict smallest timestamp (LRU). Otherwise insert.

Hint 4: Validate with Tiny Traces Create a trace manually:

L 0x0,1
L 0x40,1
L 0x0,1     # Should be a hit

Trace through after each operation and verify cache state.

Books That Will Help

Topic	Book	Chapter
Cache Organization	CS:APP 3e	Ch. 6, Sec. 6.1-6.3
Replacement Policies	CS:APP 3e	Ch. 6, Sec. 6.2
Understanding Traces	CS:APP 3e	Ch. 6, Sec. 6.6

Common Pitfalls and Debugging

Problem 1: “Hit rate is way off”

Why: Off-by-one error in address bit extraction
Fix: Print index, tag, offset for each address; verify manually
Quick test: Load same address 4 times; all should be hits

Problem 2: “Eviction isn’t working”

Why: Not properly tracking set occupancy
Fix: Assert cache[set].count <= E at all times
Quick test: Load E+1 addresses to same set; first should evict

Problem 3: “LRU evicts wrong line”

Why: Timestamp not updated on access, or evicting most recent instead
Fix: Update timestamp on every access (hits too)
Quick test: Load 3 addresses into E=3 set, access in order; verify eviction

Problem 4: “Program crashes on traces”

Why: Trying to process ‘I’ (instruction) lines as data
Fix: Skip ‘I’ lines; only process ‘L’, ‘S’, ‘M’
Quick test: Run on official CS:APP traces first

Definition of Done

Parse cache geometry from CLI (S, E, b)
Extract set index and tag from 64-bit addresses correctly
Distinguish hits from misses based on cache state
Implement LRU replacement when cache is full
Process memory traces and collect hits/misses/evictions
Output statistics matching reference implementation
Handle both direct-mapped and set-associative caches
Skip ‘I’ operations in valgrind traces

Project 10: ELF Link Map & Interposition Toolkit

File: CSAPP_ELF_INTERPOSITION_TOOLKIT.md
Main Programming Language: C
Alternative Programming Languages: Python, Go, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Dynamic Linking / Symbol Resolution / ELF Format
Software or Tool: readelf, objdump, ldd, linker scripts
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A suite of tools including: (1) linkmap to visualize symbol resolution and PLT/GOT dynamics, (2) interpose library that intercepts function calls at runtime via symbol interposition, and (3) reloc-viz to visualize relocation entries and their targets.

Why it teaches dynamic linking: You’ll understand what happens between compilation and execution—how symbols become addresses, how the PLT works, and how runtime interposition redirects calls.

Core challenges you’ll face:

ELF parsing (sections, symbols, relocations) → maps to understanding binary format and structure
Symbol resolution (strong vs weak, global vs local) → maps to understanding visibility and linking
PLT/GOT mechanics and lazy binding → maps to understanding runtime indirection
Interposition via LD_PRELOAD and symbol wrapping → maps to understanding link-time behavior modification

Real World Outcome

You’ll produce multiple command-line tools:

$ ./linkmap ./myprog
[DYNAMIC] Loaded libraries:
  libm.so.6 (0x7fce9c000000)
  libc.so.6 (0x7fce9be00000)

[SYMBOLS] Exported from myprog:
  0x400600 T main
  0x400500 T helper_func
  U printf              # Undefined (to be resolved)

[PLT/GOT] Entries:
  printf@plt (0x400450) -> printf@libc (0x7fce9c50000)

$ gcc -fPIC -shared -o interpose.so interpose.c
$ LD_PRELOAD=./interpose.so ./myprog
[INTERCEPT] malloc called! (size=1024)
[INTERCEPT] free called! (ptr=0x5555558)

The Core Question You’re Answering

“How does the dynamic linker resolve symbols, and how can we intercept those resolutions?”

Before coding, understand that linking happens in three phases: compile-time (symbols become relocation entries), link-time (entries become offsets into libraries), and load-time (libraries are mapped and symbols are resolved).

Concepts You Must Understand First

ELF Format & Sections
- .symtab (symbol table), .dynsym (dynamic symbol table)
- .rel.* (relocation tables), .got / .plt (runtime indirection)
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.3-7.5
Symbol Resolution
- Global vs local scope, strong vs weak symbols
- Interposition: overriding function definitions at link/load time
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.6
Procedure Linkage Table (PLT) and Global Offset Table (GOT)
- How indirect jumps work at runtime
- Lazy binding: first call triggers resolution, subsequent calls use cached address
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.8
Dynamic Linking & Relocation
- Relocation entries tell the linker how to patch addresses
- At load time, linker walks these entries and fixes them
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.7

Questions to Guide Your Design

Symbol Tables: How do you extract all symbols from a binary? Which symbols are global vs local, defined vs undefined?
Relocation Inspection: Given a relocation entry R_X86_64_GLOB_DAT, how do you know which symbol it refers to and where it gets patched?
PLT/GOT Tracing: How does a call to printf@plt redirect through the GOT to the real printf in libc?
Interposition: If you want to intercept malloc, how do you create a shared library with a malloc function that takes precedence?

Thinking Exercise

Symbol Resolution Chain

Program calls printf(...). Follow the chain:

In source code: #include <stdio.h> declares printf
After compilation: object file has relocation entry R_X86_64_PLT32 for printf
After linking: executable has printf@plt (jump table entry) and printf@got.plt (cache for address)
At runtime:
- First call: jump to printf@plt → indirect jump through printf@got.plt
- GOT initially points back to plt code that calls dynamic linker
- Linker finds printf in libc, patches GOT
- Subsequent calls: jump through GOT directly to libc’s printf

Questions:

Why does the PLT exist if the GOT can hold the address directly?
What happens if interposition changes which library printf comes from?

The Interview Questions They’ll Ask

“Explain the difference between .symtab, .dynsym, and the GOT. When is each used?”
“What is lazy binding, and why is it better than eager binding for large programs?”
“How would you intercept function calls using LD_PRELOAD? What prevents you from intercepting non-exported functions?”
“What’s the difference between weak and strong symbols? How does the linker resolve conflicts?”
“If you remove a symbol from a library, but a binary still calls it, what happens at runtime?”

Hints in Layers

Hint 1: Read ELF Headers and Sections Start with readelf to see what information is available. Then parse the ELF binary yourself: read the ELF header, iterate section headers, extract symbol tables (both .symtab and .dynsym), and print all symbols with their addresses and binding types.

Hint 2: Parse Relocation Entries Relocation entries live in .rel.* sections. Each entry contains: an offset (where to patch), a type (how to patch), and a symbol index. Print all relocations with their target symbols.

Hint 3: Understand PLT/GOT The .plt section contains jump stubs. The .got.plt section contains cached addresses. For each external function, find its PLT entry and its GOT entry. Trace how a call gets redirected.

Hint 4: Build Interposition Wrapper Create a shared library with a function that has the same name as a function you want to intercept. Use dlsym to get the address of the original function, then your wrapper can call the original after logging or modifying behavior. Compile with -fPIC -shared.

Books That Will Help

Topic	Book	Chapter
ELF Format & Sections	CS:APP 3e	Ch. 7, Sec. 7.3-7.5
Symbol Resolution	CS:APP 3e	Ch. 7, Sec. 7.6
PLT & GOT	CS:APP 3e	Ch. 7, Sec. 7.8
Relocation & Dynamic Linking	CS:APP 3e	Ch. 7, Sec. 7.7
Deep ELF Internals	Linkers and Loaders	Ch. 5-6 (Levine)

Common Pitfalls and Debugging

Problem 1: “I can’t find symbols in the binary”

Why: Symbols might be stripped (release builds), or you’re looking in .symtab instead of .dynsym
Fix: Use .dynsym for dynamic symbols; .symtab might not exist; check with nm or readelf -s
Quick test: readelf -s ./binary | grep printf

Problem 2: “PLT/GOT addresses don’t match what I expect”

Why: Address layout depends on relocation type and ASLR (Address Space Layout Randomization)
Fix: Disable ASLR for predictable addresses: setarch x86_64 -R ./program
Quick test: Compare output from objdump -R with runtime observation via gdb

Problem 3: “Interposition doesn’t work; original function still called”

Why: Library order matters; LD_PRELOAD works only for exported symbols, not static functions
Fix: Ensure your .so is in LD_PRELOAD and uses same symbol name; check with nm
Quick test: LD_PRELOAD=./interpose.so ldd ./program should show your .so first

Problem 4: “Segfault when calling original via dlsym”

Why: Function pointer dereference or incorrect calling convention
Fix: Verify function signature matches; test with a simple wrapper that just prints
Quick test: Create a simple malloc wrapper that only logs, doesn’t call original

Definition of Done

Parse ELF headers and sections from a binary
Extract and display all symbols (both static and dynamic)
Parse and visualize relocation entries with their targets
Map PLT entries to GOT entries
Create an LD_PRELOAD interposition library that intercepts a function
Demonstrate lazy binding by observing GOT updates
Handle both 32-bit and 64-bit ELF binaries
Document symbol resolution chain with examples

Project 11: Signals + Processes Sandbox

File: CSAPP_SIGNALS_PROCESSES_SANDBOX.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Process Control / Signal Handling / Async I/O
Software or Tool: strace, gdb, kill, signal masks
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A comprehensive test harness (sigsandbox) that safely executes untrusted code and tests signal behavior under controlled conditions. Includes: (1) signal tracing to log all signals, (2) zombie detector to find orphaned processes, (3) race condition fuzzer to test signal handler safety, and (4) test suite validating async-signal-safe operations.

Why it teaches signal handling: Signals are asynchronous—they interrupt program flow at unpredictable times. You’ll confront race conditions, deadlocks, and subtle bugs that only show up under load. This forces deep understanding of signal delivery, masking, and safe handler design.

Core challenges you’ll face:

Signal delivery and masking (blocking/unblocking) → maps to understanding atomic critical sections
Zombie processes and reaping → maps to understanding process lifecycle
Async-signal-safety and reentrancy → maps to understanding which functions are safe to call in handlers
Race conditions between signal arrival and handler registration → maps to understanding temporal ordering in concurrent systems

Real World Outcome

You’ll produce sigsandbox, a harness that captures signal behavior:

$ ./sigsandbox --test=signal_safety ./untrusted_program
[RUN] Executing: ./untrusted_program
[SIG] Caught SIGCHLD at time=0.001234s (from PID 12345)
[SIG] Caught SIGTERM at time=0.005678s (from PID 12346)

[ZOMBIE CHECK] Found 0 zombie processes
[SAFETY CHECK] Only async-safe functions called in handlers: PASS
[RACE CHECK] 1000 concurrent signals: No race detected

Summary:
  Signals caught: 2
  Zombies: 0
  Handler safety violations: 0
  Race conditions: 0

Plus additional tools:

signal_trace: Logs all signal activity to a file
zombie_detector: Identifies zombie processes
race_fuzzer: Stress-tests handlers with rapid signal delivery

The Core Question You’re Answering

“How do you write correct code that handles asynchronous interrupts, and how do you test that it works?”

Before coding, understand that signals interrupt code at random points. Any function that wasn’t running when the signal arrived might be unsafe to call in the handler. This creates subtle bugs that are nearly impossible to debug without a harness.

Concepts You Must Understand First

Signal Delivery & Masking
- How signals are blocked, pending, and delivered
- sigprocmask, sigaction, signal sets
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.5-8.6
Process Lifecycle & Reaping
- Parent-child relationships, zombie processes
- wait, waitpid, SIGCHLD handler
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.4
Async-Signal-Safe Functions
- Which functions are safe to call in handlers (minimal list)
- Why malloc, printf, locks are dangerous
- Book Reference: POSIX signal-safety(7) man page
Race Conditions in Signal Handlers
- Handler arrives while other code is running
- Examples: signal arrives during malloc, during file I/O, during mutex lock
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.7

Questions to Guide Your Design

Signal Masking: How do you temporarily block certain signals to protect a critical section? What happens to signals that arrive while blocked?
Zombie Reaping: When a child process exits, when is it considered a zombie? How do you reap it correctly without losing signals?
Handler Safety: You want to log a signal in the handler. printf is not async-safe. What can you use instead?
Race Testing: How would you stress-test your signal handlers to make race conditions more likely to manifest?

Thinking Exercise

Signal Handler Race Condition

int counter = 0;

void handler(int sig) {
    counter++;  // Not atomic!
}

int main() {
    signal(SIGUSR1, handler);
    while (1) {
        printf("Counter: %d\n", counter);  // RACE: signal might arrive mid-printf!
    }
}

The bug: printf is not async-safe. A signal might arrive in the middle of printf (while it’s calling malloc internally), corrupting its state.

Questions:

What’s a safe way to update counter in the handler?
What should main do instead of calling printf directly?
How would you detect this bug with your harness?

The Interview Questions They’ll Ask

“What’s the difference between blocking a signal with sigprocmask and ignoring it with signal(sig, SIG_IGN)?”
“You want to safely access a variable modified in a signal handler. How do you do it without race conditions?”
“If a signal handler is installed but never uninstalled, what happens to signals when the process exits?”
“Why might a zombie process exist even though the parent called wait?”
“How would you design a signal-safe logging function for use in handlers?”

Hints in Layers

Hint 1: Build a Signal Tracer Create a handler for SIGCHLD that logs signal arrival time and sender PID. Use write (async-safe) to log to a file. Install the handler with sigaction (preferred over signal). Test by spawning child processes and verifying logs.

Hint 2: Implement Zombie Detection Scan /proc/[pid]/task/*/ or use getpgid to find all processes. For each, check if it’s a zombie (parent exists, process is exited). Log zombies with their parent PID.

Hint 3: Create an Async-Safe Logger Build a function that only calls async-safe functions. Use a static buffer, write syscall, and integer-to-string conversion (no sprintf). Test by calling it from handlers.

Hint 4: Stress-Test with Signal Storms Fork a child that sends your parent rapidly alternating SIGUSR1 and SIGUSR2. Your handlers should update a counter. After receiving 10,000 signals, verify the counter matches. If it doesn’t, you have a race.

Books That Will Help

Topic	Book	Chapter
Signal Handling Basics	CS:APP 3e	Ch. 8, Sec. 8.5
Signal Masking & Delivery	CS:APP 3e	Ch. 8, Sec. 8.6
Process Lifecycle	CS:APP 3e	Ch. 8, Sec. 8.4
Async-Signal-Safety	POSIX signal-safety(7)	Man page
Advanced Signal Programming	APUE (Stevens)	Ch. 10

Common Pitfalls and Debugging

Problem 1: “My zombie detector doesn’t find zombies”

Why: Zombies might be reaped between checks; parent might be handling SIGCHLD correctly
Fix: Create a test child that exits but doesn’t get reaped; confirm it appears as zombie
Quick test: cat /proc/[pid]/status | grep State should show Z (zombie)

Problem 2: “Signal handler is called but counter doesn’t increment”

Why: Counter update might not be atomic; check might not be volatile
Fix: Declare counter as volatile sig_atomic_t; use atomic increment
Quick test: Send 100 signals rapidly; count should match

Problem 3: “Program deadlocks when signal arrives during printf”

Why: printf uses locks; handler calls printf; deadlock when signal arrives inside lock
Fix: Don’t use printf in handlers; use only async-safe functions like write
Quick test: Install handler, send signal while in printf; should not hang

Problem 4: “Parent misses SIGCHLD; children become zombies”

Why: Handler not installed, or waitpid called before SIGCHLD arrives (race)
Fix: Ensure sigaction is called early; use waitpid in loop in handler
Quick test: Fork 100 children; all should be reaped before parent exits

Definition of Done

Implement signal tracer that logs all signals with timing and sender PID
Detect zombie processes by scanning /proc or process table
Create async-safe logging function using only write and integer conversion
Stress-test signal handlers with 1000+ rapid signals; verify no race conditions
Demonstrate safe vs unsafe signal handlers with documented examples
Show how sigprocmask prevents race conditions in critical sections
Handle SIGCHLD correctly without leaving zombies
Provide examples of common pitfalls and how to avoid them

Project 12: Unix Shell with Job Control

File: CSAPP_SHELL_JOB_CONTROL.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Process Control / Terminal I/O / Job Management
Software or Tool: readline, terminal attributes, process group management
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A Unix shell (tsh) that supports background/foreground job control, including: (1) fg and bg commands to manage jobs, (2) signal handling for SIGCHLD, SIGINT, SIGTSTP, (3) correct process group management, and (4) proper terminal control.

Why it teaches process control: A shell is a process manager—it must coordinate multiple child processes, handle signals correctly, and manage terminal I/O. You’ll see how process groups, signals, and terminal control interact in a real system.

Core challenges you’ll face:

Process groups and session leaders → maps to understanding how terminal multiplexing works
Signal handling without race conditions → maps to atomic operations in concurrent systems
Job management (foreground vs background) → maps to process state tracking
Terminal control (SIGSTOP, SIGCONT) → maps to understanding kernel-terminal interaction

Real World Outcome

You’ll produce tsh, a shell that behaves like bash for job control:

$ ./tsh
tsh> sleep 100 &
[1] (12345) sleep 100
tsh> sleep 200
^Z
[2] (12346) sleep 200

tsh> jobs
[1] (12345) Running sleep 100
[2] (12346) Stopped sleep 200

tsh> bg 2
[2] (12346) sleep 200

tsh> fg 1
sleep 100
^C
tsh>

The shell correctly:

Launches jobs in foreground (blocking) or background (non-blocking)
Handles Ctrl-Z (SIGTSTP) to stop foreground job
Handles Ctrl-C (SIGINT) to kill foreground job
Maintains a job list with status (Running, Stopped, Done)
Moves jobs between foreground and background

The Core Question You’re Answering

“How does a shell manage multiple processes and handle terminal signals without losing synchronization?”

Before coding, understand that the shell is a traffic controller: it launches processes, waits for them, forwards signals, manages terminal control, and keeps track of job state—all while handling asynchronous signals.

Concepts You Must Understand First

Process Groups & Sessions
- Terminal sends signals to process groups, not individual processes
- Foreground vs background process groups
- setpgid, getpgrp, tcgetpgrp, tcsetpgrp
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.3-8.4
Signal Handling for Job Control
- SIGCHLD: child process exited or stopped
- SIGINT, SIGTSTP: sent by terminal to foreground process group
- Handlers must reap children and update job state
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.5-8.7
Job Management Data Structure
- Job list: track PID, process group, state, command line
- Foreground job tracking: which job is in foreground?
- State transitions: Running → Stopped → Done → reaped
- Book Reference: CS:APP 3e Ch. 8 (Shell Lab specification)
Terminal Control & Process Groups
- How does Ctrl-C reach only the foreground job?
- What is tcgetpgrp/tcsetpgrp for?
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.3

Questions to Guide Your Design

Process Groups: When you run sleep 100 &, you create a new process. What process group should it belong to?
Terminal Signals: When you press Ctrl-Z, why does it stop only the foreground job, not the shell?
Job Tracking: How do you track which job is foreground? How do you move a job from background to foreground?
Signal Safety: If SIGCHLD arrives while you’re updating the job list, what breaks?

Thinking Exercise

Terminal Signal Flow

You’re in shell with sleep 100 running in foreground. You press Ctrl-Z.

Terminal driver sends SIGTSTP to entire foreground process group
Sleep receives SIGTSTP and stops
Shell receives SIGTSTP too!

Questions:

How does the shell prevent itself from stopping when Ctrl-Z is pressed?
Should the shell’s signal handler for SIGTSTP do anything?
What if the sleep process ignores SIGTSTP? What should the shell do?

The Interview Questions They’ll Ask

“Explain the difference between a process group and a session. Why does the terminal need to know about process groups?”
“When you press Ctrl-C in a shell with a backgroundjob, why doesn’t it kill the background job?”
“What happens if the foreground process group stops? How does the shell regain control?”
“How would you implement the bg command to resume a stopped background job?”
“Why do you need both a SIGCHLD handler and the ability to block signals in critical sections?”

Hints in Layers

Hint 1: Build Job List Management Create a job struct: {pid, pgid, state, cmdline}. Implement functions to add, remove, find, and list jobs. Test with manual job creation before handling signals.

Hint 2: Signal Handlers Implement a SIGCHLD handler that calls waitpid in a loop with WNOHANG flag to reap any stopped or exited children. Update job state for each. Use sigprocmask to protect job list access.

Hint 3: Process Group Management When forking a child, the parent should call setpgid(child_pid, child_pid) to put the child in its own process group. If running in background, the shell shouldn’t wait. If foreground, use tcsetpgrp to give the child the terminal.

Hint 4: Shell Protection from Signals Before any critical operation (job list modification), block SIGCHLD, SIGINT, SIGTSTP with sigprocmask. Unblock after the operation completes. This prevents signal handlers from interrupting updates.

Books That Will Help

Topic	Book	Chapter
Process Groups & Sessions	CS:APP 3e	Ch. 8, Sec. 8.3
Job Control & Signal Handling	CS:APP 3e	Ch. 8, Sec. 8.4-8.7
Shell Lab Specification	CS:APP 3e	Appendix (datalab/tracinglab)
Advanced Job Control	APUE (Stevens)	Ch. 9

Common Pitfalls and Debugging

Problem 1: “Ctrl-Z doesn’t stop the foreground job”

Why: Process not in correct process group, or terminal signal handling broken
Fix: Ensure child is in its own process group; use tcsetpgrp to give terminal to child
Quick test: Run sleep 100 in foreground; press Ctrl-Z; should print Stopped

Problem 2: “Reusing a job ID causes confusion”

Why: Job list has old job with same ID; not properly removed
Fix: When job is reaped, remove it from job list; use next free job number
Quick test: Run 10 jobs; kill them; run 10 more; job IDs should be 1-10 again

Problem 3: “SIGCHLD handler runs while updating job list; data corruption”

Why: Not blocking signals during critical sections
Fix: Use sigprocmask to block SIGCHLD before modifying job list; unblock after
Quick test: Stress test with 100 rapid job creations; should not crash

Problem 4: “bg command doesn’t work; job stays stopped”

Why: Not sending SIGCONT to stopped job’s process group
Fix: bg should call kill(-pgid, SIGCONT) to resume all processes in group
Quick test: Stop a job with Ctrl-Z; run bg; job should continue

Definition of Done

Project 13: Virtual Memory Map Visualizer

File: CSAPP_VM_MAP_VISUALIZER.md
Main Programming Language: C
Alternative Programming Languages: Python, Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Virtual Memory / Process Address Space / Page Management
Software or Tool: /proc/[pid]/maps, /proc/[pid]/smaps, mmap
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: An interactive tool vmmap that visualizes a process’s virtual address space, including: (1) visual representation of memory regions (code, heap, stack, libraries), (2) parsing and display of /proc/[pid]/maps and /proc/[pid]/smaps, (3) symbol resolution to identify function boundaries, (4) dynamic visualization showing memory changes as the process runs.

Why it teaches virtual memory: Virtual memory is abstract—addresses look like addresses, but they’re really just virtual. This tool makes that abstraction concrete by showing exactly how the kernel maps virtual pages to physical frames, manages ASLR, and arranges segments.

Core challenges you’ll face:

Parsing /proc/[pid]/maps and /proc/[pid]/smaps → maps to understanding kernel’s view of address space
Symbol resolution to identify functions → maps to understanding debug symbols and relocation
Visualizing memory layout across time → maps to understanding dynamic memory allocation
Page permission flags and their meaning → maps to understanding virtual memory protection

Real World Outcome

You’ll produce vmmap, an interactive tool:

$ vmmap --pid=1234
Virtual Address Map for PID 1234 (/usr/bin/firefox)

Address Space:
┌─────────────────────────────────────────────────┐ 0xffffffffffffffff
│               Kernel Space                      │
└─────────────────────────────────────────────────┘ 0xffffffff80000000
                    (gap)
┌─────────────────────────────────────────────────┐ 0x7fffff000000
│  Shared Libraries                               │
│  libc.so.6   0x7f5a58c00000-0x7f5a58da2000 rw-p│
│  libm.so.6   0x7f5a5aa00000-0x7f5a5aa23000 r-xp│
└─────────────────────────────────────────────────┘
                    (gap)
┌─────────────────────────────────────────────────┐ 0x0000600000000000
│  Heap                                           │
│  Allocated: 256 MB | Free: 1.2 GB               │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐ 0x0000000600000000
│  BSS/Data Segment                               │
│  .data      0x0000000600000000-0x0000000600001000 rw-│
│  .bss       0x0000000600001000-0x0000000600002000 rw-│
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐ 0x0000000400000000
│  Code Segment                                   │
│  .text      0x0000000400000000-0x0000000400010000 r-xp│
└─────────────────────────────────────────────────┘

Plus detailed metrics:

$ vmmap --pid=1234 --detailed
[stack] 0x7fffffffde00-0x7ffffffff000 rw- Size: 8.5 MB (RSS: 64 KB)
[vvar]  0x7fffffe00000-0x7fffffe02000 r--p Size: 8 KB (RSS: 8 KB)
[heap]  0x5555557c1000-0x555555942000 rw- Size: 1.8 MB (RSS: 320 KB)
[text]  0x555555554000-0x555555667000 r-xp Size: 1.1 MB (RSS: 1.1 MB)

The Core Question You’re Answering

“What does the virtual address space of a running process actually look like, and how does the kernel organize it?”

Before coding, understand that every process has a complete (64-bit) address space, but only parts are actually mapped to physical memory. The kernel’s page tables determine what’s valid.

Concepts You Must Understand First

Virtual Address Space Layout
- Stack (grows down), heap (grows up), code, data, libraries
- Address Space Layout Randomization (ASLR): why are addresses different each run?
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9-9.10
Page Maps and Segments
- What does each line in /proc/self/maps mean?
- Permission flags: r/w/x, p (private) / s (shared)
- Book Reference: man proc(5) - /proc/[pid]/maps
Memory Regions and Mapping
- Executable sections vs data sections vs BSS
- mmap’d files and anonymous memory
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.2
Symbol Information
- Reading symbol tables from binaries to identify function boundaries
- Debug symbols vs stripped binaries
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.3

Questions to Guide Your Design

Address Space Parsing: What information does /proc/self/maps provide? How do you extract start address, end address, and permissions?
Memory Regions: How do you distinguish code (.text) from data (.data) from heap from stack?
Symbol Resolution: Given a virtual address, how do you find which function it belongs to?
ASLR Impact: If you run the same program twice, why are the addresses different? How does ASLR work?

Thinking Exercise

Memory Region Identification

$ cat /proc/self/maps
555555554000-555555667000 r-xp ... /usr/bin/prog
555555867000-555555869000 rw-p ... /usr/bin/prog
7ffff7ddc000-7ffff7e02000 r-xp ... /lib64/ld-linux
7ffff7dd2000-7ffff7ddc000 rw-p ... /lib64/ld-linux
7fffde000000-7fffde021000 rw-p ... [heap]
7ffffffff000-7ffffffff000 rw-p ... [stack]

Questions:

Which region is the code (.text) of the program?
Which region is the heap?
Why does the program have two r-xp regions in /lib64/ld-linux?

The Interview Questions They’ll Ask

“Explain the virtual address space layout of a 64-bit process. Where does code, stack, and heap live?”
“What is ASLR, and how does it affect debugging and security?”
“If you map a file with mmap, where does it appear in the address space?”
“What does RSS (Resident Set Size) mean vs VSZ (Virtual Size) in /proc/[pid]/smaps?”
“How do you find which shared library a function comes from given its address?”

Hints in Layers

Hint 1: Parse /proc/[pid]/maps Read the maps file for any process. Extract fields: start address, end address, permissions, offset, device, inode, pathname. Format as a table.

Hint 2: Categorize Memory Regions

If pathname matches /, it’s a file mapping (executable, library, etc.)
If pathname is [heap], [stack], [vvar], use those labels
If pathname is empty and private, it’s likely anonymous memory (stack or heap)

Hint 3: Visualize the Address Space Create a large ASCII diagram showing the virtual memory layout. Scale the sizes appropriately. Show permissions and memory usage (RSS) for each region.

Hint 4: Add Symbol Resolution Use dlopen/dlsym or ELF parsing to find symbols. Given an address, binary search through symbol tables to find the nearest symbol (function name).

Books That Will Help

Topic	Book	Chapter
Virtual Address Space	CS:APP 3e	Ch. 9, Sec. 9.9-9.10
Memory Mapping	CS:APP 3e	Ch. 9, Sec. 9.2
/proc filesystem	man proc(5)	`/proc/[pid]/maps`
ASLR & Security	CS:APP 3e	Ch. 3, Sec. 3.13

Common Pitfalls and Debugging

Problem 1: “Maps file shows different addresses each time I run”

Why: ASLR is enabled; kernel randomizes memory layout
Fix: This is normal behavior; disable ASLR only for debugging with setarch x86_64 -R
Quick test: Run without ASLR; addresses should be consistent

Problem 2: “Can’t identify which region is heap vs stack”

Why: Heap and stack might have same permissions; need to look at pathname
Fix: Stack usually has [stack] label; heap has [heap] label in some kernels; otherwise, infer from address and growth direction
Quick test: Allocate memory, run vmmap again; see which region grew

Problem 3: “Symbol resolution shows wrong function names”

Why: Symbols might be stripped, or you’re matching against wrong binary version
Fix: Use readelf -s to inspect symbols; ensure you’re reading the correct binary file
Quick test: objdump -t /path/to/binary | grep address

Problem 4: “RSS size doesn’t match actual memory usage”

Why: RSS includes shared pages; memory shared with other processes counted in each
Fix: Use PSS (Proportional Set Size) for more accurate accounting (in /proc/[pid]/smaps)
Quick test: Check /proc/[pid]/smaps instead of /maps

Definition of Done

Parse /proc/[pid]/maps and extract all fields correctly
Categorize regions (code, data, heap, stack, mmap’d files)
Visualize address space as ASCII diagram with proper scaling
Show memory permissions and sizes for each region
Read /proc/[pid]/smaps for per-region memory usage (RSS)
Resolve symbols to identify function names at given addresses
Handle ASLR correctly (show that addresses vary between runs)
Support interactive commands: list regions, show details, follow heap growth, etc.

Project 14: Build Your Own Malloc

File: CSAPP_BUILD_MALLOC.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Heap Management / Memory Allocation / Data Structures
Software or Tool: sbrk, mmap, heap inspectors
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A complete memory allocator (mymalloc) that implements malloc, realloc, and free with implicit free lists, explicit free lists, or segregated free lists. Includes fragmentation analysis and performance benchmarking.

Why it teaches memory management: Malloc is invisible to most programmers, but building one forces you to confront: how does the heap grow? How does free list fragmentation waste memory? Why does poor allocation strategy kill performance? This project connects abstract memory allocation to concrete data structures.

Core challenges you’ll face:

Free list management and coalescing → maps to understanding fragmentation and defragmentation
Alignment and padding → maps to understanding hardware constraints and compiler layouts
Splitting and merging blocks → maps to understanding data structure operations under constraints
Performance vs fragmentation tradeoffs → maps to understanding engineering tradeoffs

Real World Outcome

You’ll produce a malloc library that passes test cases:

$ ./mymalloc_test
[ALLOC] malloc(1024) -> 0x5555557c1000
[ALLOC] malloc(512) -> 0x5555557c1400
[ALLOC] malloc(256) -> 0x5555557c1600
[FREE] free(0x5555557c1400) -> OK
[REALLOC] realloc(0x5555557c1000, 2048) -> 0x5555557c2000
[STATS] Total allocated: 3840 bytes, Fragmentation: 12%, Peak memory: 5.2 KB

Performance Benchmark:
  Allocate 10000 random sizes: 2.3 ms
  Free all: 1.1 ms
  Allocate + Free 5000 pairs: 4.2 ms
  System malloc (reference): 5.1 ms

Plus detailed diagnostics:

$ ./heap_inspector --pid=1234
Heap Analysis for PID 1234:
┌─────────────────────────────────┐
│ Block 1: 1024 bytes [ALLOCATED] │ 0x1000000
│ Block 2:  512 bytes [FREE]      │ 0x1000400
│ Block 3:  256 bytes [ALLOCATED] │ 0x1000600
│ Block 4: 3200 bytes [FREE]      │ 0x1000700
└─────────────────────────────────┘
Fragmentation: 12% (largest free: 3200 bytes, total free: 3712 bytes)

The Core Question You’re Answering

“How does malloc manage the heap, and what happens when memory is fragmented?”

Before coding, understand that malloc is a data structure problem: you have a big region of memory, you need to track which parts are allocated, and you need to find free space for new allocations.

Concepts You Must Understand First

Heap Organization & Free Lists
- Implicit free lists (boundary tags), explicit free lists, segregated lists
- Why free lists exist and how they’re traversed
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
Block Coalescing & Splitting
- Immediate vs deferred coalescing (immediate is simpler, deferred reduces overhead)
- How to split a free block when allocation is smaller
- Why coalescing matters for fragmentation
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
Alignment and Padding
- Why 8-byte or 16-byte alignment is required
- How to calculate padding
- Trade-off between alignment and fragmentation
- Book Reference: CS:APP 3e Ch. 3, Sec. 3.9
Fragmentation & Policies
- External fragmentation: free space scattered, can’t be used
- Internal fragmentation: allocated block bigger than request (padding)
- First-fit, best-fit, worst-fit allocation strategies
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9

Questions to Guide Your Design

Block Representation: How do you store metadata (size, allocated/free, next block) without using extra space outside the heap?
Finding Free Space: Given a request for N bytes, how do you find a suitable free block? (First-fit, best-fit?)
Coalescing: When you free a block, how do you find adjacent free blocks to merge with?
Alignment: If you allocate 13 bytes, how much space do you actually use? (Should be rounded to 16 or 24)

Thinking Exercise

Block Representation with Boundary Tags

Free block:  [header: size|alloc] [data...] [footer: size|alloc]
Alloc block: [header: size|alloc] [data...] (no footer, saves space)

Example: Allocate 256 bytes from a 1024-byte free block

Before:
[1024|free] [1016 bytes of data...] [1024|free]

After (if you allocate 256 bytes):
[256|alloc] [252 bytes of data...]
[768|free] [764 bytes of data...] [768|free]

Questions:

How do you know where the next block starts?
Why include a footer in free blocks but not allocated ones?
What size should the header be to store both size and alloc bit?

The Interview Questions They’ll Ask

“What’s the difference between external and internal fragmentation? Can you eliminate one without increasing the other?”
“In a boundary-tag allocator, why do free blocks need footers but allocated blocks don’t?”
“What’s the trade-off between first-fit, best-fit, and worst-fit allocation policies?”
“How does coalescing fragmentation? What’s the difference between immediate and deferred coalescing?”
“If your malloc is slower than system malloc, what optimizations would you try?”

Hints in Layers

Hint 1: Implement a Simple Implicit Free List Use boundary tags (headers with size and alloc bit). Store list sequentially in the heap. For allocation, walk the heap linearly to find a free block. For freeing, mark the block as free. Don’t worry about coalescing yet.

Hint 2: Add Coalescing When freeing, check the previous and next blocks. If they’re free, merge with them. To find the previous block, either store a footer in all blocks (expensive) or maintain a doubly-linked list.

Hint 3: Optimize with Explicit Free Lists Instead of walking all blocks (allocated and free), maintain a list of only free blocks. This speeds up allocation. Use first-fit: find the first free block that fits.

Hint 4: Benchmark and Analyze Create test patterns: rapid allocate/free, pathological fragmentation patterns, realistic workloads. Measure throughput and memory usage. Identify bottlenecks.

Books That Will Help

Topic	Book	Chapter
Heap Allocation	CS:APP 3e	Ch. 9, Sec. 9.9
Memory Allocator Design	The Craft of System Design	Ch. 7
Fragmentation Analysis	Research papers	“Malloc Fragmentation”

Common Pitfalls and Debugging

Problem 1: “Fragmentation builds up; can’t allocate even though free space exists”

Why: Not coalescing adjacent free blocks properly
Fix: After free, immediately merge with adjacent free blocks
Quick test: Allocate/free in patterns; use heap_inspector to visualize fragmentation

Problem 2: “Alignment issues; crashes on 32-byte SIMD operations”

Why: Pointers not properly aligned to 8 or 16 bytes
Fix: Round all sizes up to nearest multiple of 8/16; ensure header doesn’t break alignment
Quick test: Allocate odd sizes (17, 23 bytes); check alignment of returned pointer

Problem 3: “Realloc doesn’t work; data is corrupted”

Why: Realloc tries to expand in-place without checking adjacent block; should move if can’t expand
Fix: Check if next block is free and large enough; otherwise allocate new, copy, free old
Quick test: Realloc with increasing sizes; verify data integrity

Problem 4: “Performance is terrible; 10x slower than system malloc”

Why: Linear search for free block, or excessive coalescing overhead
Fix: Use explicit free list or segregated lists; implement best-fit instead of first-fit
Quick test: Profile; identify hotspots; switch to more efficient data structure

Definition of Done

Implement malloc that allocates memory and tracks allocation state
Implement free that marks memory as free
Coalesce adjacent free blocks to reduce fragmentation
Implement realloc correctly (expand in-place or move)
Ensure correct alignment (8-byte or 16-byte)
Measure and document fragmentation rate
Benchmark against system malloc; be within 2x on throughput
Pass comprehensive test suite including stress tests and edge cases

Project 15: Robust Unix I/O Toolkit

File: CSAPP_ROBUST_IO_TOOLKIT.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: I/O Operations / System Programming / Error Handling
Software or Tool: read, write, select, poll, epoll
Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron

What you’ll build: A robust I/O library with wrappers for read/write that handle short counts and signals correctly. Includes: (1) rio_readn/rio_writen for guaranteed I/O, (2) rio_readlineb for buffered reading, (3) signal-safe helpers, (4) multiplexing support with select/epoll.

Why it teaches I/O robustness: System I/O is unreliable—read might return fewer bytes than requested, signals can interrupt syscalls. Most code ignores this and crashes. You’ll learn to write code that actually works on real systems.

Core challenges you’ll face:

Partial I/O (short counts) → maps to understanding buffering and retries
EINTR and signal interruption → maps to understanding async system behavior
Buffered vs unbuffered I/O → maps to understanding performance tradeoffs
Multiplexing multiple connections → maps to understanding event-driven programming

Real World Outcome

You’ll create a library tested on real I/O patterns:

$ ./rio_test
[TEST] rio_readn: Read 1MB in 8-byte chunks
  - Expected: 131072 reads
  - Actual: 131072 (correct!)
  - Signals received: 5,847 (all handled)

[TEST] rio_writen: Write 1MB across pipe with limited buffer
  - Expected: 1048576 bytes written
  - Actual: 1048576 (correct!)
  - Write calls: 847 (averaged 1.2KB per syscall)

[TEST] Buffered reading with signals
  - Read 100 lines (8KB total)
  - SIGALRM sent during reads: 23 times
  - All data recovered: YES

Plus a full echo server using rio_readlineb:

$ ./echo_server 8000 &
$ telnet localhost 8000
Hello world
Hello world
[server echoed back]
^C

The Core Question You’re Answering

“How do you write I/O code that works reliably despite signals, short counts, and buffering?”

Before coding, understand: system I/O is not transactional. A read for 1KB might return 1 byte. A write for 1KB might write 512 bytes then return. Signals can interrupt any syscall. Ignoring these facts leads to data loss.

Concepts You Must Understand First

Short Counts & Partial I/O
- read returns fewer bytes than requested (not an error)
- write might write only part of buffer (not an error)
- Must retry in loop until done
- Book Reference: CS:APP 3e Ch. 10, Sec. 10.2
EINTR and Signal Interruption
- Signals can interrupt syscalls with EINTR
- Some syscalls auto-restart on EINTR, others don’t
- Must handle both cases
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.6
Buffering Strategies
- Unbuffered: read/write every byte (slow)
- Line-buffered: buffer until newline (useful for text)
- Fully buffered: buffer in blocks (fastest)
- Book Reference: CS:APP 3e Ch. 10, Sec. 10.1
Multiplexing Connections
- select/poll/epoll: monitor multiple file descriptors
- When to use each (performance, portability)
- Book Reference: CS:APP 3e Ch. 12, Sec. 12.2

Questions to Guide Your Design

Short Counts: If read(fd, buf, 1000) returns 50, what should the caller do?
EINTR Handling: If write returns -1 with errno == EINTR, is the data written? (Answer: unknown!)
Buffering: When reading lines, when should you flush? (When full, on newline, on error?)
Multiplexing: How do you efficiently handle 10,000 simultaneous connections?

Thinking Exercise

Short-Count Error

// WRONG CODE
int n = read(fd, buf, 1000);
if (n < 0) perror("error");
// Now buf definitely has all data... WRONG!

What if read returns 50? The code thinks buf has 1000 bytes. Crash!

Questions:

How do you fix this? (Loop until read returns 0)
What does it mean when read returns 0? (EOF)
How do you handle partial write?

The Interview Questions They’ll Ask

“What’s a short count, and why does it happen? How do you handle it?”
“Why does write return early, and what are the consequences for your code?”
“What’s the difference between auto-restarting and non-restarting syscalls? Give examples.”
“Design a line-reading function that buffers input efficiently. How do you handle EINTR?”
“Why is select better than looping through 10,000 file descriptors with fcntl flags?”

Hints in Layers

Hint 1: Implement rio_readn Wrapper for read that loops until all N bytes are read or EOF. Handle EINTR by retrying on EINTR. Handle short counts by continuing loop.

Hint 2: Implement rio_writen Wrapper for write that loops until all N bytes are written. Handle EINTR by retrying. Handle short counts by adjusting buffer pointer and length.

Hint 3: Add Buffered Line Reading Implement rio_readlineb with internal buffer. Read chunks with rio_readn, return one line at a time. Handle lines split across read boundaries.

Hint 4: Implement select-based Multiplexer Create a simple server that handles multiple clients with select. Demonstrate that one thread can serve many clients if they’re not all active simultaneously.

Books That Will Help

Topic	Book	Chapter
I/O Basics	CS:APP 3e	Ch. 10, Sec. 10.1-10.2
Network I/O	CS:APP 3e	Ch. 11, Sec. 11.4-11.5
Signal Safety	CS:APP 3e	Ch. 8, Sec. 8.6
Multiplexing	CS:APP 3e	Ch. 12, Sec. 12.2
Rio Implementation	APUE	Ch. 3

Common Pitfalls and Debugging

Problem 1: “Signals cause silent data loss”

Why: Short count from EINTR not handled; code assumes full read/write
Fix: Always loop; handle EINTR explicitly with while (n < 0 && errno == EINTR)
Quick test: Send SIGALRM during read; verify all data still arrives

Problem 2: “Buffered reader loses data between buffer fills”

Why: Newline at buffer boundary; incomplete handling of wrap-around
Fix: Properly track position in buffer; handle lines split across boundaries
Quick test: Send lines of exactly buffer size; send many tiny lines

Problem 3: “Server doesn’t detect client disconnect”

Why: EOF condition not recognized (read returns 0)
Fix: Check for n == 0; close connection; remove from multiplexer
Quick test: Telnet to server, kill client suddenly; verify server detects it

Problem 4: “epoll doesn’t work as expected on certain systems”

Why: Platform-specific differences or incorrect edge-triggered use
Fix: Ensure using level-triggered or handle correctly; test on actual target
Quick test: Move code between Linux/BSD; verify behavior

Definition of Done

Implement rio_readn that handles short counts and EINTR
Implement rio_writen that handles short counts and EINTR
Implement rio_readlineb with internal buffering
Handle all signal interruptions correctly
Create test suite with intentional EINTR injection
Demonstrate line-reading with signals firing during reads
Implement multiplexing server with select and/or epoll
Handle client disconnects correctly
Document all edge cases and test patterns

Project 16: Concurrency Workbench

File: CSAPP_CONCURRENCY_WORKBENCH.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: Level 2: Practical but Forgettable
Difficulty: Level 3: Advanced
Knowledge Area: Concurrency / Threading / Synchronization
Software or Tool: Pthreads / Valgrind / Helgrind
Main Book: CS:APP 3e (Chapter 12)

What you’ll build: A benchmark framework that compares three concurrency models (thread-per-request, thread pool, and event-driven multiplexing) using identical workloads, measuring throughput, latency, and resource usage.

Why it teaches concurrency: Building three different concurrency models side-by-side forces you to understand the tradeoffs: threads are simple but expensive, thread pools are practical but complex, and multiplexing is efficient but non-blocking. You’ll see why different models win for different workloads.

Core challenges you’ll face:

Thread safety without deadlocks → maps to mutex ordering and condition variables
Correct shutdown and cleanup → maps to graceful termination and resource deallocation
Fair comparison between models → maps to equivalent load generation and measurement
Detecting race conditions → maps to Helgrind and ThreadSanitizer

Key Concepts:

Thread-Per-Request Model: Pthreads Ch. 1 - APUE
Thread Pool Pattern: Effective Java (Item 81) - Joshua Bloch
Select/Poll/Epoll Multiplexing: CS:APP Ch. 12, Sec. 12.2
Synchronization Primitives: CS:APP Ch. 12, Sec. 12.3
Race Condition Detection: Valgrind Helgrind user manual

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: You must understand: basic threading, mutex/semaphore concepts, the fork/exec model, and select/epoll basics. Have completed Projects 11-15 to fully appreciate the contrasts.

Real World Outcome

You’ll produce a benchmarking suite that runs identical work across three concurrency models and prints a comparative report.

$ ./concurrency_bench --workload=http --clients=1000 --duration=10

====== CONCURRENCY WORKBENCH ======

[Workload] HTTP request handler
[Duration] 10 seconds
[Load] 1000 concurrent clients

--- Thread-Per-Request Model ---
[INIT] Starting threadpool (1000 threads)...
[SUMMARY] Throughput: 45,230 req/sec
[SUMMARY] Avg latency: 22.1 ms
[SUMMARY] Max latency: 1,250 ms
[SUMMARY] Memory: 125 MB (each thread ~100KB stack)
[WARNING] Thread creation took 2.5 seconds

--- Thread Pool Model (16 workers) ---
[INIT] Starting 16 worker threads with queue...
[SUMMARY] Throughput: 89,450 req/sec
[SUMMARY] Avg latency: 11.2 ms
[SUMMARY] Max latency: 450 ms
[SUMMARY] Memory: 3.2 MB (fixed overhead)
[STATS] Queue depth: max 45, average 8

--- Event-Driven Model (epoll) ---
[INIT] Starting epoll reactor...
[SUMMARY] Throughput: 127,890 req/sec
[SUMMARY] Avg latency: 7.8 ms
[SUMMARY] Max latency: 320 ms
[SUMMARY] Memory: 1.5 MB (minimal per-connection)
[STATS] Epoll FD count: 1001

====== COMPARISON ======
Winner: Event-Driven (2.8x faster, 80x less memory)
Tradeoff: Epoll harder to program correctly

The Core Question You’re Answering

“When should I use threads vs thread pools vs select/epoll, and what does each model trade off?”

Before building three models, you realize most developers pick one and never see the tradeoffs. This project forces you to measure them directly. You’ll see that threads are simple, thread pools are practical, and multiplexing is king for I/O-bound work.

Concepts You Must Understand First

Stop and research these before coding:

Pthreads API and Synchronization
- What does pthread_create, pthread_join do?
- Why do you need pthread_mutex_lock and pthread_cond_wait?
- What is a condition variable and when do you use it?
- Book Reference: “APUE” by Kerrisk - Ch. 11
Thread Pool Pattern
- How does a bounded work queue prevent unbounded thread growth?
- What happens when the queue is full?
- How do worker threads cleanly shut down?
- Book Reference: “Java Concurrency in Practice” - Goetz et al. - Ch. 8
Select/Poll/Epoll Multiplexing
- What is the advantage of select over thread-per-connection?
- Why is epoll better than select for thousands of connections?
- What is edge-triggered vs level-triggered epoll?
- Book Reference: CS:APP 3e - Ch. 12, Sec. 12.2
Synchronization Pitfalls
- What causes a deadlock? (lock ordering, hold-and-wait)
- Why can spurious wakeups happen with condition variables?
- How do you ensure all threads wake up at shutdown?
- Book Reference: CS:APP 3e - Ch. 12, Sec. 12.3

Questions to Guide Your Design

Before implementing, think through these:

Load Generation
- How will you generate equivalent load for all three models? (same requests, same timing)
- How will you measure “started” vs “completed” for each?
- Should you use blocking or non-blocking requests?
Thread-Per-Request Model
- How will you handle 1000 threads on a typical system? (each needs ~2MB stack)
- When do you stop creating threads?
- How do you gracefully shut down all 1000 threads?
Thread Pool Model
- How big should the worker pool be? (cores? oversubscribed?)
- What is the queue capacity? (bounded or unbounded?)
- What happens when the queue fills? (block, reject, dynamic resize?)
Event-Driven Model
- How do you handle 1000 non-blocking sockets?
- How do you detect client disconnect in epoll?
- How do you ensure no work is starved?
Measurement
- How will you measure throughput without skewing results?
- How will you get accurate latency percentiles (p50, p99)?
- How will you measure memory usage (RSS, VSZ)?

Thinking Exercise

Draw Three Concurrency Models

Before coding, draw or write out:

Model 1: Thread-Per-Request
┌─────────────────────────────────────┐
│ Accepting Thread                    │
├─────────────────────────────────────┤
│ accept() -> spawn pthread_create()  │
│     ↓                               │
│ [Worker Thread 1] [Worker 2] [3]... │
│     ↓                  ↓             │
│  [Socket]         [Socket]          │
└─────────────────────────────────────┘

Model 2: Thread Pool
┌─────────────────────────────────────┐
│  [Worker 1] [Worker 2] ... [Worker N]
│       ↓          ↓            ↓      │
│   ┌──────────────────────┐           │
│   │   Work Queue         │           │
│   │  (Bounded, Locked)   │           │
│   └──────────────────────┘           │
│            ↑                         │
│   Acceptor Thread (feeds queue)      │
└─────────────────────────────────────┘

Model 3: Event-Driven
┌─────────────────────────────────────┐
│   Reactor Thread                    │
│   ┌──────────────────────────────┐  │
│   │ epoll_wait() with timeout    │  │
│   │                              │  │
│   │ for each ready socket:       │  │
│   │   - read/write (non-block)   │  │
│   │   - schedule callback        │  │
│   └──────────────────────────────┘  │
│   Handles 1000s of sockets          │
└─────────────────────────────────────┘

Questions while thinking:

In Model 1, if you create 1000 threads, what’s the total memory cost? (1000 × 2MB = 2GB!)
In Model 2, what happens if 1000 requests arrive but only 16 workers exist? (queue grows)
In Model 3, how does epoll know when a socket is ready without polling?

The Interview Questions They’ll Ask

Prepare to answer these:

“Why would you choose threads over multiplexing for a web server?”
“What’s the memory overhead of creating 1000 threads vs a thread pool of 16?”
“How do you handle client disconnects in an epoll reactor?”
“What does ‘thundering herd’ mean, and when is it a problem?”
“Explain the difference between select, poll, and epoll to someone who’s never used them.”
“If you have 1000 concurrent connections but only 16 threads, how do you prevent starvation?”

Hints in Layers

Hint 1: Start with Thread-Per-Request Build the simplest model first: accept a connection, spawn a thread, let thread handle it. Measure how many threads crash the system. This is your baseline.

Hint 2: Implement Thread Pool with Bounded Queue Use a pthread_mutex_t to protect a linked list (or array) that acts as work queue. Worker threads call pthread_cond_wait() and wake when work arrives. Measure the throughput improvement.

Hint 3: Implement Epoll Reactor Single thread calls epoll_wait() in a loop. For each ready socket, perform one non-blocking read/write. If you want async work, queue it to a thread pool. Measure the throughput again.

Hint 4: Add Measurement and Comparison For each model, track: requests/sec, latency (min/avg/max/p99), memory usage. Print a table side-by-side. Explain why the winner wins.

Books That Will Help

Topic	Book	Chapter
Pthreads Basics	APUE (3e)	Ch. 11
Mutex and Condition Variables	APUE (3e)	Ch. 11.6
Thread Pools	Java Concurrency in Practice	Ch. 8
Select/Poll/Epoll	CS:APP 3e	Ch. 12, Sec. 12.2
Synchronization Pitfalls	CS:APP 3e	Ch. 12, Sec. 12.3
Performance Analysis	Systems Performance	Ch. 1-3 - Brendan Gregg

Common Pitfalls and Debugging

Problem 1: “Threads created faster than they finish; memory explodes”

Why: Thread-per-request model with no limit; each thread takes 1-2MB stack
Fix: Add a cap on concurrent threads; reject connections if exceeded; measure memory growth
Quick test: Run with 10,000 clients; watch top to see RSS grow; kill program before OOM

Problem 2: “Thread pool workers deadlock or hang”

Why: Incorrect lock ordering or forgotten pthread_cond_signal() on shutdown
Fix: Use Helgrind to detect lock issues; add explicit shutdown signal to all workers
Quick test: Kill benchmark with Ctrl+C; run with Helgrind; check for “possible data race”

Problem 3: “Epoll doesn’t detect all ready connections”

Why: Incorrect understanding of edge-triggered vs level-triggered mode
Fix: Ensure using level-triggered (default); handle EAGAIN on non-blocking read
Quick test: Send data to socket; verify epoll fires; send more data while previous still buffered

Problem 4: “Latency measurements are wrong (missing high percentiles)”

Why: Using naive average instead of tracking percentiles; missing outliers
Fix: Store all latencies in array; sort and compute p99 = array[0.99 * n]
Quick test: Add 10 random 5-second delays; see if your avg hides them but p99 catches them

Definition of Done

Project 17: CS:APP Capstone — Secure, Observable, High-Performance Proxy

File: CSAPP_CAPSTONE_PROXY.md
Main Programming Language: C
Alternative Programming Languages: C++, Go, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: Level 3: Service & Support Model
Difficulty: Level 4: Expert
Knowledge Area: Networking / Concurrency / Caching / Debugging
Software or Tool: Sockets / Pthreads / Valgrind / GDB
Main Book: CS:APP 3e (Multiple chapters)

What you’ll build: A fully functional HTTP proxy that sits between clients and origin servers, forwarding requests, caching responses, and handling concurrent connections with zero data loss.

Why it teaches systems programming: This is the capstone because it forces you to integrate everything: sockets (Ch. 11), processes/threads (Ch. 8-12), memory management (Ch. 9), I/O robustness (Ch. 10), and debugging. You can’t hack your way through this; you need to understand every layer.

Core challenges you’ll face:

Robust I/O and EINTR handling → maps to short counts, signals, buffering
Thread-safe caching without deadlocks → maps to synchronization, lock ordering
HTTP protocol parsing → maps to protocol layering, state machines
Finding silent bugs with only logs → maps to debugging skills, Valgrind, GDB
Production concerns (graceful shutdown, resource limits, observability) → maps to real-world engineering

Key Concepts:

Socket Programming: CS:APP Ch. 11, Sec. 11.4-11.5
Thread Synchronization: CS:APP Ch. 12, Sec. 12.3
LRU Cache Implementation: Designing Data-Intensive Applications - Kleppmann
HTTP Request/Response Format: RFC 7230 (HTTP/1.1)
Parser State Machines: Crafting Interpreters - Nystrom (Ch. 4-5)
Production Observability: The Art of Monitoring - Turnbull
Debugging Real Systems: The Art of Debugging - Matloff & Salzman

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: You MUST have completed Projects 1-16. You need to understand: sockets, threading, signals, memory management, and at minimum one full debugging workflow with gdb/valgrind. This is not for beginners.

Real World Outcome

You’ll have a proxy server that can handle real HTTP traffic, cache responses, and print detailed logs of everything happening inside.

$ ./proxy 8080 --cache-size=100MB --workers=8 --verbose

[STARTUP] Proxy listening on 127.0.0.1:8080
[STARTUP] Cache initialized (100 MB, LRU)
[STARTUP] Worker pool: 8 threads
[STARTUP] Ready to accept connections

$ curl -v http://localhost:8080/index.html (via example.com)

--- Client connects ---
[CONN] Accept from 127.0.0.1:54321
[PARSE] GET /index.html HTTP/1.1
[PARSE] Host: example.com
[PARSE] Connection: close
[CACHE] Check cache for example.com:index.html... MISS
[UPSTREAM] Connecting to example.com:80...
[UPSTREAM] Connected. Sending request...
[UPSTREAM] Reading response (200 OK, 15KB)
[CACHE] Storing in cache (TTL: 3600s)
[CLIENT] Sending 200 OK (15KB) back to client
[CONN] Client disconnected

$ curl -v http://localhost:8080/index.html (2nd request, same URL)

[CONN] Accept from 127.0.0.1:54322
[PARSE] GET /index.html HTTP/1.1
[CACHE] Check cache for example.com:index.html... HIT
[CLIENT] Sending cached 200 OK (15KB) back to client
[STATS] Cache hit rate: 100%
[CONN] Client disconnected

--- Shutdown ---
$ kill -TERM $(pgrep proxy)
[SIGNAL] Received SIGTERM, graceful shutdown...
[SHUTDOWN] Waiting for 2 in-flight requests...
[SHUTDOWN] Closed all sockets
[SHUTDOWN] Cache flushed to disk
[SHUTDOWN] Exited cleanly

The Core Question You’re Answering

“Can I build a real system that handles concurrent requests, caches efficiently, parses HTTP correctly, and recovers gracefully from failures?”

This is the litmus test: Can you apply everything from CS:APP to build something that actually works? You’ll discover that real systems need: robust I/O, careful synchronization, observability, and persistence. You’ll also discover how hard it is to debug concurrency bugs in production.

Concepts You Must Understand First

Stop and research these before coding:

HTTP Protocol Basics
- What are the required headers in an HTTP/1.1 request? (Host, Connection, etc.)
- How do you parse a request line “GET /path HTTP/1.1”?
- What is the difference between “Connection: keep-alive” and “Connection: close”?
- Book Reference: RFC 7230 (IETF standard)
Socket Programming at Scale
- How do you handle short reads/writes? (EINTR, small buffers)
- How do you gracefully close both client and server connections?
- How do you prevent resource exhaustion (too many open FDs)?
- Book Reference: CS:APP 3e - Ch. 11, Sec. 11.4-11.5 + APUE - Ch. 16
Thread-Safe Caching with LRU
- How does an LRU cache decide which entry to evict?
- How do you protect concurrent reads/writes without serializing all requests?
- How do you handle cache invalidation?
- Book Reference: Designing Data-Intensive Applications - Kleppmann - Ch. 3-4
HTTP State Machine
- What states does an HTTP request go through? (parsing, routing, sending)
- How do you handle partial requests (request arrives in chunks)?
- What is the correct order of reading headers, body, etc.?
- Book Reference: Crafting Interpreters - Nystrom - Ch. 4-5
Signal-Safe Shutdown
- How do you gracefully shut down threads without data loss?
- How do you catch SIGTERM and finish in-flight requests before exiting?
- Book Reference: CS:APP 3e - Ch. 8, Sec. 8.6 + APUE - Ch. 10
Debugging Concurrency with Valgrind
- How do you use Helgrind to find race conditions?
- How do you read Valgrind output to find the smoking gun?
- Book Reference: Valgrind User Manual (Helgrind chapter)

Questions to Guide Your Design

Before implementing, think through these:

Request Parsing
- What’s the smallest valid HTTP request? (3 lines: request + Host + blank)
- How will you detect end of headers? (blank line = \r\n\r\n)
- What if a header spans multiple read() calls?
Upstream Connection
- Should you reuse upstream connections (keep-alive) or close after each response?
- How do you handle upstream server slowness (buffering on your side)?
- What if upstream closes unexpectedly mid-response?
Caching Strategy
- Which requests should you cache? (GET yes, POST no)
- Which responses should you cache? (200 yes, 500 no)
- How do you handle Expires/Cache-Control headers?
- What’s your cache eviction policy? (LRU? LFU?)
Concurrency Model
- Thread pool or epoll reactor?
- Should each thread handle one client or multiple?
- How big should your thread pool be?
Observability
- What should you log? (all requests? just errors?)
- Can you add per-request IDs to trace requests through logs?
- How do you measure cache hit rate?
Graceful Shutdown
- How do you signal threads to stop accepting new work?
- How do you wait for in-flight requests to finish?
- How do you flush cache to disk?

Thinking Exercise

Design the Thread Model

Before coding, sketch out the flow:

┌─────────────────────────────────────┐
│  Acceptor Thread                    │
│  accept() → hand to worker pool     │
└──────────────┬──────────────────────┘
               │
               v
    ┌──────────────────────────┐
    │  Worker Thread (×8)      │
    │  1. Read HTTP request    │
    │  2. Parse request line   │
    │  3. Check cache          │
    │  4. If miss: upstream    │
    │  5. Cache response       │
    │  6. Send to client       │
    └──────────────────────────┘
               │
    ┌──────────▼──────────┐
    │  Shared LRU Cache   │
    │  (mutex protected)  │
    └─────────────────────┘

Questions while thinking:

Who creates/destroys upstream connections?
Who decides whether to cache (all threads, or a caching thread)?
What if two threads request the same URL simultaneously?

The Interview Questions They’ll Ask

Prepare to answer these (this is where you get hired):

“Walk me through how your proxy handles a request from start to finish.”
“What happens if a client closes the connection mid-request?”
“How would you debug a race condition in your cache without stopping the system?”
“What’s the memory usage of your proxy with 1000 concurrent connections?”
“How do you handle HTTP keep-alive? Is it worth the complexity?”
“If your cache is full, how do you decide what to evict?”
“Your logs show intermittent ‘Connection reset by peer’ errors. How do you debug this?”
“Explain how you handle EINTR from signals. What can go wrong?”

Hints in Layers

Hint 1: Start with a Minimal Proxy Parse a single request, forward it upstream, send response back. Hardcode one upstream server. Get this working with full EINTR handling before adding concurrency. Test with curl.

Hint 2: Add Thread Pool Spawn 4 worker threads. Each waits on a queue of accepted connections. Use pthread_cond_wait() to sleep when queue is empty. Verify no hanging threads on shutdown.

Hint 3: Implement LRU Cache Use a linked list (or array with separate metadata) to track insertion order. Protect with a single mutex. Cache GET responses by (host, path) key. Test cache hits by requesting same URL twice and checking if upstream request is made.

Hint 4: Add Observability and Debugging Log every significant action: accept, parse, cache hit/miss, upstream connect, upstream recv, client send, disconnect. Run with Helgrind. Fix any race conditions detected. Use this to find bugs: “Oh, I see the log shows cache was checked but miss wasn’t counted!”

Hint 5: Handle Real Edge Cases

Partial reads (request arrives in 3 chunks)
Slow upstream (response arrives slowly)
Signals during read/write
Client disconnect during upstream read
Cache eviction under load

Hint 6: Graceful Shutdown On SIGTERM: stop accepting, wait for in-flight requests to finish, close upstream connections, flush cache, exit. Use a shared shutdown_flag and condition variable to wake all threads.

Books That Will Help

Topic	Book	Chapter
HTTP Protocol	RFC 7230 (IETF Standard)	Sections 3-4
Socket Programming	CS:APP 3e	Ch. 11, Sec. 11.4-11.5
Network I/O	APUE (3e)	Ch. 16
Threading	CS:APP 3e	Ch. 12, Sec. 12.1-12.3
Signal Handling	CS:APP 3e	Ch. 8, Sec. 8.6
Cache Design	Designing Data-Intensive Apps	Ch. 3
Parser Design	Crafting Interpreters	Ch. 4-5
Debugging Concurrency	The Art of Debugging	Ch. 6-7 - Matloff & Salzman
Production Practices	The Art of Monitoring	Ch. 1-3 - Turnbull

Common Pitfalls and Debugging

Problem 1: “Requests mysteriously lost or truncated”

Why: Short counts from read/write not handled; buffer boundary issues
Fix: Always loop until full read/write or EOF; handle EINTR explicitly
Quick test: Send 100-byte request in 10-byte chunks; verify all received

Problem 2: “Cache causes random crashes or hangs”

Why: Lock ordering, use-after-free in cache eviction, or spurious wakeup
Fix: Run with Helgrind (valgrind --tool=helgrind); trace lock acquisitions
Quick test: Kill proxy with 100 concurrent requests; run with Helgrind; check for data races

Problem 3: “Upstream connection closes but proxy keeps trying to send”

Why: Connection close not detected; didn’t check for EPIPE or closed socket
Fix: Check for EPIPE when writing; handle closed sockets gracefully
Quick test: Kill upstream server; send requests through proxy; verify error handling

Problem 4: “Cascading failures under high load”

Why: Resource exhaustion (too many open files, thread pool saturated)
Fix: Add resource limits; monitor lsof output; add backpressure
Quick test: Send 10,000 concurrent requests; watch /proc/self/fd; ensure it doesn’t hit ulimit

Problem 5: “Cache hit rate is suspiciously low”

Why: Cache key collision or TTL set too short
Fix: Log cache operations; verify hit/miss statistics; check if requests actually match
Quick test: Request same URL 100 times; trace that it hits cache

Problem 6: “Helgrind reports race conditions but code looks correct”

Why: Subtle lock ordering issue or missed synchronization
Fix: Draw a lock dependency diagram; ensure no circular waits; check for unlocked reads
Quick test: Add explicit pthread_mutex_lock around suspected operations; re-run Helgrind

Definition of Done

Phase 2: Advanced Extensions

Project 18: ELF Linker and Loader

Real World Outcome

You will build myld, a linker that takes two .o files and produces a working executable.

$ ./myld main.o utils.o -o myprog
[LINKER] Merging .text sections... (Size: 1024 bytes)
[LINKER] Merging .data sections... (Size: 64 bytes)
[LINKER] Resolving symbol 'helper_func'... Found in utils.o at offset 0x20
[LINKER] Relocating call at 0x400105 -> 0x400220
[LOADER] Output written to 'myprog'.
$ ./myprog
Hello from linked code!

The Core Question You’re Answering

How does “Undefined Reference” become a valid memory address?

Concepts You Must Understand First

ELF Headers (CS:APP 7.4) - Sections vs Segments.
Relocation Entries (CS:APP 7.7) - R_X86_64_PC32 vs R_X86_64_64.
Symbol Resolution (CS:APP 7.6) - Strong vs Weak symbols.

Questions to Guide Your Design

How do you merge two .text sections into one? (Concatenation).
If you move utils.o’s code to follow main.o’s code, what happens to the addresses of functions in utils.o?
What is the math formula for a PC-relative relocation? ($Address_{target} - Address_{instruction} - 4$).

Thinking Exercise

Draw two object files. A.o calls B. B.o defines B. When you put them together, B is no longer at offset 0. How do you calculate the number to put in the call instruction in A?

Definition of Done

Can parse 64-bit ELF headers.
Can merge .text and .data from two input files.
Can process R_X86_64_PC32 relocations to fix function calls.
Produced binary runs on Linux.

Project 19: Virtual Memory Simulator

Real World Outcome

A CLI vmsim that replays memory traces and simulates page faults.

$ ./vmsim trace.txt --phys-mem=4 --algo=LRU
[ACCESS] R 0x1234 (VPN 1) -> MISS (Cold). Loaded to PPN 0.
[ACCESS] W 0x2345 (VPN 2) -> MISS (Cold). Loaded to PPN 1.
...
[ACCESS] R 0x9999 (VPN 9) -> MISS (Capacity). Evicting VPN 1 (LRU).
[STATS] Hits: 85, Misses: 15, Hit Rate: 85%

The Core Question You’re Answering

How does the OS choose which page to throw out when RAM is full, and how much does it matter?

Concepts You Must Understand First

Paging & Page Tables (Primer Ch 6).
Replacement Policies (CS:APP 9.6) - LRU, FIFO, Clock.
Locality (Primer Ch 5).

Questions to Guide Your Design

How do you split an address into VPN (Virtual Page Number) and Offset?
How do you track “Least Recently Used”? (Counter? Queue? Approx bit?).
What is the “working set” of a trace?

Definition of Done

Implements multi-level page table walk logic.
Implements FIFO and LRU replacement.
Correctly counts hits, compulsory misses, and capacity misses.

Project 20: HTTP Web Server

Real World Outcome

A web server tiny that serves files and runs scripts.

$ ./tiny 8080
[LISTEN] Waiting on port 8080...
[CONN] Accepted 127.0.0.1:54321
[REQ] GET /index.html
[RESP] 200 OK (150 bytes)
[CONN] Accepted 127.0.0.1:54322
[REQ] GET /cgi-bin/adder?a=1&b=2
[CGI] Forking child...
[RESP] 200 OK (Dynamic Content)

The Core Question You’re Answering

How do network packets become “web pages” via the socket API?

Concepts You Must Understand First

Sockets (CS:APP 11.4).
HTTP Protocol (CS:APP 11.5).
Process Control (CS:APP 8.4) - fork/exec.

Definition of Done

Serves static files (HTML, JPG).
Handles CGI (Dynamic content via child processes).
Handles errors (404, 500) gracefully.

Project 21: Thread Pool Implementation

(See expanded guide in repo for full code and details)

What you’ll build: A reusable thread pool library with a bounded work queue. Why it matters: Producer-consumer pattern mastery.

Real World Outcome

$ ./threadpool --workers=4
[INIT] 4 workers started.
[SUBMIT] Task 1..100
[WORKER-0] Processed Task 1
...
[SHUTDOWN] Graceful exit.

The Core Question You’re Answering

How do you safely coordinate multiple threads sharing a queue?

Definition of Done

No race conditions (Helgrind clean).
Clean shutdown (no zombie threads).
Handles spurious wakeups correctly.

Project 22: Signal-Safe Printf

(See expanded guide in repo for full code and details)

What you’ll build: A printf that works inside signal handlers. Why it matters: Understanding reentrancy and async-safety.

Real World Outcome

$ ./sio_test
[HANDLER] Caught signal 10. Safe output only!

The Core Question You’re Answering

Why does printf deadlock in a signal handler, and how do we fix it?

Definition of Done

Uses only write syscall.
No malloc, no global locks.
Formats integers and strings.

Project 23: Performance Profiler

(See expanded guide in repo for full code and details)

What you’ll build: A sampling profiler using signals. Why it matters: Understanding how gprof/perf actually work.

Real World Outcome

$ ./profiler ./myprog
[STATS] Total samples: 1000
[HOT] 40% - matrix_multiply
[HOT] 30% - vector_add

The Core Question You’re Answering

How can I measure code execution without modifying the code?

Definition of Done

Uses setitimer for periodic interrupts.
Captures rip (instruction pointer).
Resolves addresses to symbols (if possible).

Project 24: Memory Leak Detector

Real World Outcome

A shared library you inject into any C program to find leaks.

$ LD_PRELOAD=./libleakcheck.so ./buggy_app
[APP] App running...
[APP] App exiting...
================ LEAK REPORT =================
[LEAK] 1024 bytes at 0x7f... allocated by:
       main.c:25 (main)
       utils.c:10 (helper)
[SUMMARY] 1 leak, 1024 bytes lost.

The Core Question You’re Answering

How can I track every malloc/free without recompiling the target application?

Concepts You Must Understand First

Dynamic Interposition (CS:APP 7.13) - LD_PRELOAD.
Function Wrappers (dlsym, RTLD_NEXT).
Call Stacks (Primer Ch 3) - Walking the stack frames.

Hints in Layers

Implement malloc that just calls the real malloc (via dlsym).
Add logging to a file.
Use a hash map (or linked list) to store “live” pointers. malloc adds, free removes.
On program exit (__attribute__((destructor))), print whatever is left in the map.

Definition of Done

Intercepts malloc, free, realloc, calloc.
Tracks allocation size and location (return address).
Reports memory leaked at exit.
Does not crash recursive calls (e.g., if printf calls malloc).

Project 25: Debugger (ptrace-based)

Real World Outcome

A minimalist debugger mydb that can stop code and inspect state.

$ ./mydb ./target
(mydb) break 0x401050
(mydb) run
[STOP] Breakpoint hit at 0x401050
(mydb) regs
rax: 0x0
rip: 0x401050
(mydb) step
[STOP] Stepped to 0x401055

The Core Question You’re Answering

How does a debugger control another process and transparently pause it?

Concepts You Must Understand First

ptrace syscall (CS:APP 8.5) - PTRACE_TRACEME, PTRACE_PEEKUSER, PTRACE_CONT.
Software Breakpoints - The INT 3 (opcode 0xCC) instruction.
Signals - SIGTRAP.

Thinking Exercise

To set a breakpoint:

Read the byte at the target address.
Save it.
Write 0xCC (INT 3) there. To resume:
Write the saved byte back.
Rewind %rip by 1.
Single step.
Write 0xCC back.
Continue.

Definition of Done

Can launch a target program.
Can set a breakpoint at an address.
Can continue and stop at breakpoint.
Can read register values.

Project 26: Operating System Kernel Capstone

Real World Outcome

A kernel.iso that you boot in QEMU, printing “Hello World” from your own OS.

The Core Question You’re Answering

What is the absolute minimum software required to manage the hardware?

Concepts You Must Understand First

Bare Metal Boot - Multiboot header.
VGA Text Mode - Writing bytes to 0xb8000.
GDT/IDT - Global Descriptor Table, Interrupt Descriptor Table.

Definition of Done

Boots via GRUB/QEMU.
Sets up a stack.
Writes text to the screen.
Handles at least one interrupt (keyboard or timer).

(Note: Every project in this list has an expanded internal guide linked directly in its heading.)

Project Comparison Table

Project	Difficulty	Time	Depth	Fun Factor
P1: Build Pipeline Explorer	★☆☆☆☆	4-8h	Medium	★★★☆☆
P2: Bitwise Data Inspector	★☆☆☆☆	4-8h	Medium	★★★☆☆
P3: Data Lab Clone	★★☆☆☆	10-15h	High	★★★★☆
P4: Crash Cart	★★☆☆☆	10-15h	High	★★★☆☆
P5: Bomb Lab Workflow	★★☆☆☆	15-20h	High	★★★★★
P6: Attack Lab Workflow	★★★☆☆	20-30h	Very High	★★★★★
P7: Y86-64 Simulator	★★★☆☆	30-40h	Very High	★★★★☆
P8: Performance Clinic	★★☆☆☆	10-15h	High	★★★☆☆
P9: Cache Simulator	★★☆☆☆	15-20h	Very High	★★★★☆
P10: ELF Link Map	★★☆☆☆	10-15h	High	★★★☆☆
P11: Signals Sandbox	★★☆☆☆	10-15h	High	★★★☆☆
P12: Unix Shell	★★★☆☆	25-35h	Very High	★★★★★
P13: VM Visualizer	★★☆☆☆	10-15h	Medium	★★★☆☆
P14: Malloc	★★★☆☆	30-40h	Very High	★★★★☆
P15: Robust I/O	★★☆☆☆	10-15h	High	★★★☆☆
P16: Concurrency Workbench	★★★☆☆	20-30h	Very High	★★★★☆
P17: Proxy Server	★★★★☆	40-60h	Comprehensive	★★★★★
P18-P26: Extensions	★★★★-★★★★★	Varies	Expert	★★★★★

Recommendation

If you’re new to systems programming: Start with Project 1 (Build Pipeline Explorer) and Project 2 (Bitwise Inspector). These give you quick wins and essential mental models. Then progress through Projects 3-6 to build debugging intuition.

If you’re preparing for interviews: Focus on Project 5 (Bomb Lab) for reverse engineering practice, Project 14 (Malloc) for the classic memory allocator question, and Project 16 (Concurrency) for threading fundamentals. These three cover the most common systems interview topics.

If you’re building production systems: Prioritize Project 12 (Shell) for process management, Project 14 (Malloc) for memory understanding, Project 15 (Robust I/O) for network code, and Project 17 (Proxy) as the capstone that integrates everything.

If you have limited time (3-month sprint): Follow Path 5: P1 → P2 → P4 → P9 → P14 → P17. This covers the essential concepts while producing impressive projects for your portfolio.

Final Overall Project: Production-Grade Systems Toolkit

After completing the core projects, combine your work into a unified systems debugging toolkit:

The Goal

Build a single binary sysinspect that integrates:

Build analysis (from P1)
Memory inspection (from P2, P13)
Process/signal debugging (from P11, P12)
Performance profiling (from P8, P23)

Integration Steps

Create a unified CLI with subcommands: sysinspect build, sysinspect memory, etc.
Add a --live mode that attaches to running processes (from P25)
Implement cross-referencing: show how a memory address maps to source code symbols
Add export to JSON/HTML for reports

Success Criteria

Single make command builds the toolkit
Can analyze your own projects from the sprint
Documentation is good enough for another developer to use
Handles edge cases gracefully (processes that exit, permission errors, etc.)

From Learning to Production: What’s Next

Your Project	Production Equivalent	Gap to Fill
P1: Build Pipeline	CMake/Bazel/Buck	Build caching, dependency graphs
P5: Bomb Lab	IDA Pro/Ghidra	GUI, decompilation, scripting
P9: Cache Simulator	Valgrind’s cachegrind	Hardware integration
P12: Unix Shell	Bash/Zsh	Scripting language, history, completion
P14: Malloc	glibc malloc/jemalloc	Thread-local caches, arenas
P17: Proxy	Nginx/HAProxy	Configuration, TLS, load balancing
P24: Leak Detector	Valgrind’s memcheck	Shadow memory, origin tracking
P25: Debugger	GDB/LLDB	DWARF parsing, expression evaluation

Continuing Your Journey

Read the Source: Study glibc, Linux kernel, and nginx source code
Contribute to Open Source: Many systems projects welcome contributors
Specialize: Pick one area (security, performance, infrastructure) and go deep
Teach: Writing about systems programming solidifies understanding

Summary

This learning path covers Computer Systems: A Programmer’s Perspective through 26 hands-on projects (17 core + 9 extensions).

#	Project Name	Main Language	Difficulty	Time
1	Build Pipeline Explorer	C/Shell	Level 1	4-8h
2	Bitwise Data Inspector	C	Level 1	4-8h
3	Data Lab Clone	C	Level 2	10-15h
4	Calling Convention Crash Cart	C/Assembly	Level 2	10-15h
5	Bomb Lab Workflow	GDB/Assembly	Level 2	15-20h
6	Attack Lab Workflow	C/Assembly	Level 3	20-30h
7	Y86-64 CPU Simulator	C	Level 3	30-40h
8	Performance Clinic	C	Level 2	10-15h
9	Cache Lab Simulator	C	Level 2	15-20h
10	ELF Link Map Toolkit	C	Level 2	10-15h
11	Signals/Processes Sandbox	C	Level 2	10-15h
12	Unix Shell with Job Control	C	Level 3	25-35h
13	VM Map Visualizer	C	Level 2	10-15h
14	Build Your Own Malloc	C	Level 3	30-40h
15	Robust Unix I/O Toolkit	C	Level 2	10-15h
16	Concurrency Workbench	C	Level 3	20-30h
17	Capstone Proxy Server	C	Level 4	40-60h
18	ELF Linker/Loader	C	Level 4	40-60h
19	VM Simulator	C	Level 3	25-35h
20	HTTP Web Server	C	Level 3	25-35h
21	Thread Pool	C	Level 3	20-25h
22	Signal-Safe Printf	C	Level 3	15-20h
23	Performance Profiler	C	Level 3	20-25h
24	Memory Leak Detector	C	Level 3	20-30h
25	ptrace Debugger	C	Level 4	35-45h
26	OS Kernel Capstone	C/Assembly	Level 5	60-100h

Expected Outcomes

After completing these projects, you will:

Debug any crash by reading registers, stack traces, and memory dumps
Predict performance by reasoning about cache behavior and memory access patterns
Write secure code by understanding buffer overflows, format strings, and memory safety
Build production systems that handle signals, concurrency, and I/O correctly
Pass systems interviews with deep understanding rather than memorization
Read and contribute to low-level codebases (kernels, compilers, databases)

You’ll have built 26 working projects that demonstrate mastery of systems programming from first principles—from source code to running process and back again.

Additional Resources and References

Primary Textbook

Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
Official website with labs: https://csapp.cs.cmu.edu/

Supplementary Books (by topic)

C Programming

C Programming: A Modern Approach by K.N. King
Effective C by Robert Seacord

Assembly & Low-Level

Low-Level Programming by Igor Zhirkov
Practical Binary Analysis by Dennis Andriesse
Hacking: The Art of Exploitation by Jon Erickson

Operating Systems

Operating Systems: Three Easy Pieces (OSTEP) - Free online
The Linux Programming Interface by Michael Kerrisk
Advanced Programming in the UNIX Environment by Stevens & Rago

Networking

Unix Network Programming by W. Richard Stevens
Beej’s Guide to Network Programming (free online)

Performance

Agner Fog’s Optimization Manuals (free online)
Computer Organization and Design by Patterson & Hennessy

Online Resources

Teach Yourself CS - Study path recommendations
CS:APP Course Videos - CMU 15-213 lectures
Compiler Explorer - See assembly in your browser
GDB Cheat Sheet

Industry Data and Trends (Current)

Standards and Documentation

System V AMD64 ABI: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
Intel x86-64 Manuals: https://www.intel.com/sdm
POSIX Specification: https://pubs.opengroup.org/onlinepubs/9699919799/
Linux Man Pages: https://man7.org/linux/man-pages/