CS:APP (3rd Edition) — Deep Learning via Buildable Projects
Goal: Transform from a programmer who writes code that “happens to work” into a systems programmer who understands exactly what the machine does with every instruction. By building 17 core projects (+ 9 optional advanced extensions), you will internalize the complete journey from source code to running process—mastering data representation, machine-level execution, memory hierarchy, operating system abstractions, and concurrent programming. When you finish, you will debug crashes by reading registers, optimize code by reasoning about cache lines, and build robust systems that handle real-world failure modes gracefully.
Introduction: What This Guide Covers
CS:APP (Computer Systems: A Programmer’s Perspective) teaches you to treat the compiler, the OS, and the CPU as inspectable systems instead of magic. This guide turns the book into a build-first sprint: you’ll repeatedly ask “what did the machine actually do?” and then prove it with artifacts (ELF headers, disassembly, traces, perf counters, crash dumps).
What you will build (by the end of the core sprint):
- A pipeline explorer that makes preprocessing → compilation → assembly → linking → loading observable
- A byte/bit inspector for integers, floats, and endianness
- Debugging workflows for x86-64 crashes, reverse engineering, and classic memory bugs
- Simulators/visualizers for CPU pipelines and caches
- OS-flavored tooling: process/signal sandbox, a job-control shell, VM map visualizer, and a malloc implementation
- A robust I/O toolkit and a concurrency workbench
- A secure, observable, high-performance proxy server capstone
Optional advanced extensions (if you keep going):
- An ELF linker/loader, VM simulator, tiny HTTP server, thread pool, signal-safe I/O, profiler, leak detector, ptrace debugger, and an OS-kernel capstone
The Big Picture (Mental Model)
You write C
│
├─► Toolchain (cpp/cc/as/ld) ──► ELF binary ──► Loader ──► Process
│ │
│ ├─► Syscalls (kernel boundary)
│ ├─► Signals (async interruptions)
│ ├─► VM (pages, TLB, faults)
│ └─► Caches (lines, locality)
│
└─► You debug/measure by reading the evidence:
source ⇄ asm ⇄ symbols ⇄ traces ⇄ counters ⇄ dumps
How to Use This Guide
This guide is designed to be worked through systematically, not skimmed. Follow this workflow:
Reading the Primer (First 2-3 Days)
- Read each Theory Primer chapter completely before starting related projects
- Work through the “Check Your Understanding” questions without looking at answers
- Draw the ASCII diagrams by hand to internalize the mental models
- Complete the homework exercises at the end of each chapter
Working Through Projects (Ongoing)
- Read the project’s “Core Question” and “Concepts You Must Understand First”
- Attempt the “Thinking Exercise” before writing any code
- Use “Hints in Layers” only when stuck—start with Hint 1, wait 30 minutes before checking the next
- Validate against the “Definition of Done” checklist before moving on
- Reflect on “Interview Questions” to solidify understanding
Parallel Reference
- Keep CS:APP open alongside this guide
- Use the “Deep Dive Reading by Concept” table to find relevant book sections
- Refer to the Glossary when terms are unclear
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
Programming Skills
- Proficiency in C (pointers, structs, memory management, compilation)
- Basic command-line fluency (navigation, file operations, pipes)
- Familiarity with a text editor (Vim/Emacs/VSCode)
- Recommended Reading: “C Programming: A Modern Approach” by K.N. King - Chapters 1-20
Computer Science Fundamentals
- Binary and hexadecimal number systems
- Basic data structures (arrays, linked lists, stacks)
- Understanding of what a compiler does (high-level)
- Recommended Reading: “Code” by Charles Petzold - Chapters 1-15
Helpful But Not Required
These topics are taught through the projects but prior exposure accelerates learning:
- Assembly language basics (learn during Projects 3-6)
- Operating system concepts (learn during Projects 11-14)
- Networking fundamentals (learn during Projects 15, 17, 20)
Self-Assessment Questions
Before starting, you should be able to answer “yes” to these:
- Can you explain what a pointer is and how
*and&operators work in C? - Can you compile a C program with
gccand run it from the command line? - Do you know the difference between stack and heap memory allocation?
- Can you read simple Makefiles and understand targets/dependencies?
- Are you comfortable spending 4+ hours debugging a single issue?
If you answered “no” to any question, spend time on the recommended readings first.
Development Environment Setup
Required Tools
| Tool | Version | Purpose |
|---|---|---|
| GCC | 9.0+ | C compiler with debugging symbols |
| GDB | 8.0+ | GNU Debugger for x86-64 |
| Make | 4.0+ | Build automation |
| Git | 2.20+ | Version control |
| Valgrind | 3.15+ | Memory error detection |
Recommended Tools
| Tool | Purpose |
|---|---|
| GEF | GDB enhancement for reverse engineering |
| objdump | Binary inspection |
| readelf | ELF file analysis |
| strace | System call tracing |
| perf | Performance profiling |
Testing Your Setup
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
$ cat > test.c << 'EOF'
#include <stdio.h>
int main() { printf("Setup OK\n"); return 0; }
EOF
$ gcc -g -Wall test.c -o test && ./test
Setup OK
Operating System
- Linux x86-64 is required (Ubuntu 20.04+ or Debian 11+ recommended)
- macOS works for some projects but lacks full
ptracesupport - Windows users should use WSL2 with Ubuntu
Time Investment
| Project Type | Time Per Project | Examples |
|---|---|---|
| Foundation | 4-8 hours | P1, P2, P3 |
| Intermediate | 10-20 hours | P4-P10 |
| Advanced | 20-40 hours | P11-P17 |
| Extension | 30-60 hours | P18-P26 |
Total Sprint Estimate:
- Core projects (P1-P17): 3-5 months at 10-15 hours/week
- With extensions (P18-P26): 6-9 months total
Important Reality Check
Systems programming is genuinely difficult. You will:
- Spend hours staring at hex dumps and register values
- Debug problems that seem “impossible” until they suddenly click
- Feel frustrated when your code crashes with no obvious cause
- Eventually develop an intuition that makes this feel natural
The struggle is the learning. If projects feel easy, you’re not going deep enough.
Why CS:APP Matters
Modern Motivation: Why Learn This in 2025?
In an era of high-level frameworks, AI code generation, and cloud abstractions, systems programming skills are more valuable than ever—precisely because they’re rarer.
The “Power Programmer” Gap
According to Teach Yourself CS, CS:APP creates “power programmers” who:
- Debug crashes by reading registers instead of adding print statements
- Optimize performance by reasoning about cache behavior
- Understand security vulnerabilities at the instruction level
- Build infrastructure that AI systems run on
Career Impact (2024-2025)
The job market has shifted toward specialists. According to IEEE Spectrum and Pragmatic Engineer:
- 87% of tech leaders report difficulty finding skilled systems talent
- AI/ML infrastructure roles require deep systems knowledge
- Senior engineers with systems debugging skills command 30-50% salary premiums
- The share of AI/ML jobs increased from 10% to 50% between 2023-2025—someone must build the infrastructure
Where These Skills Apply
┌─────────────────────────────────────────────────────────────────┐
│ SYSTEMS PROGRAMMING CAREERS │
├─────────────────────────────────────────────────────────────────┤
│ Infrastructure │ Security │ Performance │
│ ───────────── │ ──────── │ ─────────── │
│ • Cloud platforms │ • Vulnerability │ • Game engines │
│ • Database cores │ research │ • Trading systems │
│ • Compiler dev │ • Malware │ • Embedded systems │
│ • Container │ analysis │ • Graphics drivers │
│ runtimes │ • OS hardening │ • ML infrastructure │
└─────────────────────────────────────────────────────────────────┘
Context & Evolution
1960s-70s: The Birth of C and Unix Dennis Ritchie created C at Bell Labs in 1972 to rewrite Unix. His design philosophy—giving programmers direct hardware access with minimal runtime overhead—defined systems programming for the next 50 years.
1980s-90s: x86 Dominance Intel’s x86 architecture became the standard for personal computing. Understanding x86 assembly became essential for performance-critical code.
2000s: The 64-bit Transition AMD64 (x86-64) extended the architecture with more registers and a flat memory model. CS:APP 3rd Edition (2015) updated to focus on this architecture.
2020s: The Resurgence With Moore’s Law slowing and cloud computing creating massive scale challenges, deep systems understanding has become the differentiator between “good enough” and “excellent” engineers.
Real-World Impact Statistics
- Linux runs on 96.3% of the top 1 million web servers (2024, W3Techs)
- Firefox’s memory usage dropped 30% after systems-level optimizations
- Google’s TCMalloc handles billions of allocations per second across their fleet
- A single TLB shootdown can cost 1,000-5,000 CPU cycles—understanding this prevents performance cliffs
Theory Primer (Mini-Book)
This section is your textbook. It covers the nine pillars of systems programming in detail. You must understand these mental models to complete the projects.
Chapter 1: Translation, Linking, and Execution
Fundamentals
The “Build” button in your IDE is a comfortable lie. It obscures a complex, four-stage pipeline that transforms human-readable text into machine-executable binary signals. Systems programmers must look behind this curtain because that is where “undefined reference” errors, segmentation faults during startup, and weird library versioning bugs live. The translation process is not magic; it is a deterministic sequence of transformations: Preprocessing (text substitution), Compilation (C to Assembly), Assembly (Assembly to Machine Code), and Linking (stitching object files together). Understanding this pipeline transforms you from someone who blindly tries command-line flags until it compiles, into someone who architects build systems and debugs binary compatibility issues with precision.
Deep Dive into Translation & Linking
The journey from hello.c to a running process involves four distinct tools, often orchestrated by a driver like gcc.
- Preprocessing (
cpp): This is a text-processing step. It handles directives starting with#. It strips comments, expands macros, and physically pastes the contents of header files (#include) into your source file. The output is a “Translation Unit” (often.i). Crucially, the preprocessor knows nothing about C syntax; it just manipulates text. This is why missing a semicolon in a header file can cause an error 500 lines later in your.cfile. - Compilation (
cc1): This is the heavy lifter. It parses the Translation Unit, checks for syntax and semantic errors (type safety), performs optimizations (like dead code elimination or loop unrolling), and generates Assembly code (.s). Assembly is a human-readable text representation of the CPU’s machine code. This is where the abstraction of “C” ends and the reality of the hardware begins. - Assembly (
as): The assembler translates the.sfile into a Relocatable Object File (.o). This file contains binary machine code, but it is not executable yet. Addresses of global variables and functions are “relocatable” (usually set to zero) because the assembler doesn’t know where this code will live in memory or where other functions (likeprintf) are located. It produces a Symbol Table listing the functions it defines and the ones it needs. - Linking (
ld): The linker solves the puzzle. It takes multiple.ofiles and libraries (.aarchives or.soshared objects) and combines them into one Executable (ELF file). Its two main jobs are:- Symbol Resolution: Matching every reference to a symbol (e.g., a call to
foo) with exactly one definition. If zero definitions are found, you get “undefined reference”. If two are found, you get “multiple definition”. - Relocation: Once all code size is known, the linker assigns final runtime memory addresses to every instruction and variable. It then modifies the code instructions, patching the dummy addresses with the real ones.
- Symbol Resolution: Matching every reference to a symbol (e.g., a call to
Static vs. Dynamic Linking
Static linking copies the library code into your executable, creating a large, standalone binary. Dynamic linking keeps the library separate; the executable contains a “promise” to find the library at runtime. The Loader (part of the OS) fulfills this promise when you run the program, using the Dynamic Linker (ld-linux.so) to map the library into memory and resolve symbols lazily.
How this fits on projects
You will explore this pipeline in Project 1, inspect ELF files in Project 10, and build your own linker in Project 18.
Definitions & Key Terms
- Translation Unit: The output of the preprocessor; a single source file with all headers included.
- Symbol Table: Metadata in an object file that maps names (functions/globals) to their locations.
- Relocation Entry: A specific instruction in the object file that the linker must patch with a final address.
- ABI (Application Binary Interface): The binary-level contract (calling convention, alignment, type sizes) that allows modules to interoperate.
Mental Model Diagram
[Text] [Text] [Text] [Binary] [Binary]
Source (.c) ──> Preproc ──> Trans Unit ──> Compiler ──> ASM (.s) ──> Assembler ──> Object (.o)
│
Libraries (.a) ───┤
▼
Linker
│
▼
Executable (ELF)
How it Works (Step-by-Step)
- Driver: User runs
gcc main.c. - CPP:
main.c->main.i. All#definesreplaced. - CC1:
main.i->main.s. C logic becomes x86-64 logic. - AS:
main.s->main.o. Machine code generated. Offsets for functions are 0x0. - LD:
main.o+libc.soreferences ->a.out. Addresses assigned (e.g., main starts at 0x401000). - Loader: User runs
./a.out. OS creates process, maps file, jumps to_start.
Common Misconceptions
- Misconception: “The compiler handles libraries.” -> Truth: The compiler only checks headers. The linker handles libraries.
- Misconception: “Compilation is one step.” -> Truth: It is four, and you can stop at any of them to debug.
- Misconception: “Header files are compiled.” -> Truth: No, they are text-pasted into
.cfiles. You can’t compile a.hfile directly to an object.
Check-Your-Understanding
- If you delete the definition of
putsbut keep the declaration, which stage fails? - If you have a syntax error in a macro, which stage reports it?
- Does a relocatable object file (
.o) contain absolute or relative addresses?
Answers
- The Linker. The compiler is happy as long as the declaration exists.
- The Compiler (usually), though the error originated in the preprocessor’s expansion.
- Relative (or zero-based). It assumes it starts at 0.
Real-World Applications
- Hot Patching: Modifying binary code in memory to fix bugs without restarting.
- Plugin Systems: Using
dlopen/dlsymto load code at runtime (dynamic linking). - Build Systems: CMake and Make exist entirely to orchestrate this pipeline efficiently.
Key Insight
The “undefined reference” error is just a database lookup failure in the Symbol Table.
Summary
Translation is a pipeline of lowering abstractions. Preprocessing handles text, compilation handles logic, assembly handles machine encoding, and linking handles memory layout and dependencies.
Homework
Run gcc -v hello.c. Capture the output and identify the path to cc1, as, and collect2 (the linker driver).
Homework Solution
$ gcc -v hello.c 2>&1 | grep -E "(cc1|as|collect2)"
/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -v -imultiarch x86_64-linux-gnu hello.c ...
as -v --64 -o /tmp/ccXXXXXX.o /tmp/ccXXXXXX.s
/usr/lib/gcc/x86_64-linux-gnu/11/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so ...
Note: collect2 is GCC’s wrapper around the system linker (ld). The actual paths vary by system.
References
- CS:APP 3e - Chapters 1, 7 (primary source)
- Practical Binary Analysis by Dennis Andriesse - Ch. 1-2 (ELF format deep dive)
- Low-Level Programming by Igor Zhirkov - Ch. 3-4 (compilation pipeline)
- Linkers and Loaders by John Levine - Classic reference on linking
- GNU Binutils documentation: https://sourceware.org/binutils/docs/
Chapter 2: Data Representation
Fundamentals
To a computer, “type” is an illusion. Memory is just a vast ocean of bits. Whether a sequence of 32 bits represents the integer 1,065,353,216, the floating-point number 1.0, or two assembly instructions depends entirely on context—specifically, how the CPU is instructed to read it. Systems programmers must master these low-level encodings because hardware does not protect you from misinterpreting them. Understanding bitwise operations, integer overflow, and floating-point inexactness is mandatory for writing secure and reliable software.
Deep Dive into Data Representation
We focus on three primary encodings: Unsigned Integers, Signed Integers (Two’s Complement), and Floating Point (IEEE 754).
- Unsigned Integers: These are straightforward binary place-value systems. A $w$-bit unsigned number maps bits $\vec{x}$ to $\sum_{i=0}^{w-1} x_i 2^i$. They wrap around modulo $2^w$.
- Signed Integers (Two’s Complement): This is the standard for representing negative numbers. The most significant bit (MSB) has a negative weight. A $w$-bit number maps to $-x_{w-1}2^{w-1} + \sum_{i=0}^{w-2} x_i 2^i$. This genius design allows the same hardware adder circuit to handle both signed and unsigned addition. However, it introduces asymmetry:
TMin(e.g., -128) has no positive counterpart (max is +127). This leads to the famousabs(INT_MIN)bug. - Floating Point (IEEE 754): Floats are NOT real numbers. They are a finite, discrete subset of reals. A float is encoded as $V = (-1)^s \times M \times 2^E$.
- Sign (s): 1 bit.
- Exponent (E): Encoded with a bias. Controls the range.
- Mantissa (M): The fractional part. Controls the precision.
- Floats have gaps. As numbers get larger, the gap between representable numbers increases. $0.1 + 0.2 \ne 0.3$ because 0.1 cannot be exactly represented in binary.
- Endianness: Multi-byte objects (like
int,long) must be stored in memory. Little Endian (x86) stores the least significant byte at the lowest address. Big Endian (Network order) stores the most significant byte first.
How this fits on projects
You will build a tool to visualize this in Project 2 and enforce bitwise constraints in Project 3.
Definitions & Key Terms
- Word Size: The natural size of pointers and integers (usually 64-bit on modern machines).
- Two’s Complement: The dominant encoding for signed integers.
- NaN (Not a Number): A special float bit pattern resulting from invalid operations (e.g.,
sqrt(-1)). - Normalization: The standard form of floats where the leading binary digit is implicitly 1.
Mental Model Diagram
Memory Address: 0x100 0x101 0x102 0x103
Value (0x12345678):
Big Endian: [12] [34] [56] [78] (Reads left-to-right)
Little Endian: [78] [56] [34] [12] (Reads "backwards")
How it Works (Integer Casting)
Casting between signed and unsigned in C does not change the bits; it only changes how the compiler interprets them.
short s = -1; (Bits: 1111 1111 1111 1111)
unsigned short u = (unsigned short)s; (Bits: 1111..., Value: 65535)
Common Misconceptions
- Misconception: “Signed overflow throws an exception.” -> Truth: In C, signed overflow is Undefined Behavior (UB). The compiler might assume it never happens and optimize away your safety checks.
- Misconception: “Floats have constant precision.” -> Truth: Precision decreases as magnitude increases.
Check-Your-Understanding
- Why does
(int)(unsigned)-1print-1? - Is
x < yalways equivalent tox - y < 0? - What happens if you cast a
doubleto anint?
Answers
(unsigned)-1isUINT_MAX(all ones). Casting back tointinterprets “all ones” as -1 in Two’s Complement. The bits never changed.- No! If
xisINT_MINandyis1,x - yoverflows to a large positive number. - Truncation (rounding toward zero). If the double is too large, behavior is undefined.
Real-World Applications
- Crypto: Relies heavily on precise bitwise operations and large integer arithmetic.
- Networking: Protocols require converting host endianness to network endianness (
htonl,ntohl). - Graphics: Floating point precision issues (“z-fighting”) cause flickering textures in games.
Key Insight
Types are just lenses for viewing bits. Casting changes the lens, not the scene (usually).
Summary
Information = Bits + Context. You must understand the limits of integer and float encodings to write safe code.
Homework
Write a function is_little_endian() that returns 1 if the machine is Little Endian, using a union or pointer casting.
Homework Solution
// Method 1: Union (exploits how unions share memory)
int is_little_endian_v1() {
union { int i; char c; } u;
u.i = 1;
return u.c == 1; // If little endian, LSB is at lowest address
}
// Method 2: Pointer casting
int is_little_endian_v2() {
int x = 1;
return *((char*)&x) == 1; // Cast int* to char*, read first byte
}
On x86-64 (little endian), both return 1. The value 0x00000001 is stored as [01][00][00][00] with 01 at the lowest address.
References
- CS:APP 3e - Chapter 2 (primary source)
- Write Great Code Vol. 1 by Randall Hyde - Ch. 4-6 (data representation)
- Effective C by Robert Seacord - Ch. 5 (integers and overflow)
- IEEE 754-2019 Standard: https://ieeexplore.ieee.org/document/8766229
- “What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg
Chapter 3: Machine-Level Programming
Fundamentals
The CPU does not know C. It does not understand while loops, structs, or objects. It understands a linear stream of simple instructions: move data, do math, jump to address. Machine-level programming (Assembly) is the study of how high-level constructs are mapped onto this rigid hardware reality. Mastering this allows you to reverse-engineer binaries, debug “impossible” crashes, and understand exactly how function calls and the stack work.
Deep Dive into x86-64
We focus on x86-64, a Complex Instruction Set Computer (CISC) architecture.
- Register File: The CPU has 16 general-purpose 64-bit registers (
%rax,%rbx,%rcx,%rdx,%rsi,%rdi,%rbp,%rsp,%r8-%r15). These are the fastest memory in the system. - Operand Forms: Instructions can operate on:
- Immediates: Constants (
$0x4). - Registers: Values in registers (
%rax). - Memory: Values at an address (
(%rax)).
- Immediates: Constants (
- The Stack: A region of memory used for function calls. It grows downward (from high addresses to low).
%rsp(Stack Pointer) points to the top element.pushdecrements%rspand writes;popreads and increments. - Calling Convention (System V AMD64): This is the “law” of function calls.
- Args: First 6 arguments go in
%rdi,%rsi,%rdx,%rcx,%r8,%r9. Remaining go on the stack. - Return: Value goes in
%rax. - Saved Registers: The “Callee” must preserve
%rbx,%rbp,%12-%15. The “Caller” must save others if it needs them.
- Args: First 6 arguments go in
- Control Flow:
if,while, andforare implemented usingcmp(compare) andjX(jump if condition) instructions. Aswitchstatement is often compiled into a Jump Table (an array of code addresses) for O(1) performance.
How this fits on projects
You will debug crashes in Project 4, reverse engineer binaries in Project 5, and perform buffer overflow attacks in Project 6.
Definitions & Key Terms
- PC (Program Counter): The register (
%ripin x86-64) holding the address of the next instruction. - Stack Frame: The slice of the stack allocated for a single function call. Contains local vars, saved registers, and the return address.
- LEA (Load Effective Address): An instruction meant for address math, but often used for fast arithmetic (e.g.,
lea (%rdi,%rdi,2), %raxcomputesx*3).
Mental Model Diagram
Stack (High Address)
┌──────────────────┐
│ Return Address │ <- Pushed by 'call'
├──────────────────┤
│ Saved %rbp │ <- Pushed by function prologue
├──────────────────┤
│ Local Variables │ <- Space made by 'sub $N, %rsp'
├──────────────────┤
│ ... │
└──────────────────┘ <- %rsp (Stack Pointer)
(Low Address)
How it Works (The Function Call)
- Caller: Puts args in registers. Executes
call func.callpushes%rip(return address) to stack and jumps tofunc.
- Callee (Prologue):
push %rbp;mov %rsp, %rbp. (Sets up frame). - Callee (Body): Does work. Uses stack for locals if registers aren’t enough.
- Callee (Epilogue):
leave(restores stack/rbp);ret.retpops the return address into%rip, resuming the caller.
Common Misconceptions
- Misconception: “Local variables are initialized to zero.” -> Truth: They contain whatever stack garbage was there before.
- Misconception: “The stack is secure.” -> Truth: It is just writable memory. If you write past an array, you overwrite the Return Address (Buffer Overflow), hijacking control flow.
Check-Your-Understanding
- Where is the return address stored?
- If function A calls function B, which registers must B save before using?
- What does
sub $16, %rspdo?
Answers
- On the stack, directly above the saved base pointer.
- The Callee-Saved registers:
%rbx,%rbp,%r12-%r15. - It allocates 16 bytes of space on the stack (grows downward).
Real-World Applications
- Debugging: Reading stack traces requires understanding frames.
- Performance: Passing structs by value is slow (copies to stack); passing by pointer is fast (register).
- Security: ROP (Return Oriented Programming) exploits rely on chaining assembly gadgets.
Key Insight
C is just a thin wrapper around assembly. The “Stack” is a convention, not a hardware law.
Summary
Machine code is the ground truth. Understanding the stack discipline and register conventions is the key to debugging and security.
Homework
Compile a C function long plus(long a, long b) { return a+b; } with gcc -S and identify the registers used.
Homework Solution
$ echo 'long plus(long a, long b) { return a+b; }' > plus.c
$ gcc -S -O1 plus.c -o plus.s
$ cat plus.s
Expected output (key lines):
plus:
leaq (%rdi,%rsi), %rax ; a is in %rdi, b is in %rsi
ret ; result returned in %rax
Registers used: %rdi (first arg), %rsi (second arg), %rax (return value). Note: lea is used for addition because it’s often faster than add for simple cases.
References
- CS:APP 3e - Chapter 3 (primary source)
- Hacking: The Art of Exploitation by Jon Erickson - Ch. 2 (assembly basics)
- Debugging x86-64 Assembly with GDB - Practical debugging guide
- GEF (GDB Enhanced Features) - Essential GDB extension
- System V AMD64 ABI: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
- Intel x86-64 Manual Vol. 1: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Chapter 4: Processor Architecture & Optimization
Fundamentals
We treat the CPU as a black box that runs instructions sequentially. In reality, modern CPUs are massively parallel, pipelined, superscalar beasts. They execute many instructions at once, guess the future (speculation), and execute instructions out of order. Understanding this microarchitecture is essential for writing high-performance code.
Deep Dive into Architecture
- Pipelining: Like a car assembly line. Instead of doing one instruction start-to-finish, the CPU breaks it into stages: Fetch, Decode, Execute, Memory, Writeback. Ideally, if there are 5 stages, you get 5x throughput.
- Hazards: Things that break the pipeline.
- Data Hazard: Instruction B needs a result from A, but A isn’t done. The pipeline must Stall (wait) or use Forwarding (pass data internally).
- Control Hazard: The CPU doesn’t know if a conditional jump will be taken.
- Branch Prediction: To solve control hazards, the CPU guesses. “I bet this
ifis true.” It executes speculatively. If wrong, it must Flush the pipeline (throw away work), costing 15-40 cycles. - Superscalar: Having multiple pipelines. Can issue
addandmulin the same cycle. - Optimization Blockers:
- Aliasing: Two pointers might point to the same memory. The compiler can’t optimize effectively because writing to A might change B.
- Procedure Calls: The compiler treats functions as black boxes (mostly).
- Unpredictable Branches: Sorting data makes code faster because branches become predictable.
How this fits on projects
You will build a CPU simulator in Project 7 and optimize code in Project 8.
Definitions & Key Terms
- Latency: Time to complete one instruction.
- Throughput: Number of instructions completed per second.
- CPE (Cycles Per Element): Metric for measuring loop performance.
- Loop Unrolling: Manually writing out loop bodies to reduce overhead and increase parallelism.
Mental Model Diagram (Pipeline)
Clock Cycle: 1 2 3 4 5
Instr 1: [F] [D] [E] [M] [W]
Instr 2: [F] [D] [E] [M]
Instr 3: [F] [D] [E] <-- Parallel Execution!
How it Works (Branch Prediction)
If you have if (x < 0), the CPU guesses False. It loads the else code. If x turns out to be -1, the CPU realizes the error, wipes the pipeline, and loads the if code. This penalty is why branchless programming (using bitwise math instead of if) can be faster.
Common Misconceptions
- Misconception: “Fewer lines of C code = Faster.” -> Truth: Not necessarily. Complex one-liners might generate terrible assembly.
- Misconception: “The CPU executes in order.” -> Truth: It fetches in order, but executes out of order (OoO) to fill idle units.
Check-Your-Understanding
- Why is a linked list traversal often slower than an array traversal (ignoring cache)?
- What is the cost of a branch misprediction?
- Why does
restrictkeyword help optimization?
Answers
- Data dependency. You can’t load node B until you read node A’s
nextpointer. This kills instruction-level parallelism. - Pipeline flush. Usually 15-40 cycles of lost work.
- It tells the compiler that pointers don’t alias, allowing it to cache values in registers.
Real-World Applications
- HFT (High-Freq Trading): Optimizing to the cycle level to beat competitors.
- Game Engines: organizing data to maximize parallel processing (Data-Oriented Design).
Key Insight
To make code fast, feed the pipeline. Avoid dependencies and unpredictable branches.
Summary
The CPU is a factory. Your job is to keep the conveyor belts full.
Homework
Write a loop that sums an array. Compare performance of standard loop vs. unrolled loop (4x).
Homework Solution
// Standard loop
long sum_std(long *arr, int n) {
long sum = 0;
for (int i = 0; i < n; i++) sum += arr[i];
return sum;
}
// 4x unrolled loop
long sum_unrolled(long *arr, int n) {
long sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
int i;
for (i = 0; i < n - 3; i += 4) {
sum0 += arr[i]; sum1 += arr[i+1];
sum2 += arr[i+2]; sum3 += arr[i+3];
}
for (; i < n; i++) sum0 += arr[i]; // Handle remainder
return sum0 + sum1 + sum2 + sum3;
}
Benchmark with time or perf stat. The unrolled version is typically 20-40% faster due to reduced loop overhead and increased ILP (4 independent additions per iteration).
References:
- CS:APP 3e - Chapters 4-5 (primary source)
- Computer Organization and Design by Patterson & Hennessy - Ch. 4 (pipelining)
- Agner Fog’s Optimization Manuals: https://www.agner.org/optimize/
- Performance Engineering of Software Systems (MIT OCW) - Excellent performance course
Chapter 5: The Memory Hierarchy
Fundamentals
Programmers want memory that is infinite, fast, and cheap. Physics says you can pick two. The solution is the Memory Hierarchy: a pyramid of storage technologies. Small, fast memory at the top (Registers/L1 Cache); huge, slow memory at the bottom (Disk). The system works because of Locality: programs tend to access the same data, or nearby data, repeatedly.
Deep Dive into Caches
- Hierarchy Levels:
- Registers: 0 cycles.
- L1 Cache: ~4 cycles. (SRAM)
- L2/L3 Cache: ~10-50 cycles. (SRAM)
- Main Memory (DRAM): ~100-200 cycles.
- Disk (SSD/HDD): Millions of cycles.
- Cache Structure: Caches are divided into Sets, containing Lines (Blocks). A memory address is split into Tag, Set Index, and Block Offset.
- Miss Types:
- Compulsory: First access (Cold).
- Conflict: Multiple addresses map to the same set, evicting each other.
- Capacity: The Working Set is larger than the cache.
- Locality:
- Temporal: Using
iin a loop (same address, many times). - Spatial: Iterating an array
a[i]thena[i+1](nearby addresses).
- Temporal: Using
How this fits on projects
You will simulate a cache in Project 9 and write a malloc allocator in Project 14.
Definitions & Key Terms
- Cache Line: The unit of transfer (usually 64 bytes). You never fetch just a byte; you fetch a line.
- Hit Rate: % of accesses found in cache.
- Eviction: Throwing out a line to make room for new data (policies: LRU, Random).
- Write-Back: Only writing to RAM when a modified line is evicted (faster).
Mental Model Diagram
Address: [ Tag | Index | Offset ]
│
▼
[ Select Set ]
/ | \
Line 1 Line 2 Line 3 ...
(Tag?) (Tag?) (Tag?)
How it Works (The Stride Issue)
Iterating a 2D array int a[1024][1024] row-by-row is fast (Stride-1, Spatial Locality). Iterating column-by-column is slow (Stride-1024). You jump 4KB every step, missing the cache line you just fetched.
Common Misconceptions
- Misconception: “Memory access is O(1).” -> Truth: It varies by 100x depending on cache hits.
- Misconception: “Linked lists are fine.” -> Truth: They scramble memory patterns, killing spatial locality (Pointer Chasing).
Check-Your-Understanding
- Why is a cache block size of 1 byte a bad idea?
- What happens to the cache when you context switch?
- If
indexbits = 4, how many sets are there?
Answers
- No spatial locality benefit. Metadata overhead would be huge.
- It gets “polluted” by the new process. When you switch back, you suffer cold misses.
- $2^4 = 16$ sets.
Real-World Applications
- Matrix Multiplication: Tiled algorithms exist solely to fit blocks into L1 cache.
- Databases: Designed to minimize disk seeks and maximize page cache hits.
Key Insight
Locality is King. Algorithms with the same Big-O can differ in speed by 10x-100x due to cache behavior.
Summary
The cache is a hardware hash table that you cannot control directly, but you can influence by accessing memory predictably.
Homework
Write a program that accesses an array with stride 1, 2, 4, … 1024. Measure the time per access. Plot it. You will see the cache cliffs.
Homework Solution
#include <stdio.h>
#include <time.h>
#define SIZE (1024 * 1024 * 64) // 64M elements = 512MB
int main() {
static long arr[SIZE];
for (int stride = 1; stride <= 1024; stride *= 2) {
clock_t start = clock();
long sum = 0;
for (int i = 0; i < SIZE; i += stride) sum += arr[i];
clock_t end = clock();
double time_per = (double)(end - start) / CLOCKS_PER_SEC / (SIZE/stride);
printf("Stride %4d: %.2f ns/access\n", stride, time_per * 1e9);
}
}
Expected output pattern:
Stride 1: 1.2 ns/access <- L1 cache hit
Stride 2: 1.3 ns/access
Stride 4: 1.5 ns/access
Stride 8: 2.0 ns/access <- L1 miss, L2 hit (64B line / 8B stride = 8)
Stride 16: 3.5 ns/access <- L2 miss starts
Stride 32: 5.0 ns/access
...
Stride 1024: 80+ ns/access <- RAM access (TLB misses too)
The “cliff” at stride 8 shows when you exceed the cache line size (64 bytes / 8 byte longs = 8 accesses per line).
References:
- CS:APP 3e - Chapter 6 (primary source)
- OSTEP - Chapters 18-22 (paging and caching)
- Intel Optimization Manual - Section on memory subsystem
- “What Every Programmer Should Know About Memory” by Ulrich Drepper (2007, still relevant)
Chapter 6: Virtual Memory (VM)
Fundamentals
Virtual Memory is the great illusionist. It gives every process the belief that it has a massive, contiguous chunk of private memory starting at address 0. In reality, physical RAM is fragmented, shared, and possibly full. VM adds a layer of indirection: the CPU translates Virtual Addresses to Physical Addresses for every single instruction.
Deep Dive into VM
- Paging: Memory is divided into fixed-size chunks called Pages (usually 4KB).
- Page Tables: A data structure in RAM (managed by the OS) that maps Virtual Page Numbers (VPN) to Physical Page Numbers (PPN). Each process has its own page table.
- MMU (Memory Management Unit): Hardware that performs the translation.
- TLB (Translation Lookaside Buffer): A cache inside the MMU that stores recent translations. Without TLB, every memory access would require 2+ RAM accesses (one for page table, one for data), slowing the machine by 50%.
- Page Faults:
- Hard Fault: Page is on disk (Swap). OS pauses process, fetches from disk.
- Soft Fault: Page is in RAM but not marked present (e.g., Copy-on-Write).
- Segfault: Accessing unmapped/forbidden memory.
- Features Enabled by VM:
- Protection: Read-only bits prevent code overwrites.
- Sharing:
libcis loaded once in RAM, mapped into all processes. - Copy-on-Write:
fork()is fast because it shares pages initially.
How this fits on projects
You will visualize this in Project 13 and simulate it in Project 19.
Definitions & Key Terms
- Swap: Using disk space to extend RAM. Slow.
- Thrashing: When the working set > RAM, and the OS spends all time swapping pages in and out.
- VIPT (Virtually Indexed, Physically Tagged): How L1 caches interact with VM for speed.
Mental Model Diagram
Virtual Addr: [ VPN | Offset ]
│ │
[Page Table] └───┐
│ │
Physical Addr: [ PPN | Offset ]
How it Works (Fork)
When fork() is called, the OS duplicates the Page Table, not the RAM. It marks all pages as “Read-Only”. If Child tries to write, MMU raises a protection fault. OS catches it, copies that specific page to a new physical frame, updates the Child’s page table, and resumes.
Common Misconceptions
- Misconception: “malloc allocates RAM.” -> Truth: It allocates virtual address space. RAM is only assigned when you touch the pages (Demand Paging).
- Misconception: “Different pointers mean different physical memory.” -> Truth: Different virtual addresses can map to the same physical frame (Aliasing/Shared Memory).
Check-Your-Understanding
- What causes a Segmentation Fault?
- Why is the TLB critical for performance?
- If a page size is 4KB, how many bits are the offset?
Answers
- Hardware raises a fault when translation fails or permissions are violated. OS sends
SIGSEGV. - It avoids reading the Page Table from slow RAM for every instruction.
- 12 bits ($2^{12} = 4096$).
Real-World Applications
- Sandboxing: Isolating processes so a crash in one doesn’t kill the OS.
- Memory Mapped Files:
mmapallows treating files like arrays.
Key Insight
All pointers are lies. They are handles to a lookup table.
Summary
VM provides Isolation, Efficiency, and Protection by decoupling software view from hardware reality.
Homework
Use pmap <pid> to look at the memory map of a running process. Identify the heap, stack, and shared libraries.
Homework Solution
$ sleep 1000 &
[1] 12345
$ pmap 12345
12345: sleep 1000
0000555555554000 32K r-x-- sleep <- Code (.text)
000055555575b000 4K r---- sleep <- Read-only data
000055555575c000 4K rw--- sleep <- Writable data (.data, .bss)
0000555555580000 132K rw--- [ anon ] <- HEAP
00007ffff7c00000 2048K r-x-- libc.so.6 <- libc code (shared)
00007ffff7e00000 256K r---- libc.so.6 <- libc rodata
00007ffff7e40000 128K rw--- libc.so.6 <- libc writable
...
00007ffffffde000 132K rw--- [ stack ] <- STACK (grows down)
Key observations:
- Heap starts after the program’s data segments (low addresses)
- Stack is at very high addresses (~0x7fffff…)
- Shared libraries (.so) are mapped between heap and stack
- Each region has permissions: r (read), w (write), x (execute)
References:
- CS:APP 3e - Chapter 9 (primary source)
- The Linux Programming Interface by Michael Kerrisk - Ch. 48-50 (VM)
- LWN.net: Memory part 3: Virtual Memory
- Linux Kernel Documentation on page tables: https://www.kernel.org/doc/gorman/html/understand/understand006.html
- OSTEP - Chapters 13-23 (free online)
Chapter 7: Exceptional Control Flow (ECF)
Fundamentals
Code does not execute in a vacuum. It must react to the outside world: network packets arriving, disks finishing writes, user pressing Ctrl+C, or division by zero. ECF is the mechanism for abrupt changes in the control flow, handling events at the Hardware, OS, and User levels.
Deep Dive into ECF
- Exceptions (Hardware Level):
- Interrupts: Async, from external hardware (Timer, I/O). Handler returns to next instruction.
- Traps: Sync, intentional (Syscalls). Handler returns to next instruction.
- Faults: Sync, errors (Page Fault). Handler retries current instruction.
- Aborts: Sync, fatal errors (Hardware fail). Process dies.
- Context Switch (OS Level): The OS uses a timer interrupt to slice time. It saves the register state of Process A, restores Process B, and updates the PC. This creates the illusion of multitasking.
- Signals (User Level): The OS notifies a process of an event (e.g.,
SIGINT,SIGSEGV). A signal handler is a function that interrupts the main flow. This is user-space concurrency!- Async-Signal-Safety: You cannot call
printformallocin a handler because they use locks. If the main thread held the lock when interrupted, the handler will deadlock waiting for it.
- Async-Signal-Safety: You cannot call
How this fits on projects
You will master signals in Project 11 and build a shell in Project 12.
Definitions & Key Terms
- Kernel Mode: CPU state allowing privileged instructions (I/O, MMU management).
- Zombie Process: A child that finished but hasn’t been reaped (
wait) by its parent. It consumes a PID. - Orphan Process: A child whose parent died.
init(PID 1) adopts it.
Mental Model Diagram (Signal)
Main Thread OS Kernel Signal Handler
│ │ │
│ <--- Interrupt ----│ │
│ │ │
(Context Saved) │----(Modify Stack)--->│
│ │ │ (Runs)
│ │<----(Sigreturn)------│
│ │ │
(Context Restored) │ │
│ │ │
▼
How it Works (Syscall)
User calls read(). Wrapper puts syscall ID in %rax. Executes syscall instruction. CPU switches to Kernel Mode, jumps to trap handler. Kernel performs operation. Returns to User Mode.
Common Misconceptions
- Misconception: “Signals are instant.” -> Truth: They are delivered when the OS schedules the process.
- Misconception: “Ctrl+C kills the program.” -> Truth: It sends
SIGINT. The default action is terminate, but you can catch it.
Check-Your-Understanding
- What happens if you don’t reap child processes?
- Why can’t you use
printfin a signal handler? - What is the difference between an Interrupt and a Trap?
Answers
- They become Zombies, leaking OS resources.
- It is not reentrant (uses global locks). Deadlock risk.
- Interrupts are external/async (hardware). Traps are internal/sync (instruction).
Real-World Applications
- Shells: Managing foreground/background jobs (
fg,bg). - Web Servers: Handling
SIGPIPEwhen clients disconnect.
Key Insight
ECF is the glue between your code and the chaotic physical world.
Summary
Exceptions allow the system to react. Processes provide isolation. Signals provide notification.
Homework
Write a program that catches SIGINT, prints “Don’t touch me!”, and continues.
Homework Solution
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
volatile sig_atomic_t got_signal = 0;
void handler(int sig) {
// Note: write() is async-signal-safe, printf is NOT!
const char msg[] = "Don't touch me!\n";
write(STDOUT_FILENO, msg, sizeof(msg) - 1);
got_signal = 1;
}
int main() {
struct sigaction sa = {0};
sa.sa_handler = handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART; // Restart interrupted syscalls
sigaction(SIGINT, &sa, NULL);
printf("Try pressing Ctrl+C...\n");
while (1) {
pause(); // Wait for signals
if (got_signal) {
printf("Continuing after signal...\n");
got_signal = 0;
}
}
return 0;
}
Test: Press Ctrl+C multiple times. Program prints message but doesn’t exit.
References:
- CS:APP 3e - Chapter 8 (primary source)
- The Linux Programming Interface by Michael Kerrisk - Ch. 20-22 (signals)
- Advanced Programming in the UNIX Environment by Stevens - Ch. 10 (signals)
- signal(7) man page:
man 7 signal - List of async-signal-safe functions: signal-safety(7)
Chapter 8: System I/O & Networking
Fundamentals
The Unix Philosophy is “Everything is a File”. A file on disk, a pipe to another process, a network connection to a server in Tokyo—all are accessed via the same interface: read, write, open, close. Understanding this abstraction (and its leaks) is crucial for building robust servers.
Deep Dive into I/O
- File Descriptors (FD): Small integers representing an open file. Standard: 0 (Stdin), 1 (Stdout), 2 (Stderr).
- Robust I/O:
readandwritehandle streams, not packets.- Short Counts: If you ask for 1000 bytes,
readmight return 500. You must loop until you get what you need. This happens constantly in networking.
- Short Counts: If you ask for 1000 bytes,
- Buffering:
- User Buffering (
FILE*):fprintfwrites to a buffer in your heap. Flushes on newline or full. - Kernel Buffering:
writecopies data to the Page Cache. Disk write happens later (async).
- User Buffering (
- Sockets: The file abstraction for networks.
- Client:
socket->connect. - Server:
socket->bind(reserve port) ->listen->accept.
- Client:
- Multiplexing (
select/poll/epoll): How to handle 10,000 connections with one thread. “Wake me up when any of these 1000 FDs are ready.”
How this fits on projects
You will build robust I/O tools in Project 15 and a web server in Project 17 and 20.
Definitions & Key Terms
- Metadata: Information about a file (size, permissions) stored in the Inode.
- Redirect:
ls > fileworks bydup2(fd_file, STDOUT_FILENO). The programlsjust writes to FD 1, unaware it’s a file. - Blocking I/O: The process sleeps until data arrives.
Mental Model Diagram (Echo Server)
Client Server
socket, connect ───────────> socket, bind, listen
│ accept (blocks)
│ <─────────────────────> (Connection Est.)
write(req) ───────────> read(req)
read(resp) <─────────── write(resp)
How it Works (The Short Count)
TCP breaks data into packets. If you send 1MB, it arrives in 1500-byte chunks. read() returns as soon as some data is available. It does not wait for the full 1MB. You must re-call read() in a loop.
Common Misconceptions
- Misconception: “Write writes to disk.” -> Truth: It writes to kernel RAM. Power loss = Data loss (unless
fsync). - Misconception: “TCP preserves message boundaries.” -> Truth: It is a stream of bytes. You need a protocol (newlines, length headers) to separate messages.
Check-Your-Understanding
- Why does
printfnot appear immediately? - What does
acceptreturn? - Why is
epollbetter thanselect?
Answers
- Stdout is line-buffered. Use
fflushor\n. - A new file descriptor for that specific connection. The listening socket stays open.
selectis O(N) (scans all FDs).epollis O(1) (notified only of active FDs).
Real-World Applications
- Nginx/Node.js: Event-driven I/O using
epollfor massive scale. - Databases: Using
O_DIRECTto bypass kernel cache for raw disk access.
Key Insight
The network is a file that fails, hangs, and delivers partial data. Coding for the “Happy Path” guarantees failure.
Summary
Robust I/O requires handling partials, errors, and buffering discipline.
Homework
Write a program that uses dup2 to redirect stdout to a file, then executes ls.
Homework Solution
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
int main() {
// Open output file (create if doesn't exist, truncate if does)
int fd = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd < 0) { perror("open"); exit(1); }
// Redirect stdout (fd 1) to our file
if (dup2(fd, STDOUT_FILENO) < 0) { perror("dup2"); exit(1); }
close(fd); // No longer need the original fd
// exec replaces this process with `ls`
execlp("ls", "ls", "-la", NULL);
perror("exec"); // Only reached if exec fails
return 1;
}
After running: cat output.txt shows the ls -la output. The key insight is that dup2 makes fd 1 (stdout) point to our file, and exec preserves open file descriptors.
References:
- CS:APP 3e - Chapters 10-11 (primary source)
- The Linux Programming Interface by Michael Kerrisk - Ch. 5-6 (file I/O)
- Unix Network Programming Vol. 1 by Stevens - Sockets API
- Beej’s Guide to Network Programming: https://beej.us/guide/bgnet/
- Linux man pages: open(2), read(2), write(2), socket(7)
Chapter 9: Concurrency
Fundamentals
Concurrency is the art of doing multiple things at once (or appearing to). It is essential for utilizing multi-core CPUs and handling overlapping I/O. However, it introduces a new class of bugs: Race Conditions and Deadlocks, which are non-deterministic and terrifying to debug.
Deep Dive into Concurrency
- Threads vs Processes:
- Processes: Separate address spaces. Hard to share info (need IPC). Robust (crash doesn’t kill others). High overhead.
- Threads: Shared address space (same heap/globals). Easy to share. Risky (one crash kills all). Low overhead.
- Synchronization Primitives:
- Mutex: Mutual Exclusion. “Only one thread can hold this lock.”
- Semaphore: “Only N threads can pass.”
- Condition Variable: “Sleep until X happens.”
- The Race Condition: When the outcome depends on the timing of threads.
cnt++is NOT atomic. It isLoad -> Add -> Store. If two threads do this interleaved, one increment is lost.
- Deadlock: Thread A holds Lock 1, wants Lock 2. Thread B holds Lock 2, wants Lock 1. Both wait forever.
How this fits on projects
You will build a workbench in Project 16 and a Thread Pool in Project 21.
Definitions & Key Terms
- Atomic Operation: An operation that cannot be interrupted. It happens entirely or not at all.
- Critical Section: Code accessing shared resources that must be protected.
- Thread Safety: A function is thread-safe if it works correctly when called by multiple threads simultaneously.
Mental Model Diagram (Race)
Thread 1: Read(5) ---------> Add(6) -> Store(6)
Thread 2: Read(5) -> Add(6) -> Store(6)
Result: 6 (Should be 7!)
How it Works (Mutex)
Hardware provides atomic instructions like CAS (Compare-And-Swap). A Mutex uses this to atomically check if a flag is 0. If yes, set to 1 (lock). If no, sleep.
Common Misconceptions
- Misconception: “Volatile makes variables thread-safe.” -> Truth: No. It prevents compiler caching, but does not provide atomicity or memory barriers.
- Misconception: “I tested it and it didn’t crash, so it’s safe.” -> Truth: Concurrency bugs are “Heisenbugs”. They hide during testing.
Check-Your-Understanding
- What are the 4 conditions for Deadlock?
- Why is a Semaphore initialized to 1 equivalent to a Mutex?
- What is the difference between
joinanddetach?
Answers
- Mutual Exclusion, Hold & Wait, No Preemption, Circular Wait.
- It allows only 1 thread to enter (Binary Semaphore).
joinwaits for a thread to finish.detachlets it run independently and cleans up automatically.
Real-World Applications
- Game Engines: Physics, Rendering, and AI on separate threads.
- Web Servers: Thread pools to handle requests.
Key Insight
Shared mutable state is the root of all evil. Minimize sharing.
Summary
Concurrency yields performance but demands rigorous discipline (locking protocols).
Homework
Write a program where two threads increment a shared counter 1 million times. Observe the result is < 2 million. Fix it with a mutex.
Homework Solution
#include <stdio.h>
#include <pthread.h>
#define ITERATIONS 1000000
long counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void *increment_unsafe(void *arg) {
for (int i = 0; i < ITERATIONS; i++) counter++;
return NULL;
}
void *increment_safe(void *arg) {
for (int i = 0; i < ITERATIONS; i++) {
pthread_mutex_lock(&lock);
counter++;
pthread_mutex_unlock(&lock);
}
return NULL;
}
int main() {
pthread_t t1, t2;
// Unsafe version
counter = 0;
pthread_create(&t1, NULL, increment_unsafe, NULL);
pthread_create(&t2, NULL, increment_unsafe, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Unsafe: %ld (expected 2000000)\n", counter); // Often ~1.5M!
// Safe version
counter = 0;
pthread_create(&t1, NULL, increment_safe, NULL);
pthread_create(&t2, NULL, increment_safe, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Safe: %ld (expected 2000000)\n", counter); // Always 2000000
return 0;
}
Compile: gcc -pthread race.c -o race
The unsafe version loses increments because counter++ is actually 3 instructions: load, add, store.
References:
- CS:APP 3e - Chapter 12 (primary source)
- OSTEP - Chapters 26-33 (concurrency, free online)
- The Linux Programming Interface by Michael Kerrisk - Ch. 29-33 (pthreads)
- C++ Concurrency in Action by Anthony Williams - Modern threading concepts
- pthread(7) man page and POSIX Threads documentation
Glossary (High-Signal)
| Term | Meaning |
|---|---|
| ABI | The binary contract for calls, registers, alignment, and data layout. |
| ASLR | Randomizes address space layout to make exploitation harder. |
| Async-signal-safe | Functions that are safe to call inside a signal handler (most are not). |
| Cache line | The unit of transfer between cache and memory (locality “lives” here). |
| Copy-on-write (COW) | fork() shares pages until a write triggers a private copy. |
| ELF | Linux executable/object format (headers, sections, segments, symbols). |
| GOT/PLT | Indirection tables used for dynamic linking and late binding. |
| Heisenbug | A bug that disappears when you observe it (timing-sensitive). |
| Interposition | Overriding symbols at link/load time (e.g., LD_PRELOAD). |
| MMU | Hardware that translates virtual → physical addresses using page tables. |
| Page fault | CPU trap when a page is unmapped/forbidden; drives demand paging. |
| Race condition | Behavior depends on timing/interleavings (often a data race). |
| Relocation | Patching references to final addresses during linking/loading. |
| Short count | read/write returns fewer bytes than requested; you must loop. |
| Signal | Async notification delivered to a process (interrupt/exception-like). |
| Syscall | User → kernel boundary (read, write, mmap, fork, …). |
| TLB | Cache of page table translations (VM performance hinges on it). |
| Undefined behavior (UB) | C behaviors the standard doesn’t define; compilers may “assume it never happens”. |
| Zombie | A dead child process not yet waited on (exit status retained). |
Concept Summary Table
| Concept | Key Questions | Danger Signs |
|---|---|---|
| Translation | What does each stage produce? | “Compiled but crashes” |
| Data Rep | Why -1 > 0U? | Silent corruption |
| Machine Code | How does stack grow? | Cannot debug crashes |
| Architecture | What hazard is this? | Random performance |
| Memory | Cache hit rate? | 100x slowdown |
| ECF/Processes | What is a zombie? | Hangs, orphans |
| I/O/Networking | Short count? | Data corruption |
| Concurrency | Race condition? | Heisenbugs |
Deep Dive Reading By Concept
| Concept | CS:APP | Supplementary |
|---|---|---|
| Translation | Ch. 1, 7 | Practical Binary Analysis, Low-Level Programming |
| Data Rep | Ch. 2 | Write Great Code Vol.1, Effective C |
| Machine Code | Ch. 3 | Hacking: Art of Exploitation |
| Architecture | Ch. 4-5 | Computer Organization and Design |
| Memory | Ch. 6, 9 | OSTEP |
| ECF/Processes | Ch. 8 | The Linux Programming Interface |
| I/O/Networking | Ch. 10-11 | Unix Network Programming |
| Concurrency | Ch. 12 | OSTEP, TLPI |
Essential Book List: CS:APP, C Programming: A Modern Approach (King), Effective C (Seacord), OSTEP (free online), The Linux Programming Interface (Kerrisk), Low-Level Programming (Zhirkov)
Project-to-Concept Map
This table shows which Theory Primer chapters each project applies, helping you verify you’ve read the prerequisites before starting.
| Project | Primary Concepts | CS:APP Chapters |
|---|---|---|
| P1: Build Pipeline Explorer | Translation, Linking, ELF | Ch. 1, 7 |
| P2: Bitwise Data Inspector | Data Representation, Encoding | Ch. 2 |
| P3: Data Lab Clone | Bit Operations, Integer/Float | Ch. 2 |
| P4: Calling Convention Crash Cart | Machine Code, Stack Frames | Ch. 3 |
| P5: Bomb Lab Workflow | Machine Code, Control Flow | Ch. 3 |
| P6: Attack Lab Workflow | Machine Code, Security | Ch. 3 |
| P7: Y86-64 CPU Simulator | Processor Architecture | Ch. 4-5 |
| P8: Performance Clinic | Optimization, Pipelining | Ch. 5 |
| P9: Cache Lab Simulator | Memory Hierarchy, Locality | Ch. 6 |
| P10: ELF Link Map Toolkit | Translation, Dynamic Linking | Ch. 7 |
| P11: Signals/Processes Sandbox | ECF, Signals | Ch. 8 |
| P12: Unix Shell with Job Control | ECF, Process Control | Ch. 8 |
| P13: VM Map Visualizer | Virtual Memory | Ch. 9 |
| P14: Build Your Own Malloc | Virtual Memory, Heap | Ch. 9 |
| P15: Robust Unix I/O Toolkit | System I/O | Ch. 10 |
| P16: Concurrency Workbench | Concurrency, Threads | Ch. 12 |
| P17: Capstone Proxy Server | All Concepts | Ch. 10-12 |
| P18-P26: Extensions | Deep Dives | Various |
Quick Start: Your First 48 Hours
If you’re eager to start building immediately, follow this accelerated path:
Day 1: Orientation (4-6 hours)
- Skim the Big Picture diagram and Introduction (15 min)
- Read Theory Primer Chapter 1: Translation (45 min)
- Start Project 1: Build Pipeline Explorer
- Get the first output:
./inspect-build hello.cshowing preprocessor output
Day 2: First Complete Project (4-6 hours)
- Complete Project 1 with all stages visible
- Validate against the Definition of Done checklist
- Read Theory Primer Chapter 2: Data Representation
- Start Project 2: Bitwise Data Inspector—get it to show binary representation of integers
What You’ll Have After 48 Hours:
- A working understanding of the GCC compilation pipeline
- A tool that decodes how integers are stored in memory
- Confidence that you can tackle the remaining projects
Common Day-1 Blockers:
- “GCC flags are confusing” → Use
gcc -vto see exactly what commands run - “I don’t understand ELF” → Just run
readelf -h a.outand read the output - “Where does my program actually start?” → Hint: not
main. It’s_start.
Recommended Learning Paths
Different backgrounds benefit from different project orderings:
Path 1: The CS Student (Following CS:APP Syllabus)
If you’re taking a course based on CS:APP or studying independently with the textbook:
P1 → P2 → P3 → P4 → P5 → P6 → P7 → P8 → P9 → P10 → P11 → P12 → P13 → P14 → P15 → P16 → P17
This mirrors the book’s chapter order and lab sequence.
Path 2: The Security Researcher
If your goal is vulnerability research, exploit development, or malware analysis:
P2 → P4 → P5 → P6 → P10 → P13 → P25 (Debugger) → P24 (Leak Detector)
Focus on memory layout, reverse engineering, and attack surfaces.
Path 3: The Performance Engineer
If you’re optimizing code for games, trading systems, or infrastructure:
P1 → P2 → P7 → P8 → P9 → P14 (Malloc) → P16 → P21 (Thread Pool) → P17
Emphasize architecture, caching, and concurrency.
Path 4: The Systems Developer
If you’re building operating systems, compilers, or low-level infrastructure:
P1 → P4 → P10 → P11 → P12 → P13 → P14 → P18 (Linker) → P19 (VM Sim) → P26 (OS Kernel)
Deep dive into OS internals and tooling.
Path 5: Weekend Warrior (Time-Limited)
If you have limited time and want maximum impact:
P1 → P2 → P4 → P9 → P14 → P17
6 projects covering the essential concepts. Complete in 2-3 months.
Success Metrics
How do you know you’ve internalized these concepts? Use these checkpoints:
Beginner Milestones (After Projects 1-6)
- You can explain what happens during each compilation stage
- You can predict integer overflow behavior in C
- You can read x86-64 assembly and map it to C
- You can debug a crash using only GDB and register values
- You understand why buffer overflows are exploitable
Intermediate Milestones (After Projects 7-12)
- You can explain pipeline hazards and their mitigations
- You can predict cache miss rates for different access patterns
- You can implement a job-control shell with proper signal handling
- You know the difference between
fork,exec, andclone - You can trace a system call from user space to kernel
Advanced Milestones (After Projects 13-17)
- You can explain how virtual memory enables process isolation
- You can implement a memory allocator with coalescing
- You can write robust I/O code that handles partial reads/writes
- You can debug race conditions using systematic techniques
- You can build a concurrent server that doesn’t deadlock
Mastery Indicators
- You read a crash dump and immediately suspect the root cause
- You choose data structures based on cache locality, not just Big-O
- You can explain why certain “obvious” optimizations don’t work
- When code is slow, your first question is “what’s the memory access pattern?”
Project Overview Table
| # | Project Name | Difficulty | Time | Key Concepts |
|---|---|---|---|---|
| 1 | Build Pipeline Explorer | Level 1 | 4-8h | Translation, ELF, Symbols |
| 2 | Bitwise Data Inspector | Level 1 | 4-8h | Encoding, Endianness |
| 3 | Data Lab Clone | Level 2 | 10-15h | Bit Ops, IEEE 754 |
| 4 | Calling Convention Crash Cart | Level 2 | 10-15h | Stack, Registers, ABI |
| 5 | Bomb Lab Workflow | Level 2 | 15-20h | Reverse Engineering |
| 6 | Attack Lab Workflow | Level 3 | 20-30h | Exploits, ROP |
| 7 | Y86-64 CPU Simulator | Level 3 | 30-40h | Pipelining, Hazards |
| 8 | Performance Clinic | Level 2 | 10-15h | Optimization, ILP |
| 9 | Cache Lab Simulator | Level 2 | 15-20h | Cache, Locality |
| 10 | ELF Link Map Toolkit | Level 2 | 10-15h | Dynamic Linking, GOT/PLT |
| 11 | Signals/Processes Sandbox | Level 2 | 10-15h | Signals, fork/exec |
| 12 | Unix Shell with Job Control | Level 3 | 25-35h | Process Groups, TTY |
| 13 | VM Map Visualizer | Level 2 | 10-15h | Virtual Memory, mmap |
| 14 | Build Your Own Malloc | Level 3 | 30-40h | Heap, Fragmentation |
| 15 | Robust Unix I/O Toolkit | Level 2 | 10-15h | Buffering, Short Counts |
| 16 | Concurrency Workbench | Level 3 | 20-30h | Threads, Sync |
| 17 | Capstone Proxy Server | Level 4 | 40-60h | All Concepts |
| 18 | ELF Linker/Loader | Level 4 | 40-60h | Deep Linking |
| 19 | VM Simulator | Level 3 | 25-35h | Page Tables, TLB |
| 20 | HTTP Web Server | Level 3 | 25-35h | Sockets, CGI |
| 21 | Thread Pool | Level 3 | 20-25h | Producer-Consumer |
| 22 | Signal-Safe Printf | Level 3 | 15-20h | Reentrancy |
| 23 | Performance Profiler | Level 3 | 20-25h | Sampling, Signals |
| 24 | Memory Leak Detector | Level 3 | 20-30h | Interposition |
| 25 | ptrace Debugger | Level 4 | 35-45h | Breakpoints, Tracing |
| 26 | OS Kernel Capstone | Level 5 | 60-100h | Bare Metal |
Project List
Projects P1–P17 are the core CS:APP sprint. Projects P18–P26 are optional extensions for going beyond the book.
Phase 1: Foundation (Start Here)
Project 1: “Hello, Toolchain” — Build Pipeline Explorer
Real World Outcome
You will build a tool inspect-build that reveals the hidden artifacts of compilation.
$ ./inspect-build hello.c
[PREPROCESSOR] Generated hello.i (312 lines). Macros expanded.
[COMPILER] Generated hello.s (Assembly). 23 instructions.
-> Found call to 'puts' (optimization of printf)
[ASSEMBLER] Generated hello.o (Relocatable Object).
-> Symbols: main (Global, Func), puts (Undefined)
[LINKER] Generated hello (Executable).
-> Entry point: 0x401060
-> Linked with: /lib/x86_64-linux-gnu/libc.so.6
The Core Question You’re Answering
What physically happens to my code between hitting “save” and running the program?
Concepts You Must Understand First
- The Translation Pipeline (Primer Ch 1) -
cpp,cc1,as,ld. - ELF Sections (Primer Ch 1) -
.text(code),.data(globals),.rodata(constants). - Symbols (Primer Ch 1) - How names map to addresses.
Questions to Guide Your Design
- How can you stop
gccafter preprocessing? After compilation? (Hint: flags). - How do you parse the output of
readelfornmto extract symbol counts? - What is the difference between a “defined” symbol in a
.ofile and an “undefined” one?
Thinking Exercise
Run gcc -v hello.c. Look at the output. Can you identify the exact programs (cc1, as, collect2/ld) being called? Why are there so many flags?
The Interview Questions They’ll Ask
- “What is the difference between static and dynamic linking?”
- “What is a symbol table? Does the final executable need it to run?”
- “Why does a C program need a
mainfunction? Does the binary really start there?” (Hint:_start).
Hints in Layers
- Use
gcc -E,-S,-cto stop at different stages. - Use
readelf -sto see symbols. - Use
lddto see dynamic dependencies.
Common Pitfalls & Debugging
Problem 1: “gcc -E produces huge output”
- Why: Headers like
<stdio.h>include thousands of lines of library declarations - Fix: Focus on the lines at the end (your code) or count total lines for comparison
- Quick test:
gcc -E hello.c | tail -50shows just your code
Problem 2: “readelf shows no symbols in executable”
- Why: Release builds often strip symbols for size. Debug builds include them.
- Fix: Compile with
gcc -gto include debug symbols, or useobjdump -tinstead - Quick test:
nm a.outvsnm -D a.out(dynamic symbols survive stripping)
Problem 3: “collect2 error: ld returned 1 exit status”
- Why: This is a linker error—usually undefined reference or multiple definition
- Fix: Look for “undefined reference to
foo” in the error message. Check iffoois defined in linked libraries - Quick test:
nm *.o | grep footo find where symbols are defined
Problem 4: “My tool works for hello.c but crashes on other files”
- Why: Edge cases: files with no symbols, assembly errors, or missing includes
- Fix: Add error handling for each gcc stage; check return codes
- Quick test: Test with an empty file, a file with only comments, a file with syntax errors
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| GCC Internals | An Introduction to GCC by Gough | Ch. 1-3 |
| ELF Format | Practical Binary Analysis by Andriesse | Ch. 2 |
| Linking Deep Dive | Linkers and Loaders by Levine | Ch. 3-4 |
Definition of Done
- Tool runs on any simple C file.
- Reports size/lines of Preprocessed, Assembly, and Object files.
- Lists undefined symbols in the Object file.
- Identifies the dynamic linker used by the final executable.
Project 2: Bitwise Data Inspector
Real World Outcome
A CLI tool bit-view that acts as a CT scan for data.
$ ./bit-view 10
Signed: 10
Unsigned: 10
Hex: 0x0000000A
Binary: 00000000 00000000 00000000 00001010
Float: 1.401298e-44 (Denormalized)
The Core Question You’re Answering
How are the same 32 bits interpreted differently depending on type context?
Concepts You Must Understand First
- Two’s Complement (Primer Ch 2) - How negative numbers are encoded.
- IEEE 754 Floating Point (Primer Ch 2) - Sign, exponent, mantissa structure.
- Endianness (Primer Ch 2) - Byte ordering in multi-byte values.
- Book Reference: “Write Great Code Vol. 1” by Randall Hyde - Ch. 4-5
Questions to Guide Your Design
- How do you extract individual bits from an integer? (Hint: bitwise AND with mask)
- How do you detect if a float is denormalized, NaN, or infinity?
- How do you show byte order without platform-specific code?
Thinking Exercise
Take the integer -1. Write out its 32-bit representation in binary. Now interpret those same bits as an unsigned integer. What value do you get? What about as a float?
The Interview Questions They’ll Ask
- “Why is signed integer overflow undefined behavior in C?”
- “Why does 0.1 + 0.2 != 0.3 in floating point?”
- “How can you tell if a system is big-endian or little-endian?”
Hints in Layers
- Use
union { int i; float f; }to reinterpret bits without conversion. - For binary output, use
(value >> bit) & 1in a loop from bit 31 to 0. - To decode IEEE 754: Sign = bit 31, Exponent = bits 30-23 (subtract 127), Mantissa = bits 22-0.
Common Pitfalls & Debugging
Problem 1: “My binary output is backwards”
- Why: You’re printing LSB first instead of MSB first
- Fix: Loop from bit (width-1) down to 0
- Quick test: Input
1should show...0001at the end
Problem 2: “Float interpretation shows garbage”
- Why: You’re casting instead of reinterpreting bits
- Fix: Use
unionormemcpy, not(float)intval - Quick test: Input
0x3F800000should decode as float1.0
Problem 3: “Negative numbers look wrong”
- Why: Sign extension when casting between sizes
- Fix: Use unsigned types consistently when doing bit manipulation
- Quick test: Input
-1should show all 1 bits
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Bit Manipulation | Hacker’s Delight by Warren | Ch. 1-3 |
| Float Internals | Write Great Code Vol. 1 by Hyde | Ch. 4 |
| Data Types | Effective C by Seacord | Ch. 5 |
Definition of Done
- Correctly displays binary for positive/negative ints.
- Decodes IEEE 754 float components (Sign, Exponent, Mantissa).
- Visualizes Endianness (byte order).
- Handles edge cases: 0, -1, INT_MIN, NaN, infinity.
Project 3: Data Lab Clone
Real World Outcome
You will build a grading framework that enforces restricted operator sets and validates bitwise solutions.
$ ./dlc bits.c
bits.c:15: absVal (maxOps: 15, curOps: 14) Valid
bits.c:25: addOK (maxOps: 30, curOps: 28) Valid
bits.c:40: bang (maxOps: 12, curOps: 5) Valid
bits.c:52: bitCount (maxOps: 40, curOps: 38) Valid
$ ./test-bits
Testing: absVal
Test [-1] got 1 (expected 1) PASS
Test [0] got 0 (expected 0) PASS
Test [2147483647] got 2147483647 (expected 2147483647) PASS
Test [-2147483648] ERROR: overflow expected
All functions: 47/50 tests passed (94%)
The Core Question You’re Answering
How can you implement standard operations (absolute value, min, max, sign detection) using only bitwise operators—no arithmetic operators, no conditionals?
Concepts You Must Understand First
- Two’s Complement Representation (Primer Ch 2)
- How negative numbers are encoded as bit patterns
- Why
-x = ~x + 1 - Book Reference: “Write Great Code Vol. 1” by Randall Hyde - Ch. 4
- Bitwise Operators and Masking (Primer Ch 2)
&,|,^,~,<<,>>- How to extract/set individual bits
- Sign extension in right shifts
- Book Reference: “Hacker’s Delight” by Warren - Ch. 1-2
- Integer Overflow & Underflow (Primer Ch 2)
- When addition/subtraction overflow
- How to detect overflow without arithmetic
- Book Reference: “Effective C” by Seacord - Ch. 5
Questions to Guide Your Design
- How do you detect if a number is negative using only bitwise ops? (Hint: MSB)
- How do you compute
-xusing only bitwise ops? - How do you check if
x + ywill overflow without computingx + y? - Can you implement
min(x, y)without using conditional operators? - How would you count the number of 1-bits in an integer?
Thinking Exercise
The Absolute Value Problem
Take the integer -5 (in two’s complement: 11111111111111111111111111111011). Your goal is to compute its absolute value (5) using only:
- Bitwise ops:
&,|,^,~,<<,>> - No arithmetic operators (no
+,-,*,/) - No conditionals (no
if,?:)
Questions to guide your thinking:
- How do you detect that this number is negative?
- Once you know it’s negative, how do you flip it to positive?
- What about the special case
INT_MIN? (Can’t negate-2147483648in 32 bits!)
The Interview Questions They’ll Ask
- “How would you implement absolute value using only bitwise operations?”
- “How can you detect integer overflow in addition without performing the addition?”
- “Write code to count the number of set bits (1s) in an integer. Can you do it in O(# of set bits) time?”
- “Given two integers, how do you swap their values without using a temporary variable or arithmetic?”
- “How would you check if a number is a power of 2 using only bitwise operations?”
- “Explain the difference between arithmetic and logical right shift. Why does it matter?”
Hints in Layers
Hint 1: Start with Sign Detection
You need to know if a number is negative. The MSB (most significant bit) tells you this in two’s complement. How do you extract it?
Hint 2: Implement Conditional Logic with Bitwise Ops
If you can’t use if, you need a different way to choose between two values. Think: “If bit X is set, use value A, otherwise use value B.” This is a mask operation.
Hint 3: Use Masks to Implement Conditionals
// Pseudocode pattern:
int mask = (condition) ? -1 : 0; // -1 is all 1s, 0 is all 0s
int result = (a & mask) | (b & ~mask); // Pick a or b based on mask
But wait—you can’t use ?: either! So create the mask using bitwise ops.
Hint 4: Practice Decomposition
Break complex functions into smaller bitwise primitives:
sign(x): Extract the sign bitisZero(x): Check if all bits are 0isNegative(x): Check if MSB is 1- Then combine these to solve harder problems
Common Pitfalls & Debugging
Problem 1: “I used arithmetic operators by accident”
- Why: The assignment seems to allow them, but the grader catches them
- Fix: Audit your code. Replace every
+,-,*with bitwise equivalents - Quick test: Run
./dlc bits.cto check operator counts
Problem 2: “My solution works for small numbers but fails for INT_MIN”
- Why: Two’s complement has an asymmetry:
-2147483648cannot be negated in 32 bits - Fix: Handle the special case explicitly or accept the limitation
- Quick test: Test with
INT_MINandINT_MAX
Problem 3: “Bit count is slow for large numbers”
- Why: You’re checking every bit instead of just the set bits
- Fix: Use Brian Kernighan’s trick:
x & (x-1)clears the rightmost set bit - Quick test: Compare performance on
0xFFFFFFFFvs0x00000001
Problem 4: “I’m exceeding the operator budget”
- Why: Your implementation is inefficient (too many steps)
- Fix: Study the operations:
~x,x & y, etc. are cheaper than shifts - Quick test: Refactor to use fewer operations
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Bitwise Tricks | Hacker’s Delight by Warren | Ch. 1-3, 5 |
| Bit Manipulation | Competitive Programming by Halim & Halim | Ch. 2.2 |
| Integer Math | Write Great Code Vol. 1 by Hyde | Ch. 4 |
| Two’s Complement | CS:APP 3e by Bryant & O’Hallaron | Ch. 2 |
Definition of Done
- All required functions implemented with correct results
- No arithmetic operators (
+,-,*,/) used - No conditional operators (
if,?:) used - Operator counts below specified limits for each function
- Edge cases (0, INT_MIN, INT_MAX, -1) handled correctly
./test-bitspasses all provided test cases./dlc bits.cshows “Valid” for all functions
Project 4: x86-64 Calling Convention Crash Cart
Real World Outcome
You will write C programs that intentionally crash in predictable ways, producing crash dumps that teach register/stack analysis.
$ gcc -g caller.c -o caller && ./caller
Segmentation fault (core dumped)
$ gdb ./caller core
(gdb) bt
#0 0x0000555555554123 in leaf () from ./caller
#1 0x0000555555554145 in caller () from ./caller
(gdb) info registers
rax 0x0 0
rbx 0x555555554000 93824992235520
rcx 0x7ffff7dd6aa4 140737351862948
rdx 0x0 0
rsi 0x1 1
rdi 0x2 2
rbp 0x7ffffffde9e0 0x7ffffffde9e0
rsp 0x7ffffffde9c0 0x7ffffffde9c0
(gdb) x/20x $rsp
0x7ffffffde9c0: 0x55554145 0x00005555 0xdeadbeef 0xdeadbeef
0x7ffffffde9d0: 0xdeadbeef 0xdeadbeef 0x7fffde9e0 0x00007fff
0x7ffffffde9e0: 0x55554167 0x00005555 0x00000000 0x00000000
# Stack frame structure visible! Return address at 0x7ffffffde9c8
The Core Question You’re Answering
How do I read a crash dump (registers, stack memory, return addresses) to understand what my program was doing when it died?
Concepts You Must Understand First
- x86-64 Register File and Calling Conventions (Primer Ch 3)
- The 16 general-purpose registers
- Which registers hold function arguments (
%rdi,%rsi,%rdx,%rcx,%r8,%r9) - Which must be preserved by the callee (
%rbx,%rbp,%r12-%r15) - Book Reference: “CS:APP 3e” by Bryant & O’Hallaron - Ch. 3
- Stack Frames and Memory Layout (Primer Ch 3)
- How
push/popwork - Stack grows downward (high → low addresses)
- Frame pointer (
%rbp) and stack pointer (%rsp) - Return address location
- Book Reference: “Hacking: The Art of Exploitation” by Erickson - Ch. 2-3
- How
- GDB and Crash Dump Analysis (Primer Ch 3)
info registers: See CPU statex/Nx addr: Examine memorybt(backtrace): Call stackdisassemble: Read assembly from memory- Book Reference: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman
Questions to Guide Your Design
- Where is the return address stored relative to the current stack pointer?
- What does the stack look like immediately after a function prologue (
push %rbp; mov %rsp, %rbp)? - How would you identify caller-saved vs callee-saved registers from a crash dump?
- Given a crash address, how would you determine which function was executing?
- What would a buffer overflow look like in the stack frame?
Thinking Exercise
Analyzing a Crash Dump
You crash in function leaf(), called from middle(), called from main(). You see:
$rsp = 0x7ffd0000
$rbp = 0x7ffd0010
Memory at $rsp: [0x555540ab] (return address)
Questions to answer:
- Where does
leaf()store its saved%rbp? - How big is
leaf()’s stack frame? - If you inspect
0x7ffd0018, what would you expect to find? - How would you find the return address for the call to
middle()?
The Interview Questions They’ll Ask
- “Describe the x86-64 calling convention. Which registers hold function arguments?”
- “What is a stack frame? Draw one on a whiteboard.”
- “When a function crashes, how do you use
gdbto see what arguments it received?” - “What is the difference between
%rspand%rbp? When would you use each?” - “How would you detect a buffer overflow from a crash dump?”
- “Explain how a return address gets onto the stack and how the CPU uses it to return from a function.”
Hints in Layers
Hint 1: Study the ABI Document
The System V AMD64 ABI defines the calling convention. Read the first few pages. Key facts:
- First 6 integer args:
%rdi,%rsi,%rdx,%rcx,%r8,%r9 - Return value:
%rax - Callee-saved:
%rbx,%rbp,%r12-%r15
Hint 2: Write Simple Nested Calls
Create main() → func_a() → func_b(). Compile with -g. Run in GDB, break at each level, and inspect $rsp, $rbp, and the stack memory.
Hint 3: Match Assembly to C
Compile a simple function with -S. Count:
- How many bytes the function allocates (
sub $X, %rsp) - Which registers it saves
- Where local variables live relative to
%rspor%rbp
Hint 4: Create Intentional Crashes
Write functions with:
- Buffer overflows (overflow into return address)
- Null pointer dereferences
- Use-after-free (corrupt a pointer)
For each, predict what the crash dump will show before running it. Verify in GDB.
Common Pitfalls & Debugging
Problem 1: “I don’t understand which registers are which”
- Why: x86-64 has 16 registers, each with multiple names (e.g.,
%raxvs%eaxvs%ax) - Fix: Create a reference card: RAX (return), RDI (arg1), RSI (arg2), etc.
- Quick test:
gcc -S simple.cand trace a function call by hand
Problem 2: “The stack grows in the ‘wrong’ direction”
- Why: Stack grows downward (high → low addresses), which is counterintuitive
- Fix: Visualize it:
pushdecrements%rsp,popincrements it - Quick test: Print
$rspbefore and after apushinstruction in GDB
Problem 3: “I can’t locate local variables”
- Why: They’re accessed relative to
%rspor%rbp, with negative offsets - Fix: Use GDB:
p &variableshows the address;info localslists all locals - Quick test: Compile with
-gand inspect a simple function
Problem 4: “Crash dumps show weird addresses”
- Why: ASLR (Address Space Layout Randomization) randomizes addresses each run
- Fix: Disable ASLR for reproducible testing:
setarch $(uname -m) -R ./program - Quick test: Run the same program twice, compare crash addresses
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| x86-64 ABI | System V AMD64 ABI | Figures 3.4-3.5 (registers, stack layout) |
| Calling Convention | CS:APP 3e by Bryant & O’Hallaron | Ch. 3, Sec. 3.7 |
| Debugging | The Art of Debugging by Matloff & Salzman | Ch. 1-4 |
| Assembly Intro | Hacking: The Art of Exploitation by Erickson | Ch. 2 |
| GDB Guide | GDB Debugging Guide (official docs) | Ch. 1-8 |
Definition of Done
- Write at least 3 C programs demonstrating different call patterns
- Each crashes with a predictable register/stack state
- Use GDB to analyze each crash: inspect
$rsp,$rbp, return address - Draw a stack frame diagram for each crash
- Verify callee-saved register preservation
- Document where each local variable lives relative to
%rspor%rbp
Project 5: Bomb Lab Workflow
Real World Outcome
You will reverse-engineer a compiled binary (“bomb”) and defuse it by discovering the correct inputs to each phase.
$ ./bomb
Welcome to my little bomb.
Phase 1: 1 6 2 4 8 1 6
Phase 1 defused. How about the next one?
Phase 2: 2 10 20 30 40 50
That's number 2. Keep going!
Phase 3: 1 311 3
Halfway there!
Phase 4: 7 0
Phase 4 defused!
Phase 5: ionefg
Congratulations! You've defused the bomb!
The Core Question You’re Answering
Given only a compiled binary (no source code), how can you reverse-engineer it to understand what input will satisfy each phase?
Concepts You Must Understand First
- Reading and Understanding x86-64 Assembly (Primer Ch 3)
- Instruction format and common patterns
- Control flow (
cmp,je,jne, jumps) - Loops and recursion
- Book Reference: “CS:APP 3e” - Ch. 3, Sec. 3.4-3.8
- Using GDB for Interactive Debugging (Primer Ch 3)
- Setting breakpoints
- Stepping through code
- Inspecting registers and memory
- Disassembling functions
- Book Reference: “The Art of Debugging” - Ch. 2-4
- Binary Tools and Analysis (Primer Ch 3)
objdump -d: Disassemble binarystrings: Find embedded stringsreadelf -s: Extract symbol table- Using GDB’s
disassembleandlayout asm - Book Reference: “Practical Binary Analysis” - Ch. 3-4
Questions to Guide Your Design
- How would you identify the entry point of each phase?
- What pattern would a string comparison look like in assembly?
- How would you recognize a loop in assembly and determine its bounds?
- What assembly patterns indicate arithmetic operations (add, multiply)?
- How would you use breakpoints to inspect values at specific points?
Thinking Exercise
Reverse Engineering a Simple Phase
You disassemble Phase 1 and see:
mov %edi,%edx # edx = arg1
mov $0x1,%ecx # ecx = 1
cmp %ecx,%edx # compare edx to ecx
jne 0x401... # jump if not equal
mov $0x2,%ecx # ecx = 2
cmp %ecx,%edx # compare again
...
Questions:
- What values does
edxget compared against? - What would happen if you input “1”?
- What if you input “2”?
- Is there a specific sequence required?
The Interview Questions They’ll Ask
- “How would you approach reverse-engineering an unfamiliar binary?”
- “What GDB commands are most useful for understanding control flow?”
- “How would you identify function calls and their arguments from assembly?”
- “Explain how you’d detect string comparison in assembly.”
- “What’s the difference between a conditional jump based on equality vs. less-than?”
- “How would you recognize that code is looping vs. just jumping around?”
Hints in Layers
Hint 1: Use objdump to Get Disassembly
Run objdump -d bomb | less to see the entire binary disassembled. Search for function names or strings.
Hint 2: Start with Strings
Run strings bomb | grep -E "[A-Z]" to find hints about expected input format.
Hint 3: Break at Phase Boundaries
In GDB:
(gdb) break explode_bomb
(gdb) break phase_1
(gdb) run
# Input a test string, watch where it fails
Hint 4: Trace Through Phase Logic
Use stepi (step one instruction) to single-step through assembly. After each instruction, print registers with info registers to see how values change.
Common Pitfalls & Debugging
Problem 1: “I don’t know how to read assembly”
- Why: x86-64 syntax is unfamiliar
- Fix: Create a cheat sheet:
mov src, dst,cmp a, b,je label(jump if equal) - Quick test: Compile simple C to assembly with
-Sand compare
Problem 2: “Phase defuses but says it’s wrong”
- Why: Some phases require exact formatting (spaces, number of args)
- Fix: Look at how
scanfis called; match the format string exactly - Quick test: Try with different spacing or order
Problem 3: “I can’t find where input is read”
- Why:
scanforfgetsmight be called from a library function - Fix: Search for string pointers in data sections (
.rodata), trace backwards to function - Quick test:
strings bomb | grep '%'to find format strings
Problem 4: “Breakpoints don’t work where I expect”
- Why: ASLR or compiler optimizations move code around
- Fix: Use symbolic names:
break phase_1instead of memory addresses - Quick test: Check with
info breakpoints
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Assembly Reading | CS:APP 3e | Ch. 3 |
| Reverse Engineering | Practical Binary Analysis | Ch. 1-5 |
| Debugging Strategies | The Art of Debugging | Ch. 2-4 |
| x86-64 Reference | Intel 64 and IA-32 Architectures Manual | Vol. 2A-2C |
Definition of Done
- Defuse all 6 phases (or as many as available)
- For each phase, document the disassembled code and your analysis
- Explain the logic of each phase (what condition does it check?)
- Show successful input and expected output
- Demonstrate using GDB to step through at least one phase
Project 6: Attack Lab—Exploiting Stack Vulnerabilities
Real World Outcome
You will craft malicious inputs that exploit buffer overflows to hijack control flow using ROP (Return-Oriented Programming) gadgets.
$ cat solution1.txt | ./ctarget
Type string:Entering echo mode (^C to quit)
sdraw!
Normal return
Congratulations! You have successfully completed this warm-up stage 1.
$ cat solution2.txt | ./ctarget
Type string: AAAA...XXXX[gadget_addr_1][gadget_addr_2]...
Congratulations! Stage 2 complete.
$ cat solution5.txt | ./ctarget
# After crafting ROP chain with multiple gadgets
Congratulations! Stage 5 complete. All attacks succeeded!
The Core Question You’re Answering
How can you exploit a buffer overflow to change the control flow of a program and execute arbitrary code?
Concepts You Must Understand First
- Stack Buffer Overflows (Primer Ch 3)
- How writing past a buffer’s boundary corrupts the stack
- Overwriting the return address
- Shellcode vs ROP gadgets
- Book Reference: “CS:APP 3e” - Ch. 3, Sec. 3.12
- Return-Oriented Programming (ROP) (Primer Ch 3)
- Finding gadgets (sequences ending in
ret) - Chaining gadgets to perform computation
- Using
pop rdi; retto set arguments - Book Reference: “Practical Binary Analysis” - Ch. 9
- Finding gadgets (sequences ending in
- Exploit Development Techniques (Primer Ch 3)
- Disassembling to find gadgets
- Calculating stack offsets
- Bypassing ASLR (or using relative offsets)
- Testing exploits safely
- Book Reference: “Hacking: The Art of Exploitation” - Ch. 3
Questions to Guide Your Design
- How many bytes of overflow do you need before reaching the return address?
- How would you verify your return address is correct?
- Where would you find ROP gadgets in the binary?
- How do you craft a sequence of gadgets to pass arguments to a function?
- What’s the difference between Stage 1 (code injection) and Stages 4-5 (ROP)?
Thinking Exercise
Building a Simple ROP Chain
You want to call func(42). You find:
pop %rdi; retat 0x401234call funcormov %rdi, %rdi; ... retat address of func
Your stack should look like:
[Buffer][Padding][Return Address → gadget][42][Next Gadget]
Questions:
- How many bytes is the buffer?
- Where does
42go on the stack? - What address do you write as the return address?
- What happens after the
pop %rdi?
The Interview Questions They’ll Ask
- “Explain how a buffer overflow can lead to code execution.”
- “What is ROP? Why is it more powerful than simple shellcode injection?”
- “How would you find gadgets in a binary?”
- “What role does ASLR play in exploit mitigation?”
- “Can you ROP without knowing the exact memory layout?”
- “Describe the stack state before, during, and after exploiting a buffer overflow.”
Hints in Layers
Hint 1: Start with Stage 1
Stage 1 is usually a warm-up: overwrite the return address to jump to a success function (no ROP needed). Calculate offset from buffer start to return address.
Hint 2: Use GDB to Verify Offsets
(gdb) run
(gdb) break *0x401234 # At a chosen gadget
(gdb) x/20x $rsp # See what's on the stack
Hint 3: Build Gadgets from Tools
Use ropper or objdump to find gadgets:
ropper --file ctarget --search "pop rdi"
objdump -d ctarget | grep -A2 "pop.*rdi"
Hint 4: Test Incrementally
- Stage 1: Overwrite return address
- Stage 2: Inject simple code (if allowed)
- Stage 3+: Chain multiple gadgets
Common Pitfalls & Debugging
Problem 1: “Exploit doesn’t crash or execute correctly”
- Why: Return address is wrong or stack alignment is off
- Fix: Use GDB to inspect
$rspat the crash point - Quick test: Disable ASLR and run again
Problem 2: “Gadget not found”
- Why: Looking for exact instruction sequence that doesn’t exist
- Fix: Look for equivalent sequences (e.g.,
mov %rdi, %rdi; ret) - Quick test: Use
ropper --allorobjdump -d | grep mov
Problem 3: “Stack alignment causes crash
- Why: Some instructions require 16-byte alignment
- Fix: Insert extra
retas padding to adjust alignment - Quick test: Count to 16-byte boundary
Problem 4: “Can’t inject code (W^X protection)”
- Why: Modern systems mark writable memory non-executable
- Fix: Use ROP gadgets from existing code instead of injecting new code
- Quick test: Check with
checksec ./ctarget
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Buffer Overflows | Hacking: The Art of Exploitation | Ch. 3 |
| ROP Exploitation | Practical Binary Analysis | Ch. 9 |
| Security Best Practices | CS:APP 3e | Ch. 3, Sec. 3.12 |
| Gadget Finding | ROPPER Documentation | Official guide |
Definition of Done
- Complete Stage 1 (redirect control flow)
- Complete at least one more stage using ROP gadgets
- For each stage, document: buffer offset, gadgets used, stack layout
- Explain why your exploit works despite mitigations
- Test exploit multiple times to verify reliability
Project 7: Y86-64 CPU Simulator—Understanding Pipelining
Real World Outcome
You will build a cycle-accurate CPU simulator that executes Y86-64 programs and visualizes the 5-stage pipeline.
$ ./ysim prog.yo
Y86-64 Simulator
Load program: 14 instructions
Starting execution (verbose mode)
Cycle 1: F(nop) D(nop) E(-) M(-) W(-)
Cycle 2: F(imm) D(nop) E(nop) M(-) W(-)
Cycle 3: F(add) D(imm) E(nop) M(nop) W(-)
Cycle 4: F(rrmov) D(add) E(imm) M(nop) W(nop)
[Stall: Data dependency detected. E needs %rax, D is computing it]
Cycle 5: F(rrmov) D(add) E(imm) M(add) W(nop)
Cycle 6: F(jne) D(rrmov) E(imm) M(add) W(add)
Final state:
%rax = 0x00000003
%rcx = 0x00000004
Condition codes: ZF=0 SF=0 OF=0
Memory[0x100:0x108] = {0x00, 0x01, 0x02, ...}
Performance: 6 instructions / 14 cycles = 0.43 IPC
The Core Question You’re Answering
How does a CPU execute instructions in parallel using a pipeline, and what causes stalls and forwarding?
Concepts You Must Understand First
- CPU Pipelining (Primer Ch 4)
- 5 stages: Fetch, Decode, Execute, Memory, Writeback
- Parallel execution of multiple instructions
- Book Reference: “CS:APP 3e” - Ch. 4-5
- Pipeline Hazards (Primer Ch 4)
- Data hazards (dependency on previous instruction’s result)
- Control hazards (branches)
- Structural hazards (resource conflicts)
- Book Reference: “Computer Organization and Design” by Patterson & Hennessy - Ch. 4
- Hazard Resolution (Primer Ch 4)
- Stalling (NOP injection)
- Forwarding (passing data between stages)
- Branch prediction
- Book Reference: “CS:APP 3e” - Ch. 4, Sec. 4.5
Questions to Guide Your Design
- How many cycles does it take to execute 10 independent
addinstructions? - What happens if instruction B depends on instruction A’s result?
- How does forwarding reduce stalls compared to stalling alone?
- What’s the longest possible dependency chain you can create?
- How would branch prediction affect CPI (cycles per instruction)?
Thinking Exercise
Analyzing a Dependency Chain
irmovq $1, %rax # Cycle 1: rax = 1
addq %rax, %rbx # Cycle 2: needs rax from Cycle 1
addq %rbx, %rcx # Cycle 3: needs rbx from Cycle 2
addq %rcx, %rdx # Cycle 4: needs rcx from Cycle 3
Questions:
- Without forwarding, how many stalls occur?
- With forwarding, how many stalls occur?
- Draw the pipeline state at Cycle 2, 3, 4
The Interview Questions They’ll Ask
- “Explain the 5 stages of a CPU pipeline.”
- “What is a data dependency? Give an example.”
- “How does forwarding work? When does it NOT work?”
- “Compare stalling vs forwarding for resolving hazards.”
- “What causes a branch to stall the pipeline?”
- “Can you have a 10-stage pipeline with 0.1x CPI?” (Why or why not?)
Hints in Layers
Hint 1: Implement the Basic Cycles
Start with Fetch → Decode → Execute → Memory → Writeback. No parallelism yet; just sequential execution.
Hint 2: Add Pipeline Registers
Between each stage, add a “pipeline register” that holds the state. This is where values flow between stages.
Hint 3: Implement Data Dependency Detection
Check: “Does the instruction in Decode need a value that Fetch just fetched?” If yes, stall.
Hint 4: Implement Forwarding
Instead of stalling, pass the result from Execute/Memory stage backward to the Decode stage.
Common Pitfalls & Debugging
Problem 1: “Pipeline doesn’t advance; everything stalls”
- Why: Dependency detection is too conservative
- Fix: Only stall if forwarding can’t solve it
- Quick test: Run a program with no dependencies; verify it advances every cycle
Problem 2: “Results are incorrect; wrong register values”
- Why: Forwarding path is missing or has a bug
- Fix: Trace through a simple 2-instruction dependency by hand
- Quick test: Add debug output showing forwarding path
Problem 3: “Branches cause wrong control flow”
- Why: Branch prediction / fetching ahead is implemented incorrectly
- Fix: Start with predict-not-taken. Flush pipeline on misprediction.
- Quick test: Run a branch-heavy program; verify correct result
Problem 4: “Performance numbers don’t match textbook”
- Why: Stall logic or forwarding is inefficient
- Fix: Compare your CPI to expected values; audit the code
- Quick test: Run a known test case; compare CPI
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Pipelining | CS:APP 3e | Ch. 4-5 |
| Pipeline Simulation | Computer Organization and Design | Ch. 4 |
| Y86-64 Spec | CS:APP 3e | Appendix A |
| Performance Metrics | Computer Architecture by Hennessy & Patterson | Ch. 1 |
Definition of Done
- Simulator correctly executes Y86-64 programs
- Pipeline visualization shows all 5 stages each cycle
- Data dependencies detected and stalls inserted
- Forwarding implemented for common cases
- Performance metrics (CPI) calculated and reported
- Test on provided Y86-64 programs; match expected output
Project 8: Performance Optimization Clinic
Real World Outcome
You will benchmark code variants and achieve measurable speedups through optimization, documenting the improvements.
$ ./perf-clinic
=== Loop Sum Benchmark ===
Variant 1 (Naive): 2.5s for 10M iterations
Variant 2 (Unrolled 4x): 1.8s for 10M iterations [1.39x faster]
Variant 3 (SIMD): 0.6s for 10M iterations [4.17x faster]
Variant 4 (Cache-aware): 0.4s for 10M iterations [6.25x faster]
=== Memory Access Pattern ===
Row-major (Stride-1): 1.2 ns/access (L1 cache hit)
Column-major (Stride-1024): 45 ns/access (RAM access)
Random access: 89 ns/access (TLB miss)
=== Branch Prediction ===
Predictable branch: 0.8 ns per iteration
Unpredictable branch: 8.2 ns per iteration [10.25x slower]
Summary: 6.25x speedup achieved through loop unrolling + cache optimization
The Core Question You’re Answering
What techniques reliably improve performance, and how do you measure and prove the improvements?
Concepts You Must Understand First
- Loop Optimization (Primer Ch 4-5)
- Loop unrolling for ILP (Instruction-Level Parallelism)
- Reducing loop overhead
- Eliminating dependencies
- Book Reference: “CS:APP 3e” - Ch. 5, Sec. 5.7-5.9
- Memory Locality (Primer Ch 5)
- Temporal vs spatial locality
- Cache line behavior
- Stride effects
- Book Reference: “CS:APP 3e” - Ch. 6, Sec. 6.4
- Branch Prediction (Primer Ch 4)
- Predictable vs random branches
- Cost of misprediction
- Branchless programming
- Book Reference: “CS:APP 3e” - Ch. 4, Sec. 4.7
Questions to Guide Your Design
- What is ILP and how does unrolling increase it?
- Why does 4x unrolling often beat 8x?
- What’s the relationship between cache line size and array stride?
- How would you measure branch prediction penalty?
- Can you optimize code without changing the algorithm?
Thinking Exercise
Identifying Optimization Opportunities
Given this code:
long sum = 0;
for (int i = 0; i < n; i++)
sum += arr[i];
Questions:
- Is there a data dependency? What is it?
- Can you unroll the loop? How?
- Would SIMD help? Why?
- What would the assembly look like?
- How many cycles per element (CPE) would it take?
The Interview Questions They’ll Ask
- “Why does loop unrolling improve performance?”
- “Explain the difference between ILP and pipelining.”
- “How does cache line size affect array access patterns?”
- “Why is sorting an array sometimes faster than processing it unsorted?”
- “What’s the cost of a branch misprediction?”
- “Can you optimize code to run faster without changing the algorithm?”
Hints in Layers
Hint 1: Choose Simple Benchmarks
Start with loops: sum, multiplication, or simple recursive operations. Avoid I/O, system calls, or complex logic.
Hint 2: Use perf or time for Measurement
# Use 'perf' for detailed metrics
perf stat -e cycles,instructions,cache-misses ./program
# Or simple timing
time ./program
Hint 3: Vary One Thing at a Time
- Baseline version
- 2x unrolled
- 4x unrolled
- 8x unrolled
- +simd
- +cache-aware data layout
Hint 4: Disable Optimizations for Fairness
Compile with -O0 or at most -O1 so your hand optimizations are visible. Otherwise, the compiler does it for you!
Common Pitfalls & Debugging
Problem 1: “Measurements are noisy”
- Why: System noise, cache state, thermal throttling
- Fix: Run multiple times, take average or median
- Quick test: Run same binary 5 times; see variance
Problem 2: “Optimization didn’t help; maybe made it worse”
- Why: Compiler is already doing it, or new code has worse cache behavior
- Fix: Verify with
-O0compilation; check assembly with-S - Quick test: Compare hand-optimized vs compiler-optimized
Problem 3: “Can’t measure ILP directly”
- Why: ILP is a hardware metric, not directly observable
- Fix: Use
perfwith hardware events:perf stat -e instructions,cycles - Quick test: Measure CPE (cycles per element) instead
Problem 4: “Code runs too fast; measurement is 0ms”
- Why: Problem size is too small or time resolution is coarse
- Fix: Increase problem size or repeat inner loop
- Quick test: Make sure loop takes at least 1 second
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Loop Optimization | CS:APP 3e | Ch. 5, Sec. 5.7-5.9 |
| Cache Optimization | CS:APP 3e | Ch. 6, Sec. 6.4-6.5 |
| Performance Analysis | Performance Engineering (MIT OCW) | Lectures 1-5 |
perf Tool |
Brendan Gregg’s Blog | “perf-tools” series |
Definition of Done
- Implement at least 4 variants of a kernel (baseline, unrolled, cache-aware, SIMD)
- Measure and document speedup for each variant
- Explain the optimization technique used in each
- Achieve at least 2x speedup on at least one benchmark
- Use
perfto verify hardware metrics (IPC, cache misses) - Document results with graphs/tables showing improvements
Project 9: Cache Lab++ Simulator
- File: CSAPP_CACHE_LAB_SIMULATOR.md
- Main Programming Language: C
- Alternative Programming Languages: Python, Rust, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cache Hierarchy / Memory System / Performance Analysis
- Software or Tool:
valgrind,perf, cache architecture manuals - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: A fully-featured cache simulator that accepts memory access traces and simulates cache behavior (hits, misses, evictions) with configurable cache geometry and replacement policies.
Why it teaches cache behavior: You’ll implement the exact logic that determines whether a memory access hits or misses, forcing you to understand address bit anatomy, the memory hierarchy, and how different workloads interact with cache geometry.
Core challenges you’ll face:
- Address parsing (Tag/Index/Offset extraction) → maps to understanding cache address translation
- Replacement policy implementation (LRU, FIFO, Random) → maps to understanding eviction strategies
- Trace processing and statistics collection → maps to understanding cache performance metrics
Real World Outcome
You’ll produce a command-line tool csim that takes a memory trace and simulates cache behavior:
$ ./csim -v -S 4 -E 2 -b 4 -t traces/yi.trace
L 10,1
cache [0][0] 0x0 miss
L 18,1
cache [0][1] 0x8 miss
S 18,1
cache [0][1] 0x8 hit
L 22,1
cache [1][0] 0x2 miss evict
L 2d,1
cache [1][1] 0xd hit
L 4,4
cache [0][0] 0x0 miss
L 2c,4
cache [1][1] 0xc miss evict
L 2d,1
cache [1][0] 0xd hit
Summary:
hits:2 misses:4 evictions:2
The Core Question You’re Answering
“How do CPU cache lines, index bits, and tag bits determine whether a memory access hits or misses?”
Before coding, understand this completely. A cache miss isn’t random—it’s determined by arithmetic on address bits.
Concepts You Must Understand First
- Cache Geometry & Addressing
- What does “S=4, E=2, b=4” mean for a cache?
- How many bits for index, offset, and tag?
- Book Reference: CS:APP 3e Ch. 6, Sec. 6.1-6.2
- Memory Access Traces
- How to parse “L 10,1” (Load 1 byte at 0x10)?
- Book Reference: CS:APP 3e Ch. 6, Sec. 6.6
- Replacement Policies
- LRU: evict the least recently used line
- How to track ordering efficiently?
- Book Reference: CS:APP 3e Ch. 6, Sec. 6.2
Questions to Guide Your Design
- Address Decomposition: Given address 0x1A34 with S=16 sets and b=4 offset bits, what are index and tag bits?
- Hit vs Miss: How do you determine if an address is in the cache using only set index and tag?
- Eviction: How do you track which line is “least recently used” without expensive data structures?
Thinking Exercise
Cache Address Mapping
Given S=4 sets, E=2 lines/set, b=3 offset bits:
Address 0x1A5 (binary: 11010100101)
- Offset (bits 0-2): 101 = 5
- Index (bits 3-4): 10 = 2
- Tag (bits 5-10): 110101 = 53
So 0x1A5 maps to set 2 with tag 53.
Questions:
- If 0x1AA arrives next, where does it map?
- If 0x1A5 arrives again, does it hit the same line?
- How many consecutive addresses map to the same set?
The Interview Questions They’ll Ask
- “Explain how address bits map to cache index and tag. What if two addresses have the same index but different tags?”
- “What’s the difference between a cold miss, capacity miss, and conflict miss?”
- “How would you implement LRU replacement efficiently without a linked list?”
- “If a cache has 16 sets, 4 lines/set, 64-byte lines, how many bytes total? What’s the address decomposition?”
- “Why might increasing cache size not always improve hit rate?”
Hints in Layers
Hint 1: Address Decomposition Address bits aren’t just numbers—they’re structured. Rightmost bits are byte offset within a line. Next bits are set index. Remaining bits are tag. Write a function extracting all three.
Hint 2: Cache Data Structure
Use a 2D array: cache[S][E]. Each element stores: valid bit, tag, and timestamp for LRU tracking.
Hint 3: Hit/Miss Detection For each access: extract index and tag. Search the set for valid line with matching tag. If found, hit—update timestamp. If not, miss. If set is full, evict smallest timestamp (LRU). Otherwise insert.
Hint 4: Validate with Tiny Traces Create a trace manually:
L 0x0,1
L 0x40,1
L 0x0,1 # Should be a hit
Trace through after each operation and verify cache state.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Cache Organization | CS:APP 3e | Ch. 6, Sec. 6.1-6.3 |
| Replacement Policies | CS:APP 3e | Ch. 6, Sec. 6.2 |
| Understanding Traces | CS:APP 3e | Ch. 6, Sec. 6.6 |
Common Pitfalls and Debugging
Problem 1: “Hit rate is way off”
- Why: Off-by-one error in address bit extraction
- Fix: Print index, tag, offset for each address; verify manually
- Quick test: Load same address 4 times; all should be hits
Problem 2: “Eviction isn’t working”
- Why: Not properly tracking set occupancy
- Fix: Assert
cache[set].count <= Eat all times - Quick test: Load
E+1addresses to same set; first should evict
Problem 3: “LRU evicts wrong line”
- Why: Timestamp not updated on access, or evicting most recent instead
- Fix: Update timestamp on every access (hits too)
- Quick test: Load 3 addresses into E=3 set, access in order; verify eviction
Problem 4: “Program crashes on traces”
- Why: Trying to process ‘I’ (instruction) lines as data
- Fix: Skip ‘I’ lines; only process ‘L’, ‘S’, ‘M’
- Quick test: Run on official CS:APP traces first
Definition of Done
- Parse cache geometry from CLI (S, E, b)
- Extract set index and tag from 64-bit addresses correctly
- Distinguish hits from misses based on cache state
- Implement LRU replacement when cache is full
- Process memory traces and collect hits/misses/evictions
- Output statistics matching reference implementation
- Handle both direct-mapped and set-associative caches
- Skip ‘I’ operations in valgrind traces
Project 10: ELF Link Map & Interposition Toolkit
- File: CSAPP_ELF_INTERPOSITION_TOOLKIT.md
- Main Programming Language: C
- Alternative Programming Languages: Python, Go, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Dynamic Linking / Symbol Resolution / ELF Format
- Software or Tool:
readelf,objdump,ldd, linker scripts - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: A suite of tools including: (1) linkmap to visualize symbol resolution and PLT/GOT dynamics, (2) interpose library that intercepts function calls at runtime via symbol interposition, and (3) reloc-viz to visualize relocation entries and their targets.
Why it teaches dynamic linking: You’ll understand what happens between compilation and execution—how symbols become addresses, how the PLT works, and how runtime interposition redirects calls.
Core challenges you’ll face:
- ELF parsing (sections, symbols, relocations) → maps to understanding binary format and structure
- Symbol resolution (strong vs weak, global vs local) → maps to understanding visibility and linking
- PLT/GOT mechanics and lazy binding → maps to understanding runtime indirection
- Interposition via
LD_PRELOADand symbol wrapping → maps to understanding link-time behavior modification
Real World Outcome
You’ll produce multiple command-line tools:
$ ./linkmap ./myprog
[DYNAMIC] Loaded libraries:
libm.so.6 (0x7fce9c000000)
libc.so.6 (0x7fce9be00000)
[SYMBOLS] Exported from myprog:
0x400600 T main
0x400500 T helper_func
U printf # Undefined (to be resolved)
[PLT/GOT] Entries:
printf@plt (0x400450) -> printf@libc (0x7fce9c50000)
$ gcc -fPIC -shared -o interpose.so interpose.c
$ LD_PRELOAD=./interpose.so ./myprog
[INTERCEPT] malloc called! (size=1024)
[INTERCEPT] free called! (ptr=0x5555558)
The Core Question You’re Answering
“How does the dynamic linker resolve symbols, and how can we intercept those resolutions?”
Before coding, understand that linking happens in three phases: compile-time (symbols become relocation entries), link-time (entries become offsets into libraries), and load-time (libraries are mapped and symbols are resolved).
Concepts You Must Understand First
- ELF Format & Sections
.symtab(symbol table),.dynsym(dynamic symbol table).rel.*(relocation tables),.got/.plt(runtime indirection)- Book Reference: CS:APP 3e Ch. 7, Sec. 7.3-7.5
- Symbol Resolution
- Global vs local scope, strong vs weak symbols
- Interposition: overriding function definitions at link/load time
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.6
- Procedure Linkage Table (PLT) and Global Offset Table (GOT)
- How indirect jumps work at runtime
- Lazy binding: first call triggers resolution, subsequent calls use cached address
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.8
- Dynamic Linking & Relocation
- Relocation entries tell the linker how to patch addresses
- At load time, linker walks these entries and fixes them
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.7
Questions to Guide Your Design
- Symbol Tables: How do you extract all symbols from a binary? Which symbols are global vs local, defined vs undefined?
- Relocation Inspection: Given a relocation entry
R_X86_64_GLOB_DAT, how do you know which symbol it refers to and where it gets patched? - PLT/GOT Tracing: How does a call to
printf@pltredirect through the GOT to the realprintfin libc? - Interposition: If you want to intercept
malloc, how do you create a shared library with amallocfunction that takes precedence?
Thinking Exercise
Symbol Resolution Chain
Program calls printf(...). Follow the chain:
- In source code:
#include <stdio.h>declaresprintf - After compilation: object file has relocation entry
R_X86_64_PLT32forprintf - After linking: executable has
printf@plt(jump table entry) andprintf@got.plt(cache for address) - At runtime:
- First call: jump to
printf@plt→ indirect jump throughprintf@got.plt - GOT initially points back to
pltcode that calls dynamic linker - Linker finds
printfin libc, patches GOT - Subsequent calls: jump through GOT directly to libc’s
printf
- First call: jump to
Questions:
- Why does the PLT exist if the GOT can hold the address directly?
- What happens if interposition changes which library
printfcomes from?
The Interview Questions They’ll Ask
- “Explain the difference between
.symtab,.dynsym, and the GOT. When is each used?” - “What is lazy binding, and why is it better than eager binding for large programs?”
- “How would you intercept function calls using
LD_PRELOAD? What prevents you from intercepting non-exported functions?” - “What’s the difference between weak and strong symbols? How does the linker resolve conflicts?”
- “If you remove a symbol from a library, but a binary still calls it, what happens at runtime?”
Hints in Layers
Hint 1: Read ELF Headers and Sections
Start with readelf to see what information is available. Then parse the ELF binary yourself: read the ELF header, iterate section headers, extract symbol tables (both .symtab and .dynsym), and print all symbols with their addresses and binding types.
Hint 2: Parse Relocation Entries
Relocation entries live in .rel.* sections. Each entry contains: an offset (where to patch), a type (how to patch), and a symbol index. Print all relocations with their target symbols.
Hint 3: Understand PLT/GOT
The .plt section contains jump stubs. The .got.plt section contains cached addresses. For each external function, find its PLT entry and its GOT entry. Trace how a call gets redirected.
Hint 4: Build Interposition Wrapper
Create a shared library with a function that has the same name as a function you want to intercept. Use dlsym to get the address of the original function, then your wrapper can call the original after logging or modifying behavior. Compile with -fPIC -shared.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| ELF Format & Sections | CS:APP 3e | Ch. 7, Sec. 7.3-7.5 |
| Symbol Resolution | CS:APP 3e | Ch. 7, Sec. 7.6 |
| PLT & GOT | CS:APP 3e | Ch. 7, Sec. 7.8 |
| Relocation & Dynamic Linking | CS:APP 3e | Ch. 7, Sec. 7.7 |
| Deep ELF Internals | Linkers and Loaders | Ch. 5-6 (Levine) |
Common Pitfalls and Debugging
Problem 1: “I can’t find symbols in the binary”
- Why: Symbols might be stripped (release builds), or you’re looking in
.symtabinstead of.dynsym - Fix: Use
.dynsymfor dynamic symbols;.symtabmight not exist; check withnmorreadelf -s - Quick test:
readelf -s ./binary | grep printf
Problem 2: “PLT/GOT addresses don’t match what I expect”
- Why: Address layout depends on relocation type and ASLR (Address Space Layout Randomization)
- Fix: Disable ASLR for predictable addresses:
setarch x86_64 -R ./program - Quick test: Compare output from
objdump -Rwith runtime observation via gdb
Problem 3: “Interposition doesn’t work; original function still called”
- Why: Library order matters;
LD_PRELOADworks only for exported symbols, not static functions - Fix: Ensure your .so is in
LD_PRELOADand uses same symbol name; check withnm - Quick test:
LD_PRELOAD=./interpose.so ldd ./programshould show your .so first
Problem 4: “Segfault when calling original via dlsym”
- Why: Function pointer dereference or incorrect calling convention
- Fix: Verify function signature matches; test with a simple wrapper that just prints
- Quick test: Create a simple malloc wrapper that only logs, doesn’t call original
Definition of Done
- Parse ELF headers and sections from a binary
- Extract and display all symbols (both static and dynamic)
- Parse and visualize relocation entries with their targets
- Map PLT entries to GOT entries
- Create an
LD_PRELOADinterposition library that intercepts a function - Demonstrate lazy binding by observing GOT updates
- Handle both 32-bit and 64-bit ELF binaries
- Document symbol resolution chain with examples
Project 11: Signals + Processes Sandbox
- File: CSAPP_SIGNALS_PROCESSES_SANDBOX.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Process Control / Signal Handling / Async I/O
- Software or Tool:
strace,gdb,kill, signal masks - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: A comprehensive test harness (sigsandbox) that safely executes untrusted code and tests signal behavior under controlled conditions. Includes: (1) signal tracing to log all signals, (2) zombie detector to find orphaned processes, (3) race condition fuzzer to test signal handler safety, and (4) test suite validating async-signal-safe operations.
Why it teaches signal handling: Signals are asynchronous—they interrupt program flow at unpredictable times. You’ll confront race conditions, deadlocks, and subtle bugs that only show up under load. This forces deep understanding of signal delivery, masking, and safe handler design.
Core challenges you’ll face:
- Signal delivery and masking (blocking/unblocking) → maps to understanding atomic critical sections
- Zombie processes and reaping → maps to understanding process lifecycle
- Async-signal-safety and reentrancy → maps to understanding which functions are safe to call in handlers
- Race conditions between signal arrival and handler registration → maps to understanding temporal ordering in concurrent systems
Real World Outcome
You’ll produce sigsandbox, a harness that captures signal behavior:
$ ./sigsandbox --test=signal_safety ./untrusted_program
[RUN] Executing: ./untrusted_program
[SIG] Caught SIGCHLD at time=0.001234s (from PID 12345)
[SIG] Caught SIGTERM at time=0.005678s (from PID 12346)
[ZOMBIE CHECK] Found 0 zombie processes
[SAFETY CHECK] Only async-safe functions called in handlers: PASS
[RACE CHECK] 1000 concurrent signals: No race detected
Summary:
Signals caught: 2
Zombies: 0
Handler safety violations: 0
Race conditions: 0
Plus additional tools:
signal_trace: Logs all signal activity to a filezombie_detector: Identifies zombie processesrace_fuzzer: Stress-tests handlers with rapid signal delivery
The Core Question You’re Answering
“How do you write correct code that handles asynchronous interrupts, and how do you test that it works?”
Before coding, understand that signals interrupt code at random points. Any function that wasn’t running when the signal arrived might be unsafe to call in the handler. This creates subtle bugs that are nearly impossible to debug without a harness.
Concepts You Must Understand First
- Signal Delivery & Masking
- How signals are blocked, pending, and delivered
sigprocmask,sigaction, signal sets- Book Reference: CS:APP 3e Ch. 8, Sec. 8.5-8.6
- Process Lifecycle & Reaping
- Parent-child relationships, zombie processes
wait,waitpid,SIGCHLDhandler- Book Reference: CS:APP 3e Ch. 8, Sec. 8.4
- Async-Signal-Safe Functions
- Which functions are safe to call in handlers (minimal list)
- Why
malloc,printf, locks are dangerous - Book Reference: POSIX signal-safety(7) man page
- Race Conditions in Signal Handlers
- Handler arrives while other code is running
- Examples: signal arrives during
malloc, during file I/O, during mutex lock - Book Reference: CS:APP 3e Ch. 8, Sec. 8.7
Questions to Guide Your Design
- Signal Masking: How do you temporarily block certain signals to protect a critical section? What happens to signals that arrive while blocked?
- Zombie Reaping: When a child process exits, when is it considered a zombie? How do you reap it correctly without losing signals?
- Handler Safety: You want to log a signal in the handler.
printfis not async-safe. What can you use instead? - Race Testing: How would you stress-test your signal handlers to make race conditions more likely to manifest?
Thinking Exercise
Signal Handler Race Condition
int counter = 0;
void handler(int sig) {
counter++; // Not atomic!
}
int main() {
signal(SIGUSR1, handler);
while (1) {
printf("Counter: %d\n", counter); // RACE: signal might arrive mid-printf!
}
}
The bug: printf is not async-safe. A signal might arrive in the middle of printf (while it’s calling malloc internally), corrupting its state.
Questions:
- What’s a safe way to update
counterin the handler? - What should main do instead of calling
printfdirectly? - How would you detect this bug with your harness?
The Interview Questions They’ll Ask
- “What’s the difference between blocking a signal with
sigprocmaskand ignoring it withsignal(sig, SIG_IGN)?” - “You want to safely access a variable modified in a signal handler. How do you do it without race conditions?”
- “If a signal handler is installed but never uninstalled, what happens to signals when the process exits?”
- “Why might a zombie process exist even though the parent called
wait?” - “How would you design a signal-safe logging function for use in handlers?”
Hints in Layers
Hint 1: Build a Signal Tracer
Create a handler for SIGCHLD that logs signal arrival time and sender PID. Use write (async-safe) to log to a file. Install the handler with sigaction (preferred over signal). Test by spawning child processes and verifying logs.
Hint 2: Implement Zombie Detection
Scan /proc/[pid]/task/*/ or use getpgid to find all processes. For each, check if it’s a zombie (parent exists, process is exited). Log zombies with their parent PID.
Hint 3: Create an Async-Safe Logger
Build a function that only calls async-safe functions. Use a static buffer, write syscall, and integer-to-string conversion (no sprintf). Test by calling it from handlers.
Hint 4: Stress-Test with Signal Storms
Fork a child that sends your parent rapidly alternating SIGUSR1 and SIGUSR2. Your handlers should update a counter. After receiving 10,000 signals, verify the counter matches. If it doesn’t, you have a race.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Signal Handling Basics | CS:APP 3e | Ch. 8, Sec. 8.5 |
| Signal Masking & Delivery | CS:APP 3e | Ch. 8, Sec. 8.6 |
| Process Lifecycle | CS:APP 3e | Ch. 8, Sec. 8.4 |
| Async-Signal-Safety | POSIX signal-safety(7) | Man page |
| Advanced Signal Programming | APUE (Stevens) | Ch. 10 |
Common Pitfalls and Debugging
Problem 1: “My zombie detector doesn’t find zombies”
- Why: Zombies might be reaped between checks; parent might be handling SIGCHLD correctly
- Fix: Create a test child that exits but doesn’t get reaped; confirm it appears as zombie
- Quick test:
cat /proc/[pid]/status | grep Stateshould showZ (zombie)
Problem 2: “Signal handler is called but counter doesn’t increment”
- Why: Counter update might not be atomic; check might not be volatile
- Fix: Declare counter as
volatile sig_atomic_t; use atomic increment - Quick test: Send 100 signals rapidly; count should match
Problem 3: “Program deadlocks when signal arrives during printf”
- Why:
printfuses locks; handler callsprintf; deadlock when signal arrives inside lock - Fix: Don’t use
printfin handlers; use only async-safe functions likewrite - Quick test: Install handler, send signal while in
printf; should not hang
Problem 4: “Parent misses SIGCHLD; children become zombies”
- Why: Handler not installed, or
waitpidcalled beforeSIGCHLDarrives (race) - Fix: Ensure
sigactionis called early; usewaitpidin loop in handler - Quick test: Fork 100 children; all should be reaped before parent exits
Definition of Done
- Implement signal tracer that logs all signals with timing and sender PID
- Detect zombie processes by scanning
/procor process table - Create async-safe logging function using only
writeand integer conversion - Stress-test signal handlers with 1000+ rapid signals; verify no race conditions
- Demonstrate safe vs unsafe signal handlers with documented examples
- Show how
sigprocmaskprevents race conditions in critical sections - Handle
SIGCHLDcorrectly without leaving zombies - Provide examples of common pitfalls and how to avoid them
Project 12: Unix Shell with Job Control
- File: CSAPP_SHELL_JOB_CONTROL.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Process Control / Terminal I/O / Job Management
- Software or Tool:
readline,terminal attributes, process group management - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: A Unix shell (tsh) that supports background/foreground job control, including: (1) fg and bg commands to manage jobs, (2) signal handling for SIGCHLD, SIGINT, SIGTSTP, (3) correct process group management, and (4) proper terminal control.
Why it teaches process control: A shell is a process manager—it must coordinate multiple child processes, handle signals correctly, and manage terminal I/O. You’ll see how process groups, signals, and terminal control interact in a real system.
Core challenges you’ll face:
- Process groups and session leaders → maps to understanding how terminal multiplexing works
- Signal handling without race conditions → maps to atomic operations in concurrent systems
- Job management (foreground vs background) → maps to process state tracking
- Terminal control (SIGSTOP, SIGCONT) → maps to understanding kernel-terminal interaction
Real World Outcome
You’ll produce tsh, a shell that behaves like bash for job control:
$ ./tsh
tsh> sleep 100 &
[1] (12345) sleep 100
tsh> sleep 200
^Z
[2] (12346) sleep 200
tsh> jobs
[1] (12345) Running sleep 100
[2] (12346) Stopped sleep 200
tsh> bg 2
[2] (12346) sleep 200
tsh> fg 1
sleep 100
^C
tsh>
The shell correctly:
- Launches jobs in foreground (blocking) or background (non-blocking)
- Handles
Ctrl-Z(SIGTSTP) to stop foreground job - Handles
Ctrl-C(SIGINT) to kill foreground job - Maintains a job list with status (Running, Stopped, Done)
- Moves jobs between foreground and background
The Core Question You’re Answering
“How does a shell manage multiple processes and handle terminal signals without losing synchronization?”
Before coding, understand that the shell is a traffic controller: it launches processes, waits for them, forwards signals, manages terminal control, and keeps track of job state—all while handling asynchronous signals.
Concepts You Must Understand First
- Process Groups & Sessions
- Terminal sends signals to process groups, not individual processes
- Foreground vs background process groups
setpgid,getpgrp,tcgetpgrp,tcsetpgrp- Book Reference: CS:APP 3e Ch. 8, Sec. 8.3-8.4
- Signal Handling for Job Control
SIGCHLD: child process exited or stoppedSIGINT,SIGTSTP: sent by terminal to foreground process group- Handlers must reap children and update job state
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.5-8.7
- Job Management Data Structure
- Job list: track PID, process group, state, command line
- Foreground job tracking: which job is in foreground?
- State transitions: Running → Stopped → Done → reaped
- Book Reference: CS:APP 3e Ch. 8 (Shell Lab specification)
- Terminal Control & Process Groups
- How does
Ctrl-Creach only the foreground job? - What is
tcgetpgrp/tcsetpgrpfor? - Book Reference: CS:APP 3e Ch. 8, Sec. 8.3
- How does
Questions to Guide Your Design
- Process Groups: When you run
sleep 100 &, you create a new process. What process group should it belong to? - Terminal Signals: When you press
Ctrl-Z, why does it stop only the foreground job, not the shell? - Job Tracking: How do you track which job is foreground? How do you move a job from background to foreground?
- Signal Safety: If
SIGCHLDarrives while you’re updating the job list, what breaks?
Thinking Exercise
Terminal Signal Flow
You’re in shell with sleep 100 running in foreground. You press Ctrl-Z.
- Terminal driver sends SIGTSTP to entire foreground process group
- Sleep receives SIGTSTP and stops
- Shell receives SIGTSTP too!
Questions:
- How does the shell prevent itself from stopping when Ctrl-Z is pressed?
- Should the shell’s signal handler for SIGTSTP do anything?
- What if the sleep process ignores SIGTSTP? What should the shell do?
The Interview Questions They’ll Ask
- “Explain the difference between a process group and a session. Why does the terminal need to know about process groups?”
- “When you press Ctrl-C in a shell with a backgroundjob, why doesn’t it kill the background job?”
- “What happens if the foreground process group stops? How does the shell regain control?”
- “How would you implement the
bgcommand to resume a stopped background job?” - “Why do you need both a
SIGCHLDhandler and the ability to block signals in critical sections?”
Hints in Layers
Hint 1: Build Job List Management
Create a job struct: {pid, pgid, state, cmdline}. Implement functions to add, remove, find, and list jobs. Test with manual job creation before handling signals.
Hint 2: Signal Handlers
Implement a SIGCHLD handler that calls waitpid in a loop with WNOHANG flag to reap any stopped or exited children. Update job state for each. Use sigprocmask to protect job list access.
Hint 3: Process Group Management
When forking a child, the parent should call setpgid(child_pid, child_pid) to put the child in its own process group. If running in background, the shell shouldn’t wait. If foreground, use tcsetpgrp to give the child the terminal.
Hint 4: Shell Protection from Signals
Before any critical operation (job list modification), block SIGCHLD, SIGINT, SIGTSTP with sigprocmask. Unblock after the operation completes. This prevents signal handlers from interrupting updates.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Process Groups & Sessions | CS:APP 3e | Ch. 8, Sec. 8.3 |
| Job Control & Signal Handling | CS:APP 3e | Ch. 8, Sec. 8.4-8.7 |
| Shell Lab Specification | CS:APP 3e | Appendix (datalab/tracinglab) |
| Advanced Job Control | APUE (Stevens) | Ch. 9 |
Common Pitfalls and Debugging
Problem 1: “Ctrl-Z doesn’t stop the foreground job”
- Why: Process not in correct process group, or terminal signal handling broken
- Fix: Ensure child is in its own process group; use
tcsetpgrpto give terminal to child - Quick test: Run
sleep 100in foreground; press Ctrl-Z; should printStopped
Problem 2: “Reusing a job ID causes confusion”
- Why: Job list has old job with same ID; not properly removed
- Fix: When job is reaped, remove it from job list; use next free job number
- Quick test: Run 10 jobs; kill them; run 10 more; job IDs should be 1-10 again
Problem 3: “SIGCHLD handler runs while updating job list; data corruption”
- Why: Not blocking signals during critical sections
- Fix: Use
sigprocmaskto blockSIGCHLDbefore modifying job list; unblock after - Quick test: Stress test with 100 rapid job creations; should not crash
Problem 4: “bg command doesn’t work; job stays stopped”
- Why: Not sending
SIGCONTto stopped job’s process group - Fix:
bgshould callkill(-pgid, SIGCONT)to resume all processes in group - Quick test: Stop a job with Ctrl-Z; run
bg; job should continue
Definition of Done
- Parse shell commands with arguments (tokenization)
- Create child processes and manage process groups
- Implement
jobscommand listing all active jobs - Handle
SIGCHLDcorrectly, reaping children and updating state - Protect job list with signal masking during updates
- Implement
fgcommand: bring job to foreground and wait for it - Implement
bgcommand: resume stopped background job - Properly handle
Ctrl-CandCtrl-Zfor foreground job only - Show job notifications when job stops or completes
- Support
&suffix for background jobs
Project 13: Virtual Memory Map Visualizer
- File: CSAPP_VM_MAP_VISUALIZER.md
- Main Programming Language: C
- Alternative Programming Languages: Python, Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Virtual Memory / Process Address Space / Page Management
- Software or Tool:
/proc/[pid]/maps,/proc/[pid]/smaps,mmap - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: An interactive tool vmmap that visualizes a process’s virtual address space, including: (1) visual representation of memory regions (code, heap, stack, libraries), (2) parsing and display of /proc/[pid]/maps and /proc/[pid]/smaps, (3) symbol resolution to identify function boundaries, (4) dynamic visualization showing memory changes as the process runs.
Why it teaches virtual memory: Virtual memory is abstract—addresses look like addresses, but they’re really just virtual. This tool makes that abstraction concrete by showing exactly how the kernel maps virtual pages to physical frames, manages ASLR, and arranges segments.
Core challenges you’ll face:
- Parsing
/proc/[pid]/mapsand/proc/[pid]/smaps→ maps to understanding kernel’s view of address space - Symbol resolution to identify functions → maps to understanding debug symbols and relocation
- Visualizing memory layout across time → maps to understanding dynamic memory allocation
- Page permission flags and their meaning → maps to understanding virtual memory protection
Real World Outcome
You’ll produce vmmap, an interactive tool:
$ vmmap --pid=1234
Virtual Address Map for PID 1234 (/usr/bin/firefox)
Address Space:
┌─────────────────────────────────────────────────┐ 0xffffffffffffffff
│ Kernel Space │
└─────────────────────────────────────────────────┘ 0xffffffff80000000
(gap)
┌─────────────────────────────────────────────────┐ 0x7fffff000000
│ Shared Libraries │
│ libc.so.6 0x7f5a58c00000-0x7f5a58da2000 rw-p│
│ libm.so.6 0x7f5a5aa00000-0x7f5a5aa23000 r-xp│
└─────────────────────────────────────────────────┘
(gap)
┌─────────────────────────────────────────────────┐ 0x0000600000000000
│ Heap │
│ Allocated: 256 MB | Free: 1.2 GB │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐ 0x0000000600000000
│ BSS/Data Segment │
│ .data 0x0000000600000000-0x0000000600001000 rw-│
│ .bss 0x0000000600001000-0x0000000600002000 rw-│
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐ 0x0000000400000000
│ Code Segment │
│ .text 0x0000000400000000-0x0000000400010000 r-xp│
└─────────────────────────────────────────────────┘
Plus detailed metrics:
$ vmmap --pid=1234 --detailed
[stack] 0x7fffffffde00-0x7ffffffff000 rw- Size: 8.5 MB (RSS: 64 KB)
[vvar] 0x7fffffe00000-0x7fffffe02000 r--p Size: 8 KB (RSS: 8 KB)
[heap] 0x5555557c1000-0x555555942000 rw- Size: 1.8 MB (RSS: 320 KB)
[text] 0x555555554000-0x555555667000 r-xp Size: 1.1 MB (RSS: 1.1 MB)
The Core Question You’re Answering
“What does the virtual address space of a running process actually look like, and how does the kernel organize it?”
Before coding, understand that every process has a complete (64-bit) address space, but only parts are actually mapped to physical memory. The kernel’s page tables determine what’s valid.
Concepts You Must Understand First
- Virtual Address Space Layout
- Stack (grows down), heap (grows up), code, data, libraries
- Address Space Layout Randomization (ASLR): why are addresses different each run?
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9-9.10
- Page Maps and Segments
- What does each line in
/proc/self/mapsmean? - Permission flags: r/w/x, p (private) / s (shared)
- Book Reference: man proc(5) -
/proc/[pid]/maps
- What does each line in
- Memory Regions and Mapping
- Executable sections vs data sections vs BSS
- mmap’d files and anonymous memory
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.2
- Symbol Information
- Reading symbol tables from binaries to identify function boundaries
- Debug symbols vs stripped binaries
- Book Reference: CS:APP 3e Ch. 7, Sec. 7.3
Questions to Guide Your Design
- Address Space Parsing: What information does
/proc/self/mapsprovide? How do you extract start address, end address, and permissions? - Memory Regions: How do you distinguish code (.text) from data (.data) from heap from stack?
- Symbol Resolution: Given a virtual address, how do you find which function it belongs to?
- ASLR Impact: If you run the same program twice, why are the addresses different? How does ASLR work?
Thinking Exercise
Memory Region Identification
$ cat /proc/self/maps
555555554000-555555667000 r-xp ... /usr/bin/prog
555555867000-555555869000 rw-p ... /usr/bin/prog
7ffff7ddc000-7ffff7e02000 r-xp ... /lib64/ld-linux
7ffff7dd2000-7ffff7ddc000 rw-p ... /lib64/ld-linux
7fffde000000-7fffde021000 rw-p ... [heap]
7ffffffff000-7ffffffff000 rw-p ... [stack]
Questions:
- Which region is the code (.text) of the program?
- Which region is the heap?
- Why does the program have two r-xp regions in
/lib64/ld-linux?
The Interview Questions They’ll Ask
- “Explain the virtual address space layout of a 64-bit process. Where does code, stack, and heap live?”
- “What is ASLR, and how does it affect debugging and security?”
- “If you map a file with mmap, where does it appear in the address space?”
- “What does RSS (Resident Set Size) mean vs VSZ (Virtual Size) in
/proc/[pid]/smaps?” - “How do you find which shared library a function comes from given its address?”
Hints in Layers
Hint 1: Parse /proc/[pid]/maps
Read the maps file for any process. Extract fields: start address, end address, permissions, offset, device, inode, pathname. Format as a table.
Hint 2: Categorize Memory Regions
- If pathname matches
/, it’s a file mapping (executable, library, etc.) - If pathname is
[heap],[stack],[vvar], use those labels - If pathname is empty and private, it’s likely anonymous memory (stack or heap)
Hint 3: Visualize the Address Space Create a large ASCII diagram showing the virtual memory layout. Scale the sizes appropriately. Show permissions and memory usage (RSS) for each region.
Hint 4: Add Symbol Resolution
Use dlopen/dlsym or ELF parsing to find symbols. Given an address, binary search through symbol tables to find the nearest symbol (function name).
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Virtual Address Space | CS:APP 3e | Ch. 9, Sec. 9.9-9.10 |
| Memory Mapping | CS:APP 3e | Ch. 9, Sec. 9.2 |
| /proc filesystem | man proc(5) | /proc/[pid]/maps |
| ASLR & Security | CS:APP 3e | Ch. 3, Sec. 3.13 |
Common Pitfalls and Debugging
Problem 1: “Maps file shows different addresses each time I run”
- Why: ASLR is enabled; kernel randomizes memory layout
- Fix: This is normal behavior; disable ASLR only for debugging with
setarch x86_64 -R - Quick test: Run without ASLR; addresses should be consistent
Problem 2: “Can’t identify which region is heap vs stack”
- Why: Heap and stack might have same permissions; need to look at pathname
- Fix: Stack usually has
[stack]label; heap has[heap]label in some kernels; otherwise, infer from address and growth direction - Quick test: Allocate memory, run
vmmapagain; see which region grew
Problem 3: “Symbol resolution shows wrong function names”
- Why: Symbols might be stripped, or you’re matching against wrong binary version
- Fix: Use
readelf -sto inspect symbols; ensure you’re reading the correct binary file - Quick test:
objdump -t /path/to/binary | grep address
Problem 4: “RSS size doesn’t match actual memory usage”
- Why: RSS includes shared pages; memory shared with other processes counted in each
- Fix: Use PSS (Proportional Set Size) for more accurate accounting (in
/proc/[pid]/smaps) - Quick test: Check
/proc/[pid]/smapsinstead of/maps
Definition of Done
- Parse
/proc/[pid]/mapsand extract all fields correctly - Categorize regions (code, data, heap, stack, mmap’d files)
- Visualize address space as ASCII diagram with proper scaling
- Show memory permissions and sizes for each region
- Read
/proc/[pid]/smapsfor per-region memory usage (RSS) - Resolve symbols to identify function names at given addresses
- Handle ASLR correctly (show that addresses vary between runs)
- Support interactive commands: list regions, show details, follow heap growth, etc.
Project 14: Build Your Own Malloc
- File: CSAPP_BUILD_MALLOC.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Heap Management / Memory Allocation / Data Structures
- Software or Tool:
sbrk,mmap, heap inspectors - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: A complete memory allocator (mymalloc) that implements malloc, realloc, and free with implicit free lists, explicit free lists, or segregated free lists. Includes fragmentation analysis and performance benchmarking.
Why it teaches memory management: Malloc is invisible to most programmers, but building one forces you to confront: how does the heap grow? How does free list fragmentation waste memory? Why does poor allocation strategy kill performance? This project connects abstract memory allocation to concrete data structures.
Core challenges you’ll face:
- Free list management and coalescing → maps to understanding fragmentation and defragmentation
- Alignment and padding → maps to understanding hardware constraints and compiler layouts
- Splitting and merging blocks → maps to understanding data structure operations under constraints
- Performance vs fragmentation tradeoffs → maps to understanding engineering tradeoffs
Real World Outcome
You’ll produce a malloc library that passes test cases:
$ ./mymalloc_test
[ALLOC] malloc(1024) -> 0x5555557c1000
[ALLOC] malloc(512) -> 0x5555557c1400
[ALLOC] malloc(256) -> 0x5555557c1600
[FREE] free(0x5555557c1400) -> OK
[REALLOC] realloc(0x5555557c1000, 2048) -> 0x5555557c2000
[STATS] Total allocated: 3840 bytes, Fragmentation: 12%, Peak memory: 5.2 KB
Performance Benchmark:
Allocate 10000 random sizes: 2.3 ms
Free all: 1.1 ms
Allocate + Free 5000 pairs: 4.2 ms
System malloc (reference): 5.1 ms
Plus detailed diagnostics:
$ ./heap_inspector --pid=1234
Heap Analysis for PID 1234:
┌─────────────────────────────────┐
│ Block 1: 1024 bytes [ALLOCATED] │ 0x1000000
│ Block 2: 512 bytes [FREE] │ 0x1000400
│ Block 3: 256 bytes [ALLOCATED] │ 0x1000600
│ Block 4: 3200 bytes [FREE] │ 0x1000700
└─────────────────────────────────┘
Fragmentation: 12% (largest free: 3200 bytes, total free: 3712 bytes)
The Core Question You’re Answering
“How does
mallocmanage the heap, and what happens when memory is fragmented?”
Before coding, understand that malloc is a data structure problem: you have a big region of memory, you need to track which parts are allocated, and you need to find free space for new allocations.
Concepts You Must Understand First
- Heap Organization & Free Lists
- Implicit free lists (boundary tags), explicit free lists, segregated lists
- Why free lists exist and how they’re traversed
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
- Block Coalescing & Splitting
- Immediate vs deferred coalescing (immediate is simpler, deferred reduces overhead)
- How to split a free block when allocation is smaller
- Why coalescing matters for fragmentation
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
- Alignment and Padding
- Why 8-byte or 16-byte alignment is required
- How to calculate padding
- Trade-off between alignment and fragmentation
- Book Reference: CS:APP 3e Ch. 3, Sec. 3.9
- Fragmentation & Policies
- External fragmentation: free space scattered, can’t be used
- Internal fragmentation: allocated block bigger than request (padding)
- First-fit, best-fit, worst-fit allocation strategies
- Book Reference: CS:APP 3e Ch. 9, Sec. 9.9
Questions to Guide Your Design
- Block Representation: How do you store metadata (size, allocated/free, next block) without using extra space outside the heap?
- Finding Free Space: Given a request for N bytes, how do you find a suitable free block? (First-fit, best-fit?)
- Coalescing: When you free a block, how do you find adjacent free blocks to merge with?
- Alignment: If you allocate 13 bytes, how much space do you actually use? (Should be rounded to 16 or 24)
Thinking Exercise
Block Representation with Boundary Tags
Free block: [header: size|alloc] [data...] [footer: size|alloc]
Alloc block: [header: size|alloc] [data...] (no footer, saves space)
Example: Allocate 256 bytes from a 1024-byte free block
Before:
[1024|free] [1016 bytes of data...] [1024|free]
After (if you allocate 256 bytes):
[256|alloc] [252 bytes of data...]
[768|free] [764 bytes of data...] [768|free]
Questions:
- How do you know where the next block starts?
- Why include a footer in free blocks but not allocated ones?
- What size should the header be to store both size and alloc bit?
The Interview Questions They’ll Ask
- “What’s the difference between external and internal fragmentation? Can you eliminate one without increasing the other?”
- “In a boundary-tag allocator, why do free blocks need footers but allocated blocks don’t?”
- “What’s the trade-off between first-fit, best-fit, and worst-fit allocation policies?”
- “How does coalescing fragmentation? What’s the difference between immediate and deferred coalescing?”
- “If your malloc is slower than system malloc, what optimizations would you try?”
Hints in Layers
Hint 1: Implement a Simple Implicit Free List Use boundary tags (headers with size and alloc bit). Store list sequentially in the heap. For allocation, walk the heap linearly to find a free block. For freeing, mark the block as free. Don’t worry about coalescing yet.
Hint 2: Add Coalescing When freeing, check the previous and next blocks. If they’re free, merge with them. To find the previous block, either store a footer in all blocks (expensive) or maintain a doubly-linked list.
Hint 3: Optimize with Explicit Free Lists Instead of walking all blocks (allocated and free), maintain a list of only free blocks. This speeds up allocation. Use first-fit: find the first free block that fits.
Hint 4: Benchmark and Analyze Create test patterns: rapid allocate/free, pathological fragmentation patterns, realistic workloads. Measure throughput and memory usage. Identify bottlenecks.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Heap Allocation | CS:APP 3e | Ch. 9, Sec. 9.9 |
| Memory Allocator Design | The Craft of System Design | Ch. 7 |
| Fragmentation Analysis | Research papers | “Malloc Fragmentation” |
Common Pitfalls and Debugging
Problem 1: “Fragmentation builds up; can’t allocate even though free space exists”
- Why: Not coalescing adjacent free blocks properly
- Fix: After
free, immediately merge with adjacent free blocks - Quick test: Allocate/free in patterns; use
heap_inspectorto visualize fragmentation
Problem 2: “Alignment issues; crashes on 32-byte SIMD operations”
- Why: Pointers not properly aligned to 8 or 16 bytes
- Fix: Round all sizes up to nearest multiple of 8/16; ensure header doesn’t break alignment
- Quick test: Allocate odd sizes (17, 23 bytes); check alignment of returned pointer
Problem 3: “Realloc doesn’t work; data is corrupted”
- Why: Realloc tries to expand in-place without checking adjacent block; should move if can’t expand
- Fix: Check if next block is free and large enough; otherwise allocate new, copy, free old
- Quick test: Realloc with increasing sizes; verify data integrity
Problem 4: “Performance is terrible; 10x slower than system malloc”
- Why: Linear search for free block, or excessive coalescing overhead
- Fix: Use explicit free list or segregated lists; implement best-fit instead of first-fit
- Quick test: Profile; identify hotspots; switch to more efficient data structure
Definition of Done
- Implement malloc that allocates memory and tracks allocation state
- Implement free that marks memory as free
- Coalesce adjacent free blocks to reduce fragmentation
- Implement realloc correctly (expand in-place or move)
- Ensure correct alignment (8-byte or 16-byte)
- Measure and document fragmentation rate
- Benchmark against system malloc; be within 2x on throughput
- Pass comprehensive test suite including stress tests and edge cases
Project 15: Robust Unix I/O Toolkit
- File: CSAPP_ROBUST_IO_TOOLKIT.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: I/O Operations / System Programming / Error Handling
- Software or Tool:
read,write,select,poll,epoll - Main Book: Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
What you’ll build: A robust I/O library with wrappers for read/write that handle short counts and signals correctly. Includes: (1) rio_readn/rio_writen for guaranteed I/O, (2) rio_readlineb for buffered reading, (3) signal-safe helpers, (4) multiplexing support with select/epoll.
Why it teaches I/O robustness: System I/O is unreliable—read might return fewer bytes than requested, signals can interrupt syscalls. Most code ignores this and crashes. You’ll learn to write code that actually works on real systems.
Core challenges you’ll face:
- Partial I/O (short counts) → maps to understanding buffering and retries
- EINTR and signal interruption → maps to understanding async system behavior
- Buffered vs unbuffered I/O → maps to understanding performance tradeoffs
- Multiplexing multiple connections → maps to understanding event-driven programming
Real World Outcome
You’ll create a library tested on real I/O patterns:
$ ./rio_test
[TEST] rio_readn: Read 1MB in 8-byte chunks
- Expected: 131072 reads
- Actual: 131072 (correct!)
- Signals received: 5,847 (all handled)
[TEST] rio_writen: Write 1MB across pipe with limited buffer
- Expected: 1048576 bytes written
- Actual: 1048576 (correct!)
- Write calls: 847 (averaged 1.2KB per syscall)
[TEST] Buffered reading with signals
- Read 100 lines (8KB total)
- SIGALRM sent during reads: 23 times
- All data recovered: YES
Plus a full echo server using rio_readlineb:
$ ./echo_server 8000 &
$ telnet localhost 8000
Hello world
Hello world
[server echoed back]
^C
The Core Question You’re Answering
“How do you write I/O code that works reliably despite signals, short counts, and buffering?”
Before coding, understand: system I/O is not transactional. A read for 1KB might return 1 byte. A write for 1KB might write 512 bytes then return. Signals can interrupt any syscall. Ignoring these facts leads to data loss.
Concepts You Must Understand First
- Short Counts & Partial I/O
readreturns fewer bytes than requested (not an error)writemight write only part of buffer (not an error)- Must retry in loop until done
- Book Reference: CS:APP 3e Ch. 10, Sec. 10.2
- EINTR and Signal Interruption
- Signals can interrupt syscalls with
EINTR - Some syscalls auto-restart on
EINTR, others don’t - Must handle both cases
- Book Reference: CS:APP 3e Ch. 8, Sec. 8.6
- Signals can interrupt syscalls with
- Buffering Strategies
- Unbuffered:
read/writeevery byte (slow) - Line-buffered: buffer until newline (useful for text)
- Fully buffered: buffer in blocks (fastest)
- Book Reference: CS:APP 3e Ch. 10, Sec. 10.1
- Unbuffered:
- Multiplexing Connections
select/poll/epoll: monitor multiple file descriptors- When to use each (performance, portability)
- Book Reference: CS:APP 3e Ch. 12, Sec. 12.2
Questions to Guide Your Design
- Short Counts: If
read(fd, buf, 1000)returns 50, what should the caller do? - EINTR Handling: If
writereturns-1witherrno == EINTR, is the data written? (Answer: unknown!) - Buffering: When reading lines, when should you flush? (When full, on newline, on error?)
- Multiplexing: How do you efficiently handle 10,000 simultaneous connections?
Thinking Exercise
Short-Count Error
// WRONG CODE
int n = read(fd, buf, 1000);
if (n < 0) perror("error");
// Now buf definitely has all data... WRONG!
What if read returns 50? The code thinks buf has 1000 bytes. Crash!
Questions:
- How do you fix this? (Loop until
readreturns 0) - What does it mean when
readreturns 0? (EOF) - How do you handle partial
write?
The Interview Questions They’ll Ask
- “What’s a short count, and why does it happen? How do you handle it?”
- “Why does
writereturn early, and what are the consequences for your code?” - “What’s the difference between auto-restarting and non-restarting syscalls? Give examples.”
- “Design a line-reading function that buffers input efficiently. How do you handle EINTR?”
- “Why is
selectbetter than looping through 10,000 file descriptors withfcntlflags?”
Hints in Layers
Hint 1: Implement rio_readn
Wrapper for read that loops until all N bytes are read or EOF. Handle EINTR by retrying on EINTR. Handle short counts by continuing loop.
Hint 2: Implement rio_writen
Wrapper for write that loops until all N bytes are written. Handle EINTR by retrying. Handle short counts by adjusting buffer pointer and length.
Hint 3: Add Buffered Line Reading
Implement rio_readlineb with internal buffer. Read chunks with rio_readn, return one line at a time. Handle lines split across read boundaries.
Hint 4: Implement select-based Multiplexer
Create a simple server that handles multiple clients with select. Demonstrate that one thread can serve many clients if they’re not all active simultaneously.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| I/O Basics | CS:APP 3e | Ch. 10, Sec. 10.1-10.2 |
| Network I/O | CS:APP 3e | Ch. 11, Sec. 11.4-11.5 |
| Signal Safety | CS:APP 3e | Ch. 8, Sec. 8.6 |
| Multiplexing | CS:APP 3e | Ch. 12, Sec. 12.2 |
| Rio Implementation | APUE | Ch. 3 |
Common Pitfalls and Debugging
Problem 1: “Signals cause silent data loss”
- Why: Short count from
EINTRnot handled; code assumes full read/write - Fix: Always loop; handle
EINTRexplicitly withwhile (n < 0 && errno == EINTR) - Quick test: Send SIGALRM during read; verify all data still arrives
Problem 2: “Buffered reader loses data between buffer fills”
- Why: Newline at buffer boundary; incomplete handling of wrap-around
- Fix: Properly track position in buffer; handle lines split across boundaries
- Quick test: Send lines of exactly buffer size; send many tiny lines
Problem 3: “Server doesn’t detect client disconnect”
- Why: EOF condition not recognized (read returns 0)
- Fix: Check for
n == 0; close connection; remove from multiplexer - Quick test: Telnet to server, kill client suddenly; verify server detects it
Problem 4: “epoll doesn’t work as expected on certain systems”
- Why: Platform-specific differences or incorrect edge-triggered use
- Fix: Ensure using level-triggered or handle correctly; test on actual target
- Quick test: Move code between Linux/BSD; verify behavior
Definition of Done
- Implement
rio_readnthat handles short counts and EINTR - Implement
rio_writenthat handles short counts and EINTR - Implement
rio_readlinebwith internal buffering - Handle all signal interruptions correctly
- Create test suite with intentional EINTR injection
- Demonstrate line-reading with signals firing during reads
- Implement multiplexing server with
selectand/orepoll - Handle client disconnects correctly
- Document all edge cases and test patterns
Project 16: Concurrency Workbench
- File: CSAPP_CONCURRENCY_WORKBENCH.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: Level 2: Practical but Forgettable
- Difficulty: Level 3: Advanced
- Knowledge Area: Concurrency / Threading / Synchronization
- Software or Tool: Pthreads / Valgrind / Helgrind
- Main Book: CS:APP 3e (Chapter 12)
What you’ll build: A benchmark framework that compares three concurrency models (thread-per-request, thread pool, and event-driven multiplexing) using identical workloads, measuring throughput, latency, and resource usage.
Why it teaches concurrency: Building three different concurrency models side-by-side forces you to understand the tradeoffs: threads are simple but expensive, thread pools are practical but complex, and multiplexing is efficient but non-blocking. You’ll see why different models win for different workloads.
Core challenges you’ll face:
- Thread safety without deadlocks → maps to mutex ordering and condition variables
- Correct shutdown and cleanup → maps to graceful termination and resource deallocation
- Fair comparison between models → maps to equivalent load generation and measurement
- Detecting race conditions → maps to Helgrind and ThreadSanitizer
Key Concepts:
- Thread-Per-Request Model: Pthreads Ch. 1 - APUE
- Thread Pool Pattern: Effective Java (Item 81) - Joshua Bloch
- Select/Poll/Epoll Multiplexing: CS:APP Ch. 12, Sec. 12.2
- Synchronization Primitives: CS:APP Ch. 12, Sec. 12.3
- Race Condition Detection: Valgrind Helgrind user manual
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: You must understand: basic threading, mutex/semaphore concepts, the fork/exec model, and select/epoll basics. Have completed Projects 11-15 to fully appreciate the contrasts.
Real World Outcome
You’ll produce a benchmarking suite that runs identical work across three concurrency models and prints a comparative report.
$ ./concurrency_bench --workload=http --clients=1000 --duration=10
====== CONCURRENCY WORKBENCH ======
[Workload] HTTP request handler
[Duration] 10 seconds
[Load] 1000 concurrent clients
--- Thread-Per-Request Model ---
[INIT] Starting threadpool (1000 threads)...
[SUMMARY] Throughput: 45,230 req/sec
[SUMMARY] Avg latency: 22.1 ms
[SUMMARY] Max latency: 1,250 ms
[SUMMARY] Memory: 125 MB (each thread ~100KB stack)
[WARNING] Thread creation took 2.5 seconds
--- Thread Pool Model (16 workers) ---
[INIT] Starting 16 worker threads with queue...
[SUMMARY] Throughput: 89,450 req/sec
[SUMMARY] Avg latency: 11.2 ms
[SUMMARY] Max latency: 450 ms
[SUMMARY] Memory: 3.2 MB (fixed overhead)
[STATS] Queue depth: max 45, average 8
--- Event-Driven Model (epoll) ---
[INIT] Starting epoll reactor...
[SUMMARY] Throughput: 127,890 req/sec
[SUMMARY] Avg latency: 7.8 ms
[SUMMARY] Max latency: 320 ms
[SUMMARY] Memory: 1.5 MB (minimal per-connection)
[STATS] Epoll FD count: 1001
====== COMPARISON ======
Winner: Event-Driven (2.8x faster, 80x less memory)
Tradeoff: Epoll harder to program correctly
The Core Question You’re Answering
“When should I use threads vs thread pools vs select/epoll, and what does each model trade off?”
Before building three models, you realize most developers pick one and never see the tradeoffs. This project forces you to measure them directly. You’ll see that threads are simple, thread pools are practical, and multiplexing is king for I/O-bound work.
Concepts You Must Understand First
Stop and research these before coding:
- Pthreads API and Synchronization
- What does
pthread_create,pthread_joindo? - Why do you need
pthread_mutex_lockandpthread_cond_wait? - What is a condition variable and when do you use it?
- Book Reference: “APUE” by Kerrisk - Ch. 11
- What does
- Thread Pool Pattern
- How does a bounded work queue prevent unbounded thread growth?
- What happens when the queue is full?
- How do worker threads cleanly shut down?
- Book Reference: “Java Concurrency in Practice” - Goetz et al. - Ch. 8
- Select/Poll/Epoll Multiplexing
- What is the advantage of
selectover thread-per-connection? - Why is
epollbetter thanselectfor thousands of connections? - What is edge-triggered vs level-triggered epoll?
- Book Reference: CS:APP 3e - Ch. 12, Sec. 12.2
- What is the advantage of
- Synchronization Pitfalls
- What causes a deadlock? (lock ordering, hold-and-wait)
- Why can spurious wakeups happen with condition variables?
- How do you ensure all threads wake up at shutdown?
- Book Reference: CS:APP 3e - Ch. 12, Sec. 12.3
Questions to Guide Your Design
Before implementing, think through these:
- Load Generation
- How will you generate equivalent load for all three models? (same requests, same timing)
- How will you measure “started” vs “completed” for each?
- Should you use blocking or non-blocking requests?
- Thread-Per-Request Model
- How will you handle 1000 threads on a typical system? (each needs ~2MB stack)
- When do you stop creating threads?
- How do you gracefully shut down all 1000 threads?
- Thread Pool Model
- How big should the worker pool be? (cores? oversubscribed?)
- What is the queue capacity? (bounded or unbounded?)
- What happens when the queue fills? (block, reject, dynamic resize?)
- Event-Driven Model
- How do you handle 1000 non-blocking sockets?
- How do you detect client disconnect in epoll?
- How do you ensure no work is starved?
- Measurement
- How will you measure throughput without skewing results?
- How will you get accurate latency percentiles (p50, p99)?
- How will you measure memory usage (RSS, VSZ)?
Thinking Exercise
Draw Three Concurrency Models
Before coding, draw or write out:
Model 1: Thread-Per-Request
┌─────────────────────────────────────┐
│ Accepting Thread │
├─────────────────────────────────────┤
│ accept() -> spawn pthread_create() │
│ ↓ │
│ [Worker Thread 1] [Worker 2] [3]... │
│ ↓ ↓ │
│ [Socket] [Socket] │
└─────────────────────────────────────┘
Model 2: Thread Pool
┌─────────────────────────────────────┐
│ [Worker 1] [Worker 2] ... [Worker N]
│ ↓ ↓ ↓ │
│ ┌──────────────────────┐ │
│ │ Work Queue │ │
│ │ (Bounded, Locked) │ │
│ └──────────────────────┘ │
│ ↑ │
│ Acceptor Thread (feeds queue) │
└─────────────────────────────────────┘
Model 3: Event-Driven
┌─────────────────────────────────────┐
│ Reactor Thread │
│ ┌──────────────────────────────┐ │
│ │ epoll_wait() with timeout │ │
│ │ │ │
│ │ for each ready socket: │ │
│ │ - read/write (non-block) │ │
│ │ - schedule callback │ │
│ └──────────────────────────────┘ │
│ Handles 1000s of sockets │
└─────────────────────────────────────┘
Questions while thinking:
- In Model 1, if you create 1000 threads, what’s the total memory cost? (1000 × 2MB = 2GB!)
- In Model 2, what happens if 1000 requests arrive but only 16 workers exist? (queue grows)
- In Model 3, how does epoll know when a socket is ready without polling?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Why would you choose threads over multiplexing for a web server?”
- “What’s the memory overhead of creating 1000 threads vs a thread pool of 16?”
- “How do you handle client disconnects in an epoll reactor?”
- “What does ‘thundering herd’ mean, and when is it a problem?”
- “Explain the difference between
select,poll, andepollto someone who’s never used them.” - “If you have 1000 concurrent connections but only 16 threads, how do you prevent starvation?”
Hints in Layers
Hint 1: Start with Thread-Per-Request Build the simplest model first: accept a connection, spawn a thread, let thread handle it. Measure how many threads crash the system. This is your baseline.
Hint 2: Implement Thread Pool with Bounded Queue
Use a pthread_mutex_t to protect a linked list (or array) that acts as work queue. Worker threads call pthread_cond_wait() and wake when work arrives. Measure the throughput improvement.
Hint 3: Implement Epoll Reactor
Single thread calls epoll_wait() in a loop. For each ready socket, perform one non-blocking read/write. If you want async work, queue it to a thread pool. Measure the throughput again.
Hint 4: Add Measurement and Comparison For each model, track: requests/sec, latency (min/avg/max/p99), memory usage. Print a table side-by-side. Explain why the winner wins.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Pthreads Basics | APUE (3e) | Ch. 11 |
| Mutex and Condition Variables | APUE (3e) | Ch. 11.6 |
| Thread Pools | Java Concurrency in Practice | Ch. 8 |
| Select/Poll/Epoll | CS:APP 3e | Ch. 12, Sec. 12.2 |
| Synchronization Pitfalls | CS:APP 3e | Ch. 12, Sec. 12.3 |
| Performance Analysis | Systems Performance | Ch. 1-3 - Brendan Gregg |
Common Pitfalls and Debugging
Problem 1: “Threads created faster than they finish; memory explodes”
- Why: Thread-per-request model with no limit; each thread takes 1-2MB stack
- Fix: Add a cap on concurrent threads; reject connections if exceeded; measure memory growth
- Quick test: Run with 10,000 clients; watch
topto see RSS grow; kill program before OOM
Problem 2: “Thread pool workers deadlock or hang”
- Why: Incorrect lock ordering or forgotten
pthread_cond_signal()on shutdown - Fix: Use Helgrind to detect lock issues; add explicit shutdown signal to all workers
- Quick test: Kill benchmark with Ctrl+C; run with Helgrind; check for “possible data race”
Problem 3: “Epoll doesn’t detect all ready connections”
- Why: Incorrect understanding of edge-triggered vs level-triggered mode
- Fix: Ensure using level-triggered (default); handle EAGAIN on non-blocking read
- Quick test: Send data to socket; verify epoll fires; send more data while previous still buffered
Problem 4: “Latency measurements are wrong (missing high percentiles)”
- Why: Using naive average instead of tracking percentiles; missing outliers
- Fix: Store all latencies in array; sort and compute p99 = array[0.99 * n]
- Quick test: Add 10 random 5-second delays; see if your avg hides them but p99 catches them
Definition of Done
- Thread-Per-Request model runs and measures throughput/latency/memory
- Thread Pool model with 16 workers handles same load
- Event-driven epoll model runs and measures same metrics
- Comparison table clearly shows throughput and memory tradeoffs
- No race conditions detected by Helgrind on all three models
- Graceful shutdown for all three models (no hanging threads)
- Latency includes p50, p99, max (not just average)
- Load generator creates equivalent work for all models
- Document why each model wins/loses for your workload
- Code compiles with
-Wall -Wextra -Werrorand passes AddressSanitizer
Project 17: CS:APP Capstone — Secure, Observable, High-Performance Proxy
- File: CSAPP_CAPSTONE_PROXY.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Go, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: Level 3: Service & Support Model
- Difficulty: Level 4: Expert
- Knowledge Area: Networking / Concurrency / Caching / Debugging
- Software or Tool: Sockets / Pthreads / Valgrind / GDB
- Main Book: CS:APP 3e (Multiple chapters)
What you’ll build: A fully functional HTTP proxy that sits between clients and origin servers, forwarding requests, caching responses, and handling concurrent connections with zero data loss.
Why it teaches systems programming: This is the capstone because it forces you to integrate everything: sockets (Ch. 11), processes/threads (Ch. 8-12), memory management (Ch. 9), I/O robustness (Ch. 10), and debugging. You can’t hack your way through this; you need to understand every layer.
Core challenges you’ll face:
- Robust I/O and EINTR handling → maps to short counts, signals, buffering
- Thread-safe caching without deadlocks → maps to synchronization, lock ordering
- HTTP protocol parsing → maps to protocol layering, state machines
- Finding silent bugs with only logs → maps to debugging skills, Valgrind, GDB
- Production concerns (graceful shutdown, resource limits, observability) → maps to real-world engineering
Key Concepts:
- Socket Programming: CS:APP Ch. 11, Sec. 11.4-11.5
- Thread Synchronization: CS:APP Ch. 12, Sec. 12.3
- LRU Cache Implementation: Designing Data-Intensive Applications - Kleppmann
- HTTP Request/Response Format: RFC 7230 (HTTP/1.1)
- Parser State Machines: Crafting Interpreters - Nystrom (Ch. 4-5)
- Production Observability: The Art of Monitoring - Turnbull
- Debugging Real Systems: The Art of Debugging - Matloff & Salzman
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: You MUST have completed Projects 1-16. You need to understand: sockets, threading, signals, memory management, and at minimum one full debugging workflow with gdb/valgrind. This is not for beginners.
Real World Outcome
You’ll have a proxy server that can handle real HTTP traffic, cache responses, and print detailed logs of everything happening inside.
$ ./proxy 8080 --cache-size=100MB --workers=8 --verbose
[STARTUP] Proxy listening on 127.0.0.1:8080
[STARTUP] Cache initialized (100 MB, LRU)
[STARTUP] Worker pool: 8 threads
[STARTUP] Ready to accept connections
$ curl -v http://localhost:8080/index.html (via example.com)
--- Client connects ---
[CONN] Accept from 127.0.0.1:54321
[PARSE] GET /index.html HTTP/1.1
[PARSE] Host: example.com
[PARSE] Connection: close
[CACHE] Check cache for example.com:index.html... MISS
[UPSTREAM] Connecting to example.com:80...
[UPSTREAM] Connected. Sending request...
[UPSTREAM] Reading response (200 OK, 15KB)
[CACHE] Storing in cache (TTL: 3600s)
[CLIENT] Sending 200 OK (15KB) back to client
[CONN] Client disconnected
$ curl -v http://localhost:8080/index.html (2nd request, same URL)
[CONN] Accept from 127.0.0.1:54322
[PARSE] GET /index.html HTTP/1.1
[CACHE] Check cache for example.com:index.html... HIT
[CLIENT] Sending cached 200 OK (15KB) back to client
[STATS] Cache hit rate: 100%
[CONN] Client disconnected
--- Shutdown ---
$ kill -TERM $(pgrep proxy)
[SIGNAL] Received SIGTERM, graceful shutdown...
[SHUTDOWN] Waiting for 2 in-flight requests...
[SHUTDOWN] Closed all sockets
[SHUTDOWN] Cache flushed to disk
[SHUTDOWN] Exited cleanly
The Core Question You’re Answering
“Can I build a real system that handles concurrent requests, caches efficiently, parses HTTP correctly, and recovers gracefully from failures?”
This is the litmus test: Can you apply everything from CS:APP to build something that actually works? You’ll discover that real systems need: robust I/O, careful synchronization, observability, and persistence. You’ll also discover how hard it is to debug concurrency bugs in production.
Concepts You Must Understand First
Stop and research these before coding:
- HTTP Protocol Basics
- What are the required headers in an HTTP/1.1 request? (Host, Connection, etc.)
- How do you parse a request line “GET /path HTTP/1.1”?
- What is the difference between “Connection: keep-alive” and “Connection: close”?
- Book Reference: RFC 7230 (IETF standard)
- Socket Programming at Scale
- How do you handle short reads/writes? (EINTR, small buffers)
- How do you gracefully close both client and server connections?
- How do you prevent resource exhaustion (too many open FDs)?
- Book Reference: CS:APP 3e - Ch. 11, Sec. 11.4-11.5 + APUE - Ch. 16
- Thread-Safe Caching with LRU
- How does an LRU cache decide which entry to evict?
- How do you protect concurrent reads/writes without serializing all requests?
- How do you handle cache invalidation?
- Book Reference: Designing Data-Intensive Applications - Kleppmann - Ch. 3-4
- HTTP State Machine
- What states does an HTTP request go through? (parsing, routing, sending)
- How do you handle partial requests (request arrives in chunks)?
- What is the correct order of reading headers, body, etc.?
- Book Reference: Crafting Interpreters - Nystrom - Ch. 4-5
- Signal-Safe Shutdown
- How do you gracefully shut down threads without data loss?
- How do you catch SIGTERM and finish in-flight requests before exiting?
- Book Reference: CS:APP 3e - Ch. 8, Sec. 8.6 + APUE - Ch. 10
- Debugging Concurrency with Valgrind
- How do you use Helgrind to find race conditions?
- How do you read Valgrind output to find the smoking gun?
- Book Reference: Valgrind User Manual (Helgrind chapter)
Questions to Guide Your Design
Before implementing, think through these:
- Request Parsing
- What’s the smallest valid HTTP request? (3 lines: request + Host + blank)
- How will you detect end of headers? (blank line = \r\n\r\n)
- What if a header spans multiple
read()calls?
- Upstream Connection
- Should you reuse upstream connections (keep-alive) or close after each response?
- How do you handle upstream server slowness (buffering on your side)?
- What if upstream closes unexpectedly mid-response?
- Caching Strategy
- Which requests should you cache? (GET yes, POST no)
- Which responses should you cache? (200 yes, 500 no)
- How do you handle Expires/Cache-Control headers?
- What’s your cache eviction policy? (LRU? LFU?)
- Concurrency Model
- Thread pool or epoll reactor?
- Should each thread handle one client or multiple?
- How big should your thread pool be?
- Observability
- What should you log? (all requests? just errors?)
- Can you add per-request IDs to trace requests through logs?
- How do you measure cache hit rate?
- Graceful Shutdown
- How do you signal threads to stop accepting new work?
- How do you wait for in-flight requests to finish?
- How do you flush cache to disk?
Thinking Exercise
Design the Thread Model
Before coding, sketch out the flow:
┌─────────────────────────────────────┐
│ Acceptor Thread │
│ accept() → hand to worker pool │
└──────────────┬──────────────────────┘
│
v
┌──────────────────────────┐
│ Worker Thread (×8) │
│ 1. Read HTTP request │
│ 2. Parse request line │
│ 3. Check cache │
│ 4. If miss: upstream │
│ 5. Cache response │
│ 6. Send to client │
└──────────────────────────┘
│
┌──────────▼──────────┐
│ Shared LRU Cache │
│ (mutex protected) │
└─────────────────────┘
Questions while thinking:
- Who creates/destroys upstream connections?
- Who decides whether to cache (all threads, or a caching thread)?
- What if two threads request the same URL simultaneously?
The Interview Questions They’ll Ask
Prepare to answer these (this is where you get hired):
- “Walk me through how your proxy handles a request from start to finish.”
- “What happens if a client closes the connection mid-request?”
- “How would you debug a race condition in your cache without stopping the system?”
- “What’s the memory usage of your proxy with 1000 concurrent connections?”
- “How do you handle HTTP keep-alive? Is it worth the complexity?”
- “If your cache is full, how do you decide what to evict?”
- “Your logs show intermittent ‘Connection reset by peer’ errors. How do you debug this?”
- “Explain how you handle EINTR from signals. What can go wrong?”
Hints in Layers
Hint 1: Start with a Minimal Proxy
Parse a single request, forward it upstream, send response back. Hardcode one upstream server. Get this working with full EINTR handling before adding concurrency. Test with curl.
Hint 2: Add Thread Pool
Spawn 4 worker threads. Each waits on a queue of accepted connections. Use pthread_cond_wait() to sleep when queue is empty. Verify no hanging threads on shutdown.
Hint 3: Implement LRU Cache Use a linked list (or array with separate metadata) to track insertion order. Protect with a single mutex. Cache GET responses by (host, path) key. Test cache hits by requesting same URL twice and checking if upstream request is made.
Hint 4: Add Observability and Debugging Log every significant action: accept, parse, cache hit/miss, upstream connect, upstream recv, client send, disconnect. Run with Helgrind. Fix any race conditions detected. Use this to find bugs: “Oh, I see the log shows cache was checked but miss wasn’t counted!”
Hint 5: Handle Real Edge Cases
- Partial reads (request arrives in 3 chunks)
- Slow upstream (response arrives slowly)
- Signals during read/write
- Client disconnect during upstream read
- Cache eviction under load
Hint 6: Graceful Shutdown
On SIGTERM: stop accepting, wait for in-flight requests to finish, close upstream connections, flush cache, exit. Use a shared shutdown_flag and condition variable to wake all threads.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| HTTP Protocol | RFC 7230 (IETF Standard) | Sections 3-4 |
| Socket Programming | CS:APP 3e | Ch. 11, Sec. 11.4-11.5 |
| Network I/O | APUE (3e) | Ch. 16 |
| Threading | CS:APP 3e | Ch. 12, Sec. 12.1-12.3 |
| Signal Handling | CS:APP 3e | Ch. 8, Sec. 8.6 |
| Cache Design | Designing Data-Intensive Apps | Ch. 3 |
| Parser Design | Crafting Interpreters | Ch. 4-5 |
| Debugging Concurrency | The Art of Debugging | Ch. 6-7 - Matloff & Salzman |
| Production Practices | The Art of Monitoring | Ch. 1-3 - Turnbull |
Common Pitfalls and Debugging
Problem 1: “Requests mysteriously lost or truncated”
- Why: Short counts from
read/writenot handled; buffer boundary issues - Fix: Always loop until full read/write or EOF; handle EINTR explicitly
- Quick test: Send 100-byte request in 10-byte chunks; verify all received
Problem 2: “Cache causes random crashes or hangs”
- Why: Lock ordering, use-after-free in cache eviction, or spurious wakeup
- Fix: Run with Helgrind (
valgrind --tool=helgrind); trace lock acquisitions - Quick test: Kill proxy with 100 concurrent requests; run with Helgrind; check for data races
Problem 3: “Upstream connection closes but proxy keeps trying to send”
- Why: Connection close not detected; didn’t check for EPIPE or closed socket
- Fix: Check for
EPIPEwhen writing; handle closed sockets gracefully - Quick test: Kill upstream server; send requests through proxy; verify error handling
Problem 4: “Cascading failures under high load”
- Why: Resource exhaustion (too many open files, thread pool saturated)
- Fix: Add resource limits; monitor
lsofoutput; add backpressure - Quick test: Send 10,000 concurrent requests; watch
/proc/self/fd; ensure it doesn’t hit ulimit
Problem 5: “Cache hit rate is suspiciously low”
- Why: Cache key collision or TTL set too short
- Fix: Log cache operations; verify hit/miss statistics; check if requests actually match
- Quick test: Request same URL 100 times; trace that it hits cache
Problem 6: “Helgrind reports race conditions but code looks correct”
- Why: Subtle lock ordering issue or missed synchronization
- Fix: Draw a lock dependency diagram; ensure no circular waits; check for unlocked reads
- Quick test: Add explicit
pthread_mutex_lockaround suspected operations; re-run Helgrind
Definition of Done
- Parses HTTP requests (request line + headers) correctly
- Forwards requests to upstream server (HTTP/1.1)
- Receives responses and sends back to client
- Handles concurrent clients with thread pool (no crashes, no deadlocks)
- LRU cache works: stores responses, serves from cache, evicts on capacity
- Handles all EINTR and short read/write cases
- Graceful shutdown: stops accepting, finishes in-flight, exits cleanly
- No data loss: tested with intentional signal injection
- No race conditions: Helgrind clean
- Observability: logs show clear request flow with timestamps
- Survives stress test: 1000 concurrent requests, measured latency/throughput
- Code compiles with
-Wall -Wextra -Werror -fsanitize=address,undefined - Documented: README explains design choices and tradeoffs
Phase 2: Advanced Extensions
Project 18: ELF Linker and Loader
Real World Outcome
You will build myld, a linker that takes two .o files and produces a working executable.
$ ./myld main.o utils.o -o myprog
[LINKER] Merging .text sections... (Size: 1024 bytes)
[LINKER] Merging .data sections... (Size: 64 bytes)
[LINKER] Resolving symbol 'helper_func'... Found in utils.o at offset 0x20
[LINKER] Relocating call at 0x400105 -> 0x400220
[LOADER] Output written to 'myprog'.
$ ./myprog
Hello from linked code!
The Core Question You’re Answering
How does “Undefined Reference” become a valid memory address?
Concepts You Must Understand First
- ELF Headers (CS:APP 7.4) - Sections vs Segments.
- Relocation Entries (CS:APP 7.7) -
R_X86_64_PC32vsR_X86_64_64. - Symbol Resolution (CS:APP 7.6) - Strong vs Weak symbols.
Questions to Guide Your Design
- How do you merge two
.textsections into one? (Concatenation). - If you move
utils.o’s code to followmain.o’s code, what happens to the addresses of functions inutils.o? - What is the math formula for a PC-relative relocation? ($Address_{target} - Address_{instruction} - 4$).
Thinking Exercise
Draw two object files. A.o calls B. B.o defines B. When you put them together, B is no longer at offset 0. How do you calculate the number to put in the call instruction in A?
Definition of Done
- Can parse 64-bit ELF headers.
- Can merge
.textand.datafrom two input files. - Can process
R_X86_64_PC32relocations to fix function calls. - Produced binary runs on Linux.
Project 19: Virtual Memory Simulator
Real World Outcome
A CLI vmsim that replays memory traces and simulates page faults.
$ ./vmsim trace.txt --phys-mem=4 --algo=LRU
[ACCESS] R 0x1234 (VPN 1) -> MISS (Cold). Loaded to PPN 0.
[ACCESS] W 0x2345 (VPN 2) -> MISS (Cold). Loaded to PPN 1.
...
[ACCESS] R 0x9999 (VPN 9) -> MISS (Capacity). Evicting VPN 1 (LRU).
[STATS] Hits: 85, Misses: 15, Hit Rate: 85%
The Core Question You’re Answering
How does the OS choose which page to throw out when RAM is full, and how much does it matter?
Concepts You Must Understand First
- Paging & Page Tables (Primer Ch 6).
- Replacement Policies (CS:APP 9.6) - LRU, FIFO, Clock.
- Locality (Primer Ch 5).
Questions to Guide Your Design
- How do you split an address into VPN (Virtual Page Number) and Offset?
- How do you track “Least Recently Used”? (Counter? Queue? Approx bit?).
- What is the “working set” of a trace?
Definition of Done
- Implements multi-level page table walk logic.
- Implements FIFO and LRU replacement.
- Correctly counts hits, compulsory misses, and capacity misses.
Project 20: HTTP Web Server
Real World Outcome
A web server tiny that serves files and runs scripts.
$ ./tiny 8080
[LISTEN] Waiting on port 8080...
[CONN] Accepted 127.0.0.1:54321
[REQ] GET /index.html
[RESP] 200 OK (150 bytes)
[CONN] Accepted 127.0.0.1:54322
[REQ] GET /cgi-bin/adder?a=1&b=2
[CGI] Forking child...
[RESP] 200 OK (Dynamic Content)
The Core Question You’re Answering
How do network packets become “web pages” via the socket API?
Concepts You Must Understand First
- Sockets (CS:APP 11.4).
- HTTP Protocol (CS:APP 11.5).
- Process Control (CS:APP 8.4) -
fork/exec.
Definition of Done
- Serves static files (HTML, JPG).
- Handles CGI (Dynamic content via child processes).
- Handles errors (404, 500) gracefully.
Project 21: Thread Pool Implementation
(See expanded guide in repo for full code and details)
What you’ll build: A reusable thread pool library with a bounded work queue. Why it matters: Producer-consumer pattern mastery.
Real World Outcome
$ ./threadpool --workers=4
[INIT] 4 workers started.
[SUBMIT] Task 1..100
[WORKER-0] Processed Task 1
...
[SHUTDOWN] Graceful exit.
The Core Question You’re Answering
How do you safely coordinate multiple threads sharing a queue?
Definition of Done
- No race conditions (Helgrind clean).
- Clean shutdown (no zombie threads).
- Handles spurious wakeups correctly.
Project 22: Signal-Safe Printf
(See expanded guide in repo for full code and details)
What you’ll build: A printf that works inside signal handlers.
Why it matters: Understanding reentrancy and async-safety.
Real World Outcome
$ ./sio_test
[HANDLER] Caught signal 10. Safe output only!
The Core Question You’re Answering
Why does printf deadlock in a signal handler, and how do we fix it?
Definition of Done
- Uses only
writesyscall. - No
malloc, no global locks. - Formats integers and strings.
Project 23: Performance Profiler
(See expanded guide in repo for full code and details)
What you’ll build: A sampling profiler using signals.
Why it matters: Understanding how gprof/perf actually work.
Real World Outcome
$ ./profiler ./myprog
[STATS] Total samples: 1000
[HOT] 40% - matrix_multiply
[HOT] 30% - vector_add
The Core Question You’re Answering
How can I measure code execution without modifying the code?
Definition of Done
- Uses
setitimerfor periodic interrupts. - Captures
rip(instruction pointer). - Resolves addresses to symbols (if possible).
Project 24: Memory Leak Detector
Real World Outcome
A shared library you inject into any C program to find leaks.
$ LD_PRELOAD=./libleakcheck.so ./buggy_app
[APP] App running...
[APP] App exiting...
================ LEAK REPORT =================
[LEAK] 1024 bytes at 0x7f... allocated by:
main.c:25 (main)
utils.c:10 (helper)
[SUMMARY] 1 leak, 1024 bytes lost.
The Core Question You’re Answering
How can I track every malloc/free without recompiling the target application?
Concepts You Must Understand First
- Dynamic Interposition (CS:APP 7.13) -
LD_PRELOAD. - Function Wrappers (
dlsym,RTLD_NEXT). - Call Stacks (Primer Ch 3) - Walking the stack frames.
Hints in Layers
- Implement
mallocthat just calls the realmalloc(viadlsym). - Add logging to a file.
- Use a hash map (or linked list) to store “live” pointers.
mallocadds,freeremoves. - On program exit (
__attribute__((destructor))), print whatever is left in the map.
Definition of Done
- Intercepts malloc, free, realloc, calloc.
- Tracks allocation size and location (return address).
- Reports memory leaked at exit.
- Does not crash recursive calls (e.g., if
printfcallsmalloc).
Project 25: Debugger (ptrace-based)
Real World Outcome
A minimalist debugger mydb that can stop code and inspect state.
$ ./mydb ./target
(mydb) break 0x401050
(mydb) run
[STOP] Breakpoint hit at 0x401050
(mydb) regs
rax: 0x0
rip: 0x401050
(mydb) step
[STOP] Stepped to 0x401055
The Core Question You’re Answering
How does a debugger control another process and transparently pause it?
Concepts You Must Understand First
- ptrace syscall (CS:APP 8.5) -
PTRACE_TRACEME,PTRACE_PEEKUSER,PTRACE_CONT. - Software Breakpoints - The
INT 3(opcode0xCC) instruction. - Signals -
SIGTRAP.
Thinking Exercise
To set a breakpoint:
- Read the byte at the target address.
- Save it.
- Write
0xCC(INT 3) there. To resume: - Write the saved byte back.
- Rewind
%ripby 1. - Single step.
- Write
0xCCback. - Continue.
Definition of Done
- Can launch a target program.
- Can set a breakpoint at an address.
- Can continue and stop at breakpoint.
- Can read register values.
Project 26: Operating System Kernel Capstone
Real World Outcome
A kernel.iso that you boot in QEMU, printing “Hello World” from your own OS.
The Core Question You’re Answering
What is the absolute minimum software required to manage the hardware?
Concepts You Must Understand First
- Bare Metal Boot - Multiboot header.
- VGA Text Mode - Writing bytes to
0xb8000. - GDT/IDT - Global Descriptor Table, Interrupt Descriptor Table.
Definition of Done
- Boots via GRUB/QEMU.
- Sets up a stack.
- Writes text to the screen.
- Handles at least one interrupt (keyboard or timer).
(Note: Projects P3-P17 and P20-P23 are documented in the expanded guides linked in the headers. P21, P22, and P23 have detailed guides available in the full version.)
Project Comparison Table
| Project | Difficulty | Time | Depth | Fun Factor |
|---|---|---|---|---|
| P1: Build Pipeline Explorer | ★☆☆☆☆ | 4-8h | Medium | ★★★☆☆ |
| P2: Bitwise Data Inspector | ★☆☆☆☆ | 4-8h | Medium | ★★★☆☆ |
| P3: Data Lab Clone | ★★☆☆☆ | 10-15h | High | ★★★★☆ |
| P4: Crash Cart | ★★☆☆☆ | 10-15h | High | ★★★☆☆ |
| P5: Bomb Lab Workflow | ★★☆☆☆ | 15-20h | High | ★★★★★ |
| P6: Attack Lab Workflow | ★★★☆☆ | 20-30h | Very High | ★★★★★ |
| P7: Y86-64 Simulator | ★★★☆☆ | 30-40h | Very High | ★★★★☆ |
| P8: Performance Clinic | ★★☆☆☆ | 10-15h | High | ★★★☆☆ |
| P9: Cache Simulator | ★★☆☆☆ | 15-20h | Very High | ★★★★☆ |
| P10: ELF Link Map | ★★☆☆☆ | 10-15h | High | ★★★☆☆ |
| P11: Signals Sandbox | ★★☆☆☆ | 10-15h | High | ★★★☆☆ |
| P12: Unix Shell | ★★★☆☆ | 25-35h | Very High | ★★★★★ |
| P13: VM Visualizer | ★★☆☆☆ | 10-15h | Medium | ★★★☆☆ |
| P14: Malloc | ★★★☆☆ | 30-40h | Very High | ★★★★☆ |
| P15: Robust I/O | ★★☆☆☆ | 10-15h | High | ★★★☆☆ |
| P16: Concurrency Workbench | ★★★☆☆ | 20-30h | Very High | ★★★★☆ |
| P17: Proxy Server | ★★★★☆ | 40-60h | Comprehensive | ★★★★★ |
| P18-P26: Extensions | ★★★★-★★★★★ | Varies | Expert | ★★★★★ |
Recommendation
If you’re new to systems programming: Start with Project 1 (Build Pipeline Explorer) and Project 2 (Bitwise Inspector). These give you quick wins and essential mental models. Then progress through Projects 3-6 to build debugging intuition.
If you’re preparing for interviews: Focus on Project 5 (Bomb Lab) for reverse engineering practice, Project 14 (Malloc) for the classic memory allocator question, and Project 16 (Concurrency) for threading fundamentals. These three cover the most common systems interview topics.
If you’re building production systems: Prioritize Project 12 (Shell) for process management, Project 14 (Malloc) for memory understanding, Project 15 (Robust I/O) for network code, and Project 17 (Proxy) as the capstone that integrates everything.
If you have limited time (3-month sprint):
Follow Path 5: P1 → P2 → P4 → P9 → P14 → P17. This covers the essential concepts while producing impressive projects for your portfolio.
Final Overall Project: Production-Grade Systems Toolkit
After completing the core projects, combine your work into a unified systems debugging toolkit:
The Goal
Build a single binary sysinspect that integrates:
- Build analysis (from P1)
- Memory inspection (from P2, P13)
- Process/signal debugging (from P11, P12)
- Performance profiling (from P8, P23)
Integration Steps
- Create a unified CLI with subcommands:
sysinspect build,sysinspect memory, etc. - Add a
--livemode that attaches to running processes (from P25) - Implement cross-referencing: show how a memory address maps to source code symbols
- Add export to JSON/HTML for reports
Success Criteria
- Single
makecommand builds the toolkit - Can analyze your own projects from the sprint
- Documentation is good enough for another developer to use
- Handles edge cases gracefully (processes that exit, permission errors, etc.)
From Learning to Production: What’s Next
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| P1: Build Pipeline | CMake/Bazel/Buck | Build caching, dependency graphs |
| P5: Bomb Lab | IDA Pro/Ghidra | GUI, decompilation, scripting |
| P9: Cache Simulator | Valgrind’s cachegrind | Hardware integration |
| P12: Unix Shell | Bash/Zsh | Scripting language, history, completion |
| P14: Malloc | glibc malloc/jemalloc | Thread-local caches, arenas |
| P17: Proxy | Nginx/HAProxy | Configuration, TLS, load balancing |
| P24: Leak Detector | Valgrind’s memcheck | Shadow memory, origin tracking |
| P25: Debugger | GDB/LLDB | DWARF parsing, expression evaluation |
Continuing Your Journey
- Read the Source: Study glibc, Linux kernel, and nginx source code
- Contribute to Open Source: Many systems projects welcome contributors
- Specialize: Pick one area (security, performance, infrastructure) and go deep
- Teach: Writing about systems programming solidifies understanding
Summary
This learning path covers Computer Systems: A Programmer’s Perspective through 26 hands-on projects (17 core + 9 extensions).
| # | Project Name | Main Language | Difficulty | Time |
|---|---|---|---|---|
| 1 | Build Pipeline Explorer | C/Shell | Level 1 | 4-8h |
| 2 | Bitwise Data Inspector | C | Level 1 | 4-8h |
| 3 | Data Lab Clone | C | Level 2 | 10-15h |
| 4 | Calling Convention Crash Cart | C/Assembly | Level 2 | 10-15h |
| 5 | Bomb Lab Workflow | GDB/Assembly | Level 2 | 15-20h |
| 6 | Attack Lab Workflow | C/Assembly | Level 3 | 20-30h |
| 7 | Y86-64 CPU Simulator | C | Level 3 | 30-40h |
| 8 | Performance Clinic | C | Level 2 | 10-15h |
| 9 | Cache Lab Simulator | C | Level 2 | 15-20h |
| 10 | ELF Link Map Toolkit | C | Level 2 | 10-15h |
| 11 | Signals/Processes Sandbox | C | Level 2 | 10-15h |
| 12 | Unix Shell with Job Control | C | Level 3 | 25-35h |
| 13 | VM Map Visualizer | C | Level 2 | 10-15h |
| 14 | Build Your Own Malloc | C | Level 3 | 30-40h |
| 15 | Robust Unix I/O Toolkit | C | Level 2 | 10-15h |
| 16 | Concurrency Workbench | C | Level 3 | 20-30h |
| 17 | Capstone Proxy Server | C | Level 4 | 40-60h |
| 18 | ELF Linker/Loader | C | Level 4 | 40-60h |
| 19 | VM Simulator | C | Level 3 | 25-35h |
| 20 | HTTP Web Server | C | Level 3 | 25-35h |
| 21 | Thread Pool | C | Level 3 | 20-25h |
| 22 | Signal-Safe Printf | C | Level 3 | 15-20h |
| 23 | Performance Profiler | C | Level 3 | 20-25h |
| 24 | Memory Leak Detector | C | Level 3 | 20-30h |
| 25 | ptrace Debugger | C | Level 4 | 35-45h |
| 26 | OS Kernel Capstone | C/Assembly | Level 5 | 60-100h |
Expected Outcomes
After completing these projects, you will:
- Debug any crash by reading registers, stack traces, and memory dumps
- Predict performance by reasoning about cache behavior and memory access patterns
- Write secure code by understanding buffer overflows, format strings, and memory safety
- Build production systems that handle signals, concurrency, and I/O correctly
- Pass systems interviews with deep understanding rather than memorization
- Read and contribute to low-level codebases (kernels, compilers, databases)
You’ll have built 26 working projects that demonstrate mastery of systems programming from first principles—from source code to running process and back again.
Additional Resources and References
Primary Textbook
- Computer Systems: A Programmer’s Perspective, 3rd Edition by Bryant & O’Hallaron
- Official website with labs: https://csapp.cs.cmu.edu/
Supplementary Books (by topic)
C Programming
- C Programming: A Modern Approach by K.N. King
- Effective C by Robert Seacord
Assembly & Low-Level
- Low-Level Programming by Igor Zhirkov
- Practical Binary Analysis by Dennis Andriesse
- Hacking: The Art of Exploitation by Jon Erickson
Operating Systems
- Operating Systems: Three Easy Pieces (OSTEP) - Free online
- The Linux Programming Interface by Michael Kerrisk
- Advanced Programming in the UNIX Environment by Stevens & Rago
Networking
- Unix Network Programming by W. Richard Stevens
- Beej’s Guide to Network Programming (free online)
Performance
- Agner Fog’s Optimization Manuals (free online)
- Computer Organization and Design by Patterson & Hennessy
Online Resources
- Teach Yourself CS - Study path recommendations
- CS:APP Course Videos - CMU 15-213 lectures
- Compiler Explorer - See assembly in your browser
- GDB Cheat Sheet
Standards and Documentation
- System V AMD64 ABI: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
- Intel x86-64 Manuals: https://www.intel.com/sdm
- POSIX Specification: https://pubs.opengroup.org/onlinepubs/9699919799/
- Linux Man Pages: https://man7.org/linux/man-pages/