CSAPP 3E DEEP LEARNING PROJECTS
Every program you write eventually becomes electrons flowing through silicon. Between your high-level code and those electrons lies a vast machinery of translation, optimization, and abstraction that most programmers never see. This invisible infrastructure determines whether your program runs fast or slow, crashes mysteriously or fails gracefully, consumes megabytes or gigabytes of memory.
CS:APP (3rd Edition) โ Deep Learning via Buildable Projects
Goal: Transform from a programmer who writes code that โhappens to workโ into a systems programmer who understands exactly what the machine does with every instruction. By building 17 increasingly sophisticated projects, you will internalize the complete journey from source code to running processโmastering data representation, machine-level execution, memory hierarchy, operating system abstractions, and concurrent programming. When you finish, you will debug crashes by reading registers, optimize code by reasoning about cache lines, and build robust systems that handle real-world failure modes gracefully.
Why Systems Programming Matters
The Hidden Foundation
Every program you write eventually becomes electrons flowing through silicon. Between your high-level code and those electrons lies a vast machinery of translation, optimization, and abstraction that most programmers never see. This invisible infrastructure determines whether your program runs fast or slow, crashes mysteriously or fails gracefully, consumes megabytes or gigabytes of memory.
Consider this scenario: A financial trading system processes millions of transactions per day. One day, after a routine update, trades start failing silentlyโbut only for certain customers, only during peak hours, and only when the system has been running for exactly 47 minutes. The logs show nothing. The unit tests pass. The code review found nothing suspicious.
A programmer without systems knowledge might spend weeks adding more logging, trying random fixes, or blaming the network. A systems programmer recognizes the symptoms immediately: this is a classic memory corruption bug, likely a buffer overflow that only manifests when a specific heap layout occurs after enough allocations. They fire up GDB, examine the heap metadata, trace the corruption back to a string copy that assumed null termination, and fix it in an hour.
The difference is not intelligenceโit is knowledge. Systems programming knowledge.
Real-World Impact
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WHY SYSTEMS KNOWLEDGE MATTERS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ DEBUGGING Without systems knowledge, you guess. โ
โ โโโโโโโโโ With it, you diagnose. โ
โ โ
โ PERFORMANCE Without systems knowledge, you benchmark randomly. โ
โ โโโโโโโโโโโ With it, you reason about cache lines and pipelines. โ
โ โ
โ SECURITY Without systems knowledge, you follow checklists. โ
โ โโโโโโโโ With it, you understand attack surfaces. โ
โ โ
โ ARCHITECTURE Without systems knowledge, you copy patterns. โ
โ โโโโโโโโโโโโ With it, you design for the machine you have. โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Heartbleed Bug (2014): A missing bounds check in OpenSSL allowed attackers to read arbitrary server memory, exposing passwords, private keys, and session tokens. The bug existed for two years. Understanding buffer management and memory layout would have caught it in code review.
The Mars Climate Orbiter (1999): A $327 million spacecraft was lost because one module used imperial units while another expected metric. Understanding data representation and interface contractsโexactly what Chapter 2 teachesโwould have prevented this.
Spectre and Meltdown (2018): These CPU vulnerabilities exploited speculative execution and cache timing to leak privileged memory. Understanding cache behavior and CPU pipelinesโChapters 5 and 6โis essential for both exploiting and mitigating such attacks.
What This Journey Gives You
After completing these projects, you will be able to:
-
Read a crash dump and explain exactly what happenedโwhich instruction faulted, what the stack looked like, what memory was corrupted
-
Profile code and explain why it is slowโwhether it is memory-bound, compute-bound, or suffering from branch mispredictions
-
Audit code for security vulnerabilitiesโrecognizing buffer overflows, integer overflows, and use-after-free bugs from code inspection
-
Design systems that handle failure gracefullyโunderstanding partial I/O, signal races, and concurrency hazards
-
Communicate with compilers, operating systems, and hardwareโnot as black boxes, but as partners whose behavior you can predict and influence
The Big Picture: How Programs Become Running Processes
Before diving into individual concepts, let us see the complete journey a program takes from source code to execution:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ THE PROGRAM EXECUTION PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ SOURCE CODE โ
โ โโโโโโโโโโโโ โ
โ hello.c โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ #include <stdio.h> โ โ โ
โ โ โ โ โ
โ โ int main() { โ โ โ
โ โ printf("Hello\n"); โ โ โ
โ โ return 0; โ โ โ
โ โ } โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ STAGE 1: PREPROCESSING (cpp) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โข Expands #include directives โ โ โ
โ โ โข Processes #define macros โ โ โ
โ โ โข Handles conditional compilation (#ifdef) โ โ โ
โ โ โข Output: hello.i (expanded C source) โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ STAGE 2: COMPILATION (cc1) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โข Lexical analysis โ tokens โ โ โ
โ โ โข Parsing โ AST (Abstract Syntax Tree) โ โ โ
โ โ โข Semantic analysis โ type checking โ โ โ
โ โ โข Optimization passes โ โ โ
โ โ โข Code generation โ โ โ
โ โ โข Output: hello.s (assembly source) โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ STAGE 3: ASSEMBLY (as) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โข Translates assembly to machine code โ โ โ
โ โ โข Creates relocatable object file โ โ โ
โ โ โข Records symbols and relocations โ โ โ
โ โ โข Output: hello.o (ELF object file) โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ LIBRARIES โ โ โ
โ โ โ libc.a / libc.so โ โ โ
โ โ โ (printf, malloc, etc.) โ โ โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โผ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ STAGE 4: LINKING (ld) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โข Symbol resolution (matches references to defs) โ โ โ
โ โ โข Relocation (assigns final addresses) โ โ โ
โ โ โข Static: copies library code into executable โ โ โ
โ โ โข Dynamic: records dependencies for runtime โ โ โ
โ โ โข Output: hello (executable ELF file) โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ STAGE 5: LOADING (execve + ld-linux.so) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โข Kernel reads ELF headers โ โ โ
โ โ โข Creates new process address space โ โ โ
โ โ โข Maps code and data segments โ โ โ
โ โ โข Dynamic linker resolves shared libraries โ โ โ
โ โ โข Sets up stack with argc/argv/envp โ โ โ
โ โ โข Transfers control to _start โ main() โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ STAGE 6: EXECUTION โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โข CPU fetches, decodes, executes instructions โ โ โ
โ โ โข Memory accesses go through cache hierarchy โ โ โ
โ โ โข Virtual addresses translated to physical โ โ โ
โ โ โข System calls trap to kernel โ โ โ
โ โ โข Signals may interrupt execution โ โ โ
โ โ โข Process terminates, resources cleaned up โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Every project in this curriculum touches some part of this pipeline. Project 1 makes the entire pipeline visible. Projects 2-6 focus on data representation and machine code. Projects 7-9 examine the CPU and cache. Projects 10-16 explore the operating systemโs role. Project 17 integrates everything.
Core Concept Analysis
Think of CS:APP as one story told through eight interconnected concept clusters. Each cluster builds on the previous ones, and mastery requires understanding both the individual concepts and their interactions.
A. Translation & Execution
Book Coverage: Chapters 1, 7
The Central Question: How does human-readable source code become a running process?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TRANSLATION PIPELINE DETAIL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ SOURCE (.c) โ
โ โ โ
โ โ cpp (C Preprocessor) โ
โ โ โข Text substitution, #include, #define, #ifdef โ
โ โผ โ
โ PREPROCESSED (.i) โ
โ โ โ
โ โ cc1 (C Compiler) โ
โ โ โข Lexer โ Parser โ Type check โ Optimize โ Code gen โ
โ โผ โ
โ ASSEMBLY (.s) โ
โ โ โ
โ โ as (Assembler) โ
โ โ โข Machine code + metadata + relocations โ
โ โผ โ
โ OBJECT FILE (.o) โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ ELF: Header, .text, .data, .bss, .symtab, .rel.* โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โ ld (Linker): Symbol resolution + Relocation โ
โ โผ โ
โ EXECUTABLE โ
โ โ โ
โ โ execve() + ld-linux.so (Loader) โ
โ โผ โ
โ RUNNING PROCESS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Insights:
- Each stage produces a concrete artifact you can inspect
- Symbol resolution is where โundefined referenceโ errors occur
- Static linking copies code; dynamic linking defers to runtime
Mastery Test: Can you predict what changes when you switch from static to dynamic linking?
B. Data Representation
Book Coverage: Chapter 2
The Central Question: How does the machine represent information, and what happens at the boundaries?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ INTEGER REPRESENTATION โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ UNSIGNED (8-bit): 0000 0000 ... 1111 1111 = 0 ... 255 โ
โ SIGNED (8-bit): 1000 0000 ... 0111 1111 = -128 ... 127 โ
โ โ
โ DANGER ZONES: โ
โ โข Overflow (signed): undefined behavior! โ
โ โข Signed/Unsigned comparison: -1 > 0U is TRUE! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FLOATING POINT (IEEE 754) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 32-bit: [S|Exponent(8)|Mantissa(23)] โ
โ Value = (-1)^S ร 1.Mantissa ร 2^(Exp-127) โ
โ โ
โ Special: ยฑ0, ยฑInfinity, NaN โ
โ WARNING: 0.1 + 0.2 != 0.3 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BYTE ORDERING โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 0x01234567: Little-endian (x86): 67 45 23 01 โ
โ Big-endian (network): 01 23 45 67 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mastery Test: Can you predict the output of printf("%d", (int)(unsigned)-1)?
C. Machine-Level Programming
Book Coverage: Chapter 3
The Central Question: How does the compiler translate C into x86-64?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ x86-64 REGISTER FILE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9 โ
โ Return: %rax โ
โ Callee-saved: %rbx, %rbp, %r12-%r15 โ
โ Stack pointer: %rsp โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STACK FRAME LAYOUT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ High addr: [Caller's frame] [Args 7+] [Return addr] [Saved regs] [Locals] โ
โ Low addr: โ %rsp โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mastery Test: Given a crash address, can you walk the stack frames?
D. Architecture & Performance
Book Coverage: Chapters 4, 5
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PIPELINED CPU โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 5 stages: Fetch โ Decode โ Execute โ Memory โ Writeback โ
โ Hazards: Data (RAW), Control (misprediction ~15-20 cycles) โ
โ Optimization: ILP, loop unrolling, reduce dependencies โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
E. Memory Hierarchy & Virtual Memory
Book Coverage: Chapters 6, 9
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MEMORY HIERARCHY โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Registers (~1KB, ~0.25ns) โ L1 (32-64KB, ~1ns) โ L2 (256KB, ~4ns) โ
โ โ L3 (8-32MB, ~12ns) โ DRAM (8-64GB, ~60ns) โ SSD/HDD โ
โ โ
โ Cache: tag|index|offset, exploit temporal & spatial locality โ
โ VM: Virtual โ Page Table โ Physical, TLB caches translations โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
F. Exceptional Control Flow & Processes
Book Coverage: Chapter 8
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EXCEPTIONS & PROCESSES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Exceptions: Interrupt (async), Trap (syscall), Fault, Abort โ
โ Processes: fork() โ exec() โ wait() โ exit() โ
โ Signals: SIGINT, SIGTERM, SIGSEGV, SIGCHLD (handlers must be safe!) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
G. System I/O & Networking
Book Coverage: Chapters 10, 11
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ I/O & NETWORKING โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Unix I/O: "Everything is a file" (FD 0=stdin, 1=stdout, 2=stderr) โ
โ ROBUST I/O: read()/write() may return short counts - always loop! โ
โ Sockets: socketโbindโlistenโaccept (server), socketโconnect (client) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
H. Concurrency
Book Coverage: Chapter 12
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONCURRENCY โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Models: Processes (isolated) | Threads (shared) | I/O multiplex (single) โ
โ Race: counter++ is NOT atomic (load-add-store interleaving) โ
โ Sync: mutex, semaphore, condition variable โ
โ Deadlock: circular wait - prevent with consistent lock ordering โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Process Address Space Layout
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PROCESS ADDRESS SPACE (Linux x86-64) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ High: KERNEL | STACK (grows โ) | mmap region | HEAP (grows โ) โ
โ Low: BSS | DATA | RODATA | TEXT | NULL page โ
โ โ
โ Permissions: TEXT=r-x, RODATA=r--, DATA/BSS/HEAP/STACK=rw- โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Concept Summary Table
| Concept | Key Questions | Danger Signs |
|---|---|---|
| Translation | What does each stage produce? | โCompiled but crashesโ |
| Data Rep | Why -1 > 0U? | Silent corruption |
| Machine Code | How does stack grow? | Cannot debug crashes |
| Architecture | What hazard is this? | Random performance |
| Memory | Cache hit rate? | 100x slowdown |
| ECF/Processes | What is a zombie? | Hangs, orphans |
| I/O/Networking | Short count? | Data corruption |
| Concurrency | Race condition? | Heisenbugs |
Deep Dive Reading By Concept
Primary: CS:APP 3rd Ed (Bryant & OโHallaron)
| Concept | CS:APP | Supplementary |
|---|---|---|
| Translation | Ch. 1, 7 | Practical Binary Analysis, Low-Level Programming |
| Data Rep | Ch. 2 | Write Great Code Vol.1, Effective C |
| Machine Code | Ch. 3 | Hacking: Art of Exploitation |
| Architecture | Ch. 4-5 | Computer Organization and Design |
| Memory | Ch. 6, 9 | OSTEP |
| ECF/Processes | Ch. 8 | The Linux Programming Interface |
| I/O/Networking | Ch. 10-11 | Unix Network Programming |
| Concurrency | Ch. 12 | OSTEP, TLPI |
Essential Book List: CS:APP, C Programming: A Modern Approach (King), Effective C (Seacord), OSTEP (free online), The Linux Programming Interface (Kerrisk), Low-Level Programming (Zhirkov)
Table of Contents
- Overview
- Project Dependency Graph
- Progress Tracker
- Projects
- Project Comparison Table
- Skills Matrix
- Resources
Overview
The bookโs scope (12 chapters) spans:
| Domain | Topics |
|---|---|
| Translation & Execution | Preprocessing, compilation, assembly, linking, loading |
| Data Representation | Bits/bytes, integers, floating point, endianness |
| Machine-Level Code | x86-64, calling conventions, stack discipline |
| Architecture | CPU datapaths, pipelining (Y86-64) |
| Performance | Loop optimization, ILP, branch prediction |
| Memory Hierarchy | Caches, locality, virtual memory |
| Operating System | Processes, signals, exceptional control flow |
| I/O & Networking | File descriptors, sockets, robust I/O |
| Concurrency | Threads, synchronization, deadlock avoidance |
Project Dependency Graph
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PROJECT 17: CAPSTONE โ
โ Secure Proxy Server โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ P16: Conc. โ โ P15: Unix โ โ P14: Malloc โ
โ Workbench โ โ I/O Toolkit โ โ Allocator โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโค
โ โ โ โ
โผ โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
โ P12: Shell โ โ P11: Signals โ โ P13: VM โ โ P9:Cache โ
โ Job Control โ โ + Processes โ โ Visualiz โ โ Simulatorโ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ
โ โ โ โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ P10: ELF Link Map โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ P8: Perf โ โ P7: Y86-64 โ โ P6: Attack โ
โ Clinic โ โ CPU Simulator โ โ Lab Workflow โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ โ
โ โ โผ
โ โ โโโโโโโโโโโโโโโโโโโ
โ โ โ P5: Bomb Lab โ
โ โ โ Workflow โ
โ โ โโโโโโโโโโฌโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ P3: Data Lab โ โ P4: Calling โ
โ Clone โ โ Convention โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ P2: Bitwise โ
โ Data Inspectorโ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ P1: Toolchain โ
โ Explorer โ
โโโโโโโโโโโโโโโโโโโ
START

Recommended Learning Paths:
| Path | Focus | Projects |
|---|---|---|
| Core | Essential CS:APP understanding | P1 โ P2 โ P4 โ P11 โ P12 |
| Security | Exploitation & defense | P1 โ P2 โ P4 โ P5 โ P6 |
| Architecture | CPU internals | P1 โ P2 โ P3 โ P7 |
| Performance | Optimization mastery | P1 โ P2 โ P8 โ P9 |
| Systems | Full systems programmer | P1 โ P2 โ P4 โ P11 โ P12 โ P15 โ P16 |
| Complete | Everything | P1 through P17 |
Progress Tracker
Use this checklist to track your journey:
Phase 1: Foundation
[ ] P1 - Hello, Toolchain โ Build Pipeline Explorer
[ ] P2 - Bitwise Data Inspector
Phase 2: Machine-Level Mastery
[ ] P3 - Data Lab Clone
[ ] P4 - x86-64 Calling Convention Crash Cart
[ ] P5 - Bomb Lab Workflow
[ ] P6 - Attack Lab Workflow
Phase 3: Architecture & Performance
[ ] P7 - Y86-64 CPU Simulator
[ ] P8 - Performance Clinic
[ ] P9 - Cache Lab++ Simulator
Phase 4: Systems Programming
[ ] P10 - ELF Link Map & Interposition
[ ] P11 - Signals + Processes Sandbox
[ ] P12 - Unix Shell with Job Control
[ ] P13 - Virtual Memory Map Visualizer
[ ] P14 - Build Your Own Malloc
[ ] P15 - Robust Unix I/O Toolkit
[ ] P16 - Concurrency Workbench
Phase 5: Capstone
[ ] P17 - CS:APP Capstone Proxy Platform
Phase 6: Beyond CS:APP (Advanced Extensions)
[ ] P18 - ELF Linker and Loader
[ ] P19 - Virtual Memory Simulator
[ ] P20 - HTTP Web Server
[ ] P21 - Thread Pool Implementation
[ ] P22 - Signal-Safe Printf
[ ] P23 - Performance Profiler
[ ] P24 - Memory Leak Detector
[ ] P25 - Debugger (ptrace-based)
[ ] P26 - Operating System Kernel Capstone
Projects
Phase 1: Foundation (Start Here)
Project 1: โHello, Toolchainโ โ Build Pipeline Explorer
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Intermediate |
| Time | 1โ2 weeks |
| Chapters | 1, 7 |
| Coolness | โ โ โ โโ Genuinely Clever |
| Portfolio Value | Resume Gold |
What youโll build: A CLI โpipeline explainerโ that takes one small C program and produces a structured report for each stage (preprocessed C, assembly, object metadata, linked binary metadata) plus runtime observations.
Why it matters: Chapter 1 is about seeing the system as a whole; this forces you to observe every transformation and artifact, not just โrun gcc and hope.โ
Core challenges:
- Capturing each compilation artifact deterministically (translation stages)
- Explaining symbol tables/sections in human terms (executable structure)
- Relating runtime behavior to the produced binary (loading + process execution)
Key concepts to master:
- Translation system (Ch. 1)
- Object file anatomy: sections, symbols (Ch. 7)
- Error-handling discipline (Appendix)
Prerequisites: Basic C, comfort with build tools, basic debugging literacy.
Deliverable: A single report explaining โwhat the compiler produced, what the linker stitched, and what the process looks like at runtime.โ
Implementation hints:
- Treat this as a report generator, not a toy script
- Output must include: section list, symbol count by kind, stack/heap locations at runtime (via debugger)
Milestones:
- You can explain each pipeline stage using artifacts you produced
- You can predict changes between static vs dynamic linking
- You can map a crash address back to the right stage (source vs asm vs binary)
Real World Outcome
When complete, you will have a CLI tool that produces comprehensive build pipeline analysis:
$ ./pipeline-explorer hello.c --all
================================================================================
BUILD PIPELINE ANALYSIS: hello.c
================================================================================
[STAGE 1: PREPROCESSING]
--------------------------------------------------------------------------------
Input: hello.c (45 bytes, 5 lines)
Output: hello.i (18,432 bytes, 847 lines)
Time: 0.003s
Preprocessing transformations:
- #include <stdio.h> expanded: +842 lines from /usr/include/stdio.h
- Header chain: stdio.h -> stddef.h -> bits/types.h -> ...
- Macros defined: 127 (from system headers)
- Macros used in source: 0
- Conditional compilation: 23 #ifdef blocks evaluated
[STAGE 2: COMPILATION]
--------------------------------------------------------------------------------
Input: hello.i (18,432 bytes)
Output: hello.s (512 bytes, 28 lines)
Time: 0.012s
Assembly characteristics:
- Target: x86-64 (AT&T syntax)
- Functions generated: 1 (main)
- Instructions: 14
- String literals: 1 ("Hello, World!\n")
- Section directives: .text, .rodata, .note.GNU-stack
Code generation summary:
- Stack frame: 16 bytes (aligned)
- Callee-saved registers used: none
- External calls: puts@PLT
[STAGE 3: ASSEMBLY]
--------------------------------------------------------------------------------
Input: hello.s (512 bytes)
Output: hello.o (1,688 bytes)
Time: 0.002s
Object file analysis:
Section Size Type Flags
.text 26 PROGBITS AX (alloc, execute)
.rodata 15 PROGBITS A (alloc)
.comment 46 PROGBITS MS (merge, strings)
.note.GNU-s 0 NOBITS -
.eh_frame 56 PROGBITS A (alloc)
Symbol table (4 entries):
Symbol Type Bind Section Value
main FUNC GLOBAL .text 0x0
puts NOTYPE GLOBAL UND 0x0 (undefined - needs linking)
Relocations (2 entries):
Offset Type Symbol Addend
0x0a R_X86_64_PC32 .rodata -4
0x0f R_X86_64_PLT32 puts -4
[STAGE 4: LINKING]
--------------------------------------------------------------------------------
Input: hello.o + libc
Output: hello (16,696 bytes)
Time: 0.024s
Linking type: Dynamic
Interpreter: /lib64/ld-linux-x86-64.so.2
Linked binary analysis:
Section VMA Size Type
.interp 0x0000000000400318 28 interpreter path
.text 0x0000000000401040 147 executable code
.rodata 0x0000000000402000 19 read-only data
.dynamic 0x0000000000403e10 480 dynamic linking info
.got.plt 0x0000000000404000 32 GOT for PLT
.data 0x0000000000404020 0 initialized data
.bss 0x0000000000404020 0 uninitialized data
Symbol resolution:
- puts: resolved via PLT/GOT (lazy binding)
- __libc_start_main: resolved via PLT/GOT
- Dynamic libraries required: libc.so.6
Entry point: 0x401040 (_start, not main!)
[STAGE 5: RUNTIME OBSERVATION]
--------------------------------------------------------------------------------
Process memory map at main() entry:
Address Range Perms Size Mapping
0x00400000-0x00401000 r--p 4K hello (ELF header)
0x00401000-0x00402000 r-xp 4K hello (.text)
0x00402000-0x00403000 r--p 4K hello (.rodata)
0x00403000-0x00405000 rw-p 8K hello (.data, .bss, .got)
0x7ffff7c00000-0x7ffff7c28000 r--p 160K libc.so.6
0x7ffff7c28000-0x7ffff7dbd000 r-xp 1620K libc.so.6 (.text)
0x7ffff7fc3000-0x7ffff7fc7000 r--p 16K ld-linux-x86-64.so.2
0x7ffffffde000-0x7ffffffff000 rw-p 132K [stack]
Stack frame at main():
RSP: 0x7fffffffe3d0
RBP: 0x7fffffffe3e0
Return address: 0x7ffff7c29d90 (__libc_start_call_main+128)
argc: 1
argv[0]: "./hello"
================================================================================
PIPELINE SUMMARY
================================================================================
Total build time: 0.041s
Size amplification: 45 bytes (source) -> 16,696 bytes (binary) = 371x
Symbol resolution: 2 external symbols resolved dynamically
Recommendation: Use -static for deployment, dynamic for development
The tool can also produce focused reports:
$ ./pipeline-explorer hello.c --symbols
$ ./pipeline-explorer hello.c --relocations
$ ./pipeline-explorer hello.c --compare-linking # static vs dynamic comparison
$ ./pipeline-explorer hello.c --trace-symbol puts # full resolution chain for one symbol
The Core Question Youโre Answering
โWhat exactly happens between typing gcc hello.c and having a running process, and why does each transformation exist?โ
This question forces you to confront the reality that compilation is not magic - it is a deterministic pipeline where each stage produces artifacts that the next stage consumes. Understanding this pipeline is the foundation for debugging linker errors, understanding security vulnerabilities, optimizing build times, and reasoning about what code actually executes.
Concepts You Must Understand First
- The Translation Pipeline (Preprocessing, Compilation, Assembly, Linking)
- What is the output of each stage and what format does it take?
- Why does preprocessing happen before compilation?
- What would break if you skipped the assembly stage and went directly from compiler output to object file?
- CS:APP Ch. 1.2, Ch. 7.1-7.2
- Object Files and ELF Format
- What are sections and why do .text, .data, .rodata, and .bss exist as separate concepts?
- What is a symbol table and why does it contain both defined and undefined symbols?
- What is a relocation entry and why canโt the assembler resolve all addresses itself?
- CS:APP Ch. 7.3-7.4
- Symbol Resolution and Linking
- How does the linker decide which definition to use when multiple object files define the same symbol?
- What is the difference between strong and weak symbols?
- Why do static libraries and dynamic libraries resolve symbols differently?
- CS:APP Ch. 7.5-7.7
- Loading and Process Creation
- What does the loader do with the ELF file before main() runs?
- Where do the various segments end up in virtual memory?
- What is the role of the dynamic linker (ld-linux.so)?
- CS:APP Ch. 7.9, Ch. 8.2
- Compilation and Code Generation
- What decisions does the compiler make when translating C to assembly?
- How do optimization levels affect the generated code?
- What information is lost during compilation that cannot be recovered?
- CS:APP Ch. 1.2, Ch. 3.1-3.2
Questions to Guide Your Design
-
How will you invoke each stage of the pipeline separately? (Hint: gcc -E, gcc -S, gcc -c, gcc)
-
How will you parse the output of tools like
objdump,readelf, andnmto extract structured information? -
What format will your report take - plain text, JSON, or both? How will you handle reports that need to show binary data?
-
How will you capture runtime information? Will you use GDB scripting, ptrace, or /proc filesystem parsing?
-
How will you handle error cases - what if compilation fails? What if the input is not valid C?
-
How will you make the tool educational? Should it explain why each transformation happened, not just what changed?
-
How will you compare static vs dynamic linking? What metrics are meaningful to show?
Thinking Exercise
Before writing any code, trace through this program by hand:
// main.c
extern int helper(int x);
int global_var = 42;
int main(void) {
return helper(global_var);
}
// helper.c
int helper(int x) {
return x + 1;
}
Answer these questions on paper:
-
Preprocessing phase: What will main.i look like? Will it be different from main.c in any meaningful way for this example?
- Symbol table for main.o: List every symbol. For each one, state:
- Name
- Type (FUNC, OBJECT, NOTYPE)
- Binding (LOCAL, GLOBAL)
- Section (which section, or UND if undefined)
-
Relocations in main.o: There will be at least two relocations. What are they and why?
-
Linking main.o + helper.o: Draw the combined symbol table. Which symbols from main.o were undefined before linking but defined after?
-
Memory layout after loading: If the .text section of the final binary starts at 0x401000, and main is at offset 0x20 within .text, what is the absolute address of main?
- Dynamic linking alternative: If helper() were in a shared library instead of helper.o, what would be different about:
- The symbol table
- The relocations
- The PLT/GOT sections
- The runtime behavior on first call to helper()
The Interview Questions Theyโll Ask
- โWalk me through what happens when you run
gcc -o hello hello.cโ- They want: preprocessing expands includes/macros, compiler generates assembly, assembler creates object file with relocations, linker resolves symbols and creates executable
- Bonus: mention that ld.so loads dynamic dependencies at runtime
- โWhatโs the difference between a linker error and a compiler error?โ
- They want: compiler errors are syntax/type errors in a single translation unit; linker errors are symbol resolution failures across multiple object files
- Example: undefined reference vs undeclared identifier
- โExplain static vs dynamic linking and when youโd use eachโ
- They want: static bundles everything (larger binary, no dependencies, faster startup), dynamic shares libraries (smaller binary, security updates, slower first-call)
- Discuss: deployment scenarios, licensing implications (LGPL)
- โWhat is Position Independent Code (PIC) and why is it needed?โ
- They want: code that works regardless of load address, required for shared libraries (ASLR), uses PC-relative addressing and GOT/PLT
- โHow would you debug a โsymbol not foundโ error at runtime?โ
- They want: ldd to check dependencies, LD_DEBUG=all to trace resolution, readelf/nm to inspect symbol tables, verify library paths
- โWhatโs in an ELF file and how does the loader use it?โ
- They want: ELF header, program headers (segments for loading), section headers (for linking/debugging), symbol/string tables, relocation entries
Hints in Layers
Hint 1 - Getting Started: Start by manually running each stage and saving the outputs:
gcc -E hello.c -o hello.i # Preprocess only
gcc -S hello.c -o hello.s # Compile to assembly
gcc -c hello.c -o hello.o # Assemble to object file
gcc hello.o -o hello # Link to executable
Look at each output file. What tools can parse them? (file, cat, objdump, readelf, nm)
Hint 2 - Extracting Object File Information: These commands give you structured output you can parse:
readelf -h hello.o # ELF header
readelf -S hello.o # Section headers
readelf -s hello.o # Symbol table
readelf -r hello.o # Relocations
objdump -d hello.o # Disassembly
Consider using readelf --wide for easier parsing.
Hint 3 - Capturing Runtime Information: For the runtime stage, you can use GDB non-interactively:
gdb -batch -ex "break main" -ex "run" -ex "info registers" -ex "x/20x \$rsp" ./hello
Or parse /proc/[pid]/maps from a wrapper program.
Hint 4 - Comparing Linking Strategies: Build both versions and compare:
gcc -o hello_dynamic hello.c
gcc -static -o hello_static hello.c
ls -l hello_dynamic hello_static
ldd hello_dynamic
readelf -d hello_dynamic | grep NEEDED
Hint 5 - Tool Architecture: Structure your code as:
struct stage_result {
char *stage_name;
char *input_file;
char *output_file;
size_t input_size;
size_t output_size;
double elapsed_time;
/* stage-specific data */
};
struct preprocess_result { int lines_added; int macros_expanded; ... };
struct compile_result { int instructions; int functions; ... };
struct assemble_result { struct section *sections; struct symbol *symbols; ... };
struct link_result { struct segment *segments; char *entry_point; ... };
Hint 6 - The Educational Value: Donโt just report numbers - explain them:
The symbol 'puts' appears in hello.o with type NOTYPE and section UND (undefined).
This means the assembler encountered a call to puts() but has no idea where it is.
The relocation entry at offset 0x0f tells the linker: "When you find puts,
patch this location with the correct address."
After linking, puts is still not directly resolved - instead, the linker created
a PLT entry at 0x401030 and a GOT slot at 0x404018. The first call to puts()
will trigger the dynamic linker to fill in the GOT slot.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| The compilation pipeline overview | Computer Systems: A Programmerโs Perspective | Ch. 1 (A Tour of Computer Systems) |
| Object files, symbols, and relocations | Computer Systems: A Programmerโs Perspective | Ch. 7 (Linking) |
| ELF format deep dive | Practical Binary Analysis | Ch. 2 (ELF Format) |
| Static and dynamic linking | Computer Systems: A Programmerโs Perspective | Ch. 7.6-7.7 |
| Position-independent code and GOT/PLT | Computer Systems: A Programmerโs Perspective | Ch. 7.12 |
| The C compilation model | The C Programming Language (K&R) | Ch. 4 (Functions and Program Structure) |
| Separate compilation in C | C Programming: A Modern Approach | Ch. 15 (Writing Large Programs) |
| x86-64 assembly basics | Computer Systems: A Programmerโs Perspective | Ch. 3.1-3.4 |
| Process loading and execution | Computer Systems: A Programmerโs Perspective | Ch. 7.9, Ch. 8.2 |
| Low-level executable analysis | Low-Level Programming | Ch. 3-4 (Assembly and Linking) |
Project 2: Bitwise Data Inspector
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Intermediate |
| Time | Weekendโ2 weeks |
| Chapters | 2, 3 |
| Coolness | โ โ โ โโ Genuinely Clever |
| Portfolio Value | Micro-SaaS/Pro Tool |
What youโll build: A CLI that prints the byte-level representation of values (signed/unsigned integers and IEEE-754 floats), including inferred endianness and derived interpretations.
Why it matters: Chapter 2 becomes โmuscle memoryโ only when you can see representations and predict overflow, truncation, and rounding.
Core challenges:
- Correct sign extension, shifts, and casts (twoโs complement)
- Float field extraction and classification (IEEE-754)
- Tests that catch edge-case mistakes (disciplined reasoning)
Key concepts to master:
- Integer representations and overflow (Ch. 2)
- Floating point, rounding, NaN/Inf (Ch. 2)
- Data sizes and alignment (Ch. 3)
Prerequisites: Basic C operators, binary/hex comfort.
Deliverable: Paste a number; get โwhat the machine storesโ plus why comparisons/overflows surprise people.
Implementation hints:
- Separate parsing, bit extraction, and formatting as distinct modules
- Make the tool explain why a conversion changed value (range, rounding, NaN propagation)
Milestones:
- You can predict overflow and signed/unsigned comparison outcomes
- You can explain subnormals and NaN behavior with your own examples
- You start trusting bit evidence over intuition
Real World Outcome
When complete, you will have a CLI tool that reveals the hidden bit-level truth behind numbers:
$ ./bitwise-inspector 42
================================================================================
BITWISE DATA INSPECTION: 42
================================================================================
[INTEGER INTERPRETATIONS]
--------------------------------------------------------------------------------
Input parsed as: decimal integer
As unsigned integers:
uint8_t: 42 (0x2A) Binary: 00101010
uint16_t: 42 (0x002A) Binary: 00000000 00101010
uint32_t: 42 (0x0000002A)
uint64_t: 42 (0x000000000000002A)
As signed integers (two's complement):
int8_t: 42 (0x2A) Binary: 00101010
int16_t: 42 (0x002A) Sign bit: 0 (positive)
int32_t: 42 (0x0000002A)
int64_t: 42 (0x000000000000002A)
Memory layout (little-endian system):
Address: [0] [1] [2] [3]
uint32: 2A 00 00 00
[OVERFLOW ANALYSIS]
--------------------------------------------------------------------------------
42 + 200 as uint8_t = 242 (no overflow, still fits)
42 + 200 as int8_t = -14 (OVERFLOW! Wraps negative)
Binary: 00101010 + 11001000 = 11110010 = -14 (two's complement)
[COMPARISON TRAPS]
--------------------------------------------------------------------------------
Warning: Signed/unsigned comparison hazards:
(int8_t)42 > (uint8_t)200 is FALSE (42 < 200)
But: (int8_t)-1 > (uint8_t)200 might surprise you!
-1 as int8_t = 0xFF = 255 as uint8_t
Comparison: 255 > 200 = TRUE (after promotion)
$ ./bitwise-inspector -f 0.1
================================================================================
BITWISE DATA INSPECTION: 0.1 (float)
================================================================================
[IEEE-754 SINGLE PRECISION (32-bit)]
--------------------------------------------------------------------------------
Hex representation: 0x3DCCCCCD
Binary: 0 01111011 10011001100110011001101
^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
| | |
| | +-- Mantissa (23 bits): 1.10011001100110011001101
| +-- Exponent (8 bits): 123 - 127 = -4
+-- Sign bit: 0 (positive)
Value computation:
(-1)^0 * 1.10011001100110011001101 * 2^(-4)
= 1 * 1.60000002384185791015625 * 0.0625
= 0.10000000149011611938476562
PRECISION LOSS: You asked for 0.1, but got 0.10000000149011612
Error: +1.49e-09 (relative error: 1.49e-08)
[IEEE-754 DOUBLE PRECISION (64-bit)]
--------------------------------------------------------------------------------
Hex representation: 0x3FB999999999999A
Binary: 0 01111111011 1001100110011001100110011001100110011001100110011010
Value: 0.10000000000000000555111512312578270211815834045410156250
PRECISION LOSS: Error from exact 0.1 = +5.55e-18
[WHY 0.1 CANNOT BE EXACT]
--------------------------------------------------------------------------------
0.1 in binary is a repeating fraction: 0.0001100110011001100110011...
Just like 1/3 = 0.333... in decimal, 1/10 is infinite in binary.
IEEE-754 truncates this, causing the representation error.
$ ./bitwise-inspector -f inf
[IEEE-754 SPECIAL VALUES]
--------------------------------------------------------------------------------
+Infinity (float): 0x7F800000 Binary: 0 11111111 00000000000000000000000
-Infinity (float): 0xFF800000 Binary: 1 11111111 00000000000000000000000
+NaN (quiet): 0x7FC00000 Binary: 0 11111111 10000000000000000000000
-0.0 (float): 0x80000000 Binary: 1 00000000 00000000000000000000000
NaN behavior:
NaN == NaN is FALSE (NaN is not equal to anything, including itself)
NaN != NaN is TRUE
isnan(NaN) is TRUE
$ ./bitwise-inspector --edge-cases
[CRITICAL EDGE CASES TO REMEMBER]
--------------------------------------------------------------------------------
INT_MIN negation trap:
-(-2147483648) = -2147483648 (NOT 2147483648!)
Because 2147483648 cannot fit in int32_t
Signed overflow is UNDEFINED BEHAVIOR in C:
INT_MAX + 1 = undefined (compiler may assume it never happens)
Unsigned overflow is well-defined: wraps to 0
Float comparison epsilon:
0.1 + 0.2 == 0.3 is FALSE
|0.1 + 0.2 - 0.3| < epsilon is the correct approach
The Core Question Youโre Answering
โHow does the machine actually store and manipulate numbers, and why do programmers keep getting bitten by edge cases they thought they understood?โ
This project forces you to move from โI know twoโs complement existsโ to โI can predict exactly which bit pattern will result from any operation.โ This is the foundation for understanding buffer overflows, integer vulnerabilities, floating-point precision issues in financial software, and why certain optimizations are (un)safe.
Concepts You Must Understand First
- Twoโs Complement Integer Representation
- How do you convert a negative number to its twoโs complement representation?
- What is the range of an N-bit twoโs complement integer?
- Why is there one more negative number than positive?
- How does negation work in twoโs complement? When does it fail?
- CS:APP Ch. 2.2 (Integer Representations)
- Unsigned vs Signed Integer Operations
- What happens when you cast a negative signed integer to unsigned?
- What is โsign extensionโ and when does it occur?
- How does C handle mixed signed/unsigned comparisons?
- What is the difference between arithmetic and logical right shift?
- CS:APP Ch. 2.2-2.3
- Integer Overflow and Undefined Behavior
- What happens when signed overflow occurs in C? (Hint: undefined behavior)
- What happens when unsigned overflow occurs? (Hint: well-defined wraparound)
- How can compilers exploit undefined behavior for optimization?
- CS:APP Ch. 2.3, Effective C Ch. 5
- IEEE-754 Floating Point Format
- What are the three components of an IEEE-754 number (sign, exponent, mantissa)?
- What is the โbiasโ in the exponent field? Why is it needed?
- What is the implicit leading 1 in normalized numbers?
- What are denormalized (subnormal) numbers and when do they occur?
- CS:APP Ch. 2.4 (Floating Point)
- Special Floating Point Values
- How are infinity, negative infinity, and NaN represented?
- What operations produce NaN? What operations produce infinity?
- Why is NaN != NaN true? How do you test for NaN?
- What is negative zero and how does it differ from positive zero?
- CS:APP Ch. 2.4.3-2.4.6
- Endianness and Memory Layout
- What is big-endian vs little-endian?
- How do you determine the endianness of your system?
- How does endianness affect multi-byte integer storage?
- CS:APP Ch. 2.1.3
Questions to Guide Your Design
-
How will you parse different input formats (decimal, hex, binary, float literals)?
-
How will you handle type specification - should the user specify int32 vs int64, or infer it?
-
How will you display bit patterns - raw binary, grouped bytes, or both?
-
How will you demonstrate overflow - show the computation, or just the result?
-
How will you extract IEEE-754 fields - bit masking, unions, or memcpy?
-
How will you make the output educational - just facts, or explanations of why?
-
How will you handle invalid input or edge cases like NaN input?
Thinking Exercise
Before writing any code, work through these by hand:
// Exercise 1: Integer representation
int8_t a = -1;
uint8_t b = a;
// Question: What is the value of b? Draw the bit pattern.
// Exercise 2: Sign extension
int8_t x = -5;
int32_t y = x;
// Question: What bit pattern is y? How many 1s are in its binary representation?
// Exercise 3: Overflow
int8_t m = 127;
int8_t n = m + 1;
// Question: What is n? Is this defined behavior?
uint8_t p = 255;
uint8_t q = p + 1;
// Question: What is q? Is this defined behavior?
// Exercise 4: Signed/unsigned comparison
int x = -1;
unsigned int y = 1;
if (x < y) printf("x < y\n");
else printf("x >= y\n");
// Question: What prints and why?
// Exercise 5: Float representation
// Convert 12.375 to IEEE-754 single precision by hand:
// Step 1: Convert to binary: 12.375 = ?
// Step 2: Normalize: 1.??? x 2^?
// Step 3: Calculate biased exponent: ? + 127 = ?
// Step 4: Write final bit pattern: ? ? ?
// Exercise 6: Float precision
float a = 0.1f;
float b = 0.2f;
float c = 0.3f;
// Question: Is (a + b == c) true or false? What are the actual bit patterns?
The Interview Questions Theyโll Ask
- โExplain twoโs complement and why we use it instead of sign-magnitudeโ
- They want: addition works the same for signed/unsigned, only one zero, simple negation (flip bits + 1), hardware efficiency
- Know: the asymmetry (-128 to 127 for int8_t)
- โWhat happens when you compare a signed and unsigned integer in C?โ
- They want: signed is converted to unsigned, which can cause -1 > 1 to be true
- Bonus: explain that this is a common source of security vulnerabilities
- โWhy canโt 0.1 be represented exactly in floating point?โ
- They want: 0.1 is a repeating fraction in binary, IEEE-754 has finite precision
- Know: never compare floats with ==, use epsilon comparison
- โWhat is undefined behavior and why does signed overflow cause it?โ
- They want: compiler can assume UB never happens, enables optimizations
- Example:
if (x + 1 > x)can be optimized toif (true)because overflow is UB
- โHow would you detect if an integer addition will overflow before it happens?โ
- They want: for signed, check if signs match and result sign differs; for unsigned, check if result < either operand
- Bonus: mention compiler built-ins like
__builtin_add_overflow
- โExplain denormalized floating point numbersโ
- They want: gradual underflow, implicit leading 0 instead of 1, fills gap between 0 and smallest normalized
- Know: they have reduced precision but prevent abrupt underflow to zero
Hints in Layers
Hint 1 - Getting Started: Start with integer display. Use a union or memcpy to view raw bytes:
void show_bytes(void *ptr, size_t len) {
unsigned char *bytes = (unsigned char *)ptr;
for (size_t i = 0; i < len; i++) {
printf("%02x ", bytes[i]);
}
printf("\n");
}
int x = -1;
show_bytes(&x, sizeof(x)); // ff ff ff ff on little-endian
Hint 2 - Extracting IEEE-754 Fields: Use bit manipulation to extract sign, exponent, and mantissa:
typedef union {
float f;
uint32_t u;
} float_bits;
void decompose_float(float f) {
float_bits fb = { .f = f };
uint32_t sign = (fb.u >> 31) & 1;
uint32_t exponent = (fb.u >> 23) & 0xFF;
uint32_t mantissa = fb.u & 0x7FFFFF;
int actual_exp = exponent - 127; // Remove bias
printf("Sign: %u, Exp: %d (biased: %u), Mantissa: 0x%06X\n",
sign, actual_exp, exponent, mantissa);
}
Hint 3 - Detecting Overflow: For unsigned addition, overflow occurred if result < either operand:
int unsigned_add_overflows(unsigned a, unsigned b) {
return (a + b) < a;
}
// For signed, use compiler built-ins or check manually:
int signed_add_overflows(int a, int b) {
return __builtin_add_overflow(a, b, &(int){0});
}
Hint 4 - Printing Binary: Create a helper to print any integer as binary with grouping:
void print_binary(uint64_t val, int bits) {
for (int i = bits - 1; i >= 0; i--) {
printf("%c", (val >> i) & 1 ? '1' : '0');
if (i > 0 && i % 8 == 0) printf(" ");
}
printf("\n");
}
Hint 5 - Special Float Detection: Detect special values using the bit pattern:
int is_nan(float f) {
float_bits fb = { .f = f };
uint32_t exp = (fb.u >> 23) & 0xFF;
uint32_t mantissa = fb.u & 0x7FFFFF;
return exp == 255 && mantissa != 0;
}
int is_infinity(float f) {
float_bits fb = { .f = f };
uint32_t exp = (fb.u >> 23) & 0xFF;
uint32_t mantissa = fb.u & 0x7FFFFF;
return exp == 255 && mantissa == 0;
}
int is_denormalized(float f) {
float_bits fb = { .f = f };
uint32_t exp = (fb.u >> 23) & 0xFF;
return exp == 0 && f != 0.0f;
}
Hint 6 - Tool Structure: Organize your tool with clear separation:
// parser.c - parse input strings to values
// integer.c - integer analysis and display
// float.c - IEEE-754 analysis and display
// display.c - formatted output
struct inspection_result {
enum { INT_TYPE, FLOAT_TYPE } type;
union {
struct {
int64_t signed_val;
uint64_t unsigned_val;
int bits;
} integer;
struct {
double value;
int precision; // 32 or 64
} floating;
} data;
};
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Integer representations (twoโs complement) | Computer Systems: A Programmerโs Perspective | Ch. 2.2 Integer Representations |
| Integer arithmetic and overflow | Computer Systems: A Programmerโs Perspective | Ch. 2.3 Integer Arithmetic |
| IEEE-754 floating point | Computer Systems: A Programmerโs Perspective | Ch. 2.4 Floating Point |
| Bit manipulation techniques | The C Programming Language (K&R) | Ch. 2.9 Bitwise Operators |
| Safe integer operations | Effective C | Ch. 5 Integer Security |
| Undefined behavior | Effective C | Ch. 2 Objects, Functions, Types |
| Data representation overview | Write Great Code Vol. 1 | Ch. 2-4 (Numeric Representation) |
| C type conversion rules | C Programming: A Modern Approach | Ch. 7 Basic Types |
| Pointer and integer relationships | Understanding and Using C Pointers | Ch. 4 Pointers and Arrays |
| Low-level data representation | Low-Level Programming | Ch. 2 Assembly Language and Computer Architecture |
Phase 2: Machine-Level Mastery
Project 3: Data Lab Clone
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 2 |
| Coolness | โ โ โโโ Practical |
| Portfolio Value | Resume Gold |
What youโll build: A framework that enforces restricted operator sets for exercises (e.g., only bitwise ops), runs randomized tests, and produces a scoreboard.
Why it matters: The restriction forces hardware-style thinking; the harness forces correctness under edge cases.
Core challenges:
- Enforcing restrictions mechanically (operator semantics)
- Property-based/randomized testing for corner cases (representation edge behavior)
- Producing clear failure explanations (debugging discipline)
Key concepts to master:
- Bit-level operator reasoning (Ch. 2)
- Undefined/implementation-defined behavior awareness (Effective C reference)
- Test-oracle thinking (Appendix)
Prerequisites: Solid C, comfort writing tests.
Deliverable: A repeatable, automated way to prove you can do โbit-twiddling under constraintsโ correctly.
Implementation hints:
- Make restrictions mechanical (scan source for disallowed tokens)
- Include adversarial values (min/max, boundaries, NaNs) in tests
Milestones:
- You derive bit identities without trial-and-error
- You can explain every failing case without โmysteryโ
- Your constraints prevent cheating, not just discourage it
Real World Outcome
When complete, you will have a testing framework that enforces bit-manipulation constraints:
$ ./datalab-runner puzzles/bitAnd.c
================================================================================
DATA LAB CLONE - PUZZLE VALIDATOR
================================================================================
[PUZZLE: bitAnd]
--------------------------------------------------------------------------------
Task: Compute x & y using only ~ and |
Allowed operators: ~ |
Max operations: 8
Your solution uses: 4 operations
[RESTRICTION CHECK]
--------------------------------------------------------------------------------
Scanning source for disallowed operators...
Line 5: Found '~' - ALLOWED
Line 5: Found '|' - ALLOWED
PASS: No disallowed operators found
[CORRECTNESS TESTS]
--------------------------------------------------------------------------------
Running exhaustive test for 8-bit inputs (65536 combinations)...
65536/65536 tests passed
Running random 32-bit tests (10000 iterations)...
10000/10000 tests passed
[RESULT: PASS]
================================================================================
Score: 4/4 (4 ops used, max 8 allowed)
$ ./datalab-runner --scoreboard
================================================================================
DATA LAB SCOREBOARD
================================================================================
Puzzle Status Ops Used Max Ops Score
--------------------------------------------------------------------------------
bitAnd PASS 4 8 2.0
bitXor PASS 7 8 1.5
isZero PASS 2 2 2.0
addOK FAIL - 20 0.0
-> Failed: addOK(0x7FFFFFFF, 1) expected 0, got 1
Total Score: 5.5 / 8.0
The Core Question Youโre Answering
โCan you think like the hardware - expressing computation using only the primitive operations a CPU actually has, while guaranteeing correctness for every possible input?โ
Concepts You Must Understand First
- Boolean Algebra and Logic Gates
- How can you express AND using only OR and NOT? (De Morganโs Laws)
- CS:APP Ch. 2.1.6-2.1.8
- Bitwise Operations in C
-
What is the difference between & and &&? Between and ย ? - What is the difference between arithmetic and logical right shift?
- CS:APP Ch. 2.1.6-2.1.8, K&R Ch. 2.9
-
- Twoโs Complement Arithmetic
- How can you detect overflow using only bitwise operations?
- What is the relationship between ~x and -x-1?
- CS:APP Ch. 2.2-2.3
- Bit Manipulation Patterns
- How do you create a mask with the lowest N bits set?
- How do you extract a field of bits from a value?
- Hackerโs Delight Ch. 2
Questions to Guide Your Design
- How will you detect disallowed operators - regex, parsing, or AST analysis?
- What test strategy will you use - exhaustive for small inputs, random for large?
- How will you score solutions - just pass/fail, or reward minimal operator usage?
Thinking Exercise
Solve these puzzles by hand:
Puzzle 1: bitAnd(x, y) - Compute x & y using only ~ and |
De Morgan: a & b = ~(~a | ~b)
Puzzle 2: isNegative(x) - Return 1 if x < 0. Use onlyย ยป and &
Hint: x >> 31 for 32-bit integers
The Interview Questions Theyโll Ask
-
โImplement XOR using only AND, OR, and NOTโ - x ^ y = (x & ~y) (~x & y) - โHow do you detect if adding two integers will overflow?โ - Check if signs match but result sign differs
- โWhat is the fastest way to check if a number is a power of 2?โ - x && !(x & (x-1))
- โHow do you compute absolute value without branching?โ - int mask = xย ยป 31; return (x ^ mask) - mask;
Hints in Layers
Hint 1: Create reference implementations and test against them Hint 2: Test edge cases: 0, 1, -1, INT_MAX, INT_MIN, 0x55555555, 0xAAAAAAAA Hint 3: Use TRACE macros to debug intermediate values
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Bitwise operations | Computer Systems: A Programmerโs Perspective | Ch. 2.1.6-2.1.8 |
| Twoโs complement | Computer Systems: A Programmerโs Perspective | Ch. 2.2-2.3 |
| Bit manipulation tricks | Hackerโs Delight | Ch. 2 |
| C bitwise operators | The C Programming Language (K&R) | Ch. 2.9 |
| Safe integer operations | Effective C | Ch. 5 |
Project 4: x86-64 Calling Convention Crash Cart
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 3 |
| Coolness | โ โ โ โโ Genuinely Clever |
| Portfolio Value | Resume Gold |
What youโll build: Tiny programs plus a standardized post-mortem report format that explains how stack frames, saved registers, and return addresses caused a crash.
Why it matters: Chapter 3 becomes usable only when you can debug from registers/stack bytes back to the source-level defect.
Core challenges:
- Mapping assembly to C constructs (code generation)
- Explaining stack layout and argument passing (ABI)
- Handling arrays/structs in machine terms (data layout + addressing)
Key concepts to master:
- x86-64 instruction patterns (Ch. 3)
- Stack discipline and procedure calls (Ch. 3)
- Arrays/structs and pointer arithmetic (Ch. 3)
Prerequisites: Comfort using a debugger (GDB/LLDB).
Deliverable: Given a crash address and debugger snapshot, you can write a clean narrative of what happened.
Implementation hints:
- Standardize your report: registers, stack window, disassembly window, C-source mapping
- Intentionally create classic failures: invalid pointer, stack smash, use-after-free
Milestones:
- You can explain a crash without guessing
- You recognize compiler-generated patterns (switch tables, loops, calls)
- You identify vulnerability classes by assembly signature
Real World Outcome
When complete, you will have crash scenarios with detailed post-mortem analysis:
$ ./crash-cart analyze core.12345
================================================================================
x86-64 CRASH CART - POST-MORTEM ANALYSIS
================================================================================
[CRASH SUMMARY]
Signal: SIGSEGV | Fault: 0x0 | Location: main+54 (vulnerable.c:23)
Cause: NULL pointer dereference
[REGISTERS]
RAX: 0x0000000000000000 <- NULL!
RBP: 0x7fffffffdd70 | RSP: 0x7fffffffdd50 | RIP: 0x401156
[STACK TRACE]
#0 main at vulnerable.c:23
#1 __libc_start_call_main
[STACK FRAME]
RBP+8: Return address
RBP: Saved RBP
RBP-24: ptr = NULL
[DISASSEMBLY]
0x401156: mov (%rax),%eax ; CRASH: deref NULL!
[ROOT CAUSE]
get_data() returned NULL, dereferenced without checking.
The Core Question Youโre Answering
โGiven a crashed program and a debugger, can you reconstruct what happened?โ
Concepts You Must Understand First
- x86-64 Register Conventions - Caller/callee-saved, argument registers (CS:APP Ch. 3.7)
- Stack Frame Layout - Return address, saved regs, locals (CS:APP Ch. 3.7.1-3.7.4)
- Calling Convention - Arguments and returns (CS:APP Ch. 3.7)
- Memory Safety Violations - NULL deref, overflow, use-after-free (CS:APP Ch. 3.10)
Questions to Guide Your Design
- What crash scenarios will you create?
- How will you standardize your report format?
Thinking Exercise
void greet(char *name) {
char buffer[16];
strcpy(buffer, name); // No bounds check!
}
Run with 32 โAโs. Where is return address relative to buffer?
The Interview Questions Theyโll Ask
- โWalk me through a function call in x86-64โ
- โHow would you debug a segfault with only a core dump?โ
- โWhat is a stack buffer overflow?โ
- โWhat protections exist against buffer overflows?โ
Hints in Layers
Hint 1: GDB commands: info registers, x/32xg $rsp, bt full
Hint 2: Report: Summary, Registers, Stack, Disasm, Root Cause
Hint 3: Create NULL deref, stack overflow, use-after-free scenarios
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| x86-64 procedures | CS:APP | Ch. 3.7 |
| Buffer overflows | CS:APP | Ch. 3.10 |
| GDB debugging | The Art of Debugging | Ch. 1-4 |
| Binary analysis | Practical Binary Analysis | Ch. 6 |
Project 5: Bomb Lab Workflow
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 3 |
| Coolness | โ โ โ โ โ Hardcore Tech Flex |
| Portfolio Value | Resume Gold |
What youโll build: A repeatable binary-puzzle playbook and annotated solutions for at least one full bomb instance: inputs, reasoning, and the exact assembly facts used.
Why it matters: It forces fluent reading of compiler output and tool-driven reasoning under constraints.
Core challenges:
- Extracting constraints from assembly (control flow + data movement)
- Verifying hypotheses via debugging (disciplined experimentation)
- Handling indirect jumps and lookup tables (machine-level control)
Key concepts to master:
- Control flow at machine level (Ch. 3)
- Debugger-driven reasoning (Ch. 3)
- Defensive reading of compiled code (Ch. 3 security discussion)
Prerequisites: Project 4 (or equivalent).
Deliverable: A written โdefusal dossierโ that proves you can reverse engineer a real x86-64 binary methodically.
Implementation hints:
- Write down each constraint as a testable statement before trying any input
- Prefer โprove constraintsโ over โtry stringsโ
Milestones:
- You solve phases without brute force
- You generalize patterns across different binaries
- You can justify each solution in assembly terms
Real World Outcome
When you complete this project, you will have a โDefusal Dossierโ documenting your systematic reverse engineering of a binary bomb:
$ objdump -t bomb | grep phase
0000000000400ee0 g F .text 000000000000002a phase_1
0000000000400efc g F .text 0000000000000052 phase_2
0000000000400f43 g F .text 000000000000003c phase_3
$ gdb ./bomb
(gdb) break phase_1
(gdb) run
(gdb) disas
0x0000000000400ee4 <+4>: mov $0x402400,%esi
0x0000000000400ee9 <+9>: call 0x401338 <strings_not_equal>
0x0000000000400ef2 <+18>: call 0x40143a <explode_bomb>
(gdb) x/s 0x402400
0x402400: "Border relations with Canada have never been better."
$ ./bomb solutions.txt
Congratulations! You've defused the bomb!
Your dossier documents each phase:
PHASE 1: String Comparison
Constraint: input must equal string at 0x402400
Evidence: mov $0x402400,%esi before strings_not_equal call
Solution: "Border relations with Canada have never been better."
The Core Question Youโre Answering
โHow do I systematically extract program constraints from compiled machine code without source access?โ
Concepts You Must Understand First
- x86-64 Instruction Semantics (CS:APP Ch. 3.4-3.6)
- What does
leavsmovdo? How docmpandtestset condition codes? - What is the difference between
je,jl,jg,ja,jb?
- What does
- Calling Conventions (CS:APP Ch. 3.7)
- Where are arguments? (
%rdi,%rsi,%rdx,%rcx,%r8,%r9) - Where is the return value? (
%rax)
- Where are arguments? (
- Control Flow Patterns (CS:APP Ch. 3.6)
- How does a
forloop look in assembly? - How does a
switchcompile (jump tables)?
- How does a
- Data Access Patterns (CS:APP Ch. 3.8-3.9)
- How is
array[i]computed? How are struct fields accessed?
- How is
Questions to Guide Your Design
- What tools will you use first? How do you identify โinterestingโ functions?
- How do you identify the โexplodeโ condition and work backwards?
- How do you test hypotheses before committing an answer?
- How do you recognize loop and recursive patterns?
Thinking Exercise
Trace this by hand before using GDB:
phase_mystery:
mov $0x4025cf,%edi
call sscanf
cmp $0x2,%eax
jg .L1
call explode_bomb
.L1:
cmpl $0x7,0x8(%rsp)
ja .L_explode
jmp *0x402470(,%rax,8)
(gdb) x/s 0x4025cf
0x4025cf: "%d %d %d"
What format does sscanf expect? What is jmp *0x402470(,%rax,8) doing?
The Interview Questions Theyโll Ask
- โWalk me through reverse engineering an unknown binary.โ
- โHow do you identify a switch statement in x86-64 assembly?โ
- โWhatโs the difference between
test %eax,%eaxandcmp $0,%eax?โ - โHow would you find a hidden function in a binary?โ
- โWhat tools would you use for binary analysis?โ
Hints in Layers
Layer 1: strings bomb | less and nm bomb | grep phase_
Layer 2: (gdb) break explode_bomb is your safety net
Layer 3 - String Pattern:
mov $ADDR,%esi
call strings_not_equal
test %eax,%eax
je .Lsuccess
call explode_bomb
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| x86-64 instructions | Computer Systems: A Programmerโs Perspective | Ch. 3.4-3.6 |
| Reverse engineering | Hacking: The Art of Exploitation | Ch. 3 |
| GDB mastery | The Art of Debugging with GDB, DDD, and Eclipse | Ch. 1-4 |
| Binary formats | Practical Binary Analysis | Ch. 2, 4 |
| Assembly | Low-Level Programming by Igor Zhirkov | Part II |
Project 6: Attack Lab Workflow
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Expert |
| Time | 2โ3 weeks |
| Chapters | 3 |
| Coolness | โ โ โ โ โ Hardcore Tech Flex |
| Portfolio Value | Resume Gold |
What youโll build: A controlled โvulnerable target labโ environment plus an exploitation journal documenting (a) the bug class, (b) memory layout evidence, and (c) the exact control-flow hijack achievedโfirst via code injection, then via ROP.
Why it matters: It turns Chapter 3โs security discussion into concrete mechanics: stack discipline, calling conventions, and why mitigations matter.
Core challenges:
- Proving the overwrite boundary and control-flow takeover (stack layout evidence)
- Understanding executable protections and how they change tactics (mitigations reasoning)
- Constructing ROP chains from existing code fragments (machine-level composition)
Key concepts to master:
- Buffer overflows and stack discipline (Ch. 3)
- Return addresses and control transfers (Ch. 3)
- Defensive implications and mitigations (Ch. 3)
Prerequisites: Projects 4 and 5.
Deliverable: Demonstrate (in a sandbox) a reliable hijack and explain exactly why it worked and which mitigation would block it.
Implementation hints:
- Treat this as โlearn to defend by learning to break,โ not as an offensive toolkit
- Journal entries must include memory-map evidence, not just outcomes
Milestones:
- You can reason about stack frames as an attack surface
- You can explain why NX/ASLR changes the game
- You can โread gadgetsโ the way you read assembly
Real World Outcome
When you complete this project, you will have an โExploitation Journalโ documenting your control-flow hijacking techniques:
# Phase 1: Code Injection Attack
$ ./hex2raw < exploit1.txt | ./ctarget -q
Cookie: 0x59b997fa
Type string:Touch1!: You called touch1()
Valid solution for level 1 with target ctarget
PASS: Would have posted the following:
user id bovik
course 15213-f15
lab attacklab
result 1:PASS:...
# Phase 2: ROP Attack (with NX enabled)
$ ./hex2raw < exploit5.txt | ./rtarget -q
Cookie: 0x59b997fa
Type string:Touch2!: You called touch2(0x59b997fa)
Valid solution for level 2 with target rtarget
PASS: Would have posted the following:
user id bovik
course 15213-f15
lab attacklab
result 1:PASS:...
Your exploitation journal documents the attack methodology:
EXPLOIT 1: Code Injection - Touch1
==================================
Vulnerability: gets() has no bounds checking
Stack Layout:
0x5561dc78: buffer start (40 bytes)
0x5561dca0: saved %rbp
0x5561dca8: return address <- OVERWRITE TARGET
Attack Vector:
- 40 bytes padding + address of touch1 (0x4017c0)
- Little-endian: c0 17 40 00 00 00 00 00
Payload: [40 bytes junk] [0x4017c0]
EXPLOIT 5: ROP Chain - Touch2
=============================
Mitigation: Stack is non-executable (NX bit)
Strategy: Chain existing code "gadgets" to set %rdi = cookie
Gadget Chain:
0x4019cc: popq %rax; ret # Pop cookie into %rax
0x4019c5: movq %rax,%rdi; ret # Move to first argument
0x4017ec: touch2 # Call target
Payload: [40 bytes] [0x4019cc] [cookie] [0x4019c5] [0x4017ec]
The Core Question Youโre Answering
โHow do memory-safety vulnerabilities enable control-flow hijacking, and how do modern mitigations change the exploitation landscape?โ
Concepts You Must Understand First
- Stack Frame Layout (CS:APP Ch. 3.7)
- Where is the return address stored relative to local variables?
- What happens when you write past the end of a buffer?
- Control Flow Hijacking (CS:APP Ch. 3.10.3-3.10.4)
- How does overwriting a return address redirect execution?
- What is the difference between code injection and ROP?
- Modern Mitigations (CS:APP Ch. 3.10.4)
- What is stack canary protection? When does it detect attacks?
- What is ASLR? How does it complicate exploitation?
- What is NX (DEP)? Why does it require ROP?
- Gadget Identification (Hacking: Art of Exploitation)
- What makes a useful gadget? (ends in
ret) - How do you chain gadgets to achieve computation?
- What makes a useful gadget? (ends in
Questions to Guide Your Design
- How do you determine the exact offset from buffer start to return address?
- How do you construct shellcode that fits in limited space?
- How do you find useful gadgets in a binary?
- How do you chain gadgets to pass arguments to functions?
Thinking Exercise
Before crafting any exploit, analyze this vulnerable function:
void getbuf() {
char buf[BUFFER_SIZE];
Gets(buf);
return;
}
getbuf:
sub $0x28,%rsp # Allocate 40 bytes
mov %rsp,%rdi # buf = %rsp
call Gets # Gets(buf) - no bounds check!
add $0x28,%rsp
ret
Questions:
- Where is
buflocated relative to the saved return address? - How many bytes do you need to write to reach the return address?
- If you want to call
touch1at 0x4017c0, what bytes do you write? - Why must addresses be in little-endian format?
The Interview Questions Theyโll Ask
- โExplain how a buffer overflow attack works.โ
- โWhat is Return-Oriented Programming and why is it necessary?โ
- โHow does ASLR protect against exploitation?โ
- โWhat is a stack canary and how does it work?โ
- โHow would you defend a system against memory-safety attacks?โ
Hints in Layers
Layer 1: Use GDB to find exact stack layout: (gdb) x/20gx $rsp
Layer 2: For code injection, your shellcode runs from the buffer location
Layer 3 - Finding Gadgets:
# Look for "pop; ret" patterns
objdump -d rtarget | grep -A1 "pop"
# Common useful gadgets:
# 58 c3 popq %rax; ret
# 5f c3 popq %rdi; ret
# 48 89 c7 c3 movq %rax,%rdi; ret
Layer 4 - ROP Chain Structure:
[padding to return address]
[gadget1 address] <- first ret goes here
[value for pop] <- popped by gadget1
[gadget2 address] <- gadget1's ret goes here
[target function] <- final destination
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Buffer overflows | Computer Systems: A Programmerโs Perspective | Ch. 3.10.3-3.10.4 |
| Stack discipline | Computer Systems: A Programmerโs Perspective | Ch. 3.7 |
| Exploitation techniques | Hacking: The Art of Exploitation | Ch. 3, 5 |
| ROP fundamentals | Practical Binary Analysis | Ch. 10 |
| Modern mitigations | Low-Level Programming by Igor Zhirkov | Ch. 8-9 |
Phase 3: Architecture & Performance
Project 7: Y86-64 CPU Simulator
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Expert |
| Time | 1 month+ |
| Chapters | 4, 5 |
| Coolness | โ โ โ โ โ Pure Magic |
| Portfolio Value | Resume Gold |
What youโll build: A Y86-64 interpreter plus a pipelined model that can emit per-cycle traces (stage contents, hazards, stalls, bubbles).
Why it matters: Chapter 4 is execution mechanics. Modeling a pipeline forces understanding of hazards and control logic.
Core challenges:
- Implementing ISA semantics correctly (instruction execution)
- Modeling hazards and pipeline control (pipelining)
- Validating equivalence between sequential and pipelined execution (correctness)
Key concepts to master:
- Datapath and control (Ch. 4)
- Pipelining and hazards (Ch. 4)
- Correctness vs performance (Ch. 5)
Prerequisites: Strong C, state machine mindset, patience for verification.
Deliverable: Run Y86-64 programs and produce a cycle-by-cycle โwhy it stalled hereโ trace.
Implementation hints:
- Start with a โgoldenโ sequential interpreter
- Add pipeline stages as explicit state; treat each cycle as a deterministic transition
Milestones:
- Sequential simulator passes a suite of programs
- Pipelined model matches sequential results
- You can explain every stall/bubble with a specific hazard rule
Real World Outcome
When you complete this project, you will have a Y86-64 simulator with cycle-accurate pipeline tracing:
$ ./y86sim -s prog.yo
Y86-64 Sequential Simulator
Loaded program: prog.yo (156 bytes, 23 instructions)
Cycle PC Instruction Registers Changed
1 0x000 irmovq $0x100, %rsp %rsp = 0x100
2 0x00a call main %rsp = 0x0f8
3 0x058 addq %rdi, %rax %rax = 0xa
Execution complete: 47 cycles, status = HLT
$ ./y86sim -p prog.yo -trace
Y86-64 Pipelined Simulator (5-stage)
Cycle 5:
Fetch: addq %rdi, %rax
Decode: irmovq $0x0, %rax
Execute: irmovq $0xa, %rdi
Memory: call main
Writeback:irmovq $0x100, %rsp
*** HAZARD: Load-use data hazard ***
Action: STALL Fetch+Decode, BUBBLE in Execute
Summary: 52 cycles, 3 data hazards (2 stalls, 1 forwarded)
The Core Question Youโre Answering
โHow does a pipelined processor execute instructions, and what hazards must be detected and resolved to maintain correctness?โ
Concepts You Must Understand First
- Y86-64 ISA (CS:APP Ch. 4.1) - Instruction formats and semantics
- Sequential Processor (CS:APP Ch. 4.3) - Fetch, Decode, Execute, Memory, Writeback
- Pipelining (CS:APP Ch. 4.4) - Why it improves throughput
- Hazards (CS:APP Ch. 4.5) - Data (RAW) and control hazards
Questions to Guide Your Design
- How will you represent pipeline registers between stages?
- How will you detect data hazards at decode time?
- How will you implement forwarding?
- How will you handle branch mispredictions?
Thinking Exercise
0x000: irmovq $10, %rax
0x00a: irmovq $3, %rbx
0x014: addq %rax, %rbx # Depends on both previous
When addq reaches Decode, where can it get %rax and %rbx from? Stall or forward?
The Interview Questions Theyโll Ask
- โExplain the five stages of a classic RISC pipeline.โ
- โWhat is a data hazard and how is it resolved?โ
- โWhat is forwarding/bypassing?โ
- โWhat happens on a branch misprediction?โ
Hints in Layers
Layer 1: typedef struct { uint8_t icode:4; uint8_t ifun:4; ... } instruction_t;
Layer 2 - Hazard Detection: bool hazard = (D_srcA == E_dstE);
Layer 3: Load-use hazards require stalling, not just forwarding.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Y86-64 ISA | Computer Systems: A Programmerโs Perspective | Ch. 4.1 |
| Pipelining | Computer Systems: A Programmerโs Perspective | Ch. 4.4-4.5 |
| Hazards | Computer Organization and Design (Patterson) | Ch. 4.5-4.7 |
Project 8: Performance Clinic
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 5, 6, 1 |
| Coolness | โ โ โ โโ Genuinely Clever |
| Portfolio Value | Micro-SaaS/Pro Tool |
What youโll build: A benchmark suite of small kernels plus a written optimization report explaining changes in terms of ILP, branch prediction, and locality.
Why it matters: Chapter 5 is about turning โfastโ into measurable, explainable transformations.
Core challenges:
- Stable measurements (methodology)
- Transformations that improve ILP / reduce mispredicts (CPU behavior)
- Avoiding โfaster by accidentโ (experimental rigor)
Key concepts to master:
- Loop transformations and tuning (Ch. 5)
- Bottlenecks: compute vs memory (Ch. 5โ6)
- Limits: Amdahlโs Law intuition (Ch. 1)
Prerequisites: Project 1.
Deliverable: A portfolio-quality report with before/after results and a strong โwhyโ narrative.
Implementation hints:
- Keep kernels tiny; control the environment; log everything needed to reproduce
Milestones:
- Measurements become stable and repeatable
- You can predict when an optimization backfires
- You explain improvements as architecture effects, not folklore
Real World Outcome
When you complete this project, you will have a benchmark suite with detailed performance analysis:
$ ./perfclinic --kernel=dotprod --optimize
Performance Clinic: Dot Product Kernel
=======================================
BASELINE (naive implementation):
for (i = 0; i < n; i++)
sum += a[i] * b[i];
Cycles: 4,892,341
CPE (Cycles Per Element): 4.89
Bottleneck: Loop-carried dependency on 'sum'
OPTIMIZATION 1: Loop Unrolling (4x)
for (i = 0; i < n; i += 4) {
sum += a[i]*b[i] + a[i+1]*b[i+1] +
a[i+2]*b[i+2] + a[i+3]*b[i+3];
}
Cycles: 2,456,782
CPE: 2.46 (1.99x speedup)
Why: Reduced loop overhead, but still serialized on 'sum'
OPTIMIZATION 2: Multiple Accumulators
for (i = 0; i < n; i += 4) {
sum0 += a[i]*b[i]; sum1 += a[i+1]*b[i+1];
sum2 += a[i+2]*b[i+2]; sum3 += a[i+3]*b[i+3];
}
sum = sum0 + sum1 + sum2 + sum3;
Cycles: 1,234,567
CPE: 1.23 (3.97x speedup over baseline)
Why: Breaks loop-carried dependency, enables ILP
Theoretical limit: CPE ~1.0 (FP latency = 4, throughput = 1)
OPTIMIZATION 3: SIMD (AVX)
__m256d sum_vec = _mm256_setzero_pd();
for (i = 0; i < n; i += 4) {
sum_vec = _mm256_fmadd_pd(
_mm256_load_pd(&a[i]),
_mm256_load_pd(&b[i]), sum_vec);
}
Cycles: 312,456
CPE: 0.31 (15.7x speedup over baseline)
Why: 4 elements per SIMD operation
Performance Profile (perf stat):
Instructions: 12,345,678
Cycles: 312,456
IPC: 39.5 (superscalar)
L1 cache misses: 0.02%
Branch mispred: 0.01%
The Core Question Youโre Answering
โHow do I measure, explain, and improve program performance in terms of CPU microarchitecture effects?โ
Concepts You Must Understand First
- Latency vs Throughput (CS:APP Ch. 5.7)
- What is the latency of a floating-point multiply?
- What is the throughput (operations per cycle)?
- Loop-Carried Dependencies (CS:APP Ch. 5.8)
- Why does a sequential sum limit performance?
- How do multiple accumulators help?
- Instruction-Level Parallelism (CS:APP Ch. 5.9)
- How many independent operations can execute per cycle?
- What limits ILP in practice?
- Branch Prediction (CS:APP Ch. 5.12)
- What patterns are predictable?
- How do mispredictions affect performance?
- Memory Hierarchy Effects (CS:APP Ch. 6)
- When is a kernel compute-bound vs memory-bound?
- How does cache locality affect performance?
Questions to Guide Your Design
- How will you ensure stable, reproducible measurements?
- How will you identify the bottleneck (compute, memory, branches)?
- How will you verify your optimization actually helps?
- How will you explain WHY the optimization works?
Thinking Exercise
Before optimizing, analyze this loop:
double poly(double a[], double x, int degree) {
double result = a[0];
double xpwr = x;
for (int i = 1; i <= degree; i++) {
result += a[i] * xpwr;
xpwr *= x;
}
return result;
}
Questions:
- What is the loop-carried dependency?
- What is the theoretical minimum CPE?
- How would Hornerโs method change the dependency pattern?
- Would loop unrolling help? Why or why not?
The Interview Questions Theyโll Ask
- โHow do you identify a performance bottleneck?โ
- โExplain instruction-level parallelism.โ
- โWhat is loop unrolling and when does it help?โ
- โHow do branch mispredictions affect performance?โ
- โWhen is a program compute-bound vs memory-bound?โ
Hints in Layers
Layer 1 - Stable Measurement:
# Disable turbo boost, set governor to performance
sudo cpupower frequency-set -g performance
# Run multiple trials, report median
for i in {1..10}; do ./bench; done | sort -n | head -5 | tail -1
Layer 2 - Profiling:
perf stat -e cycles,instructions,cache-misses ./bench
perf record ./bench && perf report
Layer 3 - Multiple Accumulators:
// Transform: sum += a[i] * b[i];
// Into:
double sum0=0, sum1=0, sum2=0, sum3=0;
for (i = 0; i < n; i += 4) {
sum0 += a[i]*b[i]; sum1 += a[i+1]*b[i+1];
sum2 += a[i+2]*b[i+2]; sum3 += a[i+3]*b[i+3];
}
return sum0 + sum1 + sum2 + sum3;
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Performance optimization | Computer Systems: A Programmerโs Perspective | Ch. 5 |
| Memory hierarchy | Computer Systems: A Programmerโs Perspective | Ch. 6 |
| Limits of parallelism | Computer Systems: A Programmerโs Perspective | Ch. 5.9-5.11 |
| CPU microarchitecture | Write Great Code Vol 1 | Ch. 3-4 |
| SIMD programming | Write Great Code Vol 2 | Ch. 12-14 |
| Profiling tools | Linux System Programming | Ch. 10 |
Project 9: Cache Lab++ โ Cache Simulator + Locality Visualizer
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 2โ3 weeks |
| Chapters | 6, 5 |
| Coolness | โ โ โ โ โ Hardcore Tech Flex |
| Portfolio Value | Resume Gold |
What youโll build: A set-associative cache simulator plus an โASCII locality visualizerโ that shows hit/miss patterns for selected code paths.
Why it matters: Chapter 6 lands only when you can simulate misses and then change code to improve locality.
Core challenges:
- Tag/index/offset logic correctness (cache organization)
- Replacement policy and statistics (behavior)
- Improving a real kernel via locality (spatial/temporal locality)
Key concepts to master:
- Cache organization and locality (Ch. 6)
- Miss types and their causes (Ch. 6)
- Measurement discipline (Ch. 5)
Prerequisites: Projects 2 and 8 recommended.
Deliverable: Demonstrate a miss-rate reduction with a locality explanation.
Implementation hints:
- Produce both aggregate stats and per-access event logs
- Use deliberately-designed access patterns to isolate compulsory/conflict/capacity misses
Milestones:
- Simulator matches known traces
- You can explain each miss type with concrete scenarios
- You can design data layouts to target cache behavior
Real World Outcome
$ ./csim -v -s 4 -E 2 -b 4 -t traces/matrix_multiply.trace
Cache Configuration:
Sets: 16 (s=4), Lines per set: 2 (E=2), Block size: 16 bytes (b=4)
Total cache size: 512 bytes
Processing trace: traces/matrix_multiply.trace
---------------------------------------------------
L 0x00601040, 8 miss [Set 4: loaded block 0x00601040]
L 0x00601048, 8 hit [Set 4: block 0x00601040 still valid]
S 0x00602080, 8 miss [Set 8: loaded block 0x00602080]
L 0x00601050, 8 miss [Set 5: loaded block 0x00601050]
L 0x00601058, 8 hit [Set 5: block 0x00601050 still valid]
L 0x00601100, 8 miss [Set 0: loaded block 0x00601100]
L 0x00601108, 8 hit [Set 0: block 0x00601100 still valid]
S 0x00602088, 8 hit [Set 8: block 0x00602080 still valid]
L 0x00601060, 8 miss [Set 6: loaded block 0x00601060]
L 0x00601180, 8 miss [Set 8: evict LRU, loaded block 0x00601180]
...
Summary:
hits: 4,847 misses: 1,153 evictions: 641
hit rate: 80.8% miss rate: 19.2%
miss breakdown: compulsory=256 capacity=512 conflict=385
$ ./csim -locality traces/matrix_multiply.trace
Locality Visualization (temporal window=8 accesses)
===================================================
Address Heat Map (most accessed blocks):
Block 0x00601040: โโโโโโโโโโโโโโโโโโโโ 847 accesses (hot)
Block 0x00601100: โโโโโโโโโโโโโโโโ 672 accesses
Block 0x00602080: โโโโโโโโโโโโ 501 accesses
Block 0x00601180: โโโโโโโโ 334 accesses
...
Access Pattern Timeline (showing set utilization):
Time โ 0 100 200 300 400 500
Set 0: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ (strided)
Set 4: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ (temporal locality - HOT)
Set 8: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ (interleaved - thrashing!)
Spatial Locality Score: 0.73 (good - sequential block access)
Temporal Locality Score: 0.45 (moderate - reuse distance varies)
Recommendation: Consider blocking/tiling to improve temporal locality
Current working set estimate: 2.3 KB
Cache capacity: 512 bytes
Suggested tile size: 8x8 elements (fits in cache)
The Core Question Youโre Answering
โHow does the memory hierarchy create the illusion of fast, infinite memory, and how can I write code that exploits locality to make this illusion work in my favor?โ
Concepts You Must Understand First
- Cache Organization (sets, lines, blocks)
- How do you compute which set an address maps to?
- What is the difference between direct-mapped, set-associative, and fully associative?
- Given address
0x12345678and a 4-way set associative cache with 64 sets and 32-byte blocks, which set does this address map to? - Book: CS:APP Chapter 6.4 (Cache Memories)
- Tag/Index/Offset Address Decomposition
- How do you split a 64-bit address into tag, index, and offset fields?
- What determines the size of each field?
- Why must block size be a power of 2?
- Book: CS:APP Chapter 6.4.1-6.4.3
- The Three Types of Cache Misses
- What causes compulsory (cold) misses? Can they be eliminated?
- What causes capacity misses? How do you detect them?
- What causes conflict misses? Why do they occur even when cache is not full?
- Book: CS:APP Chapter 6.4.4 (Issues with Writes) and 6.4.5 (Cache Performance)
- Replacement Policies
- How does LRU (Least Recently Used) work? What data structure tracks recency?
- What is the performance difference between LRU and random replacement?
- Book: CS:APP Chapter 6.4.2 (Set Associative Caches)
- Spatial and Temporal Locality
- What code patterns exhibit temporal locality? Spatial locality?
- Why does row-major vs column-major iteration matter for 2D arrays?
- How does stride length affect cache performance?
- Book: CS:APP Chapter 6.2 (Locality) and 6.5 (Writing Cache-Friendly Code)
- Working Set and Cache Thrashing
- What is a working set? How do you estimate it?
- When does thrashing occur? What are the symptoms?
- Book: CS:APP Chapter 6.3 (Memory Hierarchy) and OSTEP Chapter 22
Questions to Guide Your Design
-
Data Structure Choice: How will you represent a cache line? What fields do you need (valid bit, tag, LRU counter, dirty bit)?
-
Address Parsing: Will you use bit manipulation or arithmetic to extract tag/index/offset? Which is clearer?
-
LRU Implementation: Will you use counters, a linked list, or bit manipulation for tracking LRU? What are the tradeoffs?
-
Trace Format: How will you parse the Valgrind lackey trace format? What about other formats?
-
Statistics Tracking: How will you distinguish compulsory from capacity from conflict misses?
-
Visualization: How will you represent temporal patterns? Access heat? Set utilization?
Thinking Exercise
Consider this code and trace what happens in a direct-mapped cache with 4 sets, 16-byte blocks:
// Array A is at address 0x1000, Array B is at address 0x1100
// Each int is 4 bytes
int A[64], B[64]; // A at 0x1000, B at 0x1100
for (int i = 0; i < 64; i++) {
A[i] = B[i] + 1; // Load B[i], then store A[i]
}
Hand-trace questions:
- Address of
A[0]? OfB[0]? What set does each map to? - Address of
A[4]? OfB[4]? (Hint: stride of 16 bytes) - Do
A[0]andB[0]map to the same set? What aboutA[4]andB[4]? - On iteration
i=0: What happens when loadingB[0]? (miss/hit?) - On iteration
i=0: What happens when storingA[0]? Does it evictB[0]โs block? - On iteration
i=4: What happens? Do we reloadB[4]or is it already cached? - What is the miss rate for this loop? Can you predict it before simulating?
- How would you restructure this code to improve locality?
The Interview Questions Theyโll Ask
- โExplain how a CPU cache works and why it matters for performance.โ
- Expected: Set/line/block organization, locality exploitation, miss penalty discussion
- โYouโre seeing poor performance in your matrix multiplication. How would you diagnose if itโs a cache issue?โ
- Expected: Profiling tools (perf, cachegrind), miss rate analysis, working set estimation
- โWhat is cache thrashing and how would you fix it?โ
- Expected: Conflict misses from aliasing, solutions include padding, blocking/tiling, changing data layout
- โExplain the difference between temporal and spatial locality. Give code examples of each.โ
- Expected: Temporal = reusing same data, Spatial = accessing nearby addresses, concrete loop examples
- โWhy does iterating a 2D array row-by-row vs column-by-column have such different performance?โ
- Expected: Memory layout (row-major in C), spatial locality, stride analysis
- โDesign a cache-friendly algorithm for transposing a large matrix.โ
- Expected: Blocking/tiling to fit working set in cache, discussion of tile size selection
Hints in Layers
Layer 1 - Getting Started: Start by parsing the trace format and printing each access. Implement a direct-mapped cache first (E=1) before handling set-associativity.
// Trace line format: "L 0x00601040, 8" means Load address 0x601040, size 8
typedef struct {
char op; // 'L' load, 'S' store, 'M' modify (load+store)
uint64_t address;
int size;
} trace_entry_t;
Layer 2 - Address Decomposition: The key insight is that tag, index, and offset are just different bit ranges of the address:
// For a cache with s index bits, b offset bits:
uint64_t offset = address & ((1ULL << b) - 1);
uint64_t set_index = (address >> b) & ((1ULL << s) - 1);
uint64_t tag = address >> (s + b);
Layer 3 - Cache Line Structure: Think about what state you need per line:
typedef struct {
int valid; // Is this line holding data?
uint64_t tag; // Tag bits from the address
uint64_t lru_counter; // For LRU replacement (higher = more recent)
// Note: You don't need to store actual data for simulation!
} cache_line_t;
Layer 4 - Miss Classification: To classify misses, track additional state:
// Compulsory: First access to this block ever (track in a set of seen blocks)
// Conflict: Cache not full, but eviction occurred
// Capacity: Would miss even with fully-associative cache of same size
// Hint: Run two simulations - one with your cache, one with "infinite" associativity
Layer 5 - LRU Implementation: For small associativity (E <= 8), a simple counter approach works:
// On cache hit or fill, update LRU counters:
for (int i = 0; i < E; i++) {
if (set[i].lru_counter < accessed_line->lru_counter)
set[i].lru_counter++; // Age other lines
}
accessed_line->lru_counter = 0; // Most recently used = 0
// For eviction: find line with highest lru_counter
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Cache organization and design | Computer Systems: A Programmerโs Perspective | Ch. 6.4 Cache Memories |
| Writing cache-friendly code | Computer Systems: A Programmerโs Perspective | Ch. 6.5 Writing Cache-Friendly Code |
| Impact on matrix operations | Computer Systems: A Programmerโs Perspective | Ch. 6.6 Cache Performance |
| Memory hierarchy overview | Computer Systems: A Programmerโs Perspective | Ch. 6.1-6.3 |
| Virtual memory and caching | Operating Systems: Three Easy Pieces | Ch. 19-22 (Memory Virtualization) |
| Cache design tradeoffs | Computer Organization and Design (Patterson & Hennessy) | Ch. 5.3-5.4 |
| Practical cache analysis | Linux System Programming (Robert Love) | Ch. 9 Memory Management |
| Performance measurement | Computer Systems: A Programmerโs Perspective | Ch. 5 Optimizing Program Performance |
Phase 4: Systems Programming
Project 10: ELF Link Map & Interposition Toolkit
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 2โ3 weeks |
| Chapters | 7 |
| Coolness | โ โ โ โ โ Hardcore Tech Flex |
| Portfolio Value | Service & Support |
What youโll build: A tool that summarizes symbol/relocation info for ELF objects and demonstrates dynamic interposition (function call hooking) with evidence logs.
Why it matters: It makes symbols, relocation, and runtime resolution concrete.
Core challenges:
- Parsing ELF structures (object file format)
- Explaining relocation and binding (static + dynamic linking)
- Demonstrating interposition safely (loader behavior)
Key concepts to master:
- Relocation and symbol resolution (Ch. 7)
- Static vs dynamic linking tradeoffs (Ch. 7)
- Library interpositioning (Ch. 7)
Prerequisites: Linux environment (VM/container), basic binary tooling.
Deliverable: Prove and explain โwhy my program called that function from that library.โ
Implementation hints:
- Keep introspection read-only first
- Interposition logs must include caller, callee, and resolved address evidence
Milestones:
- You interpret link maps confidently
- You explain PLT/GOT behavior without hand-waving
- You use interposition to debug/profile real programs
Real World Outcome
$ ./elfmap /usr/bin/ls
ELF Analysis: /usr/bin/ls
================================================================================
Type: ELF64 Executable (dynamically linked)
Entry point: 0x00005850
Interpreter: /lib64/ld-linux-x86-64.so.2
Section Headers:
[Nr] Name Type Address Offset Size
[ 0] NULL 0000000000000000 00000000 0
[ 1] .interp PROGBITS 0000000000000318 00000318 28
[11] .plt PROGBITS 0000000000005020 00005020 1424
[12] .plt.got PROGBITS 0000000000005590 00005590 24
[13] .text PROGBITS 00000000000055b0 000055b0 73521
[24] .got PROGBITS 000000000021ff98 0001ff98 104
[25] .got.plt PROGBITS 0000000000220000 00020000 728
[26] .data PROGBITS 00000000002202e0 000202e0 616
[27] .bss NOBITS 0000000000220548 00020548 4824
Symbol Table (.dynsym) - 118 entries:
Type Bind Name Library
FUNC GLOBAL printf libc.so.6
FUNC GLOBAL malloc libc.so.6
FUNC GLOBAL __libc_start_main libc.so.6
FUNC GLOBAL strcmp libc.so.6
FUNC GLOBAL opendir libc.so.6
FUNC WEAK __gmon_start__ (undefined)
...
Relocation Entries (.rela.plt) - 89 entries:
Offset Info Type Symbol + Addend
0000000220018 000100000007 R_X86_64_JUMP_SLOT printf@GLIBC_2.2.5 + 0
0000000220020 000200000007 R_X86_64_JUMP_SLOT malloc@GLIBC_2.2.5 + 0
0000000220028 000300000007 R_X86_64_JUMP_SLOT __libc_start_main + 0
Dynamic Dependencies:
NEEDED: libselinux.so.1
NEEDED: libc.so.6
$ ./elfmap --plt-trace /usr/bin/ls
PLT/GOT Lazy Binding Trace:
=============================
Before first call to printf():
GOT[printf] @ 0x220018 = 0x5026 (points to PLT stub)
[CALL] printf@plt (first call)
-> PLT stub pushes reloc index, jumps to resolver
-> ld.so resolves printf to 0x7f3a2c4a5c40 (libc.so.6)
-> GOT[printf] updated: 0x5026 -> 0x7f3a2c4a5c40
After first call:
GOT[printf] @ 0x220018 = 0x7f3a2c4a5c40 (direct to libc)
[CALL] printf (second call)
-> Direct jump via GOT, no resolver needed
$ ./interpose malloc ./myprogram arg1 arg2
=== Interposition Library Loaded ===
Wrapping: malloc, free, calloc, realloc
[14:23:45.001] malloc(64) = 0x55a3b2c00010 [caller: 0x55a3b1a00a32 main+18]
[14:23:45.002] malloc(1024) = 0x55a3b2c00060 [caller: 0x55a3b1a00a58 main+56]
[14:23:45.003] malloc(256) = 0x55a3b2c00470 [caller: 0x55a3b1a00b12 process_data+22]
[14:23:45.004] free(0x55a3b2c00060) [caller: 0x55a3b1a00b98 process_data+158]
[14:23:45.005] realloc(0x55a3b2c00010, 128) = 0x55a3b2c00010 [caller: 0x55a3b1a00c04 resize_buffer+12]
=== Interposition Summary ===
Total allocations: 47
Total frees: 45
Current heap usage: 384 bytes
Peak heap usage: 8,192 bytes
Potential leaks: 2 blocks (384 bytes)
- 0x55a3b2c00470 (256 bytes) allocated at main+56
- 0x55a3b2c00590 (128 bytes) allocated at process_data+98
The Core Question Youโre Answering
โHow does a collection of separately compiled object files become a running program, and how can I observe and modify the symbol resolution process at runtime?โ
Concepts You Must Understand First
- ELF File Format Structure
- What are the major components of an ELF file (headers, sections, segments)?
- What is the difference between sections and segments? When is each used?
- What information does the ELF header contain?
- Book: CS:APP Chapter 7.4 (Relocatable Object Files) and The Linux Programming Interface Ch. 41
- Symbol Tables and Symbol Resolution
- What is a symbol? What types of symbols exist (global, local, weak)?
- How does the linker resolve duplicate symbol definitions?
- What happens with unresolved symbols?
- Book: CS:APP Chapter 7.5 (Symbols and Symbol Tables) and 7.6 (Symbol Resolution)
- Relocation Process
- Why is relocation necessary? What problem does it solve?
- What information is in a relocation entry?
- What are PC-relative vs absolute relocations?
- Book: CS:APP Chapter 7.7 (Relocation)
- Static vs Dynamic Linking
- What are the tradeoffs between static and dynamic linking?
- When is each appropriate?
- What is a shared library? How does it differ from a static archive?
- Book: CS:APP Chapter 7.10 (Dynamic Linking with Shared Libraries)
- PLT and GOT (Lazy Binding)
- What is the Procedure Linkage Table? The Global Offset Table?
- How does lazy binding work? What triggers resolution?
- Why does the first call to a library function take longer?
- Book: CS:APP Chapter 7.12 (Position-Independent Code) and Practical Binary Analysis Ch. 2
- Library Interposition
- What is function interposition? Why is it useful?
- What are compile-time, link-time, and runtime interposition?
- How does LD_PRELOAD work?
- Book: CS:APP Chapter 7.13 (Library Interpositioning)
Questions to Guide Your Design
-
ELF Parsing Strategy: Will you parse the ELF manually, use libelf, or memory-map and cast to structures?
-
Output Format: How will you present symbol tables and relocations in a human-readable way? What groupings help understanding?
-
Cross-Reference: How will you show which relocations reference which symbols?
-
Dynamic Analysis: How will you trace PLT/GOT behavior at runtime? ptrace? Interposition?
-
Interposition Library: What functions will you interpose? How will you call the original function?
-
Evidence Logging: What information must you capture to prove โthis call went through this resolution pathโ?
Thinking Exercise
Consider this scenario with two object files being linked:
// main.c
extern int counter;
extern void increment(void);
int main(void) {
increment();
return counter;
}
// lib.c
int counter = 0;
void increment(void) {
counter++;
}
Compile to object files and examine:
gcc -c main.c -o main.o
gcc -c lib.c -o lib.o
Hand-trace questions:
- In
main.o, what symbols are UNDEFINED? What symbols are defined? - In
lib.o, what symbols are defined? Are they global or local? - What relocation entries does
main.ohave? What type are they? - When the linker processes these files, how does it resolve the
counterreference in main.o? - If you add
statictocounterin lib.c, what error do you get? Why? - If you add a second file with
int counter = 5;, what happens? (Strong vs weak symbols) - Now compile as a shared library:
gcc -fPIC -shared lib.c -o libmylib.so. What changes in the relocation types? - How would the GOT entry for
counterget filled at runtime?
The Interview Questions Theyโll Ask
- โExplain the difference between static and dynamic linking. When would you choose each?โ
- Expected: Tradeoffs (startup time, memory sharing, updates, deployment), concrete scenarios
- โWhat happens when you call printf() for the first time in a dynamically linked program?โ
- Expected: PLT stub, GOT lookup, lazy binding, runtime linker resolution, GOT update
- โHow would you intercept all malloc calls in a program without modifying its source?โ
- Expected: LD_PRELOAD, dlsym for RTLD_NEXT, wrapper function pattern
- โWhat is Position Independent Code and why is it needed for shared libraries?โ
- Expected: Load address independence, PC-relative addressing, GOT for data references
- โYouโre debugging a program that crashes in a library function. How do you determine which library provided that function?โ
- Expected: ldd, /proc/PID/maps, nm, readelf, examining PLT/GOT at crash time
- โExplain the One Definition Rule and how the linker handles multiple definitions.โ
- Expected: Strong vs weak symbols, resolution rules, static keyword effect
Hints in Layers
Layer 1 - Getting Started: Use existing tools to understand the format before parsing yourself:
# See all sections
readelf -S /usr/bin/ls
# See symbol table
readelf -s /usr/bin/ls
# See relocations
readelf -r /usr/bin/ls
# See dynamic dependencies
ldd /usr/bin/ls
Layer 2 - ELF Header Parsing: The ELF header is at offset 0 and tells you where everything else is:
#include <elf.h>
#include <fcntl.h>
#include <sys/mman.h>
// Memory-map the file
int fd = open(path, O_RDONLY);
struct stat st;
fstat(fd, &st);
void *map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Cast to ELF header
Elf64_Ehdr *ehdr = (Elf64_Ehdr *)map;
// Find section headers
Elf64_Shdr *shdr = (Elf64_Shdr *)((char *)map + ehdr->e_shoff);
int num_sections = ehdr->e_shnum;
Layer 3 - Symbol Table Navigation: Symbol tables are in sections of type SHT_SYMTAB or SHT_DYNSYM:
// Find the string table for symbol names
Elf64_Shdr *strtab_section = &shdr[symtab_section->sh_link];
char *strtab = (char *)map + strtab_section->sh_offset;
// Iterate symbols
Elf64_Sym *symtab = (Elf64_Sym *)((char *)map + symtab_section->sh_offset);
int num_syms = symtab_section->sh_size / sizeof(Elf64_Sym);
for (int i = 0; i < num_syms; i++) {
char *name = strtab + symtab[i].st_name;
int type = ELF64_ST_TYPE(symtab[i].st_info);
int bind = ELF64_ST_BIND(symtab[i].st_info);
// ...
}
Layer 4 - Interposition Library: Create a shared library that wraps functions:
// malloc_wrapper.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
// Function pointer to real malloc
static void *(*real_malloc)(size_t) = NULL;
void *malloc(size_t size) {
if (!real_malloc) {
real_malloc = dlsym(RTLD_NEXT, "malloc");
}
void *ptr = real_malloc(size);
fprintf(stderr, "malloc(%zu) = %p\n", size, ptr);
return ptr;
}
// Compile: gcc -fPIC -shared -o libmalloc_wrapper.so malloc_wrapper.c -ldl
// Use: LD_PRELOAD=./libmalloc_wrapper.so ./myprogram
Layer 5 - PLT/GOT Tracing: To observe lazy binding, examine the GOT before and after first call:
// Get GOT address from /proc/self/maps or by parsing ELF
// Read GOT entry before call (will point to PLT+6)
// Call function
// Read GOT entry after (will point to actual function in libc)
// Or use ptrace to single-step through PLT resolution
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Object files and linking | Computer Systems: A Programmerโs Perspective | Ch. 7 Linking |
| ELF format details | The Linux Programming Interface | Ch. 41 Fundamentals of Shared Libraries |
| Shared library mechanics | The Linux Programming Interface | Ch. 42 Advanced Features of Shared Libraries |
| Symbol resolution rules | Computer Systems: A Programmerโs Perspective | Ch. 7.6 Symbol Resolution |
| Position-independent code | Computer Systems: A Programmerโs Perspective | Ch. 7.12 Position-Independent Code |
| Library interposition | Computer Systems: A Programmerโs Perspective | Ch. 7.13 Library Interpositioning |
| ELF internals | Practical Binary Analysis | Ch. 2 The ELF Format |
| Dynamic linking internals | Linux System Programming (Robert Love) | Ch. 8 File and Directory Management |
| Linker scripts and details | Linkers and Loaders (John Levine) | Ch. 3-7 |
Project 11: Signals + Processes Sandbox
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 8 |
| Coolness | โ โ โ โโ Genuinely Clever |
| Portfolio Value | Resume Gold |
What youโll build: A harness that runs child processes in controlled modes (normal exit, crash, stop/continue, timeout) and logs exactly which ECF events occurred and why.
Why it matters: Chapter 8 is about the realities of process control and signals; observation is mandatory.
Core challenges:
- Correct process creation/reaping (process lifecycle)
- Async-signal-safe handler design (safe signal handling)
- Avoiding zombies and race windows (correctness)
Key concepts to master:
- Process lifecycle (Ch. 8)
- Signals and handlers (Ch. 8)
- Nonlocal control (Ch. 8)
Prerequisites: Basic OS concepts.
Deliverable: Demonstrate zombies/orphans/signal races and explain how to prevent them.
Implementation hints:
- Treat each mode like a lab experiment; isolate one behavior per run
- Produce a timeline log: spawn โ signal โ status change โ reap
Milestones:
- You can explain why zombies happen
- Your signal handlers are correct, not โsometimes worksโ
- You can reason about race windows without superstition
Real World Outcome
$ ./procsandbox --mode=lifecycle
Process Lifecycle Sandbox
==========================
Demonstrating: fork, exec, wait, exit
[Parent PID=1234] Forking child...
[14:30:01.001] fork() returned 1235 in parent
[14:30:01.001] fork() returned 0 in child (PID=1235)
[Child PID=1235] Executing /bin/echo "Hello from child"
[14:30:01.002] execve("/bin/echo", ["echo", "Hello from child"], envp)
Hello from child
[14:30:01.003] Child 1235 called exit(0)
[Parent PID=1234] waitpid() returned: child=1235, status=0x0000
-> WIFEXITED: true, exit code: 0
Process Timeline:
Parent [1234]: ----[fork]--------------------[wait/reap]----
Child [1235]: |----[exec]----[run]----[exit]---|
t=0 t=1ms t=2ms t=3ms
$ ./procsandbox --mode=zombie
Zombie Process Demonstration
=============================
[Parent PID=1234] Creating child without reaping...
[14:30:05.001] Child 1236 created
[14:30:05.002] Child 1236 exiting immediately
[14:30:05.003] Child 1236 is now a ZOMBIE (parent hasn't called wait)
Process Status (from /proc):
PID PPID STATE COMMAND
1234 1233 S procsandbox
1236 1234 Z [procsandbox] <defunct> <-- ZOMBIE!
[14:30:07.000] Parent now calling waitpid()...
[14:30:07.001] Zombie 1236 reaped, status=0x0000
$ ./procsandbox --mode=signals
Signal Handling Demonstration
==============================
[PID=1234] Installing handlers for SIGINT, SIGCHLD, SIGTSTP, SIGUSR1
[14:30:10.001] Forking child 1237 (will run for 5 seconds)...
[14:30:10.002] Child 1237 running: ./sleeper 5
--- Press Ctrl+C to send SIGINT ---
^C
[14:30:12.500] Received SIGINT (signal 2)
Handler context:
- Interrupted syscall: yes (was in read())
- errno preserved: yes (was EINTR, restored to 0)
- SA_RESTART set: no (syscall returns -1/EINTR)
[14:30:12.501] Forwarding SIGINT to child process group...
[14:30:12.502] Child 1237 terminated by signal 2 (SIGINT)
[14:30:12.503] SIGCHLD received
Handler actions (async-signal-safe only):
- Saved errno: 0
- Called waitpid(-1, &status, WNOHANG): returned 1237
- WIFSIGNALED(status): true, signal: 2
- Restored errno: 0
$ ./procsandbox --mode=race
Signal Race Condition Demonstration
====================================
INCORRECT PATTERN (race window exists):
```c
pid_t pid = fork();
if (pid == 0) {
execve(...); // Child runs
}
// RACE WINDOW: SIGCHLD might arrive HERE, before job added!
addjob(pid); // Parent adds job
Running 1000 iterations with race-prone codeโฆ [Results] Failures: 47/1000 (child reaped before job added)
CORRECT PATTERN (block signals around critical section):
sigprocmask(SIG_BLOCK, &mask_chld, &prev); // Block SIGCHLD
pid_t pid = fork();
if (pid == 0) {
sigprocmask(SIG_SETMASK, &prev, NULL); // Unblock in child
execve(...);
}
addjob(pid); // Safe: SIGCHLD blocked
sigprocmask(SIG_SETMASK, &prev, NULL); // Unblock, handler runs
Running 1000 iterations with correct codeโฆ [Results] Failures: 0/1000 (no races detected)
### The Core Question You're Answering
**"How does the operating system manage the lifecycle of processes, and how can programs respond to asynchronous events (signals) correctly and safely?"**
### Concepts You Must Understand First
1. **Process Creation and the fork() Model**
- What does fork() return in the parent? In the child?
- What is shared between parent and child after fork? What is copied?
- Why does fork return twice?
- Book: CS:APP Chapter 8.4.2 (Creating Processes) and TLPI Chapter 24
2. **The exec Family and Process Replacement**
- What happens to the calling process during exec?
- What is preserved across exec? What is not?
- When does exec return? What does it return?
- Book: CS:APP Chapter 8.4.5 (Loading and Running Programs) and TLPI Chapter 27
3. **Process Termination and Reaping**
- What is a zombie process? Why do they exist?
- What is the difference between wait() and waitpid()?
- What do WIFEXITED, WIFSIGNALED, and WIFSTOPPED tell you?
- Book: CS:APP Chapter 8.4.3 (Reaping Child Processes) and TLPI Chapter 26
4. **Signals: Asynchronous Events**
- What is a signal? What triggers signal delivery?
- What is the difference between generating, delivering, and handling a signal?
- What signals are sent by Ctrl+C, Ctrl+Z? What is their default behavior?
- Book: CS:APP Chapter 8.5 (Signals) and TLPI Chapters 20-22
5. **Signal Handlers and Async-Signal-Safety**
- Why can't you call printf() in a signal handler?
- What functions are async-signal-safe? Why does this matter?
- What is the volatile sig_atomic_t type for?
- Book: CS:APP Chapter 8.5.5 (Writing Signal Handlers) and TLPI Chapter 21.1
6. **Signal Blocking and Critical Sections**
- How do you block signals? Why would you want to?
- What is a signal mask? How does sigprocmask work?
- What happens to blocked signals? Are they queued?
- Book: CS:APP Chapter 8.5.6 (Synchronizing Flows) and TLPI Chapter 20.10
7. **Process Groups and Sessions**
- What is a process group? Why do shells use them?
- How does the kernel know which process to send SIGINT to when you press Ctrl+C?
- What is a controlling terminal?
- Book: CS:APP Chapter 8.5.2 (Sending Signals) and TLPI Chapter 34
### Questions to Guide Your Design
1. **Test Harness Structure**: How will you organize different demonstration modes (lifecycle, signals, races)?
2. **Observability**: How will you log events with precise timestamps? How will you show the timeline?
3. **Signal Handler Design**: How will you make handlers async-signal-safe while still logging useful information?
4. **Race Reproduction**: How will you reliably reproduce race conditions for educational purposes?
5. **Process State Inspection**: Will you use /proc filesystem? waitpid flags? Both?
6. **Error Handling**: How will you handle EINTR from interrupted system calls?
### Thinking Exercise
Consider this signal handler:
```c
volatile sig_atomic_t got_sigchld = 0;
int child_count = 0; // Number of children to reap
void sigchld_handler(int sig) {
int olderrno = errno;
pid_t pid;
int status;
// Reap ALL available children (might be multiple)
while ((pid = waitpid(-1, &status, WNOHANG)) > 0) {
child_count--; // BUG: not async-signal-safe!
// What if main code was in middle of reading child_count?
}
got_sigchld = 1;
errno = olderrno;
}
Hand-trace questions:
- Why do we save and restore errno? What could corrupt it?
- Why use WNOHANG? What would happen without it?
- Why the while loop instead of a single waitpid call?
- The code modifies
child_count- why is this problematic? - What if two SIGCHLD signals arrive โsimultaneouslyโ? Are both delivered?
- What is
volatile sig_atomic_tand why is it needed forgot_sigchld? - How would you fix the child_count update to be safe?
- Write a main loop that correctly checks
got_sigchldand processes reaped children.
The Interview Questions Theyโll Ask
- โExplain what happens when you type Ctrl+C in a terminal running a program.โ
- Expected: Terminal driver sends SIGINT to foreground process group, default handler terminates process
- โWhat is a zombie process and how do you prevent them?โ
- Expected: Terminated child waiting to be reaped, parent must call wait/waitpid, SIGCHLD handler for async reaping
- โWhy canโt you call printf() from inside a signal handler?โ
- Expected: printf not async-signal-safe, could deadlock on internal locks, use write() instead
- โHow would you implement a timeout for a child process?โ
- Expected: alarm() or setitimer(), SIGALRM handler, kill() to terminate child, waitpid() to reap
- โDescribe a race condition involving fork() and signals, and how to prevent it.โ
- Expected: SIGCHLD arriving before job table updated, block signals around fork/addjob, unblock after
- โWhat is the difference between SIGTERM and SIGKILL?โ
- Expected: SIGTERM can be caught/ignored (graceful shutdown), SIGKILL cannot be caught (forced termination)
Hints in Layers
Layer 1 - Basic Process Creation: Start with a simple fork/exec/wait cycle:
pid_t pid = fork();
if (pid == 0) {
// Child process
execve("/bin/echo", (char *[]){"echo", "hello", NULL}, environ);
perror("execve failed");
exit(1);
}
// Parent process
int status;
waitpid(pid, &status, 0);
if (WIFEXITED(status)) {
printf("Child exited with code %d\n", WEXITSTATUS(status));
}
Layer 2 - Signal Handler Installation: Use sigaction() instead of signal() for portable behavior:
struct sigaction sa;
sa.sa_handler = sigchld_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART; // Restart interrupted syscalls
if (sigaction(SIGCHLD, &sa, NULL) < 0) {
perror("sigaction");
exit(1);
}
Layer 3 - Async-Signal-Safe Logging: Write your own safe logging using write():
// Safe string output in signal handler
void safe_print(const char *s) {
write(STDERR_FILENO, s, strlen(s));
}
// Safe integer output (pre-convert to string)
void safe_print_int(int n) {
char buf[32];
int i = sizeof(buf) - 1;
buf[i] = '\0';
int neg = (n < 0);
if (neg) n = -n;
do {
buf[--i] = '0' + (n % 10);
n /= 10;
} while (n > 0);
if (neg) buf[--i] = '-';
write(STDERR_FILENO, &buf[i], sizeof(buf) - 1 - i);
}
Layer 4 - Blocking Signals for Critical Sections: Protect fork/job-add sequences from SIGCHLD:
sigset_t mask, prev;
sigemptyset(&mask);
sigaddset(&mask, SIGCHLD);
// Block SIGCHLD
sigprocmask(SIG_BLOCK, &mask, &prev);
pid_t pid = fork();
if (pid == 0) {
// Child: restore signal mask before exec
sigprocmask(SIG_SETMASK, &prev, NULL);
execve(argv[0], argv, environ);
exit(1);
}
// Parent: add to job list while SIGCHLD blocked
add_job(job_list, pid, RUNNING);
// Unblock SIGCHLD - pending signal delivered now
sigprocmask(SIG_SETMASK, &prev, NULL);
Layer 5 - Detecting Process State via /proc: Read process state for educational output:
void print_process_state(pid_t pid) {
char path[64], buf[256];
snprintf(path, sizeof(path), "/proc/%d/stat", pid);
int fd = open(path, O_RDONLY);
if (fd >= 0) {
read(fd, buf, sizeof(buf));
// Parse: pid (comm) state ppid pgrp ...
// state: R=running, S=sleeping, Z=zombie, T=stopped
close(fd);
}
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Process control fundamentals | Computer Systems: A Programmerโs Perspective | Ch. 8.4 Process Control |
| Signal concepts and handling | Computer Systems: A Programmerโs Perspective | Ch. 8.5 Signals |
| Comprehensive signal coverage | The Linux Programming Interface | Ch. 20-22 Signals |
| Process creation in depth | The Linux Programming Interface | Ch. 24-25 Process Creation |
| Process termination and waiting | The Linux Programming Interface | Ch. 26 Monitoring Child Processes |
| Signal safety and reentrancy | The Linux Programming Interface | Ch. 21.1 Designing Signal Handlers |
| Process groups and sessions | The Linux Programming Interface | Ch. 34 Process Groups, Sessions |
| Process lifecycle overview | Advanced Programming in the UNIX Environment | Ch. 8 Process Control |
| Signals in practice | Advanced Programming in the UNIX Environment | Ch. 10 Signals |
| Concurrency with processes | Operating Systems: Three Easy Pieces | Ch. 5-6 Process API |
Project 12: Unix Shell with Job Control
View Expanded Guide - Comprehensive implementation guide with signal flow diagrams, race condition patterns, and job state machines.
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 2โ3 weeks |
| Chapters | 8, 12 |
| Coolness | โ โ โ โ โ Hardcore Tech Flex |
| Portfolio Value | Resume Gold |
What youโll build: An interactive shell supporting foreground/background jobs, basic built-ins, and correct handling of interrupt/stop keystrokes.
Why it matters: It integrates processes, signals, and race avoidance into one user-facing system.
Core challenges:
- Process groups and terminal ownership (job control)
- Signal handling without races (ECF correctness)
- Consistent job state under async events (concurrency fundamentals)
Key concepts to master:
- Job control and signals (Ch. 8)
- Race avoidance patterns (Ch. 8 & 12)
- Robust error handling (Appendix)
Prerequisites: Project 11 recommended.
Deliverable: Use your shell to run real programs with correct fg/bg behavior.
Implementation hints:
- Define a minimal grammar first; add features only after correctness
- Design job-state transitions on paper before coding
Milestones:
- Basic commands run reliably
- Foreground/background switching works under stress
- No zombies; correct behavior under repeated interrupts/stops
Real World Outcome
$ ./mysh
mysh> echo hello world
hello world
mysh> /bin/ls -la
total 48
drwxr-xr-x 5 user user 4096 Dec 26 10:00 .
drwxr-xr-x 30 user user 4096 Dec 25 09:00 ..
-rwxr-xr-x 1 user user 8432 Dec 26 10:00 mysh
-rw-r--r-- 1 user user 2341 Dec 26 09:55 mysh.c
mysh> sleep 100 &
[1] (12345) sleep 100 &
mysh> sleep 200 &
[2] (12346) sleep 200 &
mysh> jobs
[1] (12345) Running sleep 100 &
[2] (12346) Running sleep 200 &
mysh> fg %1
sleep 100
^Z
Job [1] (12345) stopped by signal 20 (SIGTSTP)
mysh> jobs
[1] (12345) Stopped sleep 100
[2] (12346) Running sleep 200 &
mysh> bg %1
[1] (12345) sleep 100 &
mysh> jobs
[1] (12345) Running sleep 100 &
[2] (12346) Running sleep 200 &
mysh> fg %2
sleep 200
^C
Job [2] (12346) terminated by signal 2 (SIGINT)
mysh> jobs
[1] (12345) Running sleep 100 &
mysh> kill %1
Job [1] (12345) terminated by signal 15 (SIGTERM)
mysh> jobs
mysh>
--- Signal Handling Demo ---
mysh> ./long_running_process &
[1] (12350) ./long_running_process &
mysh> ./another_process
^C
[Ctrl+C sent SIGINT to foreground job only]
[Background job 12350 continues running]
Job ./another_process terminated by signal 2
mysh> jobs
[1] (12350) Running ./long_running_process &
--- Race Condition Prevention Demo (internal trace) ---
mysh> ./quick_exit & # Child exits immediately
[DEBUG] sigprocmask(SIG_BLOCK, {SIGCHLD})
[DEBUG] fork() = 12355
[DEBUG] addjob(12355, "./quick_exit")
[DEBUG] sigprocmask(SIG_UNBLOCK, {SIGCHLD})
[DEBUG] SIGCHLD handler: waitpid returned 12355
[DEBUG] deletejob(12355) - job found and removed
[1] (12355) ./quick_exit &
--- Proper Terminal Control ---
mysh> vim test.txt # Interactive program gets terminal control
[tcsetpgrp gives terminal to vim's process group]
[vim runs with full terminal control]
[After vim exits, shell reclaims terminal]
mysh>
The Core Question Youโre Answering
โHow do shells provide the illusion of multiple concurrent programs sharing one terminal, and how do they coordinate process lifecycle, terminal control, and signal delivery without races or resource leaks?โ
Concepts You Must Understand First
- Job Control Model
- What is a job? How does it differ from a process?
- What states can a job be in (foreground, background, stopped)?
- What triggers transitions between job states?
- Book: CS:APP Chapter 8.5 and TLPI Chapter 34
- Process Groups
- What is a process group? How is it different from a job?
- Why does the shell put each pipeline in its own process group?
- How does setpgid() work? Who can call it?
- Book: CS:APP Chapter 8.5.2 and TLPI Chapter 34.2
- Foreground Process Group and Terminal Control
- What is the foreground process group? How is it set?
- What is tcsetpgrp() and when must you call it?
- What happens if a background process tries to read from the terminal?
- Book: TLPI Chapter 34.4-34.5 and Advanced Programming in the UNIX Environment Chapter 9
- Signal Delivery to Process Groups
- When you press Ctrl+C, which processes receive SIGINT?
- How do you send a signal to an entire process group?
- What is the difference between kill(pid, sig) and kill(-pgid, sig)?
- Book: CS:APP Chapter 8.5.2 and TLPI Chapter 20.5
- Waiting for Stopped/Continued Children
- What is WUNTRACED? WCONTINUED? When do you need them?
- How do you distinguish a stopped child from a terminated child?
- What signals cause a process to stop? To continue?
- Book: CS:APP Chapter 8.4.3 and TLPI Chapter 26.1
- Race Conditions in Shell Implementation
- What race exists between fork() and adding a job to the table?
- What race exists between SIGCHLD and the main loop?
- Why must you block signals during critical sections?
- Book: CS:APP Chapter 8.5.6 and Chapter 12
- Built-in Commands vs External Commands
- Why must some commands be built-in (cd, exit, jobs, fg, bg)?
- How do you decide if a command is built-in?
- What happens if you try to exec a built-in?
- Book: TLPI Chapter 34.7 and APUE Chapter 9
Questions to Guide Your Design
-
Job Table Structure: How will you represent jobs? What information do you need per job (pid, pgid, state, command line)?
-
Command Parsing: How will you parse command lines? Will you handle pipes, redirects, or just simple commands first?
-
Signal Handler Design: What will your SIGCHLD handler do? What must it NOT do?
-
Terminal Control: When do you call tcsetpgrp()? What happens if you forget?
-
Main Loop Architecture: How do you wait for foreground jobs? How do you handle asynchronous SIGCHLD for background jobs?
-
Error Recovery: What happens if exec fails? If fork fails? If the command doesnโt exist?
Thinking Exercise
Consider this shell main loop pseudocode:
while (1) {
char *cmdline = readline("mysh> ");
if (is_builtin(cmdline)) {
do_builtin(cmdline);
} else {
pid_t pid = fork();
if (pid == 0) {
// Child
setpgid(0, 0); // New process group
execve(argv[0], argv, environ);
exit(1);
}
// Parent
setpgid(pid, pid); // Also set pgid (race with child)
if (foreground) {
tcsetpgrp(STDIN_FILENO, pid); // Give terminal to child
waitpid(pid, &status, WUNTRACED); // Wait for fg job
tcsetpgrp(STDIN_FILENO, getpgrp()); // Reclaim terminal
} else {
printf("[%d] %d\n", jobnum, pid);
}
}
}
Hand-trace questions:
- Why does both parent AND child call setpgid()? What race does this solve?
- What happens if the child execs before the parent calls setpgid()?
- Why do we call tcsetpgrp() before waitpid() for foreground jobs?
- What happens if we forget to reclaim the terminal after the foreground job finishes?
- Where should SIGCHLD handling happen? Is it missing from this pseudocode?
- What happens if the user types Ctrl+C while a foreground job is running?
- What happens if the user types Ctrl+Z? What state does the job transition to?
- How would you modify this to properly add jobs to a job table and handle background jobs?
The Interview Questions Theyโll Ask
-
**โWalk me through what happens when you type โls grep fooโ in a shell and press Enter.โ** - Expected: Parsing, fork for each command, pipe creation, process group setup, exec, wait
- โHow does job control work? What happens when you press Ctrl+Z?โ
- Expected: SIGTSTP to foreground process group, process stops, shell reclaims terminal, job marked stopped
- โWhy canโt โcdโ be an external command?โ
- Expected: chdir() affects calling process only, child process change doesnโt affect parent shell
- โDescribe a race condition in a naive shell implementation and how to fix it.โ
- Expected: Fork/SIGCHLD race, signal blocking around critical sections
- โWhat is a process group and why do shells use them?โ
- Expected: Collection of related processes, signal delivery, terminal control, job abstraction
- โHow would you implement the โfgโ built-in command?โ
- Expected: Find job, send SIGCONT if stopped, give it terminal via tcsetpgrp(), waitpid with WUNTRACED
Hints in Layers
Layer 1 - Basic Command Execution: Start with a shell that can only run simple foreground commands:
int main(void) {
char cmdline[1024];
while (1) {
printf("mysh> ");
if (!fgets(cmdline, sizeof(cmdline), stdin)) break;
// Parse cmdline into argv (simple: split on whitespace)
char *argv[64];
parse_cmdline(cmdline, argv);
if (argv[0] == NULL) continue;
pid_t pid = fork();
if (pid == 0) {
execvp(argv[0], argv);
perror(argv[0]);
exit(127);
}
int status;
waitpid(pid, &status, 0);
}
return 0;
}
Layer 2 - Job Table Data Structure: Design your job table before adding background jobs:
#define MAXJOBS 16
typedef enum { UNDEF, FG, BG, ST } job_state_t;
typedef struct {
pid_t pid; // Process ID
pid_t pgid; // Process group ID
job_state_t state; // FG, BG, or ST (stopped)
int jid; // Job ID [1], [2], etc.
char cmdline[1024]; // Command line for display
} job_t;
job_t jobs[MAXJOBS];
// Operations: addjob, deletejob, getjobpid, getjobjid, pid2jid, listjobs
Layer 3 - SIGCHLD Handler: Handle child termination and stops asynchronously:
void sigchld_handler(int sig) {
int olderrno = errno;
pid_t pid;
int status;
// Reap ALL available children
while ((pid = waitpid(-1, &status, WNOHANG | WUNTRACED)) > 0) {
if (WIFEXITED(status) || WIFSIGNALED(status)) {
// Child terminated - delete from job table
deletejob(jobs, pid);
} else if (WIFSTOPPED(status)) {
// Child stopped - update job state
job_t *job = getjobpid(jobs, pid);
if (job) job->state = ST;
}
}
errno = olderrno;
}
Layer 4 - Proper Fork with Signal Blocking: Prevent races between fork and job table updates:
void eval(char *cmdline) {
sigset_t mask_all, mask_chld, prev_mask;
sigfillset(&mask_all);
sigemptyset(&mask_chld);
sigaddset(&mask_chld, SIGCHLD);
// Block SIGCHLD before fork
sigprocmask(SIG_BLOCK, &mask_chld, &prev_mask);
pid_t pid = fork();
if (pid == 0) {
// Child: unblock signals, set process group, exec
sigprocmask(SIG_SETMASK, &prev_mask, NULL);
setpgid(0, 0);
execve(argv[0], argv, environ);
exit(1);
}
// Parent: add job while SIGCHLD blocked
setpgid(pid, pid); // Also set in parent (race prevention)
sigprocmask(SIG_BLOCK, &mask_all, NULL); // Block all for job table
addjob(jobs, pid, pid, bg ? BG : FG, cmdline);
sigprocmask(SIG_SETMASK, &prev_mask, NULL); // Restore (unblock SIGCHLD)
if (!bg) {
waitfg(pid); // Wait for foreground job
}
}
Layer 5 - Foreground Wait with sigsuspend: Correctly wait for foreground jobs without busy-waiting:
void waitfg(pid_t pid) {
sigset_t mask;
sigemptyset(&mask);
// Wait until the job is no longer in foreground
// SIGCHLD handler will update job state
while (fgpid(jobs) == pid) {
sigsuspend(&mask); // Atomically unblock and wait
}
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Job control overview | Computer Systems: A Programmerโs Perspective | Ch. 8.5 Signals (job control discussion) |
| Signal handling for shells | Computer Systems: A Programmerโs Perspective | Ch. 8.5.5-8.5.7 |
| Process groups and sessions | The Linux Programming Interface | Ch. 34 Process Groups, Sessions, and Job Control |
| Terminal control | The Linux Programming Interface | Ch. 34.4-34.6 |
| Shell implementation details | Advanced Programming in the UNIX Environment | Ch. 9 Process Relationships |
| Job control signals | Advanced Programming in the UNIX Environment | Ch. 10.20 Job Control Signals |
| Race conditions | Computer Systems: A Programmerโs Perspective | Ch. 8.5.6 Synchronizing Flows |
| Concurrent programming patterns | Computer Systems: A Programmerโs Perspective | Ch. 12 Concurrent Programming |
| Process API | Operating Systems: Three Easy Pieces | Ch. 5 Process API |
| Shell history and design | The Unix Programming Environment (Kernighan & Pike) | Ch. 3 Using the Shell |
Project 13: Virtual Memory Map Visualizer
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 8, 9 |
| Coolness | โ โ โ โโ Genuinely Clever |
| Portfolio Value | Micro-SaaS/Pro Tool |
What youโll build: A tool that reports a processโs virtual memory layout (regions, permissions, growth) and demonstrates demand paging and protection faults with controlled experiments.
Why it matters: It turns VM into observable reality: mapping, protection, faults, and locality.
Core challenges:
- Presenting mapping info accurately (regions + permissions)
- Controlled page-fault demonstrations (demand paging)
- Explaining copy-on-write and sharing (fork + VM interaction)
Key concepts to master:
- Address translation and pages (Ch. 9)
- Memory protection and mapping (Ch. 9)
- Process/VM interaction (Ch. 8โ9)
Prerequisites: Project 11 recommended.
Deliverable: Show an exact map of a process and explain why a specific access faults.
Implementation hints:
- Start with โregions with permissions,โ then refine to page-level reasoning
- Keep experiments minimal so the cause of faults is unambiguous
Milestones:
- You can distinguish heap/stack/mapped files by observation
- You can classify crashes as protection failures
- You reason about locality as VM + cache, not just โspeedโ
Real World Outcome
$ ./vmvis 12345
================================================================================
VIRTUAL MEMORY MAP VISUALIZER - Process: 12345 (myapp)
================================================================================
MEMORY REGIONS (from /proc/12345/maps):
--------------------------------------------------------------------------------
ADDRESS RANGE SIZE PERMS PATH
--------------------------------------------------------------------------------
0x00400000-0x00452000 328 KB r-xp /usr/bin/myapp
0x00651000-0x00652000 4 KB r--p /usr/bin/myapp
0x00652000-0x00653000 4 KB rw-p /usr/bin/myapp
0x00653000-0x00674000 132 KB rw-p [heap]
0x7f8a3c000000-0x7f8a3c1bc000 1776 KB r-xp /lib/libc-2.31.so
0x7ffc8a400000-0x7ffc8a421000 132 KB rw-p [stack]
0x7ffc8a5fe000-0x7ffc8a600000 8 KB r-xp [vdso]
REGION SUMMARY:
Code (r-x): 2104 KB | Read-only (r--): 20 KB | Read-write (rw-): 276 KB
$ ./vmvis 12345 --page-fault-demo
================================================================================
PAGE FAULT DEMONSTRATION
================================================================================
[1] Allocating 16 pages (65536 bytes) without touching...
Pages resident: 0 of 16
[2] Touching page 0 (writing 1 byte at 0x7f8a40000000)...
>>> PAGE FAULT TRIGGERED <<<
Fault type: MINOR (demand paging)
Pages resident: 1 of 16
[3] Triggering protection fault (writing to code segment)...
>>> SIGSEGV RECEIVED <<<
si_code: SEGV_ACCERR (invalid permissions for mapped object)
$ ./vmvis --cow-demo
================================================================================
COPY-ON-WRITE DEMONSTRATION
================================================================================
[Parent] Allocating 1 MB, RSS before fork: 5120 KB
[Fork] Child created
Parent RSS: 5120 KB | Child RSS: 5120 KB (pages SHARED!)
[Child writing to page 0...]
>>> COPY-ON-WRITE FAULT <<<
Child RSS: 5124 KB | Parent RSS: 5120 KB (unchanged)
The Core Question Youโre Answering
How does virtual memory create the illusion of a large, private, contiguous address space for each process, and what are the performance and correctness implications of this abstraction?
Concepts You Must Understand First
-
Virtual vs Physical Addresses - What is an address space? How does the MMU translate addresses? CS:APP Ch. 9.3
-
Pages and Page Tables - VPN/VPO division, PTE fields, TLB purpose. CS:APP Ch. 9.6
-
Memory Mapping and Regions - Anonymous vs file-backed mappings, mmap(), fork behavior. CS:APP Ch. 9.8
-
Page Faults - Minor vs major faults, demand paging. CS:APP Ch. 9.5
-
Memory Protection - Permission bits (r/w/x), SIGSEGV types, ASLR. CS:APP Ch. 9.7
Questions to Guide Your Design
- Will you parse /proc/PID/maps directly? What edge cases exist ([heap], [vdso], deleted files)?
- How will you present 48-bit address space meaningfully?
- How will you observe page faults without being inside the target process?
- How do you safely trigger and catch protection faults?
Thinking Exercise
Trace these accesses with simplified page tables:
- VPN 0x00400: PPN 0x1A000, r-x, present
- VPN 0x00653: PPN 0x2B000, rw-, present
- VPN 0x7f8a3: not present, backed by libc.so
- Instruction fetch from 0x00400ABC - success or fault?
- Write to 0x00653100 - success or fault?
- Read from 0x7f8a3000 - what happens? What changes?
- Write to 0x00400000 - different from #1, why?
The Interview Questions Theyโll Ask
-
โWalk me through what happens when a process accesses memory that hasnโt been touched since allocation.โ - Demand paging, minor fault, kernel allocates page, updates PTE, restarts instruction
-
โWhy is fork() so fast even for gigabytes of memory?โ - Copy-on-write
-
โWhatโs the difference between SIGSEGV from null pointer vs writing to read-only?โ - SEGV_MAPERR vs SEGV_ACCERR
-
โHow would you debug a memory leak that doesnโt show in valgrind?โ - /proc/PID/maps growth, RSS vs VSZ trends
Hints in Layers
Layer 1: Parse /proc/PID/maps with sscanf for start-end perms offset dev inode pathname
Layer 2: Read fault counters from /proc/PID/stat fields 10 and 12 (minflt, majflt)
Layer 3: Use sigsetjmp/siglongjmp with SIGSEGV handler to catch and recover from faults
Layer 4: Use mincore() to check page residency without faulting
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Virtual Memory Fundamentals | Computer Systems: A Programmerโs Perspective | Ch. 9 |
| Page Tables and TLB | Operating Systems: Three Easy Pieces | Ch. 18-20 |
| Linux Memory Management | The Linux Programming Interface | Ch. 48-50 |
| mmap and Memory Mapping | Advanced Programming in the UNIX Environment | Ch. 14 |
Project 14: Build Your Own Malloc
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Expert |
| Time | 1 month+ |
| Chapters | 6, 9 |
| Coolness | โ โ โ โ โ Pure Magic |
| Portfolio Value | Resume Gold |
What youโll build: A user-space allocator implementing malloc/free (optionally realloc) plus tooling for invariants, fragmentation, and throughput.
Why it matters: This is the payoff of data layout, locality, and VM: alignment, metadata design, coalescing, and policy trade-offs.
Core challenges:
- Block metadata and alignment design (layout + ABI alignment)
- Free-list policies, splitting/coalescing (fragmentation trade-offs)
- Heap checker and performance harness (correctness + optimization)
Key concepts to master:
- Heap layout and allocator concepts (Ch. 9)
- Locality and performance effects (Ch. 6)
- Invariants mindset (C Interfaces and Implementations reference)
Prerequisites: Projects 2, 9, and 13 recommended.
Deliverable: Allocate/free at scale without corruption; provide evidence on fragmentation and throughput.
Implementation hints:
- Lock down invariants first; instrument everything
- Every blockโs header must be explainable in a dump
Milestones:
- Allocator passes correctness tests
- Fragmentation becomes measurable and improvable by policy
- You can explain bugs as violated invariants, not โweird behaviorโ
Real World Outcome
$ ./mymalloc --test-suite
================================================================================
MALLOC IMPLEMENTATION TEST SUITE
================================================================================
Running correctness tests...
[PASS] Basic malloc/free cycle (1000 allocations)
[PASS] Alignment check (all pointers 16-byte aligned)
[PASS] Coalescing test (adjacent free blocks merged)
[PASS] Realloc in-place when possible
[PASS] Zero-size malloc returns NULL or unique pointer
[PASS] Double-free detection (caught and reported)
[PASS] Heap overflow detection (guard bytes intact)
Running stress tests...
[PASS] Random alloc/free pattern (100000 ops, no corruption)
[PASS] Worst-case fragmentation pattern (alternating sizes)
$ ./mymalloc --heap-dump
================================================================================
HEAP DUMP - Block Layout
================================================================================
Heap start: 0x555555756000 Heap end: 0x555555776000 Size: 131072 bytes
Block Address Size Status Prev Next (free list)
--------------------------------------------------------------------------------
[ 0] 0x555555756000 64 ALLOC - -
[ 1] 0x555555756040 128 FREE - [3]
[ 2] 0x5555557560c0 256 ALLOC - -
[ 3] 0x5555557561c0 512 FREE [1] [5]
[ 4] 0x5555557563c0 1024 ALLOC - -
[ 5] 0x5555557567c0 2048 FREE [3] -
Free list heads (segregated):
Class 0 (16-64): [1] -> NULL
Class 1 (65-256): NULL
Class 2 (257-1024): [3] -> NULL
Class 3 (1025+): [5] -> NULL
Heap utilization: 67.2% (internal fragmentation: 8.3%)
$ ./mymalloc --benchmark
================================================================================
ALLOCATOR PERFORMANCE BENCHMARK
================================================================================
Workload: Synthetic (mixed sizes 16-4096, 50% alloc / 50% free)
Operations: 1,000,000
Throughput Utilization Peak Memory
--------------------------------------------------------------------------------
System malloc 847,231 ops/s 89.2% 12.4 MB
My malloc (implicit) 234,567 ops/s 71.3% 18.2 MB
My malloc (explicit) 456,789 ops/s 78.4% 15.1 MB
My malloc (segregated) 678,901 ops/s 84.1% 13.8 MB
Fragmentation Analysis:
External fragmentation: 12.3% (free blocks too small for requests)
Internal fragmentation: 5.7% (wasted space within allocated blocks)
Coalescing efficiency: 94.2% (adjacent frees merged)
$ ./mymalloc --trace workload.trace
================================================================================
ALLOCATION TRACE ANALYSIS
================================================================================
Trace: workload.trace (real application: gcc compiling hello.c)
Operations: 47,832 malloc, 45,119 free, 2,713 realloc
Size Distribution:
0-32 bytes: โโโโโโโโโโโโโโโโโโโโ 41.2%
33-64 bytes: โโโโโโโโโโโโ 24.1%
65-128 bytes: โโโโโโโโ 15.8%
129-256 bytes: โโโโ 9.3%
257-1024 bytes: โโ 6.1%
1025+ bytes: โ 3.5%
Peak heap usage: 2.34 MB
Average allocation lifetime: 847 operations
Longest-lived allocation: 47,831 operations (probably a global)
The Core Question Youโre Answering
How do you efficiently manage a contiguous region of memory to satisfy arbitrary allocation requests while minimizing fragmentation and maximizing throughput?
Concepts You Must Understand First
-
Heap Organization - What is brk/sbrk? How does the heap grow? What is the relationship between heap and mmap? CS:APP Ch. 9.9
-
Block Structure and Metadata - Header/footer design, boundary tags, alignment requirements. CS:APP Ch. 9.9.6
-
Free List Management - Implicit vs explicit free lists, LIFO vs address-ordered, segregated fits. CS:APP Ch. 9.9.13
-
Splitting and Coalescing - When to split blocks? Immediate vs deferred coalescing. CS:APP Ch. 9.9.10
-
Placement Policies - First fit, next fit, best fit tradeoffs. CS:APP Ch. 9.9.7
-
Alignment Constraints - Why 8 or 16 byte alignment? What does ABI require? CS:APP Ch. 3.9.3
Questions to Guide Your Design
- Metadata size: How many bytes of overhead per block? Can you reduce it?
- Minimum block size: What is it? Why?
- Free list structure: Implicit, explicit, or segregated? Why?
- Coalescing strategy: Immediate or deferred? Tradeoffs?
- Heap checker: What invariants must always hold?
Thinking Exercise
Consider this allocation sequence:
p1 = malloc(32); // Request 32 bytes
p2 = malloc(64); // Request 64 bytes
p3 = malloc(32); // Request 32 bytes
free(p2); // Free middle block
p4 = malloc(48); // Request 48 bytes - where does it go?
free(p1); // Free first block
free(p3); // Free last block - what happens?
Draw the heap state after each operation. Assume:
- 8-byte header with size and allocated bit
- 16-byte alignment
- Minimum block size is 32 bytes (including header)
The Interview Questions Theyโll Ask
-
โExplain how malloc works internally.โ - Free list, block headers, splitting/coalescing
-
โWhat is memory fragmentation and how would you minimize it?โ - Internal vs external, coalescing, placement policies
-
โWhy does malloc need to track block sizes?โ - For free() to know how much to release, for coalescing
-
โHow would you detect memory leaks or double-frees?โ - Heap checker, guard bytes, tracking allocated blocks
-
โWhatโs the tradeoff between throughput and utilization?โ - Faster policies (first-fit) vs space-efficient (best-fit)
Hints in Layers
Layer 1 - Block Header:
typedef struct {
size_t size; // Block size including header (low bit = allocated)
} block_header_t;
#define GET_SIZE(hp) ((hp)->size & ~0x7)
#define GET_ALLOC(hp) ((hp)->size & 0x1)
#define PACK(size, alloc) ((size) | (alloc))
Layer 2 - Heap Initialization:
static char *heap_start;
static char *heap_end;
int mm_init(void) {
heap_start = sbrk(INITIAL_HEAP_SIZE);
if (heap_start == (void *)-1) return -1;
heap_end = heap_start + INITIAL_HEAP_SIZE;
// Create initial free block spanning entire heap
return 0;
}
Layer 3 - Coalescing:
// With boundary tags, check neighbors:
// Previous block: look at footer just before current header
// Next block: look at header at (current + current_size)
Layer 4 - Heap Checker:
int mm_check(void) {
// Every block in free list is marked free
// No contiguous free blocks (coalescing worked)
// Every free block is in the free list
// Pointers in heap point to valid addresses
// No overlapping allocated blocks
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Dynamic Memory Allocation | Computer Systems: A Programmerโs Perspective | Ch. 9.9 |
| Allocator Design Patterns | C Interfaces and Implementations | Ch. 5-6 |
| Memory Management | Operating Systems: Three Easy Pieces | Ch. 17 |
| Real Allocator Analysis | The Linux Programming Interface | Ch. 7 |
Project 15: Robust Unix I/O Toolkit
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Intermediate |
| Time | 1โ2 weeks |
| Chapters | 9, 10 |
| Coolness | โ โ โโโ Practical |
| Portfolio Value | Service & Support |
What youโll build: A โUnix file toolboxโ that copies/tees/transforms streams while producing clear evidence of buffering behavior and syscall counts.
Why it matters: Chapter 10 is about being fluent with descriptors and the realities of I/O: partial operations, buffering, metadata, and mapping.
Core challenges:
- Partial reads/writes and robustness (robust I/O)
- Buffered vs unbuffered trade-offs (performance + correctness)
- Safe memory-mapped file usage (VM + I/O interaction)
Key concepts to master:
- Unix I/O (Ch. 10)
- Robust I/O discipline (Ch. 10)
- Memory-mapped files (Ch. 9โ10)
Prerequisites: Basic C.
Deliverable: Handle large files, pipes, and redirects without hangs or silent truncation.
Implementation hints:
- Treat I/O as โmay return less than requested,โ always
- Provide a โtrace modeโ that logs your I/O loop decisions
Milestones:
- You stop assuming a single read/write is enough
- You can explain bufferingโs performance impact with evidence
- You treat mmap as โVM + file backing,โ not magic
Real World Outcome
$ ./rio_copy --trace input.bin output.bin
================================================================================
ROBUST I/O TOOLKIT - File Copy with Trace
================================================================================
Source: input.bin (104857600 bytes, 100 MB)
Destination: output.bin
Buffer size: 8192 bytes
I/O Trace (showing partial operations):
--------------------------------------------------------------------------------
[ 1] read(3, buf, 8192) = 8192 (complete)
[ 1] write(4, buf, 8192) = 8192 (complete)
[ 2] read(3, buf, 8192) = 8192 (complete)
[ 2] write(4, buf, 8192) = 4096 (PARTIAL - pipe buffer full)
[ 2] write(4, buf+4096, 4096) = 4096 (retry succeeded)
[ 3] read(3, buf, 8192) = 8192 (complete)
...
[12800] read(3, buf, 8192) = 4096 (PARTIAL - near EOF)
[12800] read(3, buf+4096, 4096) = 0 (EOF reached)
[12800] write(4, buf, 4096) = 4096 (final write)
Summary:
Total read syscalls: 12,801 (12,800 complete + 1 partial)
Total write syscalls: 12,847 (12,753 complete + 94 partial retries)
Bytes transferred: 104,857,600
Elapsed time: 0.847 seconds
Throughput: 118.2 MB/s
$ ./rio_copy --compare-buffering large_file.bin /dev/null
================================================================================
BUFFERING STRATEGY COMPARISON
================================================================================
File: large_file.bin (1073741824 bytes, 1 GB)
Strategy Buffer Syscalls Time Throughput
--------------------------------------------------------------------------------
Unbuffered 1 byte 1073741824 892.3s 1.1 MB/s
Small buffer 64 bytes 16777216 14.2s 72.1 MB/s
Default (8KB) 8192 131073 1.12s 914.3 MB/s
Large (64KB) 65536 16385 0.98s 1044.7 MB/s
Huge (1MB) 1048576 1025 0.91s 1124.9 MB/s
mmap N/A ~3 0.84s 1218.2 MB/s
Analysis:
Syscall overhead at 1-byte: ~830 ns/call (context switch dominated)
Optimal buffer size for this system: 64KB-1MB
mmap advantage: eliminates copy to user buffer
$ ./rio_tee input.txt output1.txt output2.txt --trace
================================================================================
TEE WITH SYSCALL TRACE
================================================================================
Reading from: input.txt (fd=3)
Writing to: output1.txt (fd=4), output2.txt (fd=5)
[strace-style output]
read(3, "Hello, World!\nThis is...", 8192) = 847
write(4, "Hello, World!\nThis is...", 847) = 847
write(5, "Hello, World!\nThis is...", 847) = 847
read(3, "", 8192) = 0 (EOF)
$ ./rio_cat --handle-signals file.txt
================================================================================
SIGNAL-SAFE I/O DEMONSTRATION
================================================================================
Reading file.txt with EINTR handling...
[Simulating signal interruption]
read(3, buf, 8192) = -1, errno=EINTR (signal received during read)
-> Automatically retrying...
read(3, buf, 8192) = 8192 (success after retry)
Signal-safe I/O pattern demonstrated:
Total EINTR occurrences: 3
All automatically handled by rio_readn()
The Core Question Youโre Answering
How do you build I/O routines that correctly handle the realities of Unix: partial operations, interrupted system calls, and the performance tradeoffs of buffering?
Concepts You Must Understand First
-
File Descriptors - What are they? Relationship to open file table and v-node table. CS:APP Ch. 10.1-10.2
-
Short Counts - Why does read() return less than requested? When is this normal vs error? CS:APP Ch. 10.4
-
Buffered vs Unbuffered I/O - stdio vs Unix I/O, when to use each, mixing dangers. CS:APP Ch. 10.9
-
EINTR Handling - What causes interrupted syscalls? How to handle correctly. TLPI Ch. 21.5
-
Memory-Mapped I/O - mmap() for files, advantages and gotchas. CS:APP Ch. 9.8
Questions to Guide Your Design
- What is the contract of rio_readn()? What does it guarantee?
- When should you use unbuffered I/O vs buffered?
- How do you handle EINTR - retry or propagate?
- What happens if you mix printf() with write()? Why?
- When is mmap better than read/write?
Thinking Exercise
Consider this scenario:
int fd = open("data.bin", O_RDONLY);
char buf[1000];
int n = read(fd, buf, 1000); // n = 847 (short count!)
- Is this an error? How do you know?
- What could cause this? (List at least 4 scenarios)
- How would you modify the code to guarantee reading exactly 1000 bytes (or EOF)?
- What if fd is a socket instead of a file? Does your answer change?
The Interview Questions Theyโll Ask
-
โWhat does it mean when read() returns less than you asked for?โ - Short count, normal for pipes/sockets/signals, check errno for errors
-
โWhy is printf() not safe to use after fork() before exec()?โ - Buffered I/O, buffer might be copied, double output
-
โHow would you efficiently copy a large file?โ - read/write with large buffer, or mmap + memcpy, or sendfile()
-
โExplain the difference between Unix I/O and Standard I/O.โ - Buffering, portability, performance tradeoffs
-
โWhat happens when you read() from a pipe with no data?โ - Blocks until data or all writers close (EOF)
Hints in Layers
Layer 1 - Rio Readn:
ssize_t rio_readn(int fd, void *usrbuf, size_t n) {
size_t nleft = n;
char *bufp = usrbuf;
while (nleft > 0) {
ssize_t nread = read(fd, bufp, nleft);
if (nread < 0) {
if (errno == EINTR) continue; // Retry on interrupt
return -1; // Error
} else if (nread == 0) {
break; // EOF
}
nleft -= nread;
bufp += nread;
}
return n - nleft; // Bytes actually read
}
Layer 2 - Buffered Reader Structure:
typedef struct {
int fd;
int cnt; // Unread bytes in buffer
char *bufptr; // Next unread byte
char buf[8192]; // Internal buffer
} rio_t;
Layer 3 - mmap for File I/O:
void *src = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Now src points to file contents - no read() needed!
// But: must handle SIGBUS if file truncated while mapped
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Unix I/O Fundamentals | Computer Systems: A Programmerโs Perspective | Ch. 10 |
| Robust I/O Wrappers | Computer Systems: A Programmerโs Perspective | Ch. 10.5 |
| File I/O in Depth | The Linux Programming Interface | Ch. 4-5 |
| Memory Mapping | Advanced Programming in the UNIX Environment | Ch. 14.8 |
| Standard I/O Library | Advanced Programming in the UNIX Environment | Ch. 5 |
Project 16: Concurrency Workbench
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, Go) |
| Difficulty | Expert |
| Time | 2โ3 weeks |
| Chapters | 12 |
| Coolness | โ โ โ โ โ Hardcore Tech Flex |
| Portfolio Value | Micro-SaaS/Pro Tool |
What youโll build: A server framework that can switch between concurrency models (iterative, process-per-request, thread-per-request, thread pool), with a bounded-buffer work queue and stress-test harness.
Why it matters: Chapter 12 is about choosing the right concurrency model and proving correctness under races and deadlocks.
Core challenges:
- Correct producer/consumer queue design (synchronization)
- Avoiding deadlocks and starvation (concurrency hazards)
- Designing stress tests that actually expose races (verification discipline)
Key concepts to master:
- Threads and synchronization (Ch. 12)
- Semaphores/condition-variable patterns (Ch. 12)
- Concurrency correctness discipline (OSTEP reference)
Prerequisites: Projects 11 and 15 recommended.
Deliverable: Demonstrate throughput gains by model, and explain every bug as a race/deadlock pattern.
Implementation hints:
- Require โdebug modeโ invariants: queue length bounds, lock ordering rules
- Log enough to prove โwhat happenedโ without relying on luck
Milestones:
- You can reproduce and fix at least one real race condition
- Your thread pool remains stable under stress (no deadlocks)
- You can justify which concurrency model fits which workload
Real World Outcome
$ ./concbench --mode=compare --requests=10000
================================================================================
CONCURRENCY MODEL COMPARISON
================================================================================
Workload: Echo server, 10000 requests, 100 concurrent clients
Request size: 1KB, think time: 0ms (stress test)
Model Throughput Latency(p50) Latency(p99) Memory
--------------------------------------------------------------------------------
Iterative 1,247 req/s 0.8 ms 12.3 ms 2 MB
Process-per-request 3,891 req/s 2.1 ms 45.2 ms 847 MB
Thread-per-request 12,456 req/s 0.4 ms 8.7 ms 124 MB
Thread pool (8) 34,567 req/s 0.2 ms 3.2 ms 18 MB
Thread pool (32) 31,234 req/s 0.3 ms 4.1 ms 34 MB
Event-driven (epoll) 45,678 req/s 0.1 ms 2.1 ms 8 MB
Analysis:
- Iterative: Simple but serializes all requests
- Process-per-request: Memory explosion from fork overhead
- Thread-per-request: Good throughput but thread creation overhead
- Thread pool: Best balance of throughput and resource usage
- Event-driven: Highest throughput, lowest memory, most complex
$ ./concbench --mode=race-demo
================================================================================
RACE CONDITION DEMONSTRATION
================================================================================
Running counter increment test (1000000 ops, 8 threads)...
WITHOUT synchronization:
Expected final count: 1000000
Actual final count: 847293 <-- RACE CONDITION!
Lost updates: 152707 (15.3%)
Race detected! Example interleaving:
Thread 1: load counter (value: 42)
Thread 2: load counter (value: 42)
Thread 1: increment -> 43
Thread 2: increment -> 43 <-- Uses stale value!
Thread 1: store counter (43)
Thread 2: store counter (43) <-- Overwrites!
WITH mutex:
Expected: 1000000, Actual: 1000000 (correct!)
Overhead: 2.3x slower than racy version
WITH atomic operations:
Expected: 1000000, Actual: 1000000 (correct!)
Overhead: 1.4x slower than racy version
$ ./concbench --mode=deadlock-demo
================================================================================
DEADLOCK DEMONSTRATION
================================================================================
Scenario: Transfer between two accounts (A and B)
Thread 1: transfer A -> B (locks A, then B)
Thread 2: transfer B -> A (locks B, then A)
WITHOUT lock ordering:
Running 10000 transfers...
[DEADLOCK DETECTED at iteration 47!]
Thread states:
Thread 1: holding lock_A, waiting for lock_B
Thread 2: holding lock_B, waiting for lock_A
Cycle detected: T1 -> lock_B -> T2 -> lock_A -> T1
WITH consistent lock ordering (always lock lower address first):
Running 10000 transfers...
Completed successfully! No deadlocks.
$ ./concbench --mode=producer-consumer --producers=4 --consumers=4 --queue-size=16
================================================================================
BOUNDED BUFFER PRODUCER/CONSUMER
================================================================================
Configuration: 4 producers, 4 consumers, queue capacity: 16
Items to produce: 100000
Running...
[Producer 0] produced 25000 items (blocked 1847 times on full queue)
[Producer 1] produced 25000 items (blocked 1923 times)
[Consumer 0] consumed 25000 items (blocked 2134 times on empty queue)
[Consumer 3] consumed 25000 items (blocked 2089 times)
Summary:
All 100000 items produced and consumed correctly
No items lost or duplicated
Queue utilization: 78.3% (good balance)
Average wait time: 0.12 ms
The Core Question Youโre Answering
How do you write concurrent programs that are both correct (no races, no deadlocks) and performant, and how do you choose the right concurrency model for your workload?
Concepts You Must Understand First
-
Threads vs Processes - Shared state implications, creation overhead, isolation tradeoffs. CS:APP Ch. 12.3
-
Critical Sections and Mutual Exclusion - What needs protecting? What are the primitives? CS:APP Ch. 12.4
-
Semaphores - Counting vs binary, wait/signal semantics, producer-consumer pattern. CS:APP Ch. 12.5
-
Deadlock - Four conditions (mutual exclusion, hold-and-wait, no preemption, circular wait), prevention strategies. CS:APP Ch. 12.7.3
-
Thread Safety - Reentrant functions, thread-local storage, what makes code unsafe. CS:APP Ch. 12.7
-
Concurrency Models - Process-based, thread-based, event-driven, their tradeoffs. CS:APP Ch. 12.1-12.2
Questions to Guide Your Design
- What shared state exists? How will you protect it?
- What is your lock ordering discipline? Document it!
- How will you detect deadlocks during development?
- What invariants must hold before/after each critical section?
- How will you stress test for races?
Thinking Exercise
Consider this producer-consumer scenario:
int buffer[N];
int count = 0; // Items in buffer
void producer() {
while (1) {
int item = produce();
while (count == N) ; // Spin while full
buffer[count++] = item; // BUG: multiple bugs here!
}
}
void consumer() {
while (1) {
while (count == 0) ; // Spin while empty
int item = buffer[--count]; // BUG!
consume(item);
}
}
- Identify at least 3 bugs in this code
- What happens if two producers run simultaneously?
- What happens if producer and consumer race on
count? - Rewrite using a mutex and condition variables
- Rewrite using semaphores (simpler!)
The Interview Questions Theyโll Ask
-
โWhat is a race condition and how do you prevent it?โ - Unordered access to shared state, use locks/atomics
-
โExplain deadlock and how to avoid it.โ - Circular wait on locks, use lock ordering or try-lock
-
โWhen would you use a thread pool instead of thread-per-request?โ - Thread creation overhead, resource limits, predictable memory
-
โWhatโs the difference between a mutex and a semaphore?โ - Mutex for mutual exclusion (1 holder), semaphore for counting resources
-
โHow do you debug a race condition?โ - Thread sanitizer, stress testing, code review for unprotected shared state
-
โWhat makes a function thread-safe?โ - No shared state, or proper synchronization, reentrant
Hints in Layers
Layer 1 - Basic Mutex Usage:
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void safe_increment(int *counter) {
pthread_mutex_lock(&lock);
(*counter)++;
pthread_mutex_unlock(&lock);
}
Layer 2 - Semaphore-based Producer/Consumer:
sem_t slots; // Empty slots (init to N)
sem_t items; // Items available (init to 0)
sem_t mutex; // Buffer access (init to 1)
void producer() {
sem_wait(&slots); // Wait for empty slot
sem_wait(&mutex);
// Add to buffer
sem_post(&mutex);
sem_post(&items); // Signal item available
}
Layer 3 - Thread Pool Pattern:
typedef struct {
pthread_t *threads;
task_queue_t queue;
pthread_mutex_t lock;
pthread_cond_t notify;
int shutdown;
} threadpool_t;
Layer 4 - Deadlock Prevention (Lock Ordering):
// ALWAYS acquire locks in consistent order (e.g., by address)
void transfer(account_t *from, account_t *to, int amount) {
account_t *first = (from < to) ? from : to;
account_t *second = (from < to) ? to : from;
pthread_mutex_lock(&first->lock);
pthread_mutex_lock(&second->lock);
// Transfer...
pthread_mutex_unlock(&second->lock);
pthread_mutex_unlock(&first->lock);
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Thread Programming | Computer Systems: A Programmerโs Perspective | Ch. 12 |
| Synchronization Patterns | Operating Systems: Three Easy Pieces | Ch. 26-32 |
| POSIX Threads | The Linux Programming Interface | Ch. 29-33 |
| Advanced Threading | Advanced Programming in the UNIX Environment | Ch. 11-12 |
| Concurrency Patterns | Unix Network Programming Vol 1 | Ch. 26-30 |
Phase 5: Capstone
Project 17: CS:APP Capstone โ Secure, Observable, High-Performance Proxy
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, Go) |
| Difficulty | Expert |
| Time | 2โ3 months |
| Chapters | All |
| Coolness | โ โ โ โ โ Pure Magic |
| Portfolio Value | Open Core Infrastructure |
What youโll build: A production-minded proxy that includes caching, configurable concurrency, robust error handling, performance instrumentation, and security hardening against common memory-safety failures.
Why it matters: It forces you to use every major idea: representation, machine-level understanding, caching/locality, linking/loading, ECF, VM, Unix I/O, networking, and concurrency.
Core challenges:
- Correctness under partial I/O and malformed inputs (robust I/O + defensive parsing)
- High throughput without races/deadlocks (synchronization)
- Measurable performance wins via locality and reduced syscalls (Ch. 5โ6)
- Debuggability via symbols, interposition, and structured logs (Ch. 7)
- Hardening and post-mortems for memory errors (Ch. 3, 8, 9)
Key concepts to master:
- Robust systems programming discipline (Appendix)
- Concurrency design patterns (Ch. 12)
- Caching and locality (Ch. 6)
- VM and mapping (Ch. 9)
- Network programming (Ch. 11)
Prerequisites: Complete Projects 1, 2, 4, 12, 15, and 16 (or equivalents).
Deliverable: Route real browser traffic through your proxy, observe metrics, reproduce failures, and explain behavior/performance in CS:APP terms.
Implementation hints:
- Define โdoneโ as a checklist: correctness, load test results, metrics present, and at least one documented post-mortem of a bug you introduced and fixed
Milestones:
- Correct proxying + robust I/O under adverse conditions
- Concurrency scales with evidence and no correctness regressions
- You debug performance and correctness using only system evidence (symbols, traces, logs, memory maps)
Real World Outcome
$ ./proxy --port=8080 --threads=8 --cache-size=64MB
================================================================================
CS:APP CAPSTONE PROXY - Production Mode
================================================================================
Configuration: port=8080, workers=8, cache=64MB (LRU)
[14:30:00] Proxy started, listening on port 8080
$ curl -x localhost:8080 http://example.com/page.html
[14:30:05] GET http://example.com/page.html -> Cache MISS -> 200 OK (847ms)
$ ./proxy --metrics-report
PERFORMANCE: 12,456 requests, 207.6 req/min, 1.23 GB transferred
CACHE: 67.3% hit rate, 52.3 MB used, 234 evictions
LATENCY: p50=23ms, p90=89ms, p99=234ms
CONCURRENCY: 47 active, 312 peak, 73.2% utilization
$ ./loadtest --target=localhost:8080 --concurrent=100 --duration=60s
Results: 45,678 requests, 761.3 req/s, 99.86% success
Latency: min=2ms, max=1247ms, mean=34ms
$ firefox --proxy=localhost:8080
[Browsing session through your proxy - real traffic!]
The Core Question Youโre Answering
How do you build a production-quality networked system integrating robust I/O, concurrency, caching, and security - debugging with systems-level tools?
Concepts You Must Understand First
This capstone integrates ALL CS:APP concepts:
- Ch. 2: Binary protocol parsing, endianness
- Ch. 3: Crash debugging, security
- Ch. 5-6: Performance, cache-friendly structures
- Ch. 7: Interposition for debugging
- Ch. 8: Signal handling, graceful shutdown
- Ch. 9: mmap for cache
- Ch. 10: Robust I/O
- Ch. 11: Sockets, HTTP, DNS
- Ch. 12: Thread pools, synchronization
Questions to Guide Your Design
- Thread pool vs event-driven vs hybrid?
- Cache data structure and eviction policy?
- Handling slow clients, timeouts, keep-alive?
- Malformed HTTP request handling?
- What happens when origin is down?
- Metrics without hurting performance?
- Buffer overflow prevention?
Thinking Exercise
Design the cache on paper:
- Data structure for storage? (Hash table with what key?)
- Concurrent access? (Reader-writer lock? Per-bucket?)
- LRU eviction implementation?
- What if entry read while being evicted?
- Content larger than memory?
The Interview Questions Theyโll Ask
- โWalk through a request from accept() to close()โ
- โHow did you handle concurrent cache access?โ
- โHardest bug you encountered?โ
- โHow would you scale to 10x traffic?โ
- โSecurity vulnerabilities considered?โ
- โExplain your cache eviction policyโ
Hints in Layers
Layer 1: Basic proxy loop with Accept/handle_request/Close
Layer 2: HTTP parsing with rio_readlineb and sscanf
Layer 3: Thread pool with worker dequeue pattern
Layer 4: Cache with pthread_rwlock_t and LRU doubly-linked list
Layer 5: Graceful shutdown via volatile sig_atomic_t flag
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Network Programming | CS:APP | Ch. 11 |
| Robust I/O | CS:APP | Ch. 10 |
| Concurrency | CS:APP | Ch. 12 |
| Sockets API | Unix Network Programming Vol 1 | Ch. 1-8 |
| High-Performance Servers | Unix Network Programming Vol 1 | Ch. 26-30 |
| TCP/IP | TCP/IP Illustrated Vol 1 | Ch. 12-24 |
| Systems Design | The Linux Programming Interface | Ch. 56-63 |
Phase 6: Beyond CS:APP (Advanced Extensions)
These projects extend beyond the core CS:APP curriculum, building on everything youโve learned.
Project 18: ELF Linker and Loader
| Attribute | Value |
|---|---|
| Language | C (alt: Rust) |
| Difficulty | Expert |
| Time | 2โ3 weeks |
| Chapters | 7 |
What youโll build: A tiny static linker (myld) for a constrained subset of ELF64 that parses relocatable objects, resolves symbols, applies relocations, and emits a merged output.
Why it matters: โUndefined referenceโ stops being mysterious and relocation becomes something you can explain byte-for-byte.
Core challenges:
- Parsing ELF64 headers, section tables, symbols, and relocations
- Implementing symbol resolution across multiple
.oinputs (strong/weak rules) - Implementing x86-64 relocation types end-to-end
Real World Outcome
When your linker works, youโll see output like this:
$ cat main.c
extern int global_counter;
extern void increment(void);
int main(void) {
increment();
return global_counter;
}
$ cat lib.c
int global_counter = 0;
void increment(void) {
global_counter++;
}
$ gcc -c main.c lib.c
$ ./myld main.o lib.o -o program
================================================================================
MYLD - Minimal ELF64 Static Linker
================================================================================
[PHASE 1] Reading input files...
main.o: 6 sections, 8 symbols, 2 relocations
.text: 40 bytes
.data: 0 bytes
.rodata: 0 bytes
lib.o: 5 sections, 4 symbols, 1 relocation
.text: 32 bytes
.data: 4 bytes
[PHASE 2] Symbol resolution...
Symbol Table (8 unique symbols):
+-----------------+--------+----------+---------+------------+
| Name | Type | Bind | Section | Resolution |
+-----------------+--------+----------+---------+------------+
| main | FUNC | GLOBAL | .text | DEFINED |
| increment | FUNC | GLOBAL | .text | lib.o |
| global_counter | OBJECT | GLOBAL | .data | lib.o |
| _start | FUNC | GLOBAL | (UNDEF) | libc.a |
+-----------------+--------+----------+---------+------------+
Undefined symbols resolved: 2
Strong symbol conflicts: 0
[PHASE 3] Section merging...
.text: 0x401000 (72 bytes from 2 objects)
.data: 0x402000 (4 bytes from 1 object)
.rodata: 0x403000 (0 bytes)
.bss: 0x404000 (0 bytes)
[PHASE 4] Relocation processing...
Applying 3 relocations:
+--------------------+------------+------------------+-------------+--------+
| Type | Offset | Symbol | Addend | Result |
+--------------------+------------+------------------+-------------+--------+
| R_X86_64_PLT32 | 0x40100a | increment | -4 | OK |
| R_X86_64_PC32 | 0x401012 | global_counter | -4 | OK |
| R_X86_64_32S | lib.o:0x08 | global_counter | 0 | OK |
+--------------------+------------+------------------+-------------+--------+
Relocation calculation for R_X86_64_PLT32:
S (symbol addr) = 0x401040 (increment)
P (patch site) = 0x40100a
A (addend) = -4
Result: S + A - P = 0x401040 + (-4) - 0x40100a = 0x32
Bytes written: 32 00 00 00
[PHASE 5] Output generation...
Writing ELF header at offset 0
Writing program headers (3 segments)
Writing section data
Writing section headers
[OUTPUT] program: 8752 bytes written
Entry point: 0x401000
Segments: 3 (LOAD, LOAD, NOTE)
Sections: 7
$ ./program && echo "Exit: $?"
Exit: 1
$ readelf -h program | grep Entry
Entry point address: 0x401000
$ objdump -d program | head -20
program: file format elf64-x86-64
Disassembly of section .text:
0000000000401000 <main>:
401000: 55 push %rbp
401001: 48 89 e5 mov %rsp,%rbp
401004: e8 37 00 00 00 call 401040 <increment>
401009: 8b 05 f1 0f 00 00 mov 0xff1(%rip),%eax
40100f: 5d pop %rbp
401010: c3 ret
0000000000401040 <increment>:
401040: 55 push %rbp
401041: 48 89 e5 mov %rsp,%rbp
401044: 8b 05 b6 0f 00 00 mov 0xfb6(%rip),%eax
40104a: 83 c0 01 add $0x1,%eax
40104d: 89 05 ad 0f 00 00 mov %eax,0xfad(%rip)
401053: 5d pop %rbp
401054: c3 ret
The Core Question Youโre Answering
โHow do separate compilation units become a single executable, and what exactly happens when you see โundefined reference to fooโ?โ
The linker is the final stage that transforms your mental model of separate .c files into reality. Youโll understand why symbols have โlinkage,โ why static functions canโt be called from other files, and exactly which bytes get patched during relocation.
Concepts You Must Understand First
- ELF file format (CS:APP 7.4) - Headers, sections, segments
- Symbol tables and types (CS:APP 7.5) - Global/local, strong/weak, defined/undefined
- Relocation entries (CS:APP 7.7) - R_X86_64_PC32, R_X86_64_PLT32, R_X86_64_32S
- x86-64 addressing modes (CS:APP 3.4) - RIP-relative addressing
- Object file sections (CS:APP 7.4) - .text, .data, .bss, .rodata, .symtab, .rela.*
- Two-pass linking (CS:APP 7.6) - Symbol resolution then relocation
Questions to Guide Your Design
- How will you parse ELF headers? Read the structs from
<elf.h>or define your own? - What data structures hold symbol information? Hash table? Sorted array? How do you handle duplicates?
- How do you decide section layout? What addresses do merged sections get? How do you handle alignment?
- How will you track relocations? Each relocation needs: source object, offset, type, target symbol, addend
- Whatโs your relocation calculation? For
R_X86_64_PC32:S + A - P- where do S, A, P come from? - How do you handle weak vs strong symbols? What if two objects both define
foo? - What output format will you produce? Minimal ELF64 executable? How many program headers?
- How will you test correctness? Compare against
ldoutput? Run the executable?
Thinking Exercise
Before writing any code, trace through this relocation by hand:
// In main.o at offset 0x15 in .text section:
// e8 00 00 00 00 call <helper> ; R_X86_64_PLT32, addend = -4
// Symbol 'helper' will be placed at address 0x401080
// The 'call' instruction is at address 0x401015 in final executable
// Relocation formula for R_X86_64_PLT32: S + A - P
// S = symbol address = 0x401080
// A = addend = -4
// P = patch location = 0x401015 + 1 = 0x401016 (byte after opcode)
// Calculate the 4-byte value to write:
// 0x401080 + (-4) - 0x401016 = 0x401080 - 4 - 0x401016 = 0x66
// The patched instruction becomes:
// e8 66 00 00 00 call 0x401080
// Verify: When CPU executes at 0x401015:
// - Reads opcode e8 (relative call)
// - Reads 4-byte displacement: 0x00000066
// - Calculates target: 0x40101a (next instruction) + 0x66 = 0x401080
Now trace this global variable access:
// In main.o at offset 0x20:
// 8b 05 00 00 00 00 mov 0x0(%rip),%eax ; R_X86_64_PC32 to 'counter'
// 'counter' is in .data at 0x402000
// This instruction lands at 0x401020 in final executable
// The relocation patches bytes at 0x401022 (after opcode + ModR/M)
// Calculate: S + A - P
// S = 0x402000, A = -4, P = 0x401022
// Result = 0x402000 - 4 - 0x401022 = 0xfda
The Interview Questions Theyโll Ask
- โWhatโs the difference between a section and a segment?โ
- Sections are link-time view (.text, .data, etc.), segments are load-time view (LOAD, DYNAMIC)
- Linker works with sections, loader works with segments
- Multiple sections can be combined into one segment
- โExplain R_X86_64_PC32 vs R_X86_64_32Sโ
- PC32: PC-relative, 32-bit signed displacement,
S + A - P - 32S: Absolute address, sign-extended to 64 bits,
S + A - PC32 is position-independent, 32S requires known absolute address
- PC32: PC-relative, 32-bit signed displacement,
- โWhy does the linker need two passes?โ
- Pass 1: Collect all symbols, resolve references (need to know all symbols before patching)
- Pass 2: Apply relocations (now we know every symbolโs final address)
- Single pass would require backpatching or forward references
- โWhat happens with multiple definitions of a strong symbol?โ
- Linker error: โmultiple definition of โfooโโ
- Each strong symbol can only be defined once across all objects
- Weak symbols are overridden by strong ones
- โHow does the linker handle common symbols (uninitialized globals)?โ
int foo;in multiple files creates COMMON symbols- Linker merges them, taking the largest size
- Final location is in .bss section
- โWhatโs the purpose of the .rela.text section?โ
- Contains relocation entries for .text section
- Each entry: offset, type, symbol index, addend
- Tells linker which bytes to patch and how
Hints in Layers
Layer 1 - ELF Parsing Foundation:
#include <elf.h>
#include <fcntl.h>
#include <sys/mman.h>
typedef struct {
const char *filename;
uint8_t *data;
size_t size;
Elf64_Ehdr *ehdr;
Elf64_Shdr *shdrs;
const char *shstrtab;
} ElfFile;
int load_elf(const char *path, ElfFile *ef) {
int fd = open(path, O_RDONLY);
struct stat st;
fstat(fd, &st);
ef->data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
ef->size = st.st_size;
ef->ehdr = (Elf64_Ehdr *)ef->data;
ef->shdrs = (Elf64_Shdr *)(ef->data + ef->ehdr->e_shoff);
ef->shstrtab = (char *)(ef->data + ef->shdrs[ef->ehdr->e_shstrndx].sh_offset);
return 0;
}
Layer 2 - Symbol Table Extraction:
typedef struct {
char name[256];
uint64_t value;
uint8_t bind; // STB_LOCAL, STB_GLOBAL, STB_WEAK
uint8_t type; // STT_FUNC, STT_OBJECT
uint16_t shndx; // Section index or SHN_UNDEF
ElfFile *source;
} Symbol;
void extract_symbols(ElfFile *ef, Symbol **out, int *count) {
for (int i = 0; i < ef->ehdr->e_shnum; i++) {
if (ef->shdrs[i].sh_type != SHT_SYMTAB) continue;
Elf64_Sym *syms = (Elf64_Sym *)(ef->data + ef->shdrs[i].sh_offset);
int nsyms = ef->shdrs[i].sh_size / sizeof(Elf64_Sym);
// Process each symbol...
}
}
Layer 3 - Symbol Resolution:
int resolve_symbols(Symbol *all_syms, int nsyms, SymEntry *global_tab) {
for (int i = 0; i < nsyms; i++) {
Symbol *sym = &all_syms[i];
if (sym->bind == STB_LOCAL) continue;
SymEntry *existing = hash_lookup(global_tab, sym->name);
// Handle undefined, defined, strong/weak conflicts...
}
return 0;
}
Layer 4 - Relocation Application:
void apply_relocations(ElfFile *ef, uint8_t *output, uint64_t text_base) {
for (int i = 0; i < ef->ehdr->e_shnum; i++) {
if (ef->shdrs[i].sh_type != SHT_RELA) continue;
Elf64_Rela *relas = (Elf64_Rela *)(ef->data + ef->shdrs[i].sh_offset);
// For each relocation: calculate S + A - P and patch
}
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Linking Overview | CS:APP | Ch. 7 |
| ELF Format Details | Practical Binary Analysis | Ch. 2 |
| Relocation Types | CS:APP | 7.7 |
| Symbol Resolution | CS:APP | 7.6 |
| ELF Specification | Low-Level Programming | Ch. 4 |
| Dynamic Linking | CS:APP | 7.10-7.12 |
Common Pitfalls & Debugging
Problem 1: Relocation values are wrong
# Symptom: Segfault or jump to wrong address
# Fix: P = address of the 4-byte displacement, not instruction start
$ diff <(objdump -d reference) <(objdump -d program)
Problem 2: Section addresses overlap
# Fix: Page-align sections
text_vaddr = 0x401000;
data_vaddr = (text_vaddr + text_size + 0xfff) & ~0xfff;
Problem 3: Missing symbols from libc
# For minimal linker: restrict to self-contained code with syscalls
Problem 4: ELF validation failures
$ readelf -h program # Compare with readelf -h /bin/ls
Project 19: Virtual Memory Simulator
| Attribute | Value |
|---|---|
| Language | C (alt: C++, Rust) |
| Difficulty | Advanced |
| Time | ~2 weeks |
| Chapters | 9 |
What youโll build: A CLI (vmsim) that simulates page tables, TLB, and page replacement policies on real address traces.
Why it matters: Makes virtual memory visibleโyouโll see page faults happen, measure TLB hit rates, and compare replacement algorithms.
Core challenges:
- Translating virtual addresses using multi-level page tables
- Implementing FIFO/LRU/Clock replacement policies
- Quantifying TLB hit rates vs. page-table-walk costs
Real World Outcome
When your simulator works, youโll see output like this:
$ cat trace.txt
R 0x00007fff5fbff8a0
W 0x00007fff5fbff8a8
R 0x0000000000400540
R 0x0000000000400544
W 0x00007fff5fbff890
R 0x00007fff5fbff8a0
R 0x0000000000601040
W 0x0000000000601048
$ ./vmsim --trace trace.txt --frames 4 --policy lru --levels 4 --tlb-size 16
================================================================================
VMSIM - Virtual Memory Simulator
================================================================================
Configuration:
Address bits: 48 (virtual), 36 (physical)
Page size: 4 KB (12 offset bits)
Page table levels: 4 (9 + 9 + 9 + 9 bits)
Physical frames: 4
TLB entries: 16
Replacement: LRU
Processing 8 memory accesses...
Access #1: READ 0x00007fff5fbff8a0
VPN: 0x7fff5fbff Offset: 0x8a0
TLB: MISS
Page Walk: L4[0xff] -> L3[0x1fe] -> L2[0x17e] -> L1[0x1ff]
Page Table: MISS (page fault)
[PAGE FAULT] Loading VPN 0x7fff5fbff into frame 0
Physical: 0x0000008a0
+------------------+-------+--------+-------+
| Frame | VPN | Valid | Dirty | LRU |
+-------+----------+-------+--------+-------+
| 0 | 7fff5fbff| 1 | 0 | 0 |
| 1 | - | 0 | - | - |
| 2 | - | 0 | - | - |
| 3 | - | 0 | - | - |
+-------+----------+-------+--------+-------+
Access #2: WRITE 0x00007fff5fbff8a8
VPN: 0x7fff5fbff Offset: 0x8a8
TLB: HIT (frame 0)
Physical: 0x0000008a8
[DIRTY] Marking frame 0 as dirty
Access #3: READ 0x0000000000400540
VPN: 0x000000400 Offset: 0x540
TLB: MISS
Page Table: MISS (page fault)
[PAGE FAULT] Loading VPN 0x000000400 into frame 1
Physical: 0x000001540
Access #4: READ 0x0000000000400544
VPN: 0x000000400 Offset: 0x544
TLB: HIT (frame 1)
Physical: 0x000001544
... (remaining accesses)
Access #7: READ 0x0000000000601040
VPN: 0x000000601 Offset: 0x040
TLB: MISS
Page Table: MISS (page fault)
[PAGE FAULT] All frames full - LRU eviction needed
[EVICT] Frame 2 (VPN 0x7fff5fbf8) - clean, no writeback
[PAGE FAULT] Loading VPN 0x000000601 into frame 2
Physical: 0x000002040
================================================================================
SIMULATION SUMMARY
================================================================================
Memory Accesses: 8
Reads: 6
Writes: 2
TLB Statistics:
Hits: 3 (37.5%)
Misses: 5 (62.5%)
Hit rate: 37.50%
Page Table Statistics:
Hits: 3 (60.0% of TLB misses)
Faults: 5
Page fault rate: 62.50%
Page Replacement:
Evictions: 1
Dirty writebacks: 0
Clean evictions: 1
Performance Estimate:
TLB hit cost: 1 cycle
Page walk cost: ~100 cycles (4 levels * 25 cycles)
Page fault cost: ~10,000,000 cycles (disk access)
Estimated cycles: 50,000,319
If all TLB hits: 8 cycles
Slowdown factor: 6,250,040x
$ ./vmsim --trace trace.txt --frames 4 --policy fifo --compare
================================================================================
POLICY COMPARISON
================================================================================
| Policy | Page Faults | Evictions | Hit Rate | Dirty Writebacks |
|--------|-------------|-----------|----------|------------------|
| FIFO | 5 | 1 | 37.50% | 0 |
| LRU | 5 | 1 | 37.50% | 0 |
| Clock | 5 | 1 | 37.50% | 0 |
| OPT | 4 | 0 | 50.00% | 0 |
Belady's Anomaly Check (FIFO with varying frames):
2 frames: 6 faults
3 frames: 5 faults
4 frames: 5 faults
5 frames: 4 faults
No anomaly detected in this trace.
The Core Question Youโre Answering
โWhat actually happens when the CPU accesses a virtual address, and why do some programs thrash while others run smoothly?โ
Virtual memory is the foundation of process isolation and memory efficiency. By building a simulator, youโll understand exactly why 4GB programs can run on 2GB machines, what the kernel does during a page fault, and why locality of reference is the most important property of programs.
Concepts You Must Understand First
- Virtual vs physical addresses (CS:APP 9.1) - Address spaces and the MMU
- Page tables and PTEs (CS:APP 9.3) - Structure of multi-level page tables
- TLB operation (CS:APP 9.5) - Translation lookaside buffer as a cache
- Page faults (CS:APP 9.3.4) - What triggers them, kernel handling
- Replacement policies (OSTEP Ch. 21-22) - FIFO, LRU, Clock, OPT
- Working set and locality (CS:APP 9.9) - Why caching works
Questions to Guide Your Design
- How will you represent page table entries? What fields: present, dirty, accessed, frame number?
- How many levels of page tables? x86-64 uses 4 levels - will you simulate all 4?
- Whatโs your TLB data structure? Fully associative? Set associative? What replacement policy?
- How will you track LRU order? Timestamp? Doubly-linked list? Counter bits?
- How do you implement Clock algorithm? Whatโs the โsecond chanceโ logic?
- How will you read trace files? What format:
<R/W> <hex address>? - What statistics will you collect? TLB hits, page faults, dirty writebacks?
- How will you validate correctness? Known traces with expected results?
Thinking Exercise
Before writing any code, trace through this address translation by hand:
// Virtual address: 0x00007fff5fbff8a0
// 48-bit address, 4KB pages, 4-level page table
// Break down the address (4KB = 12 offset bits):
// Binary: 0000 0000 0000 0000 0111 1111 1111 1111 0101 1111 1011 1111 1111 1000 1010 0000
//
// Bits [47:39] = L4 index = 0x0ff (255) - index into PML4
// Bits [38:30] = L3 index = 0x1fe (510) - index into PDPT
// Bits [29:21] = L2 index = 0x17e (382) - index into PD
// Bits [20:12] = L1 index = 0x1ff (511) - index into PT
// Bits [11:0] = Offset = 0x8a0 (2208) - offset within page
// Page walk:
// 1. CR3 contains physical address of PML4
// 2. Read PML4[255] -> physical address of PDPT
// 3. Read PDPT[510] -> physical address of PD
// 4. Read PD[382] -> physical address of PT
// 5. Read PT[511] -> PTE with frame number (or page fault if not present)
// 6. Physical address = (frame_number << 12) | 0x8a0
// TLB caches: VPN -> (frame_number, permissions)
// VPN = upper 36 bits = 0x7fff5fbff
Now trace a replacement decision:
// 4 frames, LRU policy, current state:
// Frame 0: VPN 0xABC, accessed at time 5
// Frame 1: VPN 0xDEF, accessed at time 8 (most recent)
// Frame 2: VPN 0x123, accessed at time 2 (least recent - EVICT THIS)
// Frame 3: VPN 0x456, accessed at time 6
// New access to VPN 0x789 at time 9 (page fault):
// 1. Find LRU frame: Frame 2 (time 2)
// 2. If Frame 2 is dirty, write back to disk
// 3. Load VPN 0x789 into Frame 2
// 4. Update access time to 9
The Interview Questions Theyโll Ask
- โWalk through what happens when a process accesses memoryโ
- CPU generates virtual address
- TLB lookup (fast path if hit)
- On TLB miss: walk page table (multiple memory accesses)
- If page not present: page fault, kernel loads from disk
- Update TLB, return physical address
- โWhy do we use multi-level page tables instead of single-level?โ
- Single level for 48-bit addresses would need 2^36 entries (64GB table!)
- Multi-level allows sparse allocation
- Only allocate tables for used regions
- Trade-off: more memory accesses per translation
- โExplain the Clock page replacement algorithmโ
- Approximation of LRU with lower overhead
- Reference bit set on access, cleared by clock hand
- Hand sweeps, evicting first page with ref=0
- Gives pages a โsecond chanceโ if recently used
- โWhat is thrashing and how do you detect it?โ
- Working set exceeds physical memory
- Constant page faults, CPU mostly waiting for I/O
- Detect: page fault rate exceeds threshold
- Solution: reduce multiprogramming or add memory
- โWhatโs Beladyโs anomaly and which algorithms are immune?โ
- FIFO can have MORE faults with MORE frames
- Example: sequence 1,2,3,4,1,2,5,1,2,3,4,5 with 3 vs 4 frames
- Stack algorithms (LRU, OPT) are immune
- FIFO is not a stack algorithm
- โHow does the TLB interact with context switches?โ
- TLB entries are process-specific
- Context switch invalidates TLB (or uses ASID)
- Cold TLB after switch causes many page walks
- ASID allows TLB entries to persist across switches
Hints in Layers
Layer 1 - Address Parsing:
#define PAGE_SIZE 4096
#define PAGE_BITS 12
#define VPN_BITS 36
#define LEVELS 4
#define LEVEL_BITS 9
typedef struct {
uint64_t vpn; // Virtual page number
uint16_t offset; // Offset within page
uint16_t indices[4]; // Indices for each level
} ParsedAddress;
ParsedAddress parse_address(uint64_t vaddr) {
ParsedAddress pa;
pa.offset = vaddr & 0xFFF;
pa.vpn = vaddr >> PAGE_BITS;
uint64_t temp = pa.vpn;
for (int i = LEVELS - 1; i >= 0; i--) {
pa.indices[i] = temp & 0x1FF; // 9 bits per level
temp >>= LEVEL_BITS;
}
return pa;
}
Layer 2 - Page Table Structure:
typedef struct {
uint32_t frame; // Physical frame number
uint8_t present; // Is page in memory?
uint8_t dirty; // Has page been written?
uint8_t accessed; // For clock algorithm
} PTE;
typedef struct PageTable {
PTE entries[512]; // 2^9 entries per level
struct PageTable *children[512]; // Pointers to next level
} PageTable;
PageTable *root; // PML4
int walk_page_table(uint64_t vpn, uint32_t *frame_out) {
PageTable *current = root;
ParsedAddress pa = parse_address(vpn << PAGE_BITS);
for (int level = 0; level < LEVELS - 1; level++) {
int idx = pa.indices[level];
if (!current->children[idx]) return -1; // Not mapped
current = current->children[idx];
}
int final_idx = pa.indices[LEVELS - 1];
if (!current->entries[final_idx].present) return -1;
*frame_out = current->entries[final_idx].frame;
return 0;
}
Layer 3 - TLB Implementation:
typedef struct {
uint64_t vpn;
uint32_t frame;
uint8_t valid;
uint64_t last_access; // For LRU
} TLBEntry;
TLBEntry tlb[TLB_SIZE];
uint64_t access_counter = 0;
int tlb_lookup(uint64_t vpn, uint32_t *frame_out) {
for (int i = 0; i < TLB_SIZE; i++) {
if (tlb[i].valid && tlb[i].vpn == vpn) {
tlb[i].last_access = ++access_counter;
*frame_out = tlb[i].frame;
return 1; // Hit
}
}
return 0; // Miss
}
void tlb_insert(uint64_t vpn, uint32_t frame) {
// Find empty or LRU entry
int victim = 0;
uint64_t oldest = UINT64_MAX;
for (int i = 0; i < TLB_SIZE; i++) {
if (!tlb[i].valid) { victim = i; break; }
if (tlb[i].last_access < oldest) {
oldest = tlb[i].last_access;
victim = i;
}
}
tlb[victim] = (TLBEntry){vpn, frame, 1, ++access_counter};
}
Layer 4 - Page Replacement Policies:
typedef struct {
uint64_t vpn;
uint8_t valid;
uint8_t dirty;
uint8_t ref_bit; // For clock
uint64_t load_time; // For FIFO
uint64_t last_access; // For LRU
} Frame;
Frame frames[MAX_FRAMES];
int clock_hand = 0;
int find_victim_lru(void) {
int victim = -1;
uint64_t oldest = UINT64_MAX;
for (int i = 0; i < num_frames; i++) {
if (!frames[i].valid) return i;
if (frames[i].last_access < oldest) {
oldest = frames[i].last_access;
victim = i;
}
}
return victim;
}
int find_victim_clock(void) {
while (1) {
if (!frames[clock_hand].ref_bit) {
int victim = clock_hand;
clock_hand = (clock_hand + 1) % num_frames;
return victim;
}
frames[clock_hand].ref_bit = 0; // Second chance
clock_hand = (clock_hand + 1) % num_frames;
}
}
Layer 5 - Main Simulation Loop:
void simulate(const char *trace_file) {
FILE *f = fopen(trace_file, "r");
char op;
uint64_t addr;
while (fscanf(f, " %c %lx", &op, &addr) == 2) {
stats.total_accesses++;
uint64_t vpn = addr >> PAGE_BITS;
uint32_t frame;
// Try TLB
if (tlb_lookup(vpn, &frame)) {
stats.tlb_hits++;
} else {
stats.tlb_misses++;
// Walk page table
if (walk_page_table(vpn, &frame) < 0) {
stats.page_faults++;
frame = handle_page_fault(vpn);
}
tlb_insert(vpn, frame);
}
// Update access info
frames[frame].last_access = ++access_counter;
frames[frame].ref_bit = 1;
if (op == 'W') frames[frame].dirty = 1;
}
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Virtual Memory Overview | CS:APP | Ch. 9 |
| Address Translation | CS:APP | 9.3-9.5 |
| Page Replacement | OSTEP | Ch. 21-22 |
| TLBs | CS:APP | 9.5 |
| Working Sets | OSTEP | Ch. 22 |
| Linux VM Implementation | TLPI | Ch. 49-50 |
Common Pitfalls & Debugging
Problem 1: Address parsing is off by one level
# Symptom: All accesses go to wrong frame
# Check your bit extraction:
printf("VPN: 0x%lx\n", addr >> 12);
printf("L4: %d L3: %d L2: %d L1: %d\n",
(addr >> 39) & 0x1ff, (addr >> 30) & 0x1ff,
(addr >> 21) & 0x1ff, (addr >> 12) & 0x1ff);
Problem 2: LRU timestamps not updating
# Symptom: Same frame always evicted
# Fix: Update last_access on EVERY access, not just faults
frames[frame].last_access = ++global_counter;
Problem 3: Clock hand not wrapping
# Symptom: Array out of bounds or stuck
clock_hand = (clock_hand + 1) % num_frames;
Problem 4: Dirty bit not set on writes
# Symptom: Dirty writeback count always zero
if (op == 'W' || op == 'w') {
frames[frame].dirty = 1;
}
Project 20: HTTP Web Server
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Go) |
| Difficulty | Intermediate |
| Time | 1โ2 weeks |
| Chapters | 10, 11, 8 |
What youโll build: A small but real HTTP server (tiny) that parses requests, serves static files, and runs simple CGI-style dynamic handlers.
Why it matters: Connects sockets, HTTP parsing, and process control into a working networked application.
Core challenges:
- Implementing request/response loop with sockets API
- Parsing HTTP request lines and headers defensively
- Serving static files with correct MIME types
Real World Outcome
When your server works, youโll see output like this:
$ ./tiny 8080 ./www &
================================================================================
TINY - Minimal HTTP/1.1 Web Server
================================================================================
[INIT] Document root: ./www
[INIT] Listening on port 8080
[INIT] Server ready. Press Ctrl+C to shutdown.
$ curl -v http://localhost:8080/index.html
* Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /index.html HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: Tiny/1.0
< Date: Sat, 14 Dec 2024 15:30:42 GMT
< Content-Type: text/html
< Content-Length: 1234
< Connection: close
<
<!DOCTYPE html>
<html>...
* Closing connection 0
# Server log output:
[2024-12-14 15:30:42] 127.0.0.1:54321 "GET /index.html HTTP/1.1" 200 1234 0.002ms
$ curl -v http://localhost:8080/cgi-bin/adder?15&213
* Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /cgi-bin/adder?15&213 HTTP/1.1
> Host: localhost:8080
>
< HTTP/1.1 200 OK
< Server: Tiny/1.0
< Content-Type: text/html
< Transfer-Encoding: chunked
<
<html>
<head><title>Adder Result</title></head>
<body>
<h1>Welcome to adder.cgi</h1>
<p>The sum of 15 and 213 is 228</p>
</body>
</html>
* Closing connection 0
[2024-12-14 15:31:15] 127.0.0.1:54322 "GET /cgi-bin/adder?15&213 HTTP/1.1" 200 - 15.3ms (CGI)
$ curl -I http://localhost:8080/images/logo.png
HTTP/1.1 200 OK
Server: Tiny/1.0
Date: Sat, 14 Dec 2024 15:32:00 GMT
Content-Type: image/png
Content-Length: 45678
Last-Modified: Fri, 13 Dec 2024 10:00:00 GMT
Connection: close
$ curl http://localhost:8080/nonexistent.html
<!DOCTYPE html>
<html>
<head><title>404 Not Found</title></head>
<body>
<h1>404 Not Found</h1>
<p>The requested URL /nonexistent.html was not found on this server.</p>
<hr><address>Tiny/1.0 Server</address>
</body>
</html>
[2024-12-14 15:32:30] 127.0.0.1:54323 "GET /nonexistent.html HTTP/1.1" 404 312 0.001ms
$ curl -X POST -d "name=test" http://localhost:8080/form
HTTP/1.1 501 Not Implemented
Server: Tiny/1.0
Content-Type: text/html
Content-Length: 156
<!DOCTYPE html>
<html><body>
<h1>501 Method Not Implemented</h1>
<p>Tiny does not support the POST method.</p>
</body></html>
# Load testing with wrk:
$ wrk -t4 -c100 -d30s http://localhost:8080/index.html
Running 30s test @ http://localhost:8080/index.html
4 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.34ms 1.12ms 45.21ms 89.32%
Req/Sec 10.87k 1.23k 14.56k 72.15%
1,301,245 requests in 30.01s, 1.52GB read
Requests/sec: 43,363.21
Transfer/sec: 51.89MB
The Core Question Youโre Answering
โHow does a web server actually work at the systems level, from TCP socket to HTTP response?โ
Every web developer uses web servers daily, but few understand what happens between socket() and the HTML appearing in the browser. This project demystifies the entire stack: connection handling, protocol parsing, file serving, and process-based CGI.
Concepts You Must Understand First
- TCP socket programming (CS:APP 11.4) - socket, bind, listen, accept, read, write
- HTTP protocol basics (RFC 2616) - Request/response structure, headers, status codes
- Unix I/O (CS:APP 10.1-10.4) - File descriptors, open, read, write, close
- Robust I/O (CS:APP 10.5) - rio_readlineb for buffered line reading
- Process control (CS:APP 8.2-8.4) - fork, exec, wait for CGI
- MIME types - Content-Type mapping from file extensions
Questions to Guide Your Design
- How will you structure the main server loop? Single process? Fork per connection? Thread pool?
- How do you parse HTTP requests robustly? What if request is malformed? Too long?
- How do you prevent directory traversal attacks?
GET /../../../etc/passwd? - How do you determine MIME types? Hard-coded table?
/etc/mime.types? - How do you implement CGI? Which environment variables? How to pass query string?
- How do you handle partial reads/writes? TCP doesnโt guarantee message boundaries
- How do you handle persistent connections?
Connection: keep-alivevsclose? - How do you handle signals? SIGPIPE when client disconnects, SIGCHLD from CGI
Thinking Exercise
Before writing any code, trace through this HTTP transaction:
// Client sends (bytes over TCP):
"GET /images/cat.jpg HTTP/1.1\r\n"
"Host: localhost:8080\r\n"
"User-Agent: Mozilla/5.0\r\n"
"Accept: image/jpeg,image/*\r\n"
"\r\n"
// Server must:
// 1. Parse request line: method="GET", uri="/images/cat.jpg", version="HTTP/1.1"
// 2. Parse headers into key-value pairs
// 3. Validate: method supported? URI safe? Version OK?
// 4. Map URI to filesystem: "./www/images/cat.jpg"
// 5. Check file exists and is readable
// 6. Determine Content-Type from extension: "image/jpeg"
// 7. Get file size with stat()
// 8. Send response:
"HTTP/1.1 200 OK\r\n"
"Server: Tiny/1.0\r\n"
"Content-Type: image/jpeg\r\n"
"Content-Length: 45678\r\n"
"\r\n"
// <45678 bytes of JPEG data>
Now trace a CGI request:
// Client sends:
"GET /cgi-bin/adder?15&213 HTTP/1.1\r\n"
"Host: localhost:8080\r\n"
"\r\n"
// Server must:
// 1. Parse URI: path="/cgi-bin/adder", query_string="15&213"
// 2. Detect CGI (path starts with /cgi-bin/)
// 3. fork() child process
// 4. In child:
// - dup2() client socket to STDOUT
// - setenv("QUERY_STRING", "15&213")
// - setenv("REQUEST_METHOD", "GET")
// - execve("./www/cgi-bin/adder", ...)
// 5. In parent: wait() for child, then continue
The Interview Questions Theyโll Ask
- โExplain the socket API calls for a TCP serverโ
socket()- create endpointbind()- attach to portlisten()- mark as passive (accepting connections)accept()- block until client connects, return new fdread()/write()- exchange dataclose()- terminate connection
- โHow do you handle slow clients?โ
- Problem: read() blocks if client sends slowly
- Solutions: non-blocking I/O, select/poll/epoll, timeouts
- For this project: accept slowness (educational focus)
- Production: event-driven with timeout
- โWhatโs a directory traversal attack and how do you prevent it?โ
- Attack:
GET /../../../etc/passwd - Naive: prepend doc root, but
..escapes it - Fix: resolve path with
realpath(), check prefix matches doc root - Also: reject paths containing
..directly
- Attack:
- โHow does CGI work?โ
- Server forks child, sets environment variables
- Redirects childโs stdout to client socket
- Executes CGI program
- Program writes HTTP response to stdout (goes to client)
- Server waits for child to finish
- โWhat happens if the client disconnects mid-transfer?โ
- write() to closed socket generates SIGPIPE
- Default: terminate process
- Fix:
signal(SIGPIPE, SIG_IGN)and check write() return value - Or: use
send()with MSG_NOSIGNAL flag
- โHow would you add HTTPS support?โ
- Use OpenSSL or similar TLS library
- SSL_accept() instead of just accept()
- SSL_read()/SSL_write() instead of read()/write()
- Handle certificate loading, cipher selection
Hints in Layers
Layer 1 - Socket Setup:
#include <sys/socket.h>
#include <netinet/in.h>
int open_listenfd(int port) {
int listenfd = socket(AF_INET, SOCK_STREAM, 0);
// Allow port reuse (avoid "Address already in use")
int optval = 1;
setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval));
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(port),
.sin_addr.s_addr = htonl(INADDR_ANY)
};
bind(listenfd, (struct sockaddr *)&addr, sizeof(addr));
listen(listenfd, 1024);
return listenfd;
}
Layer 2 - Robust I/O (from CS:APP):
typedef struct {
int fd;
int cnt; // Unread bytes in buffer
char *bufptr; // Next unread byte
char buf[8192];
} rio_t;
void rio_readinitb(rio_t *rp, int fd) {
rp->fd = fd;
rp->cnt = 0;
rp->bufptr = rp->buf;
}
// Read a line (handles partial reads)
ssize_t rio_readlineb(rio_t *rp, void *usrbuf, size_t maxlen) {
char *bufp = usrbuf;
for (int n = 1; n < maxlen; n++) {
char c;
int rc = rio_read(rp, &c, 1);
if (rc == 1) {
*bufp++ = c;
if (c == '\n') break;
} else if (rc == 0) {
if (n == 1) return 0; // EOF, no data
break;
} else {
return -1; // Error
}
}
*bufp = 0;
return bufp - (char *)usrbuf;
}
Layer 3 - HTTP Request Parsing:
typedef struct {
char method[16];
char uri[2048];
char version[16];
char headers[32][2][256]; // Up to 32 headers
int header_count;
} HttpRequest;
int parse_request(rio_t *rp, HttpRequest *req) {
char line[2048];
// Read request line: "GET /index.html HTTP/1.1"
rio_readlineb(rp, line, sizeof(line));
if (sscanf(line, "%15s %2047s %15s", req->method, req->uri, req->version) != 3)
return -1;
// Read headers until blank line
req->header_count = 0;
while (rio_readlineb(rp, line, sizeof(line)) > 0) {
if (strcmp(line, "\r\n") == 0 || strcmp(line, "\n") == 0)
break;
char *colon = strchr(line, ':');
if (colon && req->header_count < 32) {
*colon = '\0';
strcpy(req->headers[req->header_count][0], line);
// Skip ": " and trim newline
char *value = colon + 2;
value[strcspn(value, "\r\n")] = '\0';
strcpy(req->headers[req->header_count][1], value);
req->header_count++;
}
}
return 0;
}
Layer 4 - Static File Serving:
const char *get_mime_type(const char *filename) {
const char *ext = strrchr(filename, '.');
if (!ext) return "application/octet-stream";
if (strcmp(ext, ".html") == 0) return "text/html";
if (strcmp(ext, ".css") == 0) return "text/css";
if (strcmp(ext, ".js") == 0) return "application/javascript";
if (strcmp(ext, ".jpg") == 0) return "image/jpeg";
if (strcmp(ext, ".png") == 0) return "image/png";
if (strcmp(ext, ".gif") == 0) return "image/gif";
return "application/octet-stream";
}
void serve_static(int fd, const char *filename) {
struct stat sbuf;
if (stat(filename, &sbuf) < 0) {
send_error(fd, 404, "Not Found");
return;
}
int srcfd = open(filename, O_RDONLY);
char *srcp = mmap(0, sbuf.st_size, PROT_READ, MAP_PRIVATE, srcfd, 0);
close(srcfd);
// Send headers
char header[512];
sprintf(header,
"HTTP/1.1 200 OK\r\n"
"Server: Tiny/1.0\r\n"
"Content-Type: %s\r\n"
"Content-Length: %ld\r\n"
"\r\n",
get_mime_type(filename), sbuf.st_size);
write(fd, header, strlen(header));
// Send body
write(fd, srcp, sbuf.st_size);
munmap(srcp, sbuf.st_size);
}
Layer 5 - CGI Handler:
void serve_cgi(int fd, const char *program, const char *query_string) {
pid_t pid = fork();
if (pid == 0) { // Child
// Set CGI environment variables
setenv("QUERY_STRING", query_string ? query_string : "", 1);
setenv("REQUEST_METHOD", "GET", 1);
setenv("GATEWAY_INTERFACE", "CGI/1.1", 1);
// Redirect stdout to client socket
dup2(fd, STDOUT_FILENO);
close(fd);
// Execute CGI program
execl(program, program, NULL);
exit(1); // If exec fails
} else { // Parent
int status;
waitpid(pid, &status, 0);
}
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Unix I/O | CS:APP | Ch. 10 |
| Network Programming | CS:APP | Ch. 11 |
| Process Control | CS:APP | Ch. 8 |
| Sockets Deep Dive | Unix Network Programming Vol 1 | Ch. 1-8 |
| HTTP Protocol | TCP/IP Illustrated Vol 1 | Ch. 14 |
| Advanced I/O | TLPI | Ch. 63 |
Common Pitfalls & Debugging
Problem 1: โAddress already in useโ on restart
// Fix: Set SO_REUSEADDR before bind()
int optval = 1;
setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval));
Problem 2: Server crashes on client disconnect
// Fix: Ignore SIGPIPE
signal(SIGPIPE, SIG_IGN);
// Then check write() return value
if (write(fd, data, len) < 0) {
// Client disconnected, clean up
}
Problem 3: Zombie CGI processes
// Fix: Handle SIGCHLD
void sigchld_handler(int sig) {
while (waitpid(-1, NULL, WNOHANG) > 0);
}
signal(SIGCHLD, sigchld_handler);
Problem 4: Directory traversal vulnerability
// WRONG:
char path[256];
sprintf(path, "./www%s", uri); // uri="/../../../etc/passwd"
// FIX:
char realpath_buf[PATH_MAX];
char *real = realpath(path, realpath_buf);
if (!real || strncmp(real, "./www", 5) != 0) {
send_error(fd, 403, "Forbidden");
return;
}
Project 21: Thread Pool Implementation
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Go, Java) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 12 |
What youโll build: A reusable thread pool library with a bounded work queue, condition variables, and clean shutdown semantics.
Why it matters: Producer-consumer pattern is everywhereโthis forces you to get synchronization right.
Core challenges:
- Implementing correct producer-consumer queue with blocking
- Handling shutdown safely (no task loss, no deadlocks)
- Avoiding thundering herd and backpressure issues
Real World Outcome:
$ ./threadpool --workers=4 --demo
================================================================================
THREAD POOL DEMONSTRATION
================================================================================
[INIT] Creating pool with 4 worker threads
[INIT] Work queue capacity: 64 tasks
[WORKER-0] Started, waiting for work...
[WORKER-1] Started, waiting for work...
[WORKER-2] Started, waiting for work...
[WORKER-3] Started, waiting for work...
[SUBMIT] Task 1: compute_fibonacci(40)
[SUBMIT] Task 2: compute_fibonacci(35)
[SUBMIT] Task 3: compress_file("data.bin")
[SUBMIT] Task 4: compute_factorial(20)
[SUBMIT] Task 5: hash_password("secret")
[WORKER-0] Executing task 1: compute_fibonacci(40)
[WORKER-1] Executing task 2: compute_fibonacci(35)
[WORKER-2] Executing task 3: compress_file("data.bin")
[WORKER-3] Executing task 4: compute_factorial(20)
[WORKER-1] Completed task 2 in 89ms (result: 9227465)
[WORKER-1] Executing task 5: hash_password("secret")
[WORKER-3] Completed task 4 in 12ms (result: 2432902008176640000)
[WORKER-1] Completed task 5 in 156ms
[WORKER-2] Completed task 3 in 423ms (compressed 1.2MB -> 340KB)
[WORKER-0] Completed task 1 in 1247ms (result: 102334155)
================================================================================
POOL STATISTICS
================================================================================
Total tasks submitted: 5
Total tasks completed: 5
Average wait time: 34ms
Average execution time: 385ms
Queue high watermark: 4/64
$ ./threadpool --workers=2 --stress-test --tasks=10000
================================================================================
STRESS TEST MODE
================================================================================
[CONFIG] Workers: 2, Tasks: 10000, Queue size: 256
[PROGRESS] 1000/10000 tasks (10.0%) - 847 tasks/sec
[PROGRESS] 2000/10000 tasks (20.0%) - 892 tasks/sec
[PROGRESS] 5000/10000 tasks (50.0%) - 921 tasks/sec
[PROGRESS] 10000/10000 tasks (100.0%) - 934 tasks/sec
[RESULT] All 10000 tasks completed successfully
[RESULT] Total time: 10.71s
[RESULT] Throughput: 934 tasks/sec
[RESULT] No deadlocks detected
[RESULT] No tasks lost during shutdown
$ ./threadpool --workers=4 --graceful-shutdown-test
================================================================================
GRACEFUL SHUTDOWN TEST
================================================================================
[TEST] Submitting 100 long-running tasks...
[TEST] Requesting shutdown while 87 tasks pending...
[POOL] Shutdown requested - completing in-flight tasks
[POOL] Worker-0 finishing current task, then exiting
[POOL] Worker-1 finishing current task, then exiting
[POOL] Worker-2 finishing current task, then exiting
[POOL] Worker-3 finishing current task, then exiting
[POOL] Draining remaining 83 queued tasks...
[POOL] All workers joined
[POOL] Pool destroyed cleanly
[RESULT] PASS - All 100 tasks completed
[RESULT] PASS - No memory leaks (valgrind clean)
[RESULT] PASS - Shutdown completed in 2.3s
The Core Question Youโre Answering: How do you safely coordinate multiple threads that share a work queue, ensuring no race conditions, no lost work, and clean shutdown?
Concepts You Must Understand First:
- Mutex fundamentals (CS:APP 12.5.4) - Why mutual exclusion is necessary and how pthread_mutex_t provides atomicity
- Condition variables (CS:APP 12.5.5) - The โwait and signalโ pattern for efficient blocking without busy-waiting
- Producer-consumer pattern (OSTEP Ch. 30) - The classic bounded buffer problem and its solution
- Thread lifecycle (TLPI Ch. 29) - Creation, termination, joining, and detaching threads
- Memory visibility (CS:APP 12.5.1) - Why compiler reordering and CPU caches can cause subtle bugs without proper synchronization
- Deadlock prevention (OSTEP Ch. 32) - The four conditions for deadlock and how to avoid them
Questions to Guide Your Design:
- When a worker thread finds an empty queue, should it spin-wait, sleep, or use a condition variable? What are the tradeoffs?
- What happens if a producer tries to submit work when the queue is full? Block, drop, or grow the queue?
- How do you signal shutdown to workers? A special โpoison pillโ task, a shared flag, or both?
- Should workers check the shutdown flag before or after dequeuing a task? Whatโs the difference?
- If you use a circular buffer for the queue, how do you handle the wraparound correctly with concurrent access?
- How do you avoid the โthundering herdโ problem when multiple workers wake up for one task?
- What happens if a worker thread crashes? Should the pool detect this and spawn a replacement?
- How do you make task submission return quickly even if all workers are busy?
Thinking Exercise:
Before writing code, trace through this scenario by hand:
// Initial state: queue is EMPTY, pool has 2 workers (both waiting)
// Thread A (producer):
pthread_mutex_lock(&pool->lock);
// A acquires lock
enqueue(&pool->queue, task1);
// A signals condition variable
pthread_cond_signal(&pool->not_empty);
pthread_mutex_unlock(&pool->lock);
// Meanwhile, Thread B (producer) tries to submit:
pthread_mutex_lock(&pool->lock); // <-- What happens here?
// Worker-0 (waiting on cond var):
// wakes up from pthread_cond_wait()
// <-- What must Worker-0 do before accessing the queue?
// Worker-1 (also waiting):
// <-- Should Worker-1 wake up? What does it do?
Draw a timeline showing which thread holds the mutex at each moment. What if you used pthread_cond_broadcast() instead of pthread_cond_signal()?
The Interview Questions Theyโll Ask:
- โWhatโs the difference between a mutex and a semaphore? When would you use each?โ
- Expected answer: Mutex is for mutual exclusion (one thread at a time), semaphore is for counting resources. Use mutex for protecting shared data, semaphore for limiting concurrent access to N resources. A binary semaphore is similar to a mutex but has different ownership semantics (any thread can signal, only owner should unlock mutex).
- โExplain why this thread pool implementation might deadlock.โ (Theyโll show buggy code)
- Expected answer: Look for: lock ordering violations, missing unlock on error paths, waiting on condition while holding multiple locks, or joining a thread thatโs waiting on a lock you hold.
- โHow would you implement work stealing between thread pool workers?โ
- Expected answer: Each worker has its own deque. Workers push/pop from their own deque (LIFO for cache locality). When empty, steal from the tail of another workerโs deque. Requires lock-free or fine-grained locking for the stealing operation.
- โWhatโs the spurious wakeup problem and how do you handle it?โ
- Expected answer: Condition variable wait can return even when no signal was sent. Always wrap
pthread_cond_waitin a while loop that rechecks the condition, not an if statement.
- Expected answer: Condition variable wait can return even when no signal was sent. Always wrap
- โHow do you choose the optimal number of threads for a thread pool?โ
- Expected answer: For CPU-bound work: number of cores. For I/O-bound work: higher (2x-10x cores) depending on I/O wait ratio. Littleโs Law can help: N = arrival_rate * average_service_time. In practice, benchmark and tune.
- โWhatโs the ABA problem and can it affect this implementation?โ
- Expected answer: ABA occurs in lock-free structures when a value changes A->B->A between read and CAS. With mutex-protected queues, ABA isnโt an issue. But if you tried to make a lock-free queue, youโd need hazard pointers or epoch-based reclamation.
Hints in Layers:
Layer 1 - Core Data Structures:
typedef struct task {
void (*function)(void *arg);
void *arg;
struct task *next;
} task_t;
typedef struct threadpool {
pthread_mutex_t lock;
pthread_cond_t not_empty; // Signal when queue becomes non-empty
pthread_cond_t not_full; // Signal when queue has space (for bounded)
task_t *queue_head;
task_t *queue_tail;
int queue_size;
int queue_capacity;
pthread_t *workers;
int worker_count;
int shutdown; // 0 = running, 1 = graceful, 2 = immediate
} threadpool_t;
Layer 2 - Worker Thread Loop Pattern:
void *worker_thread(void *arg) {
threadpool_t *pool = (threadpool_t *)arg;
while (1) {
pthread_mutex_lock(&pool->lock);
// Wait while queue is empty AND not shutting down
while (pool->queue_size == 0 && !pool->shutdown) {
pthread_cond_wait(&pool->not_empty, &pool->lock);
}
// Check shutdown AFTER waking
if (pool->shutdown && pool->queue_size == 0) {
pthread_mutex_unlock(&pool->lock);
break;
}
task_t *task = dequeue(pool); // Remove from queue
pthread_mutex_unlock(&pool->lock);
// Execute OUTSIDE the lock!
task->function(task->arg);
free(task);
}
return NULL;
}
Layer 3 - Submit with Backpressure:
int threadpool_submit(threadpool_t *pool, void (*fn)(void*), void *arg) {
task_t *task = malloc(sizeof(task_t));
task->function = fn;
task->arg = arg;
task->next = NULL;
pthread_mutex_lock(&pool->lock);
// Block if queue is full (backpressure)
while (pool->queue_size >= pool->queue_capacity && !pool->shutdown) {
pthread_cond_wait(&pool->not_full, &pool->lock);
}
if (pool->shutdown) {
pthread_mutex_unlock(&pool->lock);
free(task);
return -1; // Rejected
}
enqueue(pool, task);
pthread_cond_signal(&pool->not_empty); // Wake ONE worker
pthread_mutex_unlock(&pool->lock);
return 0;
}
Layer 4 - Graceful Shutdown:
void threadpool_shutdown(threadpool_t *pool, int graceful) {
pthread_mutex_lock(&pool->lock);
pool->shutdown = graceful ? 1 : 2;
pthread_cond_broadcast(&pool->not_empty); // Wake ALL workers
pthread_cond_broadcast(&pool->not_full); // Unblock any blocked submitters
pthread_mutex_unlock(&pool->lock);
// Join all workers
for (int i = 0; i < pool->worker_count; i++) {
pthread_join(pool->workers[i], NULL);
}
// If immediate shutdown, drain remaining tasks
if (!graceful) {
while (pool->queue_size > 0) {
task_t *t = dequeue(pool);
free(t); // Or call a cancellation callback
}
}
}
Layer 5 - Testing for Correctness:
// Test: No lost tasks under concurrent submit/shutdown
void stress_test() {
atomic_int completed = 0;
threadpool_t *pool = threadpool_create(4, 64);
// Submit from multiple producer threads simultaneously
pthread_t producers[8];
for (int i = 0; i < 8; i++) {
pthread_create(&producers[i], NULL, submit_1000_tasks, &completed);
}
// Wait a bit then request shutdown
usleep(100000);
threadpool_shutdown(pool, 1); // Graceful
for (int i = 0; i < 8; i++) {
pthread_join(producers[i], NULL);
}
// Verify: completed should equal submitted
assert(completed == 8000);
}
Books That Will Help:
| Book | Chapters | What Youโll Learn |
|---|---|---|
| CS:APP 3e | 12.4-12.5 | Threads, mutexes, condition variables, thread safety |
| OSTEP | 26-32 | Locks, condition variables, semaphores, common concurrency bugs |
| TLPI | 29-33 | POSIX threads, mutexes, conditions, thread cancellation |
| C++ Concurrency in Action | 2-4 | Modern patterns (applicable to C with adaptation) |
| APUE 3e | 11-12 | Threads, thread control, thread synchronization |
Common Pitfalls & Debugging:
- Bug: Forgetting to recheck condition after waking from pthread_cond_wait
// WRONG - spurious wakeup breaks this if (pool->queue_size == 0) pthread_cond_wait(&pool->not_empty, &pool->lock); task = dequeue(); // Might crash on empty queue! // RIGHT - while loop handles spurious wakeups while (pool->queue_size == 0 && !pool->shutdown) pthread_cond_wait(&pool->not_empty, &pool->lock); - Bug: Executing task while holding the lock
// WRONG - blocks all other workers during task execution! pthread_mutex_lock(&pool->lock); task = dequeue(pool); task->function(task->arg); // Could take seconds! pthread_mutex_unlock(&pool->lock); // RIGHT - release lock before executing pthread_mutex_lock(&pool->lock); task = dequeue(pool); pthread_mutex_unlock(&pool->lock); task->function(task->arg); // Other workers can proceed - Bug: Race condition during shutdown
// WRONG - worker might miss the shutdown signal if (pool->shutdown) break; // Checked without lock! pthread_cond_wait(...); // Might wait forever // RIGHT - check with lock held, use broadcast for shutdown pthread_mutex_lock(&pool->lock); while (queue_empty && !pool->shutdown) { pthread_cond_wait(...); } if (pool->shutdown && queue_empty) { pthread_mutex_unlock(&pool->lock); break; } - Bug: Memory leak on rejected tasks during shutdown
// WRONG - caller doesn't know task was rejected if (pool->shutdown) { pthread_mutex_unlock(&pool->lock); return; // task memory leaked! } // RIGHT - return error code, let caller handle cleanup if (pool->shutdown) { pthread_mutex_unlock(&pool->lock); free(task); return -1; // ESHUTDOWN }
Project 22: Signal-Safe Printf
| Attribute | Value |
|---|---|
| Language | C (alt: Rust) |
| Difficulty | Advanced |
| Time | Weekend |
| Chapters | 8, 12 |
What youโll build: A tiny printf-like facility (sio) that is safe to call from signal handlers using only async-signal-safe operations.
Why it matters: Forces you to understand why printf, malloc, and most libc functions are unsafe in handlers.
Core challenges:
- Avoiding all non-async-signal-safe functions
- Implementing integer/string formatting with only
write(2) - Testing under high-frequency signal delivery
Real World Outcome:
$ ./sio_demo
================================================================================
SIGNAL-SAFE I/O (SIO) DEMONSTRATION
================================================================================
[TEST 1] Basic output from main()
sio_puts: Hello from signal-safe I/O!
sio_putl: The answer is 42
sio_puthex: Address = 0x7fff5fbff8c0
[TEST 2] Signal handler output (SIGUSR1)
$ kill -USR1 $(pgrep sio_demo)
[HANDLER] Caught signal 10 (SIGUSR1)
[HANDLER] Handler invoked 1 time(s)
[HANDLER] Current errno preserved: 0
[TEST 3] Rapid signal delivery stress test
Sending 10000 SIGALRM signals at 10000 Hz...
[HANDLER] Signal count: 1000
[HANDLER] Signal count: 2000
[HANDLER] Signal count: 5000
[HANDLER] Signal count: 10000
[RESULT] All 10000 signals handled
[RESULT] No crashes, no corruption, no deadlocks
[RESULT] Printf equivalent calls in handler: 0 (verified safe)
$ ./sio_demo --compare-with-printf
================================================================================
SAFETY COMPARISON: SIO vs PRINTF
================================================================================
[SETUP] Installing SIGALRM handler that prints a message
[SETUP] Handler will fire every 100 microseconds
[TEST] Main thread calling malloc() in a loop...
--- Using printf() in handler (UNSAFE) ---
[MAIN] Iteration 1000...
[MAIN] Iteration 2000...
[DEADLOCK DETECTED] Program hung after 2847 iterations
[CAUSE] printf() called from handler while main held stdio lock
--- Using sio_puts() in handler (SAFE) ---
[MAIN] Iteration 1000...
[HANDLER] tick 50
[MAIN] Iteration 2000...
[HANDLER] tick 100
[MAIN] Iteration 10000...
[HANDLER] tick 500
[RESULT] Completed 10000 iterations with 500 handler invocations
[RESULT] No deadlocks with async-signal-safe sio functions
$ ./sio_demo --format-test
================================================================================
FORMAT SPECIFIER TESTS
================================================================================
Testing sio_printf() format specifiers:
sio_printf("Integer: %d\n", -42) -> Integer: -42
sio_printf("Unsigned: %u\n", 42) -> Unsigned: 42
sio_printf("Hex: 0x%x\n", 255) -> Hex: 0xff
sio_printf("Long: %ld\n", 1234567890123) -> Long: 1234567890123
sio_printf("String: %s\n", "hello") -> String: hello
sio_printf("Pointer: %p\n", ptr) -> Pointer: 0x7fff5fbff8c0
sio_printf("Percent: %%\n") -> Percent: %
sio_printf("Width: %10d\n", 42) -> Width: 42
sio_printf("Multiple: %s=%d\n", "x", 5) -> Multiple: x=5
[RESULT] All format specifiers working correctly
[RESULT] No malloc, no stdio, only write(2) syscalls
The Core Question Youโre Answering: Why canโt you call printf() from a signal handler, and how do you build output functions that are safe to call from any context?
Concepts You Must Understand First:
- Async-signal-safety (CS:APP 8.5.5) - Which functions can be safely called from signal handlers and why most cannot
- Reentrancy (TLPI 21.1.2) - What happens when a function is interrupted and called again before completing
- Signal delivery semantics (CS:APP 8.5) - How signals interrupt execution at arbitrary points
- The write(2) syscall (TLPI 4.3) - The only safe way to output from a signal handler
- Errno preservation (TLPI 21.1.3) - Why handlers must save and restore errno
- Lock-free programming basics (TLPI 21.1.2) - Why mutexes in handlers cause deadlocks
Questions to Guide Your Design:
- Why is printf() not async-signal-safe? What specific resources does it use that cause problems?
- How do you convert an integer to a string without calling sprintf(), snprintf(), or any memory allocation?
- What buffer should you use for formatting? Stack-allocated? Static? What are the tradeoffs?
- How do you handle negative numbers in your integer-to-string conversion?
- Should sio functions buffer output or write immediately? What does buffering require that makes it unsafe?
- How do you implement hexadecimal output without using lookup tables that might not be in cache?
- What happens if write(2) is interrupted by another signal? How do you handle partial writes?
- How do you test that your implementation is truly async-signal-safe?
Thinking Exercise:
Before coding, analyze why this handler deadlocks:
pthread_mutex_t stdio_lock = PTHREAD_MUTEX_INITIALIZER;
char buffer[1024];
void safe_looking_print(const char *msg) {
pthread_mutex_lock(&stdio_lock);
strcpy(buffer, msg);
printf("%s\n", buffer);
pthread_mutex_unlock(&stdio_lock);
}
void handler(int sig) {
safe_looking_print("Signal received!"); // <-- Why does this deadlock?
}
int main() {
signal(SIGINT, handler);
while (1) {
safe_looking_print("Main loop iteration");
}
}
Trace through: What happens if SIGINT arrives while main() is between pthread_mutex_lock and pthread_mutex_unlock?
Now consider: Would making the mutex recursive solve the problem? (Hint: What about printfโs internal locks?)
The Interview Questions Theyโll Ask:
- โWhat makes a function async-signal-safe? Give examples of safe and unsafe functions.โ
- Expected answer: A function is async-signal-safe if it can be safely called from a signal handler, even if the main program was interrupted in the middle of the same function. Safe: write(), _exit(), signal(). Unsafe: printf(), malloc(), any function using locks or global state. The key issue is reentrancy and internal locks.
- โWhy is malloc() not async-signal-safe?โ
- Expected answer: malloc() uses internal locks to protect the heap data structures. If a signal interrupts malloc() while it holds the lock, and the handler calls malloc(), you get deadlock. Also, malloc() may be in the middle of updating heap metadata, leaving it in an inconsistent state.
- โHow would you implement a signal handler that needs to log messages?โ
- Expected answer: Use only write(2) for output. Pre-format simple messages as string constants. For dynamic data, implement integer-to-string conversion without malloc. Consider using a pipe or signal-safe queue to defer complex logging to the main thread.
- โExplain the errno problem in signal handlers and how to solve it.โ
- Expected answer: Many async-signal-safe functions (like write()) can set errno. If the handler modifies errno and the main code was about to check errno from its own syscall, the result is corrupted. Solution: Save errno at handler entry, restore before return.
- โWhatโs the difference between reentrant and thread-safe?โ
- Expected answer: Thread-safe means safe when called concurrently from multiple threads (usually via locks). Reentrant means safe when interrupted and re-invoked before completing (no global/static state, no locks). All reentrant functions are thread-safe, but not vice versa. Async-signal-safe requires reentrancy.
- โHow would you implement a printf-like format string parser thatโs async-signal-safe?โ
- Expected answer: Parse the format string character by character. For each specifier, convert the value to a string using stack-local buffers and manual conversion (repeated division for integers). Accumulate output in a stack buffer, then call write() once. No dynamic allocation, no stdio.
Hints in Layers:
Layer 1 - Core Output Primitive:
// The ONLY function we can use for output in a signal handler
ssize_t sio_write(const char *s, size_t n) {
size_t remaining = n;
const char *p = s;
while (remaining > 0) {
ssize_t written = write(STDOUT_FILENO, p, remaining);
if (written < 0) {
if (errno == EINTR) continue; // Interrupted, retry
return -1; // Real error
}
remaining -= written;
p += written;
}
return n;
}
// Wrapper for null-terminated strings
ssize_t sio_puts(const char *s) {
return sio_write(s, strlen(s));
}
Layer 2 - Integer to String (No malloc!):
// Convert integer to string in caller-provided buffer
// Returns pointer to start of number within buffer
char *sio_itoa(long value, char *buf, size_t bufsize) {
char *p = buf + bufsize - 1;
*p = '\0';
int negative = (value < 0);
unsigned long uval = negative ? -value : value;
// Build string backwards
do {
*--p = '0' + (uval % 10);
uval /= 10;
} while (uval > 0 && p > buf);
if (negative && p > buf) {
*--p = '-';
}
return p; // Start of the number string
}
// Output a long integer
ssize_t sio_putl(long value) {
char buf[32]; // Stack allocated!
char *s = sio_itoa(value, buf, sizeof(buf));
return sio_puts(s);
}
Layer 3 - Hexadecimal Output:
ssize_t sio_puthex(unsigned long value) {
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\0';
if (value == 0) {
*--p = '0';
} else {
while (value > 0 && p > buf) {
int digit = value & 0xF;
*--p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10);
value >>= 4;
}
}
// Add "0x" prefix
*--p = 'x';
*--p = '0';
return sio_puts(p);
}
Layer 4 - Signal Handler Pattern:
volatile sig_atomic_t signal_count = 0;
void handler(int sig) {
// CRITICAL: Save and restore errno
int saved_errno = errno;
signal_count++; // sig_atomic_t is safe to modify
// Safe output
sio_puts("[HANDLER] Signal ");
sio_putl(sig);
sio_puts(" received (count: ");
sio_putl(signal_count);
sio_puts(")\n");
errno = saved_errno; // Restore before return
}
Layer 5 - Simple Format String Parser:
// Minimal printf subset: %s, %d, %ld, %x, %p, %%
void sio_printf(const char *fmt, ...) {
va_list ap;
va_start(ap, fmt);
char buf[32];
const char *p = fmt;
while (*p) {
if (*p != '%') {
sio_write(p, 1);
p++;
continue;
}
p++; // Skip '%'
switch (*p) {
case 'd': {
int val = va_arg(ap, int);
sio_puts(sio_itoa(val, buf, sizeof(buf)));
break;
}
case 'l':
p++;
if (*p == 'd') {
long val = va_arg(ap, long);
sio_puts(sio_itoa(val, buf, sizeof(buf)));
}
break;
case 's': {
char *s = va_arg(ap, char*);
sio_puts(s ? s : "(null)");
break;
}
case 'x': {
unsigned val = va_arg(ap, unsigned);
sio_puthex(val);
break;
}
case 'p': {
void *ptr = va_arg(ap, void*);
sio_puthex((unsigned long)ptr);
break;
}
case '%':
sio_write("%", 1);
break;
}
p++;
}
va_end(ap);
}
Books That Will Help:
| Book | Chapters | What Youโll Learn |
|---|---|---|
| CS:APP 3e | 8.5 | Signal concepts, async-signal-safety, handler design |
| TLPI | 21-22 | Signals, signal handlers, async-signal-safe functions (comprehensive list) |
| APUE 3e | 10 | Signals (POSIX perspective) |
| OSTEP | Ch. 5 (Process API) | Understanding how signals fit with process model |
| Secure Coding in C/C++ | Ch. 5 | Signal handling vulnerabilities |
Common Pitfalls & Debugging:
- Bug: Forgetting to save/restore errno
void handler(int sig) { // WRONG - corrupts errno if main code is checking it write(STDOUT_FILENO, "signal\n", 7); // write() might set errno } // RIGHT void handler(int sig) { int saved_errno = errno; write(STDOUT_FILENO, "signal\n", 7); errno = saved_errno; } - Bug: Using sprintf() โbecause it doesnโt mallocโ
// WRONG - sprintf uses stdio buffers, internal locks void handler(int sig) { char buf[64]; sprintf(buf, "Signal %d\n", sig); // NOT async-signal-safe! write(STDOUT_FILENO, buf, strlen(buf)); } // RIGHT - manual conversion void handler(int sig) { char buf[32]; char *p = sio_itoa(sig, buf, sizeof(buf)); sio_puts("Signal "); sio_puts(p); sio_puts("\n"); } - Bug: Static buffers shared between handler and main
// WRONG - handler might corrupt buffer while main is using it static char shared_buffer[256]; void handler(int sig) { strcpy(shared_buffer, "interrupted!"); // Race condition! } // RIGHT - use stack-local buffers in handler void handler(int sig) { char local_buf[256]; // Each handler invocation gets its own // ... } - Bug: Ignoring partial writes
// WRONG - write() might not write everything void handler(int sig) { char msg[] = "Very long message..."; write(STDOUT_FILENO, msg, sizeof(msg)); // Might only write part! } // RIGHT - loop until all bytes written void sio_write_all(const char *buf, size_t n) { while (n > 0) { ssize_t written = write(STDOUT_FILENO, buf, n); if (written <= 0) { if (errno == EINTR) continue; return; // Error } buf += written; n -= written; } }
Project 23: Performance Profiler
| Attribute | Value |
|---|---|
| Language | C (alt: C++, Rust) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 5, 8, 3 |
What youโll build: A sampling profiler that periodically interrupts a program, records where it is, and reports the hottest functions.
Why it matters: Understand what profilers can/canโt tell youโbias, sampling error, and Heisenberg effects.
Core challenges:
- Implementing timer-based sampling (SIGPROF/ITIMER_PROF)
- Capturing instruction pointers and aggregating into reports
- Symbolizing addresses back to function names
Real World Outcome:
$ ./profiler --sample-rate=1000 -- ./target_program
================================================================================
SAMPLING PROFILER
================================================================================
[CONFIG] Sample rate: 1000 Hz (1ms interval)
[CONFIG] Using ITIMER_PROF (CPU time only)
[START] Profiling ./target_program (PID: 12847)
[PROGRESS] 1000 samples collected...
[PROGRESS] 5000 samples collected...
[PROGRESS] 10000 samples collected...
[END] Target exited with status 0
[STATS] Total samples: 14,293
[STATS] Unique instruction pointers: 847
[STATS] Profiling overhead: ~2.3%
================================================================================
FLAT PROFILE (Top 20 Functions)
================================================================================
%time samples function source:line
------- -------- -------------------------------- ----------------------
23.4% 3,345 matrix_multiply matrix.c:142
18.7% 2,673 vector_dot_product linalg.c:89
12.1% 1,730 quicksort_partition sort.c:67
8.9% 1,272 hash_table_lookup hash.c:234
6.2% 886 memcpy@plt (libc)
4.8% 686 strcmp@plt (libc)
3.7% 529 parse_json_object json.c:456
2.9% 414 allocate_buffer buffer.c:78
2.4% 343 compute_checksum crypto.c:123
2.1% 300 read_file_chunk io.c:89
1.8% 257 (unknown) 0x7f3a2b4c5d6e
...
13.0% 1,858 (other - 837 functions)
================================================================================
CALL GRAPH PROFILE
================================================================================
|--- vector_dot_product (18.7%)
matrix_multiply (23.4%) -|
|--- memcpy@plt (2.1% attributed)
|--- quicksort_partition (12.1%)
process_data (32.1%) -----|--- hash_table_lookup (8.9%)
|--- parse_json_object (3.7%)
$ ./profiler --flame-graph -- ./target_program > profile.svg
================================================================================
FLAME GRAPH GENERATION
================================================================================
[SAMPLING] Collecting call stacks at 997 Hz...
[STACKS] 8,234 unique stack traces captured
[RENDER] Generating SVG flame graph...
[OUTPUT] Flame graph written to: profile.svg (234 KB)
[TIP] Open in browser: firefox profile.svg
$ ./profiler --compare before.prof after.prof
================================================================================
DIFFERENTIAL PROFILE
================================================================================
Comparing: before.prof (14,293 samples) vs after.prof (13,892 samples)
Improved (faster):
function before after delta
---------------------------------------------------
matrix_multiply 23.4% 8.2% -15.2% (optimized!)
vector_dot_product 18.7% 12.1% -6.6%
Regressed (slower):
function before after delta
---------------------------------------------------
cache_lookup 1.2% 4.8% +3.6% (new bottleneck)
validate_input 0.8% 2.1% +1.3%
[SUMMARY] Overall improvement: 18.3% less CPU time in hot path
$ ./profiler --self-profile --overhead-test
================================================================================
PROFILER OVERHEAD ANALYSIS
================================================================================
[TEST] Running workload without profiling: 4.823s
[TEST] Running workload with profiling at 100 Hz: 4.831s (+0.17%)
[TEST] Running workload with profiling at 1000 Hz: 4.935s (+2.32%)
[TEST] Running workload with profiling at 10000 Hz: 5.647s (+17.09%)
[RECOMMENDATION] Use 1000 Hz for production profiling
[WARNING] Rates above 5000 Hz introduce significant overhead
The Core Question Youโre Answering: How do profilers like gprof, perf, and pprof measure where your program spends its time, and what are the limitations of statistical sampling?
Concepts You Must Understand First:
- Timer signals (SIGPROF/SIGVTALRM) (TLPI Ch. 23) - Different timers measure wall-clock, user CPU, or system CPU time
- Signal handlers and context (CS:APP 8.5) - How the interrupted context provides the instruction pointer
- Program counter / instruction pointer (CS:APP 3.4) - The CPU register that tells you where execution is
- Symbol tables and DWARF (CS:APP 7.5) - How to map addresses back to function names and line numbers
- Statistical sampling theory (CS:APP 5.14) - Why sampling works and its inherent error margins
- ASLR and PIE (CS:APP 7.12) - Address randomization affects address-to-symbol mapping
Questions to Guide Your Design:
- Whatโs the difference between ITIMER_REAL, ITIMER_VIRTUAL, and ITIMER_PROF? Which should a CPU profiler use?
- How do you get the instruction pointer (RIP) from inside a signal handler? Whatโs in the
ucontext_t? - If you sample at 1000 Hz and a function runs for 1ms, how many samples do you expect? What if it runs for 0.5ms?
- How do you aggregate samples efficiently? A hash table from IP to count? What about collisions?
- How do you convert an instruction pointer to a function name? What tools/libraries can help?
- What happens to profiling accuracy if a function is inlined? Can you still measure it?
- How do you capture call stacks, not just leaf functions? What are the challenges with frame pointers?
- Why might your profiler show different results on different runs? Is this a bug or expected behavior?
Thinking Exercise:
Before coding, analyze this sampling scenario:
Time (ms): 0 1 2 3 4 5 6 7 8 9 10
|----|----|----|----|----|----|----|----|----|----|
Function A: โโโโโโโโโโโโโโโโ (4ms, 40%)
Function B: โโโโโโโโ (2ms, 20%)
Function C: โโโโโโโโโโโโโโโโโโโโโโโโ (4ms, 40%)
Sampling at 1000 Hz (every 1ms):
Sample #: 1 2 3 4 5 6 7 8 9 10
Expected: A A A A B B C C C C
Now consider: What if the timer fires at t=0.5, 1.5, 2.5โฆ (phase-shifted by 0.5ms)?
- Which functions would we sample?
- If function B always runs at exactly t=4.0 to t=6.0, and our timer fires at t=0.5, 1.5, 2.5, 3.5, 4.5, 5.5โฆ
- How many samples of B would we get?
This illustrates aliasing - a real problem in sampling profilers!
The Interview Questions Theyโll Ask:
- โHow does a sampling profiler work? What are its advantages over instrumentation?โ
- Expected answer: Sampling profilers periodically interrupt the program (via timer signal) and record where it is (instruction pointer). Advantages: low overhead (constant regardless of call frequency), no code modification needed, works on release binaries. Disadvantages: statistical error, may miss short functions, can alias with periodic behavior.
- โExplain the difference between CPU time and wall-clock time profiling.โ
- Expected answer: CPU time (ITIMER_PROF) only counts time when the CPU is executing your code - excludes I/O waits, sleeps, context switches. Wall-clock time (ITIMER_REAL) measures real elapsed time including waits. For CPU-bound code, use CPU time. For I/O-bound or concurrent code, wall-clock may be more useful.
- โHow do you symbolize an address back to a function name?โ
- Expected answer: Use the symbol table in the ELF binary. Tools: dladdr() for runtime lookup, addr2line for static lookup, libbacktrace or libunwind for full support. Need to handle ASLR (read /proc/self/maps), stripped binaries (no symbols), and inlined functions (DWARF info).
- โWhat is the observer effect in profiling?โ
- Expected answer: Profiling changes the behavior of the program being measured. Signal handlers take CPU time, cache lines get evicted, branches may become less predictable. High sample rates increase overhead. A good profiler minimizes overhead and measures its own impact.
- โHow would you profile a multithreaded program?โ
- Expected answer: ITIMER_PROF signals go to the thread that consumed the CPU time. Each thread needs its own sample aggregation (or lock-protected shared structure). Consider: should you sample all threads equally or proportionally to CPU usage? Thread IDs help attribute samples.
- โWhy might a profiler miss a function entirely?โ
- Expected answer: If a function runs for less than the sampling interval (e.g., 0.1ms with 1000Hz sampling), it may never be sampled. Also: inlined functions donโt have separate addresses, leaf functions may be in registers, and short periodic functions may alias with the sample timer.
Hints in Layers:
Layer 1 - Basic Timer Signal Setup:
#include <sys/time.h>
#include <signal.h>
volatile sig_atomic_t sample_count = 0;
static struct sample { void *ip; } samples[1000000];
void profile_handler(int sig, siginfo_t *si, void *context) {
ucontext_t *uc = (ucontext_t *)context;
// Get instruction pointer from interrupted context
// Linux x86-64:
void *ip = (void *)uc->uc_mcontext.gregs[REG_RIP];
// macOS x86-64:
// void *ip = (void *)uc->uc_mcontext->__ss.__rip;
// Store sample (async-signal-safe: just array write)
if (sample_count < 1000000) {
samples[sample_count++].ip = ip;
}
}
void start_profiling(int hz) {
// Install signal handler with SA_SIGINFO to get context
struct sigaction sa;
sa.sa_sigaction = profile_handler;
sa.sa_flags = SA_RESTART | SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sigaction(SIGPROF, &sa, NULL);
// Set up interval timer (microseconds)
struct itimerval timer;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = 1000000 / hz; // e.g., 1000 for 1ms
timer.it_value = timer.it_interval;
setitimer(ITIMER_PROF, &timer, NULL);
}
Layer 2 - Sample Aggregation:
#include <search.h> // For hash table (hsearch)
typedef struct {
void *ip;
unsigned long count;
char *symbol; // Resolved later
} profile_entry_t;
// Simple aggregation: sort and count
void aggregate_samples(void) {
// Sort samples by IP for counting
qsort(samples, sample_count, sizeof(struct sample), compare_ip);
// Count consecutive duplicates
void *current_ip = NULL;
int current_count = 0;
for (int i = 0; i < sample_count; i++) {
if (samples[i].ip == current_ip) {
current_count++;
} else {
if (current_ip != NULL) {
add_to_profile(current_ip, current_count);
}
current_ip = samples[i].ip;
current_count = 1;
}
}
if (current_ip != NULL) {
add_to_profile(current_ip, current_count);
}
}
Layer 3 - Address Symbolization:
#define _GNU_SOURCE
#include <dlfcn.h>
// Runtime symbolization using dladdr
const char *symbolize(void *addr) {
Dl_info info;
if (dladdr(addr, &info) && info.dli_sname) {
return info.dli_sname;
}
return "(unknown)";
}
// For better symbolization, use addr2line or libbacktrace
void symbolize_with_addr2line(void *addr, char *out, size_t outsize) {
char cmd[256];
snprintf(cmd, sizeof(cmd),
"addr2line -f -e /proc/self/exe %p 2>/dev/null", addr);
FILE *fp = popen(cmd, "r");
if (fp) {
if (fgets(out, outsize, fp) == NULL) {
snprintf(out, outsize, "0x%lx", (unsigned long)addr);
}
// Strip newline
out[strcspn(out, "\n")] = '\0';
pclose(fp);
}
}
Layer 4 - Stack Trace Capture:
#include <execinfo.h>
#define MAX_STACK_DEPTH 64
typedef struct {
void *stack[MAX_STACK_DEPTH];
int depth;
} stack_sample_t;
stack_sample_t stack_samples[100000];
volatile sig_atomic_t stack_sample_count = 0;
void profile_handler_with_stack(int sig, siginfo_t *si, void *context) {
if (stack_sample_count >= 100000) return;
// Capture call stack
// NOTE: backtrace() is not strictly async-signal-safe!
// For production, use frame pointer walking or libunwind
stack_sample_t *s = &stack_samples[stack_sample_count];
s->depth = backtrace(s->stack, MAX_STACK_DEPTH);
stack_sample_count++;
}
// Print stack traces (for debugging)
void print_stack(stack_sample_t *s) {
char **symbols = backtrace_symbols(s->stack, s->depth);
for (int i = 0; i < s->depth; i++) {
printf(" %s\n", symbols[i]);
}
free(symbols);
}
Layer 5 - Report Generation:
void print_flat_profile(void) {
// Sort entries by sample count (descending)
qsort(profile_entries, entry_count, sizeof(profile_entry_t),
compare_by_count_desc);
printf("================================================================================\n");
printf(" FLAT PROFILE\n");
printf("================================================================================\n");
printf(" %%time samples function\n");
printf(" ------- -------- --------------------------------\n");
for (int i = 0; i < entry_count && i < 20; i++) {
double pct = 100.0 * profile_entries[i].count / sample_count;
printf(" %5.1f%% %8lu %s\n",
pct,
profile_entries[i].count,
profile_entries[i].symbol);
}
}
// Flame graph output (folded stacks format for flamegraph.pl)
void output_folded_stacks(FILE *out) {
for (int i = 0; i < stack_sample_count; i++) {
stack_sample_t *s = &stack_samples[i];
// Print stack frames separated by semicolons (bottom to top)
for (int j = s->depth - 1; j >= 0; j--) {
if (j < s->depth - 1) fprintf(out, ";");
fprintf(out, "%s", symbolize(s->stack[j]));
}
fprintf(out, " 1\n"); // Weight of 1 per sample
}
}
Books That Will Help:
| Book | Chapters | What Youโll Learn |
|---|---|---|
| CS:APP 3e | 5.14, 8.5 | Performance measurement, signals and handlers |
| TLPI | 23 | Timer signals (ITIMER_*), interval timers |
| Systems Performance (Gregg) | 5-6 | Profiling methodology, CPU analysis |
| APUE 3e | 10, 14 | Signals, interval timers |
| BPF Performance Tools | 13 | CPU profiling with modern tools |
Common Pitfalls & Debugging:
- Bug: Using wall-clock timer for CPU profiling
// WRONG - counts time sleeping, not CPU time setitimer(ITIMER_REAL, &timer, NULL); // If program sleeps 90% of the time, you sample sleeping! // RIGHT - counts only user + system CPU time setitimer(ITIMER_PROF, &timer, NULL); - Bug: Forgetting to handle ASLR
// WRONG - addresses change each run! printf("Hot function at: %p\n", ip); // Next run, same function is at a different address // RIGHT - subtract base address or use dladdr Dl_info info; if (dladdr(ip, &info)) { ptrdiff_t offset = (char*)ip - (char*)info.dli_fbase; printf("%s+0x%lx\n", info.dli_fname, (unsigned long)offset); } - Bug: Calling non-async-signal-safe functions in handler
// WRONG - printf, malloc, dladdr are NOT async-signal-safe void handler(int sig, siginfo_t *si, void *ctx) { printf("Sample at %p\n", get_ip(ctx)); // May deadlock! char *sym = symbolize(get_ip(ctx)); // Calls malloc! } // RIGHT - only store data, process later void handler(int sig, siginfo_t *si, void *ctx) { if (sample_count < MAX_SAMPLES) { samples[sample_count++] = get_ip(ctx); // Just a write } } - Bug: High sample rate causing measurement distortion
// WRONG - 100,000 Hz sampling timer.it_interval.tv_usec = 10; // 10us interval // Handler overhead dominates! Measuring the profiler, not the program. // RIGHT - 100-1000 Hz is usually sufficient timer.it_interval.tv_usec = 1000; // 1ms interval (1000 Hz) // Rule of thumb: if overhead > 5%, reduce sample rate
Project 24: Memory Leak Detector
| Attribute | Value |
|---|---|
| Language | C (alt: C++) |
| Difficulty | Advanced |
| Time | 1โ2 weeks |
| Chapters | 7, 9, 3 |
What youโll build: A shared library (libleakcheck.so) that interposes malloc/free at runtime, tracks allocations, and emits leak reports with stack traces.
Why it matters: Combines linking (interposition) and memory concepts into a practical debugging tool.
Core challenges:
- Using
LD_PRELOADto intercept allocation APIs - Avoiding recursion pitfalls (โno malloc in mallocโ)
- Recording useful diagnostics (sizes, call stacks)
Real World Outcome
When complete, your leak detector will produce output like this:
$ gcc -g -o leaky_app leaky_app.c
$ gcc -shared -fPIC -o libleakcheck.so leakcheck.c -ldl -lunwind
$ LD_PRELOAD=./libleakcheck.so ./leaky_app
================================================================================
MEMORY LEAK DETECTOR - Runtime Analysis
================================================================================
[INIT] libleakcheck.so loaded, intercepting malloc/calloc/realloc/free
[INIT] Tracking allocations with stack trace depth: 8
[ALLOC] malloc(64) = 0x55a3b2c00010 [leaky_app.c:23 in main()]
[ALLOC] malloc(128) = 0x55a3b2c00060 [leaky_app.c:24 in main()]
[ALLOC] calloc(10, 32) = 0x55a3b2c000f0 [leaky_app.c:27 in process_data()]
[ALLOC] malloc(256) = 0x55a3b2c00200 [leaky_app.c:31 in process_data()]
[FREE] free(0x55a3b2c00010) [leaky_app.c:45 in cleanup()]
[FREE] free(0x55a3b2c000f0) [leaky_app.c:46 in cleanup()]
================================================================================
LEAK REPORT - Program Exit
================================================================================
2 blocks leaked (384 bytes total)
Block 1: 0x55a3b2c00060 (128 bytes)
Allocated at: leaky_app.c:24 in main()
Call stack:
#0 main() at leaky_app.c:24
#1 __libc_start_call_main at libc.so.6
#2 __libc_start_main at libc.so.6
#3 _start
Block 2: 0x55a3b2c00200 (256 bytes)
Allocated at: leaky_app.c:31 in process_data()
Call stack:
#0 process_data() at leaky_app.c:31
#1 main() at leaky_app.c:28
#2 __libc_start_call_main at libc.so.6
#3 _start
--------------------------------------------------------------------------------
Summary: 4 allocations, 2 frees, 2 leaks (384 bytes)
Peak memory usage: 480 bytes at timestamp 0.003s
================================================================================
The Core Question Youโre Answering
โHow can we transparently intercept and track every memory allocation in a running program without modifying its source code, and use this to detect memory leaks with precise source location information?โ
This project teaches you that the dynamic linker is programmable infrastructure. By understanding symbol resolution order and interposition, you can inject behavior into any dynamically-linked program. The same mechanism powers profilers, sanitizers, and debugging tools used in production systems.
Concepts You Must Understand First
Before writing code, ensure you can explain:
| Concept | Why It Matters | Reference |
|---|---|---|
| Dynamic Linking & Symbol Resolution | LD_PRELOAD exploits the linkerโs symbol search order to let your library โshadowโ libc functions |
CS:APP 7.12, TLPI Ch. 41 |
| Position-Independent Code (PIC) | Shared libraries must use PIC; understand GOT/PLT indirection | CS:APP 7.12 |
dlsym and RTLD_NEXT |
You need to call the real malloc after your wrapper; RTLD_NEXT finds the next symbol in search order |
TLPI 42.1 |
| Stack Unwinding | Capturing call stacks requires walking the frame chain or using libunwind/backtrace() | CS:APP 3.7, libunwind docs |
| Thread Safety | Your tracking data structures must handle concurrent allocations | CS:APP Ch. 12 |
| Signal Safety | Some code paths (like atexit handlers) have restrictions on what functions you can call | CS:APP 8.5.5 |
Questions to Guide Your Design
Answer these before writing code:
-
How will you store allocation metadata? (Hash table keyed by address? Linked list? What are the tradeoffs?)
-
How do you get the โrealโ malloc? (When does
dlsym(RTLD_NEXT, "malloc")get called? What if dlsym itself calls malloc?) -
What happens if your tracking code calls malloc? (Design a recursion guard. How do you detect and break the cycle?)
-
How will you capture stack traces? (backtrace() vs libunwind vs manual frame walking. Which is signal-safe?)
-
When do you emit the leak report? (atexit handler? Destructor function? What about abnormal termination?)
-
How do you map addresses to source lines? (Runtime: addr2line/dladdr. Or embed DWARF parsing?)
-
What about realloc? (It can move memory. How do you track the old/new relationship?)
-
How do you handle calloc? (It might be implemented via malloc internally in some libcs.)
Thinking Exercise: Trace This Interposition
Before implementing, trace through what happens when a program runs with your library:
// leaky.c - compile with: gcc -g -o leaky leaky.c
#include <stdlib.h>
#include <stdio.h>
void helper(void) {
char *buf = malloc(100); // Allocation A
// Oops, forgot to free!
}
int main(void) {
int *arr = malloc(40); // Allocation B
helper();
free(arr); // Free B
return 0;
}
Trace questions:
-
When
LD_PRELOAD=./libleakcheck.so ./leakystarts, in what order are constructors called? - When
main()callsmalloc(40), trace the symbol resolution:- Where does the PLT jump go first?
- How does your interposed malloc get called?
- How does your wrapper call the real malloc?
-
Why is Allocation A (in helper) a leak but Allocation B is not?
- If your leak report runs in an
atexithandler, what memory is still โliveโ?
Draw the state of your allocation tracking table after each call:
After malloc(40): { 0x55...010: {size=40, caller=main:10} }
After malloc(100): { 0x55...010: {size=40, caller=main:10},
0x55...080: {size=100, caller=helper:6} }
After free(arr): { 0x55...080: {size=100, caller=helper:6} }
At exit: 1 leak detected!
The Interview Questions Theyโll Ask
Q1: Explain how LD_PRELOAD works and what security implications it has.
Expected answer: LD_PRELOAD tells the dynamic linker to load specified shared libraries before any others. When resolving symbols, the linker searches in order: LD_PRELOAD libraries, then the executable, then DT_NEEDED libraries. This allows โshadowingโ symbols like malloc. Security implication: setuid/setgid binaries ignore LD_PRELOAD (AT_SECURE) to prevent privilege escalation. Itโs also why LD_PRELOAD canโt intercept statically-linked binaries.
Q2: How would you avoid infinite recursion if your malloc wrapper needs to allocate memory?
Expected answer: Use a thread-local recursion guard. When entering the wrapper, check and set a flag. If already set, call the real malloc directly without tracking. Alternative: use a static buffer for internal allocations, or use mmap directly which doesnโt go through malloc. The guard must be thread-local (using __thread) for correctness in multi-threaded programs.
Q3: Whatโs the difference between using backtrace() and libunwind for stack traces?
Expected answer: backtrace() is simpler (part of glibc) but not signal-safe and may not work well with optimized code missing frame pointers. libunwind is more portable, can be configured for signal-safety, and handles various unwinding methods (DWARF, frame pointers, etc.). For a leak detector, either works, but libunwind is more robust for production use.
Q4: How would you extend this to detect double-frees and use-after-free?
Expected answer: For double-free: keep freed blocks in a โrecently freedโ quarantine list; if free() is called on an address in quarantine, report double-free. For use-after-free: more complex; could use guard pages (like AddressSanitizer) or probabilistic detection via canary values. Full detection requires memory poisoning and potentially page protection tricks.
Q5: Why might your leak detector report false positives?
Expected answer: (1) Intentional leaks at shutdown (global caches freed by OS). (2) Memory reachable through global pointers but not explicitly freed. (3) Custom allocators that batch-free at exit. (4) Memory still in use when atexit runs. Real leak detectors (Valgrind) do reachability analysis at exit to distinguish โdefinitely lostโ from โstill reachable.โ
Q6: How does Valgrindโs memcheck differ from LD_PRELOAD interposition?
Expected answer: Valgrind runs the program on a synthetic CPU, instrumenting every memory access. This catches more bugs (uninitialized reads, buffer overflows) but with 10-50x slowdown. LD_PRELOAD only intercepts explicit allocation calls, so itโs faster (~1.1x overhead) but misses many bug classes. Theyโre complementary: LD_PRELOAD for production monitoring, Valgrind for thorough testing.
Hints in Layers
Layer 1 - The Basic Structure
Start with the wrapper skeleton. The key is dlsym(RTLD_NEXT, ...):
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
static void* (*real_malloc)(size_t) = NULL;
static void (*real_free)(void*) = NULL;
__attribute__((constructor))
static void init(void) {
real_malloc = dlsym(RTLD_NEXT, "malloc");
real_free = dlsym(RTLD_NEXT, "free");
if (!real_malloc || !real_free) {
fprintf(stderr, "Error: dlsym failed\n");
_exit(1);
}
}
void* malloc(size_t size) {
void* ptr = real_malloc(size);
fprintf(stderr, "[ALLOC] malloc(%zu) = %p\n", size, ptr);
return ptr;
}
void free(void* ptr) {
fprintf(stderr, "[FREE] free(%p)\n", ptr);
real_free(ptr);
}
Compile: gcc -shared -fPIC -o libleakcheck.so leakcheck.c -ldl
Layer 2 - The Recursion Problem
Your fprintf calls malloc internally! Add a guard:
static __thread int in_wrapper = 0;
void* malloc(size_t size) {
if (in_wrapper || !real_malloc) {
// Bootstrapping or recursive call - use real malloc directly
if (!real_malloc) {
real_malloc = dlsym(RTLD_NEXT, "malloc");
}
return real_malloc(size);
}
in_wrapper = 1;
void* ptr = real_malloc(size);
// Now safe to call fprintf, etc.
fprintf(stderr, "[ALLOC] malloc(%zu) = %p\n", size, ptr);
in_wrapper = 0;
return ptr;
}
Layer 3 - Tracking Allocations
Use a hash table to track live allocations:
#define HASH_SIZE 65536
typedef struct alloc_info {
void* ptr;
size_t size;
void* stack[8];
int stack_depth;
struct alloc_info* next;
} alloc_info_t;
static alloc_info_t* hash_table[HASH_SIZE];
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static size_t ptr_hash(void* ptr) {
return ((uintptr_t)ptr >> 4) % HASH_SIZE;
}
static void track_alloc(void* ptr, size_t size) {
alloc_info_t* info = real_malloc(sizeof(alloc_info_t));
info->ptr = ptr;
info->size = size;
info->stack_depth = backtrace(info->stack, 8);
size_t idx = ptr_hash(ptr);
pthread_mutex_lock(&lock);
info->next = hash_table[idx];
hash_table[idx] = info;
pthread_mutex_unlock(&lock);
}
Layer 4 - Stack Trace Symbolization
Convert addresses to function names and line numbers:
#include <execinfo.h>
#include <dlfcn.h>
static void print_stack_trace(void** stack, int depth) {
char** symbols = backtrace_symbols(stack, depth);
for (int i = 0; i < depth; i++) {
Dl_info info;
if (dladdr(stack[i], &info) && info.dli_sname) {
fprintf(stderr, " #%d %s + 0x%lx\n",
i, info.dli_sname,
(char*)stack[i] - (char*)info.dli_saddr);
} else {
fprintf(stderr, " #%d %s\n", i, symbols[i]);
}
}
free(symbols);
}
For source lines, shell out to addr2line or embed libdwarf.
Layer 5 - The Leak Report
Register an atexit handler to print remaining allocations:
__attribute__((destructor))
static void report_leaks(void) {
size_t total_leaked = 0;
size_t leak_count = 0;
fprintf(stderr, "\n=== LEAK REPORT ===\n");
for (int i = 0; i < HASH_SIZE; i++) {
for (alloc_info_t* info = hash_table[i]; info; info = info->next) {
leak_count++;
total_leaked += info->size;
fprintf(stderr, "Block %p (%zu bytes)\n", info->ptr, info->size);
print_stack_trace(info->stack, info->stack_depth);
}
}
fprintf(stderr, "\n%zu blocks leaked (%zu bytes total)\n",
leak_count, total_leaked);
}
Books That Will Help
| Book | Chapter(s) | What Youโll Learn |
|---|---|---|
| CS:APP | 7.12 | Position-independent code, dynamic linking, symbol interposition |
| CS:APP | 9.9 | Dynamic memory allocation concepts |
| CS:APP | 3.7 | Stack structure, frame pointers for unwinding |
| TLPI | 41 | Fundamentals of shared libraries |
| TLPI | 42 | dlopen, dlsym, RTLD_NEXT, library interposition |
| Low-Level Programming | Ch. 13 | Shared libraries and dynamic linking on Linux |
| Effective C | Ch. 6 | Dynamic memory management best practices |
Common Pitfalls & Debugging
Problem 1: Infinite recursion / stack overflow on startup
Symptom: Program crashes immediately with SIGSEGV in dlsym or printf.
Cause: dlsym or stdio functions call malloc before your constructor runs.
Fix: Use a static buffer for early allocations, or check if real_malloc is NULL:
static char early_buffer[4096];
static size_t early_offset = 0;
void* malloc(size_t size) {
if (!real_malloc) {
// Before constructor: use static buffer
void* ptr = &early_buffer[early_offset];
early_offset += (size + 15) & ~15; // Align to 16
return ptr;
}
// ... normal path
}
Problem 2: Deadlock in multi-threaded programs
Symptom: Program hangs when multiple threads allocate simultaneously.
Cause: Holding the lock while calling fprintf (which may call malloc).
Fix: Copy necessary data, release lock, then log:
void* malloc(size_t size) {
void* ptr = real_malloc(size);
pthread_mutex_lock(&lock);
// Quick insert into hash table
pthread_mutex_unlock(&lock);
// Log AFTER releasing lock
if (!in_wrapper) {
in_wrapper = 1;
fprintf(stderr, "[ALLOC] ...\n");
in_wrapper = 0;
}
return ptr;
}
Problem 3: Incorrect leak counts (missing frees or double-counting)
Symptom: Report shows leaks for memory you know was freed.
Cause: Hash table collision handling bug, or realloc not tracked correctly.
Debug: Add verbose logging showing every insert/remove:
$ LD_DEBUG=bindings LD_PRELOAD=./libleakcheck.so ./app 2>&1 | grep -E 'malloc|free'
Problem 4: Stack traces missing function names
Symptom: Stack trace shows only addresses like 0x55a3b2c00060.
Cause: Program compiled without debug symbols, or stripped binary.
Fix: Compile with -g and -rdynamic (exports symbols for backtrace). For release builds, use addr2line:
char cmd[256];
snprintf(cmd, sizeof(cmd), "addr2line -e /proc/self/exe %p", addr);
system(cmd);
Project 25: Debugger (ptrace-based)
| Attribute | Value |
|---|---|
| Language | C (alt: C++, Rust) |
| Difficulty | Master |
| Time | 1 month+ |
| Chapters | 3, 7, 8 |
What youโll build: A tiny debugger (mydb) that runs a child process under control, sets breakpoints, single-steps, and inspects registers/memory.
Why it matters: The ultimate test of understanding machine-level code, process control, and system calls.
Core challenges:
- Controlling tracee with
ptracestop/resume semantics - Implementing software breakpoints (patching with
int3) - Building a command loop (break, run, step, continue, regs, x)
Real World Outcome
When complete, your debugger will produce output like this:
$ ./mydb ./target_program
================================================================================
MYDB - Minimal x86-64 Debugger
================================================================================
[INFO] Loaded executable: ./target_program
[INFO] Entry point: 0x401000
[INFO] Text section: 0x401000 - 0x401fff
[INFO] Type 'help' for available commands
mydb> break main
[BREAK] Breakpoint 1 set at 0x401126 <main>
mydb> break 0x40113a
[BREAK] Breakpoint 2 set at 0x40113a <main+20>
mydb> run
[RUN] Starting program: ./target_program
[STOP] Hit breakpoint 1 at 0x401126 <main>
mydb> regs
================================================================================
REGISTER STATE
================================================================================
rax = 0x0000000000000000 rbx = 0x0000000000000000
rcx = 0x00007ffff7fa5040 rdx = 0x00007fffffffe0a8
rsi = 0x00007fffffffe098 rdi = 0x0000000000000001
rbp = 0x0000000000000000 rsp = 0x00007fffffffe088
r8 = 0x0000000000000000 r9 = 0x00007ffff7fc9040
r10 = 0x00007ffff7fc3908 r11 = 0x00007ffff7fe17c0
r12 = 0x0000000000401000 r13 = 0x00007fffffffe090
r14 = 0x0000000000000000 r15 = 0x0000000000000000
rip = 0x0000000000401126 eflags = 0x00000246 [PF ZF IF]
mydb> x/8x $rsp
0x7fffffffe088: 0x00007ffff7df1b6b 0x0000000000000001
0x7fffffffe098: 0x00007fffffffe3a8 0x0000000000000000
0x7fffffffe0a8: 0x00007fffffffe3c0 0x00007fffffffe3d5
0x7fffffffe0b8: 0x00007fffffffe3f2 0x00007fffffffe410
mydb> disas main
0x401126 <main+0>: push rbp
0x401127 <main+1>: mov rbp, rsp
0x40112a <main+4>: sub rsp, 0x20
0x40112e <main+8>: mov dword ptr [rbp-0x14], edi
0x401131 <main+11>: mov qword ptr [rbp-0x20], rsi
0x401135 <main+15>: mov dword ptr [rbp-0x4], 0x2a
0x40113c <main+22>: mov eax, dword ptr [rbp-0x4]
mydb> step
[STEP] Single-stepped to 0x401127 <main+1>
mydb> continue
[CONTINUE] Resuming execution...
[STOP] Hit breakpoint 2 at 0x40113a <main+20>
mydb> print $rax
$rax = 42 (0x2a)
mydb> continue
[CONTINUE] Resuming execution...
[EXIT] Program exited with status 0
The Core Question Youโre Answering
โHow does a debugger gain control over another running process, stop it at arbitrary points, inspect its internal state, and resume execution - all without modifying the programโs source code?โ
This project demystifies the โmagicโ of debuggers. Youโll discover that debuggers are just programs that use operating system facilities (ptrace) to become the โparentโ of another process. Every debugger command maps to specific ptrace operations: breakpoints are instruction patches, single-stepping uses CPU trap flags, and register inspection reads from kernel-managed process state.
Concepts You Must Understand First
Before writing code, ensure you can explain:
| Concept | Why It Matters | Reference |
|---|---|---|
| ptrace System Call | The fundamental mechanism for process tracing; allows reading/writing memory, registers, and controlling execution | TLPI Ch. 26, man ptrace |
| x86-64 Instruction Encoding | You need to understand how int3 (0xCC) works as a breakpoint trap |
CS:APP 3.1, Intel SDM Vol. 2 |
| Process States & Signals | Traced processes stop on signals; SIGTRAP indicates breakpoint or single-step | CS:APP 8.5, TLPI Ch. 20-22 |
| ELF Format & Symbols | To set breakpoints by function name, you must parse the symbol table | CS:APP 7.4-7.5 |
| Memory Layout | Understanding text/data/stack segments and how addresses map to actual memory | CS:APP 7.9, 9.7 |
| Register Conventions | Knowing which registers hold arguments, return values, and the instruction pointer | CS:APP 3.4 |
Questions to Guide Your Design
Answer these before writing code:
-
How do you start a process under your control? (fork + ptrace(PTRACE_TRACEME) + exec? What happens if exec fails?)
-
Whatโs the difference between PTRACE_CONT and PTRACE_SINGLESTEP? (How does the CPU know to stop after one instruction?)
-
How do breakpoints actually work? (What byte do you save? What byte do you write? What happens when the CPU executes it?)
-
After hitting a breakpoint, how do you continue? (Why canโt you just PTRACE_CONT? Whatโs the โstep over breakpointโ dance?)
-
How do you distinguish breakpoint stops from other stops? (SIGTRAP can mean breakpoint, single-step, or syscall stop)
-
How do you read the traceeโs memory? (PTRACE_PEEKTEXT returns one word at a time - how do you read larger regions?)
-
How do you map addresses to function names? (Parse ELF .symtab/.dynsym? Use libdwarf for source lines?)
-
What happens if the tracee forks? (Does your debugger follow the child? How do you handle multi-threaded programs?)
Thinking Exercise: Trace a Breakpoint Hit
Before implementing, trace through what happens when a breakpoint is hit:
State 1: Program loaded, breakpoint set at 0x401126
- Original instruction at 0x401126: 55 (push rbp)
- After setting breakpoint: CC (int3)
- Debugger waiting in waitpid()
State 2: Program runs, hits breakpoint
- CPU fetches instruction at 0x401126
- CPU executes 0xCC (int3)
- CPU raises #BP exception
- Kernel converts to SIGTRAP, stops tracee
- Kernel wakes debugger from waitpid()
- RIP = 0x401127 (past the int3)
State 3: Debugger inspects state
- ptrace(PTRACE_GETREGS, ...) reads all registers
- RIP needs adjustment: subtract 1 to point to breakpoint
- ptrace(PTRACE_PEEKTEXT, 0x401126) reads memory
State 4: User says "continue"
- Restore original byte: poke 0x55 at 0x401126
- Set RIP = 0x401126 (re-execute the instruction)
- ptrace(PTRACE_SINGLESTEP) - execute ONE instruction
- waitpid() - tracee stops after push rbp
- Restore breakpoint: poke 0xCC at 0x401126
- ptrace(PTRACE_CONT) - continue normally
Draw the instruction byte at 0x401126 through each state:
[LOAD] 0x401126: 55 (push rbp - original)
[BREAK] 0x401126: CC (int3 - breakpoint active)
[HIT] 0x401126: CC, RIP=0x401127 (stopped, RIP past int3)
[STEP] 0x401126: 55, RIP=0x401126 (restored, about to re-execute)
[AFTER] 0x401126: CC, RIP=0x401127 (breakpoint re-armed, executed push)
The Interview Questions Theyโll Ask
Q1: Explain how software breakpoints work at the CPU level.
Expected answer: A software breakpoint replaces the first byte of an instruction with int3 (0xCC), a single-byte instruction that triggers a breakpoint exception (#BP). When the CPU executes it, it raises the exception, the kernel translates this to SIGTRAP, and the debugger (as the tracer) is notified via waitpid(). The debugger saves the original byte and restores it when needed for continuation.
Q2: Whatโs the โstep over breakpointโ problem and how do you solve it?
Expected answer: After hitting a breakpoint, you canโt just continue because the breakpoint instruction is still there. The solution: (1) restore the original instruction, (2) set RIP back to the breakpoint address, (3) single-step one instruction, (4) re-insert the breakpoint, (5) then continue normally. This ensures the original instruction executes before re-arming the breakpoint.
Q3: How does ptrace(PTRACE_SINGLESTEP) work?
Expected answer: PTRACE_SINGLESTEP sets the x86 Trap Flag (TF) in the EFLAGS register. This flag causes the CPU to generate a debug exception (#DB) after executing exactly one instruction. The kernel handles this exception, delivers SIGTRAP to the tracer, and clears TF. The debugger sees the process stopped after one instruction.
Q4: Why do debuggers need to parse ELF files?
Expected answer: To provide symbolic debugging. Without ELF parsing, you can only work with raw addresses. By reading .symtab (static symbols) and .dynsym (dynamic symbols), you can map addresses to function names and vice versa. For source-level debugging, you need DWARF debug info (.debug_* sections) to map addresses to source lines.
Q5: How would you implement conditional breakpoints?
Expected answer: A conditional breakpoint stops only when a condition is true. Implementation: (1) set a normal breakpoint, (2) when hit, evaluate the condition (parse expression, read registers/memory), (3) if false, do the step-over-breakpoint dance silently and continue, (4) if true, report the stop to the user. The overhead comes from stopping on every hit even when continuing.
Q6: What are the limitations of ptrace-based debugging?
Expected answer: (1) Only one tracer per process - canโt run under two debuggers. (2) Performance overhead from context switches on every stop. (3) Can be detected by the tracee (via PTRACE_TRACEME failing or checking ppid). (4) Anti-debugging tricks can interfere (timing checks, self-modifying code). (5) Multi-threaded debugging is complex (need to stop all threads atomically).
Hints in Layers
Layer 1 - Basic Process Control
Start with launching and stopping a process:
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <unistd.h>
int main(int argc, char **argv) {
pid_t child = fork();
if (child == 0) {
// Child: request to be traced, then exec
ptrace(PTRACE_TRACEME, 0, NULL, NULL);
execvp(argv[1], &argv[1]);
perror("exec failed");
_exit(1);
}
// Parent: wait for child to stop at exec
int status;
waitpid(child, &status, 0);
printf("Child stopped at entry point\n");
// Continue child
ptrace(PTRACE_CONT, child, NULL, NULL);
waitpid(child, &status, 0);
printf("Child exited with status %d\n", WEXITSTATUS(status));
return 0;
}
Layer 2 - Reading Registers
Use PTRACE_GETREGS to read all registers at once:
#include <sys/user.h>
void show_registers(pid_t child) {
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, child, NULL, ®s);
printf("rip = 0x%llx\n", regs.rip);
printf("rsp = 0x%llx\n", regs.rsp);
printf("rax = 0x%llx\n", regs.rax);
// ... other registers
}
Layer 3 - Reading Memory
PTRACE_PEEKTEXT reads one word at a time:
void read_memory(pid_t child, unsigned long addr, void *buf, size_t len) {
unsigned long *ptr = buf;
size_t i;
for (i = 0; i < len; i += sizeof(long)) {
long word = ptrace(PTRACE_PEEKTEXT, child, addr + i, NULL);
if (errno) {
perror("PEEKTEXT failed");
return;
}
*ptr++ = word;
}
}
Layer 4 - Setting Breakpoints
Save the original byte, write 0xCC:
typedef struct {
unsigned long addr;
unsigned char saved_byte;
int enabled;
} breakpoint_t;
void set_breakpoint(pid_t child, breakpoint_t *bp, unsigned long addr) {
long word = ptrace(PTRACE_PEEKTEXT, child, addr, NULL);
bp->addr = addr;
bp->saved_byte = (unsigned char)(word & 0xff);
bp->enabled = 1;
// Replace first byte with int3 (0xCC)
long modified = (word & ~0xff) | 0xCC;
ptrace(PTRACE_POKETEXT, child, addr, modified);
}
void disable_breakpoint(pid_t child, breakpoint_t *bp) {
long word = ptrace(PTRACE_PEEKTEXT, child, bp->addr, NULL);
long restored = (word & ~0xff) | bp->saved_byte;
ptrace(PTRACE_POKETEXT, child, bp->addr, restored);
bp->enabled = 0;
}
Layer 5 - The Continue Dance
When continuing from a breakpoint:
void continue_from_breakpoint(pid_t child, breakpoint_t *bp) {
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, child, NULL, ®s);
// RIP points past int3; back it up
regs.rip = bp->addr;
ptrace(PTRACE_SETREGS, child, NULL, ®s);
// Restore original instruction
disable_breakpoint(child, bp);
// Single-step one instruction
ptrace(PTRACE_SINGLESTEP, child, NULL, NULL);
int status;
waitpid(child, &status, 0);
// Re-enable breakpoint
set_breakpoint(child, bp, bp->addr);
// Now continue normally
ptrace(PTRACE_CONT, child, NULL, NULL);
}
Books That Will Help
| Book | Chapter(s) | What Youโll Learn |
|---|---|---|
| CS:APP | 3.1-3.4 | x86-64 instruction formats, registers, calling conventions |
| CS:APP | 7.4-7.5 | ELF format, symbol tables for address-to-name mapping |
| CS:APP | 8.4-8.5 | Process control, signals, and how SIGTRAP works |
| TLPI | 26 | Comprehensive ptrace coverage with examples |
| TLPI | 20-22 | Signals and signal handling |
| Low-Level Programming | Ch. 11 | Practical debugging and ptrace examples |
| Intel SDM Vol. 2 | INT instruction | How int3 generates #BP exception |
| How Debuggers Work (Rosenberg) | All | Classic book on debugger implementation |
Common Pitfalls & Debugging
Problem 1: Breakpoint doesnโt trigger
Symptom: Program runs past breakpoint address without stopping.
Cause: Wrong address (e.g., function entry vs. first instruction), or breakpoint set on non-executable memory.
Fix: Verify address with objdump -d. Ensure youโre setting breakpoint on actual code:
$ objdump -d target | grep -A5 "<main>:"
0000000000401126 <main>:
401126: 55 push %rbp
Problem 2: SIGTRAP but not from breakpoint
Symptom: Unexpected stops at random addresses.
Cause: Single-step trap, syscall stop (if PTRACE_SYSCALL was used), or clone/fork events.
Fix: Check stop reason carefully:
if (WIFSTOPPED(status)) {
int sig = WSTOPSIG(status);
if (sig == SIGTRAP) {
// Could be breakpoint, single-step, or syscall
siginfo_t info;
ptrace(PTRACE_GETSIGINFO, child, NULL, &info);
if (info.si_code == SI_KERNEL || info.si_code == TRAP_BRKPT) {
// Breakpoint
} else if (info.si_code == TRAP_TRACE) {
// Single-step
}
}
}
Problem 3: Registers show wrong values after breakpoint
Symptom: RIP points to wrong address after hitting breakpoint.
Cause: Forgetting that int3 advances RIP by 1. After hitting breakpoint at 0x401126, RIP = 0x401127.
Fix: Always subtract 1 from RIP when stopped at a breakpoint:
regs.rip--; // Point back to the int3/original instruction
Problem 4: Cannot continue after hitting breakpoint
Symptom: Program immediately hits same breakpoint again, or crashes.
Cause: Not doing the single-step dance - you continue with int3 still in place.
Fix: Always: restore original byte, step once, re-insert breakpoint, then continue.
Project 26: Operating System Kernel Capstone
| Attribute | Value |
|---|---|
| Language | C + x86-64 Assembly (alt: Rust) |
| Difficulty | Master+ |
| Time | 3โ6 months |
| Chapters | All CS:APP + OSTEP |
What youโll build: A minimal x86-64 kernel that boots in QEMU, enables paging, handles interrupts, and runs simple user processes.
Why it matters: An OS kernel uses every concept from CS:APPโthis is the ultimate capstone.
Core challenges:
- Booting to 64-bit long mode
- Implementing physical/virtual memory management
- Handling interrupts and context-switching between tasks
- Designing a minimal syscall boundary
Real World Outcome
When complete, your kernel will boot and run in QEMU like this:
$ make
AS boot.S
CC kernel.c
CC mm.c
CC interrupt.c
CC process.c
CC syscall.c
LD kernel.elf
OBJCOPY kernel.bin
$ make run
qemu-system-x86_64 -kernel kernel.bin -serial stdio -no-reboot
================================================================================
MINIX86 KERNEL v0.1 - x86-64 Operating System
================================================================================
[BOOT] Entered 64-bit long mode
[BOOT] Kernel loaded at 0xffffffff80000000
[BOOT] Stack at 0xffffffff80010000
[MM] Physical memory detected: 128 MB
[MM] Kernel: 0x100000 - 0x200000 (1 MB)
[MM] Free memory starts at: 0x200000
[MM] Initializing page frame allocator...
[MM] 32256 page frames available (126 MB)
[MM] Page tables initialized
[IDT] Loading Interrupt Descriptor Table...
[IDT] Exception handlers 0-31 installed
[IDT] IRQ handlers 32-47 installed
[IDT] Syscall handler at vector 0x80 installed
[PIC] 8259 PIC remapped (IRQ0 -> INT32)
[TIMER] PIT configured for 100 Hz tick
[PROC] Process subsystem initialized
[PROC] Creating init process (PID 1)...
[PROC] Loading /bin/init from initrd
[PROC] Entry point: 0x400000
[PROC] User stack: 0x7fffffffe000
================================================================================
SWITCHING TO USER MODE
================================================================================
[SYSCALL] init(1): write(1, "Hello from userspace!\n", 22)
Hello from userspace!
[SYSCALL] init(1): fork() = 2
[PROC] Created process 2 (parent: 1)
[SYSCALL] shell(2): write(1, "minix86> ", 9)
minix86> [SYSCALL] shell(2): read(0, buf, 256)
$ # Type commands at the kernel shell
$ echo hello
[SYSCALL] shell(2): fork() = 3
[SYSCALL] echo(3): execve("/bin/echo", ["echo", "hello"], envp)
[SYSCALL] echo(3): write(1, "hello\n", 6)
hello
[SYSCALL] echo(3): exit(0)
[PROC] Process 3 exited with status 0
[SYSCALL] shell(2): wait4(-1, &status, 0, NULL) = 3
minix86> ps
[SYSCALL] shell(2): fork() = 4
[SYSCALL] ps(4): open("/proc/self/status", O_RDONLY)
PID PPID STATE NAME
1 0 SLEEP init
2 1 RUNNING shell
4 2 RUNNING ps
[SYSCALL] ps(4): exit(0)
minix86> ^C
[SIGNAL] Sending SIGINT to process 2
[PROC] Shell caught SIGINT, continuing...
$ # Press Ctrl+A, X to exit QEMU
[SHUTDOWN] System halt requested
[SHUTDOWN] Syncing filesystems...
[SHUTDOWN] Goodbye!
The Core Question Youโre Answering
โHow does a computer go from power-on to running user programs, and what does the kernel do to make this possible while keeping user programs isolated from each other and from the hardware?โ
This project is the ultimate integration of everything in CS:APP. Youโll build the software that sits between bare metal and applications. Every concept youโve studied - memory layout, calling conventions, interrupts, virtual memory, process control - comes together here. When you understand how a kernel works, you understand how computers work.
Concepts You Must Understand First
Before writing code, ensure you can explain:
| Concept | Why It Matters | Reference |
|---|---|---|
| x86-64 Boot Process | Understanding real mode, protected mode, and long mode transitions | OSDev Wiki, Intel SDM Vol. 3 |
| Paging & Page Tables | 4-level page tables (PML4/PDPT/PD/PT), how virtual addresses translate to physical | CS:APP 9.6, OSTEP Ch. 18-20 |
| Interrupts & Exceptions | IDT setup, interrupt handlers, CPU privilege levels (rings 0-3) | CS:APP 8.1, Intel SDM Vol. 3 |
| Context Switching | Saving/restoring CPU state, switching between kernel and user stacks | OSTEP Ch. 6, CS:APP 8.2 |
| System Calls | The syscall/sysret mechanism, transitioning between user and kernel mode | CS:APP 8.2, OSTEP Ch. 6 |
| Memory Management | Physical frame allocation, virtual memory mapping, kernel/user space split | CS:APP Ch. 9, OSTEP Ch. 13-23 |
Questions to Guide Your Design
Answer these before writing code:
-
How do you get from BIOS/UEFI to your kernel? (Multiboot? Custom bootloader? UEFI stub?)
-
How do you transition from 32-bit protected mode to 64-bit long mode? (What CR registers must be set? What page tables are required?)
-
How do you organize physical memory? (Bitmap allocator? Free list? Buddy system?)
-
How do you set up kernel virtual memory? (Direct mapping? Higher-half kernel? What goes where?)
-
How do you handle interrupts? (IDT format in 64-bit? How do you save CPU state? Whatโs the interrupt stack?)
-
How do you switch from kernel to user mode? (What registers change? How does iretq work?)
-
How do you implement system calls? (syscall/sysret vs int 0x80? Whatโs the calling convention?)
-
How do you switch between processes? (When does it happen? What state must be saved/restored?)
Thinking Exercise: Trace a System Call
Before implementing, trace through what happens when a user program calls write(1, "hello", 5):
User Space (Ring 3)
-------------------
1. libc wrapper: write() function
- Put syscall number (1) in rax
- Put arguments in rdi=1, rsi=buf, rdx=5
- Execute 'syscall' instruction
CPU Transition (syscall instruction)
------------------------------------
2. CPU actions (automatic, hardware):
- Save rip to rcx
- Save rflags to r11
- Load rip from IA32_LSTAR MSR (your syscall entry point)
- Load CS from IA32_STAR MSR (kernel code segment)
- Load SS (kernel stack segment)
- Mask rflags with IA32_FMASK MSR
- Switch to Ring 0
- NOTE: rsp NOT changed - you must switch stacks!
Kernel Space (Ring 3)
---------------------
3. syscall_entry (assembly):
- swapgs (switch to kernel GS for per-CPU data)
- Save user rsp to per-CPU storage
- Load kernel rsp from per-CPU storage
- Push user context (for later iretq return)
- Call C syscall dispatcher
4. sys_write() handler:
- Validate fd (is 1 a valid file descriptor?)
- Validate buffer pointer (is it in user space? is it mapped?)
- Copy data from user space (carefully!)
- Perform the write to console/file
- Return bytes written
5. Return to user space:
- Pop saved context
- swapgs (restore user GS)
- sysretq (or iretq for more flexibility)
Back to User Space
------------------
6. After sysret:
- CPU restores rip from rcx, rflags from r11
- Switch back to Ring 3
- libc wrapper returns to caller with result in rax
Draw the stack contents at step 3:
Kernel Stack (after saving context):
+------------------+ <- kernel rsp (low)
| user ss |
| user rsp |
| user rflags |
| user cs |
| user rip (rcx) |
| error code (0) | <- interrupt frame
+------------------+
| rax (syscall #) |
| rbx |
| rcx |
| ... | <- general registers
+------------------+
The Interview Questions Theyโll Ask
Q1: Explain the difference between physical and virtual addresses, and why kernels use virtual memory.
Expected answer: Physical addresses refer to actual RAM locations. Virtual addresses are what the CPU uses; theyโre translated by the MMU via page tables. Kernels use virtual memory for: (1) isolation between processes - each has its own address space; (2) abstraction - programs donโt need to know physical memory layout; (3) demand paging - not all memory needs to be physically present; (4) shared libraries - same physical pages mapped in multiple processes.
Q2: What happens when a page fault occurs?
Expected answer: The CPU raises exception #14 (page fault), pushing an error code with bits indicating: was it a read/write, user/kernel access, page present or not. The kernelโs page fault handler examines the faulting address (in CR2) and error code to determine: (1) valid access to unmapped page -> allocate and map a frame; (2) copy-on-write -> copy the page and remap; (3) stack growth -> extend the stack; (4) invalid access -> kill the process with SIGSEGV.
Q3: How does the kernel protect itself from user programs?
Expected answer: Multiple mechanisms: (1) Privilege rings - kernel runs in Ring 0, users in Ring 3; Ring 3 canโt execute privileged instructions. (2) Separate page tables - user pages marked as user-accessible, kernel pages as supervisor-only. (3) SMAP/SMEP on modern CPUs - prevent kernel from executing or even accessing user memory without explicit override. (4) System call interface - only way for user code to request kernel services.
Q4: Explain context switching between two processes.
Expected answer: When switching from process A to B: (1) Save Aโs register state to its kernel stack or PCB; (2) Switch page tables - load Bโs PML4 into CR3; (3) Switch kernel stacks - change rsp to Bโs kernel stack; (4) Restore Bโs register state; (5) Return to Bโs code. The trigger is usually a timer interrupt (preemption) or a blocking system call (voluntary switch). TLB is flushed on CR3 change unless using PCID.
Q5: Whatโs the difference between exceptions, interrupts, and traps?
Expected answer: All are handled via the IDT but have different sources. Exceptions: synchronous, caused by CPU (divide by zero, page fault) - faults can be restarted, traps advance past the instruction. Hardware interrupts: asynchronous, from devices (keyboard, timer) via the APIC/PIC - the interrupted instruction completes. Software traps: synchronous, explicitly triggered (int, syscall) - used for system calls. All save state and transfer to a handler.
Q6: How would you add SMP (multiprocessor) support to your kernel?
Expected answer: Key challenges: (1) Per-CPU data structures - each CPU needs its own scheduler queue, current process pointer, kernel stack. Use GS segment for per-CPU access. (2) Lock all shared data - use spinlocks for short critical sections; the scheduler needs careful locking. (3) IPI (inter-processor interrupts) - to signal other CPUs for TLB shootdown, reschedule requests. (4) AP bootstrap - secondary CPUs start in real mode; need special boot code to bring them to long mode.
Hints in Layers
Layer 1 - Multiboot Header and Entry
Start with a minimal bootable kernel:
; boot.S - Multiboot2 header and entry point
.section .multiboot
.align 8
multiboot_header:
.long 0xE85250D6 ; Magic
.long 0 ; Architecture (i386)
.long multiboot_header_end - multiboot_header
.long -(0xE85250D6 + 0 + (multiboot_header_end - multiboot_header))
; End tag
.word 0
.word 0
.long 8
multiboot_header_end:
.section .bss
.align 16
stack_bottom:
.skip 16384 ; 16 KB stack
stack_top:
.section .text
.global _start
.code32
_start:
mov $stack_top, %esp
call check_multiboot
call check_cpuid
call check_long_mode
call setup_page_tables
call enable_paging
lgdt gdt64_pointer
jmp $0x08, $long_mode_start
.code64
long_mode_start:
mov $0x10, %ax
mov %ax, %ds
mov %ax, %es
mov %ax, %ss
call kernel_main
hlt
Layer 2 - Transition to Long Mode
Set up identity-mapped page tables and enable paging:
setup_page_tables:
; Map first 2MB with huge pages
mov $pml4, %edi
mov $pdpt, %eax
or $0x03, %eax ; Present + Writable
mov %eax, (%edi)
mov $pdpt, %edi
mov $pd, %eax
or $0x03, %eax
mov %eax, (%edi)
mov $pd, %edi
mov $0x83, %eax ; Present + Writable + Huge (2MB)
mov %eax, (%edi)
ret
enable_paging:
mov $pml4, %eax
mov %eax, %cr3 ; Load page table
mov %cr4, %eax
or $0x20, %eax ; Enable PAE
mov %eax, %cr4
mov $0xC0000080, %ecx ; EFER MSR
rdmsr
or $0x100, %eax ; Enable Long Mode
wrmsr
mov %cr0, %eax
or $0x80000001, %eax ; Enable Paging + Protection
mov %eax, %cr0
ret
Layer 3 - Interrupt Descriptor Table
Set up exception and interrupt handlers:
// interrupt.c
#include <stdint.h>
struct idt_entry {
uint16_t offset_low;
uint16_t selector;
uint8_t ist;
uint8_t type_attr;
uint16_t offset_mid;
uint32_t offset_high;
uint32_t zero;
} __attribute__((packed));
struct idt_entry idt[256];
void set_idt_entry(int n, uint64_t handler, uint8_t type) {
idt[n].offset_low = handler & 0xFFFF;
idt[n].selector = 0x08; // Kernel code segment
idt[n].ist = 0;
idt[n].type_attr = type; // 0x8E = interrupt gate, 0x8F = trap gate
idt[n].offset_mid = (handler >> 16) & 0xFFFF;
idt[n].offset_high = handler >> 32;
idt[n].zero = 0;
}
extern void isr0(void); // Divide error
extern void isr14(void); // Page fault
extern void irq0(void); // Timer
void idt_init(void) {
set_idt_entry(0, (uint64_t)isr0, 0x8E);
set_idt_entry(14, (uint64_t)isr14, 0x8E);
set_idt_entry(32, (uint64_t)irq0, 0x8E);
// ... more handlers
struct { uint16_t size; uint64_t addr; } __attribute__((packed)) idtr;
idtr.size = sizeof(idt) - 1;
idtr.addr = (uint64_t)idt;
asm volatile("lidt %0" : : "m"(idtr));
}
Layer 4 - Physical Memory Allocator
Simple bitmap-based page frame allocator:
// mm.c
#define PAGE_SIZE 4096
#define MAX_FRAMES (128 * 1024 * 1024 / PAGE_SIZE) // 128 MB
static uint8_t frame_bitmap[MAX_FRAMES / 8];
static uint64_t total_frames;
static uint64_t free_frames;
void pmm_init(uint64_t mem_size, uint64_t kernel_end) {
total_frames = mem_size / PAGE_SIZE;
free_frames = total_frames;
// Mark all as free initially
memset(frame_bitmap, 0, sizeof(frame_bitmap));
// Mark kernel memory as used
uint64_t kernel_frames = (kernel_end + PAGE_SIZE - 1) / PAGE_SIZE;
for (uint64_t i = 0; i < kernel_frames; i++) {
frame_bitmap[i / 8] |= (1 << (i % 8));
free_frames--;
}
}
uint64_t pmm_alloc_frame(void) {
for (uint64_t i = 0; i < total_frames; i++) {
if (!(frame_bitmap[i / 8] & (1 << (i % 8)))) {
frame_bitmap[i / 8] |= (1 << (i % 8));
free_frames--;
return i * PAGE_SIZE;
}
}
return 0; // Out of memory
}
void pmm_free_frame(uint64_t addr) {
uint64_t frame = addr / PAGE_SIZE;
frame_bitmap[frame / 8] &= ~(1 << (frame % 8));
free_frames++;
}
Layer 5 - Process and Context Switch
Basic process structure and switching:
// process.c
struct context {
uint64_t rsp;
uint64_t rbp;
uint64_t rbx;
uint64_t r12;
uint64_t r13;
uint64_t r14;
uint64_t r15;
uint64_t rip;
};
struct process {
int pid;
enum { RUNNING, READY, BLOCKED, ZOMBIE } state;
struct context context;
uint64_t *page_table;
uint64_t kernel_stack;
uint64_t user_stack;
};
struct process *current;
struct process processes[MAX_PROCESSES];
// Assembly context switch (in switch.S)
// void switch_context(struct context *old, struct context *new);
void schedule(void) {
struct process *next = find_next_runnable();
if (next == current) return;
struct process *prev = current;
current = next;
// Switch page tables
asm volatile("mov %0, %%cr3" : : "r"(next->page_table));
// Switch context
switch_context(&prev->context, &next->context);
}
Books That Will Help
| Book | Chapter(s) | What Youโll Learn |
|---|---|---|
| CS:APP | Ch. 9 | Virtual memory fundamentals, page tables |
| CS:APP | Ch. 8 | Exceptions, interrupts, process control |
| CS:APP | Ch. 3 | x86-64 assembly for boot code and handlers |
| OSTEP | Ch. 4-6 | Process abstraction, scheduling, context switching |
| OSTEP | Ch. 13-23 | Virtual memory, paging, swapping |
| OSTEP | Ch. 26-32 | Concurrency, locks, condition variables |
| Intel SDM Vol. 3 | Ch. 2-6 | Protected mode, paging, interrupts |
| OSDev Wiki | Various | Practical tutorials for each component |
| xv6 Book | All | Complete teaching OS with clean code |
Common Pitfalls & Debugging
Problem 1: Triple fault on boot (QEMU resets immediately)
Symptom: QEMU restarts as soon as kernel loads, or immediately after enabling paging.
Cause: Invalid page tables, IDT not set up, or exception in exception handler.
Debug: Use QEMUโs debug options:
$ qemu-system-x86_64 -kernel kernel.bin -d int,cpu_reset -no-reboot
# Shows interrupts and why reset occurred
$ qemu-system-x86_64 -kernel kernel.bin -s -S
# Starts paused; attach GDB: target remote :1234
Problem 2: Page fault in kernel mode
Symptom: Page fault at unexpected address, usually during initialization.
Cause: Accessing unmapped memory, or page tables not set up correctly.
Fix: Verify your page table mappings:
// Debug: print page table entries
void debug_pagewalk(uint64_t addr) {
uint64_t *pml4 = (uint64_t *)read_cr3();
uint64_t pml4e = pml4[(addr >> 39) & 0x1FF];
printf("PML4[%d] = 0x%lx\n", (addr >> 39) & 0x1FF, pml4e);
// ... continue for PDPT, PD, PT
}
Problem 3: Interrupts not working
Symptom: Timer interrupt never fires, keyboard doesnโt respond.
Cause: IDT not loaded, PIC not configured, or interrupts disabled (CLI).
Fix: Verify interrupt setup:
// Check if interrupts enabled
uint64_t flags;
asm volatile("pushfq; pop %0" : "=r"(flags));
if (!(flags & 0x200)) {
printf("Interrupts disabled (IF=0)!\n");
asm volatile("sti");
}
// Verify PIC is sending interrupts
outb(0x20, 0x0A); // Read IRR
printf("PIC IRR: 0x%x\n", inb(0x20));
Problem 4: User program crashes immediately
Symptom: General protection fault or page fault when switching to user mode.
Cause: User page tables wrong, wrong CS/SS for user mode, or stack not set up.
Fix: Verify the iretq frame is correct:
// Stack for iretq to user mode:
// [RSP+32] SS = 0x23 (user data | RPL=3)
// [RSP+24] RSP = user stack pointer
// [RSP+16] RFLAGS = 0x202 (IF set)
// [RSP+8] CS = 0x1B (user code | RPL=3)
// [RSP+0] RIP = user entry point
Legacy project list, re-numbered to match the expanded guides in CSAPP_3E_DEEP_LEARNING_PROJECTS/:
| P# | Legacy Project | Expanded guide |
|---|---|---|
| P02 + P03 | Bit Manipulation Puzzle Solver (Data Lab) | P02-bitwise-data-inspector.md, P03-data-lab-clone.md |
| P05 | Binary Bomb Defuser | P05-bomb-lab-workflow.md |
| P06 | Buffer Overflow Exploit Lab (Attack Lab) | P06-attack-lab-workflow.md |
| P07 | Y86-64 Processor Simulator | P07-y86-64-cpu-simulator.md |
| P09 | Cache Simulator | P09-cache-lab-simulator.md |
| P14 | Dynamic Memory Allocator (Malloc Lab) | P14-build-your-own-malloc.md |
| P11 + P12 | Unix Shell Implementation | P11-signals-processes-sandbox.md, P12-unix-shell-job-control.md |
| P18 | ELF Linker and Loader | P18-elf-linker-and-loader.md |
| P19 | Virtual Memory Simulator | P19-virtual-memory-simulator.md |
| P15 | Robust I/O Library (RIO) | P15-robust-unix-io-toolkit.md |
| P20 | HTTP Web Server | P20-http-web-server.md |
| P17 | Concurrent Web Proxy | P17-csapp-capstone-proxy.md |
| P21 | Thread Pool Implementation | P21-thread-pool-implementation.md |
| P22 | Signal-Safe Printf | P22-signal-safe-printf.md |
| P23 | Performance Profiler | P23-performance-profiler.md |
| P24 | Memory Leak Detector | P24-memory-leak-detector.md |
| P25 | Debugger (ptrace-based) | P25-debugger-ptrace.md |
| P26 | Operating System Kernel Capstone | P26-operating-system-kernel-capstone.md |
Project Comparison Table
| # | Project | Difficulty | Time | Understanding | Fun |
|---|---|---|---|---|---|
| 1 | Toolchain Explorer | Intermediate | 1โ2 wk | โโโโ | โโโโ |
| 2 | Bitwise Data Inspector | Intermediate | 0.5โ2 wk | โโโโ | โโโโ |
| 3 | Data Lab Clone | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 4 | Calling Convention Crash Cart | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 5 | Bomb Lab Workflow | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 6 | Attack Lab Workflow | Expert | 2โ3 wk | โโโโ | โโโโ |
| 7 | Y86-64 CPU Simulator | Expert | 1 mo+ | โโโโ | โโโโ |
| 8 | Performance Clinic | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 9 | Cache Simulator + Visualizer | Advanced | 2โ3 wk | โโโโ | โโโโ |
| 10 | ELF Link Map + Interposition | Advanced | 2โ3 wk | โโโโ | โโโโ |
| 11 | Signals + Processes Sandbox | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 12 | Unix Shell with Job Control | Advanced | 2โ3 wk | โโโโ | โโโโ |
| 13 | VM Map Visualizer | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 14 | Build Your Own Malloc | Expert | 1 mo+ | โโโโ | โโโโ |
| 15 | Robust Unix I/O Toolkit | Intermediate | 1โ2 wk | โโโโ | โโโโ |
| 16 | Concurrency Workbench | Expert | 2โ3 wk | โโโโ | โโโโ |
| 17 | Capstone Proxy | Expert | 2โ3 mo | โโโโ | โโโโ |
| 18 | ELF Linker and Loader | Expert | 2โ3 wk | โโโโ | โโโโ |
| 19 | Virtual Memory Simulator | Expert | 2โ3 wk | โโโโ | โโโโ |
| 20 | HTTP Web Server | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 21 | Thread Pool Implementation | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 22 | Signal-Safe Printf | Advanced | Weekend | โโโโ | โโโโ |
| 23 | Performance Profiler | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 24 | Memory Leak Detector | Advanced | 1โ2 wk | โโโโ | โโโโ |
| 25 | Debugger (ptrace-based) | Expert | 2โ4 wk | โโโโ | โโโโ |
| 26 | OS Kernel Capstone | Expert | 2โ3 mo | โโโโ | โโโโ |
Skills Matrix
| Project | Ch.1 | Ch.2 | Ch.3 | Ch.4 | Ch.5 | Ch.6 | Ch.7 | Ch.8 | Ch.9 | Ch.10 | Ch.11 | Ch.12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1: Toolchain | โโ | ย | ย | ย | ย | ย | โโ | ย | ย | ย | ย | ย |
| P2: Bitwise | ย | โโโ | โ | ย | ย | ย | ย | ย | ย | ย | ย | ย |
| P3: Data Lab | ย | โโโ | ย | ย | ย | ย | ย | ย | ย | ย | ย | ย |
| P4: Crash Cart | ย | ย | โโโ | ย | ย | ย | ย | ย | ย | ย | ย | ย |
| P5: Bomb Lab | ย | ย | โโโ | ย | ย | ย | ย | ย | ย | ย | ย | ย |
| P6: Attack Lab | ย | ย | โโโ | ย | ย | ย | ย | ย | ย | ย | ย | ย |
| P7: Y86-64 | ย | ย | ย | โโโ | โ | ย | ย | ย | ย | ย | ย | ย |
| P8: Perf Clinic | โ | ย | ย | ย | โโโ | โโ | ย | ย | ย | ย | ย | ย |
| P9: Cache Lab | ย | ย | ย | ย | โโ | โโโ | ย | ย | ย | ย | ย | ย |
| P10: ELF Link | ย | ย | ย | ย | ย | ย | โโโ | ย | ย | ย | ย | ย |
| P11: Signals | ย | ย | ย | ย | ย | ย | ย | โโโ | ย | ย | ย | ย |
| P12: Shell | ย | ย | ย | ย | ย | ย | ย | โโโ | ย | ย | ย | โ |
| P13: VM Map | ย | ย | ย | ย | ย | ย | ย | โโ | โโโ | ย | ย | ย |
| P14: Malloc | ย | ย | ย | ย | ย | โโ | ย | ย | โโโ | ย | ย | ย |
| P15: Unix I/O | ย | ย | ย | ย | ย | ย | ย | ย | โโ | โโโ | ย | ย |
| P16: Concurrency | ย | ย | ย | ย | ย | ย | ย | ย | ย | ย | ย | โโโ |
| P17: Capstone | โ | โ | โ | ย | โ | โโ | โ | โ | โโ | โโ | โโ | โโ |
| P18: ELF Linker | ย | ย | ย | ย | ย | ย | โโโ | โโ | ย | ย | ย | ย |
| P19: VM Simulator | ย | ย | ย | ย | ย | โโ | ย | ย | โโโ | ย | ย | ย |
| P20: HTTP Server | ย | ย | ย | ย | ย | ย | ย | ย | ย | โโโ | โโโ | ย |
| P21: Thread Pool | ย | ย | ย | ย | ย | ย | ย | ย | ย | ย | ย | โโโ |
| P22: Signal Printf | ย | ย | ย | ย | ย | ย | ย | โโโ | ย | ย | ย | โโ |
| P23: Profiler | ย | ย | โโ | ย | โโโ | ย | ย | โโ | ย | ย | ย | ย |
| P24: Leak Detector | ย | ย | ย | ย | ย | ย | โโ | ย | โโโ | ย | ย | ย |
| P25: Debugger | ย | ย | โโโ | ย | ย | ย | ย | โโโ | ย | ย | ย | ย |
| P26: OS Kernel | โ | โ | โโ | โโ | โ | โโ | โโ | โโ | โโ | โโ | ย | โโ |
| Legend: โโโ = Primary focus | โโ = Significant coverage | โ = Touches on |
Resources
Official CS:APP Materials
- Book: Computer Systems: A Programmerโs Perspective, 3rd Edition โ Bryant & OโHallaron
- Lab Materials: csapp.cs.cmu.edu/3e/labs.html
- Student Site: csapp.cs.cmu.edu/3e/students.html
Supplementary Books
- Effective C, 2nd Edition โ Robert C. Seacord (modern C practices)
- C Interfaces and Implementations โ David R. Hanson (allocator design)
- Operating Systems: Three Easy Pieces โ Arpaci-Dusseau (concurrency, VM)
- Computer Organization and Design โ Patterson & Hennessy (architecture)
Tools
- Debuggers: GDB, LLDB
- Disassemblers: objdump, Ghidra, Binary Ninja
- Profilers: perf, Valgrind, cachegrind
- Build: Make, CMake, gcc/clang
Online Resources
Summary
| # | Project | Language |
|---|---|---|
| 1 | Hello, Toolchain โ Build Pipeline Explorer | C |
| 2 | Bitwise Data Inspector | C |
| 3 | Data Lab Clone | C |
| 4 | x86-64 Calling Convention Crash Cart | C |
| 5 | Bomb Lab Workflow | C |
| 6 | Attack Lab Workflow | C |
| 7 | Y86-64 CPU Simulator | C |
| 8 | Performance Clinic | C |
| 9 | Cache Lab++ | C |
| 10 | ELF Link Map & Interposition Toolkit | C |
| 11 | Signals + Processes Sandbox | C |
| 12 | Unix Shell with Job Control | C |
| 13 | Virtual Memory Map Visualizer | C |
| 14 | Build Your Own Malloc | C |
| 15 | Robust Unix I/O Toolkit | C |
| 16 | Concurrency Workbench | C |
| 17 | CS:APP Capstone Proxy Platform | C |
| 18 | ELF Linker and Loader | C |
| 19 | Virtual Memory Simulator | C |
| 20 | HTTP Web Server | C |
| 21 | Thread Pool Implementation | C |
| 22 | Signal-Safe Printf | C |
| 23 | Performance Profiler | C |
| 24 | Memory Leak Detector | C |
| 25 | Debugger (ptrace-based) | C |
| 26 | Operating System Kernel Capstone | C |
Merged Additions (from LEARN_CSAPP_COMPUTER_SYSTEMS.md)
This file (CSAPP_3E_DEEP_LEARNING_PROJECTS.md) is the canonical โone main file + expanded project guidesโ path. LEARN_CSAPP_COMPUTER_SYSTEMS.md remains in the repo as a legacy snapshot, but its unique projects and learning-plan ideas are consolidated here so you have a single place to start.
The full legacy document is also included verbatim in Appendix A at the end of this file (collapsed by default).
Overlap Map (Project Equivalents)
LEARN_CSAPP_COMPUTER_SYSTEMS.md |
Closest match in this path | Notes |
|---|---|---|
| Project 1: Data Lab | P2 + P3 | This path splits โinspect representationsโ and โconstraints-style bit puzzlesโ. |
| Project 2: Bomb Lab | P5 | Same lab domain; this path emphasizes a repeatable workflow + writeups. |
| Project 3: Attack Lab | P6 | Same lab domain; this path emphasizes workflow + post-mortems. |
| Project 4: Y86-64 Simulator | P7 | Same core learning objective. |
| Project 5: Cache Simulator | P9 | This path adds locality visualization and โwhy itโs slowโ instrumentation. |
| Project 6: Malloc Lab | P14 | Same domain; this path pushes allocator design further. |
| Project 7: Unix Shell | P12 (+P11) | This path explicitly builds signal/process discipline first. |
| Project 10: Robust I/O | P15 | Same domain; this path frames it as a reusable toolkit. |
| Project 12: Concurrent Proxy | P17 (+P16) | This path makes the proxy the capstone and treats thread pools as a prerequisite skill. |
Bonus Projects (Build More of the Stack)
These are additional projects from LEARN_CSAPP_COMPUTER_SYSTEMS.md that are valuable but not part of the core โP1โP17โ dependency graph. Each one has an expanded guide in the same folder as the original 17 projects.
| # | Project | Expanded guide |
|---|---|---|
| 18 | ELF Linker and Loader | P18-elf-linker-and-loader.md |
| 19 | Virtual Memory Simulator | P19-virtual-memory-simulator.md |
| 20 | HTTP Web Server | P20-http-web-server.md |
| 21 | Thread Pool Implementation | P21-thread-pool-implementation.md |
| 22 | Signal-Safe Printf (Async-Signal-Safe Logging) | P22-signal-safe-printf.md |
| 23 | Performance Profiler | P23-performance-profiler.md |
| 24 | Memory Leak Detector | P24-memory-leak-detector.md |
| 25 | Debugger (ptrace-based) | P25-debugger-ptrace.md |
| 26 | Final Capstone: Operating System Kernel | P26-operating-system-kernel-capstone.md |
Alternate Time-Based Phases (Optional)
If you prefer a calendar-based plan (instead of dependency-based), the legacy file proposes these phases; they map cleanly onto this path:
- Foundation (4โ6 weeks): P1โP4 (+P2/P3 depth as needed)
- Hardware Understanding (4โ6 weeks): P7โP9
- System Software (6โ8 weeks): P10โP14
- I/O and Networking (3โ4 weeks): P15 + (P20 optional) + P17 basics
- Concurrency (4โ6 weeks): P16 + P21 + P17 scaling
- Advanced Topics (4+ weeks): P18, P19, P22โP25
- Post-CS:APP Capstone (months): P26
Last updated: December 2025