Project 23: Performance Profiler

Build a sampling profiler that periodically interrupts a program, records where it is, and reports the hottest functions and call stacks, mastering timer signals, symbol resolution, and statistical measurement.


Quick Reference

Attribute Value
Language C (alt: C++, Rust)
Difficulty Advanced
Time 1-2 weeks
Chapters 5, 8, 3
Coolness ★★★★★ Industry Tool
Portfolio Value Very High

Learning Objectives

By completing this project, you will:

  1. Master timer signal mechanics: Implement timer-based sampling using SIGPROF and ITIMER_PROF
  2. Understand signal handler context: Extract instruction pointers from the interrupted execution context
  3. Build efficient sample collection: Design async-signal-safe data structures for collecting samples
  4. Implement symbol resolution: Map raw addresses back to function names using dladdr and addr2line
  5. Apply statistical sampling theory: Understand sampling error, aliasing, and confidence intervals
  6. Differentiate timing domains: Distinguish between CPU time, wall-clock time, and their use cases
  7. Capture call stacks: Implement stack trace capture with frame pointer walking or libunwind
  8. Generate actionable reports: Create flat profiles, call graphs, and flame graph output
  9. Measure profiler overhead: Quantify and minimize the observer effect
  10. Handle address space layout randomization: Deal with ASLR and PIE in symbol resolution

Deep Theoretical Foundation

What Is a Profiler?

A profiler answers the fundamental question: “Where does my program spend its time?” This seemingly simple question has deep implications for performance optimization.

┌────────────────────────────────────────────────────────────────────┐
│                    WHY PROFILING MATTERS                            │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Without Profiling:                With Profiling:                  │
│  ─────────────────────            ──────────────────                │
│                                                                     │
│  "I think function X is slow"     "Function X uses 45% of CPU"     │
│  "Let me optimize everything"     "Optimizing X gives 2x speedup"  │
│  "Why is this still slow?"        "Function Y is now the bottleneck"│
│                                                                     │
│  Result: Wasted effort            Result: Targeted optimization     │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  Amdahl's Law:                                                      │
│  ────────────                                                       │
│                                                                     │
│  If a function uses F% of total time, and you speed it up by S:    │
│                                                                     │
│                        1                                            │
│  Overall speedup = ─────────────────                                │
│                    (1-F) + (F/S)                                    │
│                                                                     │
│  Example: Function uses 50% of time, you make it 10x faster:       │
│                                                                     │
│           1 / (0.5 + 0.5/10) = 1 / 0.55 = 1.82x speedup            │
│                                                                     │
│  Without profiling, you might optimize a 5% function:               │
│                                                                     │
│           1 / (0.95 + 0.05/10) = 1 / 0.955 = 1.05x speedup         │
│                                                                     │
│  PROFILING TELLS YOU WHERE TO FOCUS YOUR EFFORT                    │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Sampling vs Instrumentation Profiling

There are two fundamental approaches to profiling:

┌────────────────────────────────────────────────────────────────────┐
│               SAMPLING vs INSTRUMENTATION PROFILING                 │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  INSTRUMENTATION PROFILING                                          │
│  ─────────────────────────                                          │
│                                                                     │
│  How it works:                                                      │
│  - Modify every function entry/exit                                 │
│  - Insert timing code at each point                                 │
│  - Count calls and measure time exactly                             │
│                                                                     │
│  Original:              Instrumented:                               │
│  ┌──────────────┐       ┌──────────────────────────────┐           │
│  │ void foo() { │       │ void foo() {                 │           │
│  │   // code    │  ───► │   __profile_enter("foo");    │           │
│  │ }            │       │   // code                    │           │
│  └──────────────┘       │   __profile_exit("foo");     │           │
│                         │ }                            │           │
│                         └──────────────────────────────┘           │
│                                                                     │
│  Advantages:                                                        │
│  + Exact call counts                                                │
│  + Complete call tree                                               │
│  + No statistical error                                             │
│                                                                     │
│  Disadvantages:                                                     │
│  - High overhead (10-100x slowdown possible)                        │
│  - Overhead proportional to call frequency                          │
│  - Must modify/recompile code                                       │
│  - Can change program behavior (Heisenberg effect)                  │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  SAMPLING PROFILING (What we're building)                          │
│  ─────────────────────────────────────────                         │
│                                                                     │
│  How it works:                                                      │
│  - Periodically interrupt the program (e.g., 1000 Hz)              │
│  - Record where it was executing (instruction pointer)             │
│  - Build statistical picture of time distribution                  │
│                                                                     │
│  ┌────────────────────────────────────────────────────────┐        │
│  │ Normal Execution                                        │        │
│  │                                                         │        │
│  │  time ─────────────────────────────────────────────►   │        │
│  │       ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑       │        │
│  │       │    │    │    │    │    │    │    │    │       │        │
│  │    Sample Sample Sample Sample ... (periodic)          │        │
│  │       │    │    │    │                                 │        │
│  │       └────┴────┴────┴── Record IP at each interrupt   │        │
│  └────────────────────────────────────────────────────────┘        │
│                                                                     │
│  Advantages:                                                        │
│  + Low, constant overhead (~2-5%)                                   │
│  + Works on unmodified binaries                                     │
│  + Overhead independent of call frequency                           │
│  + Captures real execution behavior                                 │
│                                                                     │
│  Disadvantages:                                                     │
│  - Statistical error (may miss short functions)                     │
│  - Cannot count exact calls                                         │
│  - May have sampling aliasing                                       │
│  - Call attribution can be tricky                                   │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Timer Signals and Interval Timers

Understanding Unix timer signals is essential for building a sampling profiler:

┌────────────────────────────────────────────────────────────────────┐
│                    INTERVAL TIMERS (setitimer)                      │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  struct itimerval {                                                 │
│      struct timeval it_interval;  // Interval for periodic timer   │
│      struct timeval it_value;     // Time until next expiration    │
│  };                                                                 │
│                                                                     │
│  THREE TIMER TYPES:                                                 │
│  ──────────────────                                                 │
│                                                                     │
│  1. ITIMER_REAL (delivers SIGALRM)                                 │
│     ─────────────────────────────                                  │
│     Counts WALL-CLOCK time (real time)                             │
│                                                                     │
│     ┌──────────────────────────────────────────────────┐           │
│     │ Program: ███ RUNNING ███   SLEEPING   ███ RUN ███│           │
│     │ Timer:   ──────────────────────────────────────►  │           │
│     │          │        │        │        │        │    │           │
│     │          Fires even when sleeping!                │           │
│     └──────────────────────────────────────────────────┘           │
│                                                                     │
│     Use case: Timeouts, wall-clock profiling                       │
│                                                                     │
│  2. ITIMER_VIRTUAL (delivers SIGVTALRM)                            │
│     ───────────────────────────────────                            │
│     Counts only USER CPU time (your code, not kernel)              │
│                                                                     │
│     ┌──────────────────────────────────────────────────┐           │
│     │ User:    ███ USER ███                 ███ USER ███│           │
│     │ Kernel:              ███ KERNEL ███               │           │
│     │ Sleep:                              SLEEP         │           │
│     │ Timer:   ────────────               ────────────►  │           │
│     │          │      │                   │      │      │           │
│     │          Only fires during user mode              │           │
│     └──────────────────────────────────────────────────┘           │
│                                                                     │
│     Use case: Profile only your code, not system calls             │
│                                                                     │
│  3. ITIMER_PROF (delivers SIGPROF)  ← USE THIS FOR PROFILING       │
│     ──────────────────────────────                                 │
│     Counts USER + SYSTEM CPU time (all CPU work)                   │
│                                                                     │
│     ┌──────────────────────────────────────────────────┐           │
│     │ User:    ███ USER ███                 ███ USER ███│           │
│     │ Kernel:              ███ KERNEL ███               │           │
│     │ Sleep:                              SLEEP         │           │
│     │ Timer:   ─────────────────────────  ────────────►  │           │
│     │          │    │    │    │          │    │        │           │
│     │          Fires during user AND kernel time        │           │
│     └──────────────────────────────────────────────────┘           │
│                                                                     │
│     Use case: Full CPU profiling (what we want!)                   │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  WHY ITIMER_PROF FOR CPU PROFILING?                                │
│  ──────────────────────────────────                                │
│                                                                     │
│  Example: Program that does I/O-heavy work                         │
│                                                                     │
│  Timeline:  [USER][SYS][WAIT..........][USER][SYS][WAIT...]        │
│              2ms  3ms     50ms          2ms  3ms    50ms            │
│                                                                     │
│  ITIMER_REAL at 1000 Hz: 55 samples/iteration                      │
│    - 5 samples in CPU, 50 samples in wait                          │
│    - "Program spends 91% of time waiting" (not useful for CPU opt) │
│                                                                     │
│  ITIMER_PROF at 1000 Hz: 5 samples/iteration                       │
│    - 2 samples in user, 3 samples in kernel                        │
│    - "40% user code, 60% kernel code" (useful!)                    │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Signal Handlers and Execution Context

When a signal interrupts your program, the kernel provides context about where the interruption occurred:

┌────────────────────────────────────────────────────────────────────┐
│                    SIGNAL HANDLER CONTEXT                           │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Signal handler prototype with SA_SIGINFO:                         │
│                                                                     │
│  void handler(int signum, siginfo_t *info, void *context);         │
│                                                                     │
│  The 'context' parameter is a pointer to ucontext_t:               │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ ucontext_t                                                   │   │
│  │ ├── uc_flags                                                 │   │
│  │ ├── uc_link        (link to next context)                   │   │
│  │ ├── uc_stack       (signal stack info)                      │   │
│  │ ├── uc_mcontext    (machine context - THE GOLD!)            │   │
│  │ │   └── gregs[]    (general purpose registers)              │   │
│  │ │       ├── gregs[REG_RAX]  (return value)                  │   │
│  │ │       ├── gregs[REG_RBX]  (callee-saved)                  │   │
│  │ │       ├── gregs[REG_RCX]  (4th argument)                  │   │
│  │ │       ├── gregs[REG_RDX]  (3rd argument)                  │   │
│  │ │       ├── gregs[REG_RSI]  (2nd argument)                  │   │
│  │ │       ├── gregs[REG_RDI]  (1st argument)                  │   │
│  │ │       ├── gregs[REG_RBP]  (frame pointer) ← Stack walking │   │
│  │ │       ├── gregs[REG_RSP]  (stack pointer)                 │   │
│  │ │       └── gregs[REG_RIP]  (instruction ptr) ← WE WANT THIS│   │
│  │ └── uc_sigmask     (blocked signals)                        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  EXTRACTING THE INSTRUCTION POINTER:                               │
│  ───────────────────────────────────                               │
│                                                                     │
│  Linux x86-64:                                                      │
│  ─────────────                                                      │
│  #include <ucontext.h>                                              │
│  #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext.gregs[REG_RIP]))│
│                                                                     │
│  macOS x86-64:                                                      │
│  ─────────────                                                      │
│  #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext->__ss.__rip))│
│                                                                     │
│  macOS ARM64:                                                       │
│  ─────────────                                                      │
│  #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext->__ss.__pc))│
│                                                                     │
│  Linux ARM64:                                                       │
│  ─────────────                                                      │
│  #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext.pc))│
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  WHAT THE INSTRUCTION POINTER TELLS US:                            │
│  ──────────────────────────────────────                            │
│                                                                     │
│  When sampled at IP = 0x401234:                                    │
│                                                                     │
│  ┌─────────────────────────────────────────────────────┐           │
│  │ 0x401230: push   %rbp           │                   │           │
│  │ 0x401231: mov    %rsp,%rbp      │  function_foo()  │           │
│  │ 0x401234: add    %eax,%ebx      │ ← Sample here!   │           │
│  │ 0x401236: call   bar            │                   │           │
│  │ 0x40123b: pop    %rbp           │                   │           │
│  │ 0x40123c: ret                   │                   │           │
│  └─────────────────────────────────────────────────────┘           │
│                                                                     │
│  This tells us: "At this moment, CPU was executing function_foo()" │
│                                                                     │
│  By collecting many samples, we build a statistical picture        │
│  of where the program spends its time.                             │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Statistical Sampling Theory

Understanding the statistics behind sampling is crucial for interpreting profiler results:

┌────────────────────────────────────────────────────────────────────┐
│                    STATISTICAL SAMPLING THEORY                      │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  BASIC PRINCIPLE:                                                   │
│  ────────────────                                                   │
│                                                                     │
│  If function F runs for T% of total CPU time, and we collect       │
│  N random samples, we expect ~N*T% samples to be in function F.    │
│                                                                     │
│  Example: F uses 25% of CPU, we collect 1000 samples              │
│           Expected: ~250 samples in F (±some error)               │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  SAMPLING ERROR (Confidence Intervals):                            │
│  ──────────────────────────────────────                            │
│                                                                     │
│  Standard error of proportion: SE = sqrt(p*(1-p)/n)                │
│                                                                     │
│  Where: p = true proportion (unknown)                              │
│         n = number of samples                                      │
│                                                                     │
│  95% Confidence Interval: p ± 1.96 * SE                            │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Samples    True 25%      Error Range (95% CI)               │   │
│  │ ────────   ─────────     ─────────────────────              │   │
│  │    100     25 samples    25% ± 8.5%   (16.5% to 33.5%)     │   │
│  │   1000     250 samples   25% ± 2.7%   (22.3% to 27.7%)     │   │
│  │  10000     2500 samples  25% ± 0.85%  (24.2% to 25.8%)     │   │
│  │ 100000     25000 samples 25% ± 0.27%  (24.7% to 25.3%)     │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  KEY INSIGHT: Error decreases with sqrt(n)                         │
│               10x more samples → 3.16x more precision              │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  THE ALIASING PROBLEM:                                              │
│  ────────────────────                                               │
│                                                                     │
│  Sampling at fixed intervals can miss periodic behavior:           │
│                                                                     │
│  Function runs every 10ms for 2ms:                                 │
│  Time:  |----|----|----|----|----|----|----|----|----|----|        │
│         0   10   20   30   40   50   60   70   80   90   100       │
│  Func:  ██        ██        ██        ██        ██                  │
│                                                                     │
│  Sampling at exactly 10ms intervals:                               │
│  Samples: ↓         ↓         ↓         ↓         ↓                │
│  Time:    0        10        20        30        40                │
│                                                                     │
│  If samples always hit BETWEEN function runs → 0% measured!        │
│  If samples always hit DURING function runs → 100% measured!       │
│                                                                     │
│  SOLUTIONS:                                                         │
│  ──────────                                                         │
│  1. Use a sample rate that doesn't divide evenly into periods      │
│  2. Add small random jitter to sample times                        │
│  3. Collect more samples over longer periods                       │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  MINIMUM DETECTABLE FUNCTION:                                       │
│  ────────────────────────────                                       │
│                                                                     │
│  If sampling at frequency F Hz, and function runs for duration D:  │
│                                                                     │
│  Expected samples = F * D                                          │
│                                                                     │
│  Example: Sampling at 1000 Hz                                       │
│  - Function runs 1ms   → expect 1 sample                           │
│  - Function runs 0.5ms → expect 0.5 samples (might miss it!)      │
│  - Function runs 0.1ms → expect 0.1 samples (will usually miss)   │
│                                                                     │
│  Rule of thumb: Functions using <1% of CPU may not be measured     │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

CPU Time vs Wall-Clock Time

Understanding these timing domains is essential for choosing the right profiling approach:

┌────────────────────────────────────────────────────────────────────┐
│               CPU TIME vs WALL-CLOCK TIME                           │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  WALL-CLOCK TIME (Real time):                                      │
│  ────────────────────────────                                      │
│                                                                     │
│  Total elapsed time from start to finish, as measured by a clock   │
│  on the wall. Includes everything: CPU work, I/O wait, sleep,     │
│  time spent waiting for locks, etc.                                │
│                                                                     │
│  CPU TIME:                                                          │
│  ─────────                                                          │
│                                                                     │
│  Time the CPU was actually executing your process's code.          │
│  Can be split into:                                                 │
│  - User time: Executing your application code                      │
│  - System time: Executing kernel code on your behalf               │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                              │   │
│  │  EXAMPLE: Web server handling request                        │   │
│  │                                                              │   │
│  │  Timeline (wall clock):                                      │   │
│  │  ├──────┬──────┬──────────────┬──────┬──────┬────────────┤   │   │
│  │  │ Parse│ Query│   DB Wait    │ Parse│Render│   Send     │   │   │
│  │  │ req  │ build│   (I/O)      │result│ HTML │   resp     │   │   │
│  │  │ 2ms  │ 1ms  │   50ms       │ 2ms  │ 5ms  │   10ms     │   │   │
│  │  ├──────┴──────┴──────────────┴──────┴──────┴────────────┤   │   │
│  │  │                                                        │   │   │
│  │  │ Wall-clock time: 70ms total                           │   │   │
│  │  │ CPU time: 10ms (only Parse+Query+Parse+Render)        │   │   │
│  │  │ I/O wait: 60ms (DB + network send)                    │   │   │
│  │  │                                                        │   │   │
│  │  └────────────────────────────────────────────────────────┘   │
│  │                                                              │   │
│  │  CPU PROFILE (ITIMER_PROF) would show:                       │   │
│  │  ─────────────────────────────────────                       │   │
│  │  - ParseRequest: 20%                                         │   │
│  │  - BuildQuery: 10%                                           │   │
│  │  - ParseResult: 20%                                          │   │
│  │  - RenderHTML: 50%                                           │   │
│  │                                                              │   │
│  │  → "RenderHTML is the CPU bottleneck" (correct!)            │   │
│  │                                                              │   │
│  │  WALL-CLOCK PROFILE (ITIMER_REAL) would show:               │   │
│  │  ─────────────────────────────────────────                   │   │
│  │  - ParseRequest: 2.9%                                        │   │
│  │  - BuildQuery: 1.4%                                          │   │
│  │  - DB_Wait: 71.4%                                            │   │
│  │  - ParseResult: 2.9%                                         │   │
│  │  - RenderHTML: 7.1%                                          │   │
│  │  - Network: 14.3%                                            │   │
│  │                                                              │   │
│  │  → "DB wait is the wall-clock bottleneck" (also correct!)   │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  WHEN TO USE EACH:                                                  │
│  ─────────────────                                                  │
│                                                                     │
│  Use CPU TIME (ITIMER_PROF) when:                                  │
│  - Optimizing computation-heavy code                               │
│  - Identifying algorithmic inefficiencies                          │
│  - Comparing algorithm implementations                              │
│  - Program is CPU-bound                                             │
│                                                                     │
│  Use WALL-CLOCK TIME (ITIMER_REAL) when:                           │
│  - Optimizing response time / latency                              │
│  - I/O-bound programs (databases, web servers)                     │
│  - Finding lock contention                                          │
│  - Identifying slow I/O operations                                 │
│  - Understanding user-perceived performance                         │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Symbol Resolution

Converting raw addresses back to function names:

┌────────────────────────────────────────────────────────────────────┐
│                      SYMBOL RESOLUTION                              │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  THE PROBLEM:                                                       │
│  ────────────                                                       │
│                                                                     │
│  Profiler captures: IP = 0x55a3b2c41234                            │
│  User wants to see: "matrix_multiply() in matrix.c:142"            │
│                                                                     │
│  HOW ADDRESS → SYMBOL WORKS:                                       │
│  ────────────────────────────                                       │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ ELF Binary Structure                                        │    │
│  │                                                             │    │
│  │ ┌─────────────────┐                                         │    │
│  │ │ ELF Header      │                                         │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ Program Headers │ ← Load addresses for segments           │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ .text           │ ← Code (your functions are here)       │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ .rodata         │ ← Read-only data                        │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ .data           │ ← Initialized data                      │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ .symtab         │ ← Symbol table (function names+addrs)  │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ .strtab         │ ← String table (symbol name strings)   │    │
│  │ ├─────────────────┤                                         │    │
│  │ │ .debug_info     │ ← DWARF debug info (line numbers)      │    │
│  │ └─────────────────┘                                         │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  RESOLUTION METHODS:                                                │
│  ───────────────────                                                │
│                                                                     │
│  1. dladdr() - Runtime lookup (simplest)                           │
│     ─────────────────────────────────────                          │
│     #include <dlfcn.h>                                              │
│     Dl_info info;                                                   │
│     if (dladdr(address, &info)) {                                  │
│         printf("Function: %s\n", info.dli_sname);                  │
│         printf("Library:  %s\n", info.dli_fname);                  │
│         printf("Nearest symbol at: %p\n", info.dli_saddr);         │
│     }                                                               │
│                                                                     │
│     Limitations:                                                    │
│     - Only works for exported/dynamic symbols                      │
│     - No line numbers                                               │
│     - Static functions may not be found                            │
│     - Requires -rdynamic compile flag for best results             │
│                                                                     │
│  2. addr2line - External tool (most complete)                      │
│     ────────────────────────────────────────                       │
│     $ addr2line -f -e ./program 0x401234                           │
│     matrix_multiply                                                 │
│     /home/user/src/matrix.c:142                                    │
│                                                                     │
│     Advantages:                                                     │
│     + Resolves to file:line with debug info (-g)                   │
│     + Works with static symbols                                     │
│     + Handles inlined functions                                     │
│                                                                     │
│     Disadvantages:                                                  │
│     - External process (slow for many lookups)                     │
│     - Requires debug symbols to be present                         │
│                                                                     │
│  3. libbacktrace / libdw - Library-based (best performance)        │
│     ──────────────────────────────────────────────────             │
│     Full programmatic access to DWARF debug info                   │
│     Used by sanitizers and production profilers                    │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  ASLR COMPLICATIONS:                                                │
│  ───────────────────                                                │
│                                                                     │
│  Address Space Layout Randomization (ASLR) + Position Independent  │
│  Executables (PIE) mean addresses change every run:                │
│                                                                     │
│  Run 1: main() at 0x55a3b2c41000                                   │
│  Run 2: main() at 0x5621f8e12000                                   │
│                                                                     │
│  Solution: Work with OFFSETS from load base                        │
│                                                                     │
│  ┌───────────────────────────────────────────────────────────┐     │
│  │ 1. Read /proc/self/maps to find load address:             │     │
│  │    55a3b2c40000-55a3b2c45000 r-xp ... /path/to/program   │     │
│  │                                                           │     │
│  │ 2. Calculate offset:                                      │     │
│  │    offset = sample_ip - load_base                         │     │
│  │    offset = 0x55a3b2c41234 - 0x55a3b2c40000 = 0x1234     │     │
│  │                                                           │     │
│  │ 3. Use offset for addr2line:                             │     │
│  │    addr2line -f -e ./program 0x1234                       │     │
│  └───────────────────────────────────────────────────────────┘     │
│                                                                     │
│  Or use dladdr() which handles ASLR automatically.                 │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Stack Trace Capture

Profiling call stacks provides much richer information than just leaf functions:

┌────────────────────────────────────────────────────────────────────┐
│                    STACK TRACE CAPTURE                              │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  WHY CAPTURE STACKS?                                               │
│  ──────────────────                                                │
│                                                                     │
│  Leaf-only profiling:         Stack profiling:                     │
│  ─────────────────────         ───────────────────                 │
│  "memcpy() is hot (45%)"       "memcpy() called from:              │
│                                  - matrix_multiply (30%)           │
│                                  - vector_add (10%)                │
│                                  - hash_insert (5%)"               │
│                                                                     │
│  With stacks, you know WHERE to optimize memcpy usage!             │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  HOW THE STACK WORKS:                                               │
│  ────────────────────                                               │
│                                                                     │
│  Call: main() → process_data() → matrix_multiply() → memcpy()      │
│                                                                     │
│  Stack layout (x86-64 with frame pointers):                        │
│                                                                     │
│  High addresses                                                     │
│  ┌────────────────────────────┐                                    │
│  │  main's stack frame        │                                    │
│  │  return address (to libc)  │                                    │
│  │  saved RBP (frame pointer) │ ← main's RBP points here          │
│  ├────────────────────────────┤                                    │
│  │  process_data's frame      │                                    │
│  │  return address (to main)  │                                    │
│  │  saved RBP → main's RBP    │ ← process_data's RBP              │
│  ├────────────────────────────┤                                    │
│  │  matrix_multiply's frame   │                                    │
│  │  return address (to proc)  │                                    │
│  │  saved RBP → proc's RBP    │ ← matrix_multiply's RBP           │
│  ├────────────────────────────┤                                    │
│  │  memcpy's frame            │                                    │
│  │  return address (to matrix)│                                    │
│  │  saved RBP → matrix's RBP  │ ← Current RBP (we're here!)       │
│  └────────────────────────────┘                                    │
│  Low addresses                                                      │
│                                                                     │
│  FRAME POINTER WALKING:                                             │
│  ─────────────────────                                              │
│                                                                     │
│  void walk_stack(void *rbp) {                                      │
│      while (rbp != NULL) {                                         │
│          void *return_addr = *((void **)rbp + 1);                  │
│          printf("  at %p\n", return_addr);                         │
│          rbp = *(void **)rbp;  // Follow chain to caller's RBP    │
│      }                                                              │
│  }                                                                  │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  COMPLICATIONS:                                                     │
│  ──────────────                                                     │
│                                                                     │
│  1. Frame Pointer Omission (FPO)                                   │
│     Modern compilers often don't use frame pointers (-O2)          │
│     Solution: Compile with -fno-omit-frame-pointer                 │
│               Or use DWARF unwinding (slower but works)            │
│                                                                     │
│  2. Inlined Functions                                               │
│     Inlined functions don't have stack frames                      │
│     Solution: DWARF info can recover inlined call sites            │
│                                                                     │
│  3. Async-Signal-Safety                                             │
│     backtrace() is NOT async-signal-safe (may call malloc!)        │
│     Solution: Frame pointer walking is safe, or use libunwind     │
│               carefully                                             │
│                                                                     │
│  METHODS:                                                           │
│  ────────                                                           │
│                                                                     │
│  1. backtrace() - glibc function (NOT signal-safe!)               │
│     void *frames[100];                                              │
│     int depth = backtrace(frames, 100);                            │
│                                                                     │
│  2. Frame pointer walking (signal-safe)                            │
│     Walk RBP chain manually as shown above                         │
│                                                                     │
│  3. libunwind (most robust)                                        │
│     Uses DWARF info, works without frame pointers                  │
│     Some parts are signal-safe                                     │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Flame Graphs

Flame graphs are the most effective visualization for call-stack profiling data:

┌────────────────────────────────────────────────────────────────────┐
│                       FLAME GRAPHS                                  │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  WHAT IS A FLAME GRAPH?                                            │
│  ──────────────────────                                            │
│                                                                     │
│  A visualization where:                                             │
│  - X-axis: Stack profile population (NOT time!)                    │
│  - Y-axis: Stack depth (bottom = root, top = leaf)                 │
│  - Width: Proportion of samples with that function in stack        │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                              │   │
│  │                    ┌─────────────┐                           │   │
│  │                    │   memcpy    │                           │   │
│  │        ┌───────────┴─────────────┴───────────┐               │   │
│  │        │      matrix_multiply                 │               │   │
│  │    ┌───┴──────────────────────────────────────┴───┐          │   │
│  │    │              process_data                     │          │   │
│  │ ┌──┴───────────────────────────────────────────────┴──┐      │   │
│  │ │                        main                          │      │   │
│  │ └──────────────────────────────────────────────────────┘      │   │
│  │                                                              │   │
│  │ Reading: main is on-CPU 100% of samples                      │   │
│  │          process_data is on-CPU ~85% of samples              │   │
│  │          matrix_multiply is on-CPU ~60% of samples           │   │
│  │          memcpy is on-CPU (leaf) ~30% of samples            │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  FOLDED STACK FORMAT (input for flamegraph.pl):                    │
│  ─────────────────────────────────────────────                     │
│                                                                     │
│  main;process_data;matrix_multiply;memcpy 30                       │
│  main;process_data;matrix_multiply 15                              │
│  main;process_data;hash_lookup 20                                  │
│  main;process_data;parse_input 10                                  │
│  main;init 5                                                       │
│                                                                     │
│  Each line: semicolon-separated stack (bottom to top), then count  │
│                                                                     │
│  GENERATING FLAME GRAPH:                                            │
│  ──────────────────────                                            │
│                                                                     │
│  1. Collect samples with stacks                                    │
│  2. Output in folded format                                        │
│  3. Run: flamegraph.pl < folded.txt > profile.svg                  │
│  4. Open profile.svg in browser                                    │
│                                                                     │
│  INTERPRETING FLAME GRAPHS:                                        │
│  ──────────────────────────                                        │
│                                                                     │
│  - Wide boxes at TOP = direct CPU consumers (optimize these!)      │
│  - Wide boxes at BOTTOM = called frequently (common code paths)    │
│  - Narrow towers = deep but rare code paths                        │
│  - Plateaus = code that dominates but calls children               │
│                                                                     │
│  DIFFERENTIAL FLAME GRAPHS:                                        │
│  ──────────────────────────                                        │
│                                                                     │
│  Compare before/after profiles:                                     │
│  - RED = function got slower (more samples)                        │
│  - BLUE = function got faster (fewer samples)                      │
│  - WHITE = unchanged                                                │
│                                                                     │
│  $ difffolded.pl before.folded after.folded | flamegraph.pl > diff.svg│
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

The Observer Effect (Heisenberg Principle in Profiling)

┌────────────────────────────────────────────────────────────────────┐
│                    THE OBSERVER EFFECT                              │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  "Profiling changes the behavior of the program being profiled"    │
│                                                                     │
│  SOURCES OF OVERHEAD:                                               │
│  ───────────────────                                                │
│                                                                     │
│  1. Signal Handler Execution                                        │
│     ─────────────────────                                           │
│     Every sample invokes your handler, which:                      │
│     - Saves CPU context                                             │
│     - Executes your collection code                                 │
│     - Restores CPU context                                          │
│     - Returns to interrupted code                                   │
│                                                                     │
│     Typical: 1-5 microseconds per sample                           │
│     At 1000 Hz: ~1-5 ms per second = 0.1-0.5% overhead            │
│                                                                     │
│  2. Cache Pollution                                                 │
│     ────────────                                                    │
│     Signal handler touches memory, evicting application data:      │
│                                                                     │
│     Before sample:  Cache: [App Data A] [App Data B] [App Data C] │
│     During sample:  Cache: [Handler] [Sample Buf] [App Data C]    │
│     After sample:   Cache miss for A and B!                        │
│                                                                     │
│  3. TLB Pressure                                                    │
│     ────────────                                                    │
│     Handler may touch pages not in TLB, causing page table walks   │
│                                                                     │
│  4. Branch Predictor Pollution                                     │
│     ────────────────────────                                        │
│     Handler branches may pollute branch history                    │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  OVERHEAD VS SAMPLE RATE:                                          │
│  ────────────────────────                                           │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Sample Rate    Samples/sec    Typical Overhead              │   │
│  │ ───────────    ───────────    ─────────────────             │   │
│  │    10 Hz           10         < 0.01% (too few samples)     │   │
│  │   100 Hz          100         ~0.02%  (good for long runs)  │   │
│  │  1000 Hz         1000         ~0.2%   (recommended)         │   │
│  │  5000 Hz         5000         ~1%     (acceptable)          │   │
│  │ 10000 Hz        10000         ~5%     (high but OK)         │   │
│  │ 50000 Hz        50000         ~20%    (distorting!)         │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  RULE OF THUMB: Keep overhead under 5%                              │
│  For production profiling, 100-1000 Hz is usually ideal            │
│                                                                     │
│  MEASURING YOUR PROFILER'S OVERHEAD:                               │
│  ───────────────────────────────────                               │
│                                                                     │
│  1. Run workload without profiling, measure time T0                │
│  2. Run same workload with profiling at rate R, measure time T1    │
│  3. Overhead = (T1 - T0) / T0 * 100%                               │
│                                                                     │
│  If overhead > 5%, either:                                          │
│  - Reduce sample rate                                               │
│  - Optimize your signal handler                                     │
│  - Use hardware counters instead (perf_event_open)                 │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Project Specification

What You Will Build

A sampling profiler called profiler that:

  1. Starts target programs under profiling using exec
  2. Samples periodically using timer signals (SIGPROF/ITIMER_PROF)
  3. Captures instruction pointers (and optionally call stacks)
  4. Aggregates samples into a statistical profile
  5. Resolves symbols to map addresses to function names
  6. Generates reports including flat profiles and optionally flame graphs

Command Line Interface

# Basic usage
./profiler --hz=1000 -- ./target arg1 arg2

# With flame graph output
./profiler --flame-graph --hz=999 -- ./target > profile.svg

# Differential profiling
./profiler --compare before.prof after.prof

# Self-test overhead measurement
./profiler --self-profile --overhead-test

Functional Requirements

┌────────────────────────────────────────────────────────────────────┐
│                    FUNCTIONAL REQUIREMENTS                          │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  CORE FUNCTIONALITY:                                                │
│  ──────────────────                                                 │
│                                                                     │
│  1. Timer-based sampling                                            │
│     - Configurable sample rate (10-10000 Hz)                       │
│     - Use ITIMER_PROF for CPU time profiling                       │
│     - Properly handle SIGPROF with SA_SIGINFO                      │
│                                                                     │
│  2. Sample collection                                               │
│     - Capture instruction pointer from signal context              │
│     - Store samples efficiently (array or ring buffer)             │
│     - Must be async-signal-safe (no malloc in handler!)            │
│                                                                     │
│  3. Sample aggregation                                              │
│     - Count samples per unique instruction pointer                 │
│     - Calculate percentage of total samples                        │
│                                                                     │
│  4. Symbol resolution                                               │
│     - Map instruction pointers to function names                   │
│     - Handle ASLR (Address Space Layout Randomization)             │
│     - Gracefully handle unknown symbols                            │
│                                                                     │
│  5. Report generation                                               │
│     - Flat profile: top N functions by sample count                │
│     - Include percentage, sample count, function name              │
│                                                                     │
│  OPTIONAL EXTENSIONS:                                               │
│  ────────────────────                                               │
│                                                                     │
│  6. Call stack capture                                              │
│     - Walk stack to capture full call chain                        │
│     - Handle frame pointer omission gracefully                     │
│                                                                     │
│  7. Flame graph output                                              │
│     - Generate folded stack format                                 │
│     - Compatible with flamegraph.pl                                │
│                                                                     │
│  8. Line number resolution                                          │
│     - Use addr2line or DWARF info for file:line                   │
│                                                                     │
│  9. Differential profiling                                          │
│     - Compare two profile runs                                      │
│     - Identify improved and regressed functions                    │
│                                                                     │
│  10. Multi-threaded support                                         │
│      - Track samples per thread                                     │
│      - Aggregate across threads                                     │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Real World Outcome

When you complete this project, here’s exactly what you’ll see:

$ ./profiler --sample-rate=1000 -- ./target_program
================================================================================
                    SAMPLING PROFILER
================================================================================
[CONFIG] Sample rate: 1000 Hz (1ms interval)
[CONFIG] Using ITIMER_PROF (CPU time only)
[START] Profiling ./target_program (PID: 12847)

[PROGRESS] 1000 samples collected...
[PROGRESS] 5000 samples collected...
[PROGRESS] 10000 samples collected...

[END] Target exited with status 0
[STATS] Total samples: 14,293
[STATS] Unique instruction pointers: 847
[STATS] Profiling overhead: ~2.3%

================================================================================
                    FLAT PROFILE (Top 20 Functions)
================================================================================
  %time    samples   function                          source:line
 -------  --------   --------------------------------  ----------------------
  23.4%      3,345   matrix_multiply                   matrix.c:142
  18.7%      2,673   vector_dot_product                linalg.c:89
  12.1%      1,730   quicksort_partition               sort.c:67
   8.9%      1,272   hash_table_lookup                 hash.c:234
   6.2%        886   memcpy@plt                        (libc)
   4.8%        686   strcmp@plt                        (libc)
   3.7%        529   parse_json_object                 json.c:456
   2.9%        414   allocate_buffer                   buffer.c:78
   2.4%        343   compute_checksum                  crypto.c:123
   2.1%        300   read_file_chunk                   io.c:89
   1.8%        257   (unknown)                         0x7f3a2b4c5d6e
   ...
  13.0%      1,858   (other - 837 functions)

================================================================================
                    CALL GRAPH PROFILE
================================================================================
                         |--- vector_dot_product (18.7%)
matrix_multiply (23.4%) -|
                         |--- memcpy@plt (2.1% attributed)

                           |--- quicksort_partition (12.1%)
process_data (32.1%) -----|--- hash_table_lookup (8.9%)
                           |--- parse_json_object (3.7%)

$ ./profiler --flame-graph -- ./target_program > profile.svg
================================================================================
                    FLAME GRAPH GENERATION
================================================================================
[SAMPLING] Collecting call stacks at 997 Hz...
[STACKS] 8,234 unique stack traces captured
[RENDER] Generating SVG flame graph...

[OUTPUT] Flame graph written to: profile.svg (234 KB)
[TIP] Open in browser: firefox profile.svg

$ ./profiler --compare before.prof after.prof
================================================================================
                    DIFFERENTIAL PROFILE
================================================================================
Comparing: before.prof (14,293 samples) vs after.prof (13,892 samples)

Improved (faster):
  function                  before    after     delta
  ---------------------------------------------------
  matrix_multiply           23.4%     8.2%     -15.2%  (optimized!)
  vector_dot_product        18.7%    12.1%      -6.6%

Regressed (slower):
  function                  before    after     delta
  ---------------------------------------------------
  cache_lookup               1.2%     4.8%      +3.6%  (new bottleneck)
  validate_input             0.8%     2.1%      +1.3%

[SUMMARY] Overall improvement: 18.3% less CPU time in hot path

$ ./profiler --self-profile --overhead-test
================================================================================
                    PROFILER OVERHEAD ANALYSIS
================================================================================
[TEST] Running workload without profiling: 4.823s
[TEST] Running workload with profiling at 100 Hz: 4.831s (+0.17%)
[TEST] Running workload with profiling at 1000 Hz: 4.935s (+2.32%)
[TEST] Running workload with profiling at 10000 Hz: 5.647s (+17.09%)

[RECOMMENDATION] Use 1000 Hz for production profiling
[WARNING] Rates above 5000 Hz introduce significant overhead

Solution Architecture

High-Level Design

┌────────────────────────────────────────────────────────────────────┐
│                    PROFILER ARCHITECTURE                            │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                        MAIN PROCESS                          │   │
│  │                                                              │   │
│  │  1. Parse arguments                                          │   │
│  │  2. Set up signal handler (SIGPROF)                         │   │
│  │  3. Set up interval timer (ITIMER_PROF)                     │   │
│  │  4. Fork/exec target program                                 │   │
│  │  5. Wait for target to exit                                  │   │
│  │  6. Stop timer                                               │   │
│  │  7. Aggregate samples                                        │   │
│  │  8. Resolve symbols                                          │   │
│  │  9. Generate report                                          │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    SIGNAL HANDLER                            │   │
│  │                                                              │   │
│  │  profile_handler(sig, info, context):                       │   │
│  │      ip = extract_ip(context)                               │   │
│  │      if sample_count < MAX_SAMPLES:                         │   │
│  │          samples[sample_count++] = ip                       │   │
│  │                                                              │   │
│  │  Must be async-signal-safe!                                 │   │
│  │  - No malloc/free                                            │   │
│  │  - No printf                                                 │   │
│  │  - No dladdr (calls malloc internally)                      │   │
│  │  - Just write to pre-allocated array                        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    SAMPLE BUFFER                             │   │
│  │                                                              │   │
│  │  Pre-allocated array of instruction pointers:               │   │
│  │                                                              │   │
│  │  static void *samples[MAX_SAMPLES];  // e.g., 10 million   │   │
│  │  static volatile sig_atomic_t sample_count = 0;             │   │
│  │                                                              │   │
│  │  For stack traces:                                           │   │
│  │  typedef struct {                                            │   │
│  │      void *stack[MAX_DEPTH];                                 │   │
│  │      int depth;                                              │   │
│  │  } stack_sample_t;                                           │   │
│  │  static stack_sample_t stack_samples[MAX_SAMPLES];          │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    POST-PROCESSING                           │   │
│  │                                                              │   │
│  │  1. AGGREGATE: Sort samples, count duplicates               │   │
│  │     IP: 0x401234  Count: 3345  (23.4%)                      │   │
│  │     IP: 0x401567  Count: 2673  (18.7%)                      │   │
│  │     ...                                                      │   │
│  │                                                              │   │
│  │  2. SYMBOLIZE: Map IP to function name                       │   │
│  │     0x401234 → matrix_multiply                               │   │
│  │     0x401567 → vector_dot_product                           │   │
│  │                                                              │   │
│  │  3. REPORT: Format and print results                        │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Component Data Flow

┌────────────────────────────────────────────────────────────────────┐
│                       DATA FLOW                                     │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  During Profiling:                                                  │
│  ─────────────────                                                  │
│                                                                     │
│  Target Program                                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────┐                                               │
│  │ ITIMER_PROF     │ ──(1ms intervals)──► SIGPROF                  │
│  │ (CPU timer)     │                                               │
│  └─────────────────┘                                               │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────┐                                               │
│  │ Signal Handler  │ ──(extract IP)──► samples[] array             │
│  │                 │                                               │
│  └─────────────────┘                                               │
│                                                                     │
│  Post-Processing:                                                   │
│  ────────────────                                                   │
│                                                                     │
│  samples[]                                                          │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────┐                                               │
│  │ Sort & Count    │ ──► profile_entries[]                         │
│  │ (qsort)         │     {ip, count}                               │
│  └─────────────────┘                                               │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────┐                                               │
│  │ Symbolize       │ ──► profile_entries[]                         │
│  │ (dladdr)        │     {ip, count, symbol}                       │
│  └─────────────────┘                                               │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────┐                                               │
│  │ Report          │ ──► stdout or file                            │
│  │ Generator       │                                               │
│  └─────────────────┘                                               │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Implementation Guide

The Core Question You’re Answering

“How do profilers like gprof, perf, and pprof measure where your program spends its time, and what are the limitations of statistical sampling?”

By building a sampling profiler from scratch, you’ll understand why:

  • Profilers can miss short-running functions
  • Results vary between runs (statistical sampling)
  • High sample rates can distort measurements
  • Symbol resolution is surprisingly complex
  • Call attribution requires stack walking

Concepts You Must Understand First

Before starting this project, ensure you understand these concepts:

Concept Why It Matters Where to Learn
Timer signals (SIGPROF/SIGVTALRM) Different timers measure wall-clock, user CPU, or system CPU time TLPI Ch. 23
Signal handlers and context How the interrupted context provides the instruction pointer CS:APP 8.5
Program counter / instruction pointer The CPU register that tells you where execution is CS:APP 3.4
Symbol tables and DWARF How to map addresses back to function names and line numbers CS:APP 7.5
Statistical sampling theory Why sampling works and its inherent error margins CS:APP 5.14
ASLR and PIE Address randomization affects address-to-symbol mapping CS:APP 7.12

Questions to Guide Your Design

Work through these questions BEFORE writing code:

  1. Timer Selection: What’s the difference between ITIMER_REAL, ITIMER_VIRTUAL, and ITIMER_PROF? Which should a CPU profiler use?

  2. Context Access: How do you get the instruction pointer (RIP) from inside a signal handler? What’s in the ucontext_t?

  3. Sampling Statistics: If you sample at 1000 Hz and a function runs for 1ms, how many samples do you expect? What if it runs for 0.5ms?

  4. Sample Aggregation: How do you aggregate samples efficiently? A hash table from IP to count? What about collisions?

  5. Symbol Resolution: How do you convert an instruction pointer to a function name? What tools/libraries can help?

  6. Inlined Functions: What happens to profiling accuracy if a function is inlined? Can you still measure it?

  7. Stack Walking: How do you capture call stacks, not just leaf functions? What are the challenges with frame pointers?

  8. Run Variance: Why might your profiler show different results on different runs? Is this a bug or expected behavior?

Thinking Exercise

Before coding, analyze this sampling scenario:

Time (ms):  0    1    2    3    4    5    6    7    8    9    10
            |----|----|----|----|----|----|----|----|----|----|
Function A: ████████████████                                    (4ms, 40%)
Function B:                 ████████                            (2ms, 20%)
Function C:                         ████████████████████████    (4ms, 40%)

Sampling at 1000 Hz (every 1ms):
Sample #:   1    2    3    4    5    6    7    8    9    10
Expected:   A    A    A    A    B    B    C    C    C    C

Now consider: What if the timer fires at t=0.5, 1.5, 2.5… (phase-shifted by 0.5ms)?

  • Which functions would we sample?
  • If function B always runs at exactly t=4.0 to t=6.0, and our timer fires at t=0.5, 1.5, 2.5, 3.5, 4.5, 5.5…
  • How many samples of B would we get?

This illustrates aliasing - a real problem in sampling profilers!

Development Environment Setup

# Required tools
gcc --version      # GCC for compilation
make --version     # Build automation
gdb --version      # Debugging

# Helpful for testing
perf --version     # Compare results with system profiler
addr2line --version # Debug symbol resolution

# Create project structure
mkdir -p profiler-project/{src,tests,traces}
cd profiler-project

Project Structure

profiler-project/
├── src/
│   ├── profiler.c         # Main program and CLI
│   ├── sampler.c          # Signal handler and timer setup
│   ├── sampler.h
│   ├── aggregator.c       # Sample counting and sorting
│   ├── aggregator.h
│   ├── symbolizer.c       # Address to symbol resolution
│   ├── symbolizer.h
│   ├── reporter.c         # Report generation
│   ├── reporter.h
│   └── platform.h         # Platform-specific macros (GET_IP)
├── tests/
│   ├── test_workload.c    # Known workload for testing
│   ├── test_sampler.c     # Unit tests
│   └── overhead_test.c    # Overhead measurement
├── Makefile
└── README.md

Implementation Phases

Phase 1: Basic Timer Signal Setup (Days 1-2)

Goals:

  • Set up signal handler with SA_SIGINFO
  • Configure ITIMER_PROF timer
  • Extract instruction pointer from context

Implementation:

/* sampler.c - Layer 1: Basic Timer Signal Setup */

#include <sys/time.h>
#include <signal.h>
#include <ucontext.h>
#include <stdint.h>

/* Platform-specific instruction pointer extraction */
#if defined(__linux__) && defined(__x86_64__)
    #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext.gregs[REG_RIP]))
#elif defined(__APPLE__) && defined(__x86_64__)
    #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext->__ss.__rip))
#elif defined(__APPLE__) && defined(__aarch64__)
    #define GET_IP(ctx) ((void *)(((ucontext_t *)ctx)->uc_mcontext->__ss.__pc))
#else
    #error "Unsupported platform"
#endif

/* Sample storage - pre-allocated, fixed size */
#define MAX_SAMPLES 10000000
static void *samples[MAX_SAMPLES];
static volatile sig_atomic_t sample_count = 0;

/* Signal handler - MUST be async-signal-safe */
static void profile_handler(int sig, siginfo_t *si, void *context) {
    if (sample_count >= MAX_SAMPLES) return;

    /* Extract instruction pointer from interrupted context */
    void *ip = GET_IP(context);

    /* Store sample (just an array write - async-signal-safe) */
    samples[sample_count++] = ip;
}

/* Start profiling at given frequency (Hz) */
int start_profiling(int hz) {
    /* Install signal handler with SA_SIGINFO to get context */
    struct sigaction sa;
    sa.sa_sigaction = profile_handler;
    sa.sa_flags = SA_RESTART | SA_SIGINFO;
    sigemptyset(&sa.sa_mask);

    if (sigaction(SIGPROF, &sa, NULL) < 0) {
        perror("sigaction");
        return -1;
    }

    /* Set up interval timer */
    struct itimerval timer;
    long usec = 1000000 / hz;  /* Microseconds per interval */

    timer.it_interval.tv_sec = 0;
    timer.it_interval.tv_usec = usec;
    timer.it_value = timer.it_interval;

    if (setitimer(ITIMER_PROF, &timer, NULL) < 0) {
        perror("setitimer");
        return -1;
    }

    return 0;
}

/* Stop profiling */
int stop_profiling(void) {
    struct itimerval timer = {{0, 0}, {0, 0}};
    return setitimer(ITIMER_PROF, &timer, NULL);
}

/* Get collected samples */
int get_samples(void ***out_samples, int *out_count) {
    *out_samples = samples;
    *out_count = sample_count;
    return 0;
}

Checkpoint: Timer fires and handler is called. Verify with a simple counter.

Phase 2: Sample Aggregation (Days 3-4)

Goals:

  • Sort samples by instruction pointer
  • Count duplicates efficiently
  • Build profile entry table

Implementation:

/* aggregator.c - Layer 2: Sample Aggregation */

#include <stdlib.h>
#include <string.h>

typedef struct {
    void *ip;
    unsigned long count;
    char *symbol;      /* Filled in later by symbolizer */
    char *source_loc;  /* Optional: file:line */
} profile_entry_t;

static profile_entry_t *entries = NULL;
static int entry_count = 0;
static int entry_capacity = 0;

/* Compare function for qsort */
static int compare_ip(const void *a, const void *b) {
    void *ip_a = *(void **)a;
    void *ip_b = *(void **)b;
    if (ip_a < ip_b) return -1;
    if (ip_a > ip_b) return 1;
    return 0;
}

/* Compare entries by count (descending) */
static int compare_by_count(const void *a, const void *b) {
    const profile_entry_t *ea = a;
    const profile_entry_t *eb = b;
    if (ea->count > eb->count) return -1;
    if (ea->count < eb->count) return 1;
    return 0;
}

/* Add entry to profile */
static void add_entry(void *ip, unsigned long count) {
    if (entry_count >= entry_capacity) {
        entry_capacity = entry_capacity ? entry_capacity * 2 : 1024;
        entries = realloc(entries, entry_capacity * sizeof(profile_entry_t));
    }
    entries[entry_count].ip = ip;
    entries[entry_count].count = count;
    entries[entry_count].symbol = NULL;
    entries[entry_count].source_loc = NULL;
    entry_count++;
}

/* Aggregate samples into profile entries */
int aggregate_samples(void **samples, int sample_count) {
    if (sample_count == 0) return 0;

    /* Sort samples by IP for efficient counting */
    qsort(samples, sample_count, sizeof(void *), compare_ip);

    /* Count consecutive duplicates */
    void *current_ip = samples[0];
    unsigned long current_count = 1;

    for (int i = 1; i < sample_count; i++) {
        if (samples[i] == current_ip) {
            current_count++;
        } else {
            add_entry(current_ip, current_count);
            current_ip = samples[i];
            current_count = 1;
        }
    }
    /* Don't forget the last group */
    add_entry(current_ip, current_count);

    /* Sort entries by count (most samples first) */
    qsort(entries, entry_count, sizeof(profile_entry_t), compare_by_count);

    return 0;
}

/* Get profile entries */
int get_profile_entries(profile_entry_t **out_entries, int *out_count) {
    *out_entries = entries;
    *out_count = entry_count;
    return 0;
}

Checkpoint: Can aggregate 100,000 samples in under 1 second.

Phase 3: Address Symbolization (Days 5-6)

Goals:

  • Map instruction pointers to function names
  • Handle ASLR/PIE binaries
  • Gracefully handle unknown symbols

Implementation:

/* symbolizer.c - Layer 3: Address Symbolization */

#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Simple runtime symbolization using dladdr */
const char *symbolize_address(void *addr) {
    Dl_info info;
    if (dladdr(addr, &info) && info.dli_sname) {
        return info.dli_sname;
    }
    return "(unknown)";
}

/* More detailed symbolization with offset */
int symbolize_address_detailed(void *addr, char *out, size_t out_size) {
    Dl_info info;

    if (dladdr(addr, &info)) {
        if (info.dli_sname) {
            /* Calculate offset from symbol start */
            ptrdiff_t offset = (char *)addr - (char *)info.dli_saddr;
            snprintf(out, out_size, "%s+0x%lx", info.dli_sname, (unsigned long)offset);
        } else {
            /* No symbol, but have library info */
            ptrdiff_t offset = (char *)addr - (char *)info.dli_fbase;
            snprintf(out, out_size, "%s+0x%lx", info.dli_fname, (unsigned long)offset);
        }
        return 0;
    }

    /* Complete failure */
    snprintf(out, out_size, "0x%lx", (unsigned long)addr);
    return -1;
}

/* Use addr2line for better symbolization (slower but more complete) */
int symbolize_with_addr2line(void *addr, const char *executable,
                             char *func_out, size_t func_size,
                             char *loc_out, size_t loc_size) {
    char cmd[512];
    snprintf(cmd, sizeof(cmd),
             "addr2line -f -e %s %p 2>/dev/null", executable, addr);

    FILE *fp = popen(cmd, "r");
    if (!fp) return -1;

    /* First line is function name */
    if (fgets(func_out, func_size, fp) == NULL) {
        pclose(fp);
        return -1;
    }
    func_out[strcspn(func_out, "\n")] = '\0';

    /* Second line is file:line */
    if (fgets(loc_out, loc_size, fp) == NULL) {
        loc_out[0] = '\0';
    }
    loc_out[strcspn(loc_out, "\n")] = '\0';

    pclose(fp);
    return 0;
}

/* Symbolize all profile entries */
int symbolize_profile(profile_entry_t *entries, int count) {
    for (int i = 0; i < count; i++) {
        const char *sym = symbolize_address(entries[i].ip);
        entries[i].symbol = strdup(sym);
    }
    return 0;
}

Checkpoint: Can resolve symbols for libc functions (memcpy, strcmp).

Phase 4: Stack Trace Capture (Days 7-8)

Goals:

  • Capture call stacks, not just leaf functions
  • Handle frame pointer omission gracefully
  • Stay async-signal-safe (or close to it)

Implementation:

/* sampler.c - Layer 4: Stack Trace Capture */

#include <execinfo.h>

#define MAX_STACK_DEPTH 64
#define MAX_STACK_SAMPLES 1000000

typedef struct {
    void *stack[MAX_STACK_DEPTH];
    int depth;
} stack_sample_t;

static stack_sample_t stack_samples[MAX_STACK_SAMPLES];
static volatile sig_atomic_t stack_sample_count = 0;

/*
 * WARNING: backtrace() is NOT strictly async-signal-safe!
 * It may call malloc internally on some systems.
 *
 * For production use, consider:
 * 1. Frame pointer walking (truly signal-safe)
 * 2. libunwind (some functions are signal-safe)
 * 3. Accept the small risk for development/debugging
 */
static void profile_handler_with_stack(int sig, siginfo_t *si, void *context) {
    if (stack_sample_count >= MAX_STACK_SAMPLES) return;

    stack_sample_t *s = &stack_samples[stack_sample_count];

    /* Capture call stack */
    s->depth = backtrace(s->stack, MAX_STACK_DEPTH);

    stack_sample_count++;
}

/* Safer alternative: Frame pointer walking (x86-64) */
static void profile_handler_fp_walk(int sig, siginfo_t *si, void *context) {
    if (stack_sample_count >= MAX_STACK_SAMPLES) return;

    stack_sample_t *s = &stack_samples[stack_sample_count];
    ucontext_t *uc = (ucontext_t *)context;

    /* Get initial RBP from context */
#if defined(__linux__) && defined(__x86_64__)
    void *rbp = (void *)uc->uc_mcontext.gregs[REG_RBP];
    void *rip = (void *)uc->uc_mcontext.gregs[REG_RIP];
#elif defined(__APPLE__) && defined(__x86_64__)
    void *rbp = (void *)uc->uc_mcontext->__ss.__rbp;
    void *rip = (void *)uc->uc_mcontext->__ss.__rip;
#else
    return;  /* Unsupported platform */
#endif

    s->stack[0] = rip;  /* Current instruction */
    s->depth = 1;

    /* Walk frame pointer chain */
    while (rbp != NULL && s->depth < MAX_STACK_DEPTH) {
        /* Validate rbp looks like a reasonable stack address */
        /* (Could add more validation here) */

        void *ret_addr = *((void **)rbp + 1);
        s->stack[s->depth++] = ret_addr;

        rbp = *(void **)rbp;  /* Follow chain */
    }

    stack_sample_count++;
}

/* Get stack samples */
int get_stack_samples(stack_sample_t **out_samples, int *out_count) {
    *out_samples = stack_samples;
    *out_count = stack_sample_count;
    return 0;
}

Checkpoint: Can capture 10+ frame call stacks.

Phase 5: Report Generation (Days 9-10)

Goals:

  • Generate flat profile report
  • Generate folded stacks for flame graphs
  • Format output professionally

Implementation:

/* reporter.c - Layer 5: Report Generation */

#include <stdio.h>

/* Print flat profile (top N functions) */
void print_flat_profile(profile_entry_t *entries, int count, int total_samples) {
    printf("================================================================================\n");
    printf("                    FLAT PROFILE\n");
    printf("================================================================================\n");
    printf("  %%time    samples   function\n");
    printf(" -------  --------   --------------------------------\n");

    int top_n = count < 20 ? count : 20;

    for (int i = 0; i < top_n; i++) {
        double pct = 100.0 * entries[i].count / total_samples;
        printf("  %5.1f%%  %8lu   %s\n",
               pct,
               entries[i].count,
               entries[i].symbol ? entries[i].symbol : "(unknown)");
    }

    /* Summary for remaining functions */
    if (count > top_n) {
        unsigned long other_count = 0;
        for (int i = top_n; i < count; i++) {
            other_count += entries[i].count;
        }
        double other_pct = 100.0 * other_count / total_samples;
        printf("  %5.1f%%  %8lu   (other - %d functions)\n",
               other_pct, other_count, count - top_n);
    }
}

/* Output folded stacks format for flamegraph.pl */
void output_folded_stacks(FILE *out, stack_sample_t *samples, int count) {
    for (int i = 0; i < count; i++) {
        stack_sample_t *s = &samples[i];

        /* Print stack frames separated by semicolons (bottom to top) */
        for (int j = s->depth - 1; j >= 0; j--) {
            if (j < s->depth - 1) fprintf(out, ";");
            const char *sym = symbolize_address(s->stack[j]);
            fprintf(out, "%s", sym);
        }
        fprintf(out, " 1\n");  /* Weight of 1 per sample */
    }
}

/* Print profiler statistics */
void print_stats(int total_samples, int unique_ips, double overhead_pct) {
    printf("================================================================================\n");
    printf("                    PROFILER STATISTICS\n");
    printf("================================================================================\n");
    printf("[STATS] Total samples: %d\n", total_samples);
    printf("[STATS] Unique instruction pointers: %d\n", unique_ips);
    if (overhead_pct >= 0) {
        printf("[STATS] Profiling overhead: ~%.1f%%\n", overhead_pct);
    }
}

Checkpoint: Generate readable flat profile for test program.


Testing Strategy

Self-Profiling Test

Create a workload with known hot spots:

/* test_workload.c - Known workload for testing */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* This function should show up as ~40% of CPU */
__attribute__((noinline))
void hot_function_a(volatile int *data, int n) {
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * 2 + 1;
    }
}

/* This function should show up as ~30% of CPU */
__attribute__((noinline))
void hot_function_b(volatile int *data, int n) {
    for (int i = 0; i < n; i++) {
        data[i] = data[i] + data[(i + 1) % n];
    }
}

/* This function should show up as ~30% of CPU */
__attribute__((noinline))
void hot_function_c(volatile int *data, int n) {
    for (int i = 0; i < n; i++) {
        data[i] = data[i] ^ (data[i] >> 1);
    }
}

int main(void) {
    const int N = 10000;
    const int ITERS = 10000;
    volatile int *data = malloc(N * sizeof(int));

    for (int i = 0; i < N; i++) {
        data[i] = rand();
    }

    for (int iter = 0; iter < ITERS; iter++) {
        hot_function_a(data, N);  /* 40% */
        hot_function_b(data, N);  /* 30% */
        hot_function_c(data, N);  /* 30% */
    }

    printf("Result: %d\n", data[0]);
    free((void *)data);
    return 0;
}

Overhead Measurement Test

/* overhead_test.c - Measure profiler overhead */

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

/* Get current time in seconds */
double get_time(void) {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

/* Workload function */
void do_work(int iterations) {
    volatile double sum = 0;
    for (int i = 0; i < iterations; i++) {
        sum += i * 0.1;
    }
}

int main(int argc, char **argv) {
    int iterations = 100000000;
    int sample_rates[] = {0, 100, 1000, 5000, 10000};
    int num_rates = sizeof(sample_rates) / sizeof(sample_rates[0]);

    double base_time = 0;

    for (int r = 0; r < num_rates; r++) {
        int hz = sample_rates[r];

        if (hz > 0) {
            start_profiling(hz);
        }

        double start = get_time();
        do_work(iterations);
        double end = get_time();
        double elapsed = end - start;

        if (hz > 0) {
            stop_profiling();
        }

        if (hz == 0) {
            base_time = elapsed;
            printf("[TEST] Without profiling: %.3fs\n", elapsed);
        } else {
            double overhead = (elapsed - base_time) / base_time * 100;
            printf("[TEST] At %d Hz: %.3fs (+%.2f%%)\n", hz, elapsed, overhead);
        }
    }

    return 0;
}

Compare with System Profiler

# Profile with your profiler
./profiler --hz=1000 -- ./test_workload > my_profile.txt

# Profile with perf (Linux)
perf record -F 1000 ./test_workload
perf report > perf_profile.txt

# Compare results - functions should have similar percentages
diff <(grep -E "^\s+[0-9]" my_profile.txt | head -10) \
     <(grep -E "^\s+[0-9]" perf_profile.txt | head -10)

Common Pitfalls & Debugging

Bug 1: Using Wall-Clock Timer for CPU Profiling

// WRONG - counts time sleeping, not CPU time
setitimer(ITIMER_REAL, &timer, NULL);
// If program sleeps 90% of the time, you sample sleeping!

// RIGHT - counts only user + system CPU time
setitimer(ITIMER_PROF, &timer, NULL);

Symptom: Profile shows functions with I/O waits as hot, or shows very few samples for CPU-intensive code.

Diagnosis: Add logging to see when samples are taken. Check if samples occur during sleep/wait calls.

Bug 2: Forgetting to Handle ASLR

// WRONG - addresses change each run!
printf("Hot function at: %p\n", ip);
// Next run, same function is at a different address

// RIGHT - subtract base address or use dladdr
Dl_info info;
if (dladdr(ip, &info)) {
    ptrdiff_t offset = (char*)ip - (char*)info.dli_fbase;
    printf("%s+0x%lx\n", info.dli_fname, (unsigned long)offset);
}

Symptom: Symbols don’t resolve correctly, or profiles from different runs can’t be compared.

Diagnosis: Print raw addresses and check if they match output of nm or objdump. If they differ, ASLR is the issue.

Bug 3: Calling Non-Async-Signal-Safe Functions in Handler

// WRONG - printf, malloc, dladdr are NOT async-signal-safe
void handler(int sig, siginfo_t *si, void *ctx) {
    printf("Sample at %p\n", get_ip(ctx));  // May deadlock!
    char *sym = symbolize(get_ip(ctx));     // Calls malloc!
}

// RIGHT - only store data, process later
void handler(int sig, siginfo_t *si, void *ctx) {
    if (sample_count < MAX_SAMPLES) {
        samples[sample_count++] = get_ip(ctx);  // Just a write
    }
}

Symptom: Deadlock, crash, or corrupted output intermittently (depends on what main program was doing when signal arrived).

Diagnosis: Run with strace to see if program hangs in a system call. Check if handler calls any library functions.

Bug 4: High Sample Rate Causing Measurement Distortion

// WRONG - 100,000 Hz sampling
timer.it_interval.tv_usec = 10;  // 10us interval
// Handler overhead dominates! Measuring the profiler, not the program.

// RIGHT - 100-1000 Hz is usually sufficient
timer.it_interval.tv_usec = 1000;  // 1ms interval (1000 Hz)
// Rule of thumb: if overhead > 5%, reduce sample rate

Symptom: Profile shows signal handler or profiler functions as hot. Total runtime is much longer than expected.

Diagnosis: Measure overhead with and without profiling. If overhead > 10%, reduce sample rate.

Bug 5: Missing Symbols Due to Compilation Flags

// WRONG - Stripped binary, no symbols
// gcc -O2 -s program.c -o program

// RIGHT - Keep symbols and enable dynamic symbol table
// gcc -O2 -rdynamic program.c -o program
// For debug symbols: gcc -O2 -g program.c -o program

Symptom: All functions show as “(unknown)” or only libc functions are resolved.

Diagnosis: Run nm program or objdump -t program to check if symbols are present.

Bug 6: Aliasing with Periodic Code

// PROBLEM: Sampling at exactly 1000 Hz, function runs every 1ms
// May always sample at same phase, missing or over-counting function

// SOLUTION 1: Use prime sample rate
timer.it_interval.tv_usec = 997;  // ~1003 Hz, won't sync with 1000 Hz patterns

// SOLUTION 2: Add jitter (harder to implement safely)
// Vary sample interval slightly each time

// SOLUTION 3: Collect many samples over long runtime
// Statistical variations average out

Symptom: Same function shows very different percentages on different runs. Results don’t match wall-clock expectations.

Diagnosis: Run profile multiple times and compare. Try different sample rates. Check if target code has periodic behavior.

Bug 7: Not Handling Short-Running Programs

// PROBLEM: Program exits before enough samples collected
// With 1000 Hz sampling, a 10ms program gets ~10 samples!

// SOLUTION: Detect low sample count and warn
if (sample_count < 100) {
    fprintf(stderr, "WARNING: Only %d samples collected.\n", sample_count);
    fprintf(stderr, "Results may be unreliable. Try:\n");
    fprintf(stderr, "  1. Increase program runtime\n");
    fprintf(stderr, "  2. Increase sample rate (--hz=10000)\n");
    fprintf(stderr, "  3. Run program multiple times\n");
}

Symptom: Very few samples, results dominated by startup/shutdown code, high variance between runs.

Diagnosis: Print sample count. If < 100, results are statistically unreliable.


Extensions & Challenges

Beginner Extensions

  • Source line resolution: Use addr2line to show file:line for each function
  • Call graph output: Show caller-callee relationships from stack traces
  • CSV output: Export profile data for analysis in spreadsheets
  • Multiple runs aggregation: Combine profiles from multiple runs

Intermediate Extensions

  • Flame graph generation: Output SVG flame graphs directly (without external tools)
  • Differential profiling: Compare two profiles and highlight differences
  • Per-thread profiling: Track samples by thread ID for multi-threaded programs
  • Real-time display: Show live profile updates during execution

Advanced Extensions

  • Hardware counter sampling: Use perf_event_open() for cycle-accurate sampling
  • Off-CPU profiling: Track where program waits (I/O, locks) using ITIMER_REAL
  • Call-graph construction: Build full call graph from stack samples
  • JIT support: Handle dynamically generated code (requires cooperation from runtime)
  • Remote profiling: Profile programs on remote machines via SSH

The Interview Questions They’ll Ask

After completing this project, you’ll be ready for these interview questions:

1. “How does a sampling profiler work? What are its advantages over instrumentation?”

Expected answer: Sampling profilers periodically interrupt the program (via timer signal) and record where it is (instruction pointer). Advantages: low overhead (constant regardless of call frequency), no code modification needed, works on release binaries. Disadvantages: statistical error, may miss short functions, can alias with periodic behavior.

Deeper follow-up: Discuss trade-offs between sample rate and accuracy, explain how to handle ASLR, mention hardware vs software sampling.

2. “Explain the difference between CPU time and wall-clock time profiling.”

Expected answer: CPU time (ITIMER_PROF) only counts time when the CPU is executing your code - excludes I/O waits, sleeps, context switches. Wall-clock time (ITIMER_REAL) measures real elapsed time including waits. For CPU-bound code, use CPU time. For I/O-bound or concurrent code, wall-clock may be more useful.

Deeper follow-up: When would you use ITIMER_VIRTUAL vs ITIMER_PROF? How would you profile lock contention?

3. “How do you symbolize an address back to a function name?”

Expected answer: Use the symbol table in the ELF binary. Tools: dladdr() for runtime lookup, addr2line for static lookup, libbacktrace or libunwind for full support. Need to handle ASLR (read /proc/self/maps), stripped binaries (no symbols), and inlined functions (DWARF info).

Deeper follow-up: How would you handle position-independent executables? What about dynamically loaded libraries?

4. “What is the observer effect in profiling?”

Expected answer: Profiling changes the behavior of the program being measured. Signal handlers take CPU time, cache lines get evicted, branches may become less predictable. High sample rates increase overhead. A good profiler minimizes overhead and measures its own impact.

Deeper follow-up: How would you measure your profiler’s overhead? What techniques minimize cache pollution in signal handlers?

5. “How would you profile a multithreaded program?”

Expected answer: ITIMER_PROF signals go to the thread that consumed the CPU time. Each thread needs its own sample aggregation (or lock-protected shared structure). Consider: should you sample all threads equally or proportionally to CPU usage? Thread IDs help attribute samples.

Deeper follow-up: What are the challenges with lock-free sample collection? How do thread-local storage and per-CPU buffers help?

6. “Why might a profiler miss a function entirely?”

Expected answer: If a function runs for less than the sampling interval (e.g., 0.1ms with 1000Hz sampling), it may never be sampled. Also: inlined functions don’t have separate addresses, leaf functions may be in registers, and short periodic functions may alias with the sample timer.

Deeper follow-up: How would you detect aliasing? What’s the relationship between sample count and statistical confidence?


Hints in Layers

If you’re stuck, reveal hints one at a time:

Hint 1: Basic Timer Signal Setup
#include <sys/time.h>
#include <signal.h>

volatile sig_atomic_t sample_count = 0;
static struct sample { void *ip; } samples[1000000];

void profile_handler(int sig, siginfo_t *si, void *context) {
    ucontext_t *uc = (ucontext_t *)context;

    // Get instruction pointer from interrupted context
    // Linux x86-64:
    void *ip = (void *)uc->uc_mcontext.gregs[REG_RIP];

    // macOS x86-64:
    // void *ip = (void *)uc->uc_mcontext->__ss.__rip;

    // Store sample (async-signal-safe: just array write)
    if (sample_count < 1000000) {
        samples[sample_count++].ip = ip;
    }
}

void start_profiling(int hz) {
    // Install signal handler with SA_SIGINFO to get context
    struct sigaction sa;
    sa.sa_sigaction = profile_handler;
    sa.sa_flags = SA_RESTART | SA_SIGINFO;
    sigemptyset(&sa.sa_mask);
    sigaction(SIGPROF, &sa, NULL);

    // Set up interval timer (microseconds)
    struct itimerval timer;
    timer.it_interval.tv_sec = 0;
    timer.it_interval.tv_usec = 1000000 / hz;  // e.g., 1000 for 1ms
    timer.it_value = timer.it_interval;
    setitimer(ITIMER_PROF, &timer, NULL);
}
Hint 2: Sample Aggregation
#include <search.h>  // For hash table (hsearch)

typedef struct {
    void *ip;
    unsigned long count;
    char *symbol;  // Resolved later
} profile_entry_t;

// Simple aggregation: sort and count
void aggregate_samples(void) {
    // Sort samples by IP for counting
    qsort(samples, sample_count, sizeof(struct sample), compare_ip);

    // Count consecutive duplicates
    void *current_ip = NULL;
    int current_count = 0;

    for (int i = 0; i < sample_count; i++) {
        if (samples[i].ip == current_ip) {
            current_count++;
        } else {
            if (current_ip != NULL) {
                add_to_profile(current_ip, current_count);
            }
            current_ip = samples[i].ip;
            current_count = 1;
        }
    }
    if (current_ip != NULL) {
        add_to_profile(current_ip, current_count);
    }
}
Hint 3: Address Symbolization
#define _GNU_SOURCE
#include <dlfcn.h>

// Runtime symbolization using dladdr
const char *symbolize(void *addr) {
    Dl_info info;
    if (dladdr(addr, &info) && info.dli_sname) {
        return info.dli_sname;
    }
    return "(unknown)";
}

// For better symbolization, use addr2line or libbacktrace
void symbolize_with_addr2line(void *addr, char *out, size_t outsize) {
    char cmd[256];
    snprintf(cmd, sizeof(cmd),
             "addr2line -f -e /proc/self/exe %p 2>/dev/null", addr);

    FILE *fp = popen(cmd, "r");
    if (fp) {
        if (fgets(out, outsize, fp) == NULL) {
            snprintf(out, outsize, "0x%lx", (unsigned long)addr);
        }
        // Strip newline
        out[strcspn(out, "\n")] = '\0';
        pclose(fp);
    }
}
Hint 4: Stack Trace Capture
#include <execinfo.h>

#define MAX_STACK_DEPTH 64

typedef struct {
    void *stack[MAX_STACK_DEPTH];
    int depth;
} stack_sample_t;

stack_sample_t stack_samples[100000];
volatile sig_atomic_t stack_sample_count = 0;

void profile_handler_with_stack(int sig, siginfo_t *si, void *context) {
    if (stack_sample_count >= 100000) return;

    // Capture call stack
    // NOTE: backtrace() is not strictly async-signal-safe!
    // For production, use frame pointer walking or libunwind
    stack_sample_t *s = &stack_samples[stack_sample_count];
    s->depth = backtrace(s->stack, MAX_STACK_DEPTH);
    stack_sample_count++;
}

// Print stack traces (for debugging)
void print_stack(stack_sample_t *s) {
    char **symbols = backtrace_symbols(s->stack, s->depth);
    for (int i = 0; i < s->depth; i++) {
        printf("  %s\n", symbols[i]);
    }
    free(symbols);
}
Hint 5: Report Generation
void print_flat_profile(void) {
    // Sort entries by sample count (descending)
    qsort(profile_entries, entry_count, sizeof(profile_entry_t),
          compare_by_count_desc);

    printf("================================================================================\n");
    printf("                    FLAT PROFILE\n");
    printf("================================================================================\n");
    printf("  %%time    samples   function\n");
    printf(" -------  --------   --------------------------------\n");

    for (int i = 0; i < entry_count && i < 20; i++) {
        double pct = 100.0 * profile_entries[i].count / sample_count;
        printf("  %5.1f%%  %8lu   %s\n",
               pct,
               profile_entries[i].count,
               profile_entries[i].symbol);
    }
}

// Flame graph output (folded stacks format for flamegraph.pl)
void output_folded_stacks(FILE *out) {
    for (int i = 0; i < stack_sample_count; i++) {
        stack_sample_t *s = &stack_samples[i];

        // Print stack frames separated by semicolons (bottom to top)
        for (int j = s->depth - 1; j >= 0; j--) {
            if (j < s->depth - 1) fprintf(out, ";");
            fprintf(out, "%s", symbolize(s->stack[j]));
        }
        fprintf(out, " 1\n");  // Weight of 1 per sample
    }
}

Books That Will Help

Book Chapters What You’ll Learn
CS:APP 3e 5.14, 8.5 Performance measurement, signals and handlers
TLPI 23 Timer signals (ITIMER_*), interval timers
Systems Performance (Gregg) 5-6 Profiling methodology, CPU analysis
APUE 3e 10, 14 Signals, interval timers
BPF Performance Tools 13 CPU profiling with modern tools

Self-Assessment Checklist

Understanding

  • I can explain the difference between sampling and instrumentation profiling
  • I understand why ITIMER_PROF is used instead of ITIMER_REAL for CPU profiling
  • I can describe what makes a function async-signal-safe
  • I understand statistical sampling error and how sample count affects accuracy
  • I can explain the aliasing problem and how to mitigate it
  • I understand how ASLR affects symbol resolution

Implementation

  • Signal handler correctly extracts instruction pointer from context
  • Sample collection is async-signal-safe (no malloc/printf in handler)
  • Timer is configured correctly for desired sample rate
  • Sample aggregation correctly counts unique instruction pointers
  • Symbol resolution works for at least standard library functions
  • Report shows percentages and sample counts correctly

Testing

  • Profile of known workload matches expected percentages (within ~10%)
  • Measured overhead is < 5% at 1000 Hz sample rate
  • Results are consistent across multiple runs (within statistical variation)
  • Profile matches system profiler (perf) for same workload

Growth

  • I can explain how production profilers (perf, pprof) work
  • I understand the trade-offs in profiler design
  • I can interpret flame graphs effectively
  • I can discuss profiling in interviews confidently

  • Previous: P22 (Concurrent Data Structures) - Concurrency foundations
  • Next: P24 (Memory Leak Detector) - Another debugging tool using interposition
  • Uses concepts from: P8 (Signals) - Signal handling fundamentals
  • Uses concepts from: P12 (Linker) - Symbol tables and dynamic linking

Submission / Completion Criteria

Minimum Viable Completion:

  • Timer-based sampling with ITIMER_PROF
  • Sample collection (async-signal-safe)
  • Basic aggregation and flat profile output
  • Symbol resolution for common functions
  • Works on simple test programs

Full Completion:

  • All of the above plus stack trace capture
  • Folded stack output for flame graphs
  • Source line resolution (with debug symbols)
  • Overhead measurement and reporting
  • Handles ASLR correctly

Excellence (Going Above & Beyond):

  • SVG flame graph generation
  • Differential profiling
  • Multi-threaded support
  • Hardware counter integration
  • Comparison with production profilers

This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.