Project 2: perf + Flamegraph Investigator

Project 2: perf + Flamegraph Investigator

Project Overview

Attribute Details
Difficulty Intermediate
Time Estimate 1-2 weeks
Primary Language C
Alternative Languages Rust, Go, C++
Knowledge Area Profiling and Flamegraphs
Tools Required perf, flamegraph scripts, graphviz
Primary Reference โ€œSystems Performanceโ€ by Brendan Gregg

Learning Objectives

By completing this project, you will be able to:

  1. Configure perf for accurate sampling with appropriate sample rates and event types
  2. Generate and interpret flamegraphs understanding what width, depth, and color represent
  3. Attribute CPU time to specific functions and explain why theyโ€™re hot
  4. Distinguish CPU-bound work from I/O wait using different profiling modes
  5. Build repeatable profiling workflows that can be automated in CI/CD
  6. Cross-validate findings using multiple profiling tools

Deep Theoretical Foundation

How Sampling Profilers Work

Unlike instrumentation (which adds code at every function entry/exit), sampling profilers periodically interrupt the program and record its current state. This approach has fundamental trade-offs you must understand:

The Sampling Process

Every 1/sample_rate seconds, the kernel interrupts your program and captures:

  1. The current instruction pointer (IP)
  2. The call stack (all return addresses on the stack)
  3. Optional: CPU registers, counter values
Time โ†’
Program: [code][code][code][code][code][code][code][code]...
Samples:      โ†‘           โ†‘           โ†‘           โ†‘
              Sample 1    Sample 2    Sample 3    Sample 4

If samples 1,2,3 are in function A and sample 4 is in function B:
โ†’ A appears to consume 75% of CPU time

CPU Profiling Sampling Timeline

Sample Rate Trade-offs

  • High rate (10,000 Hz): More samples = more precision, but higher overhead (1-5% CPU)
  • Low rate (100 Hz): Minimal overhead, but may miss short-lived functions
  • Default (99 Hz): Avoids lockstep with 100 Hz kernel timers, good balance

Why Sampling Can Mislead

  1. Short functions are invisible: A function running 100 microseconds between samples will never be captured, even if it runs 1000 times per second.

  2. Inlined functions disappear: Compiler inlining merges functions, so the โ€œhotโ€ function shown is actually its caller.

  3. Kernel time is separate: By default, perf only shows user-space time. Kernel time (syscalls) requires additional configuration.

Understanding Call Stacks

A call stack is the chain of function calls that led to the current instruction:

main()
 โ””โ†’ process_request()
     โ””โ†’ parse_input()
         โ””โ†’ json_decode()
             โ””โ†’ utf8_validate() โ† Current IP

Function Call Stack Hierarchy

When we sample, we record this entire chain. Over many samples, we can determine:

  • Which functions are directly consuming CPU (leaf functions)
  • Which functions are responsible for CPU consumption (ancestors)

The Attribution Problem

Consider this stack sampled 100 times:

main โ†’ process โ†’ parse โ†’ validate (100 samples)

Who is โ€œresponsibleโ€ for the CPU time?

  • validate is doing the actual work (leaf)
  • parse called validate (parent)
  • main started everything (root)

The answer depends on what you can change. If validate is a library function you canโ€™t modify, then parse calling it unnecessarily is the problem.

Flamegraph Anatomy

Flamegraphs are a visualization invented by Brendan Gregg that makes call stack data intuitive:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                              main                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚          process_a               โ”‚            process_b             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚    parse      โ”‚     compute      โ”‚   parse    โ”‚       render        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ json  โ”‚ xml   โ”‚                  โ”‚            โ”‚   html   โ”‚   css    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Width = Time (samples)     Height = Stack Depth     Color = Frame type

Flamegraph Anatomy

Reading Rules:

  1. Width is proportional to CPU time: Wider bars consumed more samples
  2. Height is call stack depth: Top of each column is the leaf function
  3. Parent-child is above-below: A function calls the functions directly above it
  4. Order is alphabetical by default: Horizontal position is for readability, not time

What to Look For:

  1. Plateaus: Wide flat tops indicate functions that consume CPU themselves (not via children)
  2. Towers: Tall thin stacks indicate deep call chains for rare events
  3. Repeated patterns: Same function appearing under multiple callers indicates a hot utility

CPU vs Off-CPU Profiling

Standard profiling captures CPU timeโ€”when your code is actively running. But programs also wait:

  • I/O wait: Blocked on disk or network
  • Lock contention: Waiting for a mutex
  • Sleep: Explicitly sleeping
  • Page faults: Waiting for memory pages

Off-CPU profiling captures what the program is waiting for:

CPU Flamegraph:        Shows where CPU cycles go
Off-CPU Flamegraph:    Shows where time goes while NOT on CPU

Combining both gives complete picture of latency.


Complete Project Specification

What Youโ€™re Building

A profiling workflow called profile_run that:

  1. Captures CPU profiles using perf with configurable duration and sample rate
  2. Generates flamegraph SVGs with proper symbolization
  3. Produces analysis reports identifying top hotspots with sample counts
  4. Compares profiles to show before/after differences
  5. Stores artifacts for historical comparison

Functional Requirements

profile_run capture --workload <cmd> --duration <sec> --output <dir>
profile_run flamegraph --input <perf.data> --output <svg>
profile_run analyze --input <perf.data> --top <n>
profile_run diff --before <perf.data> --after <perf.data> --output <svg>

Output Artifacts

  1. perf.data: Raw profile data for reprocessing
  2. flamegraph.svg: Interactive flamegraph visualization
  3. folded.txt: Collapsed stack traces for custom analysis
  4. report.txt: Human-readable hotspot analysis

Example Analysis Report

Flamegraph Profiling Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Workload:        web_server
Date:            2025-01-27 15:45:22
Duration:        30 seconds
Sample rate:     99 Hz
Total samples:   2,970
Debug symbols:   Available

Top CPU Hotspots (by sample count):
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1) parse_input              1,128 samples (38.0%)
   โ”œโ”€ json_parse               687 samples (23.1%)
   โ”œโ”€ validate_schema          298 samples (10.0%)
   โ””โ”€ utf8_decode              143 samples (4.8%)

2) hash_lookup                624 samples (21.0%)
   โ”œโ”€ hash_compute             374 samples (12.6%)
   โ””โ”€ bucket_scan              250 samples (8.4%)

3) serialize_output           416 samples (14.0%)
   โ””โ”€ json_encode              312 samples (10.5%)

Optimization Recommendations:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ€ข json_parse (23.1%): Primary bottleneck
  โ†’ Consider: simdjson, pre-parsed schemas

โ€ข hash_compute (12.6%): Hash function overhead
  โ†’ Consider: faster hash (xxhash), perfect hashing

Artifacts saved:
  Profile: profiles/web_server_2025-01-27.data
  Flamegraph: reports/web_server_2025-01-27.svg
  Folded: reports/web_server_2025-01-27.folded

Solution Architecture

Workflow Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Profile Capture                          โ”‚
โ”‚  perf record -F 99 -g --call-graph dwarf -o perf.data cmd    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Symbol Resolution                         โ”‚
โ”‚  perf script > folded.txt (resolve addresses to symbols)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Stack Collapsing                            โ”‚
โ”‚  stackcollapse-perf.pl < folded.txt > collapsed.txt          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                   โ”‚                   โ”‚
          โ–ผ                   โ–ผ                   โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Flamegraph     โ”‚ โ”‚   Top-N Report   โ”‚ โ”‚   Diff Report    โ”‚
โ”‚   Generation     โ”‚ โ”‚   Generation     โ”‚ โ”‚   Generation     โ”‚
โ”‚   flamegraph.pl  โ”‚ โ”‚   awk/sort       โ”‚ โ”‚   difffolded.pl  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Profiling Workflow Pipeline

Key Components

1. Profile Capture Script

Handles perf invocation with correct flags for symbol resolution:

#!/bin/bash
# Ensure debug symbols are available
# Use dwarf for accurate stack unwinding
# Sample at 99 Hz to avoid timer aliasing

perf record \
    -F 99 \                     # 99 samples/sec
    -g \                        # Capture call graphs
    --call-graph dwarf \        # Use DWARF for unwinding
    -o "$output_dir/perf.data" \
    -- "$@"                     # Command to profile

2. Symbolization Pipeline

Converts raw addresses to function names:

perf script -i perf.data > perf.script
stackcollapse-perf.pl perf.script > collapsed.txt

3. Flamegraph Generator

Creates interactive SVG:

flamegraph.pl \
    --title "CPU Profile: $workload" \
    --subtitle "Duration: ${duration}s, Samples: $total" \
    --width 1200 \
    --colors java \             # Color scheme
    collapsed.txt > flamegraph.svg

4. Analysis Engine

Parses collapsed stacks to extract hotspots:

# Extract leaf functions (actual CPU consumers)
awk -F';' '{print $NF}' collapsed.txt |
    awk '{sum[$1]+=$2} END {for(k in sum) print sum[k],k}' |
    sort -rn | head -20

Phased Implementation Guide

Phase 1: Basic Profiling (Days 1-2)

Goal: Capture a profile and generate a flamegraph for a test workload.

Steps:

  1. Install flamegraph scripts from GitHub
  2. Create a CPU-intensive test workload (e.g., recursive Fibonacci)
  3. Compile with debug symbols (-g -fno-omit-frame-pointer)
  4. Run perf record and generate flamegraph
  5. Open SVG in browser and explore interactively

Validation: Flamegraph shows expected function hierarchy.

Phase 2: Symbol Resolution (Days 3-4)

Goal: Ensure all functions are properly named, not [unknown].

Steps:

  1. Investigate [unknown] symbolsโ€”usually missing debug info
  2. Add -fno-omit-frame-pointer to compiler flags
  3. Use --call-graph dwarf for better unwinding
  4. Handle stripped binaries with external debuginfo

Validation: Less than 5% of samples show [unknown].

Phase 3: Analysis Report (Days 5-7)

Goal: Automate hotspot extraction and report generation.

Steps:

  1. Parse collapsed stack format
  2. Calculate sample percentages per function
  3. Identify call hierarchies (parent-child relationships)
  4. Generate human-readable report with recommendations

Validation: Report correctly identifies known hotspots.

Phase 4: Differential Profiling (Days 8-10)

Goal: Compare before/after profiles to show optimization impact.

Steps:

  1. Capture baseline profile
  2. Make code change
  3. Capture new profile
  4. Use difffolded.pl to generate differential flamegraph
  5. Report which functions improved/regressed

Validation: Synthetic optimization shows expected diff.

Phase 5: Automation and Storage (Days 11-14)

Goal: Create repeatable workflow with artifact storage.

Steps:

  1. Bundle into shell script or Makefile
  2. Add timestamp-based output directories
  3. Implement profile comparison against baselines
  4. Add CI integration hooks

Validation: Workflow runs unattended and produces consistent artifacts.


Testing Strategy

Synthetic Workload Tests

  1. Known distribution: Create workload where time is 50% A, 30% B, 20% C
    • Verify flamegraph widths match expected ratios
  2. Deep call stack: Create 20-level recursion
    • Verify all levels appear in flamegraph
  3. Inlined functions: Force inlining and verify impact
    • Inlined functions should not appear as separate bars

Cross-Validation

  1. Compare with perf top: Live hotspots should match flamegraph
  2. Compare with gprof: Different profiler, similar results
  3. Manual inspection: Insert known delays and verify attribution

Edge Cases

  1. Multi-threaded: Verify thread attribution is correct
  2. Short-lived processes: Capture startup and teardown
  3. Kernel time: Enable kernel profiling and verify syscall visibility

Common Pitfalls and Debugging

Pitfall 1: All Samples Show [unknown]

Symptom: Flamegraph is full of [unknown] instead of function names.

Causes and Solutions:

  1. Missing debug symbols: Recompile with -g
  2. Frame pointer omitted: Add -fno-omit-frame-pointer
  3. Stripped binary: Use --call-graph dwarf instead of fp
  4. JIT code: Need special handling (perf-map-agent for Java)

Pitfall 2: Flamegraph Shows Wrong Time Attribution

Symptom: Known-slow function doesnโ€™t appear hot.

Causes:

  1. Function is inlined: Check assembly with objdump -d
  2. Time is in kernel: Run perf with kernel visibility
  3. I/O wait: CPU profiling misses blocking time

Debug: Compare wall-clock time vs CPU time for the function.

Pitfall 3: Profile Is Too Noisy

Symptom: Different runs show different hotspots.

Solutions:

  1. Increase sample count (longer duration)
  2. Run workload at steady state, skip startup
  3. Pin CPU and control for frequency scaling
  4. Average multiple profiles

Pitfall 4: Flamegraph SVG Wonโ€™t Open

Symptom: Browser shows blank page or error.

Causes:

  1. Empty collapsed stacks: Check intermediate files
  2. Invalid characters in function names: Sanitize C++ mangled names
  3. SVG too large: Reduce sample count or filter stacks

Extensions and Challenges

Extension 1: Off-CPU Profiling

Add off-CPU profiling to capture blocking time:

# Using bpftrace for off-CPU profiling
bpftrace -e '
kprobe:finish_task_switch {
    @[kstack] = count();
}'

Generate off-CPU flamegraph showing where time is spent waiting.

Extension 2: Differential Flamegraphs

Implement automatic regression detection:

  • If a functionโ€™s sample percentage increases by >5%, flag as regression
  • Color regressions red in differential flamegraph
  • Send alert with function name and change magnitude

Extension 3: Memory Profiling

Extend to profile memory allocation patterns:

  • Use perf record -e malloc:* for allocation tracing
  • Generate allocation flamegraph
  • Identify allocation hotspots

Challenge: Production Profiling

Design a low-overhead profiling system suitable for production:

  • Continuous 1 Hz sampling with minimal impact
  • Automatic hotspot detection and alerting
  • Historical trend analysis

Real-World Connections

How This Applies in Production

  1. Incident Response: When latency spikes, the first question is โ€œwhatโ€™s consuming CPU?โ€ Flamegraphs answer this in seconds.

  2. Optimization Prioritization: Flamegraph width tells you where to focus. Optimizing a 2% function is wasted effort.

  3. Code Review: Attach flamegraphs to PRs for performance-sensitive changes.

  4. Capacity Planning: CPU profiles reveal efficiencyโ€”more efficient code means fewer servers.

Industry Practices

  • Netflix: Generates flamegraphs for every service continuously
  • Uber: Uses differential flamegraphs in their CI/CD pipeline
  • LinkedIn: Built โ€œasync-profilerโ€ for low-overhead Java profiling

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can explain what sample rate means and how to choose it
  • You understand why [unknown] symbols appear and how to fix them
  • You can read a flamegraph and identify the primary bottleneck
  • You know the difference between CPU time and wall-clock time
  • You can generate a differential flamegraph showing optimization impact
  • Your workflow can be automated and run in CI
  • You can cross-validate findings with a second profiling tool

Resources

Essential Reading

  • โ€œSystems Performanceโ€ by Brendan Gregg, Chapter 6: CPUs
  • Brendan Greggโ€™s Flamegraph documentation: https://www.brendangregg.com/flamegraphs.html
  • โ€œPerformance Analysis and Tuning on Modern CPUsโ€ by Denis Bakhvalov

Tools

  • perf: Linux performance counters tool
  • FlameGraph: https://github.com/brendangregg/FlameGraph
  • async-profiler: Low-overhead Java profiler with flamegraph output
  • speedscope: Web-based flamegraph viewer

Reference

  • perf wiki: https://perf.wiki.kernel.org/
  • Intel VTune (alternative commercial profiler)