LEARN PERFORMANCE MONITORING C DEEP DIVE
Learn Performance Monitoring & Benchmarking in C
Goal: To deeply understand how to measure, analyze, and optimize the performance of C programs by interacting directly with the operating system and CPU hardware.
Why Learn Performance in C?
In many languages, performance analysis is done with high-level profilers that hide the underlying complexity. In C, you have the power—and the necessity—to measure performance at its source. Learning to benchmark and monitor in C teaches you how computers actually work. You’ll move beyond just “making it work” to understanding why it’s fast or slow.
This knowledge is fundamental to systems programming, embedded development, game engines, high-frequency trading, and any other domain where performance is critical. You will learn to see the code not just as logic, but as a series of operations with a real cost in time, memory, and CPU cycles.
After completing these projects, you will:
- Accurately measure wall-clock time, CPU time, and memory usage.
- Build robust benchmarking harnesses that produce statistically sound results.
- Understand the impact of the CPU cache hierarchy on performance.
- Use OS-specific APIs to track memory allocations, page faults, and system calls.
- Tap into hardware performance counters to measure CPU instructions, cycles, and branch mispredictions.
- Build your own simple profiler from first principles.
Core Concept Analysis
The Performance Measurement Pyramid
Understanding performance requires looking at different layers of the system. We can’t just measure “speed”; we must be more specific.
┌──────────────────────────────────────────────────┐
│ Application Logic │
│ (Algorithm complexity, data structures) │
├──────────────────────────────────────────────────┤
│ ▲ Operating System │
│ │ (System calls, memory management, scheduling) │
├─┼────────────────────────────────────────────────┤
│ │ ▲ CPU & Memory Hierarchy │
│ │ │ (Instructions, cycles, cache misses, RAM) │
└─┴─┴─┴──────────────────────────────────────────────┘
A good performance engineer knows how to measure at all three levels. A change at the bottom (e.g., better cache usage) can have a huge impact at the top.
Key Metrics & How to Get Them in C
- Time:
- Wall-Clock Time: Real-world elapsed time. Measured with
clock_gettime(CLOCK_MONOTONIC, ...). This is what the user experiences. - CPU Time: Time the CPU spent executing your code. It’s broken down into:
- User Time: CPU executing your application’s code.
- System Time: CPU executing kernel code on behalf of your application (e.g., during a
read()system call).
- Measured with
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)orgetrusage(). Comparing Wall Time to CPU Time tells you how much time your program spent waiting (e.g., for I/O).
- Wall-Clock Time: Real-world elapsed time. Measured with
- Memory:
- Resident Set Size (RSS): The portion of your program’s memory currently held in RAM. A key indicator of its physical memory footprint.
- Heap Usage: How much memory has been allocated via
malloc. Requires instrumentation to track precisely. - Page Faults: When the CPU tries to access a memory page that isn’t in RAM. A major page fault means the OS has to load it from disk, which is very slow. A minor page fault is cheaper (e.g., the memory is elsewhere in RAM). Measured with
getrusage().
- Hardware Performance Counters (PMCs/HPUs):
- Modern CPUs have special registers that can count hardware-level events. These provide the deepest insights.
- Instructions Executed: A measure of the work done.
- CPU Cycles: How many clock cycles were spent. The ratio Instructions Per Cycle (IPC) is a critical measure of CPU efficiency. An IPC < 1 suggests the CPU is often stalled, waiting for memory.
- Cache Misses: How often the CPU needed data that wasn’t in its fast L1/L2/L3 caches. High cache misses are a primary cause of poor performance.
- Branch Mispredictions: How often the CPU guessed the wrong way on an
if/elsestatement, forcing it to flush its pipeline and restart. - On Linux, these are accessed via the
perf_event_open()system call.
Project List
These projects will progressively build your skills from basic timing to advanced hardware-level analysis. All projects are designed for a Linux environment, as it provides the most accessible low-level APIs.
Project 1: A High-Precision time Command Clone
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Time Measurement / Process Management
- Software or Tool:
fork,exec,wait4,getrusage - Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: A command-line utility that, like the standard time command, takes another command as an argument, runs it, and prints a detailed report of the wall-clock time, user CPU time, and system CPU time it consumed.
Why it teaches performance: It’s the “Hello, World!” of performance monitoring. It forces you to learn how to spawn and manage child processes and how to use the OS’s fundamental tool for resource measurement: getrusage().
Core challenges you’ll face:
- Spawning a child process → maps to using
fork()andexecvp()correctly - Waiting for the child to complete → maps to using
wait4()to get the exit status and resource usage - Measuring wall-clock time → maps to using
clock_gettime(CLOCK_MONOTONIC, ...)before theforkand after thewait4 - Extracting CPU time from
rusage→ maps to understanding thetimevalstruct and convertingtv_secandtv_usecto a single value
Key Concepts:
- Process Management: “The Linux Programming Interface” Ch. 24-27
getrusage():man 2 getrusage- High-Resolution Timers:
man 2 clock_gettime
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, command line usage.
Real world outcome: A working replacement for a standard system utility.
$ ./my_time ls -l /
... (output of ls) ...
--- Performance Report ---
Wall-clock time: 0.052 s
User CPU time: 0.011 s
System CPU time: 0.039 s
Implementation Hints:
- Record the monotonic start time using
clock_gettime. - Use
fork()to create a child process. - In the child process: use
execvp()to replace the child’s image with the command provided by the user (e.g.,ls). - In the parent process: use
wait4(-1, &status, 0, &usage_struct). This specificwaitvariant is crucial as it populates therusagestruct for the completed child. - Record the monotonic end time using
clock_gettime. The difference is your wall time. - The
rusagestruct filled bywait4will containru_utime(user time) andru_stime(system time). These aretimevalstructs. Convert them to a floating-point number of seconds and print them.
Learning milestones:
- You can run a command like
lsand see its output → You’ve masteredfork/exec. - You can print the correct wall-clock time → You’re using
clock_gettimecorrectly. - You can print the user and system time → You’ve successfully used
wait4and interpretedgetrusagedata. - Your tool’s output roughly matches the system’s
timecommand → You’ve built a correct and useful utility.
Project 2: A malloc / free Instrumentation Library
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Profiling / Linker Tricks
- Software or Tool:
gcc,ld(--wrapflag),dlsym - Main Book: “Expert C Programming” by Peter van der Linden
What you’ll build: A shared library that, when preloaded or linked, intercepts all calls to malloc, free, calloc, and realloc to keep detailed statistics about a program’s heap usage. On program exit, it will print a report with peak heap usage, total allocations, and any detected memory leaks.
Why it teaches performance: Heap allocation can be a major performance bottleneck. This project teaches you how to perform “instrumentation” at the linker level, a powerful technique used by many professional profiling tools. You’ll understand memory usage from the inside out.
Core challenges you’ll face:
- Intercepting standard library calls → maps to using
LD_PRELOADor the GCC linker’s--wrapoption - Avoiding infinite recursion → maps to calling the *real
mallocfrom within your wrapper usingdlsymor__real_malloc* - Storing allocation metadata → maps to maintaining a data structure (like a hash map) to track the size and location of every active allocation
- Ensuring thread safety → maps to using mutexes to protect your global statistics from race conditions in multi-threaded programs
Key Concepts:
- Linker Seams: Using the linker to replace functions. Search for “ld –wrap example”.
LD_PRELOAD: A Linux environment variable to load a shared library before any others.- Dynamic Linking Symbol Resolution:
man 3 dlsym.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Solid C skills, understanding of pointers, basic data structures.
Real world outcome: You can take an existing C program, run it with your library, and get a detailed memory report without changing the program’s source code.
# Compile the target program normally
$ gcc target.c -o target
# Run it with your profiler preloaded
$ LD_PRELOAD=./mem_profiler.so ./target
--- Heap Profile ---
Total allocations: 5,234
Total bytes allocated: 2.1 MB
Peak heap usage: 1.5 MB
Potential leaks: 3 allocations (48 bytes)
Implementation Hints:
Using ld --wrap (simpler):
- Define your wrapper functions, e.g.,
void *__wrap_malloc(size_t size). - Inside
__wrap_malloc:- Lock a mutex.
- Call
void *ptr = __real_malloc(size);. The linker provides__real_malloc. - If
ptris not NULL, record its size and address in a global hash map. Update statistics (total allocations, current usage, peak usage). - Unlock the mutex.
- Return
ptr.
- Implement
__wrap_freesimilarly, removing the allocation from your hash map. - Use
atexit()to register a function that prints the final report. When it runs, any allocations left in your hash map are leaks. - Compile your main program with:
gcc main.c mem_profiler.c -Wl,-wrap,malloc,-wrap,free -o main.
Learning milestones:
- Intercept a
malloccall and print a message → You’ve mastered the linker wrapping. - Track the total number of bytes allocated → Your basic accounting works.
- Calculate the peak (“high water mark”) heap usage → You are tracking the state of the heap over time.
- Detect and print a memory leak → Your metadata tracking is robust.
Project 3: A Robust Micro-Benchmarking Harness
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Benchmarking / Statistics
- Software or Tool:
clock_gettime - Main Book: “Performance Tuning for C Programmers” (online resources are better here)
What you’ll build: A reusable header file and C module that allows you to easily and reliably benchmark small functions. The harness will handle running the function many times, collecting timings, and calculating basic statistics (min, max, mean, median, stddev).
Why it teaches performance: Writing a good micro-benchmark is surprisingly hard. You will learn to fight the compiler, which tries to optimize your benchmark away, and to apply basic statistics to get meaningful results instead of noisy, single-run measurements.
Core challenges you’ll face:
- Fighting compiler optimizations → maps to learning how to prevent the compiler from optimizing out the function you’re testing
- Creating a stable measurement loop → maps to running the function thousands or millions of time to get a measurable duration
- Calculating statistics → maps to implementing mean, median, and standard deviation
- Designing a clean API → maps to using macros or function pointers to make it easy for a user to register a function to be benchmarked
Key Concepts:
- Dead Code Elimination: Compilers are smart. If the result of a function isn’t used, the function call might be removed entirely.
- Volatile Keyword: A hint to the compiler that a variable’s value can change at any time, preventing certain optimizations.
- Statistical Analysis: Why mean can be misleading and median is often better for benchmarks.
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1.
Real world outcome: A simple, powerful framework for comparing the performance of different implementations of a function.
// In your code:
BENCHMARK(sort_v1) {
// setup code
qsort(my_array, ...);
}
BENCHMARK(sort_v2) {
// setup code
radix_sort(my_array, ...);
}
// Output of the tool:
$ ./my_benchmarks
Running benchmark: sort_v1
- Iterations: 100
- Min: 15.2 ms
- Max: 19.8 ms
- Mean: 16.4 ms
- Median: 16.2 ms
Running benchmark: sort_v2
- Iterations: 100
- Min: 8.1 ms
- Max: 9.5 ms
- Mean: 8.6 ms
- Median: 8.5 ms
Implementation Hints:
- Preventing Optimization:
- Pass the result of your benchmarked function to a simple, opaque function defined in another compilation unit. This function can just accept a pointer and do nothing. The compiler can’t prove it does nothing, so it must execute the benchmarked function to get the result.
- Alternatively, declare a global
volatilevariable and assign the result to it inside the loop.
- API Design: A macro-based approach is common and clean.
#define BENCHMARK(name) void bench_##name(void); \ /* code to register the benchmark */ \ void bench_##name(void) - Statistics:
- Store all timing results from the loop in an array.
- Sort the array to easily find the median (the middle element).
- Calculate the mean and standard deviation.
Learning milestones:
- You benchmark a simple function, but the time is 0 → You’ve seen dead code elimination in action.
- You successfully benchmark the function after defeating the compiler’s optimization → You know how to force the code to run.
- You implement a harness that reports min, max, and mean → You have a working benchmark tool.
- You add median and standard deviation → You are now producing statistically robust results.
Project 4: Cache Performance Investigator
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: CPU Architecture / Memory Hierarchy
- Software or Tool:
clock_gettime - Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron
What you’ll build: A program that measures memory access latency for different array sizes. It will demonstrate the “performance staircase”: the dramatic increase in latency as the data size exceeds the L1, then L2, then L3 caches and finally has to go to main memory (RAM).
Why it teaches performance: This project makes the abstract concept of the CPU cache tangible. You will produce a graph that is a physical signature of your own computer’s hardware. It’s the most effective way to understand why data locality is the #1 rule of performance optimization.
Core challenges you’ll face:
- Designing the access pattern → maps to writing a loop that unpredictably jumps through the array to defeat CPU prefetchers
- Iterating through different array sizes → maps to testing sizes from a few KB (fits in L1) to many MB (spills to RAM)
- Measuring average access time → maps to dividing total time by the number of memory accesses
- Finding your CPU’s cache sizes → maps to using commands like
lscpuorsysctlto know what to expect
Key Concepts:
- Memory Hierarchy: The L1/L2/L3/RAM pyramid.
- Data Locality: Temporal (reusing data soon) and Spatial (using data that’s nearby in memory).
- CPU Prefetcher: A hardware component that tries to predict what memory you’ll need next. Your benchmark needs to outsmart it.
Difficulty: Advanced Time estimate: Weekend Prerequisites: Project 3 (Benchmarking Harness).
Real world outcome: A program that generates data you can plot to create a graph like this, clearly showing the latency jumps at your CPU’s cache boundaries.
Array Size (KB) | Avg Access Time (ns)
----------------|---------------------
4 | 0.5
... | ...
32 | 0.6 (L1 boundary)
33 | 4.2
... | ...
256 | 4.5 (L2 boundary)
257 | 12.8
... | ...
8192 | 13.5 (L3 boundary)
8193 | 70.1 (RAM)
Implementation Hints:
- First, find your CPU’s cache sizes. On Linux, run
lscpu | grep cache. - The core of the program is a loop that allocates an array of a given size
N. - Create a random permutation of indices
0toN-1. This is your access pattern. This ensures each memory access is to a “random” location in the array, defeating the prefetcher.p[i] = p[p[i]]is a common pattern to create a pointer-chasing dependency chain. - Use your benchmarking harness from Project 3 to time a loop that performs millions of accesses into the array using your random permutation.
- Divide the total time by the number of accesses to get the average latency.
- Run this whole process for a range of array sizes
N, from very small (e.g., 2 KB) to very large (e.g., 32 MB). - Print the results in a CSV format so you can easily plot them with another tool.
Learning milestones:
- You measure the access time for an array that fits in L1 cache → You have a baseline for “fast”.
- You see the first major jump in latency → You’ve identified your L1 cache boundary.
- You see subsequent jumps for L2 and L3 → You’ve mapped out your CPU’s memory hierarchy.
- You can explain the resulting graph to someone → You deeply understand the performance impact of caches.
Project 5: A Simple strace Clone using ptrace
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: OS Internals / Tracing / Debugging
- Software or Tool:
ptracesyscall - Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: A tool that uses the Linux ptrace system call to trace another process, printing out every system call it makes and its return value.
Why it teaches performance: System calls are a major source of overhead as they involve a context switch into the kernel. Knowing what syscalls your program makes is key to optimizing it. This project demystifies how powerful tools like strace and gdb work.
Core challenges you’ll face:
- Understanding the
ptraceAPI → maps to the state machine of attaching, trapping, continuing, and detaching - Telling the child process to be traced → maps to using
PTRACE_TRACEME - Catching syscalls → maps to using
PTRACE_SYSCALLand understanding that it traps *twice per syscall (entry and exit)* - Reading CPU registers → maps to using
PTRACE_GETREGSto find the syscall number and its arguments/return value (which are stored in specific registers by convention) - Mapping syscall numbers to names → maps to including the right header (
<sys/syscall.h>) or creating a lookup table
Key Concepts:
ptrace: The primary process tracing mechanism in Linux.man 2 ptrace.- Syscall Calling Convention: On x86-64, the syscall number is in
rax, arguments are inrdi,rsi,rdx, etc., and the return value is inrax. - Tracer and Tracee: The two roles in a
ptracerelationship.
Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 1, deep understanding of C and process management.
Real world outcome:
A functional, albeit simplified, strace.
$ ./my_strace echo "hello"
Syscall entry: execve(...)
Syscall exit: execve -> 0
Syscall entry: brk(...)
Syscall exit: brk -> 0x55c0a0f4a000
Syscall entry: write(1, "hello\n", 6)
hello
Syscall exit: write -> 6
Syscall entry: exit_group(0)
Implementation Hints:
- The overall structure is a
fork, similar to Project 1. - Tracee (child process):
- Immediately call
ptrace(PTRACE_TRACEME, 0, NULL, NULL). This tells the OS that the parent will be tracing it. - Call
raise(SIGSTOP)to pause itself, allowing the parent to set up options. execvpthe target program.
- Immediately call
- Tracer (parent process):
wait()for the child’sSIGSTOP.- Set
ptraceoptions, e.g.,PTRACE_O_TRACESYSGOOD, to make syscall traps clearer. - Enter a loop. Inside the loop:
- Call
ptrace(PTRACE_SYSCALL, child_pid, NULL, NULL)to let the child run until the next syscall entry or exit. wait()for the child to trap.- Check the
statusfromwait. If it’s a syscall trap, useptrace(PTRACE_GETREGS, ...)to read the child’s registers into a struct. - On syscall entry, the
orig_raxregister holds the syscall number. Print it and its arguments. - On syscall exit, the
raxregister holds the return value. Print it. - Loop until the child exits.
- Call
Learning milestones:
- You successfully attach to a child process and stop it → You understand the basic
ptracehandshake. - You can catch the very first syscall (
execve) → Your main tracing loop works. - You can read the syscall number from the registers → You are successfully inspecting the tracee’s state.
- You can distinguish between syscall entry and exit and print the return value → You have a fully functional syscall tracer.
Project 6: Hardware Counter Tool via perf_event_open
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 5: Master
- Knowledge Area: CPU Architecture / Linux Kernel
- Software or Tool:
perf_event_opensyscall - Main Book:
perf_event_open(2)man page and online kernel documentation.
What you’ll build: A library that uses the Linux perf_event_open syscall to measure fundamental hardware events (instructions, cycles, cache misses, branch mispredictions) for a specific block of C code.
Why it teaches performance: This is the ground truth. You are completely bypassing high-level timers and asking the CPU itself about its performance. You can calculate critical metrics like IPC (Instructions Per Cycle) and cache miss rates, which are essential for serious performance engineering.
Core challenges you’ll face:
- The
perf_event_openAPI → maps to its complexperf_event_attrstruct and numerous configuration flags - Setting up the counters → maps to choosing the right
typeandconfigvalues for the events you want to measure (e.g.,PERF_COUNT_HW_INSTRUCTIONS) - Controlling the counters → maps to using
ioctlorprctlto enable, disable, and reset the counters around the code you want to measure - Reading the results → maps to using
read()on the file descriptor returned by the syscall - Handling groups of events → maps to measuring multiple events simultaneously, with a group leader
Key Concepts:
- Performance Monitoring Units (PMU/PMCs): The hardware units inside the CPU that do the counting.
perf_event_open(2): The Linux syscall that is the gateway to the PMU.- Instructions Per Cycle (IPC): The ratio of instructions to cycles. A key measure of CPU efficiency.
IPC = instructions / cycles.
Difficulty: Master Time estimate: 1-2 weeks Prerequisites: Project 3, strong C skills, comfortable reading man pages.
Real world outcome: A header file and library that let you write code like this:
// Define the events you want to measure
perf_event_t events[] = {
{PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS},
{PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES},
{PERF_TYPE_HARDWARE, PERF_COUNT_HW_CACHE_MISSES},
};
// Start counting
perf_start(events, 3);
// --- Code to be measured ---
my_complex_algorithm();
// -------------------------
// Stop counting and get results
uint64_t results[3];
perf_stop(events, results);
printf("Instructions: %lu\n", results[0]);
printf("Cycles: %lu\n", results[1]);
printf("Cache Misses: %lu\n", results[2]);
printf("IPC: %.2f\n", (double)results[0] / results[1]);
Implementation Hints:
- Your
perf_startfunction will be the most complex. It will need to:- Loop through the requested events.
- For each event, fill a
perf_event_attrstruct. Settype,config,disabled = 1,inherit = 1, andexclude_kernel = 1. - Call
syscall(__NR_perf_event_open, &attr, 0, -1, group_fd, 0). - The first event’s file descriptor (
fd) becomes thegroup_fdfor the rest, so they are all measured together. - Store the returned
fds. - Use
ioctl(group_fd, PERF_EVENT_IOC_RESET, 0)andioctl(group_fd, PERF_EVENT_IOC_ENABLE, 0)to start the counting.
- Your
perf_stopfunction will:- Use
ioctl(group_fd, PERF_EVENT_IOC_DISABLE, 0)to stop counting. read()from eachfdto get the 64-bit count.close()all the file descriptors.
- Use
Learning milestones:
- You successfully open a file descriptor for a single event (e.g., cycles) → You’ve mastered the basic
perf_event_opencall. - You can reset, enable, and disable the counter around a block of code → You can control the measurement interval.
- You can read a counter and get a non-zero value → You are successfully getting data from the CPU.
- You can measure multiple events simultaneously and calculate IPC → You’ve built a complete, powerful performance analysis tool.
Project 7: A Basic Sampling Profiler
- File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 5: Master
- Knowledge Area: OS Internals / Profiling / Signals
- Software or Tool:
setitimer,sigaction,libbacktraceorlibunwind - Main Book: “Computer Systems: A Programmer’s Perspective” (for linking/loading concepts)
What you’ll build: A simple sampling profiler. One process (the profiler) will launch a target program (the “profilee”). Every N milliseconds, the profiler will send a signal to the profilee, causing it to pause. The profilee’s signal handler will then record its own instruction pointer (RIP) and maybe a stack trace, storing the results in a shared data structure.
Why it teaches performance: This demystifies how tools like perf record or gprof work. You learn that profiling isn’t magic; it’s just periodically stopping a program and asking, “What are you doing right now?”. By doing this many times, you get a statistical picture of where the program is spending its time.
Core challenges you’ll face:
- Controlling another process → maps to using
fork/execand sending signals withkill() - Handling signals asynchronously → maps to writing a safe and correct signal handler using
sigaction - Setting up a periodic timer → maps to using
setitimerto receive aSIGPROForSIGALRMat a regular interval (e.g., every 10ms) - Capturing the instruction pointer → maps to accessing the
ucontext_tstruct passed to an advanced signal handler - Mapping addresses to function names → maps to the difficult problem of symbol resolution. You might shell out to
addr2linefor a simple solution, or try to parse the ELF file’s symbol table yourself for an advanced version.
Key Concepts:
- Signal Handling:
man 7 signalandman 2 sigaction. - Interval Timers:
man 2 setitimer. ucontext_t: The structure that holds the complete state of a thread (registers, etc.) at the moment a signal is delivered.- Symbol Tables & DWARF debug info: The data inside an executable that maps addresses to function and line numbers.
Difficulty: Master Time estimate: 2-3 weeks Prerequisites: All previous projects, especially ptrace and signals.
Real world outcome: A tool that can give you a basic performance report for any compiled program.
$ ./my_profiler ./my_cpu_intensive_app
Profiling for 5 seconds...
--- Profile Report ---
Samples: 498
Top 5 Functions:
1. calculate_pi - 250 samples (50.2%)
2. matrix_multiply - 150 samples (30.1%)
3. update_display - 50 samples (10.0%)
4. read_input - 25 samples (5.0%)
5. main - 5 samples (1.0%)
Implementation Hints:
This can be designed in two ways:
- Two-Process Model (like
strace): A tracer processforks a child, and periodically sends it a signal. This is harder because getting the RIP from another process is complex. - Self-Profiling Model (easier): The program profiles itself.
- In
main, set up a signal handler forSIGPROFusingsigaction. Yoursa_flagsmust includeSA_SIGINFOto get theucontext_t. - The handler’s job is to be very fast: get the
RIPfrom theucontext_t(uc_mcontext.gregs[REG_RIP]), store it in a global array of samples, and return. - In
main, set up an interval timer:setitimer(ITIMER_PROF, ...)to send aSIGPROFevery 10ms. - Let your program do its work.
- Before exiting, stop the timer.
- Process the array of samples. For each address, use
addr2line -e <executable_name> <address>to resolve it to a function name. Count the occurrences of each name and print the histogram.
- In
Learning milestones:
- You set up a signal handler that fires once → You understand
sigaction. - You set up a timer that fires the handler 10 times per second → You understand
setitimer. - The handler successfully captures the instruction pointer → You can inspect the program’s state.
- You can resolve a captured address to a function name → You’ve connected the profiler to the source code.
- You generate a correct histogram of where time was spent → You have built a working sampling profiler.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
1. time Command Clone |
Beginner | Weekend | OS Basics | 3/5 |
2. malloc/free Profiler |
Advanced | 1-2 weeks | Memory / Linker | 5/5 |
| 3. Micro-Benchmarking Harness | Intermediate | Weekend | Statistics / Compilers | 4/5 |
| 4. Cache Performance Investigator | Advanced | Weekend | CPU Caches | 5/5 |
5. strace Clone using ptrace |
Expert | 1-2 weeks | OS Tracing | 5/5 |
| 6. Hardware Counter Tool | Master | 1-2 weeks | CPU Hardware | 5/5 |
| 7. Basic Sampling Profiler | Master | 2-3 weeks | Profiling Theory | 5/5 |
Recommendation
For a comprehensive journey into C performance analysis:
- Start with Project 1:
timeCommand Clone. It teaches the fundamental OS APIs (fork,wait4,getrusage) that everything else builds upon. - Next, build Project 3: A Robust Micro-Benchmarking Harness. This is an incredibly practical tool you will reuse for all subsequent projects. It forces you to learn how to measure correctly.
- With your harness, tackle Project 4: Cache Performance Investigator. This project provides the most “a-ha!” moment per line of code, as it makes the memory hierarchy visible and tangible.
- After that, your path depends on your interest.
- For Memory Performance, build Project 2:
malloc/freeProfiler. - For OS-level interaction, build Project 5:
straceClone. - For the ultimate low-level truth, challenge yourself with Project 6: Hardware Counter Tool.
- For Memory Performance, build Project 2:
Finishing with Project 7: Basic Sampling Profiler will tie all the concepts of process control, signals, and state inspection together into a powerful capstone project.
Summary
- Project 1: A High-Precision
timeCommand Clone: C - Project 2: A
malloc/freeInstrumentation Library: C - Project 3: A Robust Micro-Benchmarking Harness: C
- Project 4: Cache Performance Investigator: C
- Project 5: A Simple
straceClone usingptrace: C - Project 6: Hardware Counter Tool via
perf_event_open: C - Project 7: A Basic Sampling Profiler: C