LEARN PERFORMANCE MONITORING C DEEP DIVE

Learn Performance Monitoring & Benchmarking in C

Goal: To deeply understand how to measure, analyze, and optimize the performance of C programs by interacting directly with the operating system and CPU hardware.

Why Learn Performance in C?

In many languages, performance analysis is done with high-level profilers that hide the underlying complexity. In C, you have the power—and the necessity—to measure performance at its source. Learning to benchmark and monitor in C teaches you how computers actually work. You’ll move beyond just “making it work” to understanding why it’s fast or slow.

This knowledge is fundamental to systems programming, embedded development, game engines, high-frequency trading, and any other domain where performance is critical. You will learn to see the code not just as logic, but as a series of operations with a real cost in time, memory, and CPU cycles.

After completing these projects, you will:

Accurately measure wall-clock time, CPU time, and memory usage.
Build robust benchmarking harnesses that produce statistically sound results.
Understand the impact of the CPU cache hierarchy on performance.
Use OS-specific APIs to track memory allocations, page faults, and system calls.
Tap into hardware performance counters to measure CPU instructions, cycles, and branch mispredictions.
Build your own simple profiler from first principles.

Core Concept Analysis

The Performance Measurement Pyramid

Understanding performance requires looking at different layers of the system. We can’t just measure “speed”; we must be more specific.

┌──────────────────────────────────────────────────┐
│               Application Logic                  │
│  (Algorithm complexity, data structures)         │
├──────────────────────────────────────────────────┤
│ ▲              Operating System                  │
│ │ (System calls, memory management, scheduling)  │
├─┼────────────────────────────────────────────────┤
│ │ ▲          CPU & Memory Hierarchy              │
│ │ │ (Instructions, cycles, cache misses, RAM)    │
└─┴─┴─┴──────────────────────────────────────────────┘

A good performance engineer knows how to measure at all three levels. A change at the bottom (e.g., better cache usage) can have a huge impact at the top.

Key Metrics & How to Get Them in C

Time:
- Wall-Clock Time: Real-world elapsed time. Measured with clock_gettime(CLOCK_MONOTONIC, ...). This is what the user experiences.
- CPU Time: Time the CPU spent executing your code. It’s broken down into:
  - User Time: CPU executing your application’s code.
  - System Time: CPU executing kernel code on behalf of your application (e.g., during a read() system call).
- Measured with clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...) or getrusage(). Comparing Wall Time to CPU Time tells you how much time your program spent waiting (e.g., for I/O).
Memory:
- Resident Set Size (RSS): The portion of your program’s memory currently held in RAM. A key indicator of its physical memory footprint.
- Heap Usage: How much memory has been allocated via malloc. Requires instrumentation to track precisely.
- Page Faults: When the CPU tries to access a memory page that isn’t in RAM. A major page fault means the OS has to load it from disk, which is very slow. A minor page fault is cheaper (e.g., the memory is elsewhere in RAM). Measured with getrusage().
Hardware Performance Counters (PMCs/HPUs):
- Modern CPUs have special registers that can count hardware-level events. These provide the deepest insights.
- Instructions Executed: A measure of the work done.
- CPU Cycles: How many clock cycles were spent. The ratio Instructions Per Cycle (IPC) is a critical measure of CPU efficiency. An IPC < 1 suggests the CPU is often stalled, waiting for memory.
- Cache Misses: How often the CPU needed data that wasn’t in its fast L1/L2/L3 caches. High cache misses are a primary cause of poor performance.
- Branch Mispredictions: How often the CPU guessed the wrong way on an if/else statement, forcing it to flush its pipeline and restart.
- On Linux, these are accessed via the perf_event_open() system call.

Project List

These projects will progressively build your skills from basic timing to advanced hardware-level analysis. All projects are designed for a Linux environment, as it provides the most accessible low-level APIs.

Project 1: A High-Precision `time` Command Clone

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Time Measurement / Process Management
Software or Tool: fork, exec, wait4, getrusage
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A command-line utility that, like the standard time command, takes another command as an argument, runs it, and prints a detailed report of the wall-clock time, user CPU time, and system CPU time it consumed.

Why it teaches performance: It’s the “Hello, World!” of performance monitoring. It forces you to learn how to spawn and manage child processes and how to use the OS’s fundamental tool for resource measurement: getrusage().

Core challenges you’ll face:

Spawning a child process → maps to using fork() and execvp() correctly
Waiting for the child to complete → maps to using wait4() to get the exit status and resource usage
Measuring wall-clock time → maps to using clock_gettime(CLOCK_MONOTONIC, ...) before the fork and after the wait4
Extracting CPU time from rusage → maps to understanding the timeval struct and converting tv_sec and tv_usec to a single value

Key Concepts:

Process Management: “The Linux Programming Interface” Ch. 24-27
getrusage(): man 2 getrusage
High-Resolution Timers: man 2 clock_gettime

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, command line usage.

Real world outcome: A working replacement for a standard system utility.

$ ./my_time ls -l /
... (output of ls) ...

--- Performance Report ---
Wall-clock time: 0.052 s
User CPU time:   0.011 s
System CPU time: 0.039 s

Implementation Hints:

Record the monotonic start time using clock_gettime.
Use fork() to create a child process.
In the child process: use execvp() to replace the child’s image with the command provided by the user (e.g., ls).
In the parent process: use wait4(-1, &status, 0, &usage_struct). This specific wait variant is crucial as it populates the rusage struct for the completed child.
Record the monotonic end time using clock_gettime. The difference is your wall time.
The rusage struct filled by wait4 will contain ru_utime (user time) and ru_stime (system time). These are timeval structs. Convert them to a floating-point number of seconds and print them.

Learning milestones:

You can run a command like ls and see its output → You’ve mastered fork/exec.
You can print the correct wall-clock time → You’re using clock_gettime correctly.
You can print the user and system time → You’ve successfully used wait4 and interpreted getrusage data.
Your tool’s output roughly matches the system’s time command → You’ve built a correct and useful utility.

Project 2: A `malloc` / `free` Instrumentation Library

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Memory Profiling / Linker Tricks
Software or Tool: gcc, ld (--wrap flag), dlsym
Main Book: “Expert C Programming” by Peter van der Linden

What you’ll build: A shared library that, when preloaded or linked, intercepts all calls to malloc, free, calloc, and realloc to keep detailed statistics about a program’s heap usage. On program exit, it will print a report with peak heap usage, total allocations, and any detected memory leaks.

Why it teaches performance: Heap allocation can be a major performance bottleneck. This project teaches you how to perform “instrumentation” at the linker level, a powerful technique used by many professional profiling tools. You’ll understand memory usage from the inside out.

Core challenges you’ll face:

Intercepting standard library calls → maps to using LD_PRELOAD or the GCC linker’s --wrap option
Avoiding infinite recursion → maps to calling the *real malloc from within your wrapper using dlsym or __real_malloc*
Storing allocation metadata → maps to maintaining a data structure (like a hash map) to track the size and location of every active allocation
Ensuring thread safety → maps to using mutexes to protect your global statistics from race conditions in multi-threaded programs

Key Concepts:

Linker Seams: Using the linker to replace functions. Search for “ld –wrap example”.
LD_PRELOAD: A Linux environment variable to load a shared library before any others.
Dynamic Linking Symbol Resolution: man 3 dlsym.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Solid C skills, understanding of pointers, basic data structures.

Real world outcome: You can take an existing C program, run it with your library, and get a detailed memory report without changing the program’s source code.

# Compile the target program normally
$ gcc target.c -o target

# Run it with your profiler preloaded
$ LD_PRELOAD=./mem_profiler.so ./target

--- Heap Profile ---
Total allocations:     5,234
Total bytes allocated: 2.1 MB
Peak heap usage:       1.5 MB
Potential leaks:       3 allocations (48 bytes)

Implementation Hints:

Using ld --wrap (simpler):

Define your wrapper functions, e.g., void *__wrap_malloc(size_t size).
Inside __wrap_malloc:
- Lock a mutex.
- Call void *ptr = __real_malloc(size);. The linker provides __real_malloc.
- If ptr is not NULL, record its size and address in a global hash map. Update statistics (total allocations, current usage, peak usage).
- Unlock the mutex.
- Return ptr.
Implement __wrap_free similarly, removing the allocation from your hash map.
Use atexit() to register a function that prints the final report. When it runs, any allocations left in your hash map are leaks.
Compile your main program with: gcc main.c mem_profiler.c -Wl,-wrap,malloc,-wrap,free -o main.

Learning milestones:

Intercept a malloc call and print a message → You’ve mastered the linker wrapping.
Track the total number of bytes allocated → Your basic accounting works.
Calculate the peak (“high water mark”) heap usage → You are tracking the state of the heap over time.
Detect and print a memory leak → Your metadata tracking is robust.

Project 3: A Robust Micro-Benchmarking Harness

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Benchmarking / Statistics
Software or Tool: clock_gettime
Main Book: “Performance Tuning for C Programmers” (online resources are better here)

What you’ll build: A reusable header file and C module that allows you to easily and reliably benchmark small functions. The harness will handle running the function many times, collecting timings, and calculating basic statistics (min, max, mean, median, stddev).

Why it teaches performance: Writing a good micro-benchmark is surprisingly hard. You will learn to fight the compiler, which tries to optimize your benchmark away, and to apply basic statistics to get meaningful results instead of noisy, single-run measurements.

Core challenges you’ll face:

Fighting compiler optimizations → maps to learning how to prevent the compiler from optimizing out the function you’re testing
Creating a stable measurement loop → maps to running the function thousands or millions of time to get a measurable duration
Calculating statistics → maps to implementing mean, median, and standard deviation
Designing a clean API → maps to using macros or function pointers to make it easy for a user to register a function to be benchmarked

Key Concepts:

Dead Code Elimination: Compilers are smart. If the result of a function isn’t used, the function call might be removed entirely.
Volatile Keyword: A hint to the compiler that a variable’s value can change at any time, preventing certain optimizations.
Statistical Analysis: Why mean can be misleading and median is often better for benchmarks.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1.

Real world outcome: A simple, powerful framework for comparing the performance of different implementations of a function.

// In your code:
BENCHMARK(sort_v1) {
    // setup code
    qsort(my_array, ...);
}

BENCHMARK(sort_v2) {
    // setup code
    radix_sort(my_array, ...);
}

// Output of the tool:
$ ./my_benchmarks
Running benchmark: sort_v1
  - Iterations: 100
  - Min:    15.2 ms
  - Max:    19.8 ms
  - Mean:   16.4 ms
  - Median: 16.2 ms
Running benchmark: sort_v2
  - Iterations: 100
  - Min:    8.1 ms
  - Max:    9.5 ms
  - Mean:   8.6 ms
  - Median: 8.5 ms

Implementation Hints:

Preventing Optimization:
- Pass the result of your benchmarked function to a simple, opaque function defined in another compilation unit. This function can just accept a pointer and do nothing. The compiler can’t prove it does nothing, so it must execute the benchmarked function to get the result.
- Alternatively, declare a global volatile variable and assign the result to it inside the loop.

API Design: A macro-based approach is common and clean.

#define BENCHMARK(name) void bench_##name(void); \
    /* code to register the benchmark */ \
    void bench_##name(void)

Statistics:
- Store all timing results from the loop in an array.
- Sort the array to easily find the median (the middle element).
- Calculate the mean and standard deviation.

Learning milestones:

You benchmark a simple function, but the time is 0 → You’ve seen dead code elimination in action.
You successfully benchmark the function after defeating the compiler’s optimization → You know how to force the code to run.
You implement a harness that reports min, max, and mean → You have a working benchmark tool.
You add median and standard deviation → You are now producing statistically robust results.

Project 4: Cache Performance Investigator

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: CPU Architecture / Memory Hierarchy
Software or Tool: clock_gettime
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A program that measures memory access latency for different array sizes. It will demonstrate the “performance staircase”: the dramatic increase in latency as the data size exceeds the L1, then L2, then L3 caches and finally has to go to main memory (RAM).

Why it teaches performance: This project makes the abstract concept of the CPU cache tangible. You will produce a graph that is a physical signature of your own computer’s hardware. It’s the most effective way to understand why data locality is the #1 rule of performance optimization.

Core challenges you’ll face:

Designing the access pattern → maps to writing a loop that unpredictably jumps through the array to defeat CPU prefetchers
Iterating through different array sizes → maps to testing sizes from a few KB (fits in L1) to many MB (spills to RAM)
Measuring average access time → maps to dividing total time by the number of memory accesses
Finding your CPU’s cache sizes → maps to using commands like lscpu or sysctl to know what to expect

Key Concepts:

Memory Hierarchy: The L1/L2/L3/RAM pyramid.
Data Locality: Temporal (reusing data soon) and Spatial (using data that’s nearby in memory).
CPU Prefetcher: A hardware component that tries to predict what memory you’ll need next. Your benchmark needs to outsmart it.

Difficulty: Advanced Time estimate: Weekend Prerequisites: Project 3 (Benchmarking Harness).

Real world outcome: A program that generates data you can plot to create a graph like this, clearly showing the latency jumps at your CPU’s cache boundaries.

Array Size (KB) | Avg Access Time (ns)
----------------|---------------------
4               | 0.5
...             | ...
32              | 0.6  (L1 boundary)
33              | 4.2
...             | ...
256             | 4.5  (L2 boundary)
257             | 12.8
...             | ...
8192            | 13.5 (L3 boundary)
8193            | 70.1 (RAM)

Implementation Hints:

First, find your CPU’s cache sizes. On Linux, run lscpu | grep cache.
The core of the program is a loop that allocates an array of a given size N.
Create a random permutation of indices 0 to N-1. This is your access pattern. This ensures each memory access is to a “random” location in the array, defeating the prefetcher. p[i] = p[p[i]] is a common pattern to create a pointer-chasing dependency chain.
Use your benchmarking harness from Project 3 to time a loop that performs millions of accesses into the array using your random permutation.
Divide the total time by the number of accesses to get the average latency.
Run this whole process for a range of array sizes N, from very small (e.g., 2 KB) to very large (e.g., 32 MB).
Print the results in a CSV format so you can easily plot them with another tool.

Learning milestones:

You measure the access time for an array that fits in L1 cache → You have a baseline for “fast”.
You see the first major jump in latency → You’ve identified your L1 cache boundary.
You see subsequent jumps for L2 and L3 → You’ve mapped out your CPU’s memory hierarchy.
You can explain the resulting graph to someone → You deeply understand the performance impact of caches.

Project 5: A Simple `strace` Clone using `ptrace`

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: OS Internals / Tracing / Debugging
Software or Tool: ptrace syscall
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A tool that uses the Linux ptrace system call to trace another process, printing out every system call it makes and its return value.

Why it teaches performance: System calls are a major source of overhead as they involve a context switch into the kernel. Knowing what syscalls your program makes is key to optimizing it. This project demystifies how powerful tools like strace and gdb work.

Core challenges you’ll face:

Understanding the ptrace API → maps to the state machine of attaching, trapping, continuing, and detaching
Telling the child process to be traced → maps to using PTRACE_TRACEME
Catching syscalls → maps to using PTRACE_SYSCALL and understanding that it traps *twice per syscall (entry and exit)*
Reading CPU registers → maps to using PTRACE_GETREGS to find the syscall number and its arguments/return value (which are stored in specific registers by convention)
Mapping syscall numbers to names → maps to including the right header (<sys/syscall.h>) or creating a lookup table

Key Concepts:

ptrace: The primary process tracing mechanism in Linux. man 2 ptrace.
Syscall Calling Convention: On x86-64, the syscall number is in rax, arguments are in rdi, rsi, rdx, etc., and the return value is in rax.
Tracer and Tracee: The two roles in a ptrace relationship.

Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 1, deep understanding of C and process management.

Real world outcome: A functional, albeit simplified, strace.

$ ./my_strace echo "hello"
Syscall entry: execve(...)
Syscall exit:  execve -> 0
Syscall entry: brk(...)
Syscall exit:  brk -> 0x55c0a0f4a000
Syscall entry: write(1, "hello\n", 6)
hello
Syscall exit:  write -> 6
Syscall entry: exit_group(0)

Implementation Hints:

The overall structure is a fork, similar to Project 1.
Tracee (child process):
- Immediately call ptrace(PTRACE_TRACEME, 0, NULL, NULL). This tells the OS that the parent will be tracing it.
- Call raise(SIGSTOP) to pause itself, allowing the parent to set up options.
- execvp the target program.
Tracer (parent process):
- wait() for the child’s SIGSTOP.
- Set ptrace options, e.g., PTRACE_O_TRACESYSGOOD, to make syscall traps clearer.
- Enter a loop. Inside the loop:
  - Call ptrace(PTRACE_SYSCALL, child_pid, NULL, NULL) to let the child run until the next syscall entry or exit.
  - wait() for the child to trap.
  - Check the status from wait. If it’s a syscall trap, use ptrace(PTRACE_GETREGS, ...) to read the child’s registers into a struct.
  - On syscall entry, the orig_rax register holds the syscall number. Print it and its arguments.
  - On syscall exit, the rax register holds the return value. Print it.
  - Loop until the child exits.

Learning milestones:

You successfully attach to a child process and stop it → You understand the basic ptrace handshake.
You can catch the very first syscall (execve) → Your main tracing loop works.
You can read the syscall number from the registers → You are successfully inspecting the tracee’s state.
You can distinguish between syscall entry and exit and print the return value → You have a fully functional syscall tracer.

Project 6: Hardware Counter Tool via `perf_event_open`

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 5: Master
Knowledge Area: CPU Architecture / Linux Kernel
Software or Tool: perf_event_open syscall
Main Book: perf_event_open(2) man page and online kernel documentation.

What you’ll build: A library that uses the Linux perf_event_open syscall to measure fundamental hardware events (instructions, cycles, cache misses, branch mispredictions) for a specific block of C code.

Why it teaches performance: This is the ground truth. You are completely bypassing high-level timers and asking the CPU itself about its performance. You can calculate critical metrics like IPC (Instructions Per Cycle) and cache miss rates, which are essential for serious performance engineering.

Core challenges you’ll face:

The perf_event_open API → maps to its complex perf_event_attr struct and numerous configuration flags
Setting up the counters → maps to choosing the right type and config values for the events you want to measure (e.g., PERF_COUNT_HW_INSTRUCTIONS)
Controlling the counters → maps to using ioctl or prctl to enable, disable, and reset the counters around the code you want to measure
Reading the results → maps to using read() on the file descriptor returned by the syscall
Handling groups of events → maps to measuring multiple events simultaneously, with a group leader

Key Concepts:

Performance Monitoring Units (PMU/PMCs): The hardware units inside the CPU that do the counting.
perf_event_open(2): The Linux syscall that is the gateway to the PMU.
Instructions Per Cycle (IPC): The ratio of instructions to cycles. A key measure of CPU efficiency. IPC = instructions / cycles.

Difficulty: Master Time estimate: 1-2 weeks Prerequisites: Project 3, strong C skills, comfortable reading man pages.

Real world outcome: A header file and library that let you write code like this:

// Define the events you want to measure
perf_event_t events[] = {
    {PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS},
    {PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES},
    {PERF_TYPE_HARDWARE, PERF_COUNT_HW_CACHE_MISSES},
};

// Start counting
perf_start(events, 3);

// --- Code to be measured ---
my_complex_algorithm();
// -------------------------

// Stop counting and get results
uint64_t results[3];
perf_stop(events, results);

printf("Instructions: %lu\n", results[0]);
printf("Cycles:       %lu\n", results[1]);
printf("Cache Misses: %lu\n", results[2]);
printf("IPC:          %.2f\n", (double)results[0] / results[1]);

Implementation Hints:

Your perf_start function will be the most complex. It will need to:
- Loop through the requested events.
- For each event, fill a perf_event_attr struct. Set type, config, disabled = 1, inherit = 1, and exclude_kernel = 1.
- Call syscall(__NR_perf_event_open, &attr, 0, -1, group_fd, 0).
- The first event’s file descriptor (fd) becomes the group_fd for the rest, so they are all measured together.
- Store the returned fds.
- Use ioctl(group_fd, PERF_EVENT_IOC_RESET, 0) and ioctl(group_fd, PERF_EVENT_IOC_ENABLE, 0) to start the counting.
Your perf_stop function will:
- Use ioctl(group_fd, PERF_EVENT_IOC_DISABLE, 0) to stop counting.
- read() from each fd to get the 64-bit count.
- close() all the file descriptors.

Learning milestones:

You successfully open a file descriptor for a single event (e.g., cycles) → You’ve mastered the basic perf_event_open call.
You can reset, enable, and disable the counter around a block of code → You can control the measurement interval.
You can read a counter and get a non-zero value → You are successfully getting data from the CPU.
You can measure multiple events simultaneously and calculate IPC → You’ve built a complete, powerful performance analysis tool.

Project 7: A Basic Sampling Profiler

File: LEARN_PERFORMANCE_MONITORING_C_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 5: Master
Knowledge Area: OS Internals / Profiling / Signals
Software or Tool: setitimer, sigaction, libbacktrace or libunwind
Main Book: “Computer Systems: A Programmer’s Perspective” (for linking/loading concepts)

What you’ll build: A simple sampling profiler. One process (the profiler) will launch a target program (the “profilee”). Every N milliseconds, the profiler will send a signal to the profilee, causing it to pause. The profilee’s signal handler will then record its own instruction pointer (RIP) and maybe a stack trace, storing the results in a shared data structure.

Why it teaches performance: This demystifies how tools like perf record or gprof work. You learn that profiling isn’t magic; it’s just periodically stopping a program and asking, “What are you doing right now?”. By doing this many times, you get a statistical picture of where the program is spending its time.

Core challenges you’ll face:

Controlling another process → maps to using fork/exec and sending signals with kill()
Handling signals asynchronously → maps to writing a safe and correct signal handler using sigaction
Setting up a periodic timer → maps to using setitimer to receive a SIGPROF or SIGALRM at a regular interval (e.g., every 10ms)
Capturing the instruction pointer → maps to accessing the ucontext_t struct passed to an advanced signal handler
Mapping addresses to function names → maps to the difficult problem of symbol resolution. You might shell out to addr2line for a simple solution, or try to parse the ELF file’s symbol table yourself for an advanced version.

Key Concepts:

Signal Handling: man 7 signal and man 2 sigaction.
Interval Timers: man 2 setitimer.
ucontext_t: The structure that holds the complete state of a thread (registers, etc.) at the moment a signal is delivered.
Symbol Tables & DWARF debug info: The data inside an executable that maps addresses to function and line numbers.

Difficulty: Master Time estimate: 2-3 weeks Prerequisites: All previous projects, especially ptrace and signals.

Real world outcome: A tool that can give you a basic performance report for any compiled program.

$ ./my_profiler ./my_cpu_intensive_app
Profiling for 5 seconds...
--- Profile Report ---
Samples: 498
Top 5 Functions:
1. calculate_pi        - 250 samples (50.2%)
2. matrix_multiply     - 150 samples (30.1%)
3. update_display      -  50 samples (10.0%)
4. read_input          -  25 samples (5.0%)
5. main                -   5 samples (1.0%)

Implementation Hints:

This can be designed in two ways:

Two-Process Model (like strace): A tracer process forks a child, and periodically sends it a signal. This is harder because getting the RIP from another process is complex.
Self-Profiling Model (easier): The program profiles itself.
- In main, set up a signal handler for SIGPROF using sigaction. Your sa_flags must include SA_SIGINFO to get the ucontext_t.
- The handler’s job is to be very fast: get the RIP from the ucontext_t (uc_mcontext.gregs[REG_RIP]), store it in a global array of samples, and return.
- In main, set up an interval timer: setitimer(ITIMER_PROF, ...) to send a SIGPROF every 10ms.
- Let your program do its work.
- Before exiting, stop the timer.
- Process the array of samples. For each address, use addr2line -e <executable_name> <address> to resolve it to a function name. Count the occurrences of each name and print the histogram.

Learning milestones:

You set up a signal handler that fires once → You understand sigaction.
You set up a timer that fires the handler 10 times per second → You understand setitimer.
The handler successfully captures the instruction pointer → You can inspect the program’s state.
You can resolve a captured address to a function name → You’ve connected the profiler to the source code.
You generate a correct histogram of where time was spent → You have built a working sampling profiler.

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. `time` Command Clone	Beginner	Weekend	OS Basics	3/5
2. `malloc`/`free` Profiler	Advanced	1-2 weeks	Memory / Linker	5/5
3. Micro-Benchmarking Harness	Intermediate	Weekend	Statistics / Compilers	4/5
4. Cache Performance Investigator	Advanced	Weekend	CPU Caches	5/5
5. `strace` Clone using `ptrace`	Expert	1-2 weeks	OS Tracing	5/5
6. Hardware Counter Tool	Master	1-2 weeks	CPU Hardware	5/5
7. Basic Sampling Profiler	Master	2-3 weeks	Profiling Theory	5/5

Recommendation

For a comprehensive journey into C performance analysis:

Start with Project 1: time Command Clone. It teaches the fundamental OS APIs (fork, wait4, getrusage) that everything else builds upon.
Next, build Project 3: A Robust Micro-Benchmarking Harness. This is an incredibly practical tool you will reuse for all subsequent projects. It forces you to learn how to measure correctly.
With your harness, tackle Project 4: Cache Performance Investigator. This project provides the most “a-ha!” moment per line of code, as it makes the memory hierarchy visible and tangible.
After that, your path depends on your interest.
- For Memory Performance, build Project 2: malloc/free Profiler.
- For OS-level interaction, build Project 5: strace Clone.
- For the ultimate low-level truth, challenge yourself with Project 6: Hardware Counter Tool.

Finishing with Project 7: Basic Sampling Profiler will tie all the concepts of process control, signals, and state inspection together into a powerful capstone project.

Summary

Project 1: A High-Precision time Command Clone: C
Project 2: A malloc / free Instrumentation Library: C
Project 3: A Robust Micro-Benchmarking Harness: C
Project 4: Cache Performance Investigator: C
Project 5: A Simple strace Clone using ptrace: C
Project 6: Hardware Counter Tool via perf_event_open: C
Project 7: A Basic Sampling Profiler: C

Learn Performance Monitoring & Benchmarking in C

Why Learn Performance in C?

Core Concept Analysis

The Performance Measurement Pyramid

Key Metrics & How to Get Them in C

Project List

Project 1: A High-Precision time Command Clone

Project 2: A malloc / free Instrumentation Library

Project 3: A Robust Micro-Benchmarking Harness

Project 4: Cache Performance Investigator

Project 5: A Simple strace Clone using ptrace

Project 6: Hardware Counter Tool via perf_event_open

Project 7: A Basic Sampling Profiler

Project Comparison Table

Recommendation

Summary

Project 1: A High-Precision `time` Command Clone

Project 2: A `malloc` / `free` Instrumentation Library

Project 5: A Simple `strace` Clone using `ptrace`

Project 6: Hardware Counter Tool via `perf_event_open`