Project 8: Profiler for Cache Misses

A simple command-line profiler that uses the Linux perf_event_open syscall to wrap another command and report on the L1/L2/L3 cache misses it generated.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages C++, Rust
Difficulty Level 4: Expert
Time Estimate 2-3 weeks
Knowledge Area Systems Programming / CPU Internals / Linux Kernel
Tooling Linux perf_event_open syscall
Prerequisites Strong C skills, comfort with Linux syscalls, Project 1 (for a target to profile).

What You Will Build

A simple command-line profiler that uses the Linux perf_event_open syscall to wrap another command and report on the L1/L2/L3 cache misses it generated.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Understanding the perf_event_open syscall → maps to reading kernel documentation and dealing with a complex API
  • Setting up the perf_event_attr struct → maps to specifying which hardware events to count (e.g., PERF_COUNT_HW_CACHE_MISSES)
  • Forking and executing a child process (execvp) → maps to standard Unix process management
  • Controlling the counters with ioctl → maps to enabling, disabling, and resetting counts around the child process execution
  • Reading and reporting the results → maps to getting the final numbers from the kernel

Key Concepts

  • perf_event_open syscall: man perf_event_open is the primary source.
  • Performance Monitoring Units (PMU): Intel/AMD developer manuals.
  • fork/exec process model: “Advanced Programming in the UNIX Environment” by Stevens & Rago.

Real-World Outcome

# Profile the Locality Benchmarker from Project 1
$ ./cache_profiler ./locality_benchmark

> Profiling command: ./locality_benchmark
... (output from the locality_benchmark) ...
> Profiling finished. Hardware Performance Counters:
  L1d Cache Loads:    2,014,589,123
  L1d Cache Misses:   501,887,345 (24.91%)
  L3 Cache Misses:    125,234,876 ( 6.22%)
  Instructions:       10,456,123,789
  CPU Cycles:         25,123,456,901

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • “The Linux Programming Interface” by Michael Kerrisk