Project 7: Interrupt Latency Profiler

Measure IRQ latency and identify which interrupts suffer delays under load.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 2 weeks
Main Programming Language C (Alternatives: Python + bpftrace)
Alternative Programming Languages Python
Coolness Level Level 4: Hardcore
Business Potential Level 3: Performance tooling
Prerequisites C, kernel tracing basics, timing concepts
Key Topics interrupt handling, tracepoints, latency analysis

1. Learning Objectives

By completing this project, you will:

  1. Explain the IRQ handling pipeline (top half/bottom half).
  2. Capture IRQ entry/exit timestamps via tracefs or perf.
  3. Compute latency distributions and percentiles.
  4. Map IRQ numbers to device names.
  5. Build deterministic reporting for tests.
  6. Identify system conditions that increase interrupt latency.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Interrupt Handling: Top Half vs Bottom Half

Fundamentals

An interrupt is a hardware signal that diverts CPU execution to an interrupt handler. The immediate handler (top half) runs quickly; deferred work (bottom half) runs later via softirqs or workqueues.

Deep Dive into the concept

When an IRQ arrives, the CPU saves state and jumps to the interrupt handler. The top half acknowledges the device and schedules any longer work. Bottom halves run in softirq context or kernel threads, reducing time spent with interrupts disabled. Latency can increase if interrupts are disabled for long periods or if the CPU is saturated. Measuring entry->exit time tells you how long the top half took, while longer delays before entry indicate masking or CPU contention.

How this fits on projects

This defines what you measure in Section 3.2 and interpret in Section 3.7.

Definitions & key terms

  • IRQ -> interrupt request
  • top half -> immediate interrupt handler
  • bottom half -> deferred processing (softirq, tasklet)

Mental model diagram (ASCII)

IRQ -> top half (fast) -> schedule bottom half (slow)

How it works (step-by-step)

  1. Device raises IRQ.
  2. CPU jumps to handler (top half).
  3. Handler schedules bottom half if needed.
  4. CPU resumes previous task.

Minimal concrete example

irq_handler_entry -> handler -> irq_handler_exit

Common misconceptions

  • Misconception: IRQ latency is only handler time. Correction: it also includes delay before handler runs.

Check-your-understanding questions

  1. Why keep top halves short?
  2. What causes long interrupt latency spikes?

Check-your-understanding answers

  1. To minimize time with interrupts disabled.
  2. CPU saturation, long critical sections, or IRQ masking.

Real-world applications

  • Real-time audio, networking latency tuning

Where you’ll apply it

References

  • “Linux Kernel Development” interrupt chapters

Key insights

IRQ latency is a system health indicator, not just a driver detail.

Summary

Understanding IRQ handling helps interpret latency numbers correctly.

Homework/Exercises to practice the concept

  1. Trigger a high-rate IRQ (e.g., network traffic) and observe handler counts.

Solutions to the homework/exercises

  1. Use ping -f and observe /proc/interrupts changes.

2.2 Tracepoints for IRQ Entry/Exit

Fundamentals

The kernel exposes irq_handler_entry and irq_handler_exit tracepoints. By pairing these events, you can compute handler latency.

Deep Dive into the concept

Tracepoints emit records containing IRQ number, handler name, and timestamp. You must enable tracepoints in tracefs and read the event stream. Pairing entry and exit requires per-CPU or per-IRQ maps because events interleave across CPUs. A monotonic clock ensures stable timestamps.

How this fits on projects

This is the core measurement pipeline in Section 4.4 and Section 5.10 Phase 1.

Definitions & key terms

  • tracefs -> filesystem for tracing controls
  • irq_handler_entry/exit -> tracepoint events
  • pairing -> matching entry/exit events

Mental model diagram (ASCII)

entry(ts=10) -> exit(ts=12) => latency=2us

How it works (step-by-step)

  1. Enable tracepoints.
  2. Read event stream.
  3. Pair entry/exit by IRQ + CPU.
  4. Compute latency and aggregate.

Minimal concrete example

echo 1 | sudo tee /sys/kernel/tracing/events/irq/irq_handler_entry/enable

Common misconceptions

  • Misconception: events are globally ordered. Correction: ordering is per-CPU; use timestamps.

Check-your-understanding questions

  1. Why pair entry/exit per CPU?
  2. What happens if exit is missing?

Check-your-understanding answers

  1. IRQ handling is CPU-local; cross-CPU pairing is incorrect.
  2. Mark as incomplete and drop or log.

Real-world applications

  • Latency tuning in RT systems

Where you’ll apply it

  • This project: Section 4.4 Algorithm Overview, Section 6 testing.
  • Also used in: P04-page-fault-analyzer for event parsing.

References

  • Kernel tracing docs

Key insights

Accurate latency requires correct pairing.

Summary

Tracepoints are your measurement instruments.

Homework/Exercises to practice the concept

  1. Enable IRQ tracepoints and print raw events.

Solutions to the homework/exercises

  1. Read trace_pipe and confirm entry/exit pairs.

2.3 Latency Histograms and Percentiles

Fundamentals

Raw latency samples are noisy. Histograms and percentiles (p50, p95, p99) summarize the distribution and highlight tail latency.

Deep Dive into the concept

A histogram buckets samples into ranges (e.g., 1us, 2us, 4us). Percentiles are computed from the cumulative distribution. Tail latency (p99, max) indicates rare but impactful delays. Using fixed buckets and deterministic sampling makes tests reproducible.

How this fits on projects

This powers the output in Section 3.7 and the summary stats in Section 5.10 Phase 3.

Definitions & key terms

  • histogram -> bucketed distribution
  • percentile -> value below which X% of samples fall
  • tail latency -> high percentiles and max

Mental model diagram (ASCII)

latency us: [0-1]=120 [1-2]=40 [2-4]=10 [4-8]=2

How it works (step-by-step)

  1. For each sample, place it in a bucket.
  2. Periodically compute percentiles from buckets.
  3. Report avg/p99/max per IRQ.

Minimal concrete example

IRQ16: avg=2.1us p99=40us max=110us

Common misconceptions

  • Misconception: average latency is sufficient. Correction: tail latency is often more important.

Check-your-understanding questions

  1. Why do we care about p99?
  2. How does bucket size affect histogram accuracy?

Check-your-understanding answers

  1. Rare spikes cause user-visible issues.
  2. Larger buckets reduce precision but are cheaper.

Real-world applications

  • Latency SLAs in real-time systems

Where you’ll apply it

References

  • “Systems Performance” (Gregg), latency chapters

Key insights

Latency distributions tell a more complete story than averages.

Summary

Histograms and percentiles make IRQ latency actionable.

Homework/Exercises to practice the concept

  1. Create a histogram from sample latencies.

Solutions to the homework/exercises

  1. Use log2 buckets and compute cumulative counts.

3. Project Specification

3.1 What You Will Build

A CLI tool irq-latency that measures IRQ handler latency, aggregates per IRQ and per CPU, and prints histograms and percentile stats.

3.2 Functional Requirements

  1. Enable IRQ entry/exit tracepoints.
  2. Pair entry/exit events and compute latency.
  3. Map IRQ numbers to device names via /proc/interrupts.
  4. Compute avg/p95/p99/max per IRQ.
  5. Output histogram buckets.
  6. Deterministic mode for tests (--fixed-ts).

3.3 Non-Functional Requirements

  • Performance: handle high IRQ rates without dropping.
  • Reliability: survive missing events.
  • Usability: clear output format.

3.4 Example Usage / Output

$ sudo ./irq-latency --top 5 --fixed-ts
IRQ 16 (eth0): avg=2.3us p99=40us max=110us
IRQ 24 (nvme0): avg=1.2us p99=25us max=95us

3.5 Data Formats / Schemas / Protocols

  • Output format: IRQ <n> (<name>): avg=<us> p99=<us> max=<us>

3.6 Edge Cases

  • Missing exit events.
  • IRQs with no name.
  • IRQs masked for long periods.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

sudo ./irq-latency --fixed-ts --top 5

3.7.2 Golden Path Demo (Deterministic)

  • Use fixed sampling window and --fixed-ts for reproducible output.

3.7.3 CLI Transcript (Success + Failure)

$ sudo ./irq-latency --fixed-ts
IRQ 16 (eth0): avg=2.3us p99=40us max=110us

$ ./irq-latency
error: requires tracefs access (try sudo)
exit code: 2

3.7.4 Exit Codes

  • 0 success
  • 2 permission error

4. Solution Architecture

4.1 High-Level Design

Tracefs Reader -> Pairing Engine -> Histogram Aggregator -> Reporter

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Reader | read irq events | trace_pipe | | Pairing | match entry/exit | per-CPU map | | Aggregator | histogram + percentiles | log2 buckets | | Reporter | format output | per-IRQ summary |

4.3 Data Structures (No Full Code)

struct irq_stats {
    uint64_t count;
    uint64_t sum_ns;
    uint64_t max_ns;
    uint64_t buckets[32];
};

4.4 Algorithm Overview

  1. Enable tracepoints.
  2. Read events and pair them.
  3. Update stats and histogram.
  4. Periodically report summary.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt install linux-tools-common

5.2 Project Structure

irq-latency/
|-- src/
|   |-- main.c
|   |-- trace.c
|   `-- stats.c
`-- Makefile

5.3 The Core Question You’re Answering

“How long does the kernel take to respond to interrupts under load?”

5.4 Concepts You Must Understand First

  1. IRQ handling pipeline.
  2. Tracepoint pairing.
  3. Latency histograms.

5.5 Questions to Guide Your Design

  1. How will you handle missing exit events?
  2. What bucket size is useful?

5.6 Thinking Exercise

If IRQ average is 2us but max is 500us, what might explain the spike?

5.7 The Interview Questions They’ll Ask

  1. What is interrupt latency and why does it matter?

5.8 Hints in Layers

  • Hint 1: start with printing raw entry/exit events.
  • Hint 2: add pairing and average latency.
  • Hint 3: add histogram and percentiles.

5.9 Books That Will Help

| Topic | Book | Chapter | |——|——|———| | Interrupts | Linux Kernel Development | Interrupt chapters | | Performance | Systems Performance | Latency chapters |

5.10 Implementation Phases

Phase 1: capture events. Phase 2: pairing + avg. Phase 3: histogram + percentiles.


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———|———|———-| | Unit | histogram | bucket boundaries | | Integration | event capture | synthetic IRQ load |

6.2 Critical Test Cases

  1. Entry without exit is ignored with warning.
  2. Deterministic output with fixed window.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Wrong timestamp source | negative latency | use monotonic clock | | Global pairing | mismatched events | pair by CPU |


8. Extensions & Challenges

  • Export CSV to analyze in Python.
  • Add live alerts when p99 exceeds threshold.

9. Real-World Connections

  • Real-time audio, networking, storage latency

10. Resources

  • Kernel tracing docs

11. Self-Assessment Checklist

  • I can explain top vs bottom halves.

12. Submission / Completion Criteria

Minimum: capture IRQ events and compute avg. Full: histograms and percentiles. Excellence: alerting + per-CPU breakdown.


13. Determinism Notes

  • Use fixed window size and stable sampling duration.