Project 7: Interrupt Latency Profiler

Measure IRQ latency and identify which interrupts suffer delays under load.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2 weeks
Main Programming Language	C (Alternatives: Python + bpftrace)
Alternative Programming Languages	Python
Coolness Level	Level 4: Hardcore
Business Potential	Level 3: Performance tooling
Prerequisites	C, kernel tracing basics, timing concepts
Key Topics	interrupt handling, tracepoints, latency analysis

1. Learning Objectives

By completing this project, you will:

Explain the IRQ handling pipeline (top half/bottom half).
Capture IRQ entry/exit timestamps via tracefs or perf.
Compute latency distributions and percentiles.
Map IRQ numbers to device names.
Build deterministic reporting for tests.
Identify system conditions that increase interrupt latency.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Interrupt Handling: Top Half vs Bottom Half

Fundamentals

An interrupt is a hardware signal that diverts CPU execution to an interrupt handler. The immediate handler (top half) runs quickly; deferred work (bottom half) runs later via softirqs or workqueues.

Deep Dive into the concept

When an IRQ arrives, the CPU saves state and jumps to the interrupt handler. The top half acknowledges the device and schedules any longer work. Bottom halves run in softirq context or kernel threads, reducing time spent with interrupts disabled. Latency can increase if interrupts are disabled for long periods or if the CPU is saturated. Measuring entry->exit time tells you how long the top half took, while longer delays before entry indicate masking or CPU contention.

How this fits on projects

This defines what you measure in Section 3.2 and interpret in Section 3.7.

Definitions & key terms

IRQ -> interrupt request
top half -> immediate interrupt handler
bottom half -> deferred processing (softirq, tasklet)

Mental model diagram (ASCII)

IRQ -> top half (fast) -> schedule bottom half (slow)

How it works (step-by-step)

Device raises IRQ.
CPU jumps to handler (top half).
Handler schedules bottom half if needed.
CPU resumes previous task.

Minimal concrete example

irq_handler_entry -> handler -> irq_handler_exit

Common misconceptions

Misconception: IRQ latency is only handler time. Correction: it also includes delay before handler runs.

Check-your-understanding questions

Why keep top halves short?
What causes long interrupt latency spikes?

Check-your-understanding answers

To minimize time with interrupts disabled.
CPU saturation, long critical sections, or IRQ masking.

Real-world applications

Real-time audio, networking latency tuning

Where you’ll apply it

This project: Section 3.2 Functional Requirements, Section 5.10 Phase 2.
Also used in: P03-process-scheduler-visualization-tool for timing interpretation.

References

“Linux Kernel Development” interrupt chapters

Key insights

IRQ latency is a system health indicator, not just a driver detail.

Summary

Understanding IRQ handling helps interpret latency numbers correctly.

Homework/Exercises to practice the concept

Trigger a high-rate IRQ (e.g., network traffic) and observe handler counts.

Solutions to the homework/exercises

Use ping -f and observe /proc/interrupts changes.

2.2 Tracepoints for IRQ Entry/Exit

Fundamentals

The kernel exposes irq_handler_entry and irq_handler_exit tracepoints. By pairing these events, you can compute handler latency.

Deep Dive into the concept

Tracepoints emit records containing IRQ number, handler name, and timestamp. You must enable tracepoints in tracefs and read the event stream. Pairing entry and exit requires per-CPU or per-IRQ maps because events interleave across CPUs. A monotonic clock ensures stable timestamps.

How this fits on projects

This is the core measurement pipeline in Section 4.4 and Section 5.10 Phase 1.

Definitions & key terms

tracefs -> filesystem for tracing controls
irq_handler_entry/exit -> tracepoint events
pairing -> matching entry/exit events

Mental model diagram (ASCII)

entry(ts=10) -> exit(ts=12) => latency=2us

How it works (step-by-step)

Enable tracepoints.
Read event stream.
Pair entry/exit by IRQ + CPU.
Compute latency and aggregate.

Minimal concrete example

echo 1 | sudo tee /sys/kernel/tracing/events/irq/irq_handler_entry/enable

Common misconceptions

Misconception: events are globally ordered. Correction: ordering is per-CPU; use timestamps.

Check-your-understanding questions

Why pair entry/exit per CPU?
What happens if exit is missing?

Check-your-understanding answers

IRQ handling is CPU-local; cross-CPU pairing is incorrect.
Mark as incomplete and drop or log.

Real-world applications

Latency tuning in RT systems

Where you’ll apply it

This project: Section 4.4 Algorithm Overview, Section 6 testing.
Also used in: P04-page-fault-analyzer for event parsing.

References

Kernel tracing docs

Key insights

Accurate latency requires correct pairing.

Summary

Tracepoints are your measurement instruments.

Homework/Exercises to practice the concept

Enable IRQ tracepoints and print raw events.

Solutions to the homework/exercises

Read trace_pipe and confirm entry/exit pairs.

2.3 Latency Histograms and Percentiles

Fundamentals

Raw latency samples are noisy. Histograms and percentiles (p50, p95, p99) summarize the distribution and highlight tail latency.

Deep Dive into the concept

A histogram buckets samples into ranges (e.g., 1us, 2us, 4us). Percentiles are computed from the cumulative distribution. Tail latency (p99, max) indicates rare but impactful delays. Using fixed buckets and deterministic sampling makes tests reproducible.

How this fits on projects

This powers the output in Section 3.7 and the summary stats in Section 5.10 Phase 3.

Definitions & key terms

histogram -> bucketed distribution
percentile -> value below which X% of samples fall
tail latency -> high percentiles and max

Mental model diagram (ASCII)

latency us: [0-1]=120 [1-2]=40 [2-4]=10 [4-8]=2

How it works (step-by-step)

For each sample, place it in a bucket.
Periodically compute percentiles from buckets.
Report avg/p99/max per IRQ.

Minimal concrete example

IRQ16: avg=2.1us p99=40us max=110us

Common misconceptions

Misconception: average latency is sufficient. Correction: tail latency is often more important.

Check-your-understanding questions

Why do we care about p99?
How does bucket size affect histogram accuracy?

Check-your-understanding answers

Rare spikes cause user-visible issues.
Larger buckets reduce precision but are cheaper.

Real-world applications

Latency SLAs in real-time systems

Where you’ll apply it

This project: Section 3.7 Real World Outcome, Section 5.10 Phase 3.
Also used in: P03-process-scheduler-visualization-tool for aggregation.

References

“Systems Performance” (Gregg), latency chapters

Key insights

Latency distributions tell a more complete story than averages.

Summary

Histograms and percentiles make IRQ latency actionable.

Homework/Exercises to practice the concept

Create a histogram from sample latencies.

Solutions to the homework/exercises

Use log2 buckets and compute cumulative counts.

3. Project Specification

3.1 What You Will Build

A CLI tool irq-latency that measures IRQ handler latency, aggregates per IRQ and per CPU, and prints histograms and percentile stats.

3.2 Functional Requirements

Enable IRQ entry/exit tracepoints.
Pair entry/exit events and compute latency.
Map IRQ numbers to device names via /proc/interrupts.
Compute avg/p95/p99/max per IRQ.
Output histogram buckets.
Deterministic mode for tests (--fixed-ts).

3.3 Non-Functional Requirements

Performance: handle high IRQ rates without dropping.
Reliability: survive missing events.
Usability: clear output format.

3.4 Example Usage / Output

$ sudo ./irq-latency --top 5 --fixed-ts
IRQ 16 (eth0): avg=2.3us p99=40us max=110us
IRQ 24 (nvme0): avg=1.2us p99=25us max=95us

3.5 Data Formats / Schemas / Protocols

Output format: IRQ <n> (<name>): avg=<us> p99=<us> max=<us>

3.6 Edge Cases

Missing exit events.
IRQs with no name.
IRQs masked for long periods.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

sudo ./irq-latency --fixed-ts --top 5

3.7.2 Golden Path Demo (Deterministic)

Use fixed sampling window and --fixed-ts for reproducible output.

3.7.3 CLI Transcript (Success + Failure)

$ sudo ./irq-latency --fixed-ts
IRQ 16 (eth0): avg=2.3us p99=40us max=110us

$ ./irq-latency
error: requires tracefs access (try sudo)
exit code: 2

3.7.4 Exit Codes

0 success
2 permission error

4. Solution Architecture

4.1 High-Level Design

Tracefs Reader -> Pairing Engine -> Histogram Aggregator -> Reporter

4.2 Key Components

4.3 Data Structures (No Full Code)

struct irq_stats {
    uint64_t count;
    uint64_t sum_ns;
    uint64_t max_ns;
    uint64_t buckets[32];
};

4.4 Algorithm Overview

Enable tracepoints.
Read events and pair them.
Update stats and histogram.
Periodically report summary.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt install linux-tools-common

5.2 Project Structure

irq-latency/
|-- src/
|   |-- main.c
|   |-- trace.c
|   `-- stats.c
`-- Makefile

5.3 The Core Question You’re Answering

“How long does the kernel take to respond to interrupts under load?”

5.4 Concepts You Must Understand First

IRQ handling pipeline.
Tracepoint pairing.
Latency histograms.

5.5 Questions to Guide Your Design

How will you handle missing exit events?
What bucket size is useful?

5.6 Thinking Exercise

If IRQ average is 2us but max is 500us, what might explain the spike?

5.7 The Interview Questions They’ll Ask

What is interrupt latency and why does it matter?

5.8 Hints in Layers

Hint 1: start with printing raw entry/exit events.
Hint 2: add pairing and average latency.
Hint 3: add histogram and percentiles.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: capture events. Phase 2: pairing + avg. Phase 3: histogram + percentiles.

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Entry without exit is ignored with warning.
Deterministic output with fixed window.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

8. Extensions & Challenges

Export CSV to analyze in Python.
Add live alerts when p99 exceeds threshold.

9. Real-World Connections

Real-time audio, networking, storage latency

10. Resources

Kernel tracing docs

11. Self-Assessment Checklist

I can explain top vs bottom halves.

12. Submission / Completion Criteria

Minimum: capture IRQ events and compute avg. Full: histograms and percentiles. Excellence: alerting + per-CPU breakdown.

13. Determinism Notes

Use fixed window size and stable sampling duration.