Project 7: Interrupt Latency Profiler
Measure IRQ latency and identify which interrupts suffer delays under load.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2 weeks |
| Main Programming Language | C (Alternatives: Python + bpftrace) |
| Alternative Programming Languages | Python |
| Coolness Level | Level 4: Hardcore |
| Business Potential | Level 3: Performance tooling |
| Prerequisites | C, kernel tracing basics, timing concepts |
| Key Topics | interrupt handling, tracepoints, latency analysis |
1. Learning Objectives
By completing this project, you will:
- Explain the IRQ handling pipeline (top half/bottom half).
- Capture IRQ entry/exit timestamps via tracefs or perf.
- Compute latency distributions and percentiles.
- Map IRQ numbers to device names.
- Build deterministic reporting for tests.
- Identify system conditions that increase interrupt latency.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Interrupt Handling: Top Half vs Bottom Half
Fundamentals
An interrupt is a hardware signal that diverts CPU execution to an interrupt handler. The immediate handler (top half) runs quickly; deferred work (bottom half) runs later via softirqs or workqueues.
Deep Dive into the concept
When an IRQ arrives, the CPU saves state and jumps to the interrupt handler. The top half acknowledges the device and schedules any longer work. Bottom halves run in softirq context or kernel threads, reducing time spent with interrupts disabled. Latency can increase if interrupts are disabled for long periods or if the CPU is saturated. Measuring entry->exit time tells you how long the top half took, while longer delays before entry indicate masking or CPU contention.
How this fits on projects
This defines what you measure in Section 3.2 and interpret in Section 3.7.
Definitions & key terms
- IRQ -> interrupt request
- top half -> immediate interrupt handler
- bottom half -> deferred processing (softirq, tasklet)
Mental model diagram (ASCII)
IRQ -> top half (fast) -> schedule bottom half (slow)
How it works (step-by-step)
- Device raises IRQ.
- CPU jumps to handler (top half).
- Handler schedules bottom half if needed.
- CPU resumes previous task.
Minimal concrete example
irq_handler_entry -> handler -> irq_handler_exit
Common misconceptions
- Misconception: IRQ latency is only handler time. Correction: it also includes delay before handler runs.
Check-your-understanding questions
- Why keep top halves short?
- What causes long interrupt latency spikes?
Check-your-understanding answers
- To minimize time with interrupts disabled.
- CPU saturation, long critical sections, or IRQ masking.
Real-world applications
- Real-time audio, networking latency tuning
Where you’ll apply it
- This project: Section 3.2 Functional Requirements, Section 5.10 Phase 2.
- Also used in: P03-process-scheduler-visualization-tool for timing interpretation.
References
- “Linux Kernel Development” interrupt chapters
Key insights
IRQ latency is a system health indicator, not just a driver detail.
Summary
Understanding IRQ handling helps interpret latency numbers correctly.
Homework/Exercises to practice the concept
- Trigger a high-rate IRQ (e.g., network traffic) and observe handler counts.
Solutions to the homework/exercises
- Use
ping -fand observe/proc/interruptschanges.
2.2 Tracepoints for IRQ Entry/Exit
Fundamentals
The kernel exposes irq_handler_entry and irq_handler_exit tracepoints. By pairing these events, you can compute handler latency.
Deep Dive into the concept
Tracepoints emit records containing IRQ number, handler name, and timestamp. You must enable tracepoints in tracefs and read the event stream. Pairing entry and exit requires per-CPU or per-IRQ maps because events interleave across CPUs. A monotonic clock ensures stable timestamps.
How this fits on projects
This is the core measurement pipeline in Section 4.4 and Section 5.10 Phase 1.
Definitions & key terms
- tracefs -> filesystem for tracing controls
- irq_handler_entry/exit -> tracepoint events
- pairing -> matching entry/exit events
Mental model diagram (ASCII)
entry(ts=10) -> exit(ts=12) => latency=2us
How it works (step-by-step)
- Enable tracepoints.
- Read event stream.
- Pair entry/exit by IRQ + CPU.
- Compute latency and aggregate.
Minimal concrete example
echo 1 | sudo tee /sys/kernel/tracing/events/irq/irq_handler_entry/enable
Common misconceptions
- Misconception: events are globally ordered. Correction: ordering is per-CPU; use timestamps.
Check-your-understanding questions
- Why pair entry/exit per CPU?
- What happens if exit is missing?
Check-your-understanding answers
- IRQ handling is CPU-local; cross-CPU pairing is incorrect.
- Mark as incomplete and drop or log.
Real-world applications
- Latency tuning in RT systems
Where you’ll apply it
- This project: Section 4.4 Algorithm Overview, Section 6 testing.
- Also used in: P04-page-fault-analyzer for event parsing.
References
- Kernel tracing docs
Key insights
Accurate latency requires correct pairing.
Summary
Tracepoints are your measurement instruments.
Homework/Exercises to practice the concept
- Enable IRQ tracepoints and print raw events.
Solutions to the homework/exercises
- Read
trace_pipeand confirm entry/exit pairs.
2.3 Latency Histograms and Percentiles
Fundamentals
Raw latency samples are noisy. Histograms and percentiles (p50, p95, p99) summarize the distribution and highlight tail latency.
Deep Dive into the concept
A histogram buckets samples into ranges (e.g., 1us, 2us, 4us). Percentiles are computed from the cumulative distribution. Tail latency (p99, max) indicates rare but impactful delays. Using fixed buckets and deterministic sampling makes tests reproducible.
How this fits on projects
This powers the output in Section 3.7 and the summary stats in Section 5.10 Phase 3.
Definitions & key terms
- histogram -> bucketed distribution
- percentile -> value below which X% of samples fall
- tail latency -> high percentiles and max
Mental model diagram (ASCII)
latency us: [0-1]=120 [1-2]=40 [2-4]=10 [4-8]=2
How it works (step-by-step)
- For each sample, place it in a bucket.
- Periodically compute percentiles from buckets.
- Report avg/p99/max per IRQ.
Minimal concrete example
IRQ16: avg=2.1us p99=40us max=110us
Common misconceptions
- Misconception: average latency is sufficient. Correction: tail latency is often more important.
Check-your-understanding questions
- Why do we care about p99?
- How does bucket size affect histogram accuracy?
Check-your-understanding answers
- Rare spikes cause user-visible issues.
- Larger buckets reduce precision but are cheaper.
Real-world applications
- Latency SLAs in real-time systems
Where you’ll apply it
- This project: Section 3.7 Real World Outcome, Section 5.10 Phase 3.
- Also used in: P03-process-scheduler-visualization-tool for aggregation.
References
- “Systems Performance” (Gregg), latency chapters
Key insights
Latency distributions tell a more complete story than averages.
Summary
Histograms and percentiles make IRQ latency actionable.
Homework/Exercises to practice the concept
- Create a histogram from sample latencies.
Solutions to the homework/exercises
- Use log2 buckets and compute cumulative counts.
3. Project Specification
3.1 What You Will Build
A CLI tool irq-latency that measures IRQ handler latency, aggregates per IRQ and per CPU, and prints histograms and percentile stats.
3.2 Functional Requirements
- Enable IRQ entry/exit tracepoints.
- Pair entry/exit events and compute latency.
- Map IRQ numbers to device names via
/proc/interrupts. - Compute avg/p95/p99/max per IRQ.
- Output histogram buckets.
- Deterministic mode for tests (
--fixed-ts).
3.3 Non-Functional Requirements
- Performance: handle high IRQ rates without dropping.
- Reliability: survive missing events.
- Usability: clear output format.
3.4 Example Usage / Output
$ sudo ./irq-latency --top 5 --fixed-ts
IRQ 16 (eth0): avg=2.3us p99=40us max=110us
IRQ 24 (nvme0): avg=1.2us p99=25us max=95us
3.5 Data Formats / Schemas / Protocols
- Output format:
IRQ <n> (<name>): avg=<us> p99=<us> max=<us>
3.6 Edge Cases
- Missing exit events.
- IRQs with no name.
- IRQs masked for long periods.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
sudo ./irq-latency --fixed-ts --top 5
3.7.2 Golden Path Demo (Deterministic)
- Use fixed sampling window and
--fixed-tsfor reproducible output.
3.7.3 CLI Transcript (Success + Failure)
$ sudo ./irq-latency --fixed-ts
IRQ 16 (eth0): avg=2.3us p99=40us max=110us
$ ./irq-latency
error: requires tracefs access (try sudo)
exit code: 2
3.7.4 Exit Codes
0success2permission error
4. Solution Architecture
4.1 High-Level Design
Tracefs Reader -> Pairing Engine -> Histogram Aggregator -> Reporter
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Reader | read irq events | trace_pipe | | Pairing | match entry/exit | per-CPU map | | Aggregator | histogram + percentiles | log2 buckets | | Reporter | format output | per-IRQ summary |
4.3 Data Structures (No Full Code)
struct irq_stats {
uint64_t count;
uint64_t sum_ns;
uint64_t max_ns;
uint64_t buckets[32];
};
4.4 Algorithm Overview
- Enable tracepoints.
- Read events and pair them.
- Update stats and histogram.
- Periodically report summary.
5. Implementation Guide
5.1 Development Environment Setup
sudo apt install linux-tools-common
5.2 Project Structure
irq-latency/
|-- src/
| |-- main.c
| |-- trace.c
| `-- stats.c
`-- Makefile
5.3 The Core Question You’re Answering
“How long does the kernel take to respond to interrupts under load?”
5.4 Concepts You Must Understand First
- IRQ handling pipeline.
- Tracepoint pairing.
- Latency histograms.
5.5 Questions to Guide Your Design
- How will you handle missing exit events?
- What bucket size is useful?
5.6 Thinking Exercise
If IRQ average is 2us but max is 500us, what might explain the spike?
5.7 The Interview Questions They’ll Ask
- What is interrupt latency and why does it matter?
5.8 Hints in Layers
- Hint 1: start with printing raw entry/exit events.
- Hint 2: add pairing and average latency.
- Hint 3: add histogram and percentiles.
5.9 Books That Will Help
| Topic | Book | Chapter | |——|——|———| | Interrupts | Linux Kernel Development | Interrupt chapters | | Performance | Systems Performance | Latency chapters |
5.10 Implementation Phases
Phase 1: capture events. Phase 2: pairing + avg. Phase 3: histogram + percentiles.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———|———|———-| | Unit | histogram | bucket boundaries | | Integration | event capture | synthetic IRQ load |
6.2 Critical Test Cases
- Entry without exit is ignored with warning.
- Deterministic output with fixed window.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Wrong timestamp source | negative latency | use monotonic clock | | Global pairing | mismatched events | pair by CPU |
8. Extensions & Challenges
- Export CSV to analyze in Python.
- Add live alerts when p99 exceeds threshold.
9. Real-World Connections
- Real-time audio, networking, storage latency
10. Resources
- Kernel tracing docs
11. Self-Assessment Checklist
- I can explain top vs bottom halves.
12. Submission / Completion Criteria
Minimum: capture IRQ events and compute avg. Full: histograms and percentiles. Excellence: alerting + per-CPU breakdown.
13. Determinism Notes
- Use fixed window size and stable sampling duration.