Project 3: Process Scheduler Visualization Tool
Build a real-time scheduler observer that shows context switches, run-queue depth, and per-CPU activity.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 2 weeks |
| Main Programming Language | C (Alternatives: Rust, Python) |
| Alternative Programming Languages | Rust, Python |
| Coolness Level | Level 3: Clever |
| Business Potential | Level 3: Observability tooling |
| Prerequisites | C or Python, /proc familiarity, basic scheduling theory |
| Key Topics | CFS, run queues, tracepoints, time-series aggregation |
1. Learning Objectives
By completing this project, you will:
- Explain how the CFS scheduler selects the next runnable task.
- Read and interpret
/proc/schedstatand/proc/stat. - Capture
sched_switchevents via tracefs/perf. - Aggregate event streams into per-CPU timelines.
- Build a deterministic, testable visualization output.
- Identify scheduling anomalies like long run-queue depth or starvation.
2. All Theory Needed (Per-Concept Breakdown)
2.1 CFS Scheduling Model and Run Queues
Fundamentals
Linux’s CFS (Completely Fair Scheduler) aims to distribute CPU time fairly. It keeps runnable tasks in a red-black tree ordered by virtual runtime (vruntime). The task with the smallest vruntime runs next, approximating fairness.
Deep Dive into the concept
Each CPU has a run queue (rq) and a CFS run queue (cfs_rq). vruntime increases proportional to actual runtime, scaled by task weight (from nice value). This creates the illusion of equal CPU time for tasks of equal weight. Load balancing periodically migrates tasks between CPUs. In your visualizer, run-queue depth indicates contention, and large vruntime gaps suggest dominance by a task. Understanding this model makes your visualization meaningful rather than a random stream of PIDs.
How this fits on projects
You’ll map sched_switch events to CFS states in Section 3.2 and Section 4.4, and interpret run-queue depth in Section 5.10.
Definitions & key terms
- CFS -> Completely Fair Scheduler
- vruntime -> weighted runtime used for scheduling order
- run queue -> per-CPU list/tree of runnable tasks
- nice -> user-visible priority weight
Mental model diagram (ASCII)
Run queue (rb-tree, ordered by vruntime)
leftmost -> next to run
How it works (step-by-step)
- Task becomes runnable and is inserted into run queue.
- Scheduler picks leftmost (smallest
vruntime). - Task runs for a time slice and its
vruntimeincreases. - Task is reinserted and a new task is selected.
Minimal concrete example
Task A vr=10, Task B vr=12, Task C vr=20 -> pick A
Common misconceptions
- Misconception: CFS uses fixed time quanta. Correction: time slices vary based on load and weight.
- Misconception: higher nice always runs less. Correction: it runs less proportionally, not necessarily less in wall time.
Check-your-understanding questions
- Why is
vruntimeused instead of raw runtime? - What does run-queue depth indicate?
- How does nice affect scheduling?
Check-your-understanding answers
- It normalizes runtime by task weight.
- Contention for the CPU.
- It changes task weight, affecting
vruntimegrowth.
Real-world applications
- Performance debugging (CPU-bound vs IO-bound workloads)
- Capacity planning and latency analysis
Where you’ll apply it
- This project: Section 3.2 Functional Requirements, Section 4.4 Algorithm Overview.
- Also used in: P06-userspace-thread-library-green-threads for scheduling policies.
References
- “Linux Kernel Development” (Love), scheduler chapters
- “Operating Systems: Three Easy Pieces” Ch. 7-10
Key insights
Scheduler theory explains what your visualization means.
Summary
CFS is a fairness machine; your tool reveals its decisions.
Homework/Exercises to practice the concept
- Predict how two tasks with nice -5 and +5 share CPU.
- Simulate a run queue with three tasks and track
vruntimechanges.
Solutions to the homework/exercises
- The nice -5 task receives more CPU time (weight is higher).
- The smallest
vruntimetask runs next; others wait.
2.2 Tracepoints and sched_switch Events
Fundamentals
The kernel emits tracepoints for scheduling events. sched:sched_switch fires on every context switch and contains previous and next task info. You can read these events via tracefs or perf.
Deep Dive into the concept
Tracepoints expose structured events with fields like prev_comm, prev_pid, next_pid, and timestamps. When enabled, the kernel writes event records to a ring buffer that user space reads. You must mount tracefs, enable events, and read trace_pipe or perf buffers. Proper timestamps should come from a monotonic clock to avoid skew.
How this fits on projects
Your event ingestion layer in Section 4.2 depends on correct tracepoint handling.
Definitions & key terms
- tracepoint -> kernel instrumentation hook
- tracefs -> virtual filesystem for tracing
sched_switch-> event emitted on context switch- ring buffer -> circular buffer of event records
Mental model diagram (ASCII)
CPU core -> sched_switch -> ring buffer -> user reader
How it works (step-by-step)
- Mount tracefs and enable
sched_switch. - Kernel writes events into per-CPU ring buffers.
- User-space reads and parses events.
- Events are aggregated into timelines.
Minimal concrete example
echo 1 | sudo tee /sys/kernel/tracing/events/sched/sched_switch/enable
cat /sys/kernel/tracing/trace_pipe
Common misconceptions
- Misconception: tracepoints are always enabled. Correction: they must be explicitly enabled.
- Misconception: trace_pipe preserves event order across CPUs. Correction: ordering across CPUs is approximate; use timestamps.
Check-your-understanding questions
- Why do you need timestamps for events from different CPUs?
- What happens if tracefs isn’t mounted?
Check-your-understanding answers
- Events are per-CPU and not globally ordered.
- You’ll see no events and the enable path fails.
Real-world applications
- Kernel performance profiling
- Latency investigations
Where you’ll apply it
- This project: Section 5.10 Phase 1-2, Section 6 tests.
- Also used in: P07-interrupt-latency-profiler.
References
- Kernel tracing docs
perf tracedocumentation
Key insights
Tracepoints are the API that makes the scheduler observable.
Summary
Without tracepoints, scheduler behavior is invisible.
Homework/Exercises to practice the concept
- Enable
sched_switchand count events during a CPU stress test. - Compare
trace_pipeoutput withperf trace.
Solutions to the homework/exercises
- Use
stress-ng --cpu 2and count lines. - Both show the same switches but with different formats.
2.3 Time-Series Aggregation and Visualization
Fundamentals
Raw events are noisy. You need to aggregate events into time buckets to show run-queue depth, CPU utilization, and context switch rates. Visualization can be as simple as a TUI timeline.
Deep Dive into the concept
Aggregation involves grouping events by time window (e.g., 100ms) and CPU. The run-queue depth can be derived from /proc/schedstat deltas, while context-switch rate comes from counting sched_switch events. A deterministic output mode helps tests: use fixed windows and monotonic timestamps. Visualization can be textual bars or ASCII timelines.
How this fits on projects
This is how you turn raw events into the real-world outcome in Section 3.7 and Section 5.10 Phase 3.
Definitions & key terms
- bucket -> time window for aggregation
- utilization -> fraction of time CPU is busy
- timeline -> per-CPU representation of running tasks
Mental model diagram (ASCII)
Time -> |----100ms----|----100ms----|
CPU0 -> nginx | postgres | nginx
How it works (step-by-step)
- Capture events with timestamps.
- Bucket events per CPU into fixed windows.
- Compute counts (context switches, run-queue depth).
- Render as ASCII timeline.
Minimal concrete example
CPU0: A A A B B
CPU1: C C D D D
Common misconceptions
- Misconception:
/proc/statgives per-process usage. Correction: it’s per-CPU counters; you need other sources for per-process.
Check-your-understanding questions
- Why is a fixed time bucket useful for deterministic tests?
- How do you compute context switches per second?
Check-your-understanding answers
- It avoids jitter and wall-clock variance.
- Count
sched_switchevents per time window.
Real-world applications
- Monitoring tools (top, htop, perfetto)
- Scheduling anomaly detection
Where you’ll apply it
- This project: Section 3.7 Real World Outcome, Section 5.10 Phase 3.
- Also used in: P01-syscall-tracer-strace-clone for output formatting strategies.
References
- “Systems Performance” (Gregg), time-series chapters
Key insights
Aggregation turns kernel noise into human insight.
Summary
Your tool is only as useful as the aggregation you build.
Homework/Exercises to practice the concept
- Aggregate
/proc/statdeltas into 1s CPU utilization. - Render a text bar for CPU busy time.
Solutions to the homework/exercises
- Compute delta idle vs total and derive utilization.
- Use
#for busy,.for idle.
3. Project Specification
3.1 What You Will Build
A CLI/TUI tool, schedviz, that shows per-CPU scheduling activity: current task, recent switches, run-queue depth, and summary statistics. It supports fixed-window aggregation for deterministic outputs.
3.2 Functional Requirements
- Read
sched_switchtracepoints in real time. - Parse
/proc/schedstatand/proc/statfor run-queue depth and CPU usage. - Render per-CPU timelines in text or TUI mode.
- Provide summary stats (context switches/sec, average run-queue depth).
- Support
--window-msand--fixed-tsfor deterministic buckets. - Handle multi-CPU systems and process exits gracefully.
3.3 Non-Functional Requirements
- Performance: handle 10k events/sec without dropping.
- Reliability: survives tracefs being temporarily unavailable.
- Usability: output is readable in 80x24 terminals.
3.4 Example Usage / Output
$ sudo ./schedviz --fixed-ts --window-ms 100
CPU0: nginx(1234) -> postgres(5678) -> nginx(1234)
CPU1: kworker/1:0 -> bash(2222) -> kworker/1:0
runq depth: [3,1,2,1]
ctx switches/sec: 1240
3.5 Data Formats / Schemas / Protocols
sched_switchevent format from tracefs/proc/schedstatparsed into per-CPU counters
3.6 Edge Cases
- No
sched_switchevents (tracefs not mounted) - Processes exit mid-window
- CPU hotplug changes CPU count
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
sudo ./schedviz --fixed-ts --window-ms 100
3.7.2 Golden Path Demo (Deterministic)
- Use
--fixed-tsand 100ms windows to generate stable snapshots.
3.7.3 CLI Transcript (Success + Failure)
$ sudo ./schedviz --fixed-ts
CPU0: idle -> bash(1234) -> gcc(5000)
runq depth: [1,2,1]
ctx switches/sec: 800
$ ./schedviz
error: schedviz requires tracefs access (try sudo)
exit code: 2
3.7.4 Exit Codes
0success2missing permissions
4. Solution Architecture
4.1 High-Level Design
Tracefs Reader -> Event Parser -> Aggregator -> Renderer
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Reader | read tracefs/perf | trace_pipe for v1 | | Parser | parse sched_switch lines | regex vs manual parse | | Aggregator | bucket events | fixed window size | | Renderer | ASCII timeline | terminal-friendly output |
4.3 Data Structures (No Full Code)
struct cpu_stats {
uint64_t switches;
uint64_t runq_depth;
char last_task[32];
};
4.4 Algorithm Overview
- Enable tracepoint.
- Read events into per-CPU buckets.
- Every window, compute metrics and render.
Complexity: O(E) per window where E is events.
5. Implementation Guide
5.1 Development Environment Setup
sudo apt install linux-tools-common
5.2 Project Structure
schedviz/
|-- src/
| |-- main.c
| |-- tracefs.c
| |-- agg.c
| `-- render.c
`-- Makefile
5.3 The Core Question You’re Answering
“Can I make the scheduler’s decisions visible in real time?”
5.4 Concepts You Must Understand First
- CFS scheduling model.
- Tracepoints and event ingestion.
- Time-window aggregation.
5.5 Questions to Guide Your Design
- What window size is most useful?
- How will you handle CPU hotplug?
- Which metrics are most valuable to show?
5.6 Thinking Exercise
Predict how make -j8 changes run-queue depth on a 4-core machine.
5.7 The Interview Questions They’ll Ask
- What is
sched_switchand why is it important? - How does CFS balance fairness vs responsiveness?
5.8 Hints in Layers
- Hint 1: Start with
/proc/statutilization. - Hint 2: Add tracefs
sched_switchevents. - Hint 3: Build simple ASCII timelines.
5.9 Books That Will Help
| Topic | Book | Chapter | |——|——|———| | Scheduling | Linux Kernel Development | Scheduler chapters | | OS scheduling | OSTEP | Ch. 7-10 | | Performance | Systems Performance | Timeline chapters |
5.10 Implementation Phases
Phase 1: Event capture (3 days)
- Enable tracepoint and print raw events.
Phase 2: Aggregation (4 days)
- Bucket events and compute metrics.
Phase 3: Visualization (3-4 days)
- Render timelines and stats.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———|———|———-| | Unit | parser | parse a sample sched_switch line | | Integration | capture | stress CPU and count events | | Edge | no events | tracefs disabled |
6.2 Critical Test Cases
- Known
sched_switchlog parses correctly. - Deterministic output for fixed window size.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Tracefs not mounted | no events | mount tracefs | | Mixed timestamps | jittery output | use monotonic clock |
7.2 Debugging Strategies
- Compare counts against
/proc/statcontext switch counter.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add JSON output for metrics.
8.2 Intermediate Extensions
- Add per-process CPU usage estimates.
8.3 Advanced Extensions
- Export perfetto trace format.
9. Real-World Connections
9.1 Industry Applications
- Identifying CPU contention in production workloads
9.2 Related Open Source Projects
- perf, bcc, sysstat
9.3 Interview Relevance
- Explaining how schedulers make decisions
10. Resources
- Kernel tracing docs
perf tracedocs
11. Self-Assessment Checklist
- I can interpret
sched_switchevents. - I can explain run-queue depth metrics.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Captures and displays sched_switch events.
Full Completion:
- Aggregates and renders per-CPU timelines.
Excellence:
- Exports to external visualization tools.
13. Determinism Notes
- Use
--fixed-tsand fixed window size for tests.