Project 3: Process Scheduler Visualization Tool

Build a real-time scheduler observer that shows context switches, run-queue depth, and per-CPU activity.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2 weeks
Main Programming Language	C (Alternatives: Rust, Python)
Alternative Programming Languages	Rust, Python
Coolness Level	Level 3: Clever
Business Potential	Level 3: Observability tooling
Prerequisites	C or Python, `/proc` familiarity, basic scheduling theory
Key Topics	CFS, run queues, tracepoints, time-series aggregation

1. Learning Objectives

By completing this project, you will:

Explain how the CFS scheduler selects the next runnable task.
Read and interpret /proc/schedstat and /proc/stat.
Capture sched_switch events via tracefs/perf.
Aggregate event streams into per-CPU timelines.
Build a deterministic, testable visualization output.
Identify scheduling anomalies like long run-queue depth or starvation.

2. All Theory Needed (Per-Concept Breakdown)

2.1 CFS Scheduling Model and Run Queues

Fundamentals

Linux’s CFS (Completely Fair Scheduler) aims to distribute CPU time fairly. It keeps runnable tasks in a red-black tree ordered by virtual runtime (vruntime). The task with the smallest vruntime runs next, approximating fairness.

Deep Dive into the concept

Each CPU has a run queue (rq) and a CFS run queue (cfs_rq). vruntime increases proportional to actual runtime, scaled by task weight (from nice value). This creates the illusion of equal CPU time for tasks of equal weight. Load balancing periodically migrates tasks between CPUs. In your visualizer, run-queue depth indicates contention, and large vruntime gaps suggest dominance by a task. Understanding this model makes your visualization meaningful rather than a random stream of PIDs.

How this fits on projects

You’ll map sched_switch events to CFS states in Section 3.2 and Section 4.4, and interpret run-queue depth in Section 5.10.

Definitions & key terms

CFS -> Completely Fair Scheduler
vruntime -> weighted runtime used for scheduling order
run queue -> per-CPU list/tree of runnable tasks
nice -> user-visible priority weight

Mental model diagram (ASCII)

Run queue (rb-tree, ordered by vruntime)
  leftmost -> next to run

How it works (step-by-step)

Task becomes runnable and is inserted into run queue.
Scheduler picks leftmost (smallest vruntime).
Task runs for a time slice and its vruntime increases.
Task is reinserted and a new task is selected.

Minimal concrete example

Task A vr=10, Task B vr=12, Task C vr=20 -> pick A

Common misconceptions

Misconception: CFS uses fixed time quanta. Correction: time slices vary based on load and weight.
Misconception: higher nice always runs less. Correction: it runs less proportionally, not necessarily less in wall time.

Check-your-understanding questions

Why is vruntime used instead of raw runtime?
What does run-queue depth indicate?
How does nice affect scheduling?

Check-your-understanding answers

It normalizes runtime by task weight.
Contention for the CPU.
It changes task weight, affecting vruntime growth.

Real-world applications

Performance debugging (CPU-bound vs IO-bound workloads)
Capacity planning and latency analysis

Where you’ll apply it

This project: Section 3.2 Functional Requirements, Section 4.4 Algorithm Overview.
Also used in: P06-userspace-thread-library-green-threads for scheduling policies.

References

“Linux Kernel Development” (Love), scheduler chapters
“Operating Systems: Three Easy Pieces” Ch. 7-10

Key insights

Scheduler theory explains what your visualization means.

Summary

CFS is a fairness machine; your tool reveals its decisions.

Homework/Exercises to practice the concept

Predict how two tasks with nice -5 and +5 share CPU.
Simulate a run queue with three tasks and track vruntime changes.

Solutions to the homework/exercises

The nice -5 task receives more CPU time (weight is higher).
The smallest vruntime task runs next; others wait.

2.2 Tracepoints and sched_switch Events

Fundamentals

The kernel emits tracepoints for scheduling events. sched:sched_switch fires on every context switch and contains previous and next task info. You can read these events via tracefs or perf.

Deep Dive into the concept

Tracepoints expose structured events with fields like prev_comm, prev_pid, next_pid, and timestamps. When enabled, the kernel writes event records to a ring buffer that user space reads. You must mount tracefs, enable events, and read trace_pipe or perf buffers. Proper timestamps should come from a monotonic clock to avoid skew.

How this fits on projects

Your event ingestion layer in Section 4.2 depends on correct tracepoint handling.

Definitions & key terms

tracepoint -> kernel instrumentation hook
tracefs -> virtual filesystem for tracing
sched_switch -> event emitted on context switch
ring buffer -> circular buffer of event records

Mental model diagram (ASCII)

CPU core -> sched_switch -> ring buffer -> user reader

How it works (step-by-step)

Mount tracefs and enable sched_switch.
Kernel writes events into per-CPU ring buffers.
User-space reads and parses events.
Events are aggregated into timelines.

Minimal concrete example

echo 1 | sudo tee /sys/kernel/tracing/events/sched/sched_switch/enable
cat /sys/kernel/tracing/trace_pipe

Common misconceptions

Misconception: tracepoints are always enabled. Correction: they must be explicitly enabled.
Misconception: trace_pipe preserves event order across CPUs. Correction: ordering across CPUs is approximate; use timestamps.

Check-your-understanding questions

Why do you need timestamps for events from different CPUs?
What happens if tracefs isn’t mounted?

Check-your-understanding answers

Events are per-CPU and not globally ordered.
You’ll see no events and the enable path fails.

Real-world applications

Kernel performance profiling
Latency investigations

Where you’ll apply it

This project: Section 5.10 Phase 1-2, Section 6 tests.
Also used in: P07-interrupt-latency-profiler.

References

Kernel tracing docs
perf trace documentation

Key insights

Tracepoints are the API that makes the scheduler observable.

Summary

Without tracepoints, scheduler behavior is invisible.

Homework/Exercises to practice the concept

Enable sched_switch and count events during a CPU stress test.
Compare trace_pipe output with perf trace.

Solutions to the homework/exercises

Use stress-ng --cpu 2 and count lines.
Both show the same switches but with different formats.

2.3 Time-Series Aggregation and Visualization

Fundamentals

Raw events are noisy. You need to aggregate events into time buckets to show run-queue depth, CPU utilization, and context switch rates. Visualization can be as simple as a TUI timeline.

Deep Dive into the concept

Aggregation involves grouping events by time window (e.g., 100ms) and CPU. The run-queue depth can be derived from /proc/schedstat deltas, while context-switch rate comes from counting sched_switch events. A deterministic output mode helps tests: use fixed windows and monotonic timestamps. Visualization can be textual bars or ASCII timelines.

How this fits on projects

This is how you turn raw events into the real-world outcome in Section 3.7 and Section 5.10 Phase 3.

Definitions & key terms

bucket -> time window for aggregation
utilization -> fraction of time CPU is busy
timeline -> per-CPU representation of running tasks

Mental model diagram (ASCII)

Time -> |----100ms----|----100ms----|
CPU0 -> nginx | postgres | nginx

How it works (step-by-step)

Capture events with timestamps.
Bucket events per CPU into fixed windows.
Compute counts (context switches, run-queue depth).
Render as ASCII timeline.

Minimal concrete example

CPU0: A A A B B
CPU1: C C D D D

Common misconceptions

Misconception: /proc/stat gives per-process usage. Correction: it’s per-CPU counters; you need other sources for per-process.

Check-your-understanding questions

Why is a fixed time bucket useful for deterministic tests?
How do you compute context switches per second?

Check-your-understanding answers

It avoids jitter and wall-clock variance.
Count sched_switch events per time window.

Real-world applications

Monitoring tools (top, htop, perfetto)
Scheduling anomaly detection

Where you’ll apply it

This project: Section 3.7 Real World Outcome, Section 5.10 Phase 3.
Also used in: P01-syscall-tracer-strace-clone for output formatting strategies.

References

“Systems Performance” (Gregg), time-series chapters

Key insights

Aggregation turns kernel noise into human insight.

Summary

Your tool is only as useful as the aggregation you build.

Homework/Exercises to practice the concept

Aggregate /proc/stat deltas into 1s CPU utilization.
Render a text bar for CPU busy time.

Solutions to the homework/exercises

Compute delta idle vs total and derive utilization.
Use # for busy, . for idle.

3. Project Specification

3.1 What You Will Build

A CLI/TUI tool, schedviz, that shows per-CPU scheduling activity: current task, recent switches, run-queue depth, and summary statistics. It supports fixed-window aggregation for deterministic outputs.

3.2 Functional Requirements

Read sched_switch tracepoints in real time.
Parse /proc/schedstat and /proc/stat for run-queue depth and CPU usage.
Render per-CPU timelines in text or TUI mode.
Provide summary stats (context switches/sec, average run-queue depth).
Support --window-ms and --fixed-ts for deterministic buckets.
Handle multi-CPU systems and process exits gracefully.

3.3 Non-Functional Requirements

Performance: handle 10k events/sec without dropping.
Reliability: survives tracefs being temporarily unavailable.
Usability: output is readable in 80x24 terminals.

3.4 Example Usage / Output

$ sudo ./schedviz --fixed-ts --window-ms 100
CPU0: nginx(1234) -> postgres(5678) -> nginx(1234)
CPU1: kworker/1:0 -> bash(2222) -> kworker/1:0
runq depth: [3,1,2,1]
ctx switches/sec: 1240

3.5 Data Formats / Schemas / Protocols

sched_switch event format from tracefs
/proc/schedstat parsed into per-CPU counters

3.6 Edge Cases

No sched_switch events (tracefs not mounted)
Processes exit mid-window
CPU hotplug changes CPU count

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

sudo ./schedviz --fixed-ts --window-ms 100

3.7.2 Golden Path Demo (Deterministic)

Use --fixed-ts and 100ms windows to generate stable snapshots.

3.7.3 CLI Transcript (Success + Failure)

$ sudo ./schedviz --fixed-ts
CPU0: idle -> bash(1234) -> gcc(5000)
runq depth: [1,2,1]
ctx switches/sec: 800

$ ./schedviz
error: schedviz requires tracefs access (try sudo)
exit code: 2

3.7.4 Exit Codes

0 success
2 missing permissions

4. Solution Architecture

4.1 High-Level Design

Tracefs Reader -> Event Parser -> Aggregator -> Renderer

4.2 Key Components

4.3 Data Structures (No Full Code)

struct cpu_stats {
    uint64_t switches;
    uint64_t runq_depth;
    char last_task[32];
};

4.4 Algorithm Overview

Enable tracepoint.
Read events into per-CPU buckets.
Every window, compute metrics and render.

Complexity: O(E) per window where E is events.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt install linux-tools-common

5.2 Project Structure

schedviz/
|-- src/
|   |-- main.c
|   |-- tracefs.c
|   |-- agg.c
|   `-- render.c
`-- Makefile

5.3 The Core Question You’re Answering

“Can I make the scheduler’s decisions visible in real time?”

5.4 Concepts You Must Understand First

CFS scheduling model.
Tracepoints and event ingestion.
Time-window aggregation.

5.5 Questions to Guide Your Design

What window size is most useful?
How will you handle CPU hotplug?
Which metrics are most valuable to show?

5.6 Thinking Exercise

Predict how make -j8 changes run-queue depth on a 4-core machine.

5.7 The Interview Questions They’ll Ask

What is sched_switch and why is it important?
How does CFS balance fairness vs responsiveness?

5.8 Hints in Layers

Hint 1: Start with /proc/stat utilization.
Hint 2: Add tracefs sched_switch events.
Hint 3: Build simple ASCII timelines.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Event capture (3 days)

Enable tracepoint and print raw events.

Phase 2: Aggregation (4 days)

Bucket events and compute metrics.

Phase 3: Visualization (3-4 days)

Render timelines and stats.

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Known sched_switch log parses correctly.
Deterministic output for fixed window size.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Compare counts against /proc/stat context switch counter.

8. Extensions & Challenges

8.1 Beginner Extensions

Add JSON output for metrics.

8.2 Intermediate Extensions

Add per-process CPU usage estimates.

8.3 Advanced Extensions

Export perfetto trace format.

9. Real-World Connections

9.1 Industry Applications

Identifying CPU contention in production workloads

perf, bcc, sysstat

9.3 Interview Relevance

Explaining how schedulers make decisions

10. Resources

Kernel tracing docs
perf trace docs

11. Self-Assessment Checklist

I can interpret sched_switch events.
I can explain run-queue depth metrics.

12. Submission / Completion Criteria

Minimum Viable Completion:

Captures and displays sched_switch events.

Full Completion:

Aggregates and renders per-CPU timelines.

Excellence:

Exports to external visualization tools.

13. Determinism Notes

Use --fixed-ts and fixed window size for tests.