Sprint: OS Kernel Development Mastery (Track A) - Real World Projects

Goal: Master operating system internals from first principles by building tools that observe, analyze, and ultimately modify kernel behavior. You will learn how user space crosses into kernel space, how the kernel schedules work, how virtual memory and paging actually operate, and how interrupts and device drivers make hardware usable. By the end, you will be able to design kernel-facing tools, debug failures at the lowest levels, and build a minimal operating system kernel that boots and runs on real (or virtual) hardware.


Introduction

Operating system and kernel development is the practice of building and understanding the privileged software layer that mediates between hardware and user programs. It is the contract that defines what processes can do, how memory is managed, and how devices are controlled.

What you will build (by the end of this guide):

  • A syscall tracer that observes kernel entry and exit in real time
  • A custom memory allocator that replaces the system allocator
  • A scheduler visualization tool that reveals how the kernel chooses what runs
  • A page-fault analyzer that distinguishes major and minor faults
  • A kernel module character device with a ring buffer and ioctl interface
  • A user-space threading library with its own scheduler
  • An interrupt latency profiler using kernel tracing
  • A minimal OS kernel that boots, sets up paging, runs processes, and exposes syscalls

Scope (what is included):

  • Linux kernel interfaces for tracing, scheduling, memory, and I/O
  • The process, thread, and syscall execution model
  • Virtual memory, page tables, and fault handling
  • Kernel modules and basic device driver structure
  • Interrupt handling, timing, and latency analysis
  • Observability with procfs, tracefs, perf_event_open, and eBPF
  • A small, bootable kernel implementation path

Out of scope (for this guide):

  • Full production kernel development workflow (mailing lists, patch queues)
  • Complex hardware drivers (GPU, Wi-Fi, storage stacks)
  • Formal verification or fully verified kernels
  • Deep security hardening and exploit development

The Big Picture (Mental Model)

User Programs
  |  (syscalls)
  v
+------------------------------+
|            Kernel            |
|  - Scheduler                 |
|  - Memory Manager            |
|  - VFS and I/O               |
|  - Interrupt/IRQ subsystem   |
|  - Drivers                   |
+------------------------------+
  |  (MMIO, interrupts, DMA)
  v
Hardware (CPU, RAM, Devices)

Key Terms You Will See Everywhere

  • Kernel space: Privileged execution mode with full hardware access.
  • User space: Unprivileged execution mode for applications.
  • System call (syscall): The controlled entry point from user to kernel.
  • Context switch: Saving and restoring CPU state between tasks.
  • Page fault: CPU exception on invalid or not-yet-present memory access.
  • Run queue: The scheduler data structure for runnable tasks.
  • VFS: Virtual File System, the abstraction that unifies filesystems.
  • Interrupt (IRQ): Hardware signal that diverts CPU to a handler.

How to Use This Guide

  1. Read the Theory Primer first. It is your mini-book. Every project assumes these concepts.
  2. Build in order when possible. The projects are designed to layer mental models.
  3. Treat each project as a lab. Keep notes, measure, and compare results.
  4. Validate with real tools. Use strace, perf, ftrace, and procfs to check your work.
  5. Revisit the primer after each project. The second pass locks in understanding.

Prerequisites & Background Knowledge

Before starting these projects, you should have foundational understanding in these areas.

Essential Prerequisites (Must Have)

Programming Skills:

  • Strong C proficiency: pointers, structs, bit operations, manual memory management
  • Assembly basics: registers, stack frames, calling conventions (x86-64 preferred)
  • Comfortable with CLI tools: gcc, make, gdb, grep, sed, awk
  • Git fundamentals: clone, branch, commit, diff

OS and Systems Fundamentals:

  • Basic process model: pid, parent/child, fork/exec/wait
  • Virtual memory concepts: virtual vs physical, pages, page tables (conceptual)
  • Files and permissions: file descriptors, open/read/write/close
  • Recommended reading: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau - Intro, Ch. 1-6

Linux/Unix Basics:

  • Basic shell usage: pipes, redirection, job control
  • /proc awareness: high-level idea of process info
  • Recommended reading: “The Linux Command Line” by William Shotts - Ch. 1-10

Helpful But Not Required

  • Kernel source reading: You will learn to navigate kernel docs and source trees.
  • Debuggers: gdb and lldb (helpful for user-space projects).
  • Concurrency primitives: mutexes, atomics, condition variables.
  • Kernel build basics: You will learn just enough for modules and a tiny kernel.

Self-Assessment Questions

  1. Can you explain the difference between stack, heap, and mmap memory regions?
  2. Can you describe what happens when a process calls read()?
  3. Can you read and understand C code using pointers and structs?
  4. Have you used gdb to step through code or inspect memory?
  5. Do you understand what a page fault is at a high level?

If you answered “no” to questions 1-3, spend 1-2 weeks on the recommended books.

Development Environment Setup

Required Tools:

  • Linux machine (Ubuntu 22.04+ or Fedora 38+ recommended)
  • gcc, make, gdb
  • kernel headers for your running kernel
  • strace, perf, and trace tools
# Core build tools
sudo apt install build-essential gcc gdb make

# Kernel headers (for module projects)
sudo apt install linux-headers-$(uname -r)

# Tracing tools
sudo apt install strace ltrace linux-tools-common linux-tools-generic

# Convenience tools
sudo apt install git vim tmux htop

Recommended Tools:

  • QEMU/KVM for safe kernel testing
  • perf-tools or bpftrace (optional for advanced tracing)

Testing Your Setup:

$ gcc --version
# Expect gcc 11+ (Ubuntu 22.04) or gcc 13+ (Fedora 38)

$ uname -r
# Expect a modern kernel (5.15+ is fine)

$ ls /lib/modules/$(uname -r)/build
# Should exist if headers are installed

Time Investment

  • Projects 1-2 (Intermediate): 2-3 weeks each, ~10-15 hours/week
  • Projects 3-4 (Advanced): 2-3 weeks each, ~15-20 hours/week
  • Projects 5-7 (Expert): 3-4 weeks each, ~20-30 hours/week
  • Final Project (Master): 2-3 months, ~20-30 hours/week

Total time commitment: 6-12 months part-time or 3-6 months full-time.

Important Reality Check

These projects are hard on purpose. Expect:

  • Kernel panics and VM crashes
  • Race conditions and heisenbugs
  • Documentation that assumes prior knowledge

This is normal. Mastery happens in layers:

  1. Make it work
  2. Understand what each piece does
  3. Understand why it is designed that way
  4. Understand performance and security trade-offs

Big Picture / Mental Model

Kernel development is about controlling the rules every program follows. Think in layers:

Applications
  |  (libc wrappers)
  v
Syscall ABI (registers, trap instruction)
  |  (mode switch)
  v
Kernel Core
  - Scheduler
  - Memory Manager
  - VFS and I/O
  - Networking and IPC
  - Tracing
  |  (drivers, IRQ)
  v
Hardware

Theory Primer

This section is the mini-book for the projects. Read each chapter once before coding, and then revisit it while building.

Chapter 1: The Kernel Execution Model (Privilege, Syscalls, Processes, Scheduling)

Fundamentals The execution model of an OS is the contract between user programs and the kernel. A program runs in user mode (unprivileged) and cannot access hardware or arbitrary memory. When it needs a service, it performs a syscall. The CPU traps into kernel mode, the kernel validates the request, performs the work, and returns to user mode. This boundary is more than a technical detail: it is the foundation of isolation, stability, and performance. Understanding it is the difference between seeing a program as “code” and seeing it as a state machine interacting with a privileged supervisor.

At the heart of the execution model is the process. A process is an address space with a set of resources (open files, permissions, signal handlers, memory mappings) and one or more threads of execution. Each thread has registers, a stack, and a scheduling state. The kernel tracks these in data structures like task_struct. The scheduler chooses which runnable task gets CPU time, and a context switch saves one task’s CPU state and restores another’s.

This model shapes everything you observe: syscalls are expensive because they save and restore context; a context switch might cause cache misses; a blocked syscall yields the CPU; a signal can interrupt a thread; a syscall can be restarted. You cannot measure performance, diagnose hangs, or build kernel tools without understanding this mental model.

Scheduling is where policy meets mechanism. The kernel decides when a task runs, for how long, and on which CPU. Linux uses scheduling classes (CFS for normal tasks, RT for real-time tasks) and attempts to approximate fairness. Scheduler decisions are visible in run queues, vruntime values, and context switch events. A kernel developer thinks in terms of run queues, wakeups, and preemption points, not just threads and loops.

Finally, the execution model is fundamentally about state transitions: RUNNING -> SLEEPING -> RUNNABLE -> RUNNING. Every blocking syscall is a state change. Every interrupt may cause a state change. Every signal can force a state change. When you build a syscall tracer or scheduler visualizer, you are decoding this state machine.

Privilege levels are enforced by the CPU itself. User space runs with limited privileges and cannot directly access hardware registers or kernel memory. This isolation is enforced through page table permissions and CPU mode bits. When a user program attempts a privileged operation, the CPU raises an exception or the kernel rejects the request. The OS relies on this hardware enforcement to keep unrelated processes isolated and to protect the kernel from crashes caused by user bugs.

The syscall interface is the explicit, audited entry point into the kernel. libc wrappers make syscalls feel like normal C functions, but the ABI contract matters: arguments must be placed in specific registers, and error handling is encoded in return values plus errno. If you mis-handle errno or forget about restartable syscalls, your tracer and tooling will misreport behavior. This is why OS development forces you to learn how syscalls are actually encoded on your architecture.

Processes and threads are different layers of the same abstraction. A process owns an address space and resource table; threads are execution contexts that share that address space. The kernel tracks both with a task representation, but the distinction matters because scheduling is always done on the thread (task) level. When you trace a process, you must know whether you are seeing a single thread or a group of threads, and which scheduling decisions are per-thread vs per-process.

Scheduling is where policy meets mechanics. The OS must balance throughput, fairness, and latency. The same program can feel fast or slow depending on how the scheduler allocates CPU time. A task that yields too often can be starved, while a task that never yields can monopolize CPU time. This is why kernel developers analyze run queue depth, wake-up latency, and context switch rates as primary signals.

Context switches are deceptively expensive. The CPU must save registers, swap stacks, and sometimes change memory mappings. This can flush TLB entries and disrupt caches. You will see this in your scheduler visualizer: high context switch rates often correlate with lower throughput. Understanding this cost helps you interpret the results of tracing tools and choose appropriate scheduling strategies.

Signals and asynchronous events complicate the execution model. A task might be interrupted mid-syscall, or a timer might fire and force a reschedule. These asynchronous events are normal and expected in a multitasking OS. A correct mental model includes the idea that execution is not a straight line; it is a series of interruptions, wakeups, and reschedulings. Deep Dive into the Concept The kernel execution model is implemented through a combination of CPU privilege levels, architecture-specific syscall mechanisms, and kernel data structures that represent processes and threads. On x86-64, the CPU has rings; user mode is Ring 3, kernel mode is Ring 0. The transition happens via a syscall instruction, which switches stacks, changes privilege level, and vectors into a handler defined by the kernel. The kernel then dispatches to the appropriate syscall function via a syscall table. This is where the ABI matters: each syscall expects arguments in specific registers, and return values are encoded with error conventions.

When a syscall enters the kernel, the kernel validates user pointers before dereferencing them. Functions like copy_from_user and copy_to_user perform access checks and handle faults gracefully. This design prevents user programs from corrupting kernel memory. It is also why syscall overhead can be significant: validation, safety checks, and potential page faults all add latency.

From there, syscalls interact with the process model. For example, read() will block if data is unavailable, which changes the process state to SLEEPING and puts it on a wait queue. The scheduler chooses another RUNNABLE task. Later, an interrupt (e.g., a disk I/O completion) wakes the task, which transitions back to RUNNABLE and eventually RUNNING. In other words, syscalls and interrupts work together to move tasks through states.

Scheduling in Linux is modular. CFS models an “ideal” CPU that divides time equally among runnable tasks. It uses vruntime to approximate fairness and a red-black tree to pick the next task. Real-time scheduling classes use different policies. These concepts show up in kernel docs and are critical when building a scheduler visualizer. Understanding which scheduling class a task belongs to, how it is enqueued, and when it is preempted is essential to interpreting observed behavior.

Context switching is expensive because it saves and restores registers, updates memory mappings (sometimes involving TLB flushes), and disrupts cache locality. A kernel developer uses tools to measure context switch rates, observes run queue depth, and understands the cost of waking tasks too frequently. Your scheduler visualizer project will expose these dynamics.

The process model is also about identity and security. Each task has credentials, capabilities, namespaces, and cgroups. These control what syscalls are allowed and how resources are accounted for. While this guide does not dive deeply into namespaces or cgroups, you will see their effects when tracing syscalls or analyzing scheduling behavior. Understanding that a syscall is not just a function call but a privileged operation with policy checks is key to kernel-level reasoning.

Finally, the syscall ABI is architecture-specific but conceptually consistent across systems: place arguments in registers, trigger a trap, dispatch in kernel, return with status. The exact registers differ, but the idea is the same. For your tracer, you must know where to read arguments and how to interpret return values. For your mini-kernel, you must design and implement your own syscall table. The kernel execution model is the foundation for all of it.

At the instruction level, syscalls are triggered by a dedicated instruction (syscall on x86-64). The CPU reads model-specific registers (MSRs) that contain the kernel entry address, switches to a privileged stack, and begins executing kernel code. The kernel’s entry code saves registers, establishes a kernel stack frame, and then dispatches the syscall number to the sys_call_table. This path is different from interrupts but shares many concepts: privilege escalation, state save/restore, and return to user mode.

The syscall ABI for x86-64 Linux is stable and well documented. Arguments are passed in rdi, rsi, rdx, r10, r8, and r9; the syscall number is placed in rax; the return value is in rax. Errors are indicated by negative values, and libc converts them to -1 while setting errno. Your tracer must read the correct registers and interpret return values properly or you will misdiagnose failures.

A syscall is not just a function call; it is also an opportunity for the kernel to enforce policy. Security modules (LSM) can deny operations, capabilities can restrict privileged actions, and seccomp filters can block syscalls entirely. These checks happen in the kernel path and are why syscalls sometimes fail even when parameters look valid. When you implement a tracer, you will see errno values that are caused by these policy checks, not just programming errors.

Linux also optimizes some syscalls with the vDSO (virtual dynamic shared object). Certain time-related syscalls, like clock_gettime, can be handled in user space by calling code in the vDSO page mapped into each process. This avoids the syscall transition overhead. A tracer that only hooks kernel entry points will miss these calls unless it is aware of vDSO behavior.

Syscalls interact with scheduling in complex ways. When a syscall blocks, the task transitions to SLEEPING and is placed on a wait queue. The scheduler then chooses another task. Later, a wakeup event (usually an interrupt handler or bottom half) moves the task back to RUNNABLE. This is why I/O-bound workloads often have high context switch rates: every I/O completion triggers a wakeup and potential reschedule.

CFS (Completely Fair Scheduler) uses a red-black tree to maintain runnable tasks ordered by vruntime. vruntime represents the amount of CPU time a task has received relative to its weight. The scheduler picks the leftmost node (smallest vruntime) to approximate fairness. Nice values adjust weights, and per-CPU run queues reduce contention. Load balancing moves tasks between CPUs to keep utilization even. Understanding these mechanics lets you interpret the scheduler traces you collect in Project 3.

Real-time scheduling is layered on top of CFS. Tasks in the RT class can preempt normal tasks, which means latency-sensitive threads can run immediately. This is why a misconfigured RT task can starve the system. While you may not implement RT scheduling, you should be aware that tasks can belong to different scheduling classes and that their behavior differs.

Context switches can be categorized into voluntary (a task yields or blocks) and involuntary (the scheduler preempts a running task). Voluntary switches are often due to I/O or waiting on a lock. Involuntary switches are a sign of CPU contention or time-slice expiration. These distinctions matter when interpreting scheduling statistics because they hint at the underlying bottleneck.

Signals and restartable syscalls introduce more complexity. If a process receives a signal while inside a blocking syscall, the kernel may return -EINTR. Some syscalls are marked restartable, and the kernel will transparently resume them. This behavior affects trace output and is a common cause of confusion when building syscall tracers and debuggers.

Process identity and resource control also influence execution. Each task has credentials, capabilities, and may belong to cgroups. Cgroups allow resource limits and accounting; namespaces isolate visibility. While this guide does not dive into cgroup internals, you will see their effects when tasks are throttled or isolated. Your tools should therefore be robust in environments with resource controls.

Finally, the kernel execution model is observable through tracepoints. The kernel emits events at syscall entry and exit, context switches, wakeups, and scheduling decisions. Tools like perf and ftrace expose these events without kernel modification. Your projects leverage these observability points to turn raw execution behavior into understandable timelines. How This Fits the Projects This chapter directly powers the syscall tracer, scheduler visualizer, user-space thread library, and the mini-kernel. Every other project depends on understanding how the kernel enters and exits, how tasks are scheduled, and how state transitions occur.

Definitions and Key Terms

  • Syscall ABI: The convention for passing syscall arguments and returning values.
  • Task: The kernel’s internal representation of a thread of execution.
  • Run queue: The scheduler data structure of runnable tasks.
  • Preemption: Interrupting a running task to run another.
  • Context switch: Saving/restoring CPU state between tasks.

Mental Model Diagram

User thread
  | syscall
  v
Syscall entry -> kernel dispatcher -> subsystem
                                |
                                +--> returns -> user mode
                                |
                                +--> blocks -> scheduler -> wakeup -> return -> user mode

How It Works (Step by Step)

  1. User code executes a syscall instruction with arguments in registers.
  2. CPU switches to kernel mode and jumps to the syscall entry point.
  3. Kernel validates arguments and dispatches the syscall handler.
  4. The handler may block, put the task to sleep, or complete immediately.
  5. Kernel returns to user mode with a return value and errno on error.

Minimal Concrete Example

// Minimal syscall invocation (x86-64 Linux) using syscall(2)
#include <unistd.h>
#include <sys/syscall.h>

int main() {
    long pid = syscall(SYS_getpid);
    return pid > 0 ? 0 : 1;
}

Common Misconceptions

  • “A syscall is just a function call.” -> It is a privilege transition with validation and policy checks.
  • “Context switches only happen on syscalls.” -> Interrupts and scheduling decisions cause them too.
  • “All threads are scheduled the same way.” -> Scheduling classes differ.

Check-Your-Understanding Questions

  1. Why is a syscall more expensive than a normal function call?
  2. What happens if a syscall blocks on I/O?
  3. How does the scheduler pick the next task?
  4. What is the difference between a mode switch and a context switch?

Check-Your-Understanding Answers

  1. It requires a privilege transition, state save/restore, validation, and dispatch.
  2. The task is put to sleep and removed from the run queue until the event completes.
  3. It selects a task from the run queue based on scheduling policy (e.g., CFS fairness).
  4. A mode switch changes privilege level; a context switch changes which task is running.

Real-World Applications

  • Syscall tracing and debugging tools (strace, ltrace)
  • Performance analysis of CPU scheduling
  • Security enforcement through syscall filtering

Where You Will Apply It

  • Project 1: Syscall Tracer
  • Project 3: Scheduler Visualization Tool
  • Project 6: User-space Thread Library
  • Final Project: Mini OS Kernel

References

  • Linux kernel scheduler design (CFS) documentation: https://docs.kernel.org/6.14/scheduler/sched-design-CFS.html
  • Operating Systems: Three Easy Pieces - Scheduling chapters
  • Linux Kernel Development by Robert Love - Scheduling and process chapters

Key Insight The kernel execution model is a controlled state machine: syscalls, scheduling, and context switches are the only paths through which programs move.

Summary The kernel execution model defines how user code requests privileged work, how tasks are scheduled, and how the system ensures isolation. Every project in this track is a different lens on this same model.

Homework / Exercises

  1. Trace your own shell with strace and identify the syscalls used for a simple ls.
  2. Use perf or /proc to measure context switches during a CPU-bound vs I/O-bound workload.

Solutions

  1. strace -f -o trace.log ls; then review openat, read, mmap, getdents64, etc.
  2. Run perf stat -e context-switches and compare workloads.

Chapter 2: Memory System (Virtual Memory, Paging, Page Faults, Allocators)

Fundamentals Modern systems give each process the illusion of a private, contiguous address space. This is virtual memory. It allows processes to be isolated, simplifies programming, and enables advanced features like memory-mapped files and copy-on-write. The CPU and kernel cooperate to translate virtual addresses to physical memory using page tables. A missing translation causes a page fault, which the kernel handles by loading or allocating a page or by killing the process if the access is invalid.

The memory system is not just about correctness but performance. The Translation Lookaside Buffer (TLB) caches recent translations. A TLB miss is costly, and a major page fault (requiring disk I/O) can be millions of times slower than a cache hit. Understanding these costs changes how you design systems and how you interpret performance metrics.

Memory allocators are policy engines on top of these mechanisms. In user space, malloc manages the heap with strategies to reduce fragmentation and manage metadata. In kernel space, allocators like slab and slab-like caches optimize for frequent allocations of fixed-size objects. These allocators rely on the same underlying page allocator but optimize for different workloads.

Memory mapping (mmap) is the unifying abstraction that lets files and devices appear as memory. It is why page faults are often “normal”: a fault can mean the kernel is lazily loading a file into memory. Understanding this helps you interpret page fault statistics in your page-fault analyzer project.

Ultimately, the memory system is where high-level programs meet hardware realities. It is the reason a program can allocate gigabytes but only touch parts of it, and it is why memory-intensive workloads can stall even with free RAM. Your allocator project and page-fault analyzer will teach you this at the system level.

A process’s address space is typically organized into segments: text (code), data, BSS, heap, and stack, plus memory-mapped regions. The OS uses permissions on each region (read, write, execute) to enforce safety. Features like ASLR randomize these regions to make attacks harder. Understanding the typical layout helps you interpret /proc/pid/maps and correlate faults with code, heap, or stack regions.

Page size is a practical constraint that shapes everything. Most systems use 4 KB pages, but huge pages (2 MB or 1 GB on x86-64) can reduce TLB misses for large memory regions. Page size affects allocator alignment, fragmentation, and performance. Your allocator project will need to consider alignment and page boundaries to avoid wasting memory or causing page faults.

Memory is not just anonymous RAM; it is also a cache for files. The page cache stores file data in memory to reduce I/O. When you mmap a file, you are mapping page cache entries into your address space. The same file can be shared across multiple processes without duplication, which is why shared libraries are memory-efficient. Page faults are often the mechanism by which the page cache is filled.

Swapping is the kernel’s pressure release valve. When physical memory is scarce, the kernel can evict pages to disk or swap. This keeps processes alive but at enormous performance cost. Your page fault analyzer should distinguish between minor faults (cache or page table updates) and major faults (disk I/O), because the latter are often the source of performance collapses.

Finally, allocators are policy layers on top of these mechanisms. They decide when to grow the heap, when to use mmap, and how to manage free blocks. Their policies can make the difference between a stable, high-performance system and one that thrashes or fragments itself into failure. A good allocator is one that minimizes fragmentation while keeping allocation and free operations fast. Deep Dive into the Concept Virtual memory is implemented by dividing the address space into fixed-size pages (commonly 4 KB, with huge pages as an option). Each process has a hierarchy of page tables. When a CPU references a virtual address, it walks this table to find the corresponding physical page and permissions. The TLB caches these mappings so the CPU does not walk the page tables for every access. A TLB miss triggers a page table walk; if a mapping is missing, a page fault occurs and control transfers to the kernel.

Page faults are not necessarily errors. There are minor faults (the page is in memory but not mapped or in the TLB) and major faults (the page must be fetched from disk). Copy-on-write (COW) faults happen when two processes share a page and one attempts to write. The kernel then allocates a new page, copies the data, and updates the page table. This is fundamental to fork(), which is why process creation can be fast even for large address spaces.

Memory allocation in user space is built on top of syscalls like brk/sbrk and mmap. The allocator grows the heap or maps new regions. It keeps metadata to track allocated and free blocks. Fragmentation is a core problem: small free blocks can be unusable for large allocations, and large blocks can waste memory for small requests. Allocators balance speed and space efficiency with strategies like segregated free lists, coalescing, and best-fit/first-fit policies.

In kernel space, the allocator must be fast, safe, and aware of contexts where sleeping is not allowed. The slab allocator caches objects of specific sizes to reduce overhead. Functions like kmalloc, kfree, and vmalloc operate with different guarantees. vmalloc provides virtually contiguous memory but may not be physically contiguous; kmalloc provides physically contiguous memory for DMA. Knowing when to use each is critical for kernel modules.

Memory mapping links the virtual memory system to the file system and device drivers. mmap allows a file to be mapped into memory. The first access triggers a page fault, and the kernel loads the data from disk. This lazy loading reduces startup cost and enables efficient file I/O. It is also why page faults correlate with disk activity. A page fault analyzer can show you where the system is spending time and whether a program is thrashing.

Page tables and their layout are architecture-specific, but the mental model is consistent: hierarchical mappings with permissions. Understanding the mapping of user and kernel address space helps you interpret /proc/pid/maps, and it helps you design your allocator to align with page boundaries. It also clarifies security: pages can be marked non-executable (NX), read-only, or supervisor-only.

The memory system also interacts with scheduling. A task that triggers major faults may be put to sleep while I/O completes, freeing the CPU for other tasks. This creates an interplay between memory pressure, I/O scheduling, and CPU utilization. Your projects will expose these relationships in practice.

The page table hierarchy on x86-64 typically includes four levels: PGD, PUD, PMD, and PTE. Each level indexes a portion of the virtual address, and the final PTE points to a physical frame with permissions. When a process accesses memory, the CPU walks this hierarchy unless a TLB entry already exists. If the mapping is missing or violates permissions, a page fault exception is raised and the kernel’s fault handler takes over.

Fault handling is a complex pipeline. The kernel must determine whether the address is valid (e.g., within a VMA), whether the access is permitted (read, write, execute), and whether the page is present. For anonymous memory, the kernel may allocate a zero page or a new physical page. For file-backed memory, it may read the page from disk into the page cache and map it. For COW, it allocates and copies a new page. Each of these steps has different cost profiles and implications for performance.

Demand paging allows the kernel to map large regions without actually allocating pages until they are touched. This is why programs can appear to allocate large buffers quickly. But it also means that the first access can trigger a burst of page faults. In your page fault analyzer, you will likely see spikes when a program first touches a large array or when it maps a file and scans it.

The page cache is shared between processes and is often the largest consumer of RAM. It acts as a unified cache for file I/O. When the kernel needs memory, it can evict page cache entries. This is why disk I/O patterns influence memory availability and why memory pressure can slow down file-heavy workloads. Understanding this relationship helps you interpret major page faults and performance regressions.

Memory reclaim is managed by algorithms such as LRU (least recently used) approximations. The kernel runs kswapd to reclaim memory in the background. If reclaim cannot keep up, the OOM killer may terminate processes. This is a harsh but necessary policy. A memory-intensive program that ignores memory locality can force OOM behavior even if average usage seems acceptable.

User-space allocators are built on top of these mechanisms. They maintain metadata about blocks, often stored next to allocated memory. This metadata can be corrupted by buffer overflows, which is a classic security vulnerability. Allocator design is also about balancing metadata overhead with speed. Many allocators use segregated free lists for common sizes and mmap for large requests.

Kernel allocators differ because they must operate in interrupt context and cannot always sleep. The buddy allocator manages pages at the lowest level, while slab/slub manage objects. kmalloc provides physically contiguous memory and is suitable for small objects; vmalloc provides virtually contiguous memory and can be used for larger regions. These differences matter when you write kernel modules and need to choose the correct allocation API.

Memory overcommit allows the kernel to promise more memory than is physically available, based on the assumption that not all memory will be used simultaneously. This can improve utilization but also causes surprises when large allocations are suddenly touched. Tools like your page fault analyzer can reveal overcommit effects by showing sudden spikes in major faults or OOM events.

NUMA systems add another layer of complexity. Memory is physically attached to specific CPU sockets. Accessing remote memory is slower. The kernel uses NUMA policies to keep memory local to the CPU that uses it. If your workload is heavily multithreaded, NUMA effects can dominate performance. While you may not fully implement NUMA policies, you should be aware of their existence.

Finally, memory permissions are a security boundary. The kernel marks pages as non-executable (NX) to prevent code injection. It enforces read-only pages for shared libraries and COW pages. When you build a mini-kernel, you will need to define your own page permissions and handle faults when they are violated. How This Fits the Projects This chapter powers the custom allocator and page-fault analyzer. It also influences how you think about stack usage in the thread library and memory protection in the mini-kernel.

Definitions and Key Terms

  • Virtual memory: An abstraction giving each process a private address space.
  • Page table: Data structure mapping virtual pages to physical pages.
  • TLB: CPU cache of recent address translations.
  • Major fault: Page not in RAM; requires disk I/O.
  • Minor fault: Page in RAM but not mapped; no disk I/O.
  • COW: Copy-on-write, shared pages copied on first write.

Mental Model Diagram

Virtual Address -> Page Table -> Physical Page
    |                |
    |                v
    |           TLB cache
    v
 Page fault -> Kernel handler -> load/allocate -> update table

How It Works (Step by Step)

  1. CPU accesses a virtual address.
  2. TLB lookup occurs.
  3. If miss, page table walk happens.
  4. If mapping missing, page fault occurs.
  5. Kernel resolves by loading or allocating a page or killing the task.

Minimal Concrete Example

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

int main() {
    size_t sz = 4096 * 10;
    char *p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    p[0] = 'A'; // triggers a page fault on first touch
    printf("%c\n", p[0]);
    return 0;
}

Common Misconceptions

  • “Page faults mean errors.” -> Many faults are normal (demand paging, COW).
  • “malloc always uses brk.” -> Modern allocators use mmap for large allocations.
  • “TLB misses are cheap.” -> They are far more expensive than cache hits.

Check-Your-Understanding Questions

  1. Why are major page faults so expensive?
  2. What happens on the first write after fork()?
  3. Why do allocators use multiple free lists?
  4. What is the difference between kmalloc and vmalloc?

Check-Your-Understanding Answers

  1. They require disk I/O to bring a page into RAM.
  2. A COW fault occurs and a new page is allocated and copied.
  3. To reduce fragmentation and improve allocation speed.
  4. kmalloc gives physically contiguous memory; vmalloc gives virtually contiguous memory.

Real-World Applications

  • Memory allocators and performance tuning
  • Database buffer pool design
  • OS paging and swap policies

Where You Will Apply It

  • Project 2: Custom Memory Allocator
  • Project 4: Page Fault Analyzer
  • Project 6: User-space Thread Library
  • Final Project: Mini OS Kernel

References

  • The Linux Programming Interface - Memory management chapters
  • Computer Systems: A Programmer’s Perspective - Virtual memory chapter

Key Insight Virtual memory is both an isolation mechanism and a performance engine, and page faults are the observable boundary between them.

Summary Virtual memory gives every process a private address space, page faults make it real, and allocators turn raw pages into usable memory.

Homework / Exercises

  1. Use /proc/pid/maps to inspect a running program’s memory layout.
  2. Allocate a large mmap region and touch it sparsely; measure page faults with perf.

Solutions

  1. cat /proc/$(pidof bash)/maps
  2. perf stat -e page-faults ./your_test

Chapter 3: Files, I/O, and Device Drivers (VFS, Char/Block Devices)

Fundamentals Files are the universal abstraction in Unix. Everything from disk storage to devices is accessed through file descriptors and syscalls like open, read, write, and ioctl. The kernel implements this through the Virtual File System (VFS), which provides a uniform interface across different filesystem types. The VFS translates generic file operations into filesystem-specific operations by dispatching function pointers in structures like file_operations and inode_operations.

Devices are exposed via special files in /dev. Character devices provide byte-stream access (e.g., terminals), while block devices provide block-based access (e.g., disks). The kernel’s device driver model ties hardware to these interfaces. A driver must register its operations and manage device state, often interacting with interrupts, DMA, and hardware registers.

Understanding the VFS and device model is essential for the kernel module project. You will implement a character device that provides read/write semantics and an ioctl interface. This teaches you how the kernel exposes custom functionality to user space and how it protects itself from invalid pointers.

The file descriptor model is deceptively simple: an integer index into a per-process table that points to an open file description. That open file description holds the current file offset and a reference to the inode and file operations. This means two processes can share the same open file description (via fork or dup), which is why they share a file offset. Understanding this is essential for interpreting read/write behavior and for building drivers that behave like regular files.

Path lookup is one of the most performance-sensitive operations in the kernel. The VFS uses the dentry cache to avoid repeated path resolution. This is why the first open of a file is often slower than subsequent opens. Your syscall tracer will show bursts of openat and stat calls when path resolution happens. It is not just “file I/O”; it is name lookup, permission checks, and cache interaction.

Devices are exposed as special files so that user space can use familiar read/write semantics. But devices are not disks: they often involve interrupts, DMA, and asynchronous completion. A character device driver needs to define how reads block or return, how writes are buffered, and how ioctl calls control device behavior. Your kernel module project will force you to map these design decisions to file semantics.

Finally, the VFS model unifies not only real filesystems but also pseudo filesystems like procfs and sysfs. This is why reading /proc/interrupts or /proc/pid/maps feels like reading a file even though no disk I/O occurs. It is a clean abstraction that also gives you powerful debugging tools for your projects.

Another subtlety is buffering. The kernel may satisfy reads from the page cache and defer writes, which is why fsync and fdatasync exist. Buffered I/O improves throughput but can hide latency. Your tracer and scheduler visualizer will reveal spikes when buffered writes finally flush. This is also why write() returning does not always mean data is on disk. Deep Dive into the Concept The VFS is a layer that decouples system calls from filesystem implementations. When a user program calls read(), the kernel resolves the file descriptor to a struct file, which points to a struct file_operations table containing function pointers for read, write, mmap, and ioctl. This indirection allows the same syscall to work for ext4, XFS, procfs, device nodes, and more. The VFS also uses inodes to represent filesystem objects and dentries to cache path lookups. These caches are critical for performance.

Character device drivers implement file_operations so that user space can treat the device like a file. For example, a ring buffer driver might store bytes in a circular buffer in kernel space. The driver must handle concurrency (multiple readers/writers), validate user pointers with copy_to_user and copy_from_user, and ensure that operations are non-blocking or blocking as expected.

Ioctl adds a flexible control plane to devices. It allows user space to send custom commands and data structures. The kernel provides macros like _IO, _IOR, and _IOW to define ioctl numbers consistently. Getting this right matters for compatibility and safety. Poor ioctl handling is a common source of kernel bugs and security vulnerabilities.

The device model also requires registration of major and minor device numbers. The kernel uses these to identify device types and route operations to the correct driver. udev and device nodes in /dev provide the user-space visible interface. In your project, you will use alloc_chrdev_region, cdev_init, and cdev_add to register your device.

I/O is often asynchronous. A read() might block waiting for data, which involves putting the task to sleep and waking it later. The driver can integrate with wait queues to support blocking semantics. For a ring buffer, you can implement poll/select support to integrate with user-space event loops.

This chapter also introduces the difference between user-space and kernel-space memory allocation. In a driver, you may need kmalloc for kernel memory, and you must ensure allocations are done in the correct context (GFP flags). These details will show up when you build your module.

The path from open() to actual file operations involves multiple caches and checks. The kernel performs a path walk, resolving each component through dentry and inode lookups, enforcing permissions at each step. The resulting struct file points to a file_operations table, which is the core of polymorphism in the VFS. This is why a read() on a regular file, a device node, or a procfs entry can all be expressed through the same syscall but invoke very different code paths.

The open file description contains the current offset and flags. This is why two file descriptors can share an offset and why pread and pwrite exist: they bypass the shared offset and provide explicit positioning. This matters for drivers that want to emulate normal file behavior. If you do not manage offsets correctly, your driver will appear broken to user space.

For block devices, I/O often flows through the block layer. The kernel converts a read request into a bio and submits it to a request queue. Modern kernels use multi-queue I/O schedulers (like mq-deadline or bfq) to optimize throughput and latency. These layers are often invisible to user space, but they show up in performance counters and tracing output. Understanding this helps you interpret I/O-related page faults and latency spikes.

Character devices are simpler but still need correct concurrency. The ring buffer you implement must handle simultaneous readers and writers, avoid race conditions, and correctly block or return partial data. Supporting poll/select means integrating with wait queues so that user space can wait for data without busy looping. These design details turn a toy driver into a usable interface.

The ioctl interface is powerful but dangerous. It bypasses the usual read/write semantics and allows custom commands. Kernel documentation provides macros (_IO, _IOR, _IOW, _IOWR) that encode the direction and size of the data. If you define ioctls incorrectly, you can break 32-bit compatibility or introduce memory safety issues. This is why robust ioctl design is a key part of kernel module work.

The device model uses major and minor numbers to map device nodes to drivers. The kernel maintains a registry of device numbers, and tools like udev create /dev entries for you. In your module, you will likely allocate a dynamic major number and create the device node manually or with udev rules. Understanding this mapping is essential for debugging device nodes that do not appear.

The VFS also includes the page cache and buffered I/O. When you read a file, the kernel often serves data from the page cache rather than going to disk. The page cache is shared across processes, so repeated reads by different programs can be served without disk access. This is why file I/O performance can improve dramatically after the first run, and why page faults may show up as major faults on the first access but minor faults on subsequent accesses.

Finally, memory-mapped I/O (mmap) ties the VFS to the virtual memory system. When you mmap a file, page faults cause the kernel to load data from disk into memory and map it into your address space. The same mechanism is used for memory-mapped devices. This is another way your projects intersect: your memory analyzer can show faults caused by file I/O through mmap, while your driver can optionally support mmap to expose its buffer.

Asynchronous I/O and event-driven readiness (poll, select, epoll) sit on top of VFS. A driver that supports poll can integrate smoothly with user-space event loops. Under the hood, poll registers wait queues so the kernel can wake user space when data becomes available. If you skip this, your device becomes hard to use in real applications.

Many high-performance drivers also implement mmap to expose internal buffers directly to user space, reducing copies. This is common in video, networking, and shared-memory IPC. The driver must ensure the mapped pages remain valid and synchronized. Even if you do not implement mmap in your project, understanding that it exists helps you see why page faults can be triggered by device access, not just files. How This Fits the Projects This chapter is central to the kernel module character device project and influences your understanding of syscalls in the tracer project.

Definitions and Key Terms

  • VFS: Virtual File System, the kernel’s abstraction layer for file operations.
  • file_operations: Struct of function pointers for file methods.
  • inode: Metadata structure representing a filesystem object.
  • Character device: Byte-stream device node in /dev.
  • ioctl: Custom device control interface.

Mental Model Diagram

read() syscall
  -> VFS
      -> file_operations.read
          -> driver code
              -> device/hardware

How It Works (Step by Step)

  1. User opens a device node in /dev.
  2. Kernel resolves file descriptor to struct file.
  3. read/write/ioctl dispatch via file_operations.
  4. Driver copies data to/from user space safely.
  5. The kernel returns to user mode with results.

Minimal Concrete Example

// User-space ioctl example
int fd = open("/dev/mydevice", O_RDWR);
struct stats s;
ioctl(fd, MYDEV_GET_STATS, &s);

Common Misconceptions

  • “Drivers are just libraries.” -> Drivers run in kernel mode and must be safe.
  • “ioctl is just a function call.” -> It crosses user/kernel boundary with validation.

Check-Your-Understanding Questions

  1. Why does VFS use function pointers in file_operations?
  2. What is the difference between a char and block device?
  3. Why must drivers use copy_to_user?
  4. What does _IOR mean in ioctl definitions?

Check-Your-Understanding Answers

  1. It allows multiple filesystem and device types to share the same syscalls.
  2. Char devices are stream-based; block devices are block-addressable.
  3. Directly dereferencing user pointers is unsafe.
  4. It indicates data is read by user (kernel writes to user).

Real-World Applications

  • Device driver development
  • Kernel modules for custom hardware
  • Filesystem implementations

Where You Will Apply It

  • Project 5: Kernel Module - Character Device Driver
  • Final Project: Mini OS Kernel

References

  • Linux kernel driver basics documentation: https://docs.kernel.org/6.2/driver-api/basics.html
  • ioctl numbering conventions: https://www.kernel.org/doc/html/v6.0/userspace-api/ioctl/ioctl-number.html
  • copy_to_user and copy_from_user documentation: https://www.kernel.org/doc/html/v4.17/core-api/uaccess.html

Key Insight The VFS and driver model let every kernel object look like a file, which is why Unix is both simple and powerful.

Summary The VFS abstracts file operations, drivers expose devices through file semantics, and ioctl provides a safe control path.

Homework / Exercises

  1. Inspect /proc/devices and identify major numbers on your system.
  2. Use strace to observe open/read/write on /dev/null and /dev/zero.

Solutions

  1. cat /proc/devices
  2. strace -e openat,read,write cat /dev/zero head

Chapter 4: Interrupts, Concurrency, and Time

Fundamentals Interrupts are how hardware gets the CPU’s attention. When a device needs service, it triggers an interrupt, causing the CPU to pause the current task and run an interrupt handler. This is how keyboard input, disk I/O completion, and network packets are processed. Interrupts introduce concurrency at the hardware level. They can happen at any time, which is why kernel code must be careful about synchronization.

The kernel handles interrupts in stages: a top half (fast, minimal work) and a bottom half (deferred work). This split prevents long interrupt handlers from blocking other interrupts. Understanding this split is essential for analyzing latency and building an interrupt profiler.

Concurrency in the kernel is not just threads. It is interrupts, softirqs, tasklets, and multiple CPUs running in parallel. Locks, atomic operations, and memory barriers are the tools that maintain consistency. A kernel developer must understand when code can sleep, when it must be lock-free, and how to avoid deadlocks.

Timekeeping is woven into scheduling and interrupt handling. The kernel uses timers, ticks, and high-resolution clocks to manage timeouts and scheduling slices. Measuring interrupt latency requires accurate time measurement, and it reveals how system load affects responsiveness.

Interrupts are different from exceptions. An exception is triggered by the current instruction (like a divide-by-zero or page fault), while an interrupt is triggered by external hardware. Both transfer control to the kernel, but they represent different kinds of events. This distinction matters when you interpret traces, because a page fault is not an IRQ but an exception with its own handler path.

On multi-core systems, interrupts are distributed across CPUs. This is why /proc/interrupts shows per-CPU counts. Interrupt affinity and balancing can affect latency: if two high-rate interrupts share the same CPU, latency spikes are likely. Understanding this helps you design your interrupt profiler and interpret results.

Concurrency is about more than threads; it includes interrupts preempting kernel code, softirqs running in special contexts, and multiple CPUs executing the kernel simultaneously. This is why kernel synchronization is complex: locks must protect data not just from other threads, but from interrupt handlers and deferred work. The choice between spinlocks and mutexes is a choice between blocking and busy-waiting, and it has direct latency implications.

Timekeeping is not a single clock. The kernel exposes multiple clocks, and their resolution and stability differ. High-resolution timers (hrtimers) provide precise scheduling and latency measurement, while the system tick provides coarser scheduling for general workloads. Understanding how the kernel accounts CPU time is necessary to interpret scheduler statistics and interrupt latency measurements.

Non-maskable interrupts (NMI) are special: they cannot be disabled and are often used for critical events. They complicate debugging because they can interrupt even code that has disabled interrupts. Understanding their existence helps explain rare, hard-to-reproduce timing issues.

Kernel preemption can be disabled in critical sections, which reduces race conditions but increases latency. The art of kernel engineering is to keep these sections short and well-structured so that interrupt latency remains predictable even under load. Deep Dive into the Concept Interrupts are delivered by the hardware (via interrupt controller) and handled by the kernel’s IRQ subsystem. The CPU pushes state, jumps to an interrupt vector, and runs the handler. The kernel must minimize work in the top half because interrupts are often masked or prioritized. Deferred work runs in softirq or workqueue context, where sleeping is allowed. Understanding which context you are in determines which APIs you can safely call.

Interrupt latency is the delay between the hardware interrupt and the execution of the handler. Sources of latency include interrupt masking, high-priority interrupts, and long-running critical sections with preemption disabled. Real-time systems care deeply about this, but even general-purpose systems benefit from lower latency. The interrupt latency profiler project will expose these dynamics by correlating interrupt arrival with handler execution.

Concurrency is hard because the kernel is preemptible and multi-core. A data structure can be accessed by multiple CPUs and by interrupt handlers. Spinlocks are used when sleeping is not allowed; mutexes are used when sleeping is allowed. RCU (Read-Copy-Update) provides a lock-free read path for certain structures. Atomic operations and memory barriers enforce ordering between CPUs.

Timekeeping in Linux uses multiple clocks: CLOCK_MONOTONIC, CLOCK_REALTIME, and high-resolution timers. The kernel’s tick infrastructure drives scheduling, timeouts, and accounting. High-resolution timers allow precise timing for scheduling and profiling. In your interrupt project, you will measure time at microsecond granularity, which forces you to understand clock sources and their overhead.

Interrupts and concurrency also affect user-space behavior. A syscall might be interrupted and restarted. A thread may be preempted while holding a lock. Signal handlers can run at unexpected times. These effects show up when you measure scheduling and latency.

The IRQ subsystem is driven by the interrupt controller (APIC on x86). Each device is assigned an interrupt vector, and the controller delivers it to a CPU. The kernel’s interrupt handler acknowledges the interrupt, runs the top half, and schedules bottom half work. The top half must be minimal because interrupts may be masked or prioritized, and long handlers can delay other interrupts. Bottom halves (softirqs or workqueues) execute later and can perform heavier processing.

Softirqs are reserved for high-throughput, low-latency tasks (like networking). They run in interrupt context and cannot sleep. Workqueues run in process context and can sleep, making them safer for slow operations. Tasklets are a legacy mechanism similar to softirqs. When you profile interrupt latency, you must know which context you are measuring: top half execution, softirq handling, or workqueue processing.

Kernel synchronization is a discipline. Spinlocks are used when sleeping is not allowed, which is often in interrupt context. Mutexes allow sleeping but must not be used in contexts where the scheduler cannot run. RCU provides a lock-free read path for data structures where reads dominate writes. Each mechanism has strict rules, and violating them can lead to deadlocks or crashes. Lock ordering rules and tools like lockdep help detect problems.

Memory ordering matters on modern CPUs. Atomic operations provide ordering guarantees, but not all operations are fully ordered by default. The kernel uses memory barriers to enforce ordering when needed. This is why concurrency bugs can appear only on certain architectures or under heavy load. Your kernel module project will probably not require explicit barriers, but understanding that they exist is important for any kernel-level work.

Interrupt latency is affected by disabled interrupts, long critical sections, and CPU load. If a kernel path disables interrupts for too long, hardware events will be delayed. This is why the kernel tries to keep interrupt-off sections short. Measuring maximum latency and correlating it with system activity (like network bursts or disk I/O) is the core of Project 7.

Scheduling and interrupts interact. A timer interrupt drives scheduler ticks and can trigger preemption. Tickless kernels reduce periodic interrupts to save power, but this can change latency behavior. High-resolution timers can provide fine-grained scheduling while reducing overhead. Understanding these trade-offs helps you interpret timing data.

Priority inversion is a classic concurrency failure mode. A low-priority task holds a lock needed by a high-priority task, while a medium-priority task runs, preventing the low-priority task from releasing the lock. Kernel mechanisms like priority inheritance help mitigate this, especially in real-time configurations. Even if you do not implement this, being aware of it is essential when analyzing latency spikes.

Interrupt affinity and balancing can be adjusted to tune performance. The kernel and irqbalance daemon distribute interrupts across CPUs, but misconfiguration can cause jitter. Your profiler can detect when a specific IRQ is pinned to a busy CPU and suggest rebalancing.

Finally, timer and clock infrastructure underpins everything from scheduling to profiling. The kernel exposes multiple clock sources, and their accuracy depends on hardware. A profiler must know which clock it is using and the associated overhead. Using a fast, stable clock source is critical for accurate latency measurement.

Real-time kernels introduce PREEMPT_RT, which changes how interrupts are threaded and how locks behave. In RT mode, many interrupt handlers run in kernel threads, which changes latency characteristics. Even on non-RT kernels, IRQ threading exists for some drivers. A good profiler should note whether IRQs are threaded or hard IRQs, because this affects the interpretation of latency measurements.

Per-CPU data structures are a common optimization to reduce contention. Instead of locking a global counter, the kernel maintains per-CPU counters and aggregates them when needed. This reduces locking overhead but complicates measurement because values are distributed. Your latency profiler should consider per-CPU aggregation and the possibility that a single CPU is overloaded while others are idle.

Lock-free techniques like sequence locks (seqlock) allow readers to proceed without heavy locking while writers update a version counter. This is useful for timekeeping data and other read-heavy structures. The trade-off is that readers may need to retry if they detect a concurrent write. Understanding these patterns helps you interpret kernel performance and latency behavior. How This Fits the Projects This chapter is essential for the interrupt latency profiler, and it informs your understanding of scheduling, locking in kernel modules, and the user-space threading library.

Definitions and Key Terms

  • IRQ: Interrupt request line.
  • Top half: Immediate interrupt handler.
  • Bottom half: Deferred interrupt work (softirq, tasklet, workqueue).
  • Spinlock: Lock used in contexts where sleeping is not allowed.
  • RCU: Read-Copy-Update, a synchronization mechanism for read-heavy data.

Mental Model Diagram

Hardware IRQ -> Top half (fast) -> Deferred work (softirq/workqueue)
                                         |
                                         v
                                      wakes tasks

How It Works (Step by Step)

  1. Device triggers IRQ.
  2. CPU jumps to interrupt handler.
  3. Top half executes quickly and schedules bottom half.
  4. Bottom half runs later and may wake tasks.
  5. Scheduler decides which task runs next.

Minimal Concrete Example

// Kernel pseudo-code for a top-half/bottom-half split
irq_handler() {
  acknowledge_irq();
  schedule_work(&my_work);
}

my_work() {
  process_data();
}

Common Misconceptions

  • “Interrupt handlers can do anything.” -> They must be fast and cannot sleep.
  • “Locks are the same everywhere.” -> Spinlocks and mutexes have different rules.

Check-Your-Understanding Questions

  1. Why are long interrupt handlers harmful?
  2. When should you use a spinlock vs a mutex?
  3. What is the difference between softirq and workqueue?
  4. Why do real-time systems care about interrupt latency?

Check-Your-Understanding Answers

  1. They block other interrupts and increase latency.
  2. Use spinlocks when sleeping is not allowed; use mutexes when it is.
  3. Softirqs run in interrupt context; workqueues run in process context.
  4. High latency can violate timing guarantees and cause missed deadlines.

Real-World Applications

  • Real-time systems
  • Network drivers and packet processing
  • Performance tuning and latency debugging

Where You Will Apply It

  • Project 7: Interrupt Latency Profiler
  • Project 5: Kernel Module (locking)
  • Project 3: Scheduler Visualization

References

  • Linux kernel tracing docs (tracefs and ftrace): https://www.kernel.org/doc/html/next/trace/ftrace.html
  • Linux tracing guide: https://docs.kernel.org/trace/index.html

Key Insight Interrupts and concurrency are where kernel correctness meets real-time performance; safe synchronization is non-negotiable.

Summary Interrupts drive hardware responsiveness, concurrency defines correctness, and timekeeping defines predictability.

Homework / Exercises

  1. Read /proc/interrupts and identify active IRQ lines.
  2. Use ftrace or perf to measure context switch and IRQ events.

Solutions

  1. cat /proc/interrupts
  2. sudo perf trace -e irq:* -a sleep 1

Chapter 5: Observability and Kernel Tooling (procfs, tracefs, perf, eBPF)

Fundamentals Kernel development requires visibility. You cannot fix what you cannot observe. Linux exposes internal state through pseudo-filesystems like procfs and tracefs, and through interfaces like perf_event_open and eBPF. These allow user-space tools to observe syscalls, page faults, scheduling decisions, and interrupt latencies without modifying the kernel.

procfs provides a file-like interface to kernel data structures. Files like /proc/pid/maps, /proc/schedstat, and /proc/interrupts reveal memory mappings, scheduler statistics, and interrupt counts. tracefs and ftrace allow you to trace kernel events and functions. perf_event_open provides access to hardware performance counters and kernel tracepoints, while eBPF allows you to run safe programs in kernel context to collect high-fidelity data.

This chapter is essential because most of your projects are about observation: syscall tracing, scheduling visualization, page fault analysis, interrupt profiling. The kernel itself already provides the data; your job is to interpret and visualize it.

Observability tools trade convenience for overhead and permissions. procfs is simple but can be expensive to poll at high frequency. tracefs and ftrace can capture deep details but often require root access. perf can sample with lower overhead but needs careful configuration. eBPF provides high fidelity but also demands a deeper toolchain. Choosing the right tool is part of system design, not just debugging.

procfs entries are generated on demand, which means their contents represent a snapshot of kernel state at read time. This is convenient but can also lead to race conditions in analysis if the system changes rapidly. A robust tool should tolerate slight inconsistencies between related procfs entries.

Security and policy matter. Some procfs data is restricted by settings like hidepid. perf_event_open is gated by perf_event_paranoid, and eBPF may require additional privileges or kernel config. A practical tool must detect and explain these constraints to the user, which is why explicit error handling is a feature, not just a nuisance.

Other pseudo-filesystems also play a role. sysfs exposes device and driver information, while debugfs exposes internal debugging interfaces. They are often used for kernel instrumentation. Knowing the difference between procfs, sysfs, and debugfs helps you locate the right data source and avoid relying on unstable debug-only interfaces.

Observability also includes correlation. A single metric rarely tells the whole story, so effective tools combine traces with process names, CPU IDs, and timestamps, and then align them with logs or application events. Even a simple profiler becomes dramatically more useful when it can answer not just what happened but which task caused it and when.

Practical tooling also requires presentation. Raw traces are overwhelming, so most production tools aggregate by PID, syscall name, or time bucket. Learning to summarize and visualize is part of kernel engineering because it turns low-level events into actionable insights.

Because these interfaces are live, tools must be resilient to partial data and transient errors. A good design reads, validates, and retries rather than assuming a stable snapshot. This mindset mirrors kernel development itself: the system is always changing while you observe it. Deep Dive into the Concept procfs is a pseudo-filesystem mounted at /proc. It exposes process and system information as files. For example, /proc/pid/maps lists memory regions with permissions and offsets; /proc/interrupts shows interrupt counts per CPU; /proc/schedstat provides scheduler statistics. These files reflect live kernel data structures, so reading them is effectively querying the kernel. The proc(5) man page describes the structure and security implications, including hidepid options that restrict visibility.

tracefs is mounted at /sys/kernel/tracing and provides a control interface to the kernel tracing subsystem. ftrace is the underlying tracer, which supports function tracing, event tracing, and latency tracing. You enable events by writing to tracefs files, then read results from trace or trace_pipe. This is the basis for many profiling and debugging workflows. The kernel documentation describes how to mount tracefs and use it.

perf_event_open is a syscall that allows programs to access hardware and software performance counters. It can count events (like CPU cycles, cache misses) and sample them to produce profiles. It can also attach to tracepoints, which is useful for capturing kernel events like page faults or context switches. perf and related tools are built on top of this interface. The man page is the primary reference for how to configure and read perf events.

eBPF extends observability by allowing user-space to load safe bytecode into the kernel. eBPF programs can attach to kprobes, tracepoints, and cgroups. They can collect data and send it back via maps and ring buffers. This allows you to build sophisticated tracing tools without modifying the kernel or using heavy instrumentation.

These tools have trade-offs. procfs is simple but limited; ftrace is powerful but requires root and careful configuration; perf is flexible but can be complex; eBPF offers deep insight but requires more tooling. A skilled systems engineer knows which tool to use based on overhead, permissions, and precision needs.

procfs exposes both per-process and system-wide data. For example, /proc/pid/stat provides a fixed set of fields describing CPU time, memory usage, and scheduling state. /proc/pid/maps lists VMAs with permissions and backing files, which is essential for mapping page faults to code and heap regions. /proc/schedstat aggregates scheduler metrics, and /proc/interrupts enumerates interrupt counts per CPU. These are the raw materials for most of your tracer and visualization projects.

tracefs is the control plane for the kernel tracing infrastructure. It exposes directories for events, filters, and trace buffers. For example, enabling sched_switch events requires writing to events/sched/sched_switch/enable. Filters can reduce overhead by limiting which events are collected. trace_pipe provides a live stream of events, while trace provides a snapshot. Understanding these files lets you build lightweight tools without writing kernel code.

perf_event_open is configured with a perf_event_attr struct. You can count events (e.g., CPU cycles) or sample them with a given frequency. Sampling produces a ring buffer that your tool must read and decode. For page faults, you can attach to the major-faults and minor-faults tracepoints, or use software event counters. The man page details flags like sample_type, read_format, and disabled, which are essential for correct setup.

eBPF runs verified bytecode in kernel context. The verifier ensures safety by checking bounds, loops, and memory access patterns. Programs attach to tracepoints, kprobes, or cgroups and can write data to maps. Maps are shared data structures that user space can read. This is how high-performance tools like bpftrace and bcc operate. The upside is high fidelity and low overhead; the downside is a steeper learning curve.

Tracing overhead is real. Collecting every event can slow a system dramatically. Practical tools apply filtering, sampling, or aggregation. For example, a syscall tracer might only trace a subset of syscalls, or aggregate counts per second. A page fault analyzer might sample every Nth event. These trade-offs determine whether a tool is usable in production.

Finally, observability is about interpretation, not just collection. Raw trace events are not useful without context. You must map addresses to symbols, resolve PIDs to process names, and correlate events with time. This is why your projects emphasize visualization and summarization rather than raw logging.

perf sampling can capture call stacks when configured with callchain and sample_type flags. This is how perf can attribute CPU time to code paths. For kernel tracepoints, you can capture task context, PID, and timestamps. If you want to build a high-quality page-fault analyzer, you must choose sample types carefully to avoid losing context or producing misleading reports.

eBPF maps come in multiple forms: hash maps, arrays, LRU maps, and ring buffers. Ring buffers are efficient for streaming events to user space with low overhead. Modern BPF also supports CO-RE (Compile Once, Run Everywhere), which allows programs to adapt to kernel structure changes using BTF metadata. This is why eBPF tools can often run across kernel versions without recompilation. If you extend your projects with eBPF, you will need to understand map lifetime, pinning, and how to avoid verifier rejections.

BPF programs attach to many hook points: kprobes for dynamic instrumentation, tracepoints for stable event formats, and uprobes for user-space functions. Each attachment has trade-offs in stability and overhead. A robust tool chooses the most stable hook available and falls back to others when needed. This design consideration is as important as the data itself.

User-space decoding is as important as kernel-side capture. You often need to map addresses to symbols (using /proc/pid/maps or ELF parsing) and resolve PIDs to command lines. Without this translation layer, trace data is opaque. Building this translation pipeline is one of the most valuable skills you gain from the tracing projects.

Finally, perf and ftrace can be combined with static tracing in applications to create end-to-end timelines. This is how production systems correlate kernel events with application-level spans. Even if you do not build full tracing systems here, understanding the concept helps you design tools that can integrate with larger observability stacks. How This Fits the Projects This chapter powers the syscall tracer, scheduler visualizer, page-fault analyzer, and interrupt profiler. It also provides debugging tools for the kernel module and mini-kernel.

Definitions and Key Terms

  • procfs: Process and system info pseudo-filesystem (/proc).
  • tracefs: Tracing control filesystem (/sys/kernel/tracing).
  • ftrace: Kernel function and event tracer.
  • perf_event_open: Syscall used to access performance counters.
  • eBPF: Extended BPF, safe in-kernel programs for tracing and filtering.

Mental Model Diagram

Kernel data structures
  |     |     |
 procfs tracefs perf/eBPF
  |     |     |
 user-space tools -> visualization -> insight

How It Works (Step by Step)

  1. Kernel exposes state via procfs or tracepoints.
  2. User-space tool configures tracefs or perf_event_open.
  3. Kernel emits events into a ring buffer or counters.
  4. User-space reads events and decodes them.
  5. Visualization turns raw events into insight.

Minimal Concrete Example

# Mount tracefs and enable scheduler switch tracing
sudo mount -t tracefs tracefs /sys/kernel/tracing
cd /sys/kernel/tracing
echo 1 | sudo tee events/sched/sched_switch/enable
cat trace_pipe | head

Common Misconceptions

  • “procfs is a real filesystem.” -> It is a live view of kernel data.
  • “perf is only for CPU profiling.” -> It also measures kernel tracepoints.

Check-Your-Understanding Questions

  1. Why is tracefs required for ftrace events?
  2. What is the difference between counting and sampling in perf?
  3. Why is eBPF safer than traditional kernel modules for tracing?
  4. What does /proc/pid/maps show?

Check-Your-Understanding Answers

  1. ftrace is configured and read through tracefs files.
  2. Counting aggregates totals; sampling records events with timestamps and context.
  3. eBPF programs are verified and sandboxed to prevent crashes.
  4. It lists memory mappings, permissions, and backing files for a process.

Real-World Applications

  • Production profiling (perf, eBPF)
  • Kernel debugging and latency analysis
  • Security monitoring and audit tooling

Where You Will Apply It

  • Project 1: Syscall Tracer
  • Project 3: Scheduler Visualization Tool
  • Project 4: Page Fault Analyzer
  • Project 7: Interrupt Latency Profiler

References

  • proc(5) man page: https://man7.org/linux/man-pages/man5/proc.5.html
  • perf_event_open(2) man page: https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html
  • ftrace documentation: https://www.kernel.org/doc/html/next/trace/ftrace.html
  • BPF documentation: https://www.kernel.org/doc/html/v5.16/bpf/index.html

Key Insight Observability is a first-class capability of the Linux kernel, and tools like procfs, tracefs, perf, and eBPF are the windows into its behavior.

Summary Kernel observability is how you turn internal state into evidence. These interfaces are the backbone of modern performance and debugging tooling.

Homework / Exercises

  1. Read /proc/interrupts and correlate with device activity.
  2. Use perf stat to measure page faults during a memory-heavy program.

Solutions

  1. cat /proc/interrupts and run a network ping to see counts increase.
  2. perf stat -e page-faults ./your_memory_test

Chapter 6: Boot and Initialization (From Power-On to First Process)

Fundamentals A kernel is useless if it cannot boot. Boot is the process of transitioning from firmware (BIOS/UEFI) into a running kernel. A bootloader loads the kernel image into memory, sets up a CPU execution environment, and jumps to the kernel entry point. The kernel then initializes critical subsystems: memory management, interrupts, scheduler, and device drivers. Only after that does it start the first user-space process.

Understanding boot is essential for the final project where you build a minimal OS kernel. You need to know what state the CPU is in, how to set up protected mode or long mode, how to build the GDT and IDT, and how to enable paging. These steps are architecture-specific, but the high-level phases are common across systems.

Firmware (BIOS or UEFI) initializes basic hardware and loads a bootloader. The bootloader’s job is to place the kernel in memory, prepare a minimal execution environment, and transfer control to the kernel entry point. Modern bootloaders (like GRUB) can pass a memory map and kernel command line arguments. This information is critical for early memory setup.

The kernel’s early boot code must run before full memory management and drivers are available. Early console output, simple memory allocators, and basic exception handling are often implemented with minimal facilities. This is why boot code is often architecture-specific and carefully structured.

After the kernel sets up paging, interrupts, and core data structures, it starts the scheduler and launches the first user-space process (init). From that point onward, the OS behaves like the system you use every day. Understanding this transition helps you design your mini-kernel and debug early boot failures.

Bootloaders can follow standards like Multiboot, which defines how the kernel is loaded and which parameters are passed. Using a standard loader reduces complexity for a mini-kernel project because it gives you a known memory map and entry point. This is why many OS tutorials start with Multiboot or a GRUB-based loader.

Initramfs provides an early user-space environment that the kernel can pivot to after boot, often used to load drivers or mount the real root filesystem. Even if your mini-kernel does not use initramfs, the concept is valuable because it shows how early user space can extend kernel capabilities before the full system is online.

The kernel command line is another boot-time interface. It can enable or disable features, set root filesystem parameters, and control debugging output. Even small kernels can benefit from a minimal command line parser to control boot-time behavior, which is an optional extension for your final project.

Boot is also where platform differences show up. BIOS vs UEFI, multiboot vs custom loaders, and 32-bit vs 64-bit mode all change the startup sequence. Even if your final kernel targets a single configuration, understanding these variations will make you far more effective when reading OSDev resources and debugging boot failures.

Even small kernels benefit from a deterministic boot log so that you can bisect failures quickly and compare behavior across runs. Deep Dive into the Concept On x86, boot starts in real mode with limited memory addressing. The bootloader sets up protected mode (or long mode) to enable 32-bit or 64-bit addressing. It must set up a Global Descriptor Table (GDT) to define segments and an Interrupt Descriptor Table (IDT) to handle exceptions and interrupts. The kernel then sets up paging by creating page tables and enabling the paging bit in CR0/CR3. Without paging, modern OS concepts like virtual memory cannot work.

Once basic CPU mode is established, the kernel initializes memory allocators, builds process structures, sets up the scheduler, and prepares device drivers. It also mounts a root filesystem or creates a temporary in-memory filesystem. The final step is launching the first user-space process (init), which then spawns other processes.

In your mini-kernel, you will implement a simplified version of this boot path. You will load the kernel with a bootloader (or a multiboot loader), set up minimal page tables, establish a basic interrupt handler, and create a tiny process model. This gives you an end-to-end view of OS construction.

On x86 systems, the CPU starts in 16-bit real mode. The bootloader must switch to protected mode or long mode to access modern memory addressing. This requires constructing a GDT and enabling the PE bit in CR0. To enter long mode, paging must be enabled and a 64-bit capable page table must be installed. These steps are highly ordered; doing them in the wrong sequence results in triple faults and instant resets.

The kernel entry point is usually a small assembly stub that sets up a stack, clears BSS, and initializes early memory management. Early memory allocators (memblock in Linux) track available physical memory before the full buddy allocator is initialized. This is why early allocations have different APIs than normal kernel allocations.

Interrupt handling must be initialized early. The IDT defines the handlers for exceptions and interrupts. Without a valid IDT, any fault will cause a reset. A minimal kernel often sets up default handlers that print diagnostic info and halt. Later, the kernel installs full handlers for page faults, timers, and device interrupts.

The kernel may be linked as a higher-half kernel, meaning it expects to run at a high virtual address. This requires setting up identity mappings for early code, then switching to higher-half mappings. This concept is often confusing to beginners, but it is essential for robust kernel layouts. Your mini-kernel can start with identity mapping and then optionally transition to a higher-half design.

SMP bring-up is another advanced boot topic. The bootstrap processor (BSP) initializes the system, then starts application processors (APs) using inter-processor interrupts. Each CPU needs its own stack and per-CPU data. While your mini-kernel may remain single-core, understanding the concept explains why per-CPU data structures and CPU-local variables exist in real kernels.

The final steps involve launching init or a shell process. In Linux, this is done by executing /sbin/init (or systemd). In your mini-kernel, you can implement a simplified version: create a single user process that runs a minimal shell loop. This is the moment the OS becomes visible to the user.

Early boot code also configures the interrupt controller (APIC) and parses firmware tables (like ACPI) to discover hardware. In full Linux, this is a large subsystem, but in a mini-kernel you can stub most of it. Still, you should understand that device discovery and initialization are part of the boot path, not an afterthought. The initcall mechanism in Linux shows this: subsystems are initialized in stages, with dependencies respected. This staged initialization concept can guide your own kernel boot sequence.

When a kernel is linked, the linker script defines where sections like .text, .data, and .bss live. Early boot code may need to relocate these sections or clear BSS before calling C code. Higher-half kernels require careful relocation and identity mapping of the early code region. If you jump to a virtual address before paging is correctly configured, the CPU will fault. This is why boot code is often written in assembly with extremely strict ordering. In your mini-kernel, you can start with a simple identity-mapped kernel and later experiment with higher-half mappings to understand the trade-offs.

The transition from assembly to C during boot is non-trivial. The kernel must establish a valid stack, zero the BSS, and ensure that global constructors (if any) are handled. These details are invisible in user space but are foundational for a reliable kernel. If you skip these steps, subtle bugs appear later in unrelated subsystems.

A real kernel also builds internal descriptor tables (like the TSS on x86) and configures task state for interrupts and syscalls. Even if your mini-kernel uses a simplified model, it is useful to know why these structures exist: they allow the CPU to switch stacks safely on privilege transitions and provide a controlled entry point into kernel code.

A mini-kernel can also implement a simple bootstrap allocator that carves memory from a fixed region before the full allocator is ready. This mirrors the real kernel’s approach and avoids circular dependencies. Once paging and a basic page allocator exist, you can free the bootstrap region and transition to normal allocation. This staged memory setup is a key pattern in OS design and explains why early boot code is so carefully ordered.

One more practical boot detail is debugging. QEMU provides logging flags like -d int,cpu_reset and a GDB stub that lets you single-step early boot code. Using these tools you can inspect register state before and after enabling paging, verify IDT entries, and catch triple faults before they reset the machine. Learning to debug at this stage is part of kernel development, and it is exactly the skill you need when your mini-kernel fails silently.

This workflow turns boot failures from mystery resets into debuggable states and shortens iteration loops dramatically. How This Fits the Projects This chapter is essential for the final mini-kernel and provides context for how the kernel modules you build are loaded in a running system.

Definitions and Key Terms

  • Bootloader: Program that loads the kernel into memory and jumps to it.
  • GDT/IDT: Descriptor tables for segments and interrupts.
  • Protected mode / Long mode: CPU execution modes for 32/64-bit operation.
  • Paging: Virtual memory translation mechanism.
  • init process: First user-space process started by the kernel.

Mental Model Diagram

Power on -> Firmware -> Bootloader -> Kernel entry
  -> set up GDT/IDT -> enable paging -> init subsystems -> start init

How It Works (Step by Step)

  1. Firmware initializes hardware and loads bootloader.
  2. Bootloader loads kernel and sets CPU mode.
  3. Kernel sets up GDT/IDT and page tables.
  4. Kernel initializes memory, scheduler, and drivers.
  5. Kernel launches the first user-space process.

Minimal Concrete Example

; Pseudo-assembly for jumping to kernel entry
mov eax, kernel_entry
jmp eax

Common Misconceptions

  • “Boot just loads the kernel.” -> It must also set CPU state and memory mappings.
  • “Paging can be enabled later.” -> Many kernel structures assume paging early.

Check-Your-Understanding Questions

  1. Why do we need an IDT before enabling interrupts?
  2. What does the bootloader do besides loading the kernel?
  3. Why is paging critical for modern OS design?

Check-Your-Understanding Answers

  1. Without an IDT, interrupts would jump to undefined addresses.
  2. It sets up CPU mode, memory layout, and passes boot parameters.
  3. It provides isolation, virtual memory, and flexibility in layout.

Real-World Applications

  • Kernel bring-up on embedded systems
  • OS development and hypervisor design

Where You Will Apply It

  • Final Project: Mini OS Kernel

References

  • OSDev Wiki (boot process and paging): https://wiki.osdev.org
  • Computer Systems: A Programmer’s Perspective - machine-level programming and memory

Key Insight Boot is the bridge between hardware reality and the kernel’s abstractions; without it, there is no OS.

Summary Boot establishes CPU mode, memory translation, and interrupt handling, enabling the kernel to run and create processes.

Homework / Exercises

  1. Read the OSDev bootloader overview and summarize the boot phases.
  2. Draw the sequence of steps from BIOS to first user process.

Solutions

  1. BIOS/UEFI -> bootloader -> kernel entry -> setup -> init.
  2. Use the mental model diagram and annotate each step.

Glossary

  • CFS: Completely Fair Scheduler, Linux’s default scheduler for normal tasks.
  • TLB: Translation Lookaside Buffer, CPU cache of virtual->physical mappings.
  • VMA: Virtual Memory Area, a continuous memory range in a process.
  • RCU: Read-Copy-Update, a synchronization mechanism for read-heavy data.
  • tracefs: Filesystem for configuring kernel tracing.
  • perf_event_open: Syscall used to access performance counters.
  • ioctl: Device-specific control interface.

Why OS and Kernel Development Matters

The Modern Problem It Solves Modern software runs on massive infrastructure: cloud servers, mobile devices, and embedded systems. The OS kernel is the layer that makes hardware reliable, secure, and performant. If you understand the kernel, you can optimize performance, diagnose failures, and build systems that scale.

Recent, verifiable indicators:

  • Web infrastructure: Linux is used by 59.5% of all websites and 55.8% of the top 1,000,000 websites (W3Techs, 21 Dec 2025). Source: https://w3techs.com/technologies/comparison/os-Linux
  • Mobile dominance: Android holds 71.9% global mobile OS share (StatCounter, Nov 2025). Source: https://gs.statcounter.com/os-market-share/mobile/worldwide/2025
  • Supercomputing: The TOP500 June 2025 list shows leading systems using Linux-based OS distributions such as TOSS, HPE Cray OS, and RHEL. Source: https://top500.org/lists/top500/list/2025/06/
Traditional App Developer              Kernel Developer
+--------------------------+           +--------------------------+
| Uses OS APIs             |           | Defines OS APIs          |
| Sees virtual memory      |           | Implements paging        |
| Depends on scheduling    |           | Builds schedulers        |
| Uses files and devices   |           | Builds VFS and drivers   |
+--------------------------+           +--------------------------+

Context and Evolution (Optional) Unix and Linux evolved from academic research to power the majority of modern infrastructure. The same foundational concepts continue to drive performance, security, and scalability today.


Concept Summary Table

Concept Cluster What You Need to Internalize
Kernel Execution Model How syscalls, processes, and scheduling define program execution.
Memory System How virtual memory, paging, and allocators shape performance and isolation.
Files, I/O, Drivers How VFS and drivers expose hardware through file semantics.
Interrupts and Concurrency How IRQs, locking, and timekeeping maintain correctness and responsiveness.
Observability and Tooling How procfs, tracefs, perf, and eBPF expose kernel behavior.
Boot and Initialization How the kernel transitions from firmware to a running OS.

Project-to-Concept Map

Project What It Builds Primer Chapters It Uses
Project 1: Syscall Tracer Observes syscall entry/exit Execution Model, Observability
Project 2: Memory Allocator Heap allocator replacement Memory System
Project 3: Scheduler Visualizer Real-time scheduling view Execution Model, Observability
Project 4: Page Fault Analyzer Fault tracing and mapping Memory System, Observability
Project 5: Kernel Module Device Char device driver Files/I-O/Drivers, Interrupts/Concurrency
Project 6: User-space Threads Green thread scheduler Execution Model, Memory System
Project 7: Interrupt Profiler IRQ latency tracing Interrupts/Concurrency, Observability
Final Project: Mini OS Kernel Bootable kernel All chapters

Deep Dive Reading by Concept

Fundamentals and Execution Model

Concept Book and Chapter Why This Matters
Syscalls and processes “Operating Systems: Three Easy Pieces” - Process and syscall chapters Core OS execution model
Scheduling “Linux Kernel Development” - Scheduler chapters Modern kernel scheduling logic
Machine-level execution “Computer Systems: A Programmer’s Perspective” - Ch. 2-3 ABI and calling conventions

Memory System

Concept Book and Chapter Why This Matters
Virtual memory “Computer Systems: A Programmer’s Perspective” - Ch. 9 Page tables, TLB, faults
Allocators “C Interfaces and Implementations” - Ch. 5-6 Allocator design
Paging and faults “Operating Systems: Three Easy Pieces” - Ch. 18-23 Fault handling models

Files, I/O, and Drivers

Concept Book and Chapter Why This Matters
VFS and file descriptors “The Linux Programming Interface” - File I/O chapters Syscall semantics
Device drivers “Linux Kernel Development” - Driver basics Kernel driver structure

Interrupts and Concurrency

Concept Book and Chapter Why This Matters
Concurrency “Operating Systems: Three Easy Pieces” - Concurrency chapters Locks and race conditions
Performance “Systems Performance” by Brendan Gregg (if available) Latency analysis

Observability and Tooling

Concept Book and Chapter Why This Matters
/proc and tracing “The Linux Programming Interface” - Procfs and tracing chapters Kernel introspection
Debugging “The Art of Debugging with GDB” User-space debugging support

Boot and Initialization

Concept Book and Chapter Why This Matters
Boot process “Operating Systems: Three Easy Pieces” - Intro and mechanisms Boot-time mental model
CPU modes “Inside the Machine” or “Low-Level Programming” CPU state transitions

Quick Start

Day 1 (4 hours):

  1. Read Chapter 1 (Execution Model) and Chapter 5 (Observability).
  2. Skim the syscall tracer project.
  3. Run strace ls and identify at least 5 syscalls.
  4. Start Project 1 and capture your first syscall output.

Day 2 (4 hours):

  1. Add filtering to your tracer (only open/read/write).
  2. Use /proc/pid/maps to inspect a running process.
  3. Read Chapter 2 (Memory System) to prepare for Project 2.
  4. Write down three questions you want answered by these projects.

End of weekend: You can explain what a syscall is, how it is captured, and why context switches matter.


Path 1: The Kernel Tool Builder (Recommended Start) Best for: People who want observability and debugging skills.

  1. Project 1: Syscall Tracer
  2. Project 3: Scheduler Visualization
  3. Project 4: Page Fault Analyzer
  4. Project 7: Interrupt Profiler

Path 2: The Kernel Engineer Best for: People who want to write kernel code.

  1. Project 1: Syscall Tracer
  2. Project 5: Kernel Module Device Driver
  3. Project 7: Interrupt Profiler
  4. Final Project: Mini OS Kernel

Path 3: The Memory and Performance Specialist Best for: People who want to master memory and scheduling.

  1. Project 2: Memory Allocator
  2. Project 4: Page Fault Analyzer
  3. Project 3: Scheduler Visualization

Path 4: The Completionist Best for: Full-stack kernel mastery. Phase 1: Projects 1-2 Phase 2: Projects 3-4 Phase 3: Projects 5-7 Phase 4: Final Project


Success Metrics

By the end, you should be able to:

  • Explain a syscall path with registers and mode transitions
  • Identify why a process is sleeping using /proc and tracing tools
  • Build a kernel module that handles concurrent access safely
  • Measure page fault rates and explain performance implications
  • Boot a minimal kernel and run a simple shell process

Appendices

Appendix A: Kernel Module Build Cheat Sheet

Based on kernel kbuild documentation, the standard external module build command is:

make -C /lib/modules/$(uname -r)/build M=$PWD

Reference: https://docs.kernel.org/6.12/kbuild/modules.html

Appendix B: procfs and tracefs Quick Reference

  • Mount procfs: mount -t proc proc /proc (usually already mounted)
  • Mount tracefs: mount -t tracefs tracefs /sys/kernel/tracing
  • /proc/pid/maps: memory maps for a process
  • /proc/interrupts: interrupt counts per CPU

Reference: https://man7.org/linux/man-pages/man5/proc.5.html Reference: https://www.kernel.org/doc/html/next/trace/ftrace.html


Project List

Project 1: Syscall Tracer (strace-lite)

  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Syscalls and process execution
  • Software or Tool: ptrace or seccomp-bpf
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A syscall tracer that attaches to a running process and logs every syscall with arguments, return values, and timing.

Why it teaches OS internals: It forces you to map user-space calls to kernel behavior and understand how syscalls traverse the kernel boundary.

Core challenges you will face:

  • Attaching to processes with ptrace and handling threads
  • Decoding syscall arguments and return values
  • Filtering and summarizing large volumes of events
  • Handling signals and syscall restarts

Real World Outcome

When complete, you will have a tool that looks and feels like a minimal strace clone.

What you will see:

  1. Live stream of syscalls with arguments and return values
  2. Summary statistics (counts, total time, errors)
  3. Filters for specific syscalls or processes

Command Line Outcome Example:

# 1. Attach to a running process
$ sudo ./syscall-tracer -p 1234
[ts=12.345678] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 (0.000091s)
[ts=12.345812] mmap(NULL, 8192, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f2c7b000000 (0.000073s)
[ts=12.345921] close(3) = 0 (0.000006s)

# 2. Run a program and trace it
$ ./syscall-tracer -- ./bin/mini-httpd
[ts=0.000004] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
[ts=0.000013] bind(3, 0.0.0.0:8080) = 0
[ts=0.000025] listen(3, 128) = 0

# 3. Summary mode
$ ./syscall-tracer --summary -- ./bin/mini-httpd
openat: 12 calls, 0 errors, 1.2ms total
read:   45 calls, 0 errors, 3.8ms total
write:  34 calls, 0 errors, 2.1ms total

The Core Question You Are Answering

“How does a user-space program cross into the kernel, and how can I observe that boundary without modifying the kernel?”

Answering this gives you the mental model needed for every other project.

Concepts You Must Understand First

  1. Syscall ABI and registers
    • Where are arguments passed?
    • What does the return value mean?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3
  2. ptrace or seccomp basics
    • How does ptrace intercept syscalls?
    • What is the overhead of tracing?
    • Book Reference: “The Linux Programming Interface” - Debugging and tracing chapters
  3. Process states and signals
    • What happens if the traced process receives SIGSTOP?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Process chapters

Questions to Guide Your Design

  1. How will you handle multi-threaded processes?
  2. How will you decode pointer arguments safely?
  3. How will you measure syscall latency accurately?
  4. What is your performance overhead and how will you reduce it?

Thinking Exercise

The “Hidden Work” Exercise

Given this syscall sequence, what user action caused it?

openat("/etc/ld.so.cache")
mmap(...)
openat("/lib/x86_64-linux-gnu/libc.so.6")
read(...)
close(...)

Questions: What is being loaded? Why does it happen before main() runs?

The Interview Questions They Will Ask

  1. “Explain the syscall path on x86-64 Linux.”
  2. “How is a syscall different from a context switch?”
  3. “Why are syscalls expensive?”
  4. “How does ptrace intercept syscalls?”
  5. “How would you reduce tracing overhead?”
  6. “What happens if the traced process uses threads?”

Hints in Layers

Hint 1: Start with ptrace

ptrace(PTRACE_TRACEME, 0, 0, 0);

Hint 2: Use PTRACE_SYSCALL to stop on entry and exit

Hint 3: Read registers to decode arguments

struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, NULL, &regs);

Hint 4: Verify with strace output

strace -f -p 1234

Books That Will Help

Topic Book Chapter
Syscalls “The Linux Programming Interface” Syscall chapters
Process model “Operating Systems: Three Easy Pieces” Process chapters
ABI “Computer Systems: A Programmer’s Perspective” Ch. 3

Common Pitfalls and Debugging

Problem 1: “Tracer hangs after attach”

  • Why: The process is stopped and not continued.
  • Fix: Use PTRACE_CONT or PTRACE_SYSCALL after each stop.
  • Quick test: Verify with ps -o stat and ensure it is running.

Problem 2: “Arguments are garbage”

  • Why: You are reading the wrong registers for your architecture.
  • Fix: Confirm ABI register mapping for x86-64.

Problem 3: “Missing syscalls from threads”

  • Why: Not tracing new threads.
  • Fix: Use PTRACE_O_TRACECLONE and attach to new TIDs.

Definition of Done

  • Captures syscall entry and exit with timestamps
  • Decodes at least open/read/write/close arguments correctly
  • Handles multi-threaded processes
  • Produces summary statistics
  • Overhead documented with a simple benchmark

Project 2: Custom Memory Allocator

  • Main Programming Language: C
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 4: Hardcore
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Memory management
  • Software or Tool: brk/sbrk, mmap
  • Main Book: “C Interfaces and Implementations” by David Hanson

What you will build: A drop-in malloc/free replacement that supports splitting, coalescing, and mmap for large allocations.

Why it teaches OS internals: It forces you to interact with the OS memory system and understand fragmentation, page alignment, and allocation policies.

Core challenges you will face:

  • Managing free lists and metadata safely
  • Coalescing and splitting blocks
  • Handling large allocations via mmap
  • Avoiding fragmentation and alignment bugs

Real World Outcome

What you will see:

  1. Your allocator works with real programs using LD_PRELOAD
  2. You can visualize heap blocks and fragmentation
  3. Benchmarks compare your allocator to glibc

Command Line Outcome Example:

$ LD_PRELOAD=./libmymalloc.so ./bench_alloc
[alloc] malloc(64) -> 0x55b2c00010
[alloc] malloc(4096) -> 0x55b2c01000
[free ] free(0x55b2c00010)

$ ./heapviz
[BLOCK 0x1000: 64 USED] [BLOCK 0x1040: 128 FREE] [BLOCK 0x10c0: 256 USED]

The Core Question You Are Answering

“How does the OS give memory to a process, and how can I manage that memory efficiently myself?”

Concepts You Must Understand First

  1. Virtual memory and page alignment
    • Why do allocators align to 16 or 32 bytes?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 9
  2. brk/sbrk and mmap
    • When should you use mmap instead of the heap?
    • Book Reference: “The Linux Programming Interface” - Memory chapters
  3. Fragmentation and free lists
    • How do free lists reduce fragmentation?
    • Book Reference: “C Interfaces and Implementations” - Ch. 5

Questions to Guide Your Design

  1. How will you store metadata for each block?
  2. How will you coalesce free blocks safely?
  3. How will you handle multi-threaded access (optional advanced)?
  4. What is your strategy for large allocations?

Thinking Exercise

Consider a heap with blocks: [64 USED][128 FREE][256 USED][128 FREE]. If a request for 200 bytes arrives, which block should you choose and why?

The Interview Questions They Will Ask

  1. “Explain internal vs external fragmentation.”
  2. “Why do allocators use mmap for large allocations?”
  3. “What is the advantage of segregated free lists?”
  4. “How does coalescing work?”
  5. “What happens if free is called twice?”

Hints in Layers

Hint 1: Start with a simple free list

Hint 2: Add coalescing on free

Hint 3: Use mmap for large requests

void* p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

Hint 4: Validate with LD_PRELOAD

LD_PRELOAD=./libmymalloc.so ls

Books That Will Help

Topic Book Chapter
Allocator design “C Interfaces and Implementations” Ch. 5-6
Virtual memory “Computer Systems: A Programmer’s Perspective” Ch. 9
Memory syscalls “The Linux Programming Interface” Ch. 49

Common Pitfalls and Debugging

Problem 1: “Segfault on free”

  • Why: Metadata corruption or double free.
  • Fix: Add canaries or debug mode checks.

Problem 2: “Allocator leaks memory”

  • Why: Blocks never returned to free list.
  • Fix: Implement free and coalescing carefully.

Definition of Done

  • Supports malloc, free, calloc, realloc
  • Coalescing implemented
  • Large allocations use mmap
  • Works with LD_PRELOAD
  • Basic benchmark comparison to glibc

Project 3: Process Scheduler Visualization Tool

  • Main Programming Language: C
  • Alternative Programming Languages: Python (with perf), Rust
  • Coolness Level: Level 3: Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Scheduling and kernel internals
  • Software or Tool: /proc, ftrace, perf
  • Main Book: “Linux Kernel Development” by Robert Love

What you will build: A tool that visualizes Linux scheduling decisions in real time: run queue depth, context switches, CPU utilization, and task transitions.

Why it teaches OS internals: It makes the scheduler visible, turning abstract scheduling theory into observable data.

Core challenges you will face:

  • Parsing /proc/schedstat and /proc/stat
  • Tracing context switch events via tracefs or perf
  • Rendering time-series data and CPU mapping

Real World Outcome

$ sudo ./schedviz
CPU0: nginx(1234) -> postgres(5678) -> nginx(1234)
CPU1: kworker/1:0 -> bash(2222) -> kworker/1:0
runq depth: [3,1,2,1]
context switches/sec: 1240

The Core Question You Are Answering

“How does the kernel decide what runs next, and can I see those decisions in real time?”

Concepts You Must Understand First

  1. CFS scheduling model
    • What is vruntime and fairness?
    • Book Reference: “Linux Kernel Development” - Scheduler chapters
  2. procfs statistics
    • How does /proc/schedstat encode data?
    • Book Reference: “The Linux Programming Interface” - Procfs
  3. Tracepoints and sched_switch
    • How to enable sched_switch events?
    • Book Reference: Kernel tracing docs

Questions to Guide Your Design

  1. What is your time resolution for visualization?
  2. How will you handle multi-core synchronization in the UI?
  3. What data is most useful to show: run queue, latency, or CPU usage?

Thinking Exercise

Given two tasks with different vruntime, predict which the scheduler picks next.

The Interview Questions They Will Ask

  1. “What is CFS and why does it use vruntime?”
  2. “How do you measure context switch rates?”
  3. “What does run queue depth tell you?”
  4. “What is the difference between preemption and yield?”

Hints in Layers

Hint 1: Start by reading /proc/schedstat

Hint 2: Enable sched_switch tracing

echo 1 | sudo tee /sys/kernel/tracing/events/sched/sched_switch/enable

Hint 3: Use perf for timestamps

sudo perf trace -e sched:sched_switch -a

Books That Will Help

Topic Book Chapter
Scheduling “Linux Kernel Development” Scheduler chapters
Scheduling theory “Operating Systems: Three Easy Pieces” Ch. 7-10
/proc “The Linux Programming Interface” Procfs chapters

Common Pitfalls and Debugging

Problem: “No sched_switch events”

  • Why: tracefs not mounted.
  • Fix: mount tracefs at /sys/kernel/tracing.

Problem: “Run queue depth always zero”

  • Why: Parsing /proc/schedstat incorrectly or reading the wrong CPU fields.
  • Fix: Validate against /proc/stat and test on a busy system.

Problem: “Timeline jumps or appears out of order”

  • Why: Mixing clock sources or using wall clock time.
  • Fix: Use a monotonic clock and timestamp each event at capture time.

Definition of Done

  • Shows per-CPU context switches in real time
  • Displays run queue depth
  • Correctly identifies top CPU consumers

Project 4: Page Fault Analyzer

  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python
  • Coolness Level: Level 4: Hardcore
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Virtual memory and tracing
  • Software or Tool: perf_event_open
  • Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you will build: A tool that monitors page faults, distinguishes major and minor faults, and maps fault addresses to memory regions.

Why it teaches OS internals: Page faults are where virtual memory becomes real. This tool shows you when memory is actually loaded and why performance stalls.

Core challenges you will face:

  • Using perf_event_open to capture faults
  • Mapping fault addresses to /proc/pid/maps
  • Symbolizing addresses to source or binary regions

Real World Outcome

$ sudo ./pagefault-analyzer -p 4321
[MINOR] 0x7f2c7b100000 (libc.so.6 .text)
[MAJOR] 0x400000 (binary .text) 12.4ms
[MINOR] 0x7fff5a2b3000 (stack growth)
Summary: major=12, minor=542

The Core Question You Are Answering

“Where do page faults happen, and how do they map to the program’s memory layout?”

Concepts You Must Understand First

  1. Page fault types
    • Minor vs major vs COW
    • Book Reference: “Operating Systems: Three Easy Pieces” - VM chapters
  2. /proc/pid/maps
    • Mapping addresses to regions
    • Book Reference: “The Linux Programming Interface” - Procfs chapters
  3. perf_event_open
    • Counting vs sampling events
    • Book Reference: perf_event_open(2) man page

Questions to Guide Your Design

  1. How will you link a fault address to a symbol or file?
  2. How will you handle shared libraries vs heap vs stack?
  3. What sampling rate gives you useful data without overhead?

Thinking Exercise

The “Fault Classification” Exercise

You have two traces:

  • Program A: 1,000 minor faults, 0 major faults
  • Program B: 10 minor faults, 200 major faults

Which program is likely stalled on disk I/O and why? What part of the memory system explains the difference?

The Interview Questions They Will Ask

  1. “Explain the difference between major and minor faults.”
  2. “Why do memory-mapped files generate faults?”
  3. “How does the kernel resolve a page fault?”

Hints in Layers

Hint 1: Start with perf_event_open in counting mode

Hint 2: Add sampling and read the ring buffer

Hint 3: Parse /proc/pid/maps to label addresses

Books That Will Help

Topic Book Chapter
Virtual memory “Computer Systems: A Programmer’s Perspective” Ch. 9
Page faults “Operating Systems: Three Easy Pieces” VM chapters
perf_event_open Linux man-pages perf_event_open(2)

Common Pitfalls and Debugging

Problem: “No page faults captured”

  • Why: perf_event_paranoid is too strict.
  • Fix: Adjust /proc/sys/kernel/perf_event_paranoid.

Problem: “Fault addresses do not map to any region”

  • Why: /proc/pid/maps snapshot is outdated or the process exited.
  • Fix: Refresh maps periodically and handle process exit events.

Problem: “All faults look like minor faults”

  • Why: The workload is fully cached or you are not distinguishing major/minor tracepoints.
  • Fix: Ensure you subscribe to the correct tracepoints and test with uncached file reads.

Definition of Done

  • Captures major and minor faults
  • Maps faults to memory regions
  • Produces summary statistics

Project 5: Linux Kernel Module - Character Device Driver

  • Main Programming Language: C
  • Alternative Programming Languages: None (kernel C only)
  • Coolness Level: Level 4: Hardcore
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Kernel modules and device drivers
  • Software or Tool: Linux kernel headers, kbuild
  • Main Book: “Linux Kernel Development” by Robert Love

What you will build: A character device driver /dev/mydevice that implements a ring buffer with read/write/ioctl.

Why it teaches OS internals: You cross the user/kernel boundary and write code that runs in kernel mode, managing memory, synchronization, and interfaces.

Core challenges you will face:

  • Building out-of-tree modules with kbuild
  • Implementing file_operations safely
  • Handling concurrent access with locks
  • Copying data safely between user and kernel

Real World Outcome

$ make
$ sudo insmod mydevice.ko
$ echo "hello" > /dev/mydevice
$ cat /dev/mydevice
hello
$ ./myctl --stats
buffer_size=4096 bytes, used=5 bytes, read_pos=0, write_pos=5

The Core Question You Are Answering

“How does the kernel expose a device to user space, and how do I safely implement that interface?”

Concepts You Must Understand First

  1. file_operations and VFS
    • How read/write dispatch works
    • Book Reference: “The Linux Programming Interface” - file I/O
  2. copy_to_user/copy_from_user
    • Why user pointers are unsafe
    • Book Reference: Kernel uaccess docs
  3. kbuild module build process
    • How external modules are built
    • Book Reference: Kernel build docs

Questions to Guide Your Design

  1. How will you handle multiple readers/writers?
  2. Will reads block when the buffer is empty?
  3. What ioctl commands will you support?

Thinking Exercise

The “Ring Buffer Race” Exercise

Two user processes write concurrently to the device. Sketch the ring buffer state after interleaved writes and explain what locking is required to prevent corruption.

The Interview Questions They Will Ask

  1. “Why are kernel modules dangerous?”
  2. “What is the role of file_operations?”
  3. “How do you safely copy data between user and kernel?”
  4. “What is the difference between spinlock and mutex?”

Hints in Layers

Hint 1: Create a minimal module with module_init and module_exit

Hint 2: Register a char device with alloc_chrdev_region

Hint 3: Implement read/write and use copy_to_user

Hint 4: Use dmesg for debugging

dmesg -w

Books That Will Help

Topic Book Chapter
Kernel modules “Linux Kernel Development” Module chapters
Device drivers “Linux Kernel Development” Driver basics
User-kernel copies Kernel docs uaccess

Common Pitfalls and Debugging

Problem: “Module fails to load”

  • Why: Kernel version mismatch or missing symbols.
  • Fix: Ensure headers match running kernel.

Problem: “Kernel crash on read”

  • Why: copy_to_user called with invalid pointer or incorrect size.
  • Fix: Validate size and return -EFAULT on error.

Problem: “Device node not found”

  • Why: udev rule missing or device major/minor not created.
  • Fix: Create /dev/mydevice manually with mknod or add a udev rule.

Definition of Done

  • Module loads and unloads cleanly
  • /dev/mydevice appears and works
  • read/write are correct and safe
  • ioctl returns valid stats
  • Stress test passes without crash

Project 6: User-space Thread Library (Green Threads)

  • Main Programming Language: C
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Concurrency and scheduling
  • Software or Tool: setjmp/longjmp or assembly context switch
  • Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you will build: A cooperative threading library with its own scheduler, stacks, and context switching. The library will expose thread_create, thread_yield, thread_exit, and a scheduler_run loop.

Why it teaches OS internals: It makes you implement the mechanics of scheduling and context switching without kernel support. You will see the difference between user-level and kernel-level scheduling and understand why blocking I/O is hard for green threads.

Core challenges you will face:

  • Saving and restoring CPU registers correctly
  • Allocating, aligning, and freeing per-thread stacks
  • Designing a run queue and scheduler policy
  • Handling yields, exits, and task cleanup safely

Real World Outcome

What you will see:

  1. A demo program that interleaves output from multiple threads
  2. A scheduler statistics panel (switch counts, ready queue length)
  3. A simple cooperative server demo (optional advanced)

Command Line Outcome Example:

$ ./green-threads-demo
[thread 1] count=1
[thread 2] count=1
[thread 1] count=2
[thread 2] count=2

$ ./green-threads-demo --stats
threads=4 switches=1200 avg_slice=1.2ms

The Core Question You Are Answering

“What minimal mechanisms are required to schedule multiple tasks on one CPU without kernel threads?”

Concepts You Must Understand First

  1. Context switching
    • Which registers must be saved/restored?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Concurrency chapters
  2. Stack layout and calling conventions
    • How does the stack grow and align on x86-64?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3
  3. Run queue design
    • How do you represent runnable vs blocked tasks?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Scheduling chapters

Questions to Guide Your Design

  1. What data structure should represent the run queue (list, deque, priority queue)?
  2. How will you handle a thread that calls a blocking syscall?
  3. Will you support preemption via signals or only cooperative yields?
  4. How will you clean up a thread’s stack after exit?

Thinking Exercise

The “Saved Registers” Exercise

Assume you only save rbx, rbp, and rip. What breaks when a thread uses r12-r15? Why do you need to save callee-saved registers at minimum?

The Interview Questions They Will Ask

  1. “How does a context switch work at the register level?”
  2. “Why does each thread need its own stack?”
  3. “What are the limitations of green threads?”
  4. “How would you add preemption to a cooperative scheduler?”
  5. “What happens if a green thread blocks on I/O?”

Hints in Layers

Hint 1: Start with setjmp/longjmp

if (setjmp(t->ctx) == 0) { /* save */ }
longjmp(next->ctx, 1);

Hint 2: Allocate stacks with guard pages using mmap

Hint 3: Build a simple round-robin scheduler

Hint 4: Add statistics to verify switching frequency

Books That Will Help

Topic Book Chapter
Context switches “Operating Systems: Three Easy Pieces” Ch. 26-28
Stack layout “Computer Systems: A Programmer’s Perspective” Ch. 3
C low-level “C Programming: A Modern Approach” Ch. 24

Common Pitfalls and Debugging

Problem 1: “Threads crash on first switch”

  • Why: Stack not properly aligned or register state corrupted.
  • Fix: Ensure 16-byte stack alignment on x86-64.

Problem 2: “Threads never yield”

  • Why: Missing explicit yield points.
  • Fix: Insert yield in long-running loops.

Problem 3: “Memory leak after thread exit”

  • Why: Stack not freed.
  • Fix: Free stack after thread is marked dead.

Definition of Done

  • Can create and schedule at least 4 threads
  • Thread exit cleans up resources
  • Scheduler stats confirm round-robin behavior
  • Demo program shows interleaved output

Project 7: Interrupt Latency Profiler

  • Main Programming Language: C
  • Alternative Programming Languages: Python with bpftrace
  • Coolness Level: Level 4: Hardcore
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Interrupts and performance
  • Software or Tool: ftrace, tracefs, or eBPF
  • Main Book: “Systems Performance” by Brendan Gregg

What you will build: A tool that measures interrupt latency and identifies which IRQs suffer from delays, with histograms and per-CPU breakdowns.

Why it teaches OS internals: It forces you to understand interrupt handlers, top/bottom halves, and how system load affects latency.

Core challenges you will face:

  • Capturing IRQ entry/exit timestamps
  • Mapping IRQ numbers to device names
  • Handling per-CPU data collection safely
  • Presenting histograms and percentile stats

Real World Outcome

What you will see:

  1. Per-IRQ latency statistics
  2. Histogram of latency buckets
  3. Alerts for latency spikes

Command Line Outcome Example:

$ sudo ./irq-latency --top 5
IRQ 16 (eth0): avg=2.3us p99=40us max=110us
IRQ 24 (nvme0): avg=1.2us p99=25us max=95us
IRQ 31 (xhci_hcd): avg=3.1us p99=60us max=140us

The Core Question You Are Answering

“How long does it take the kernel to respond to hardware interrupts under real load?”

Concepts You Must Understand First

  1. Interrupt handling path
    • Top half vs bottom half
    • Book Reference: “Linux Kernel Development” - Interrupt chapters
  2. tracefs/ftrace events
    • How to enable irq_handler_entry/exit events
    • Book Reference: Kernel tracing docs
  3. Timekeeping and timestamps
    • Choosing a stable clock source
    • Book Reference: “Systems Performance” - timing chapters

Questions to Guide Your Design

  1. How will you correlate irq_handler_entry with irq_handler_exit?
  2. Will you use ftrace, perf_event_open, or eBPF?
  3. How will you collect per-CPU latency without locks?

Thinking Exercise

Imagine IRQ 16 has average latency 2us but max latency 500us. What system conditions could explain the max value?

The Interview Questions They Will Ask

  1. “What is interrupt latency and why does it matter?”
  2. “What is the difference between top half and bottom half?”
  3. “How would you reduce IRQ latency in Linux?”
  4. “Why do we measure p99 latency rather than average?”

Hints in Layers

Hint 1: Enable irq_handler_entry and irq_handler_exit events

echo 1 | sudo tee /sys/kernel/tracing/events/irq/irq_handler_entry/enable
echo 1 | sudo tee /sys/kernel/tracing/events/irq/irq_handler_exit/enable

Hint 2: Use trace_pipe for live streaming

Hint 3: Add histogram bucketing in user space

Books That Will Help

Topic Book Chapter
Interrupts “Linux Kernel Development” Ch. 7-8
Performance “Systems Performance” Latency chapters
Tracing Kernel docs ftrace and tracefs

Common Pitfalls and Debugging

Problem 1: “No IRQ events captured”

  • Why: tracefs not mounted or permissions missing.
  • Fix: Mount tracefs and run with sudo.

Problem 2: “Latency values are zero”

  • Why: Incorrect timestamp parsing or missing exit event.
  • Fix: Ensure entry/exit pairing and use monotonic clock.

Definition of Done

  • Captures IRQ entry and exit timestamps
  • Computes avg/p99/max latency per IRQ
  • Shows per-CPU breakdown
  • Produces histogram output

Project 8: Mini Operating System Kernel

  • Main Programming Language: C and Assembly
  • Alternative Programming Languages: Rust (optional)
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: OS construction
  • Software or Tool: QEMU, bootloader, assembler
  • Main Book: “Operating Systems: Three Easy Pieces” and OSDev Wiki

What you will build: A minimal OS kernel that boots, sets up paging, handles interrupts, runs basic processes, and provides a tiny shell.

Why it teaches OS internals: It integrates every concept into a functioning system. If you can boot and run your own kernel, you understand the entire OS stack.

Core challenges you will face:

  • Bootloader integration and CPU mode setup
  • Page table initialization and virtual memory
  • Interrupt handling and timer ticks
  • Process creation and context switching
  • Syscall interface and user/kern separation

Real World Outcome

What you will see:

  1. A QEMU boot screen showing your kernel banner
  2. A simple shell with built-in commands
  3. A process list showing multiple tasks

Command Line Outcome Example:

$ qemu-system-i386 -kernel myos.bin
MyOS v0.1 booting...
> help
commands: ps, mem, echo, halt
> ps
PID 1 shell
PID 2 idle
> mem
free=12MB used=4MB

The Core Question You Are Answering

“Can I build the minimal set of mechanisms required to create a working OS?”

Concepts You Must Understand First

  1. Boot process and CPU modes
    • Real mode -> protected/long mode
    • Book Reference: OSDev Wiki
  2. Paging and virtual memory
    • Page table creation
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 9
  3. Interrupts and timer ticks
    • IDT setup and timer interrupts
    • Book Reference: “Operating Systems: Three Easy Pieces” - scheduling chapters

Questions to Guide Your Design

  1. Will you use a multiboot-compliant loader or a custom bootloader?
  2. Will you map kernel in higher-half or identity map only?
  3. How will you implement syscalls (int 0x80, syscall, or software trap)?
  4. What is your minimum viable process abstraction?

Thinking Exercise

If your kernel crashes right after enabling paging, list the top three likely causes and how you would debug them in QEMU.

The Interview Questions They Will Ask

  1. “Explain the steps from power-on to first user process.”
  2. “How do you set up paging in a new kernel?”
  3. “Why do you need an IDT before enabling interrupts?”
  4. “What is a syscall, and how do you implement one?”
  5. “How would you debug a triple fault?”

Hints in Layers

Hint 1: Use a known bootloader (GRUB or Multiboot)

Hint 2: Start with identity-mapped paging

Hint 3: Print to screen early (VGA text mode)

Hint 4: Use QEMU -d int,cpu_reset for debugging

Books That Will Help

Topic Book Chapter
OS design “Operating Systems: Three Easy Pieces” Intro + Mechanisms
CPU and boot “Computer Systems: A Programmer’s Perspective” Ch. 3
Low-level assembly “The Art of Assembly Language” Boot sections

Common Pitfalls and Debugging

Problem 1: “Triple fault on boot”

  • Why: Invalid IDT/GDT or bad paging setup.
  • Fix: Enable QEMU logging and verify tables.

Problem 2: “No output after boot”

  • Why: VGA memory not mapped or wrong address.
  • Fix: Identity-map VGA region early.

Problem 3: “Scheduler never runs”

  • Why: Timer interrupt not configured.
  • Fix: Configure PIT/HPET and enable interrupts.

Definition of Done

  • Kernel boots in QEMU and prints a banner
  • Page tables set up and paging enabled
  • Basic interrupts and timers work
  • Simple scheduler runs two processes
  • Syscall interface works
  • Shell can run built-in commands