Project 26: Operating System Kernel Capstone

Build a minimal x86-64 kernel that boots in QEMU, enables paging, handles interrupts, runs simple user processes, and exposes a tiny syscall interface you can reason about end-to-end.


Quick Reference

Attribute Value
Difficulty Master+
Time Estimate 3-6 months (part-time)
Main Programming Language C + x86-64 Assembly
Alternative Programming Languages Rust, Zig
Coolness Level Legendary Systems Build
Business Potential High signaling value for systems roles
Prerequisites CS:APP Ch. 1-12, OSTEP basics, ELF familiarity, GDB basics
Key Topics Boot flow, privilege rings, paging, interrupts, scheduler, syscalls

1. Learning Objectives

By completing this project, you will:

  1. Explain the full boot path from firmware/bootloader handoff to your first kernel instruction.
  2. Build and debug a deterministic boot workflow using QEMU, serial logs, and GDB.
  3. Implement 4-level paging and validate address translation behavior with controlled fault cases.
  4. Build an interrupt subsystem (IDT + handlers) and show safe control transfer into kernel mode.
  5. Implement a preemptive scheduler and reason about context-switch invariants.
  6. Design a minimal syscall boundary and safely execute user-mode programs.
  7. Build confidence with low-level failure analysis (triple fault, bad IDT, wrong page flags, deadlocks).

2. All Theory Needed (Per-Concept Breakdown)

2.1 Boot Process and Privilege Model

Fundamentals

A kernel does not “start” like a normal process. It is entered by firmware and boot infrastructure under a strict ABI contract. On modern x86-64 systems, your bootloader transitions hardware into a known state, loads your kernel image, and transfers control to a fixed entry point. From that point, you must establish execution invariants quickly: stack pointer validity, predictable segment setup, and a safe exception path.

The privilege model matters immediately. x86-64 rings define who can execute privileged instructions. Ring 0 is kernel mode; ring 3 is user mode. Your system is correct only if ring transitions happen through explicit controlled paths (interrupt gates, syscall entry, return sequences).

Mental Model Diagram

Power On
  |
  v
Firmware (UEFI/BIOS)
  |
  v
Bootloader (e.g., GRUB)
  |
  | loads kernel image + boot info
  v
Kernel entry (ring 0)
  |
  +--> set stack, descriptor tables, paging, handlers
  |
  +--> init subsystems (memory, timer, scheduler, syscall)
  |
  v
User process launch (ring 3)

How It Works (Step-by-step)

  1. Bootloader loads kernel ELF and provides memory map metadata.
  2. Kernel entry runs with minimal assumptions; first objective is “do not fault silently.”
  3. Kernel sets early console/serial output for deterministic logs.
  4. Kernel initializes CPU tables needed for safe control transfer.
  5. Kernel eventually transitions to user mode through a validated entry path.

Common Misconceptions

  • “Kernel starts at main like any C program.” Not true; entry is architecture-specific.
  • “Ring 0 vs ring 3 is just convention.” It is hardware-enforced privilege separation.
  • “If it boots once, the boot path is correct.” Boot must be repeatable and diagnosable.

Where You Apply It

  • Boot milestones 1-3
  • Interrupt setup milestones
  • User-mode transition milestone

2.2 Paging and Memory Protection

Fundamentals

Virtual memory gives each process its own address-space abstraction while the kernel manages physical frames and protection flags. On x86-64, canonical user-space translation typically uses 4-level tables (PML4 -> PDPT -> PD -> PT). Each level narrows to the final frame mapping. Permissions (present, writable, executable, user-accessible) are policy, not decoration.

Correctness depends on invariants: kernel memory should not be writable/executable by user mode, guard pages must fault predictably, and page-fault handling must preserve system liveness. Bad mappings often fail as mysterious crashes unless you log fault vectors and CR2 addresses.

Mental Model Diagram

Virtual Address
+---------+---------+---------+---------+-----------+
| PML4 ix | PDPT ix | PD ix   | PT ix   | offset    |
+---------+---------+---------+---------+-----------+
      |         |         |         |
      v         v         v         v
    PML4 --> PDPT --> PD --> PT --> Frame + offset

Protection checks happen during walk:
- Present?
- U/S (user/supervisor)?
- R/W?
- NX?

How It Works (Step-by-step)

  1. Allocate page table roots in known-safe physical memory.
  2. Map kernel text/data with strict permissions.
  3. Map kernel heap region for allocator growth.
  4. Create per-process user mappings with user bit set only where intended.
  5. On fault, inspect fault code + address and dispatch to recovery/kill policy.

Failure Modes

  • Wrong page flags: user code accesses kernel pages or crashes on legal user pages.
  • Stale TLB assumptions: mapping updates not reflected until invalidation path runs.
  • Recursive fault loops: fault handler touches invalid memory.

Minimal Concrete Example (Pseudo-structure)

PageTableEntry {
  present: 1
  writable: 1
  user: 0
  nx: 0
  frame_number: 0x12345
}

Where You Apply It

  • Milestones 4-6
  • Syscall safety checks
  • Process creation and teardown

2.3 Interrupts, Exceptions, and Safe Control Transfer

Fundamentals

Interrupts and exceptions are how hardware and CPU events asynchronously redirect execution into kernel handlers. The IDT defines entry points and semantics. You must distinguish external IRQs (timer, keyboard) from synchronous exceptions (divide-by-zero, page fault, invalid opcode).

A robust handler path preserves state, acknowledges the controller/interrupt source, and avoids unbounded work in interrupt context. Your timer interrupt will drive scheduler ticks; your page-fault handler will enforce memory policy. If these are wrong, your OS appears random when it is simply under-specified.

Mental Model Diagram

User Code -----------+
                      | exception/interrupt
Hardware IRQ ---------+--------> IDT gate ------> Kernel handler
                                          |
                                          +--> save context
                                          +--> handle event
                                          +--> schedule? (timer)
                                          +--> return to prior context

How It Works (Step-by-step)

  1. Build IDT entries for required vectors.
  2. Install low-level stubs with consistent stack-frame contract.
  3. Forward into typed handler layer (fault vs irq).
  4. For timer IRQ, increment tick and request reschedule.
  5. Return through architecture-correct sequence restoring context.

Common Misconceptions

  • “Interrupt handlers can do anything.” They should do minimal bounded work.
  • “All faults are fatal.” Some are policy-handled (copy-on-write later, lazy map, etc.).
  • “Spurious IRQs are impossible.” They exist; handlers must be defensive.

Where You Apply It

  • Milestones 3 and 7
  • Scheduler tick
  • Fault-driven VM validation

2.4 Scheduling and Syscall Boundary

Fundamentals

Scheduling is controlled resource multiplexing. A simple round-robin scheduler is enough for this project if you define clear invariants: runnable queue integrity, no lost wakeups in primitive synchronization, and deterministic state transitions (RUNNABLE, RUNNING, BLOCKED, ZOMBIE).

Syscalls are user-to-kernel transitions where trust boundaries are enforced. Kernel code must validate user pointers, lengths, and handles before acting. A tiny syscall table (write, exit, optional fork, optional exec) is enough to demonstrate this safely.

Mental Model Diagram

Ring 3 process
   |
   | syscall (number + args)
   v
Entry trampoline (ring transition)
   |
   v
Kernel dispatcher ----> syscall table ----> implementation
   |                               |
   +--> validate args              +--> update process state/resources
   |
   v
Return path (result / error) to ring 3

How It Works (Step-by-step)

  1. Define syscall ABI contract (numbers, arg registers, return codes).
  2. Implement entry trampoline and dispatcher.
  3. Validate all user-provided pointers and lengths.
  4. Integrate blocking behavior with scheduler state transitions.
  5. Return standardized error codes for predictable user behavior.

Failure Modes

  • Invalid pointer dereference in kernel path (panic/security bug).
  • Deadlock from scheduler and syscall lock ordering mistakes.
  • Inconsistent process state causing stuck tasks.

Where You Apply It

  • Milestones 6-8
  • User program runner
  • End-to-end shell demonstration

3. Project Specification

3.1 What You Will Build

Build a teaching kernel for x86-64 that boots in QEMU and demonstrates all core OS loops clearly and measurably:

  • Boot + early diagnostics
  • Memory management (frames + paging)
  • Interrupt subsystem (timer/keyboard + exceptions)
  • Scheduler and task model
  • Minimal syscall interface and user-mode transition
  • Basic file/data interface sufficient for a tiny shell flow

3.2 Functional Requirements

  1. Boot and diagnostics: Kernel boots deterministically and writes structured logs to serial.
  2. Paging: 4-level paging enabled with kernel and user separation.
  3. Interrupts: Timer and keyboard interrupts handled without losing control flow.
  4. Fault handling: Page faults and invalid opcodes produce actionable diagnostics.
  5. Scheduling: Preemptive round-robin over at least two user tasks.
  6. Syscalls: write, exit required; fork and exec optional but recommended.
  7. User mode: At least one user-space program executes and returns result.

3.3 Non-Functional Requirements

  • Reliability: Boot-to-shell flow should pass 20 repeated runs without manual intervention.
  • Observability: Every panic/fault includes vector, fault address, and current task id.
  • Determinism: Test profiles should be replayable with fixed inputs.
  • Safety: User pointers are validated before dereference in syscall handlers.

3.4 Example Usage / Output

$ make run
[boot] stage=entry arch=x86_64
[mem] pmm init frames_total=262144 frames_free=258901
[vm] paging enabled mode=4level
[int] idt ready vectors=256
[sched] timer irq online hz=100
[user] launch pid=1 name=/bin/init
[user] launch pid=2 name=/bin/echo
[sys] write pid=2 bytes=12
hello kernel
[sched] tick=120 context_switches=87
[ok] system stable

3.5 Data Formats / Protocols

  • Boot info contract: memory map entries with region type and physical ranges.
  • Process control block (PCB): pid, state, kernel stack pointer, address-space root, runtime stats.
  • Syscall contract: (number, arg0..argN) -> result_or_error.

3.6 Edge Cases

  • User passes invalid pointer to write.
  • Double fault from bad handler stack.
  • Timer IRQ during scheduler critical section.
  • Out-of-memory during page table expansion.
  • Zombie accumulation due to missing wait/reap path.

3.7 Real World Outcome

This project gives you a complete machine-level narrative from boot to user process execution. When successful, you can:

  • Boot a kernel image in QEMU with repeatable logs.
  • Trigger and diagnose controlled faults without losing traceability.
  • Prove user/kernel isolation with negative tests.
  • Demonstrate scheduling behavior with measurable context-switch stats.

3.7.1 How to Run (Copy/Paste)

$ make clean && make
$ make run
$ make run-gdb
$ make test-boot
$ make test-vm
$ make test-syscall

3.7.2 Golden Path Demo (Deterministic)

Scenario: boot kernel, spawn two user tasks, run 200 timer ticks, exit cleanly.
Expected: no panic, at least 2 tasks scheduled, sys_write output appears, summary metrics printed.

3.7.3 CLI Transcript (Reference)

$ make run
[boot] entry ok
[int] idt loaded
[vm] kernel/user split active
[sched] started
[user] pid=1 /bin/init
[user] pid=2 /bin/echo
hello kernel
[metrics] ticks=200 ctx_switch=146 faults=0
[exit] clean shutdown

4. Solution Architecture

4.1 High-Level Design

+----------------------+      +---------------------------+
| Boot + Early Init    |----->| Core Kernel Services      |
| - entry              |      | - memory manager          |
| - serial logger      |      | - interrupt manager       |
+----------------------+      | - scheduler               |
          |                   | - syscall dispatcher      |
          v                   +-------------+-------------+
+----------------------+                     |
| Process Runtime      |<--------------------+
| - user tasks         |          transitions + interrupts
| - kernel tasks       |
+----------------------+

4.2 Key Components

Component Responsibility Key Decisions
Boot/Entry Establish minimal safe execution context Fail fast with explicit serial diagnostics
PMM (physical memory manager) Manage free frames Bitmap or freelist, deterministic allocation strategy
VMM (virtual memory manager) Build/maintain page mappings Strict permission defaults, explicit mapping APIs
Interrupt Manager IDT setup and handler dispatch Keep handler critical sections short
Scheduler Select runnable task each tick Round-robin first, policy pluggable later
Syscall Layer User-kernel boundary Validate all user inputs before access

4.3 Data Structures (No Full Code)

TaskControlBlock {
  pid: integer
  state: RUNNABLE|RUNNING|BLOCKED|ZOMBIE
  kernel_stack_top: address
  page_table_root: physical_address
  runtime_ticks: integer
}

RunQueue {
  tasks: circular_list<pid>
}

PageMappingRequest {
  virtual_start: address
  physical_start: address
  length_bytes: integer
  flags: {present,writable,user,nx}
}

4.4 Algorithm Overview

Round-Robin Scheduler

  1. On timer tick, save current context.
  2. If current task still runnable, push it to queue tail.
  3. Pop next runnable task from queue head.
  4. Switch address space if needed.
  5. Restore context and resume.

Complexity Analysis

  • Time: O(1) enqueue/dequeue with circular structure
  • Space: O(n) for task metadata

5. Implementation Guide

5.1 Development Environment Setup

# Toolchain and emulator (example names; use distro equivalents)
$ toolchain-check --target x86_64-elf
$ qemu-system-x86_64 --version
$ gdb --version

5.2 Project Structure

kernel/
  boot/
  arch/x86_64/
  mm/
  interrupts/
  sched/
  syscall/
  user/
  tests/
  scripts/

5.3 The Core Question You Are Answering

“What is the smallest coherent system that can safely multiplex hardware for user programs without collapsing under faults?”

5.4 Concepts You Must Understand First

  1. x86-64 privilege transitions
    • How does ring 3 enter ring 0?
    • What state must be saved/restored?
  2. Page table semantics and fault codes
    • Which bits enforce access policy?
    • How do fault error codes guide handling?
  3. Interrupt discipline
    • Why keep handlers short?
    • What is safe vs unsafe in interrupt context?
  4. Scheduler state machine
    • Which transitions are legal?
    • How do you prevent lost wakeups?

5.5 Questions to Guide Your Design

  1. What are your kernel invariants immediately after boot?
  2. What diagnostic fields must every panic line include?
  3. Which mappings must exist before enabling interrupts?
  4. How do you prove user pointers are validated in syscalls?
  5. What test demonstrates preemption is working rather than cooperative yielding?

5.6 Milestone Plan

  1. Bring-up and logging
    • Done when boot prints deterministic stage markers.
  2. Memory foundations
    • Done when paging is enabled and kernel heap allocations are stable.
  3. Interrupt subsystem
    • Done when timer ticks increment and keyboard events are handled.
  4. Scheduling
    • Done when two tasks alternate under timer preemption.
  5. Syscall boundary
    • Done when user task prints through kernel write path safely.
  6. Stability pass
    • Done when scripted tests pass repeatedly with no nondeterministic crashes.

6. Testing Strategy

6.1 Test Layers

  • Boot smoke tests: image boots and emits expected checkpoints.
  • VM tests: valid mappings succeed, invalid access faults with expected code.
  • Interrupt tests: timer and keyboard IRQ counts progress as expected.
  • Scheduler tests: measurable context switching under load.
  • Syscall tests: valid and invalid pointer scenarios.

6.2 Deterministic Test Harness

  • Use fixed boot args and deterministic workload loops.
  • Emit machine-parseable logs ([component] key=value).
  • Keep a golden baseline and compare deltas by test profile.

6.3 Fault Injection

  • Invalid syscall number
  • User pointer to unmapped page
  • Forced page permission mismatch
  • Intentional divide-by-zero in user task

Expected result: controlled fault path and useful diagnostics, not silent reboot loops.


7. Common Pitfalls and Debugging

Problem 1: “QEMU resets immediately”

  • Why: likely triple fault due to invalid IDT/GDT or bad stack state.
  • Fix: enable verbose exception logging and verify early table pointers.
  • Quick test: boot with minimal interrupt setup and progressively re-enable handlers.

Problem 2: “Paging on -> instant crash”

  • Why: missing identity mapping for currently executing region or bad CR3 root.
  • Fix: verify current instruction fetch region remains mapped across transition.
  • Quick test: print expected virtual->physical map of kernel text before enabling paging.

Problem 3: “Timer IRQ fires but scheduler stalls”

  • Why: task-state transitions or queue mutation not atomic.
  • Fix: define legal transition table and assert on illegal transitions.
  • Quick test: run two CPU-bound tasks and check increasing context-switch counter.

Problem 4: “Syscall causes kernel panic”

  • Why: unvalidated user pointer dereferenced in kernel.
  • Fix: centralize pointer validation and reject/return error consistently.
  • Quick test: call syscall with null/invalid/high addresses and verify graceful failure.

Problem 5: “Intermittent deadlock under load”

  • Why: lock ordering mismatch between interrupt path and scheduler path.
  • Fix: define lock hierarchy and enforce with debug assertions.
  • Quick test: stress timer frequency with concurrent syscalls for 5+ minutes.

8. Interview Questions They Will Ask

  1. Why does a kernel need separate interrupt and syscall entry paths?
  2. What information in a page-fault error code is most useful for triage?
  3. How does a scheduler guarantee fairness in simple round-robin?
  4. Why is user-pointer validation a security boundary, not just a correctness check?
  5. What typically causes triple faults in early kernel bring-up?
  6. How would you evolve this design toward SMP support?

9. Extension Ideas

  1. Add copy-on-write fork semantics.
  2. Introduce demand paging with lazy frame allocation.
  3. Add a virtual filesystem layer and simple in-memory FS implementation.
  4. Support multi-core scheduling (SMP) with per-core run queues.
  5. Add tracing hooks for syscall latency and IRQ service time.
  6. Port the design to a second architecture profile and compare abstractions.

10. Definition of Done

  • Kernel boots in QEMU and reaches user task execution consistently.
  • Paging and user/kernel separation are active and validated by negative tests.
  • Timer and keyboard interrupts are handled with stable logging.
  • Scheduler preempts tasks and reports context-switch metrics.
  • Syscalls write and exit function with pointer validation.
  • Fault paths (page fault, invalid opcode) produce actionable diagnostics.
  • Test suite passes on repeated runs without nondeterministic failure.
  • Design tradeoffs and limitations are documented in README notes.

11. References and Reading Map

  • Computer Systems: A Programmer’s Perspective (3rd Edition) - Ch. 3, 8, 9, 12
  • Operating Systems: Three Easy Pieces - Virtual memory + scheduling chapters
  • OSDev Wiki - Boot flow, IDT, paging reference material
  • Intel 64 and IA-32 Architectures Software Developer Manuals - privileged architecture details
  • MIT xv6 notes (for architecture intuition, not copy/paste)

Recommended sequence:

  1. CS:APP Ch. 8-9 refresher
  2. OSTEP memory + scheduling
  3. Intel manual sections relevant to IDT/paging/syscall transition
  4. Build milestone 1-3 first, then return to advanced chapters as needed