Operating Systems from First Principles

Goal: Build a precise mental model of how an operating system turns raw hardware into a reliable, multi-program environment. You will understand the kernel/user boundary, process and thread scheduling, virtual memory, storage, and I/O, and you will know why each abstraction exists. You will implement core subsystems (boot flow, allocator, scheduler, syscall interface, file system, shell, containers) and validate them with real traces and tests. By the end, you can trace any syscall end to end, reason about performance and security trade-offs, and design small kernels and OS subsystems from scratch.

Introduction

Operating systems are the privileged software layer that virtualizes hardware resources, enforces isolation, and exposes stable abstractions (processes, files, sockets, memory) to user programs. In practice, the OS is both a resource manager and a contract: it decides who gets CPU time, how memory is mapped, how data persists, and how devices are accessed.

What you will build (by the end of this guide):

A bootable bare-metal “Hello” that runs without any OS
A bootloader that loads a kernel and switches CPU modes
Core OS simulators (scheduler, virtual memory, allocator)
User-space OS tools (shell, syscall tracer, IPC tools)
A file system in a disk image (mountable via FUSE)
A minimal container runtime (namespaces + cgroups)
A kernel module and a network stack exploration project

Scope (what is included):

x86-64 Linux and Unix-style APIs as the primary learning target
Core OS abstractions: processes, threads, memory, files, devices, IPC
Practical implementations and simulators you can run and test

Out of scope (for this guide):

Building a full production kernel for real hardware
GUI subsystems, window servers, and graphics stacks
Deep device-specific driver development beyond basic modules

The Big Picture (Mental Model)

Firmware -> Bootloader -> Kernel -> User Space
   |           |            |          |
   |           |            |          +--> Processes, shells, servers
   |           |            +--> Scheduler, VM, FS, IPC, drivers
   |           +--> Load kernel, set CPU mode, pass boot info
   +--> Initialize hardware, provide boot services

Key Terms You Will See Everywhere

Kernel: The privileged core that manages CPU, memory, devices, and protection.
System call: The controlled entry from user space into the kernel.
Process: A running program with its own address space and resources.
Virtual memory: The per-process illusion of a private address space.
File system: The abstraction that turns blocks into named, durable files.
Interrupt: An asynchronous signal from hardware that preempts execution.

How to Use This Guide

Read the Theory Primer first. It is your mini-book and mental model.
Then do projects in order or follow a learning path that matches your goals.
For low-level projects, always use QEMU or a VM. Never test kernel code on your host OS.
Each project has a Definition of Done. Treat it as a contract.
Keep a lab notebook. Write down invariants, surprises, and failures.
Revisit the primer after each project. You will understand more the second time.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Programming Skills:

Confident C programming (pointers, structs, memory layout, bitwise ops)
Comfort with the command line, Make, and basic toolchains
Ability to read and write basic assembly (x86-64 syntax preferred)

Computer Architecture Fundamentals:

CPU registers, stack, calling conventions, instruction execution
Basic memory hierarchy: caches, RAM, storage
Recommended Reading: “Computer Systems: A Programmer’s Perspective” Ch. 1-3

OS Fundamentals:

What a process is and why system calls exist
Basic file I/O concepts (open, read, write, close)
Recommended Reading: “Operating Systems: Three Easy Pieces” Ch. 1-6

Helpful But Not Required

Advanced Topic:

Detailed x86 boot and mode switching
Can learn during: Projects 1-2

Advanced Topic:

Kernel debugging and tracing tools
Can learn during: Projects 8, 15, 16

Advanced Topic:

Networking internals (TCP/IP, sockets)
Can learn during: Project 16

Self-Assessment Questions

Can you explain what a system call is and why user code cannot access hardware directly?
Can you compile a C program and inspect its symbols with nm or objdump?
Can you explain the difference between virtual and physical memory?
Can you read simple assembly and understand stack usage?
Can you use GDB to set breakpoints and inspect memory?

If you answered “no” to questions 1-3, spend 1-2 weeks on OSTEP and CS:APP before starting.

Development Environment Setup

Required Tools:

Linux host or VM (Ubuntu 22.04+ or Debian 12+ recommended)
GCC or Clang, Make, binutils (ld, objdump, nm)
QEMU (x86_64) for boot and kernel testing
GDB for debugging (plus gdb-multiarch if needed)

Recommended Tools:

NASM or GAS for assembly
strace, ltrace, perf, bpftrace for tracing
dd, hexdump, xxd, objdump for binary inspection
FUSE development headers for file system project

Time Investment

Foundations and early projects (1-6): 8-12 weeks
Mid-stack projects (7-12): 10-14 weeks
Advanced projects (13-16): 8-12 weeks
Total: 6-9 months at a steady pace

Important Reality Check

OS work is detail-heavy and error-prone. You will spend real time debugging boot loaders, memory corruption, and deadlocks. The payoff is deep systems intuition that transfers to every other area of software.

Big Picture / Mental Model

At runtime, the OS sits between user programs and hardware. It multiplexes CPU time, virtualizes memory, translates I/O into device commands, and enforces security boundaries. A useful mental model is to think in terms of three loops: CPU scheduling, memory translation, and I/O completion.

User Program
  | syscalls
  v
+--------------------+
| Kernel             |
|  - Scheduler       |<-- timer interrupts
|  - Virtual Memory  |<-- page faults
|  - File System     |<-- block I/O
|  - IPC             |<-- pipes/sockets
|  - Drivers         |<-- device interrupts
+--------------------+
  | privileged ops
  v
Hardware (CPU, RAM, Disk, NIC, GPU)

Key invariants:

User code never touches kernel memory directly.
Every syscall validates user inputs.
Every runnable thread eventually runs (fairness).
Storage updates either succeed or fail safely across crashes.

Theory Primer

C1: Bootstrapping and CPU Modes

Fundamentals

Bootstrapping is the process of turning powered-off hardware into a running kernel. When the CPU resets, it starts executing firmware code (BIOS or UEFI), not your OS. The firmware initializes hardware just enough to load a bootloader. The bootloader then loads the kernel image into memory, sets up a stack and CPU mode, and jumps to the kernel entry point. On x86, the CPU begins in 16-bit real mode with severe limits; modern kernels require 32-bit protected mode and eventually 64-bit long mode with paging enabled. Understanding this flow is critical because every later OS feature assumes the boot stage has established a correct memory map, stack, and CPU state. If boot is wrong, everything fails.

Deep Dive into the Concept

Bootstrapping starts at the CPU reset vector. On legacy BIOS systems, the firmware runs a power-on self-test, enumerates devices, and then loads the first 512-byte sector of a bootable disk to memory at 0x7C00. That sector is a tiny program with a magic signature (0x55AA) at the end. Because 512 bytes is tiny, real systems use a multi-stage boot: stage 1 loads stage 2, and stage 2 loads the kernel. In UEFI systems, the firmware can read filesystems directly and load a PE/COFF executable from an EFI System Partition, but the core idea is the same: a small, trusted loader prepares the environment for the kernel.

In real mode, the CPU uses segmented addressing and only has access to the first 1 MB of memory. Code must fit and use 16-bit instructions. A key early task is enabling the A20 line so memory addresses above 1 MB work correctly. The bootloader then builds a Global Descriptor Table (GDT) and sets the PE bit in CR0 to enter 32-bit protected mode. This is not just a flag flip: you must perform a far jump to flush the instruction pipeline and load the new code segment. From there, the bootloader can set up a flat memory model where segments map to the same linear address space.

Modern 64-bit kernels require long mode. That transition requires enabling Physical Address Extension (PAE), building identity-mapped page tables, setting CR3 to point at them, enabling paging (CR0.PG), and then setting the long mode bit in the EFER MSR. Only then can the CPU execute 64-bit instructions. This sequence is fragile: if page tables are incorrect or not aligned, the CPU will triple fault and reset. That is why many teaching kernels begin in 32-bit mode and only later switch to 64-bit after bringing up minimal diagnostics.

A bootloader also handles disk I/O. In BIOS mode, this means using INT 0x13 services to read sectors, which can be CHS or LBA. The loader must know how much to read and where to put it. That implies a kernel image format: raw binary, or a structured format like ELF. If you use ELF, the loader reads the ELF header, copies each loadable segment into memory, and then jumps to the entry address. This is the same mechanism used by modern OS loaders, just simplified.

Early kernel code is freestanding: there is no libc, no dynamic memory, and no interrupts yet. The kernel must set up its own stack, zero the BSS, and establish a minimal console or serial output so you can debug. Only after the kernel initializes interrupt handlers, memory management, and device drivers does it become a full OS. This early phase is where most boot bugs live, and the only visibility you have is whatever you print (VGA memory, serial, or QEMU debug).

Boot is also where hardware differences first appear. BIOS vs UEFI, legacy PIC vs APIC, and different memory maps change what the kernel must do. UEFI provides services for memory maps and drivers, while BIOS expects you to work with legacy interfaces. Real kernels parse boot parameters (command line, memory map, initrd location) and validate them before continuing. That validation step is a miniature trust boundary: your kernel must assume firmware or the bootloader can be wrong. In this guide, you build small, controlled boot paths to internalize the invariants: you must know exactly where code is loaded, which CPU mode is active, and how control passes to the kernel. Once you trust the boot path, everything else in the OS becomes explainable and debuggable.

How this fits on projects

Project 1 builds a raw boot sector and forces you to work in real mode.
Project 2 implements a multi-stage bootloader and mode switch.
Project 14 (xv6 study) lets you inspect a real teaching OS boot path.

Definitions & key terms

Firmware: The first code that runs after reset (BIOS or UEFI).
Bootloader: A small program that loads and starts the kernel.
Real mode: 16-bit x86 mode with segmented addressing and 1 MB limit.
Protected mode: 32-bit mode with privilege levels and memory protection.
Long mode: 64-bit mode enabled via paging and EFER MSR.
GDT: Global Descriptor Table defining memory segments.
A20 line: Gate that allows access above 1 MB in legacy systems.
ELF: Executable and Linkable Format used by many Unix kernels.

Mental model diagram

Power -> Firmware -> Stage1 -> Stage2 -> Kernel
            |          |          |        |
            |          |          +--> load ELF segments
            |          +--> read sectors from disk
            +--> hardware init, boot order

How it works (step by step)

CPU resets and begins executing firmware.
Firmware loads boot sector or EFI loader.
Bootloader loads kernel image into memory.
Bootloader sets CPU mode and stack.
Bootloader jumps to kernel entry.
Kernel initializes memory, interrupts, and devices.

Minimal concrete example

; 16-bit boot sector (conceptual)
org 0x7C00
mov ah, 0x0E
mov al, 'H'
int 0x10            ; BIOS teletype output
jmp $
times 510-($-$$) db 0
dw 0xAA55

Common misconceptions

“The kernel starts in 64-bit mode by default.” (It does not on BIOS systems.)
“The bootloader is optional.” (You always need something to load the kernel.)
“Switching CPU modes is just a flag.” (It requires correct tables and jumps.)

Check-your-understanding questions

Why must the boot sector end with 0x55AA?
What is the purpose of the A20 line?
Why do you need a far jump after enabling protected mode?

Check-your-understanding answers

The BIOS checks for 0x55AA to decide the sector is bootable.
It enables access above 1 MB; without it, addresses wrap.
The far jump reloads CS and flushes the pipeline for the new mode.

Real-world applications

Secure boot chains in laptops and servers
Embedded systems with custom boot ROMs
Virtual machine firmware (QEMU/OVMF) boot paths

Where you’ll apply it

Projects 1, 2, 14

References

Operating Systems: Three Easy Pieces, Ch. 36
Write Great Code, Vol 2, Ch. 3
The Art of Assembly Language, Ch. 13
Intel 64 and IA-32 SDM, Vol 3 (system programming)

Key insights

Boot is a series of strict invariants; if you control the invariants, you control the system.

Summary

Bootstrapping is the forced path from reset to a running kernel. It is constrained, fragile, and deterministic. Understanding it gives you the first mental model of how hardware becomes software.

Homework/Exercises to practice the concept

Draw the memory map from 0x0000 to 0x20000 during boot.
Modify the boot sector to print a second line using VGA memory.
Implement a two-stage loader that reads 4 sectors.

Solutions to the homework/exercises

Include a 0x7C00 boot sector and show reserved regions like IVT/BDA.
Use 0xB8000 and increment by 2 bytes per character.
Stage 1 reads N sectors into 0x1000, then jumps to stage 2.

C2: Kernel Boundary and System Call Interface

Fundamentals

The kernel boundary is the line that separates trusted code from untrusted code. User programs run in an unprivileged CPU mode and cannot directly access devices, page tables, or other processes’ memory. Instead, they request services through system calls. A system call is a controlled transition into the kernel, where the OS validates arguments, performs the requested action, and returns to user space. This boundary is the core safety and stability mechanism of any OS. If you understand how a syscall enters the kernel, how the kernel validates user input, and how control returns, you understand the OS contract. Every project you build later depends on this boundary behaving correctly.

Deep Dive into the Concept

Modern CPUs provide hardware support for privilege separation. On x86, user code runs at ring 3 and kernel code runs at ring 0. The transition happens through instructions like syscall/sysenter or through exceptions and interrupts. When the CPU executes a syscall instruction, it switches to kernel mode, loads a kernel stack pointer from a model specific register, saves user context, and jumps to a well-defined entry point. That entry point is usually a hand-written assembly stub that saves registers, switches to a safe stack, and calls a C function that dispatches to the correct system call handler.

The system call ABI defines how arguments are passed. On Linux x86-64, the syscall number is placed in rax and arguments are passed in rdi, rsi, rdx, r10, r8, and r9. The return value is placed in rax, and negative values are used to signal errors. The kernel must treat all user pointers as untrusted. Functions like copy_from_user and copy_to_user exist specifically to validate memory access and handle faults if the user passes a bad pointer. This is why the syscall boundary is a security boundary: a single unchecked pointer can crash or compromise the kernel.

System call entry is designed to be fast but safe. The kernel entry path follows a tight sequence: save registers, switch stacks, validate arguments, dispatch to the handler, and return to user space. To reduce overhead, modern kernels expose certain data (such as time) through a vDSO page mapped into user space. But anything that touches privileged state still requires the full transition.

The boundary is also a policy boundary. The kernel decides which processes can perform which system calls through permissions, capability checks, and security modules. For example, a process might be able to read a file but not change its permissions, or it might be prevented from creating new namespaces. Every syscall handler must check these policies because the kernel is the final gatekeeper. This is why the syscall table is a central audit point in OS security.

Debugging and tracing syscalls is one of the most practical ways to understand OS behavior. Tools like strace use ptrace to intercept syscalls at entry and exit, showing you exactly which kernel services a program uses. This reveals the runtime behavior that libraries hide. When you implement a syscall wrapper in assembly, you see exactly how small the interface is and how much libc does for you.

Syscalls also interact with signals and error handling. A blocking syscall can return early with EINTR if a signal handler runs; robust programs must restart or handle partial results. The kernel may mark some syscalls as restartable so it can resume them transparently after signal handling. This behavior matters for I/O, sleep, and synchronization. Security layers such as seccomp, audit, and LSM hooks intercept syscalls as additional filters. They can allow, deny, or log calls, which adds overhead but provides observability and control in production.

ABI stability is another hidden constraint. Once a syscall number is exposed, it cannot change without breaking binaries. Linux therefore adds new syscalls rather than changing existing ones and maintains compatibility layers for 32-bit applications on 64-bit kernels. Some syscalls are split into families (open vs openat) to add new semantics without breaking old code. Legacy vsyscall and modern vDSO pages illustrate the balance between speed and compatibility. When you build raw syscall wrappers, you see why the ABI is small but extremely rigid.

The same boundary appears in every OS. Windows has NT syscalls, macOS has Mach traps, and Unix-like systems have POSIX system calls. The specific ABI differs, but the concepts are identical: controlled privilege transition, argument validation, resource accounting, and a return to user space. Understanding the kernel boundary is the fastest way to predict performance (syscalls are expensive), security (bad syscalls are dangerous), and portability (different OSes expose different syscall sets).

C3: Execution Model - Processes, Threads, Scheduling, and Synchronization

Fundamentals

The execution model describes how the OS creates, runs, and coordinates work. A process is a protected container for resources (address space, files, credentials). A thread is a schedulable unit of execution within a process. The OS scheduler decides which thread runs next, for how long, and on which CPU. Because threads share memory, the OS must provide synchronization primitives so that concurrent access does not corrupt state. Understanding processes, threads, scheduling, and synchronization gives you the ability to predict latency, throughput, and correctness in real systems.

Deep Dive into the Concept

A process begins when the OS creates a new address space and loads a program into it. On Unix, this often happens through fork and exec: fork creates a near-exact copy of the parent, and exec replaces the child image with a new program. This separation allows the child to adjust its environment (file descriptors, environment variables, working directory) before executing. Internally, the OS builds a Process Control Block (PCB) that stores PID, state, registers, memory map, and scheduling metadata. Every context switch updates this structure.

Threads share the same address space but have separate stacks and registers. This allows fast communication but makes correctness harder. Preemptive scheduling means the OS can interrupt a running thread at almost any time, save its context, and run another thread. The timer interrupt drives this. A context switch is expensive: it flushes CPU pipelines, may evict cache lines, and disrupts branch prediction. The scheduler must balance fairness (everyone runs) with performance (minimize switches) and responsiveness (interactive tasks stay snappy).

Scheduling policies encode those trade-offs. FIFO (first in, first out) maximizes simplicity but can starve short jobs. Shortest Job First optimizes turnaround but is unfair. Round Robin improves responsiveness by giving each process a time slice. Multi-Level Feedback Queue (MLFQ) adapts based on observed behavior, promoting interactive jobs and demoting CPU hogs. Linux’s Completely Fair Scheduler (CFS) uses a virtual runtime to approximate ideal fairness, maintaining a red-black tree of runnable tasks. Each policy needs metrics: turnaround time, response time, waiting time, and throughput.

Synchronization is the second half of the execution model. When threads share memory, they must coordinate access to avoid races. Locks provide mutual exclusion; semaphores control access to a limited resource; condition variables allow threads to sleep until a condition becomes true. These primitives rely on atomic operations like compare-and-swap. The OS provides low-level primitives and can also detect or mitigate issues like priority inversion (where a low-priority thread holds a lock needed by a high-priority thread). Deadlock is a risk when multiple locks are acquired in different orders, so the OS and your code must maintain ordering rules or use deadlock detection.

The OS also manages thread states: running, ready, blocked, and terminated. A thread blocks on I/O or synchronization and later becomes ready. The scheduler moves ready threads onto a CPU. A well-designed system has clear invariants: a blocked thread is not runnable, only one thread owns a lock at a time, and a thread cannot run on two CPUs simultaneously. These invariants are small, but the state space is huge, which is why concurrency bugs are so hard.

Finally, the execution model touches performance isolation. The scheduler must protect interactive tasks from CPU-bound tasks and avoid starvation. In multi-core systems, there is also a NUMA dimension: where a thread runs affects memory latency. Sophisticated schedulers try to keep threads near their data while still balancing load. Your projects implement simplified versions of these ideas so you can see their trade-offs in a controlled environment.

How this fits on projects

Project 5 builds a scheduler simulator and forces you to measure fairness.
Project 6 and 7 exercise fork/exec and process relationships.
Project 11 and 15 use synchronization primitives to avoid races.

Definitions & key terms

Process Control Block (PCB): Kernel data structure describing a process.
Context switch: Saving and restoring CPU state between threads.
Preemption: Forcibly stopping a thread to run another.
Time slice: Fixed time the scheduler gives a thread before switching.
Mutex: A mutual exclusion lock.
Condition variable: A wait/notify synchronization primitive.

Mental model diagram

Ready Queue -> [Scheduler] -> CPU
     ^             |           |
     |             v           |
Blocked <------ I/O, locks ----+

How it works (step by step)

A process is created and placed in the ready queue.
The scheduler selects a thread and runs it on a CPU.
The thread runs until it blocks, yields, or is preempted.
The OS saves context and picks the next runnable thread.
Synchronization primitives ensure shared memory stays consistent.

Minimal concrete example

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
int counter = 0;

void *worker(void *arg) {
    for (int i = 0; i < 100000; i++) {
        pthread_mutex_lock(&lock);
        counter++;
        pthread_mutex_unlock(&lock);
    }
    return NULL;
}

Common misconceptions

“Threads are always faster than processes.” (They are often faster but risk races.)
“A context switch is cheap.” (It is measurable and can dominate latency.)
“Locks guarantee correctness.” (Only if you use them consistently.)

Check-your-understanding questions

Why can Round Robin improve responsiveness compared to FIFO?
What makes a thread blocked versus ready?
How does priority inversion occur?

Check-your-understanding answers

It limits how long any one process can run without yielding.
Blocked threads wait for I/O or synchronization; ready threads can run.
A low-priority thread holds a lock needed by a high-priority thread.

Real-world applications

Interactive desktop scheduling vs batch workloads
High-frequency trading systems tuned for low latency
Web servers with thread pools and backpressure

Where you’ll apply it

Projects 5, 6, 7, 11, 15

References

Operating Systems: Three Easy Pieces, Ch. 4, 7-9, 26-31
Linux Kernel Development, Ch. 4
Advanced Programming in the UNIX Environment, Ch. 8-9
docs.kernel.org: scheduler and CFS overview

Key insights

Scheduling and synchronization are policy choices built on small but fragile invariants.

Summary

The execution model defines how the OS creates work, runs it, and keeps it correct. Scheduling policies trade fairness against throughput, while synchronization keeps shared memory safe under preemption.

Homework/Exercises to practice the concept

Simulate FIFO, SJF, and Round Robin for a 4-process workload.
Build a toy lock and measure how contention changes runtime.
Create a deadlock with two locks and then fix it by ordering.

Solutions to the homework/exercises

Use a timeline table and compute response/turnaround by hand.
Add timestamps and compare unlocked vs locked increments.
Acquire locks in a single global order to prevent cycles.

C4: Virtual Memory and Paging

Fundamentals

Virtual memory gives each process the illusion of a large, private address space. The OS and hardware cooperate to translate virtual addresses into physical addresses using page tables. This abstraction provides isolation (one process cannot access another’s memory), flexibility (the OS can move pages around), and efficiency (only needed pages must be resident). Page faults are the mechanism by which the OS brings pages into memory on demand. Understanding address translation, page tables, and page faults is essential for reasoning about performance and security.

Deep Dive into the Concept

A virtual address is split into a page number and an offset. The page number indexes into a page table, which yields a physical frame number and permissions. The CPU then combines the frame number with the offset to produce a physical address. On modern systems, page tables are multi-level to reduce memory overhead. Instead of a single huge table, the OS creates intermediate tables only when needed. This allows sparse address spaces and efficient memory use.

The Translation Lookaside Buffer (TLB) is a hardware cache of page table entries. When the CPU needs to translate an address, it checks the TLB first. A TLB hit is fast; a TLB miss requires walking the page tables in memory, which is slower. A page fault is different: it occurs when the page is not present or the access violates permissions. On a page fault, the CPU traps into the kernel, which decides whether to load the page, allocate a new page, or kill the process. This trap path is one of the most important performance and correctness paths in the OS.

Demand paging uses this mechanism to load pages only when they are accessed. The OS can map a file into memory and only load the pages that are actually used. This makes large applications feasible and enables features like memory-mapped files and copy-on-write. Copy-on-write is how fork is efficient: the parent and child share the same physical pages until one writes, at which point the OS copies the page and updates the page tables.

When physical memory fills up, the OS must evict pages. Page replacement algorithms like FIFO, LRU, and Clock decide which pages to evict. The working set model helps explain why some workloads thrash: if the working set is larger than physical memory, the OS spends most of its time servicing page faults. The OS can also use swap space on disk to extend memory, but swapping is far slower than RAM, so a system with heavy swap tends to stall.

Virtual memory is also a protection mechanism. Each page has permissions: read, write, execute. The OS sets these to enforce isolation and security. Modern defenses like non-executable stacks (NX) and address space layout randomization (ASLR) are built on top of paging. The OS can also use huge pages (2 MB or 1 GB) to reduce TLB pressure for large, contiguous memory regions, trading off flexibility for speed.

In a simulator, you can see these concepts clearly. When you control page size, number of frames, and replacement policy, you can observe how locality of reference drives performance. A small change in access pattern can shift the miss rate dramatically. This is why understanding virtual memory is so important for systems performance work: it explains why programs slow down when they exceed cache or RAM, and why memory access patterns matter even when you have “enough” RAM.

Virtual memory also enables sharing. Shared libraries map the same physical pages into many processes, reducing memory usage. Memory-mapped files let the OS treat file I/O as paging, which simplifies applications and enables zero-copy reads. Kernel space itself is mapped into every process (usually in a protected region), which allows fast transitions but requires strict access control.

How this fits on projects

Project 4 builds a VM simulator and page replacement algorithms.
Project 3 (allocator) depends on paging behavior for sbrk/mmap.
Project 14 (xv6 study) exposes real page table code.

Definitions & key terms

Page: Fixed-size block of memory (commonly 4 KB).
Page table: Data structure mapping virtual pages to physical frames.
TLB: Cache of recent translations.
Page fault: Trap when a page is missing or access is illegal.
Copy-on-write: Share pages until a write triggers a copy.
Working set: Set of pages a process actively uses.

Mental model diagram

Virtual Addr -> [Page Table] -> Physical Frame -> RAM
        |             ^
        v             |
      TLB (cache) ----+

How it works (step by step)

CPU splits virtual address into page number + offset.
TLB is checked for a cached translation.
On miss, CPU walks page tables in memory.
If page not present, a page fault traps to kernel.
Kernel loads/allocates page, updates tables, resumes.

Minimal concrete example

uint32_t page_num = vaddr >> 12;      // 4 KB pages
uint32_t offset   = vaddr & 0xFFF;
if (!pte[page_num].present) page_fault(page_num);
phys = (pte[page_num].frame << 12) | offset;

Common misconceptions

“Page faults are always errors.” (They are often normal demand paging.)
“More RAM means no VM overhead.” (TLB misses still matter.)
“TLB miss == page fault.” (A TLB miss can still be a valid page.)

Check-your-understanding questions

Why are page tables multi-level on 64-bit systems?
What is the difference between a TLB miss and a page fault?
Why does locality improve performance?

Check-your-understanding answers

A flat table would be enormous and waste memory for sparse spaces.
A TLB miss requires a table walk; a page fault means the page is absent.
Locality keeps recent pages in cache and reduces fault rate.

Real-world applications

Memory-mapped databases and file I/O
Security hardening via NX and ASLR
High-performance systems using huge pages

Where you’ll apply it

Projects 4, 3, 14

References

Operating Systems: Three Easy Pieces, Ch. 13-22
Computer Systems: A Programmer’s Perspective, Ch. 9
Computer Architecture (Hennessy and Patterson), Ch. 5

Key insights

Virtual memory is both a performance tool and a security boundary.

Summary

Paging gives each process a private, flexible view of memory. The cost is translation overhead and page faults, which the OS manages with caching and replacement policies.

Homework/Exercises to practice the concept

Simulate FIFO and LRU on a short access trace.
Draw a two-level page table for a small address space.
Measure page fault rate for a random vs sequential access pattern.

Solutions to the homework/exercises

Use a 4-frame memory and track evictions by hand.
Split virtual address into directory, table, and offset bits.
Sequential access shows fewer faults due to spatial locality.

C5: Dynamic Memory Allocation and Fragmentation

Fundamentals

Dynamic memory allocation turns raw pages into flexible blocks that programs can request and free at runtime. The OS provides primitives like brk/sbrk and mmap, and user-space allocators (malloc) manage the heap on top of those primitives. The allocator must track free and used blocks, satisfy alignment requirements, and minimize fragmentation. Fragmentation comes in two forms: internal (wasted space inside allocated blocks) and external (unused gaps between blocks). Understanding allocators is essential for performance, stability, and security because memory bugs often originate here.

Deep Dive into the Concept

The simplest allocator is a bump pointer: keep a pointer to the end of the heap and move it forward for each allocation. This is fast but cannot free memory, so it only works in special cases. Real allocators maintain metadata for each block, typically a header that stores size and free/used status. Free blocks are organized in lists or trees. When a request arrives, the allocator finds a suitable free block, splits it if necessary, and returns a pointer to the payload. When memory is freed, the allocator marks the block as free and coalesces with neighboring free blocks to reduce fragmentation.

The allocator must respect alignment: most platforms require certain data types to be aligned to 8 or 16 bytes. This affects header size and block splitting. If blocks are not aligned, the CPU may raise exceptions or silently degrade performance. Many allocators use boundary tags or footers to make coalescing efficient. Others use segregated free lists, where blocks are grouped by size class to speed up allocation.

The OS provides two primary sources of memory: the heap via sbrk (contiguous growth) and memory mappings via mmap (arbitrary regions). Modern allocators often use mmap for large allocations and sbrk or mmap for the main heap. This reduces fragmentation and allows the OS to return large blocks to the system. Threaded programs complicate things: locking the whole heap on every allocation is expensive, so many allocators use per-thread caches or arenas to reduce contention.

Kernel allocators face similar problems but with different constraints. The kernel often uses slab or buddy allocators to manage fixed-size objects and page-level blocks. Slab allocators reduce fragmentation and improve cache locality by reusing objects of the same size. Buddy allocators manage pages in power-of-two sizes and can quickly split and coalesce. Understanding these approaches helps you reason about kernel memory and performance.

Fragmentation is the primary trade-off. First-fit allocation is fast but can leave small unusable gaps. Best-fit reduces wasted space but can be slow. The allocator must also consider security: use-after-free, double-free, and buffer overflows are common vulnerabilities. Many allocators add canaries, quarantines, or randomized layouts to mitigate attacks. These safety measures often trade performance for security.

Building your own allocator in a project forces you to make these trade-offs explicit. You will decide how to track metadata, how to search for free blocks, how to split and coalesce, and how to handle realloc. This experience turns malloc from a black box into a mental model you can use when debugging real programs.

In production systems, allocator behavior can be the difference between stable performance and catastrophic pauses. Many runtimes expose tuning knobs for arena counts, size classes, and background trimming. Instrumentation helps: track total allocated bytes, peak usage, fragmentation ratios, and allocation latency. These metrics let you correlate allocator choices with application behavior, which is a key skill for debugging memory-heavy services.

Allocator design also interacts with CPU caches: contiguous allocations improve locality, while scattered blocks increase cache misses and latency.

How this fits on projects

Project 3 builds a user-space allocator from scratch.
Project 4 relies on understanding page sizes and alignment.
Project 14 (xv6 study) exposes kernel allocators.

Definitions & key terms

Heap: Region of memory used for dynamic allocation.
Internal fragmentation: Wasted space inside allocated blocks.
External fragmentation: Wasted space between allocated blocks.
Free list: Data structure that tracks free blocks.
Coalescing: Merging adjacent free blocks.
Arena: A per-thread or per-heap allocation region.

Mental model diagram

[free block][alloc][free][free] -> split -> [free][alloc][alloc][free]

How it works (step by step)

Request size is rounded up to alignment.
Allocator finds a free block large enough.
If block is larger, it is split into used + free.
On free, allocator marks block and coalesces neighbors.
Large requests may use mmap directly.

Minimal concrete example

struct block {
    size_t size;
    int free;
    struct block *next;
};

Common misconceptions

“malloc just asks the OS for memory every time.” (Allocators reuse blocks.)
“Fragmentation is only about wasted bytes.” (It can prevent large allocs.)
“realloc always moves memory.” (It may expand in place.)

Check-your-understanding questions

Why must allocations be aligned?
What is the difference between internal and external fragmentation?
Why use mmap for large allocations?

Check-your-understanding answers

Misalignment can break CPU access rules and slow performance.
Internal is wasted space inside blocks; external is wasted gaps.
It avoids large heap growth and allows returning memory to the OS.

Real-world applications

High-performance allocators in databases and game engines
Memory debugging tools that detect leaks and corruption
Kernel slab caches for network buffers and file descriptors

Where you’ll apply it

Projects 3, 4, 14

References

Operating Systems: Three Easy Pieces, Ch. 17
Computer Systems: A Programmer’s Perspective, Ch. 9
C Interfaces and Implementations, Ch. 5-6
The Linux Programming Interface, Ch. 7

Key insights

Allocators are policy engines: the data structures you choose define the trade-offs.

Summary

Dynamic allocation is the bridge between page-level memory and flexible program needs. Fragmentation, alignment, and concurrency are the core challenges.

Homework/Exercises to practice the concept

Implement a bump allocator and measure its speed.
Add free lists and coalescing, then test fragmentation.
Compare first-fit and best-fit on a synthetic workload.

Solutions to the homework/exercises

Use a single pointer and return aligned slices.
Add headers and test with a randomized allocate/free sequence.
Track free space and average allocation time for each strategy.

C6: Storage - File Systems and Crash Consistency

Fundamentals

File systems turn raw blocks into named, durable files. They provide directories, permissions, metadata, and a stable namespace. The OS caches file data in memory and must ensure that updates remain consistent across crashes and power loss. File systems use structures like superblocks, inodes, and allocation bitmaps to track what is stored where. Understanding file system layout and crash consistency helps you reason about persistence, performance, and reliability.

Deep Dive into the Concept

A block device exposes a linear array of fixed-size blocks. The file system organizes these blocks into metadata and data. The superblock describes the file system geometry (block size, number of blocks, free space). Inodes describe files: size, owner, permissions, and pointers to data blocks. Directories are special files that map names to inode numbers. The OS translates a path like /home/user/doc.txt by walking directories, reading inodes, and following data block pointers.

Free space management is central. Many file systems use bitmaps to track which blocks are free, while others use free lists or extent trees. Allocation strategies trade off locality and fragmentation. If you place file data near its inode, you improve cache locality. But if you always allocate the first free block, you may fragment the disk. This is why file systems often include heuristics or delayed allocation.

Crash consistency is the hardest part. A file system update often touches multiple blocks: directory entry, inode, data blocks, and allocation bitmap. If the system crashes mid-update, the on-disk structure can become inconsistent. Journaling solves this by writing intent to a log before making changes. After a crash, the file system replays the log to complete or roll back operations. Copy-on-write file systems (like ZFS or btrfs) avoid in-place updates entirely, writing new structures and flipping pointers atomically. Both approaches trade space and write amplification for consistency.

Caching and buffering are critical for performance. The page cache stores recently accessed file data in RAM. Write-back caches allow the OS to batch writes, improving throughput but increasing the risk window for data loss unless you call fsync. The OS also merges adjacent writes to reduce disk seeks. These caching decisions matter because the performance of a file system is often dominated by I/O scheduling and cache hit rates, not raw disk speed.

File systems must also handle permissions and durability semantics. Unix permissions (rwx for user/group/other) are a simple model, while ACLs provide more granularity. Durability is controlled by fsync and by mount options that choose between data journaling, ordered writes, and writeback modes. The OS must balance speed against safety, and it exposes those options so applications can choose.

A minimal file system project shows all of these choices. When you implement your own file system in a disk image, you choose the layout, allocation policy, and consistency guarantees. You will observe how small structural choices affect both performance and the ability to recover from crashes.

Modern storage adds another layer of complexity. SSDs reorder writes internally and use erase blocks, so the OS cannot assume on-disk ordering without explicit barriers. File systems add checksums and sequence numbers to detect silent corruption. Some systems use copy-on-write to avoid in-place updates entirely, trading write amplification for safety. These design decisions are not just academic; they determine whether your data survives power loss and how well your system performs under heavy write loads.

Checksums, scrubbers, and redundancy (such as RAID) complement file system consistency by detecting and correcting latent corruption over time.

How this fits on projects

Project 9 builds a file system from scratch.
Project 10 explores device files and the “everything is a file” model.
Project 12 builds Unix tools that depend on file semantics.

Definitions & key terms

Superblock: Metadata describing the file system layout.
Inode: Metadata record for a file.
Extent: A contiguous range of blocks assigned to a file.
Journaling: Logging updates to ensure crash recovery.
fsync: System call that forces data to durable storage.
Page cache: Memory cache for file data and metadata.

Mental model diagram

[superblock][inode table][bitmap][data blocks]
       |         |          |        |
       +--> metadata      free map   file contents

How it works (step by step)

Parse path into directory components.
Read directory entries to find inode numbers.
Read inode to find data block pointers.
Read or write data blocks.
Update metadata and journal changes if enabled.

Minimal concrete example

/ (inode 2)
  |-- home (inode 10)
       |-- user (inode 11)
            |-- notes.txt (inode 42)

Common misconceptions

“A file is stored in one contiguous block.” (Most files are fragmented.)
“fsync is optional.” (Without it, data may be lost after a crash.)
“Journaling guarantees no data loss.” (It guarantees metadata consistency.)

Check-your-understanding questions

Why do file systems need a separate inode table?
What problem does journaling solve?
Why does caching improve performance but risk durability?

Check-your-understanding answers

It separates metadata from file names and supports hard links.
It makes multi-block updates recoverable after crashes.
Cached writes may not reach disk immediately unless flushed.

Real-world applications

Databases that rely on fsync for durability
Log-structured systems optimized for SSDs
Container images layered on top of file systems

Where you’ll apply it

Projects 9, 10, 12

References

Operating Systems: Three Easy Pieces, Ch. 39-45
Understanding the Linux Kernel, Ch. 12
The Linux Programming Interface, Ch. 14

Key insights

File systems are where correctness and performance collide; crash consistency is the core challenge.

Summary

File systems map names to blocks, maintain metadata, and protect consistency across failures. The details of allocation, caching, and journaling define both speed and safety.

Homework/Exercises to practice the concept

Design a block layout for a 10 MB disk image.
Implement a bitmap allocator for blocks.
Simulate a crash mid-write and define recovery steps.

Solutions to the homework/exercises

Reserve blocks for superblock, bitmap, inodes, and data.
Use a bit array and scan for free bits.
Replay a journal entry or roll back incomplete metadata.

C7: I/O Subsystem - Interrupts, Drivers, and Devices

Fundamentals

The I/O subsystem connects the OS to physical devices. Devices are slow and unpredictable compared to the CPU, so the OS relies on interrupts, buffering, and drivers to manage them. A device driver translates OS requests into hardware commands and reports completion via interrupts. The OS must schedule I/O, avoid blocking the kernel on slow devices, and provide a uniform interface to programs. Understanding interrupts and drivers is essential for explaining responsiveness, throughput, and device failures.

Deep Dive into the Concept

I/O devices are controlled through registers mapped into memory or I/O ports. A driver writes commands to these registers, the device performs work, and then it interrupts the CPU when done. The interrupt triggers a handler in the kernel, which acknowledges the device, queues any new work, and wakes waiting processes. This interrupt-driven model allows the CPU to do useful work while devices operate. Polling is an alternative where the CPU repeatedly checks device status, which is simpler but wastes cycles.

Modern systems rely on DMA (Direct Memory Access) for performance. DMA lets devices read or write memory directly without copying through the CPU. The OS sets up DMA descriptors, and the device transfers data in the background. This improves throughput but creates complexity: the OS must ensure the memory is mapped and pinned, and it must handle cache coherency on some architectures.

Drivers are often split into a fast “top half” (interrupt handler) and a slower “bottom half” or deferred work, so the OS can keep interrupts short. Linux uses mechanisms like softirqs and workqueues for this. Drivers expose devices as block devices, character devices, or network devices. Block devices (disks) have request queues and can be optimized by reordering I/O requests to minimize seeks. Character devices (serial ports) are stream-oriented and simpler.

User programs typically see devices through file descriptors and syscalls like open, read, write, and ioctl. This is the “everything is a file” model: disks, terminals, and sockets all look like files. The OS can multiplex I/O using select, poll, or epoll, letting a program wait for multiple devices at once. This is how high-performance servers handle many connections without dedicating a thread to each.

I/O subsystems are also where many failures occur: lost interrupts, misbehaving drivers, or slow devices can stall the system. The OS must be defensive, using timeouts and watchdogs to detect failures. In kernels, a single bad driver can crash the entire system, which is why kernel modules are carefully controlled and tested.

Your projects expose these ideas through device files, kernel modules, and network stack exploration. By simulating I/O and using tracing tools, you see how interrupt latency and buffering affect performance, and why driver correctness is as important as application correctness.

Modern kernels use hybrid techniques to balance latency and throughput. For example, network drivers may switch between interrupt-driven mode and polling (NAPI style) under heavy load to reduce interrupt storms. Storage stacks may merge and reorder requests to optimize device queues. Asynchronous I/O interfaces let applications submit many requests without blocking, improving throughput at the cost of complexity. These optimizations all rely on the same core ideas: explicit queues, careful synchronization, and clear ownership of buffers.

Latency-sensitive workloads care about interrupt latency and scheduling. Real-time systems may pin interrupts to specific CPUs, raise handler priorities, or avoid deferred work to meet deadlines. These techniques show that the I/O subsystem is not just about throughput; it is also about predictability under load.

How this fits on projects

Project 10 builds device files and demonstrates file-based device I/O.
Project 15 builds a kernel module and a simple driver.
Project 16 explores network I/O and packet handling.

Definitions & key terms

Interrupt: Asynchronous signal from hardware to the CPU.
DMA: Device-initiated memory transfer without CPU copying.
Driver: Kernel code that controls a hardware device.
Block device: Random-access device like disks.
Character device: Stream device like a serial port.
ioctl: Device-specific control call.

Mental model diagram

User -> syscall -> Driver -> Device
              ^        |
              |        v
           wakeup   interrupt

How it works (step by step)

User program issues read/write on a device file.
Kernel calls the driver with a request.
Driver programs device registers or DMA.
Device signals completion via interrupt.
Kernel handler wakes waiting process.

Minimal concrete example

int fd = open("/dev/ttyS0", O_RDWR);
write(fd, "ping\n", 5);

Common misconceptions

“Interrupts are always faster than polling.” (Polling can win for busy devices.)
“Drivers are just user-space libraries.” (Drivers run in kernel mode.)
“All devices are block devices.” (Many are streams or packets.)

Check-your-understanding questions

Why is DMA faster than copying through the CPU?
Why do interrupt handlers do minimal work?
What is the difference between block and character devices?

Check-your-understanding answers

DMA transfers data without CPU involvement and avoids extra copies.
Long handlers block other interrupts and increase latency.
Block devices are random access; character devices are streams.

Real-world applications

High-throughput storage systems using NVMe
Low-latency networking with interrupt coalescing
Embedded devices with limited CPU budgets

Where you’ll apply it

Projects 10, 15, 16

References

Operating Systems: Three Easy Pieces, Ch. 36
Linux Kernel Development, Ch. 13
The Linux Programming Interface, Ch. 13-14
docs.kernel.org: device drivers and interrupt handling

Key insights

I/O is the boundary between deterministic CPU code and nondeterministic devices.

Summary

The I/O subsystem coordinates devices using drivers, interrupts, buffering, and DMA. Correctness and performance depend on careful sequencing and defensive design.

Homework/Exercises to practice the concept

Trace disk I/O for a file copy with strace and iostat.
Write a user-space program that uses select on two file descriptors.
Explain why a buggy driver can crash the system.

Solutions to the homework/exercises

Observe open/read/write calls and correlate with disk stats.
Use select to wait on stdin and a pipe simultaneously.
Drivers run in kernel mode and can corrupt kernel memory.

C8: IPC and Composition - Pipes, Sockets, Signals, Networking

Fundamentals

Inter-process communication (IPC) lets processes exchange data and coordinate work. Unix provides multiple IPC mechanisms: pipes, signals, shared memory, message queues, and sockets. The Unix philosophy treats many of these as file descriptors, enabling composition with simple read/write operations. Networking extends IPC beyond a single machine through sockets and the TCP/IP stack. Understanding IPC is critical for building shells, servers, and scalable systems.

Deep Dive into the Concept

A pipe is a unidirectional byte stream connecting two processes. It is created by the parent, and then each child inherits one end. This simple mechanism enables the classic Unix pipeline. Sockets generalize pipes: they are bidirectional endpoints that can communicate locally (Unix domain sockets) or across networks (INET sockets). Unix domain sockets behave like files, support file descriptor passing, and are fast for local IPC. INET sockets carry TCP or UDP traffic across networks and introduce reliability, congestion control, and ordering concerns.

Signals are asynchronous notifications delivered by the kernel. They are used for process control (SIGINT, SIGTERM), timers, and fault reporting (SIGSEGV). Signals interrupt a running thread and run a signal handler, so they require careful design to avoid race conditions. Shared memory is the fastest IPC method because it avoids copying, but it requires explicit synchronization to avoid races. Message queues and pipes include synchronization implicitly but add overhead and limited message structure.

The networking stack adds several layers. At the bottom, the network interface card (NIC) receives frames. The kernel processes them into IP packets, then into TCP or UDP segments, then delivers data to sockets. TCP adds reliability with sequence numbers, acknowledgments, retransmission, and congestion control. UDP is connectionless and faster but requires application-level reliability if needed. Socket APIs expose this complexity through functions like socket, bind, connect, send, and recv. The OS also provides select, poll, and epoll so a single process can manage many sockets concurrently.

Composition is where IPC becomes a design philosophy. By exposing uniform file descriptor interfaces, Unix allows small programs to be composed into pipelines. A shell uses fork/exec to create processes, then wires them together with pipes. The same mechanisms build web servers, log processors, and data pipelines. The kernel enforces boundaries, but IPC provides the glue for collaboration.

Performance considerations are critical. Copying data between processes can dominate runtime, so many OSes provide zero-copy mechanisms like splice and sendfile. For networking, interrupt coalescing and offload features can reduce CPU overhead but add latency. These are trade-offs that high-performance systems must understand.

Flow control and backpressure are the hidden backbone of IPC. Pipes have finite buffers, and writes can block when the buffer is full, forcing producers to slow down. Sockets add more layers: TCP uses window sizes and acknowledgments to regulate send rate, while UDP leaves flow control to the application. Non-blocking I/O and event loops (select, poll, epoll) let a single process handle many endpoints, but they require careful message framing so one stream does not starve others. These mechanics are why “simple” IPC often becomes a system design exercise.

IPC also requires framing. Pipes and TCP streams are byte streams, so applications must define message boundaries with delimiters or length prefixes. Half-closed connections (shutdown write) can signal end-of-stream without closing the read side. These details determine whether protocols are robust, especially under partial reads and writes and when clients disconnect unexpectedly.

These edge cases are where many IPC bugs live in practice.

How this fits on projects

Project 7 builds a shell with pipes and redirection.
Project 11 builds IPC mechanisms and a small chat system.
Project 16 explores network sockets and packet flow.

Definitions & key terms

Pipe: Unidirectional byte stream between processes.
Socket: Communication endpoint for local or network communication.
Signal: Asynchronous notification from the kernel.
Shared memory: Memory region mapped into multiple processes.
TCP: Reliable, ordered byte stream protocol.
UDP: Connectionless datagram protocol.

Mental model diagram

Process A -> pipe/socket -> Process B
     |                       ^
     +-- signal/notify ------+

How it works (step by step)

Create IPC endpoint (pipe, socket, shared memory).
Fork or connect processes to the endpoint.
Read/write data through file descriptors.
Kernel queues data and wakes receivers.
For sockets, kernel routes packets through TCP/IP.

Minimal concrete example

int fd[2];
pipe(fd);
if (fork() == 0) {
    dup2(fd[1], STDOUT_FILENO);
    execlp("ls", "ls", NULL);
} else {
    dup2(fd[0], STDIN_FILENO);
    execlp("wc", "wc", "-l", NULL);
}

Common misconceptions

“Signals are reliable.” (Signals can be coalesced or missed.)
“Shared memory is always the best.” (It requires careful synchronization.)
“Sockets are only for networks.” (Unix domain sockets are local IPC.)

Check-your-understanding questions

Why are pipes unidirectional?
Why does TCP need congestion control?
Why can a signal handler be dangerous?

Check-your-understanding answers

Pipes are optimized for simple producer-consumer flows.
Without congestion control, networks collapse under load.
It can interrupt code at unsafe points, causing races.

Real-world applications

Shell pipelines and data processing chains
Microservices communicating over sockets
Event-driven servers using epoll

Where you’ll apply it

Projects 7, 11, 16

References

The Linux Programming Interface, Ch. 44-56
Advanced Programming in the UNIX Environment, Ch. 14-16
TCP/IP Illustrated, Vol 1, Ch. 1-6
UNIX Network Programming, Vol 1, Ch. 1-6

Key insights

IPC is the OS mechanism that turns isolated processes into systems.

Summary

IPC mechanisms let processes cooperate. Pipes and sockets provide simple composition, while shared memory offers speed with higher synchronization cost.

Homework/Exercises to practice the concept

Build a pipeline of three commands and trace their file descriptors.
Implement a simple echo server using TCP sockets.
Compare throughput of pipes vs shared memory for a large buffer.

Solutions to the homework/exercises

Use lsof or /proc/<pid>/fd to inspect the pipeline.
Create a socket, bind, listen, accept, recv, send.
Measure copy overhead in a microbenchmark.

C9: Security, Isolation, and Virtualization

Fundamentals

The OS enforces security by isolating processes and controlling access to resources. Traditional Unix permissions (user, group, other) are the baseline, but modern systems add capabilities, ACLs, mandatory access control, and sandboxing. Virtualization extends isolation across entire OS instances using hypervisors or containers. Containers rely on namespaces and cgroups to isolate processes while sharing the kernel. Understanding these mechanisms explains how multi-tenant systems remain safe and how containers differ from virtual machines.

Deep Dive into the Concept

Access control starts with user and group IDs. The kernel checks permissions on every file access and system call that modifies privileged state. Capabilities split the all-powerful root privilege into smaller pieces, so a process can be allowed to bind to low ports without full root privileges. ACLs add fine-grained permissions beyond the simple rwx model. Mandatory Access Control systems like SELinux add a policy layer that can deny access even if traditional permissions allow it. These mechanisms form a layered defense, and each is enforced at syscall boundaries.

Namespaces provide isolation by creating separate views of system resources. PID namespaces give processes their own process tree, mount namespaces give separate filesystem views, and network namespaces give isolated network stacks. A container is a set of processes that share a kernel but live in their own namespaces. Cgroups (control groups) impose resource limits and accounting: CPU shares, memory caps, I/O throttling. Together, namespaces and cgroups are the core of containerization. They are not virtualization in the hardware sense; they are isolation within a single kernel.

Virtual machines, by contrast, emulate hardware. A hypervisor runs at a higher privilege level and presents virtual CPUs, memory, and devices to guest OSes. Type 1 hypervisors run directly on hardware; type 2 hypervisors run on a host OS. Hardware virtualization extensions (Intel VT-x, AMD-V) make this efficient by allowing the guest to run near-native speeds while trapping privileged operations to the hypervisor. The trade-off is overhead and complexity: VMs provide stronger isolation but require a full OS per guest.

Security also involves sandboxing untrusted code. Mechanisms like seccomp filter syscalls, reducing the attack surface. Combined with namespaces, seccomp can turn a normal process into a tightly constrained sandbox. This is how browsers and container runtimes limit damage from compromised processes. The kernel is the ultimate enforcement point, so kernel bugs are high-impact, and defense-in-depth is essential.

Isolation is not just about security; it is also about reliability. Resource limits prevent one process from starving others. Namespaces allow multiple applications to run with conflicting configurations. Cgroups provide observability by letting you see resource usage per container or process group. When you build a minimal container in this guide, you will see how small the underlying mechanism is compared to the heavy ecosystem built on top of it.

Security in this space is not automatic. Misconfigured namespaces or overly broad capabilities can allow container escapes. Kernel attack surface grows when you enable many subsystems inside containers. This is why production systems combine cgroups, namespaces, seccomp filters, and sometimes additional isolation like user namespaces or VM-based containers. Understanding these layers helps you decide which isolation level is appropriate for a workload and why containers are fast but not always equivalent to VMs in threat models.

Virtual machines add overhead but provide stronger isolation boundaries, especially when combined with hardware-assisted virtualization and device passthrough. The trade-off is a heavier footprint and slower startup compared to containers.

How this fits on projects

Project 10 uses file permissions and device access control.
Project 13 builds a container using namespaces and cgroups.
Project 15 explores kernel modules and privileged code.

Definitions & key terms

Capabilities: Split root privileges into discrete rights.
ACL: Access Control List for fine-grained permissions.
Namespace: Isolated view of a system resource.
cgroup: Resource control and accounting for process groups.
Hypervisor: Software that runs virtual machines.
seccomp: Syscall filtering mechanism.

Mental model diagram

Container = namespaces + cgroups + processes
VM        = hypervisor + virtual hardware + guest OS

How it works (step by step)

User requests a privileged action via syscall.
Kernel checks UID/GID, capabilities, and policies.
Namespace and cgroup limits are applied.
If allowed, kernel performs the action.
If denied, kernel returns an error (EPERM/EACCES).

Minimal concrete example

unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET);
// Create a process that sees its own mount, PID, and network namespaces.

Common misconceptions

“Containers are just lightweight VMs.” (They share the host kernel.)
“Root inside a container is real root.” (It is namespaced and constrained.)
“Capabilities are rarely used.” (They are core to modern security.)

Check-your-understanding questions

What is the key difference between a VM and a container?
Why do capabilities exist when root already exists?
How do cgroups enforce memory limits?

Check-your-understanding answers

VMs virtualize hardware; containers isolate processes in one kernel.
Capabilities reduce the power of privileged processes to a minimum.
The kernel tracks usage and denies allocations past the limit.

Real-world applications

Kubernetes and container orchestration
Multi-tenant servers with strict resource limits
Sandboxing untrusted plugins and code

Where you’ll apply it

Projects 10, 13, 15

References

Operating Systems: Three Easy Pieces, Ch. 14
Linux man-pages: namespaces(7), capabilities(7)
docs.kernel.org: cgroup v2 documentation
Android Open Source Project: Kernel Overview

Key insights

Isolation is enforced at syscall boundaries, and containers are policy on top of kernel primitives.

Summary

Security and virtualization are built from small, enforceable kernel mechanisms. Namespaces and cgroups isolate processes, while hypervisors virtualize hardware for stronger boundaries.

Homework/Exercises to practice the concept

Use unshare to create a new PID namespace and observe process IDs.
Create a cgroup with a memory limit and test enforcement.
Compare a container and VM by listing running processes inside each.

Solutions to the homework/exercises

Run unshare -p -f --mount-proc bash and observe PID 1 inside.
Write a small program that allocates memory until it is killed.
In a VM you see a full OS; in a container you see host kernel processes.

Glossary

ABI: Application Binary Interface; calling conventions and binary formats.
Bootloader: Program that loads the kernel and prepares CPU state.
Context switch: Saving/restoring CPU state between threads.
cgroup: Kernel mechanism for resource limits and accounting.
Copy-on-write: Share memory until a write triggers a copy.
DMA: Direct Memory Access; device transfers without CPU copies.
ELF: Executable and Linkable Format for binaries.
FUSE: Filesystem in Userspace; build a FS without kernel code.
GDT: Table defining memory segments in protected mode.
Hypervisor: Software layer that runs virtual machines.
Inode: File metadata record in Unix file systems.
Interrupt: Asynchronous hardware signal to the CPU.
IPC: Inter-process communication (pipes, sockets, shared memory).
Kernel: Privileged core managing CPU, memory, and devices.
Namespace: Isolated view of a system resource (PID, mount, net).
Page fault: Trap when a page is missing or access is invalid.
PCB: Process Control Block; kernel data for a process.
Pipe: One-way byte stream between processes.
Syscall: Controlled entry into the kernel for services.
TLB: Cache of virtual-to-physical translations.
vDSO: Shared page for fast kernel data without syscalls.

Why Operating Systems Matter

The Modern Problem It Solves

Modern systems run many programs at once, across local machines, clouds, and mobile devices. Without an OS, every program would need to manage hardware directly, leading to chaos, poor security, and fragile software. The OS provides isolation, fairness, and a uniform interface that makes software portable and reliable.

Recent real-world context (with sources and year):

StatCounter global mobile OS share (Nov 2025): Android about 71.9% and iOS about 27.6%.
StatCounter global desktop OS share (Nov 2025): Windows about 69.5% worldwide.
POSIX.1-2024 is the latest standardized interface for portable Unix-style systems.
Android is built on the upstream Linux kernel and depends on standard kernel subsystems.

Without OS                         With OS
+------------+                     +-----------------------+
| Program A  |                     | Scheduler + VM + FS   |
| Program B  |                     | Isolation + Security  |
| Program C  |                     | Stable APIs           |
+------------+                     +-----------------------+
        |                                      |
        v                                      v
   Hardware chaos                       Reliable computing

Context & Evolution (History)

1969-1973: Unix emerges at Bell Labs and is rewritten in C.
1980s-1990s: POSIX standardizes a portable OS interface.
1991: Linux introduces a Unix-like kernel with open development.
2000s-2020s: Containers and cloud workloads push OS isolation features.

The lesson: OS abstractions that are simple and composable tend to survive. The Unix model is still visible inside modern Linux, macOS, and Android.

Concept Summary Table

Code	Concept	What you must internalize
C1	Bootstrapping and CPU Modes	Boot flow, CPU modes, loader invariants
C2	Kernel Boundary and Syscalls	Privilege transitions, ABI, validation
C3	Execution Model	Processes, threads, scheduling, synchronization
C4	Virtual Memory and Paging	Translation, TLB, page faults, replacement
C5	Dynamic Allocation	Heap design, fragmentation, alignment
C6	File Systems	Inodes, allocation, journaling, crash safety
C7	I/O Subsystem	Interrupts, drivers, DMA, device models
C8	IPC and Composition	Pipes, sockets, signals, networking
C9	Security and Virtualization	Permissions, namespaces, cgroups, VMs

Project-to-Concept Map

Project	Concepts Applied
1. Bare Metal “Hello”	C1, C7
2. Bootloader + Kernel	C1, C2
3. Memory Allocator	C5, C4
4. VM Simulator	C4
5. Scheduler Simulator	C3
6. Fork and Exec	C2, C3
7. Unix Shell	C2, C3, C8
8. Syscall Interface	C2
9. File System	C6
10. Device Files	C6, C7, C9
11. IPC Toolkit	C3, C8
12. Unix Tools	C6, C8
13. Containers	C9, C2
14. xv6 Study	C1-C6
15. Kernel Module	C2, C7, C9
16. Network Stack	C7, C8

Deep Dive Reading by Concept

Concept	Book & Chapter	Why It Helps
C1 Bootstrapping	OSTEP Ch. 36; Write Great Code Vol 2 Ch. 3	Boot flow and CPU modes
C2 Syscalls	TLPI Ch. 3; APUE Ch. 2-3	Syscall ABI and process APIs
C3 Execution Model	OSTEP Ch. 4, 7-9, 26-31	Scheduling and concurrency
C4 Virtual Memory	OSTEP Ch. 13-22; CS:APP Ch. 9	Paging and address translation
C5 Allocation	OSTEP Ch. 17; C Interfaces and Implementations Ch. 5	Allocator design
C6 File Systems	OSTEP Ch. 39-45; Understanding the Linux Kernel Ch. 12	Layout and journaling
C7 I/O	OSTEP Ch. 36; Linux Kernel Development Ch. 13	Interrupts and drivers
C8 IPC/Networking	TLPI Ch. 44-56; UNIX Network Programming Ch. 1-6	IPC and sockets
C9 Security/Virtualization	Modern Operating Systems Ch. 9-10; Linux docs	Permissions and isolation

Quick Start

Day 1

Read OSTEP Ch. 1-2 and skim Ch. 36 (boot overview).
Install QEMU, build tools, and run a minimal C program.
Draw the user/kernel boundary and annotate where syscalls occur.

Day 2

Build Project 1 and boot it in QEMU.
Use strace on a simple command and categorize syscalls.
Write a 1-page summary: “What does the OS provide that hardware does not?”

Recommended Learning Paths

1) Foundations Path: Projects 1 -> 2 -> 3 -> 4 -> 5 -> 6 2) Unix Tooling Path: Projects 6 -> 7 -> 11 -> 12 -> 8 3) Kernel Focus Path: Projects 14 -> 15 -> 16 -> 13 4) Performance Path: Projects 3 -> 4 -> 5 -> 9 -> 16

Success Metrics

I can explain the full boot chain from firmware to kernel entry.
I can trace a syscall from user code to kernel handler and back.
I can explain why a program slows down when its working set exceeds RAM.
I can implement a scheduler and measure fairness metrics.
I can build a file system image and recover it after a simulated crash.
I can build a basic container with namespaces and cgroups.

Appendix: Tooling and Debugging Cheatsheet

Boot and kernel debugging: qemu-system-x86_64 -s -S, gdb, objdump -d
Syscall tracing: strace -f, ltrace
Process inspection: ps, top, htop, /proc/<pid>/status
File system inspection: hexdump, xxd, debugfs, fsck
I/O profiling: iostat, iotop, perf
Network inspection: tcpdump, ss, ip

Project Overview Table

#	Project	Focus	Difficulty	Outcome
1	Bare Metal “Hello”	Boot, real mode	Expert	Bootable 512B image
2	Bootloader + Kernel	Mode switch, loader	Expert	Protected-mode kernel stub
3	Memory Allocator	Heap, fragmentation	Advanced	malloc/free library
4	VM Simulator	Paging, faults	Expert	Trace-driven VM simulator
5	Scheduler Simulator	Policies, metrics	Advanced	Scheduler comparison tool
6	Fork and Exec	Process model	Intermediate	Process explorer
7	Unix Shell	Pipes, jobs	Advanced	Featureful shell
8	Syscall Interface	ABI, tracing	Expert	Raw syscall + tracer
9	File System	Persistence	Expert	FUSE-mounted FS
10	Device Files	I/O model	Advanced	Custom /dev devices
11	IPC Toolkit	Pipes, shm	Intermediate	IPC demos + chat
12	Unix Tools	Composition	Intermediate	Mini coreutils
13	Containers	Namespaces, cgroups	Expert	Minimal container runtime
14	xv6 Study	Kernel architecture	Advanced	Annotated xv6 + changes
15	Kernel Module	Drivers	Expert	Loadable module
16	Network Stack	Packets, sockets	Expert	Packet analyzer/mini stack

Project List

Project 1: Bare Metal “Hello World” (No OS)

Real World Outcome

You boot a raw 512-byte image in QEMU and see your message printed on the virtual screen, with no OS, no libc, and no runtime. The CPU is executing your instructions directly.

$ nasm -f bin boot.asm -o boot.bin
$ ls -l boot.bin
-rw-r--r-- 1 user staff 512 Jan 01 12:00 boot.bin
$ qemu-system-x86_64 -drive format=raw,file=boot.bin

[QEMU window]
Hello, OS!
_

The Core Question You Are Answering

What is the smallest program that can make real hardware do visible work without any OS support, and why must it be structured in a precise way?

Concepts You Must Understand First

C1 Bootstrapping and CPU modes (OSTEP Ch. 36)
BIOS boot signature and 0x7C00 load address (Write Great Code Vol 2 Ch. 3)
Memory-mapped I/O (CS:APP Ch. 6)
Instruction size limits and padding (Art of Assembly Ch. 13)

Questions to Guide Your Design

Why must the boot sector be exactly 512 bytes?
Do you want BIOS interrupts or direct VGA memory writes?
What registers are safe to assume on entry?
How do you position the cursor or print multiple lines?
How will you halt the CPU cleanly after output?

Thinking Exercise

Draw a 25x80 text screen. Compute the memory offset for row 3, column 10 in VGA text mode and show the 2-byte value for a white “A” on black.

The Interview Questions They Will Ask

What does the 0x55AA signature do?
Why is org 0x7C00 required?
What is the difference between BIOS output and VGA memory output?
What CPU mode are you in when BIOS loads the boot sector?

Hints in Layers

Layer 1 - Minimal boot

Write an infinite loop and confirm QEMU does not reboot.

Layer 2 - BIOS teletype

Use int 0x10 with AH=0x0E to print characters.

Layer 3 - Direct VGA

Write word values to 0xB8000 for text mode output.

Layer 4 - Polished output

Clear the screen and set the cursor position for readability.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	36	Boot sequence overview
Write Great Code Vol 2	3	Real mode and segmentation
Art of Assembly	13	BIOS interrupts
CS:APP	6	Memory-mapped I/O

Common Pitfalls & Debugging

Problem: “QEMU reboots instantly”

Why: Missing or incorrect 0x55AA signature.
Fix: Pad to 510 bytes and append 0x55 0xAA.
Quick test: tail -c 2 boot.bin | hexdump -C shows 55 aa.

Problem: “No text appears”

Why: Writing to wrong address or using 32-bit instructions in real mode.
Fix: Use 16-bit instructions and VGA base 0xB8000.
Quick test: Write a single character to offset 0 and observe.

Definition of Done

boot.bin is exactly 512 bytes and ends with 0x55AA
QEMU boots and prints your message reliably
Output appears without using any OS or libc
You can explain each instruction in your boot sector

Project 2: Bootloader that Loads a Kernel

Real World Outcome

You build a two-stage bootloader that loads a kernel image from disk, switches to protected mode, and transfers control to C code. The screen shows a boot log and a kernel banner.

$ make
nasm -f bin stage1.asm -o stage1.bin
nasm -f bin stage2.asm -o stage2.bin
gcc -m32 -ffreestanding -c kernel.c -o kernel.o
ld -m elf_i386 -T linker.ld -o kernel.bin kernel.o
cat stage1.bin stage2.bin kernel.bin > os.bin

$ qemu-system-x86_64 -drive format=raw,file=os.bin

Bootloader: reading kernel...
Bootloader: enabling A20...
Bootloader: entering protected mode...
Kernel: hello from 32-bit C

The Core Question You Are Answering

How do you safely transition from firmware to your own kernel, set up CPU protection, and transfer control to higher-level code?

Concepts You Must Understand First

C1 Bootstrapping and CPU modes (OSTEP Ch. 36)
GDT and segmentation basics (Write Great Code Vol 2 Ch. 3)
Disk reads via BIOS INT 0x13 (Art of Assembly Ch. 13)
Kernel image layout and linker scripts (CS:APP Ch. 7)

Questions to Guide Your Design

Where will you load the kernel so you do not overwrite the bootloader?
How many sectors must you read, and how do you compute that?
What GDT entries are required for a flat memory model?
How will you set up the stack before entering C?
What happens if the A20 line is not enabled?

Thinking Exercise

Sketch a memory map from 0x0000 to 0x20000 showing the boot sector at 0x7C00 and a kernel loaded at 0x1000 or 0x100000. Identify overlaps.

The Interview Questions They Will Ask

Why is a far jump required after setting CR0.PE?
What is the purpose of the GDT?
How does the bootloader decide where the kernel starts?
Why is -ffreestanding used in kernel builds?

Hints in Layers

Layer 1 - Stage 1 loads stage 2

Read a fixed number of sectors into memory and jump.

Layer 2 - Enter protected mode

Load a minimal GDT, set CR0.PE, far jump.

Layer 3 - Call C code

Set up a stack and call a C function from assembly.

Layer 4 - ELF-aware loader

Parse ELF headers and load segments to correct addresses.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	36	Boot flow overview
Art of Assembly	8, 13	Mode switching and BIOS I/O
CS:APP	7	Linking and object layout
Linux Kernel Development	13	Low-level I/O context

Common Pitfalls & Debugging

Problem: “Bootloader hangs after enabling protected mode”

Why: GDT descriptor or far jump is incorrect.
Fix: Verify selector values and GDT layout.
Quick test: Use QEMU -d int and step through the transition.

Problem: “Kernel prints garbage”

Why: Stack not initialized or wrong segment selectors.
Fix: Set SS:ESP before calling C.
Quick test: Print a constant after the mode switch.

Definition of Done

Stage 1 loads stage 2 and the kernel reliably
CPU enters protected mode without triple fault
Kernel prints a C message after boot
You can explain the GDT and mode switch sequence

Project 3: Memory Allocator (malloc from Scratch)

Real World Outcome

You implement malloc, free, and realloc with a free list and coalescing, and you can preload your allocator into real programs.

$ gcc -fPIC -shared -o libmymalloc.so mymalloc.c
$ LD_PRELOAD=./libmymalloc.so ./alloc_test

[mymalloc] malloc(1024) -> 0x7f1234000010
[mymalloc] malloc(2048) -> 0x7f1234000420
[mymalloc] free(0x7f1234000010)
[mymalloc] malloc(512)  -> 0x7f1234000010 (reused)
[mymalloc] coalesce: merged 2 blocks

Heap stats:
  total: 65536 bytes
  in-use: 2560 bytes
  free: 62976 bytes
  fragmentation: 12%

The Core Question You Are Answering

How does a runtime convert raw pages into safe, reusable allocations with minimal overhead and fragmentation?

Concepts You Must Understand First

C5 Dynamic allocation (OSTEP Ch. 17)
Heap vs stack and alignment (C Interfaces and Implementations Ch. 5)
sbrk vs mmap (TLPI Ch. 7)
Fragmentation metrics (OSTEP Ch. 17)

Questions to Guide Your Design

Will your size field include the header or only payload?
How will you align allocations (8 or 16 bytes)?
When do you split a free block, and how small can the remainder be?
How will you coalesce neighbors efficiently?
What is your strategy for large allocations?

Thinking Exercise

Simulate this sequence on paper: malloc(32), malloc(64), free(first), malloc(16), free(second). Draw the free list after each step.

The Interview Questions They Will Ask

What is the difference between internal and external fragmentation?
Why does free not need a size argument?
How would you make the allocator thread-safe?
Why is alignment required for correctness?

Hints in Layers

Layer 1 - Bump allocator

Allocate by moving a pointer forward; ignore free.

Layer 2 - Free list

Add block headers and support freeing blocks.

Layer 3 - Split and coalesce

Split large blocks and merge adjacent free blocks.

Layer 4 - realloc and stats

Add realloc, fragmentation metrics, and debug logging.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	17	Allocator fundamentals
CS:APP	9	Heap internals
TLPI	7	sbrk and mmap
C Interfaces and Implementations	5	Alignment and layout

Common Pitfalls & Debugging

Problem: “malloc returns misaligned pointers”

Why: Header size not aligned to required boundary.
Fix: Round block sizes up to alignment.
Quick test: Allocate a double and check address % 8 == 0.

Problem: “free causes crashes later”

Why: Coalescing merges wrong neighbors or corrupts list.
Fix: Add boundary tags or verify with heap walking.
Quick test: Run with AddressSanitizer and compare.

Definition of Done

Alloc/free works on randomized tests without corruption
Coalescing reduces fragmentation on a sample workload
realloc passes shrink and grow cases
You can explain your allocator policy choices

Project 4: Virtual Memory Simulator

Real World Outcome

You build a trace-driven simulator that translates virtual addresses, handles page faults, and compares replacement algorithms.

$ ./vm_sim workload.trace --pages=64 --frames=8 --algo=lru

VM Simulator
============
Pages: 64 (256 KB)  Frames: 8 (32 KB)
Algorithm: LRU

Access 0x1234 -> page 1 -> frame 0 [fault]
Access 0x1240 -> page 1 -> frame 0 [hit]
Access 0x9ABC -> page 9 -> frame 1 [fault]
...

Stats:
  accesses: 10000
  page faults: 812 (8.12%)
  TLB hits: 8234 (82.34%)
  FIFO faults: 1012
  LRU faults: 812
  Clock faults: 855

The Core Question You Are Answering

How does the OS translate addresses and decide what stays in memory when RAM is smaller than the working set?

Concepts You Must Understand First

C4 Virtual memory and paging (OSTEP Ch. 13-22)
TLB and caching (Hennessy and Patterson Ch. 5)
Replacement policies (OSTEP Ch. 22)
Locality of reference (CS:APP Ch. 6)

Questions to Guide Your Design

How many bits are page number vs offset?
What fields must each PTE contain (present, dirty, accessed)?
How will you approximate LRU efficiently?
How will you validate translations against expected output?
What metrics matter most for comparisons?

Thinking Exercise

Given frames=4 and trace: 1,2,3,4,1,2,5,1,2,3,4,5. Compute FIFO and LRU faults by hand.

The Interview Questions They Will Ask

Why are page tables multi-level on 64-bit systems?
What is the difference between a TLB miss and a page fault?
Why does locality make LRU outperform FIFO?
What is the working set model?

Hints in Layers

Layer 1 - Direct page table

Implement page number extraction and translation.

Layer 2 - FIFO

Add a queue of frames and evict the oldest.

Layer 3 - LRU or Clock

Track access times or use an accessed bit.

Layer 4 - TLB simulation

Add a small translation cache and track hit rate.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	13-22	Paging, faults, replacement
CS:APP	9	VM fundamentals
Hennessy and Patterson	5	TLB and caching

Common Pitfalls & Debugging

Problem: “All translations are wrong”

Why: Page size mask or shift is incorrect.
Fix: Verify page size and bit math on small examples.
Quick test: Translate a known address by hand.

Problem: “LRU behaves like FIFO”

Why: Access timestamps not updated on hits.
Fix: Update access metadata for every access.
Quick test: Run a locality-heavy trace.

Definition of Done

Translation works for manual test cases
Replacement algorithms produce distinct fault rates
TLB hit rate is computed and reported
You can explain why LRU beats FIFO on locality traces

Project 5: Process Scheduler Simulator

Real World Outcome

You build a scheduler simulator that compares FIFO, SJF, Round Robin, MLFQ, and CFS on real workloads and outputs fairness metrics.

$ ./schedsim workload.txt --algo=mlfq

Scheduler Simulator
===================
Workload: 5 processes
Algorithm: MLFQ (3 queues, quanta 10/20/40 ms)

Timeline:
0ms   P1
10ms  P2
20ms  P3
30ms  P4
40ms  P5
...

Results:
PID  Turnaround  Response  Wait
P1   180ms       0ms       80ms
P2   110ms       10ms      40ms
P3   290ms       20ms      90ms
P4   40ms        30ms      30ms
P5   120ms       40ms      70ms

Average response: 20ms

The Core Question You Are Answering

How should the OS decide which process runs next so the system is both fair and responsive?

Concepts You Must Understand First

C3 Execution model and scheduling (OSTEP Ch. 4, 7-9)
Preemption and time slicing (OSTEP Ch. 7)
Scheduling metrics (response, turnaround, wait)
CFS overview (Linux Kernel Development Ch. 4)

Questions to Guide Your Design

How will you model CPU vs I/O bursts?
When two events occur at the same time, which wins?
How do you account for context switch overhead?
How will you compute response time for Round Robin?
What does fairness mean in your simulator?

Thinking Exercise

For processes P1(5), P2(2), P3(8) arriving at time 0, compute FIFO, SJF, and RR (q=2) by hand.

The Interview Questions They Will Ask

Why can SJF starve long-running jobs?
What is the difference between preemptive and non-preemptive scheduling?
Why does MLFQ use priority boosts?
How does CFS approximate fairness?

Hints in Layers

Layer 1 - FIFO

Implement a simple queue and no preemption.

Layer 2 - SJF and RR

Add burst times and a time quantum.

Layer 3 - Metrics

Track start and completion times for each process.

Layer 4 - MLFQ or CFS

Add multiple queues or a virtual runtime tree.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	4, 7-9	Scheduling fundamentals
Linux Kernel Development	4	CFS design

Common Pitfalls & Debugging

Problem: “Response time is negative”

Why: Start time overwritten on reschedule.
Fix: Record first-run time only once.
Quick test: Validate with a 2-process workload.

Problem: “Blocked processes never resume”

Why: I/O completion events are not re-queued.
Fix: Move blocked processes back to ready at completion time.
Quick test: Add a single I/O burst and verify.

Definition of Done

FIFO, SJF, and RR produce correct metrics
MLFQ or CFS is implemented and compared
Simulator outputs a timeline and averages
You can explain trade-offs for each policy

Project 6: Fork and Exec (Process Creation)

Real World Outcome

You build a process explorer that demonstrates fork/exec/wait, shows process trees, and surfaces zombies and orphans.

$ ./fork_explorer

Parent PID: 1201
fork() -> child PID 1202
Child: exec /bin/ls
file1.txt  file2.txt  fork_explorer
Parent: child 1202 exited with status 0

$ ./fork_explorer tree
init(1)
+-- bash(1180)
    +-- fork_explorer(1201)
        +-- ls(1202)

The Core Question You Are Answering

Why did Unix split process creation into fork and exec, and how does that enable composition and redirection?

Concepts You Must Understand First

C2 Syscall boundary (TLPI Ch. 3)
C3 Process model (OSTEP Ch. 4-5)
fork/exec semantics (APUE Ch. 8)
Wait and exit status (TLPI Ch. 26)

Questions to Guide Your Design

How will you clearly show parent vs child output ordering?
How will you demonstrate zombies and orphans safely?
How will you parse /proc to build a tree?
How do you capture child exit status accurately?
How will you show file descriptor inheritance?

Thinking Exercise

Draw the process tree created by a shell pipeline cat file | grep foo | wc -l and label who calls fork and exec.

The Interview Questions They Will Ask

What is the difference between fork and exec?
What is a zombie and how is it cleaned up?
Why is fork efficient despite copying memory?
How does a shell implement output redirection?

Hints in Layers

Layer 1 - Basic fork

Print different messages in parent and child.

Layer 2 - exec

In the child, exec /bin/ls and observe output.

Layer 3 - wait

Parent waits and prints exit status.

Layer 4 - /proc tree

Parse /proc/<pid>/stat and build a tree.

Books That Will Help

Book	Chapters	Why It Helps
TLPI	24-26	fork, exec, wait
APUE	8	Process relationships
OSTEP	5	Fork and exec model

Common Pitfalls & Debugging

Problem: “Output ordering is confusing”

Why: Parent and child run concurrently.
Fix: Use fflush or sleep to make ordering clear.
Quick test: Add timestamps to each line.

Problem: “Zombie stays forever”

Why: Parent never calls wait.
Fix: Call waitpid or handle SIGCHLD.
Quick test: ps -el | grep Z before and after wait.

Definition of Done

fork/exec/wait demo works and shows exit status
Process tree is printed from /proc data
Zombies and orphans are demonstrated and cleaned
You can explain the fork+exec design rationale

Project 7: Unix Shell from Scratch

Real World Outcome

You build a functional shell with parsing, pipes, redirection, and job control.

$ ./myshell
myshell> echo hello
hello
myshell> ls -la | grep .c | wc -l
5
myshell> sort < unsorted.txt | uniq > sorted.txt
myshell> sleep 100 &
[1] 4321
myshell> jobs
[1] Running sleep 100
myshell> fg 1
sleep 100
^C
myshell> exit

The Core Question You Are Answering

How do fork, exec, pipes, and signals combine to implement the Unix command model?

Concepts You Must Understand First

C3 Processes and scheduling (OSTEP Ch. 4)
C8 Pipes and IPC (TLPI Ch. 44)
File descriptor duplication and redirection (APUE Ch. 3)
Signals and job control (TLPI Ch. 20-22)

Questions to Guide Your Design

How will you parse quoted strings and escapes?
How will you implement pipelines of arbitrary length?
How will you handle built-in commands like cd?
How will you manage process groups for job control?
How will you handle SIGINT and SIGTSTP correctly?

Thinking Exercise

Explain how grep foo < in.txt | sort | uniq -c > out.txt is executed at the process and FD level.

The Interview Questions They Will Ask

How do pipes work under the hood?
Why must cd be a shell built-in?
How does the shell implement & and fg/bg?
What is a process group and why does it matter?

Hints in Layers

Layer 1 - Read/exec loop

Implement a loop that execs single commands.

Layer 2 - Pipes and redirection

Use pipe(), dup2(), and close() to wire processes.

Layer 3 - Built-ins

Implement cd, exit, jobs, and fg.

Layer 4 - Job control

Use setpgid and tcsetpgrp for foreground control.

Books That Will Help

Book	Chapters	Why It Helps
TLPI	20-22, 44	Signals and pipes
APUE	3, 9	Redirection and job control
OSTEP	5	Fork/exec mental model

Common Pitfalls & Debugging

Problem: “Pipelines hang”

Why: Parent forgot to close pipe ends.
Fix: Close unused ends in all processes.
Quick test: Use lsof -p <pid> to inspect FDs.

Problem: “Ctrl+C kills the shell”

Why: Shell did not separate process groups.
Fix: Put child in its own process group and forward signals.
Quick test: Run sleep 100 and press Ctrl+C.

Definition of Done

Commands execute with correct exit status
Pipes and redirection match bash for simple cases
Built-ins work and do not spawn child processes
Job control handles background and foreground tasks

Project 8: System Call Interface

Real World Outcome

You implement raw syscalls in assembly, compare with libc, and build a small tracer that logs syscalls of a target program.

$ ./syscall_lab raw

[raw] write(1, "Hello\n", 6) -> 6
[raw] getpid() -> 5123

$ ./syscall_lab trace /bin/ls
execve("/bin/ls", ["ls"], [/* env */]) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY) = 3
mmap(NULL, 12345, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f...
write(1, "file1\nfile2\n", 12) = 12
exit_group(0) = ?

The Core Question You Are Answering

What exactly happens when user code crosses the kernel boundary, and how small is the real OS API?

Concepts You Must Understand First

C2 Syscall boundary and ABI (TLPI Ch. 3)
User/kernel transition (Linux Kernel Development Ch. 5)
ptrace basics (TLPI Ch. 26)
Error handling (errno conventions)

Questions to Guide Your Design

How will you encode syscall numbers for your architecture?
How will you pass 6 arguments safely?
How will you capture syscall entry and exit with ptrace?
How will you decode syscall numbers into names?
How will you handle multi-threaded targets?

Thinking Exercise

Pick a simple program (e.g., cat file) and predict the first 10 syscalls it will make.

The Interview Questions They Will Ask

Why does the kernel validate user pointers?
What registers are used for syscalls on x86-64?
What is the difference between a syscall and a function call?
How does strace work internally?

Hints in Layers

Layer 1 - Raw syscall

Implement getpid and write in inline asm.

Layer 2 - Syscall table

Add a small map from numbers to names.

Layer 3 - Tracing

Use ptrace(PTRACE_SYSCALL) to stop at entry/exit.

Layer 4 - Formatter

Decode arguments and print human-readable output.

Books That Will Help

Book	Chapters	Why It Helps
TLPI	3, 26	Syscall basics and tracing
Low-Level Programming	8	x86-64 syscall ABI
Linux Kernel Development	5	Kernel entry path

Common Pitfalls & Debugging

Problem: “Raw syscall returns -1”

Why: Wrong syscall number or argument register.
Fix: Verify ABI and use errno mapping.
Quick test: Compare with libc call.

Problem: “Tracer misses syscalls”

Why: You did not wait for both entry and exit stops.
Fix: Alternate between entry and exit and track state.
Quick test: Log a counter and ensure it increments.

Definition of Done

Raw syscalls work for at least two functions
Tracer logs syscall entry and exit reliably
Output matches strace for a simple program
You can explain the syscall ABI for x86-64

Project 9: File System from Scratch

Real World Outcome

You implement a simple file system in a disk image, build mkfs and fsck tools, and mount it with FUSE.

$ ./mkfs.myfs disk.img 10M
Created myfs: 10 MB, 256 inodes, 2560 blocks

$ ./myfs_fuse disk.img /mnt/myfs
$ echo "hello" > /mnt/myfs/notes.txt
$ ls -l /mnt/myfs
-rw-r--r-- 1 user user 6 Jan 01 12:00 notes.txt

$ ./fsck.myfs disk.img
myfs: clean

The Core Question You Are Answering

How do raw blocks become durable, named files, and how do you recover after crashes?

Concepts You Must Understand First

C6 File systems and crash consistency (OSTEP Ch. 39-45)
Inodes and directories (Understanding the Linux Kernel Ch. 12)
Block allocation and bitmaps (OSTEP Ch. 40)
FUSE interface basics (TLPI Ch. 14)

Questions to Guide Your Design

What is your on-disk layout (superblock, inode table, bitmap, data)?
How will you map filenames to inode numbers?
How will you allocate and free blocks efficiently?
How will you implement fsck and detect inconsistencies?
What is your crash recovery strategy?

Thinking Exercise

Design a 1 MB file system with 4 KB blocks. How many inodes can you fit if each inode is 128 bytes and you reserve 10% for metadata?

The Interview Questions They Will Ask

What is the difference between a directory and a file in Unix?
Why does journaling improve crash recovery?
What is the purpose of an inode?
How does the OS resolve a path like /a/b/c?

Hints in Layers

Layer 1 - Flat files

Implement fixed-size files with a simple table.

Layer 2 - Inodes and directories

Add inode structures and directory entries.

Layer 3 - Allocation bitmap

Track free blocks and implement allocation.

Layer 4 - FUSE mount

Expose your file system via FUSE and test with shell tools.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	39-45	File system design
Understanding the Linux Kernel	12	Inode internals
TLPI	14	File I/O and FUSE

Common Pitfalls & Debugging

Problem: “Files disappear after remount”

Why: Metadata not flushed or inconsistent.
Fix: Ensure superblock/inodes are written and fsync as needed.
Quick test: Mount, create file, unmount, remount, verify.

Problem: “Directory traversal fails”

Why: Incorrect directory entry parsing.
Fix: Validate entry sizes and offsets.
Quick test: Dump raw directory blocks with hexdump.

Definition of Done

mkfs creates a valid disk image
create/read/write/delete works via FUSE
fsck detects and repairs simple inconsistencies
You can explain your on-disk layout

Project 10: Everything Is a File (Device Files)

Real World Outcome

You build a set of custom device files (via a kernel module or FUSE) that behave like real devices: a random generator, a counter, and a logging sink.

$ sudo ./mydevctl load
$ ls -l /dev/myrand /dev/mycount /dev/mylog
crw-rw---- 1 root root 240, 0 Jan 01 12:00 /dev/myrand
crw-rw---- 1 root root 240, 1 Jan 01 12:00 /dev/mycount
crw-rw---- 1 root root 240, 2 Jan 01 12:00 /dev/mylog

$ head -c 8 /dev/myrand | hexdump -C
00000000  4f 12 a9 7c 2b 90 33 01  |O..|+.3.|

$ cat /dev/mycount
1
$ cat /dev/mycount
2

$ echo "hello" > /dev/mylog
$ dmesg | tail -n 2
[12345.67] mylog: hello

The Core Question You Are Answering

How does the OS present hardware and kernel services as simple files with read/write semantics?

Concepts You Must Understand First

C7 I/O subsystem (interrupts and drivers)
C6 File system semantics and device nodes
C9 Permissions and access control
Character vs block devices (Linux kernel docs)

Questions to Guide Your Design

Should your device be character or block?
What does read/write mean for each device?
How will you handle concurrency and multiple readers?
What permissions should the device nodes have?
How will you expose device state to user space?

Thinking Exercise

Design a device /dev/uptime that returns a string with seconds since boot. What kernel data would you use?

The Interview Questions They Will Ask

What is the difference between a character and block device?
Why does Unix treat devices as files?
How does a kernel module register a device?
What does mknod do?

Hints in Layers

Layer 1 - Stub device

Layer 2 - Stateful reads

Add internal counters or a ring buffer.

Layer 3 - Permissions

Create udev rules or set permissions on /dev nodes.

Layer 4 - Concurrency

Use locks to prevent races on shared state.

Books That Will Help

Book	Chapters	Why It Helps
Linux Kernel Development	6-7	Device registration
TLPI	13	Device files and I/O
APUE	4	File system interfaces

Common Pitfalls & Debugging

Problem: “read blocks forever”

Why: Driver never returns EOF or data length.
Fix: Return 0 when no more data is available.
Quick test: Use strace to see read loop behavior.

Problem: “permission denied”

Why: Device node permissions are too strict.
Fix: Adjust udev rules or chmod the node.
Quick test: ls -l /dev/myrand and fix mode bits.

Definition of Done

Device nodes appear under /dev with correct major/minor
Read/write semantics behave as documented
Concurrency is safe under parallel reads
Permissions and ownership are correct

Project 11: Inter-Process Communication Toolkit

Real World Outcome

You build a toolkit that demonstrates pipes, shared memory, and Unix domain sockets, including a small local chat program.

$ ./ipc_demo pipe
parent -> child: hello via pipe
child  -> parent: ack

$ ./ipc_demo shm
writer: wrote 4096 bytes to shared segment
reader: verified checksum OK

$ ./chat_server
[server] listening on /tmp/chat.sock
$ ./chat_client
> hello
[server] broadcast: hello

The Core Question You Are Answering

When processes are isolated, how do they communicate efficiently and safely?

Concepts You Must Understand First

C8 IPC mechanisms (pipes, sockets, shared memory)
C3 Synchronization primitives (locks, condition variables)
File descriptor inheritance (APUE Ch. 3)
Unix domain sockets (UNP Vol 1 Ch. 3)

Questions to Guide Your Design

Which IPC mechanism is best for each demo and why?
How will you synchronize shared memory writes?
How will you handle client disconnects in the chat server?
What buffer sizes are safe for your IPC messages?
How will you benchmark latency between mechanisms?

Thinking Exercise

Compare the data path of a pipe vs shared memory. Where does copying occur in each?

The Interview Questions They Will Ask

When would you choose sockets over pipes?
Why is shared memory fast but dangerous?
How does the OS wake a process waiting on a pipe?
What is the difference between Unix and TCP sockets?

Hints in Layers

Layer 1 - Pipe demo

Use pipe(), fork(), write/read.

Layer 2 - Shared memory

Use shm_open/mmap and add a simple lock.

Layer 3 - Unix socket server

Bind to /tmp/socket and accept clients.

Layer 4 - Benchmark

Measure round-trip time for each mechanism.

Books That Will Help

Book	Chapters	Why It Helps
TLPI	44-56	IPC mechanisms
APUE	14	IPC overview
UNIX Network Programming	3-5	Unix domain sockets

Common Pitfalls & Debugging

Problem: “Shared memory data is corrupted”

Why: Missing synchronization or incorrect offsets.
Fix: Add a mutex or atomic flag.
Quick test: Add checksums and verify.

Problem: “Server stops accepting clients”

Why: File descriptor leak or not handling EINTR.
Fix: Close client FDs and restart accept on EINTR.
Quick test: Use lsof to monitor open sockets.

Definition of Done

Pipe and shared memory demos work reliably
Unix socket chat handles multiple clients
Benchmarks show relative latency for each mechanism
You can explain the trade-offs among IPC types

Project 12: Unix Philosophy Tools (Mini Coreutils)

Real World Outcome

You build small Unix tools (mini cat, grep, wc, head, tail) and verify that they compose correctly in pipelines.

$ ./mcat notes.txt | ./mgrep error | ./mwc -l
42

$ ./mhead -n 3 notes.txt
line1
line2
line3

$ ./mtail -n 2 notes.txt
line99
line100

The Core Question You Are Answering

Why do small, composable tools scale into powerful systems?

Concepts You Must Understand First

C8 Pipes and composition (TLPI Ch. 44)
C6 File I/O semantics (APUE Ch. 3)
C2 Syscall boundary and errors
Line buffering and streams

Questions to Guide Your Design

How will your tools behave with stdin vs file input?
How will you handle large files without loading into memory?
How will you parse options and flags consistently?
How will you handle binary data vs text?
How will you test pipeline behavior?

Thinking Exercise

Design a pipeline that extracts the top 5 IPs from a log using only your tools.

The Interview Questions They Will Ask

Why is “everything is a file” useful for tool composition?
What is the difference between buffered and unbuffered I/O?
Why should tools read from stdin by default?
What does it mean for a tool to be stream-friendly?

Hints in Layers

Layer 1 - Single tool

Implement mcat with read/write loops.

Layer 2 - Filtering

Implement mgrep with simple substring search.

Layer 3 - Counting

Implement mwc with line/word/byte counts.

Layer 4 - Robustness

Add flags, error messages, and usage output.

Books That Will Help

Book	Chapters	Why It Helps
APUE	3	File I/O patterns
TLPI	4-5	File descriptor basics
The Linux Command Line	9-12	Unix tool behavior

Common Pitfalls & Debugging

Problem: “Pipelines hang”

Why: Tool never closes stdout or reads stdin fully.
Fix: Close outputs and handle EOF correctly.
Quick test: Run with strace and verify read/write loops.

Problem: “Counts are wrong”

Why: Incorrect line ending handling or buffer logic.
Fix: Normalize line endings and test with known files.
Quick test: Compare with system wc on the same file.

Definition of Done

Tools behave like their Unix counterparts on basic tests
Tools compose correctly in pipelines
Tools handle stdin and file arguments
You can explain how composition relies on file descriptors

Project 13: Containers from Scratch

Real World Outcome

You build a minimal container runtime that creates a new PID, mount, and network namespace, applies cgroup limits, and runs a command inside.

$ sudo ./mini_container /bin/bash
[container] pid: 1
[container] hostname: mini
[container] /proc shows only container processes

# inside container
$ hostname
mini
$ ps
  PID TTY          TIME CMD
    1 pts/0    00:00:00 bash

The Core Question You Are Answering

How do namespaces and cgroups isolate processes without virtualizing hardware?

Concepts You Must Understand First

C9 Security and isolation (namespaces, cgroups)
C2 Syscall boundary and privilege checks
C8 Networking and IPC (network namespaces)
C6 Mount namespaces and filesystem views

Questions to Guide Your Design

Which namespaces do you need for isolation (PID, mount, net)?
How will you set up a minimal root filesystem?
How will you configure cgroup limits (memory, CPU)?
How will you handle UID/GID mapping for safety?
How will you attach and clean up namespaces?

Thinking Exercise

Compare a container and a VM in terms of kernel, filesystem, and process isolation.

The Interview Questions They Will Ask

Why is a container not a VM?
What does a PID namespace change?
What is a cgroup and why is it needed?
How does a container runtime start a process?

Hints in Layers

Layer 1 - Unshare namespaces

Use unshare or clone with namespace flags.

Layer 2 - Mount /proc

Mount a new procfs in the container namespace.

Layer 3 - cgroup limits

Write memory and CPU limits to cgroup files.

Layer 4 - Rootfs

Use pivot_root or chroot for filesystem isolation.

Books That Will Help

Book	Chapters	Why It Helps
Modern Operating Systems	9-10	Protection and virtualization
TLPI	40, 42	Namespaces and resource limits
Linux Kernel Development	7	Process management context

Common Pitfalls & Debugging

Problem: “Container sees host processes”

Why: Missing PID namespace or /proc mount.
Fix: Use CLONE_NEWPID and mount a new /proc.
Quick test: ps inside container should show PID 1 only.

Problem: “Memory limit not enforced”

Why: cgroup not configured or wrong controller.
Fix: Enable cgroup v2 and write to memory.max.
Quick test: Allocate memory until OOM and observe.

Definition of Done

Container has isolated PID and mount namespaces
/proc shows only container processes
cgroup limits are enforced
Container runs arbitrary commands and exits cleanly

Project 14: xv6 Operating System Study

Real World Outcome

You read and annotate the xv6 codebase, run it in QEMU, and implement small extensions (e.g., a new syscall, scheduler tweak, or file system change).

$ make qemu
xv6...

$ ls
README  cat  echo  grep  init  sh  ls  mkdir  rm  wc

$ uptime
1 1 1 100

The Core Question You Are Answering

How do real kernel subsystems fit together in a minimal, understandable OS?

Concepts You Must Understand First

C1 Bootstrapping and CPU modes
C2 Syscall boundary and ABI
C3 Scheduling and process model
C4 Virtual memory and paging
C6 File system layout

Questions to Guide Your Design

Which files implement boot and trap entry?
How does xv6 represent processes and scheduling queues?
Where are page tables created and switched?
How does the file system map paths to inodes?
Where would you add a new syscall safely?

Thinking Exercise

Trace the syscall path for ls from user space to kernel and back. Write the function names in order.

The Interview Questions They Will Ask

Why is xv6 a good teaching OS?
How does xv6 handle copy-on-write or lack it?
Where are locks used to protect shared data?
How do you add a syscall in xv6?

Hints in Layers

Layer 1 - Boot and traps

Identify entry.S and trap handlers.

Layer 2 - Process model

Read proc.c and scheduler loop.

Layer 3 - VM

Read vm.c and map a page by hand.

Layer 4 - Extend

Add a syscall that returns system uptime.

Books That Will Help

Book	Chapters	Why It Helps
OSTEP	4-6, 13-15	Processes and memory
Modern Operating Systems	1-3	Kernel structures
Understanding the Linux Kernel	1-3	Real kernel architecture

Common Pitfalls & Debugging

Problem: “Kernel panic after change”

Why: Modified code breaks an invariant or missed a lock.
Fix: Revert to last known working state and diff.
Quick test: Use QEMU with GDB to inspect trap state.

Problem: “New syscall not found”

Why: Syscall table or user stub not updated.
Fix: Add entry in syscall.c and user header.
Quick test: Verify syscall number and rebuild.

Definition of Done

xv6 builds and boots in QEMU
You can explain at least 3 core subsystems
You implement a small, working kernel change
You document the syscall path for a command

Project 15: Kernel Module Development

Real World Outcome

You write a loadable kernel module that registers a character device, exposes a /dev node, and logs kernel events.

$ make
$ sudo insmod mymod.ko
$ dmesg | tail -n 3
[12345.00] mymod: loaded
[12345.01] mymod: registered /dev/mymod

$ echo "ping" > /dev/mymod
$ dmesg | tail -n 1
[12345.50] mymod: ping

$ sudo rmmod mymod
$ dmesg | tail -n 1
[12346.00] mymod: unloaded

The Core Question You Are Answering

How do you safely extend a running kernel without crashing the system?

Concepts You Must Understand First

C2 Syscall boundary and kernel mode rules
C7 I/O subsystem and device registration
C9 Security and privilege checks
Kernel logging and debugging

Questions to Guide Your Design

What init and exit hooks are required?
How will you register and unregister a device cleanly?
How will you handle concurrency in read/write callbacks?
How will you expose module parameters safely?
How will you test failure paths?

Thinking Exercise

List three reasons kernel code is more dangerous than user code, and how modules mitigate them.

The Interview Questions They Will Ask

Why can a kernel module crash the whole system?
How do you debug kernel code?
What is the difference between a built-in driver and a module?
How does module reference counting work?

Hints in Layers

Layer 1 - Hello module

Print a message in init and exit.

Layer 2 - Device node

Layer 3 - Concurrency

Add a mutex to protect shared buffers.

Layer 4 - Robustness

Handle errors and unregister cleanly on failure.

Books That Will Help

Book	Chapters	Why It Helps
Linux Kernel Development	6-7	Modules and devices
TLPI	13	Device files
Understanding the Linux Kernel	1-2	Kernel structure

Common Pitfalls & Debugging

Problem: “Kernel oops on write”

Why: User pointer not validated or buffer overflow.
Fix: Use copy_from_user and bounds checks.
Quick test: Write a long string and verify safe handling.

Problem: “Module will not unload”

Why: Reference count still held.
Fix: Ensure file descriptors are closed and release callbacks run.
Quick test: lsmod and check refcount.

Definition of Done

Module loads and unloads cleanly
/dev node works with read and write
Concurrency is safe under parallel access
You can explain kernel/user safety constraints

Project 16: Network Stack Exploration

Real World Outcome

You build a packet capture tool and a minimal user-space TCP/IP stack simulator, then compare its behavior with the kernel stack.

$ sudo ./pktcap eth0
[12:00:01] RX TCP 192.168.1.10:443 -> 192.168.1.50:51432 len=74
[12:00:01] TX TCP 192.168.1.50:51432 -> 192.168.1.10:443 len=66
[12:00:02] RX DNS 192.168.1.1 -> 192.168.1.50 len=92

$ ./mini_tcp_sim
SYN -> SYN-ACK -> ACK
state: ESTABLISHED
send: GET / HTTP/1.1
recv: HTTP/1.1 200 OK

The Core Question You Are Answering

How does the OS move bytes from a socket API call to packets on the wire, and back again?

Concepts You Must Understand First

C7 I/O subsystem and interrupts
C8 Networking and sockets
Buffering and queueing (socket buffers)
TCP state machine basics (TCP/IP Illustrated Ch. 1-6)

Questions to Guide Your Design

What layers will your simulator implement (Ethernet, IP, TCP)?
How will you handle checksum calculation and validation?
How will you model retransmission and timeouts?
How will you compare kernel vs user-space behavior?
What metrics will you collect (latency, throughput, drops)?

Thinking Exercise

Draw the TCP three-way handshake and label the state transitions.

The Interview Questions They Will Ask

What is the difference between TCP and UDP?
Why does TCP need sequence numbers?
What is the role of the NIC driver in packet handling?
How does the OS wake a process waiting on a socket?

Hints in Layers

Layer 1 - Packet capture

Use raw sockets or libpcap to capture packets.

Layer 2 - Decode headers

Parse Ethernet, IP, and TCP headers.

Layer 3 - Simulated TCP

Implement a tiny state machine for SYN/SYN-ACK/ACK.

Layer 4 - Compare

Run a real HTTP request and compare with your simulator logs.

Books That Will Help

Book	Chapters	Why It Helps
TCP/IP Illustrated	1-6	Protocol fundamentals
UNIX Network Programming	1-5	Socket APIs
Understanding Linux Network Internals	1-3	Kernel stack design

Common Pitfalls & Debugging

Problem: “Captured packets look wrong”

Why: Incorrect header offsets or endianness.
Fix: Use htons/ntohs and verify header sizes.
Quick test: Compare against tcpdump output.

Problem: “Simulator does not retransmit”

Why: No timer or state tracking.
Fix: Add a retransmission timer and state machine.
Quick test: Drop a packet and ensure resend occurs.

Definition of Done

Packet capture tool decodes Ethernet/IP/TCP
Simulator implements a basic TCP handshake
Output matches tcpdump for a sample flow
You can explain how socket calls map to packets