Virtualization, Hypervisors, and Hyperconvergence Mastery - Real World Projects

Goal: Build a first-principles understanding of virtualization, from CPU modes and two-stage memory translation to device models, live migration, and hyperconverged storage. You will learn how the illusion of “many machines on one” is constructed, where it breaks, and which invariants must hold for correctness and security. Across the projects you will build real artifacts: a KVM userspace loop, a shadow paging simulator, virtio devices, a container runtime, and a hyperconverged lab. By the end, you will be able to design, debug, and explain modern virtualized infrastructure the way production platform and systems engineers do.

Introduction

What is virtualization? A set of techniques that allow one physical machine to host multiple isolated machines by virtualizing CPU, memory, and I/O.
What problem does it solve today? Workload isolation, elastic capacity, multi-tenant security, and operational portability across heterogeneous hardware.
What will you build across the projects? A KVM-based VM loop, memory virtualization simulators, virtio storage and network devices, a container runtime, an HCI lab, and a mini control plane.
In scope: CPU virtualization, memory translation, device models, virtio, VFIO/IOMMU, live migration, hyperconverged infrastructure (HCI), and control planes.
Out of scope: GPU virtualization beyond fundamentals, full Kubernetes/OpenStack builds, and non-x86 deep dives.

Big-picture diagram

User Apps            User Apps              Containerized App
   |                    |                          |
   v                    v                          v
Guest OS A           Guest OS B                Host Kernel
   |                    |                  (namespaces/cgroups)
   +---------- Virtual Devices (virtio/emulated/VFIO) ----------+
                             |
                             v
                      Hypervisor / VMM
                    (Type 1 or Type 2)
                             |
                             v
                  CPU + Memory + NIC + Disk

How to Use This Guide

Read the Theory Primer in order; each project depends on multiple concept chapters.
Use the Project-to-Concept Map to review the exact chapters you need before each build.
Keep a lab notebook with VM exit logs, timing data, and failure signatures.
Validate each project against its Definition of Done before moving on.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

C systems programming fundamentals (memory, pointers, structs, process model)
Linux basics: filesystems, permissions, process lifecycle, networking tools
OS fundamentals: privilege rings, paging, interrupts, device I/O
Recommended Reading: “Operating System Concepts” by Silberschatz et al. - Ch. 1, 2, 9

Helpful But Not Required

x86 architecture (VMX/SVM, MSRs)
Distributed systems (quorum, replication)
Linux kernel internals (ioctl, device drivers)

Self-Assessment Questions

Can you describe a page table walk and a TLB miss?
Can you explain what a VM exit is and why it is expensive?
Can you trace a packet from a VM to a host NIC using bridge or TAP?
Can you interpret dmesg for KVM-related messages?

Development Environment Setup Required Tools:

Linux host with virtualization support (Intel VT-x or AMD-V)
QEMU + KVM + libvirt
iproute2, tcpdump, bridge-utils
perf, strace, tmux

Recommended Tools:

bpftrace or bcc-tools (host tracing)
fio (storage benchmarking)
iperf3 (network benchmarking)

Testing Your Setup:

$ egrep -c '(vmx|svm)' /proc/cpuinfo
1

$ lsmod | grep kvm
kvm_intel  ...

Time Investment

Simple projects: 4-8 hours each
Moderate projects: 10-20 hours each
Complex projects: 20-40 hours each
Total sprint: 3-6 months

Important Reality Check Virtualization debugging is often opaque. Expect sudden VM exits, silent failures, and hours in vendor manuals. This is normal, not a sign of poor skill.

Big Picture / Mental Model

+----------------------- Control Plane ------------------------+
| API, scheduler, policy, images, quotas, observability        |
+-------------------------------+------------------------------+
                                |
                                v
+---------------------+   +-------------+   +------------------+
| Guest OS + Apps     |<->| Hypervisor  |<->| Host Kernel + I/O |
| (VMs/Containers)    |   | VMX/SVM/EPT |   | Drivers, NIC, FS  |
+---------------------+   +-------------+   +------------------+
                                |
                                v
                     CPU + Memory + NIC + Storage

Key invariant: the hypervisor must preserve architectural
semantics while multiplexing hardware safely and predictably.

Theory Primer

Chapter 1: Hypervisor Taxonomy and CPU Virtualization

Fundamentals Hypervisors create the illusion that a guest OS owns the CPU while actually sharing it with other guests. The core mechanism is trap-and-emulate: most instructions run directly on hardware, but privileged or sensitive operations trigger a VM exit so the hypervisor can emulate or deny them. Historically, x86 did not cleanly trap all sensitive instructions, so early hypervisors used binary translation or paravirtualization. Modern CPUs added new execution modes (VMX root/non-root on Intel, SVM on AMD) and control structures (VMCS/VMCB) to make virtualization practical and efficient. Type 1 hypervisors run directly on hardware, while Type 2 run on a host OS; both can be fast with hardware assist, but they differ in attack surface and control.

Deep Dive CPU virtualization is a contract of invariants. The guest believes it owns ring 0, can manipulate page tables, and can program interrupt controllers. The hypervisor must preserve these semantics without giving the guest real control of hardware. Hardware assist adds a new privilege layer so the CPU itself can save guest state and switch to host state on a VM exit. The VMCS/VMCB defines what to trap on (CPUID, MSR access, IO, HLT, exceptions) and how to enter/exit. Each exit is expensive: it flushes pipelines, impacts branch predictors, and often requires TLB synchronization. That is why reducing exit frequency is one of the central performance goals.

A hypervisor must also present a stable virtual CPU model. This matters for live migration: if a guest sees different CPU features on the destination host, it may crash or misbehave. Production hypervisors filter CPUID to expose a consistent feature set and to hide unstable microcode features. Timer virtualization is another subtlety. The guest expects monotonic time, but real execution is preempted by the host scheduler. Hypervisors use virtual timers and paravirtual time sources so the guest sees consistent time even across migrations.

Nested virtualization adds complexity: a guest hypervisor must think it controls VMX/SVM, while the real hypervisor must still control the actual hardware. The outer hypervisor virtualizes VMX instructions and often uses a “shadow VMCS” or nested control structures to emulate the inner hypervisor. This multiplies exit paths and can cause significant overhead unless hardware provides nested support.

Finally, CPU virtualization is about correctness under failure. VM entries can fail due to invalid state, unsupported features, or misconfigured controls. Hypervisors must validate controls against CPU capability MSRs and must ensure the guest and host state fields are complete. Even after the VM is running, hypervisors must handle non-maskable interrupts, triple faults, and machine checks safely. These are rare but high-severity events; production systems treat them as fatal for the VM and potentially for the host if the error is hardware-wide.

Control fields are grouped into pin-based controls (interrupt behavior), primary processor-based controls (instruction intercepts), secondary controls (EPT, unrestricted guest), and entry/exit controls (state transitions). Each is constrained by capability MSRs that define which bits must be 0 or can be 1. The VMM must compute a valid control set by intersecting desired features with these constraints. This is why "feature discovery" is a separate project: without it, VM entry fails and debugging is opaque.

Interrupt virtualization illustrates the trade-offs. If every interrupt triggers an exit, latency spikes and throughput drops. Techniques like APIC virtualization and posted interrupts allow the host to deliver interrupts directly to a running guest, reducing exits. But these techniques also complicate state management, especially when guests are preempted or migrated. Virtualizing performance counters and debug facilities creates similar challenges: too much exposure can leak host details, too little exposure can break guest tools.

Different vendors implement similar ideas with different names and quirks. SVM uses VMRUN and VMCB rather than VMCS, and has different intercept bitmaps and state layouts. A hypervisor that wants portability must understand both models and map them to a common internal representation. Even when only targeting one architecture, it is useful to think in terms of the invariant: "guest runs until an intercept occurs," then the VMM resolves the event and resumes.

How this fits on projects

Projects 1, 2, and 10 depend on a deep understanding of VMX/SVM exits and controls.

Definitions & key terms

VM Exit: Transition from guest to hypervisor on a trapped event.
VM Entry: Transition from hypervisor to guest mode.
VMCS/VMCB: Control structures holding guest/host state and exit controls.
Type 1/Type 2: Bare-metal hypervisor vs hosted hypervisor.

Mental model diagram

Guest instruction stream
        |
        v
Sensitive op?
   |         |
  no        yes
   v         v
Runs natively  VM exit -> VMM emulates -> VM entry

How it works (step-by-step, invariants, failure modes)

Hypervisor enables VMX/SVM and validates capability MSRs.
It sets up VMCS/VMCB with guest and host state.
It configures exit controls (what to trap) and entry controls.
It enters guest mode; guest runs until an exit event.
It handles the exit, updates state, and re-enters guest.

Invariants: guest state must be valid for entry, control bits must be supported, and host state must be consistent for exit. Failure modes include invalid control fields, inconsistent segment state, or unsupported features.

Minimal concrete example (protocol transcript)

EVENT: Guest executes CPUID
VMEXIT: reason=CPUID
VMM: emulates feature bits and writes virtual registers
VMENTRY: guest resumes at next instruction

Common misconceptions

“Type 2 is always slow.” With hardware assist, it can be near-native.
“VM exits are cheap.” They are expensive and dominate overhead when frequent.

Check-your-understanding questions

Why does CPUID often trigger a VM exit?
Why is stable CPUID exposure important for migration?

Check-your-understanding answers

Because the hypervisor must control which CPU features the guest sees.
A guest must see the same virtual CPU model on source and destination.

Real-world applications

KVM, Xen, Hyper-V, VMware ESXi
Nested virtualization for cloud tenants and CI systems

Where you’ll apply it

Project 1, Project 2, Project 10

References

Intel Software Developer Manual (VMX) - Vol. 3C
AMD64 Architecture Programmer’s Manual - Vol. 2
KVM API documentation (kernel.org)

Key insights CPU virtualization is an illusion contract enforced by programmable traps.

Summary You now understand how VMX/SVM modes, control structures, and exits enable hypervisors.

Homework/Exercises to practice the concept

List 10 exit reasons that matter for a minimal VMM.
Explain the difference between VM entry failures and guest crashes.

Solutions to the homework/exercises

CPUID, MSR read/write, I/O port access, HLT, external interrupts, page faults, NMI, RDTSCP, INVD, debug exceptions.
Entry failures occur before guest execution due to invalid VMCS state; crashes happen after guest execution begins.

Chapter 2: Memory Virtualization and Two-Stage Translation

Fundamentals Memory virtualization lets each guest OS believe it has contiguous physical memory. The guest maps guest virtual addresses (GVA) to guest physical addresses (GPA) using its own page tables. The hypervisor then maps GPA to host physical addresses (HPA). Historically, hypervisors used shadow page tables to maintain a combined mapping, trapping on guest page table updates. Modern CPUs provide EPT/NPT (second-level translation) so the hardware performs both translations. This reduces exits but introduces new fault types (EPT violations). Memory virtualization also enables overcommit and ballooning, which can improve utilization but introduce performance cliffs.

It is useful to separate correctness from performance. Correctness means each guest sees a consistent memory model with isolation from other guests. Performance means minimizing page faults, TLB misses, and exit overhead. Hypervisors constantly trade these goals by choosing page sizes, tracking dirty pages, and deciding when to reclaim memory. These trade-offs show up immediately in migration behavior and in the latency profile of memory-heavy workloads.

Deep Dive The fundamental challenge is that the guest OS believes it controls physical memory, but the host must multiplex and protect that memory across VMs. With shadow page tables, the VMM maintains a mapping from GVA to HPA by shadowing the guest’s page tables. Every guest write to its own page tables must be trapped so the shadow can be updated. This is correct but expensive; a busy guest can trigger thousands of exits just to update page tables, and TLB flushes become frequent.

Hardware-assisted translation (EPT on Intel, NPT on AMD) changes the trade-off. The CPU walks the guest page tables to translate GVA to GPA, then walks the EPT/NPT tables to translate GPA to HPA. The guest can update its own page tables without exits. The hypervisor handles second-level faults when the GPA does not map to an HPA. This enables lazy allocation, demand paging, and copy-on-write. However, it also increases TLB pressure because translations are effectively two-level. Performance tuning often involves large pages, page-walk caching, and careful TLB invalidation strategies.

Overcommit introduces another layer. A hypervisor might allocate 8 GB of guest memory on a host with 4 GB of real RAM, assuming the guest will not touch it all. To make this safe, hypervisors use balloon drivers inside guests to reclaim unused pages, or use deduplication (KSM) to share identical pages across VMs. These techniques improve density but can create latency spikes under memory pressure. Live migration also depends on dirty-page tracking: the hypervisor must know which pages were modified since the last copy. Dirty logging can be done in software or via hardware dirty bits; it can be expensive if the VM writes heavily.

NUMA adds another dimension. A VM may span multiple NUMA nodes, but if its vCPUs run on one node while its memory sits on another, memory access latency increases and bandwidth drops. Hypervisors therefore try to keep vCPU and memory locality aligned. Some platforms expose virtual NUMA topologies to the guest so it can make NUMA-aware scheduling decisions. This is especially important for databases and latency-sensitive services.

Memory virtualization also affects security. Side channels such as page-fault timing or cache contention can leak information between guests. Techniques like page coloring, memory bandwidth throttling, and constant-time code paths are used in high-security environments to reduce leakage. These are advanced topics, but they highlight why memory virtualization is not just about mapping addresses; it is also about controlling shared microarchitectural resources.

Large pages (2 MB or 1 GB) reduce TLB misses and improve performance, but they interact with migration and snapshots. If the hypervisor tracks dirty data at huge-page granularity, migration can send far more data than necessary. Some systems split huge pages into smaller pages when dirty tracking begins, then coalesce later. Memory virtualization also interacts with IOMMU for device DMA: device addresses map to GPAs, which must be translated to HPAs for isolation.

A subtle but critical aspect is consistency between guest and host views. If the hypervisor overcommits too aggressively, swapping guest pages to disk can cause long pauses. If it mismanages page table invalidations, the guest may see stale data. In distributed virtualization platforms, memory policy must also consider NUMA locality; placing guest memory far from its vCPU can degrade performance even if CPU cycles are available.

How this fits on projects

Projects 1, 3, and 4 rely on correct memory mapping and understanding of shadow vs EPT.

Definitions & key terms

GVA/GPA/HPA: Guest virtual, guest physical, host physical addresses.
Shadow page tables: VMM-maintained combined translation tables.
EPT/NPT: Hardware second-level translation.
Ballooning: Guest driver that returns unused memory to the host.

Mental model diagram

GVA --(guest PT)--> GPA --(EPT/NPT)--> HPA

How it works (step-by-step, invariants, failure modes)

Guest page tables map GVA to GPA.
Hypervisor sets up EPT/NPT mapping GPA to HPA.
CPU walks both tables on memory access.
Second-level fault occurs if GPA is unmapped.
Hypervisor allocates/maps memory or denies access.

Invariants: isolation between guests; consistent mappings; correct TLB invalidation. Failure modes include incorrect mapping, stale TLB entries, or overcommit-induced thrashing.

Minimal concrete example (translation trace)

ACCESS: GVA 0x7f00 -> GPA 0x12000
EPT: GPA 0x12000 -> HPA 0x9a000
RESULT: load/store succeeds

Common misconceptions

“EPT removes all VM exits.” It reduces exits for page tables, not I/O or traps.
“Overcommit is free.” It can cause severe latency under pressure.

Check-your-understanding questions

Why are shadow page tables expensive?
What causes an EPT violation?

Check-your-understanding answers

The hypervisor must trap and synchronize on every guest page table write.
A guest access hits a GPA without a valid EPT mapping to HPA.

Real-world applications

Cloud density optimization via overcommit
VM snapshots and copy-on-write

Where you’ll apply it

Project 1, Project 3, Project 4

References

Intel SDM Vol. 3C (EPT)
AMD64 APM Vol. 2 (NPT)
OS Concepts Ch. 9 (paging)

Key insights Memory virtualization is a two-stage translation problem with performance trade-offs.

Summary You now understand how shadow paging, EPT/NPT, and overcommit shape VM performance.

Homework/Exercises to practice the concept

Draw a two-stage translation for a sample address.
Explain how dirty tracking works during migration.

Solutions to the homework/exercises

GVA -> GPA via guest PT, then GPA -> HPA via EPT/NPT.
Hypervisor marks pages dirty on write and recopies them during migration.

Chapter 3: Device Virtualization and DMA Isolation

Fundamentals Device virtualization lets a guest believe it has NICs, disks, and other devices. There are three primary approaches: emulation (software model of a real device), paravirtualization (virtio devices with shared queues), and passthrough (direct device assignment via VFIO). Emulation is compatible but slow due to frequent exits. Virtio reduces exits by using shared memory rings. Passthrough provides near-native performance but requires an IOMMU for DMA isolation. The hypervisor must ensure that device DMA cannot access memory belonging to other guests.

The device model is a contract between guest drivers and the hypervisor. It defines register layouts, queue formats, interrupts, and reset behavior. If the hypervisor violates that contract, guest drivers will misbehave in ways that are difficult to debug. This is why device emulation often focuses on correctness first, then performance optimizations like vhost or SR-IOV.

Deep Dive I/O is often the bottleneck in virtualization because device access crosses trust and privilege boundaries. Emulated devices (e.g., legacy IDE, e1000) trigger VM exits on every register access. This can be thousands of exits per second, each with heavy overhead. Emulation is still essential for bootstrapping or compatibility with older OSes, but it is not ideal for performance.

Virtio is the industry standard for paravirtualized devices. The guest and host negotiate features, then use virtqueues (descriptor rings) in shared memory. The guest posts buffers, the host consumes them, and completion is signaled by interrupts or eventfd. This design drastically reduces exits and copies. Vhost accelerates the data path further by moving virtio handling into the host kernel, reducing context switches.

Passthrough uses VFIO to map a physical device directly into a guest. This gives near-native performance but removes the hypervisor from the data path. To make this safe, an IOMMU translates device DMA addresses and enforces isolation. The hypervisor programs an IOMMU domain for each VM, mapping guest memory into device-visible I/O virtual addresses (IOVA). Without this, a malicious or buggy device could DMA into another VM’s memory. IOMMU groups also matter: devices in the same isolation group must be assigned together, which can limit passthrough options.

SR-IOV extends PCIe devices to expose multiple Virtual Functions (VFs) so multiple VMs can share a device. Each VF has its own queues and interrupts. This yields excellent performance but complicates live migration, since device state lives on hardware. A common production trade-off is to use virtio or vhost for general workloads, and SR-IOV or passthrough only when performance is critical.

Device virtualization also includes interrupt delivery. The hypervisor must inject interrupts into the guest virtual APIC. Frequent interrupts can cause exit storms, so techniques like MSI-X, interrupt moderation, and posted interrupts are used to reduce overhead. In addition, device reset and hotplug must be handled carefully to avoid leaking state or DMA mappings across guests.

Security is a constant concern. Device firmware can be buggy, and DMA is powerful. Even with an IOMMU, a malicious device can exploit microarchitectural side channels or misconfigured mappings. Production hypervisors often restrict passthrough to known-good devices, and enforce strict reset requirements before assignment.

Virtio feature negotiation is another subtlety. The guest and host must agree on a common feature subset; otherwise, the device behavior may diverge. This is why virtio specifications define a strict negotiation handshake and versioning rules. Migration adds more constraints: the device state must be serializable and consistent across hosts. If a device backend cannot be migrated safely, the hypervisor must either refuse migration or fall back to a safe mode.

The backend matters as much as the frontend. A virtio-blk frontend backed by a qcow2 file will behave very differently from one backed by a raw block device or distributed storage. Similarly, a virtio-net frontend backed by a simple TAP device will have different performance characteristics than one backed by vhost-net or an accelerated userspace backend such as DPDK. Understanding these backend choices is essential for making performance and reliability trade-offs.

How this fits on projects

Projects 5 and 6 focus on virtio devices; Project 4 uses MMIO traps; Project 9 uses Ceph and virtio backends.

Definitions & key terms

Emulation: Full software model of a device.
Virtio: Standard paravirtual device interface.
Vhost: Kernel acceleration for virtio data paths.
IOMMU: Hardware DMA translation and isolation.
VFIO: Linux userspace device passthrough framework.

Mental model diagram

Guest driver -> virtqueue -> vhost/QEMU -> host backend
            -> (optional) VFIO/IOMMU -> physical device

How it works (step-by-step, invariants, failure modes)

Guest negotiates virtio features with device.
Guest posts buffers into virtqueues.
Host consumes buffers and performs I/O.
Host signals completion via interrupt.
For passthrough, IOMMU translates DMA and enforces isolation.

Invariants: device DMA must be confined to guest memory; interrupts must be delivered only to the correct guest. Failure modes include misconfigured IOMMU mappings or excessive exit rates.

Minimal concrete example (virtqueue transcript)

GUEST: posts TX buffer #42
HOST: reads descriptor chain, writes to TAP
HOST: updates used ring, injects interrupt

Common misconceptions

“Virtio is always faster.” Backend configuration still dominates performance.
“Passthrough is always safe.” Device firmware and IOMMU groups matter.

Check-your-understanding questions

Why does emulation cause high exit rates?
What is an IOMMU group and why is it important?

Check-your-understanding answers

Every register access traps to the hypervisor for emulation.
Devices in the same group cannot be isolated, so they must be assigned together.

Real-world applications

High-performance storage and networking in clouds
NFV and low-latency appliances

Where you’ll apply it

Project 4, Project 5, Project 6, Project 9

References

OASIS Virtio spec v1.3
Linux VFIO documentation (kernel.org)
Intel VT-d / AMD-Vi documentation

Key insights Device virtualization trades compatibility for performance, and DMA isolation is non-negotiable.

Summary You now understand how emulation, virtio, and VFIO combine to virtualize I/O safely.

Homework/Exercises to practice the concept

Compare virtio-blk vs virtio-scsi and list when each is preferable.
Diagram a VFIO device assignment with IOMMU mappings.

Solutions to the homework/exercises

Virtio-blk is simpler and low overhead; virtio-scsi supports richer SCSI semantics.
Guest memory maps into an IOMMU domain; device DMA is restricted to that domain.

Chapter 4: Storage Virtualization and Snapshot Semantics

Fundamentals Storage virtualization presents a VM with a block device backed by files, raw devices, or distributed storage. Formats like qcow2 add copy-on-write (CoW) snapshots and thin provisioning, while raw disks provide the fastest data path. Storage correctness requires honoring flush and barrier semantics; otherwise, guest filesystems can corrupt data. In hyperconverged systems, distributed storage such as Ceph RBD provides durability and shared access across nodes, enabling live migration without copying disks.

Storage behavior is visible to guests through latency, throughput, and failure semantics. A guest filesystem expects that a flush means data is durable; a hypervisor that violates this contract may appear fast in benchmarks but fail under real workloads. This makes storage virtualization a correctness-first domain: performance gains are only acceptable if they preserve ordering and durability.

Deep Dive A VM issues reads and writes at block offsets. The hypervisor must map these blocks to host storage while preserving ordering and durability semantics. A raw backend maps blocks directly to host file offsets. This is fast and simple but lacks snapshots. qcow2 adds metadata indirection: a block is mapped through L1/L2 tables to a data cluster. When a block is written for the first time, qcow2 allocates a new cluster, enabling copy-on-write snapshots. This makes snapshots easy but adds latency and fragmentation, especially under random write workloads.

Caching policies are critical. Writeback caching can improve throughput but risks data loss if the host crashes. Direct I/O (cache=none) is safer for durability but may be slower. Hypervisors must honor guest flush and barrier requests; otherwise, journaled filesystems may report success but lose data after a power failure. Many production outages trace back to misconfigured cache modes or ignored flush commands.

Distributed storage changes the model. Ceph stores data as objects across OSDs. The CRUSH algorithm allows clients to compute object placement without a central metadata server. For block storage, Ceph exposes RBD images; the hypervisor uses librbd or QEMU’s rbd backend to access them. Replication or erasure coding provides durability, but introduces network and CPU overhead. Placement groups (PGs) influence data distribution and recovery speed; tuning PG counts is a common operational task.

Snapshots in distributed storage behave differently from qcow2 snapshots. They are often copy-on-write at the object level, which can cause long tail latencies during rebalancing or recovery. Backups usually rely on snapshot + incremental export, which must be coordinated with guest flushes to ensure crash consistency. Hypervisors often expose multiple storage tiers: fast local SSD for performance, and distributed storage for resilience. Choosing the right tier is a workload and SLO decision.

Storage virtualization also intersects with security and multi-tenancy. A hypervisor must ensure that one guest cannot read another guest’s data from shared backends, especially when thin provisioning and snapshot reuse are involved. Zeroing new blocks and enforcing strong access controls on storage pools are required to prevent data leakage. Performance isolation is handled through I/O throttling, which may use cgroups or backend-specific QoS. Without throttling, a noisy neighbor can saturate I/O queues and degrade other VMs.

Finally, storage virtualization is visible in failure recovery behavior. When a host fails, the guest may experience stalls while the backend fails over or rebalances. The hypervisor must decide whether to pause I/O, retry, or surface errors to the guest. These decisions define the user experience during incidents and are a major reason that storage SLOs matter in virtualization platforms.

Live migration relies on shared storage. If storage is shared, only memory and CPU state must move. If storage is local, the hypervisor must do block migration (copy disk state), which is slower and more failure-prone. Hyperconverged systems use distributed storage so live migration is feasible at scale.

Finally, storage isolation is a multi-tenant problem. Hypervisors often enforce I/O throttling using cgroups or backend policies, preventing one VM from monopolizing storage bandwidth. Without throttling, “noisy neighbor” effects can make performance unpredictable.

How this fits on projects

Projects 5, 7, and 9 depend on qcow2/RBD behavior and snapshot semantics.

Definitions & key terms

qcow2: QEMU copy-on-write disk image format.
Flush/Barrier: Commands that enforce write ordering and durability.
RBD: Ceph block device abstraction.
Thin provisioning: Allocating storage on demand rather than upfront.

Mental model diagram

Guest block IO -> virtio-blk -> backend (raw/qcow2/RBD)

How it works (step-by-step, invariants, failure modes)

Guest issues read/write to block device.
Hypervisor translates to backend offsets or objects.
Backend writes data and updates metadata.
Flush ensures ordering and durability.
Snapshot metadata records changes for rollback.

Invariants: correct ordering, durability guarantees, consistent snapshots. Failure modes include corrupted snapshots, ignored flushes, or excessive CoW fragmentation.

Minimal concrete example (snapshot workflow)

ACTION: create snapshot "baseline"
WRITE: guest modifies blocks 120-140
RESULT: snapshot retains old data; new data stored in new clusters

Common misconceptions

“Snapshots are free.” They grow and degrade performance over time.
“Writeback cache is safe.” It risks data loss on host failure.

Check-your-understanding questions

Why can qcow2 be slower than raw?
Why is shared storage important for live migration?

Check-your-understanding answers

CoW metadata lookups and fragmentation add latency.
It avoids copying disk state during migration.

Real-world applications

VM image management in clouds
Hyperconverged storage for private clouds

Where you’ll apply it

Project 5, Project 7, Project 9

References

QEMU block layer documentation
Ceph RBD architecture documentation
OS Concepts Ch. 10 (storage)

Key insights Storage virtualization is correctness-first; performance is secondary to durability.

Summary You now understand how CoW, caching, and distributed storage shape VM disk behavior.

Homework/Exercises to practice the concept

Compare latency of qcow2 vs raw with a small benchmark.
Explain how a snapshot can impact read performance.

Solutions to the homework/exercises

qcow2 typically shows higher latency due to metadata overhead.
Read paths may traverse additional indirection layers for CoW blocks.

Chapter 5: Network Virtualization and Overlays

Fundamentals Network virtualization connects VMs to virtual networks independent of physical topology. On a single host, a VM NIC typically maps to a TAP device attached to a Linux bridge or Open vSwitch (OVS). Across hosts, overlays like VXLAN encapsulate L2 frames into UDP so networks can span L3 infrastructure. Virtio-net is the paravirtual NIC interface, while SR-IOV provides near-native performance but can bypass overlays.

The hypervisor must preserve Ethernet semantics for guests while enforcing isolation and policy. This means MAC learning, VLAN tagging, and filtering must work in virtual switches the same way they do on physical switches. At scale, the hypervisor also needs a control plane to program virtual switch rules and to maintain tenant-to-VNI mappings in overlay networks.

Deep Dive At the host level, a VM’s virtio-net device is backed by a TAP interface. Packets written by the guest appear on the TAP device and are forwarded by a bridge or OVS. The bridge performs L2 switching (MAC learning, forwarding), while OVS adds programmable flows, VLANs, and tunnel support. This is sufficient for single-host labs and small clusters.

Scaling across hosts requires overlays. VXLAN encapsulates Ethernet frames into UDP packets with a 24-bit VNI, enabling up to 16 million isolated networks. A VXLAN tunnel endpoint (VTEP) on each host encapsulates and decapsulates traffic. The overlay control plane may be static (manual VTEP configuration) or dynamic (EVPN, SDN controllers). OVS is commonly used as a VTEP in virtualized environments.

Isolation and security are critical. A guest must not spoof MAC/IP addresses or sniff other tenants. Hypervisors enforce anti-spoofing rules and security groups at the virtual switch. Performance is shaped by offloads (GRO, checksum offload), queue counts, and CPU pinning. Virtual switches often use kernel data paths for fast flows and userspace fallbacks for complex flows.

Overlay networks introduce MTU overhead. VXLAN adds headers, so the underlay MTU must be larger to preserve a 1500-byte guest MTU. If not, fragmentation or drops occur and can be difficult to debug. Production environments often use jumbo frames to absorb overlay overhead. Additionally, SR-IOV can bypass the virtual switch entirely, which yields performance but makes overlays and security policy enforcement more difficult.

Virtualization also changes network observability. Tools like tcpdump or eBPF must be attached at the right layer (TAP, bridge, veth). Diagnosing packet loss requires a mental model of where packets are dropped: guest, virtio queue, TAP, bridge, overlay, or physical NIC.

Address management and control planes are critical. DHCP and ARP must behave consistently across virtual networks, and virtual routers or NAT gateways must handle east-west and north-south traffic. In many environments, a controller pushes flow rules to OVS or programs kernel eBPF filters to enforce security groups. These mechanisms determine how well multi-tenant isolation scales and how quickly policies propagate.

Performance tuning goes beyond raw throughput. The number of virtio queues, CPU pinning, interrupt moderation, and GRO/LRO settings can dramatically change latency. Overlay networks add additional processing and encapsulation overhead that can be mitigated with offloads (e.g., VXLAN offload) when supported by the NIC. Without these, CPU overhead can dominate at high packet rates, which is why many cloud providers carefully constrain network configurations for performance-sensitive workloads.

Control-plane scale introduces additional concerns such as ARP suppression and MAC learning limits. In large overlays, flooding unknown MACs can become expensive; controllers often program explicit forwarding entries or use EVPN to distribute MAC/IP mappings. DHCP relay and metadata services must be reachable from every tenant network without leaking across tenants, which often requires virtual routers and policy-aware NAT. These components live in the "virtual network" even though they are implemented in host software.

Lastly, troubleshooting virtual networks demands a layered approach. A ping failure can originate in the guest stack, virtio queues, TAP, bridge rules, overlay encapsulation, physical NIC offloads, or upstream routing. A disciplined debugging workflow that checks each layer systematically is essential; otherwise, engineers often chase the wrong layer and waste hours. Lastly, troubleshooting virtual networks demands a layered approach. A ping failure can originate in the guest stack, virtio queues, TAP, bridge rules, overlay encapsulation, physical NIC offloads, or upstream routing. A disciplined debugging workflow that checks each layer systematically is essential; otherwise, engineers often chase the wrong layer and waste hours.

How this fits on projects

Projects 6, 7, 9, and 10 depend on bridging, TAP, and overlay concepts.

Definitions & key terms

TAP: Virtual L2 interface for VM traffic.
Bridge: Linux L2 switch.
OVS: Open vSwitch with programmable flows.
VXLAN: L2 overlay protocol over UDP.

Mental model diagram

Guest NIC -> virtio-net -> TAP -> Bridge/OVS -> (VXLAN) -> NIC

How it works (step-by-step, invariants, failure modes)

Guest sends an Ethernet frame via virtio-net.
TAP receives frame on the host.
Bridge/OVS forwards based on MAC/VLAN/VNI.
VXLAN encapsulates if remote.
Remote VTEP decapsulates and delivers to guest.

Invariants: isolation between tenants, correct MTU, consistent MAC learning. Failure modes include MTU mismatch, misconfigured VTEP, or missing security rules.

Minimal concrete example (packet path)

VM A -> TAP -> OVS -> VXLAN (VNI 42) -> OVS -> TAP -> VM B

Common misconceptions

“VXLAN replaces VLANs.” It runs over VLANs and routed networks.
“SR-IOV always helps.” It can reduce flexibility and migration options.

Check-your-understanding questions

Why does VXLAN need a VTEP?
What happens if underlay MTU is too small?

Check-your-understanding answers

The VTEP encapsulates and decapsulates overlay traffic.
Frames are fragmented or dropped, causing silent failures.

Real-world applications

Cloud tenant isolation
Multi-host VM networks

Where you’ll apply it

Project 6, Project 7, Project 9, Project 10

References

RFC 7348 (VXLAN)
OVS documentation
“Computer Networks” by Tanenbaum - Ch. 2-5

Key insights Network virtualization is a layered system; misconfigurations propagate across layers.

Summary You now understand how bridges, OVS, and VXLAN build virtual networks.

Homework/Exercises to practice the concept

Build a two-VM bridge and verify MAC learning.
Create a simple VXLAN tunnel and ping across hosts.

Solutions to the homework/exercises

Use bridge fdb show to confirm learned MACs.
Configure VTEPs on each host and validate connectivity.

Chapter 6: Live Migration and Checkpoint/Restore

Fundamentals Live migration moves a running VM between hosts with minimal downtime. Pre-copy iteratively copies memory while the VM runs, then pauses briefly to copy remaining dirty pages. Post-copy starts the VM on the destination earlier and fetches pages on demand, reducing downtime but increasing risk if the source fails. Migration requires device state transfer, which is straightforward for virtio but difficult for passthrough devices. Checkpoint/restore for containers (CRIU) shares similar concepts at process level.

Migration is a core operational tool: it enables maintenance without downtime, load balancing, and proactive failure avoidance. The correctness constraints are strict: there must be a single active VM instance, memory state must be consistent, and storage must not diverge. These constraints shape every design decision in migration systems.

Deep Dive Live migration is essentially a distributed checkpoint. A VM’s state consists of CPU registers, memory pages, and device state. In pre-copy, the source iteratively copies memory pages while the VM runs. Each round sends pages dirtied since the last iteration. If the dirty rate is lower than available bandwidth, the process converges. The final stop-and-copy phase pauses the VM, transfers remaining dirty pages and CPU state, and resumes on the destination. Downtime is proportional to the last copy phase.

Post-copy flips the trade-off: the VM is paused briefly, minimal state is transferred, and the VM resumes on the destination. Missing pages are fetched from the source on demand, usually via userfaultfd. This reduces downtime but increases risk; if the source fails mid-migration, the VM can crash because required pages are lost. Many platforms use a hybrid approach: pre-copy until convergence stalls, then switch to post-copy.

Dirty tracking is critical. Hypervisors mark pages as dirty using hardware bits or write-protection. This overhead can be significant for write-heavy workloads. Compression and delta encoding can reduce bandwidth but increase CPU usage. Migration also requires compatible CPU features across hosts, stable virtual device models, and shared storage. Without shared storage, block migration must copy disks, which can be slower than memory migration.

Failure modes are important. Pre-copy can be safely canceled; the VM continues on the source. Post-copy has a “point of no return” where the destination becomes the source of truth. Production systems use migration tunnels with bandwidth caps and prioritization to avoid impacting other workloads. They also require coordination services (leases, fencing) to ensure only one host considers the VM active.

Container checkpoint/restore shares many concepts but operates at process level. It must capture namespace state, file descriptors, and memory. It is typically easier for stateless processes than for processes with external I/O or network connections. The connection between VM migration and container checkpointing is conceptual: both are about consistent state capture and resumption under distributed failure.

Migration also interacts with networking and storage. If the VM’s IP address must remain stable, ARP tables or virtual switch rules must update quickly. If storage is not shared, block migration must stream disk changes, which can dwarf memory migration time. These practical constraints are why many platforms restrict migration to shared-storage environments or to workloads that can tolerate brief pauses.

Device state serialization is another bottleneck. Virtio devices are designed to be migratable, but even they require careful coordination so that queues, in-flight requests, and interrupts are transferred consistently. Passthrough devices rarely support live migration; in those cases, platforms either disable migration or require device-specific migration support. This is why cloud providers often avoid passthrough for workloads that demand mobility.

Bandwidth management is operationally critical. Migration traffic can overwhelm a production network if left unconstrained, so many platforms enforce bandwidth caps, schedule migrations during low-traffic windows, or throttle based on real-time congestion signals. These policies are not merely "nice to have"; they are required to prevent migrations from degrading unrelated workloads.

How this fits on projects

Projects 9 and 10 depend on live migration concepts.

Definitions & key terms

Pre-copy: Copy memory while VM runs, then stop-and-copy.
Post-copy: Resume VM early and fault pages on demand.
Dirty page: A page modified since last migration round.
Stop-and-copy: Final pause to transfer remaining state.

Mental model diagram

Pre-copy: run -> copy dirty -> run -> copy dirty -> stop -> final copy
Post-copy: stop -> minimal state -> resume -> fetch pages on fault

How it works (step-by-step, invariants, failure modes)

Establish migration channel between hosts.
Begin pre-copy rounds, track dirty pages.
Pause VM and copy remaining state.
Start VM on destination, release source.

Invariants: only one active VM at a time, consistent CPU model, shared storage or block migration. Failure modes include non-converging dirty rate and source failure during post-copy.

Minimal concrete example (migration trace)

ROUND 1: 8 GB copied, dirty rate 200 MB/s
ROUND 2: 1.2 GB copied
STOP: 120 MB copied, downtime 120 ms

Common misconceptions

“Migration is always safe.” Post-copy can fail if the source crashes.
“Passthrough devices migrate easily.” Most do not without vendor support.

Check-your-understanding questions

Why might pre-copy fail to converge?
What is the primary risk of post-copy?

Check-your-understanding answers

The VM dirties memory faster than the network can copy it.
If the source fails, missing pages cannot be fetched.

Real-world applications

Host evacuation and maintenance without downtime
Automated failover in private clouds

Where you’ll apply it

Project 9, Project 10

References

QEMU migration documentation
CRIU documentation

Key insights Migration is bandwidth-limited and correctness depends on strict coordination.

Summary You now understand pre-copy vs post-copy migration and the constraints that govern them.

Homework/Exercises to practice the concept

Measure downtime for a VM under different dirty rates.
Explain why shared storage simplifies migration.

Solutions to the homework/exercises

Higher dirty rates cause longer stop-and-copy phases.
Only memory/CPU state moves; disks stay accessible.

Chapter 7: Containers vs VMs and OS-Level Isolation

Fundamentals Containers isolate processes using Linux namespaces and cgroups instead of full hardware virtualization. A container runtime sets up namespaces (PID, mount, network, IPC, UTS, user) and applies cgroup limits for CPU, memory, and I/O. The Open Container Initiative (OCI) defines the runtime spec and bundle format so runtimes are interoperable. Containers are lightweight but share the host kernel, so their isolation boundary depends on kernel security.

Because containers share the host kernel, they emphasize speed and portability rather than strong isolation. This makes them ideal for microservices, CI, and rapid scaling, but less ideal for untrusted multi-tenant workloads unless paired with stronger isolation (for example, running containers inside VMs).

Deep Dive Containers are not mini-VMs. They are processes with restricted views of global resources. Namespaces provide isolation: PID namespaces give each container its own process tree; mount namespaces give each container its own filesystem view; network namespaces give separate network stacks; user namespaces map host IDs so a process can be root inside the container without host root. Cgroups enforce resource limits and accounting. cgroup v2 is now the unified model; it requires explicit controller enablement and supports fine-grained I/O throttling.

An OCI runtime consumes a bundle directory containing a root filesystem and a config description. The runtime uses clone/unshare to create namespaces, configures mounts (often via overlayfs), attaches the process to a cgroup, drops capabilities, applies seccomp filters, and then execs the target process. Proper cleanup is critical; leaked cgroups or mounts can cause resource starvation or security issues.

Security is fundamentally different from VMs. Because the kernel is shared, a kernel exploit can compromise all containers. Production systems mitigate this with seccomp, AppArmor/SELinux, and sometimes stronger isolation layers such as gVisor or Kata Containers (which run containers inside lightweight VMs). This is why many cloud providers still run containers inside VMs for multi-tenant isolation.

Containers also have unique operational quirks: PID 1 semantics (signal handling and zombie reaping), filesystem copy-up behavior under overlayfs, and namespace lifetime tied to the init process. Observability is often simpler than VMs because everything is still a host process, but the namespace boundaries can make debugging confusing without the right tooling (nsenter, lsns).

Containers and VMs are complementary. VMs provide hardware isolation; containers provide packaging and rapid startup. Modern infrastructure often uses VMs as the tenant isolation boundary and containers for workload portability inside those VMs.

Operationally, containers bring their own challenges. Resource limits are enforced by the same kernel scheduler, so noisy neighbors can still appear if limits are misconfigured. Filesystem semantics (overlayfs copy-up, inode behavior) can surprise applications. Networking often relies on additional layers such as CNI plugins, which add complexity and can introduce subtle performance regressions. Understanding these realities helps prevent overestimating the safety or simplicity of containerization.

Security hardening goes beyond namespaces and cgroups. Most production systems apply seccomp filters to reduce the system call surface, and use LSMs such as AppArmor or SELinux to enforce policy. Rootless containers rely on user namespaces and often need idmapped mounts to avoid expensive ownership changes in large filesystem trees. These mechanisms make containers safer, but they also add operational complexity and can break assumptions in legacy software.

Container runtime behavior also matters for correctness. Signal handling and process reaping are subtle: PID 1 has special semantics, so many systems use a tiny init process inside the container to forward signals and reap zombies. Time and randomness sources are shared with the host, which can affect reproducibility. Understanding these details helps you design container runtimes that behave predictably under load and failure.

Image and filesystem management is another practical layer. Container images are assembled from layers, and overlayfs merges those layers into a single view. Copy-on-write behavior can create surprising disk usage and performance patterns, especially when applications write heavily to files that exist in lower layers. Runtimes must also handle extraction, verification, and garbage collection of images so the host does not accumulate stale data over time.

How this fits on projects

Project 8 is entirely built on these primitives.

Definitions & key terms

Namespaces: Kernel isolation of global resources.
cgroups: Resource control and accounting.
OCI: Open Container Initiative runtime spec.
Bundle: Rootfs + config describing a container.

Mental model diagram

Process -> namespaces (pid, net, mnt, uts, ipc, user)
        -> cgroups (cpu, memory, io)

How it works (step-by-step, invariants, failure modes)

Create namespaces for the container.
Configure rootfs and mounts (often overlayfs).
Attach process to a cgroup with limits.
Drop capabilities and apply seccomp.
Exec target process and manage lifecycle.

Invariants: isolation of process IDs and mounts; cgroup limits enforced. Failure modes include leaking mounts, missing cgroup controllers, or PID 1 not reaping children.

Minimal concrete example (lifecycle trace)

RUNTIME: create namespaces -> mount rootfs -> set hostname -> apply cgroups
RUNTIME: exec /bin/sh as PID 1

Common misconceptions

“Containers are just lightweight VMs.” They share the host kernel.
“Rootless containers are fully safe.” Kernel bugs still apply.

Check-your-understanding questions

Why does PID 1 behave differently inside a container?
What is the risk of sharing the host kernel?

Check-your-understanding answers

PID 1 ignores some signals and must reap zombies.
A kernel exploit can compromise all containers.

Real-world applications

Microservices and CI pipelines
Multi-tenant platforms (often inside VMs)

Where you’ll apply it

Project 8

References

Linux namespaces man pages
cgroup v2 documentation
OCI runtime spec

Key insights Containers trade kernel isolation for speed and packaging simplicity.

Summary You now understand how namespaces, cgroups, and OCI define container isolation.

Homework/Exercises to practice the concept

Explain why user namespaces enable rootless containers.
Sketch the lifecycle of an OCI runtime from bundle to exec.

Solutions to the homework/exercises

User namespaces map host IDs to container IDs without host root.
Create namespaces, configure mounts, set cgroups, drop caps, exec.

Chapter 8: Hyperconverged Infrastructure (HCI) and Distributed Systems

Fundamentals Hyperconverged infrastructure merges compute, storage, and networking into a single cluster. Each node runs a hypervisor and contributes storage to a distributed system (often Ceph). HCI relies on quorum to prevent split-brain, replication or erasure coding for durability, and failure domains to spread risk. The goal is to simplify operations while enabling high availability (HA) and live migration without external SANs.

HCI shifts the operational center of gravity. Instead of separate storage and compute teams, the same platform must manage both. This simplifies procurement and scaling but introduces coupling between compute load, storage load, and network health. Understanding these couplings is essential for designing reliable HCI systems.

Deep Dive HCI is a distributed systems problem in disguise. Every VM write becomes a distributed write. A storage system like Ceph stores objects across OSDs according to the CRUSH algorithm. Clients compute object placement using the cluster map, avoiding centralized metadata bottlenecks. This design scales but requires careful configuration of placement groups, replication factors, and failure domains. If these are misconfigured, a single rack or power domain failure can still destroy data.

Quorum is a safety mechanism. Cluster components (Corosync for Proxmox, Ceph MONs) require a majority of votes to make progress. In a 3-node cluster, losing one node still leaves quorum; losing two does not. When quorum is lost, the cluster typically halts VM operations to avoid split-brain writes. This is not a bug; it is a deliberate safety trade-off. Some deployments add a qdevice as a tie-breaker to improve availability in small clusters.

Replication vs erasure coding is another central trade-off. Replication is simple and fast but costs 2x-3x storage overhead. Erasure coding reduces overhead but increases CPU and network cost on writes and recovery. Many HCI systems use replication for hot data and erasure coding for cold data. The operational cost of rebalancing during node additions or failures can be significant; recovery traffic can saturate networks and degrade VM I/O latency.

HCI also collapses operational domains. The same team must manage compute scheduling, storage health, and network capacity. This creates new failure coupling: a network issue can degrade storage, which then stalls VM disks, which then impacts compute. Monitoring must cover OSD latency, recovery backlogs, network saturation, and VM performance. Without this, operators often misattribute issues and apply the wrong fix.

Finally, HCI is sensitive to workload mix. Adding nodes increases both compute and storage, but real workloads may be skewed toward one. Capacity planning must consider CPU, RAM, and storage together rather than independently. This is why HCI can simplify infrastructure but can also create inefficiencies if workloads are unbalanced.

Operational tooling matters. Cluster upgrades, firmware updates, and network changes can all affect storage availability. Mature HCI stacks provide rolling upgrade procedures, health gates, and automated recovery. A lab environment should emulate these workflows at small scale so you build intuition for safe operations under failure.

Ceph-specific tuning is a practical reality in HCI labs. The number of placement groups, the replication factor, and the CRUSH failure domain hierarchy all influence recovery time and performance. If placement groups are too few, data distribution is uneven; too many, and metadata overhead grows. During recovery, backfill traffic can saturate the network, so many operators throttle recovery to protect VM latency. These operational levers are part of the HCI skill set, not optional extras.

Network separation is another common design choice. Many HCI deployments use dedicated networks for storage replication and for VM traffic to reduce congestion and jitter. In small labs this may be simulated with VLANs rather than physical NICs, but the principle is the same: storage traffic is bursty and can starve VM traffic if not isolated. This is why HCI design always includes network planning alongside compute and storage planning.

How this fits on projects

Project 9 (HCI lab) and Project 10 (mini cloud) rely on quorum, replication, and placement concepts.

Definitions & key terms

HCI: Hyperconverged Infrastructure.
Quorum: Majority vote for safety in distributed systems.
CRUSH: Ceph’s placement algorithm.
Failure domain: Boundary used to spread replicas (disk, host, rack).

Mental model diagram

VM write -> RBD client -> CRUSH -> OSD replicas across nodes

How it works (step-by-step, invariants, failure modes)

Client writes data to an RBD image.
CRUSH computes placement group and OSDs.
Data is replicated or erasure-coded across nodes.
MONs track cluster health and maps.
On failure, data is re-replicated to restore durability.

Invariants: quorum maintained, replicas in distinct failure domains, consistent cluster maps. Failure modes include split-brain, degraded PGs, and network saturation.

Minimal concrete example (health check transcript)

STATUS: HEALTH_OK
OSD: 6 up, 6 in
QUORUM: 3/3 MONs

Common misconceptions

“HCI always reduces cost.” It can increase operational complexity.
“Replication is enough.” Poor failure domains still cause data loss.

Check-your-understanding questions

Why does losing quorum force the cluster to stop serving?
What is the role of CRUSH in Ceph?

Check-your-understanding answers

To prevent split-brain and inconsistent writes.
It computes object placement without a central metadata server.

Real-world applications

Private cloud virtualization
Edge clusters and branch office deployments

Where you’ll apply it

Project 9, Project 10

References

Ceph architecture documentation
Proxmox cluster manager documentation
“Designing Data-Intensive Applications” Ch. 5, 9

Key insights HCI turns virtualization into a distributed storage and quorum problem.

Summary You now understand why HCI depends on quorum, placement, and failure domains.

Homework/Exercises to practice the concept

Simulate a node failure and observe recovery behavior.
Explain the difference between replication and erasure coding.

Solutions to the homework/exercises

Stop an OSD and watch the cluster report degraded PGs and recovery.
Replication stores full copies; erasure coding stores shards with parity.

Chapter 9: Control Planes, Scheduling, and Observability

Fundamentals A control plane manages VM lifecycle, scheduling, policy, and observability. Libvirt provides a consistent API for VM definition and lifecycle across hypervisors. QEMU provides device models and the runtime. A control plane tracks desired state, reconciles actual state, and integrates with metrics and logs. Without observability (metrics, logs, tracing), VM performance problems are guesswork.

Control planes are not optional in production. Even a small cluster needs a single source of truth for VM identity, ownership, and placement. This is why control planes often include authentication, authorization, and audit logging from the start: without these, operational mistakes become outages. They also provide the integration surface for automation, billing, and policy enforcement across many hosts.

Deep Dive Control planes separate desired state from actual state. Users declare what should exist (a VM with 2 vCPUs, 4 GB RAM), and controllers reconcile reality by creating or updating VMs. This is a distributed systems problem: state must be durable, API calls must be idempotent, and failures must not create duplicate VMs. Many control planes store state in a database and use reconciliation loops similar to Kubernetes.

Scheduling is central. A scheduler must account for CPU, memory, storage, and network capacity. It often supports policies such as anti-affinity, NUMA alignment, or power-aware placement. Overcommit is a lever: you can allocate more vCPUs than physical cores, but this increases CPU steal time and latency variance. Good schedulers use real-time metrics to decide placement and avoid stale data.

Observability ties it together. Hypervisors expose metrics like vCPU run time, VM exit counts, dirty page rates, and disk latency. Logs record VM lifecycle events and migration progress. Tracing tools (perf, ftrace, eBPF) can attribute latency to host or guest. Without these signals, diagnosing performance regressions is nearly impossible.

Control planes also enforce security and multi-tenancy. They must validate requests, apply quotas, and enforce network and storage ACLs. Audit logs are essential for compliance. They also need to integrate with identity systems and secrets management, because VM credentials and images are sensitive assets.

Finally, control planes must handle partial failure. A host may be unreachable but still running VMs. The control plane must decide whether to fence and restart those VMs elsewhere, balancing safety with availability. This is why leases, heartbeats, and fencing are standard patterns.

Control planes also expose interfaces for automation. Webhooks, event streams, and rate limits protect the system from overload while still enabling integration with CI/CD and monitoring. Even in a mini-cloud, these concerns surface quickly: without backpressure, a burst of API requests can cascade into host overload and VM instability.

Event-driven design is essential. Libvirt and QEMU expose event streams (including QMP) that notify when VMs change state, migrate, or encounter errors. A control plane that ignores events quickly diverges from reality. Conversely, a control plane that treats events as authoritative can rebuild state after crashes and recover from partial failures. This is why reconciliation loops are a foundational pattern.

Security is a control-plane concern too. Access control must be enforced at the API boundary, and actions should be audited. VM images must be verified and stored in trusted registries. Quotas prevent a single tenant from exhausting cluster resources. These policies are not just administrative; they directly influence scheduling and availability, because a scheduler can only make safe decisions when it knows the limits and priorities of each tenant.

Metrics design influences both stability and user trust. If the control plane tracks only coarse CPU usage, it may oversubscribe memory or saturate storage I/O without noticing. Good control planes track queue depth, latency percentiles, and error rates, then feed those signals into placement and admission control. This is effectively a feedback system: poorly chosen signals lead to oscillation and instability, while good signals stabilize throughput and latency.

How this fits on projects

Projects 7 and 10 rely on control plane and scheduling concepts.

Definitions & key terms

Control plane: Management layer for VM lifecycle and policy.
Reconciliation: Converging actual state to desired state.
Scheduler: Placement engine for VMs.
Observability: Metrics, logs, tracing.

Mental model diagram

API -> Scheduler -> Host -> libvirt/QEMU -> VM
     ^                                  |
     +----------- metrics/logs ----------+

How it works (step-by-step, invariants, failure modes)

User submits VM spec via API.
Scheduler chooses a host based on policy and capacity.
Control plane calls libvirt/QEMU to start the VM.
Metrics and logs feed back into the scheduler.
Reconciliation loops fix drift or failures.

Invariants: idempotent operations, single ownership of VM instances, durable state. Failure modes include stale metrics, split-brain scheduling, or missing audit logs.

Minimal concrete example (state transition)

DESIRED: vm-101 = running
ACTUAL: vm-101 = stopped
ACTION: control plane issues start -> state converges

Common misconceptions

“Scheduling is just least-loaded.” Real schedulers must consider multiple constraints.
“Observability is optional.” Without it, debugging becomes guesswork.

Check-your-understanding questions

Why is idempotency critical for VM create APIs?
How does observability influence scheduling decisions?

Check-your-understanding answers

Retries must not create duplicate VMs after partial failures.
Real-time metrics prevent overcommit and noisy-neighbor issues.

Real-world applications

OpenStack Nova
Cloud provider VM orchestration systems

Where you’ll apply it

Project 7, Project 10

References

Libvirt domain XML format
QEMU QMP documentation
“Fundamentals of Software Architecture” (orchestration patterns)

Key insights A hypervisor without a control plane is a lab toy; production requires orchestration.

Summary You now understand how control planes, schedulers, and observability make virtualization usable at scale.

Homework/Exercises to practice the concept

Design a simple placement policy that avoids NUMA mismatch.
List the top 5 metrics you would monitor for VM health.

Solutions to the homework/exercises

Place VMs on hosts with local NUMA memory and available cores.
CPU steal time, memory pressure, disk latency, network latency, exit rates.

Glossary

APIC: Interrupt controller for CPUs.
CRUSH: Ceph’s data placement algorithm.
EPT/NPT: Hardware second-level page translation.
HCI: Hyperconverged Infrastructure.
IOMMU: DMA isolation and address translation.
KVM: Kernel-based Virtual Machine.
Libvirt: Hypervisor management API.
QEMU: Emulator and device model.
SR-IOV: PCIe virtualization for multiple virtual functions.
VM Exit: Transition from guest to hypervisor.

Why Virtualization, Hypervisors, and Hyperconvergence Matters

Modern motivation and real-world use cases: Cloud infrastructure, compliance isolation, multi-tenant hosting, CI, and secure sandboxing all depend on virtualization.
Real-world statistics and impact:
- Gartner forecasts worldwide public cloud end-user spending of $723.4B in 2025, up from $595.7B in 2024 (Nov 2024). Source: Gartner press release.
- Gartner predicts 90% of organizations will adopt hybrid cloud through 2027 (Nov 2024). Source: Gartner press release.
- CNCF’s 2024 survey reports 89% cloud native adoption and 93% of organizations using or evaluating Kubernetes (Apr 2025 announcement). Source: CNCF.

Context & Evolution (short):

Early x86 virtualization used binary translation and paravirtualization.
VT-x/AMD-V enabled efficient hardware-assisted virtualization.
HCI emerged as distributed storage matured and enterprises sought simpler ops.

Old vs new model diagram

Traditional DC                 Hyperconverged Model
+----------+ +----------+      +---------------------+
| Compute  | | Storage  |      | Compute + Storage   |
| Servers  | | SAN/NAS  |      | + Network in cluster|
+----------+ +----------+      +---------------------+

Concept Summary Table

Concept Cluster	What You Need to Internalize
CPU Virtualization	VMX/SVM modes, VM exits, stable CPU models
Memory Virtualization	GVA/GPA/HPA, EPT/NPT, shadow paging
Device Virtualization	Emulation vs virtio vs VFIO, DMA isolation
Storage Virtualization	qcow2, snapshots, caching, Ceph RBD
Network Virtualization	TAP/bridge, OVS, VXLAN overlays
Live Migration	Pre-copy vs post-copy, dirty tracking
Containers vs VMs	Namespaces, cgroups, OCI runtime
Hyperconverged Infrastructure	Quorum, CRUSH, replication
Control Planes	Scheduling, reconciliation, observability

Project-to-Concept Map

Project	Concepts Applied
Project 1: Toy KVM Hypervisor	CPU, Memory, Device
Project 2: VMX Capability Explorer	CPU
Project 3: Shadow Page Table Simulator	Memory
Project 4: Userspace Memory Mapper + MMIO	Memory, Device
Project 5: Virtio Block Device	Device, Storage
Project 6: Virtio Net Device	Device, Network
Project 7: Vagrant-Style Orchestrator	Control Plane, Storage, Network
Project 8: Container Runtime	Containers vs VMs
Project 9: Hyperconverged Home Lab	HCI, Storage, Network, Migration
Project 10: Mini Cloud Control Plane	Control Plane, Migration, HCI, Network

Deep Dive Reading by Concept

Concept	Book and Chapter	Why This Matters
CPU Virtualization	“Operating System Concepts” - Ch. 16	Hypervisor fundamentals and VM execution model
Memory Virtualization	“Computer Systems: A Programmer’s Perspective” - Ch. 9	Address translation and paging mechanics
Device Virtualization	“Operating System Concepts” - Ch. 13	I/O systems and device abstraction
Storage Virtualization	“Operating System Concepts” - Ch. 10	Disk management and file systems
Network Virtualization	“Computer Networks” - Ch. 2-5	L2/L3 behavior and overlays
Live Migration	“Operating System Concepts” - Ch. 16	VM management concepts
Containers vs VMs	“Modern Operating Systems” - Ch. 7	OS-level virtualization and isolation
HCI & Distributed Storage	“Designing Data-Intensive Applications” - Ch. 5, 9	Replication and consensus
Control Planes	“Fundamentals of Software Architecture” - orchestration chapters	Scheduling and control systems

Quick Start: Your First 48 Hours

Day 1:

Read Chapter 1 and Chapter 2 (CPU + memory virtualization).
Validate KVM support on your host.
Start Project 1 and get the first VM exit log.

Day 2:

Read Chapter 3 (device virtualization) and Chapter 6 (migration).
Validate Project 1 against Definition of Done.
Skim Project 7 to see how control planes use libvirt.

Recommended Learning Paths

Path 1: Hardware-First Systems Engineer

Project 2 -> Project 1 -> Project 3 -> Project 4 -> Project 5 -> Project 6 -> Project 9 -> Project 10

Path 2: Platform Engineer

Project 7 -> Project 9 -> Project 10 -> Project 1 -> Project 8

Path 3: Containers-to-VMs Bridge

Project 8 -> Project 7 -> Project 1 -> Project 9 -> Project 10

Success Metrics

You can explain VM exits and map them to hypervisor actions.
You can configure virtio devices and compare performance to emulated devices.
You can build a container runtime that enforces cgroup limits.
You can run a 3-node HCI lab with quorum and live migration.
You can design a scheduler that makes placement decisions using metrics.

Optional Appendix: Tooling and Debugging Cheat Sheet

Core commands

virsh list --all (libvirt domains)
perf kvm stat (VM exit stats)
ceph -s (Ceph health)
pvecm status (Proxmox cluster status)

Common failure signatures

Repeated VM exits with the same reason = missing emulation logic.
Ceph HEALTH_WARN after node failure = degraded placement groups.
VM migration stuck at high dirty rate = workload too write-heavy.

Project Overview Table

Project	Difficulty	Time	Focus	Hardware Needed
Toy KVM Hypervisor	Advanced	2-3 weeks	VM exits, KVM API	KVM-capable Linux
VMX Capability Explorer	Intermediate	1 week	VMX features	VT-x CPU
Shadow Page Table Simulator	Advanced	2-3 weeks	Memory virtualization	Any Linux
Userspace Memory Mapper + MMIO	Intermediate	1-2 weeks	MMIO traps	Any Linux
Virtio Block Device	Advanced	3-4 weeks	Storage virtualization	Linux + QEMU
Virtio Net Device	Advanced	3-4 weeks	Network virtualization	Linux + TAP
Vagrant-Style Orchestrator	Intermediate	1-2 weeks	Control plane	Linux + libvirt
Container Runtime	Advanced	2-3 weeks	OS-level isolation	Linux
Hyperconverged Home Lab	Advanced	3-4 weeks	HCI + Ceph	3 nodes
Mini Cloud Control Plane	Expert	6-10 weeks	Scheduler + API	Multi-node lab

Project List

The following projects guide you from VM fundamentals to a working hyperconverged control plane.

Project 1: Toy KVM Hypervisor

File: P01-toy-kvm-hypervisor.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 4: Real VM Magic
Business Potential: Level 2: Foundational Infra Skill
Difficulty: Level 4: Advanced
Knowledge Area: KVM API, VM exits
Software or Tool: KVM (/dev/kvm), QEMU for comparison
Main Book: “Operating System Concepts” - Ch. 16

What you will build: A minimal userspace program that creates a VM, maps memory, runs a vCPU, and logs VM exits.

Why it teaches virtualization: You will see the exact boundary between guest and hypervisor in a controlled, observable loop.

Core challenges you will face:

VM creation -> CPU virtualization concepts
Guest memory mapping -> Memory virtualization concepts
Exit handling -> Device emulation and trap handling

Real World Outcome

You will boot a tiny guest payload and see VM exits logged with precise reasons, including I/O and HLT.

For CLI projects - show exact output:

$ sudo ./kvm-toy
[kvm] /dev/kvm opened
[kvm] VM created (vm_fd=5)
[kvm] Memory mapped: guest=0x00000000 size=2MB
[kvm] vCPU created (vcpu_fd=7)
[kvm] KVM_RUN
[exit] reason=KVM_EXIT_IO port=0x3f8 size=1 data='H'
[exit] reason=KVM_EXIT_IO port=0x3f8 size=1 data='i'
[exit] reason=KVM_EXIT_HLT

The Core Question You Are Answering

“What exactly happens when a guest executes a privileged instruction?”

Explain why the boundary between guest and hypervisor is the central abstraction of virtualization.

Concepts You Must Understand First

VM exits and VM entry controls
- Book Reference: “Operating System Concepts” - Ch. 16
Two-stage memory translation
- Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 9
Device I/O trapping
- Book Reference: “Operating System Concepts” - Ch. 13

Questions to Guide Your Design

Guest memory layout
- Where will the guest code be loaded?
- How will you ensure the guest starts with valid CPU state?
Exit handling strategy
- Which exit reasons will you handle first?
- How will you log exits for debugging?

Thinking Exercise

Minimal VM State Sketch the smallest CPU state needed to enter guest mode and run a single instruction.

Questions to answer:

Which registers must be initialized for VM entry?
What happens if the guest executes HLT?

The Interview Questions They Will Ask

“What is a VM exit, and why is it expensive?”
“How does KVM separate CPU virtualization from device emulation?”
“Why is CPUID trapped by hypervisors?”
“How do you map guest memory in KVM?”

Hints in Layers

Hint 1: Start with VM creation Use the KVM API sequence: open /dev/kvm, create VM, create vCPU.

Hint 2: Map guest memory Use a single contiguous region for guest RAM to reduce early complexity.

Hint 3: Run loop outline (pseudocode)

SETUP: create VM, map memory, init vCPU
LOOP: enter guest -> handle exit -> resume

Hint 4: Debugging If you see repeated KVM_EXIT_FAIL_ENTRY, verify VMCS state and KVM API version.

Books That Will Help

Common Pitfalls and Debugging

Problem 1: “KVM_RUN returns -1”

Why: VM or vCPU not initialized correctly.
Fix: Verify API version and initial register state.
Quick test: strace to confirm ioctl sequence.

Problem 2: “Guest prints garbage”

Why: Guest code loaded at wrong address.
Fix: Align guest entry point with expected CS:IP.
Quick test: Dump guest memory and verify payload bytes.

Definition of Done

VM initializes and runs guest code
Exit reasons are logged
Serial port I/O output is visible
Guest halts cleanly

Project 2: VMX Capability Explorer

File: P02-vmx-capability-explorer.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 3: Hardware Archaeology
Business Potential: Level 2: Kernel Skills
Difficulty: Level 3: Intermediate
Knowledge Area: VMX/SVM capabilities
Software or Tool: Kernel module or privileged tooling
Main Book: Intel SDM Vol. 3C

What you will build: A tool that reports VMX capabilities: supported controls, EPT features, and VMCS constraints.

Why it teaches virtualization: It forces you to interpret the CPU’s virtualization feature set, a prerequisite for any real hypervisor.

Core challenges you will face:

CPUID enumeration -> CPU virtualization concepts
MSR interpretation -> VMX controls
Feature reporting -> Control plane observability

Real World Outcome

You will load the tool and see a human-readable report of VMX support.

For CLI projects - show exact output:

$ sudo ./vmx-explorer
[VMX] VMX supported: yes
[VMX] VMCS revision ID: 0x12
[VMX] EPT: supported, 4-level page walk
[VMX] Unrestricted guest: supported

The Core Question You Are Answering

“What virtualization features does this CPU actually support, and which are mandatory?”

Concepts You Must Understand First

VMX controls and MSRs
- Book Reference: Intel SDM Vol. 3C
CPUID feature exposure
- Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3
EPT basics
- Book Reference: Intel SDM Vol. 3C

Questions to Guide Your Design

Output structure
- How will you summarize capabilities for humans?
- Which features are mandatory for your later VMM?
Safety constraints
- How will you avoid crashing the host when reading MSRs?

Thinking Exercise

Feature Matrix Build a matrix of VMX controls and classify them as required, optional, or unsupported.

Questions to answer:

Which features are required for unrestricted guest mode?
Which features are useful but not required for a toy VMM?

The Interview Questions They Will Ask

“What is the purpose of IA32_VMX_BASIC?”
“Why do VMX control MSRs have allowed-0 and allowed-1 bitmasks?”
“What does ‘unrestricted guest’ enable?”
“How do you determine whether EPT is supported?”

Hints in Layers

Hint 1: Start with CPUID Detect VT-x support before reading MSRs.

Hint 2: Read capability MSRs Enumerate VMX capability MSRs in Intel SDM Vol. 3C.

Hint 3: Parsing outline (pseudocode)

READ MSR -> split low/high -> compute allowed bits -> print summary

Hint 4: Debugging If the system crashes, you likely accessed an MSR without privilege or validation.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | VMX | Intel SDM Vol. 3C | Ch. 23-33 | | Kernel modules | “Linux Device Drivers” | Ch. 2 |

Common Pitfalls and Debugging

Problem 1: “VMX not supported”

Why: BIOS virtualization disabled.
Fix: Enable VT-x/AMD-V in BIOS.
Quick test: egrep -c '(vmx|svm)' /proc/cpuinfo.

Definition of Done

Reports VMX availability correctly
Lists key control capabilities
Outputs a readable summary
Runs safely without crashing the host

Project 3: Shadow Page Table Simulator

File: P03-shadow-page-table-simulator.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python
Coolness Level: Level 4: MMU Wizardry
Business Potential: Level 2: Systems Cred
Difficulty: Level 4: Advanced
Knowledge Area: Memory virtualization
Software or Tool: Simulator
Main Book: “Operating Systems: Three Easy Pieces” - VM chapters

What you will build: A simulator that tracks guest page tables and maintains a shadow mapping from GVA to HPA.

Why it teaches virtualization: It shows why shadow paging was expensive and why EPT was a breakthrough.

Core challenges you will face:

Three address spaces -> Memory virtualization concepts
Trap simulation -> VM exit behavior
Dirty tracking -> Migration foundations

Real World Outcome

You will see a trace of guest page table writes triggering shadow updates and a final performance report.

For CLI projects - show exact output:

$ ./shadow-sim
[GUEST] PTE write: GVA 0x1000 -> GPA 0x5000
[VMM] Trap on PT write, updating shadow
[CPU] Shadow translation: GVA 0x1000 -> HPA 0x8A000
[STATS] PT writes: 1247, shadow updates: 1247, TLB flushes: 89

The Core Question You Are Answering

“Why was shadow paging so expensive, and what did EPT fix?”

Concepts You Must Understand First

Page tables and TLBs
- Book Reference: “Operating Systems: Three Easy Pieces” - Ch. 18-20
Two-stage translation
- Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 9
VM exits
- Book Reference: “Operating System Concepts” - Ch. 16

Questions to Guide Your Design

Tracing model
- How will you represent guest PTs and shadow PTs?
- How will you count simulated VM exits?
Fault modeling
- How will you simulate TLB flushes and page faults?

Thinking Exercise

Translate by Hand Manually translate two GVA addresses using a guest PT and a shadow PT.

Questions to answer:

Which steps would cause VM exits in real hardware?
What data structures must the hypervisor maintain?

The Interview Questions They Will Ask

“Why do shadow page tables require trapping guest PT writes?”
“How does EPT reduce VM exits?”
“What is a TLB shootdown and why is it expensive?”
“Why do large pages complicate dirty tracking?”

Hints in Layers

Hint 1: Keep the model small Simulate a two-level guest page table before adding more levels.

Hint 2: Track events Log every guest PT write and shadow update.

Hint 3: Simulation loop (pseudocode)

ACCESS -> translate -> if PT write then trap -> update shadow

Hint 4: Debugging If translations are wrong, print each level of the page walk.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Paging | “Operating Systems: Three Easy Pieces” | Ch. 18-20 | | Memory systems | “CSAPP” | Ch. 9 |

Common Pitfalls and Debugging

Problem 1: “Shadow PT does not match guest”

Why: Missing update on guest PT write.
Fix: Mark guest PT pages read-only in the simulator and trap writes.
Quick test: Compare shadow mapping to guest mapping after each write.

Definition of Done

Shadow PT updates on every guest PT write
Translations are correct for test inputs
Stats report simulated VM exits
Output demonstrates overhead clearly

Project 4: Userspace Memory Mapper with MMIO Traps

File: P04-userspace-memory-mapper-mmio.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 3: Memory Alchemy
Business Potential: Level 2: Core Systems Skill
Difficulty: Level 3: Intermediate
Knowledge Area: MMIO, memory management
Software or Tool: mmap, signal handling
Main Book: “The Linux Programming Interface” - Ch. 49

What you will build: A userspace memory manager that simulates guest RAM and MMIO regions using access traps.

Why it teaches virtualization: It mirrors how VMMs carve MMIO regions and intercept device accesses.

Core challenges you will face:

Memory mapping -> Memory virtualization
MMIO traps -> Device emulation
Dirty tracking -> Migration foundations

Real World Outcome

You will see reads and writes to RAM succeed while MMIO accesses trigger a handler.

For CLI projects - show exact output:

$ ./memmap
[MEM] Guest RAM: 512MB
[MEM] MMIO: 0x10000000-0x10001000 (UART)
[GUEST] Write RAM OK
[GUEST] Write MMIO -> TRAP
[UART] output: 'A'

The Core Question You Are Answering

“How do hypervisors detect and emulate device accesses in guest memory space?”

Concepts You Must Understand First

Virtual memory and protection
- Book Reference: “The Linux Programming Interface” - Ch. 49
MMIO concepts
- Book Reference: “Operating System Concepts” - Ch. 13
Signal handling
- Book Reference: “The Linux Programming Interface” - Ch. 20-22

Questions to Guide Your Design

Address space layout
- How will you define RAM vs MMIO regions?
- How will you detect MMIO accesses?
Trap handling
- How will you emulate device responses?

Thinking Exercise

Memory Map Sketch Draw a memory map with RAM, MMIO, and reserved regions.

Questions to answer:

Which regions must trap?
Which regions should be read-only?

The Interview Questions They Will Ask

“What is MMIO and why is it used?”
“How do hypervisors intercept MMIO accesses?”
“Why do you need dirty tracking for migration?”
“What are the risks of incorrect memory protection?”

Hints in Layers

Hint 1: Start small Map a small RAM region and a tiny MMIO page first.

Hint 2: Trap design Use page protections to force faults on MMIO ranges.

Hint 3: Handling outline (pseudocode)

FAULT -> check address -> if MMIO then emulate -> resume

Hint 4: Debugging Log the fault address and verify it falls within your MMIO range.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Memory mapping | “The Linux Programming Interface” | Ch. 49 | | Signals | “The Linux Programming Interface” | Ch. 20-22 |

Common Pitfalls and Debugging

Problem 1: “All accesses fault”

Why: RAM mapped with no permissions.
Fix: Verify protection flags for RAM regions.
Quick test: Write to RAM and confirm no trap.

Definition of Done

RAM reads/writes succeed
MMIO accesses trap and are handled
Dirty pages are tracked
Output clearly distinguishes RAM vs MMIO

Project 5: Virtio Block Device Emulator

File: P05-virtio-block-device.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 4: Storage Blacksmith
Business Potential: Level 3: Infra Utility
Difficulty: Level 4: Advanced
Knowledge Area: Virtio, storage
Software or Tool: QEMU-style virtio backend
Main Book: “Understanding the Linux Kernel” - storage chapters

What you will build: A virtio-blk compatible backend that reads/writes to a host file or device.

Why it teaches virtualization: It forces you to implement the virtio protocol and block request semantics.

Core challenges you will face:

Virtqueue parsing -> Device virtualization
I/O semantics -> Storage correctness
Flush handling -> Data durability

Real World Outcome

A Linux guest can format, mount, and use your virtual disk.

For CLI projects - show exact output:

$ ./vblk --backing=disk.img
[VBLK] init ok
[VBLK] virtqueue ready
[VBLK] READ sector=2048 len=8
[VBLK] WRITE sector=4096 len=16
[VBLK] FLUSH

The Core Question You Are Answering

“How does virtio move block I/O from guest to host efficiently?”

Concepts You Must Understand First

Virtio queues
- Book Reference: OASIS virtio spec
Block I/O semantics
- Book Reference: “Operating System Concepts” - Ch. 10
Flush and durability
- Book Reference: “Operating System Concepts” - Ch. 10

Questions to Guide Your Design

Queue handling
- How will you parse descriptor chains?
- How will you signal completion?
Backend storage
- How will you map sectors to file offsets?

Thinking Exercise

Queue Walkthrough Draw a virtqueue with three descriptors (header, data, status) and trace one request.

Questions to answer:

Where does the status byte live?
How does the host know the request is complete?

The Interview Questions They Will Ask

“What is a virtqueue and why does it reduce exits?”
“How do you handle flush requests?”
“Why is virtio-blk simpler than virtio-scsi?”
“What is the impact of backend caching modes?”

Hints in Layers

Hint 1: Start with read-only Implement reads first to simplify error handling.

Hint 2: Add writes and flush Ensure flush semantics are honored for guest filesystems.

Hint 3: Processing outline (pseudocode)

IF new descriptor -> parse -> read/write -> update used ring -> interrupt

Hint 4: Debugging Compare guest sector numbers with file offsets to catch mapping errors.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Storage | “Operating System Concepts” | Ch. 10 | | Device models | “Understanding the Linux Kernel” | Ch. 14 |

Common Pitfalls and Debugging

Problem 1: “Guest filesystem corrupts”

Why: Flush/barrier requests ignored.
Fix: Treat flush as a durability boundary.
Quick test: Create a filesystem, copy files, reboot VM, verify integrity.

Definition of Done

Guest can read/write the disk
Flush requests are handled
Virtqueue completion is correct
Disk survives guest reboot without corruption

Project 6: Virtio Net Device Emulator

File: P06-virtio-net-device.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 4: Packet Wizard
Business Potential: Level 3: Infra Utility
Difficulty: Level 4: Advanced
Knowledge Area: Virtio networking, TAP
Software or Tool: TAP device, virtio-net backend
Main Book: “Understanding Linux Network Internals”

What you will build: A virtio-net backend that connects a guest to a host TAP device.

Why it teaches virtualization: It reveals how packets move between guest and host with minimal overhead.

Core challenges you will face:

Virtio-net queues -> Device virtualization
Packet handling -> Network virtualization
TAP integration -> Host networking

Real World Outcome

The guest can ping the host through your device.

For CLI projects - show exact output:

$ ./vnet --tap=tap0
[VNET] virtio-net ready
[VNET] MAC: 52:54:00:12:34:56
[VNET] TX: 98 bytes
[VNET] RX: 98 bytes

The Core Question You Are Answering

“How do virtual NICs move packets without emulating full hardware?”

Concepts You Must Understand First

Virtio networking
- Book Reference: OASIS virtio spec
TAP and bridges
- Book Reference: “Understanding Linux Network Internals” - Ch. 14
Packet structure
- Book Reference: “TCP/IP Illustrated” - Ch. 3

Questions to Guide Your Design

Queue separation
- How will you manage RX and TX queues?
Host integration
- How will you connect to TAP and configure IP?

Thinking Exercise

Packet Path Trace a guest ping to the host through virtio-net and TAP.

Questions to answer:

Where is the packet encapsulated or modified?
Where could it be dropped?

The Interview Questions They Will Ask

“What does the virtio-net header contain?”
“Why does TAP exist in Linux?”
“How does virtio-net reduce VM exits?”
“What happens if the host MTU is too small?”

Hints in Layers

Hint 1: Start with TX only Send guest packets to TAP and confirm with tcpdump.

Hint 2: Add RX Inject host packets into the guest RX queue.

Hint 3: Processing outline (pseudocode)

RX: read TAP -> wrap header -> push to guest queue
TX: pull guest buffer -> strip header -> write TAP

Hint 4: Debugging Use tcpdump -i tap0 to confirm packet flow.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Networking | “Understanding Linux Network Internals” | Ch. 14 | | Packet formats | “TCP/IP Illustrated” | Ch. 3 |

Common Pitfalls and Debugging

Problem 1: “Guest cannot reach host”

Why: TAP not bridged or IP not configured.
Fix: Ensure TAP is up and in the correct bridge.
Quick test: Ping between host and guest with tcpdump.

Definition of Done

Guest sends packets through virtio-net
Host receives packets on TAP
Guest receives replies
Ping works end-to-end

Project 7: Build a Vagrant-Style VM Orchestrator

File: P07-vagrant-style-orchestrator.md
Main Programming Language: Python or Go
Alternative Programming Languages: Rust
Coolness Level: Level 4: Infra Builder
Business Potential: Level 3: DevOps Utility
Difficulty: Level 3: Intermediate
Knowledge Area: Control planes, libvirt
Software or Tool: libvirt, QEMU
Main Book: “Modern Operating Systems” - Ch. 7

What you will build: A CLI that provisions VMs from a config file, manages lifecycle, and supports snapshots.

Why it teaches virtualization: It connects declarative state to actual VM actions and exposes control-plane challenges.

Core challenges you will face:

Idempotent VM creation -> Control plane concepts
Storage provisioning -> Storage virtualization
Network setup -> Network virtualization

Real World Outcome

You run myvagrant up and get a working VM with SSH access.

For CLI projects - show exact output:

$ myvagrant up
[vm] define domain vm1
[vm] create disk 20G
[vm] attach cloud-init
[vm] start vm1
[vm] ssh ready: 192.168.122.50

The Core Question You Are Answering

“How does a control plane map a config into a real VM lifecycle?”

Concepts You Must Understand First

Libvirt domain XML
- Book Reference: libvirt docs
Storage backends
- Book Reference: “Operating System Concepts” - Ch. 10
Bridges and TAP
- Book Reference: “Computer Networks” - Ch. 2-3

Questions to Guide Your Design

State management
- How will you store VM metadata between runs?
Idempotency
- How will you avoid duplicate VMs on retry?

Thinking Exercise

Config Mapping Design a minimal config schema and map each field to libvirt actions.

Questions to answer:

Which fields are required vs optional?
How will you store secrets or SSH keys?

The Interview Questions They Will Ask

“What is libvirt and why is it used?”
“How do you make VM provisioning idempotent?”
“Why use qcow2 instead of raw?”
“How do you handle snapshots and rollback?”

Hints in Layers

Hint 1: Start with define/start Use libvirt to define a domain and start it.

Hint 2: Add storage and networking Attach qcow2 disks and a bridge-backed NIC.

Hint 3: Workflow outline (pseudocode)

READ config -> compute desired state -> define/update domain -> start

Hint 4: Debugging Use virsh dumpxml to compare actual vs desired config.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Virtualization | “Modern Operating Systems” | Ch. 7 | | Storage | “Operating System Concepts” | Ch. 10 |

Common Pitfalls and Debugging

Problem 1: “VM boots but no network”

Why: Bridge or DHCP misconfigured.
Fix: Verify virbr0 and DHCP service.
Quick test: virsh net-list and ip a.

Definition of Done

CLI creates and destroys VMs reliably
VM is reachable via SSH
Snapshots work
Config is reproducible

Project 8: Container Runtime from Scratch

File: P08-container-runtime.md
Main Programming Language: C or Go
Alternative Programming Languages: Rust
Coolness Level: Level 4: Systems Practicality
Business Potential: Level 2: Platform Cred
Difficulty: Level 4: Advanced
Knowledge Area: Namespaces, cgroups
Software or Tool: OCI runtime clone
Main Book: “Modern Operating Systems” - Ch. 7

What you will build: A minimal container runtime that creates namespaces, applies cgroup limits, and runs a process in an isolated rootfs.

Why it teaches virtualization: It reveals how OS-level isolation differs from hardware virtualization.

Core challenges you will face:

Namespace setup -> OS isolation
Cgroup limits -> Resource control
Rootfs isolation -> Filesystem virtualization

Real World Outcome

A process runs as PID 1 in its own namespace with enforced CPU and memory limits.

For CLI projects - show exact output:

$ sudo ./minirun run ./rootfs /bin/sh
[minirun] namespaces: pid, net, mnt, uts, ipc, user
[minirun] cgroup: cpu.max=50% memory.max=256M
[minirun] pivot_root -> /rootfs

root@container:/# hostname
container
root@container:/# ps -ef
PID 1 /bin/sh

The Core Question You Are Answering

“How does Linux isolate a process so it looks like a VM without a hypervisor?”

Concepts You Must Understand First

Namespaces
- Book Reference: Linux namespaces docs
cgroup v2
- Book Reference: cgroup v2 documentation
OCI runtime spec
- Book Reference: OCI runtime spec

Questions to Guide Your Design

Lifecycle management
- How will you set up and tear down namespaces?
Security
- Which capabilities should be dropped by default?

Thinking Exercise

PID 1 Semantics Explain why a container needs an init-like process to reap zombies.

Questions to answer:

What happens if PID 1 ignores SIGTERM?
How does this impact process cleanup?

The Interview Questions They Will Ask

“What is the difference between namespaces and cgroups?”
“Why do containers start faster than VMs?”
“What is an OCI bundle?”
“How do user namespaces improve security?”

Hints in Layers

Hint 1: Start with a single namespace Get PID isolation working before adding network and mount namespaces.

Hint 2: Add cgroup limits Use cgroup v2 for unified control of CPU and memory.

Hint 3: Runtime outline (pseudocode)

CLONE namespaces -> setup rootfs -> apply cgroups -> exec target

Hint 4: Debugging Use lsns and cat /proc/self/cgroup to verify isolation.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | OS virtualization | “Modern Operating Systems” | Ch. 7 | | Systems interfaces | “The Linux Programming Interface” | Ch. 49 |

Common Pitfalls and Debugging

Problem 1: “Container sees host PID list”

Why: PID namespace not created or /proc not remounted.
Fix: Ensure PID namespace and mount /proc inside the container.
Quick test: ps inside container should show PID 1 only.

Definition of Done

Container has its own PID namespace
Hostname and mount isolation work
CPU and memory limits enforced
Rootfs isolation via pivot_root or chroot

Project 9: Hyperconverged Home Lab with Ceph

File: P09-hyperconverged-home-lab.md
Main Programming Language: Bash / YAML
Alternative Programming Languages: Python
Coolness Level: Level 5: Real Infra Lab
Business Potential: Level 4: Infra Architect
Difficulty: Level 4: Advanced
Knowledge Area: HCI, distributed storage
Software or Tool: Proxmox + Ceph
Main Book: “Designing Data-Intensive Applications” - Ch. 5, 9

What you will build: A 3-node hyperconverged lab with Ceph-backed storage, HA, and live migration.

Why it teaches virtualization: You will experience the operational realities of quorum, replication, and migration.

Core challenges you will face:

Quorum design -> HCI safety
Ceph storage setup -> Distributed storage
Live migration -> Availability

Real World Outcome

You can migrate a VM live and survive a node failure without data loss.

For CLI projects - show exact output:

$ pvecm status
Quorate: Yes
Nodes: 3

$ ceph -s
health: HEALTH_OK
osd: 6 up, 6 in

$ qm migrate 100 node2 --online
migration started

The Core Question You Are Answering

“How do you build a cluster that keeps VMs running when hardware fails?”

Concepts You Must Understand First

Quorum and fencing
- Book Reference: “Designing Data-Intensive Applications” - Ch. 9
Ceph architecture
- Book Reference: Ceph docs
Live migration
- Book Reference: “Operating System Concepts” - Ch. 16

Questions to Guide Your Design

Network separation
- Which NICs handle storage vs VM traffic?
Failure testing
- How will you validate HA without corrupting data?

Thinking Exercise

Failure Scenario If Node1 fails during migration, trace what happens to VM state and storage.

Questions to answer:

Does the VM continue on the source or destination?
Which component enforces safety?

The Interview Questions They Will Ask

“How does Ceph place replicas?”
“Why is quorum required for HA?”
“What is the trade-off between replication and erasure coding?”
“How does live migration reduce downtime?”

Hints in Layers

Hint 1: Build the cluster first Confirm Proxmox quorum before adding Ceph.

Hint 2: Add Ceph storage Create MONs, OSDs, and a pool before attaching RBD disks.

Hint 3: Validation outline (pseudocode)

CHECK quorum -> verify ceph health -> migrate VM -> simulate node failure

Hint 4: Debugging Use ceph health detail and pvecm status to diagnose issues.

Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Replication | “Designing Data-Intensive Applications” | Ch. 5 | | Consensus | “Designing Data-Intensive Applications” | Ch. 9 |

Common Pitfalls and Debugging

Problem 1: “Cluster loses quorum”

Why: Network partition or even-numbered votes.
Fix: Use an odd number of votes or add a qdevice.
Quick test: pvecm status.

Problem 2: “Ceph HEALTH_WARN”

Why: OSD down or PGs degraded.
Fix: ceph health detail and repair OSDs.
Quick test: ceph -s after recovery.

Definition of Done

3-node cluster is quorate
Ceph health is OK
VM live migration works
HA restarts VMs after node failure

Project 10: Build a Mini Cloud Control Plane

File: P10-mini-cloud-control-plane.md
Main Programming Language: Go or Python
Alternative Programming Languages: Rust
Coolness Level: Level 5: Cloud Builder
Business Potential: Level 4: Platform Builder
Difficulty: Level 5: Expert
Knowledge Area: Scheduling, orchestration
Software or Tool: KVM/libvirt + API server
Main Book: “Designing Data-Intensive Applications” - Ch. 5, 9

What you will build: A mini control plane that schedules VMs across nodes, exposes an API, and supports migration.

Why it teaches virtualization: It integrates scheduling, networking, storage, and VM lifecycle into a working system.

Core challenges you will face:

Scheduling -> Control plane design
State management -> Distributed systems
Migration workflows -> Availability

Real World Outcome

A REST API can create and manage VMs across multiple nodes with scheduler logs.

For CLI projects - show exact output:

$ curl -X POST http://localhost:8080/v1/vms \
  -d '{"name":"web01","cpu":2,"ram":4096,"image":"ubuntu"}'
{"id":"vm-101","status":"building","host":"node2"}

$ curl http://localhost:8080/v1/vms/vm-101
{"id":"vm-101","status":"running","ip":"10.0.3.15"}

$ tail -n 2 scheduler.log
[decision] vm-101 -> node2 (cpu=32%, ram=58%)

The Core Question You Are Answering

“How do real cloud platforms orchestrate VMs across multiple hosts?”

Concepts You Must Understand First

Scheduling and placement
- Book Reference: “Fundamentals of Software Architecture” - scheduling
Quorum and distributed state
- Book Reference: “Designing Data-Intensive Applications” - Ch. 9
VM lifecycle management
- Book Reference: libvirt docs

Questions to Guide Your Design

State model
- What is stored centrally vs per-node?
Idempotency
- How will you prevent duplicate VM creation on retries?

Thinking Exercise

Placement Policy Design a placement algorithm for three nodes with uneven CPU and RAM usage.

Questions to answer:

How will you handle stale metrics?
What is your overcommit policy?

The Interview Questions They Will Ask

“How does a scheduler decide where to place a VM?”
“What happens if the control plane fails mid-request?”
“How do you enforce API idempotency?”
“How do you coordinate migration safely?”

Hints in Layers

Hint 1: Start with a small API Implement create, start, stop, destroy before adding migration.

Hint 2: Add a simple scheduler Use a greedy placement policy based on CPU/RAM usage.

Hint 3: Workflow outline (pseudocode)

REQUEST -> validate -> choose host -> call libvirt -> record state

Hint 4: Debugging Log every state transition and include a request ID for tracing.

Books That Will Help

Common Pitfalls and Debugging

Problem 1: “Scheduler overloads a node”

Why: Stale metrics or missing resource limits.
Fix: Refresh metrics before placement and cap overcommit.
Quick test: Compare host metrics before and after scheduling.

Problem 2: “Duplicate VM creation”

Why: Missing idempotency tokens.
Fix: Require a request ID and enforce uniqueness.
Quick test: Repeat the same request and verify single VM.

Definition of Done

API can create/start/stop/destroy VMs
Scheduler uses live resource metrics
State survives restarts
Migration or evacuation works for one VM

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Toy KVM Hypervisor	Level 4	Weeks	High	★★★★☆
2. VMX Capability Explorer	Level 3	Week	Medium	★★★☆☆
3. Shadow Page Table Simulator	Level 4	Weeks	High	★★★☆☆
4. Userspace MMIO Mapper	Level 3	Weeks	Medium	★★★☆☆
5. Virtio Block Device	Level 4	Weeks	High	★★★★☆
6. Virtio Net Device	Level 4	Weeks	High	★★★★☆
7. Vagrant-Style Orchestrator	Level 3	Weeks	Medium	★★★★☆
8. Container Runtime	Level 4	Weeks	High	★★★★☆
9. Hyperconverged Home Lab	Level 4	Weeks	High	★★★★★
10. Mini Cloud Control Plane	Level 5	Months	Very High	★★★★★

Recommendation

If you are new to virtualization: Start with Project 2 then Project 1 to build intuition about VMX and VM exits. If you are a platform engineer: Start with Project 7 then Project 9 for control planes and HCI ops. If you want to build a cloud: Focus on Project 9 then Project 10.

Final Overall Project: Hyperconverged Mini-Cloud

The Goal: Combine Projects 7, 9, and 10 into a single mini-cloud that provisions VMs, migrates them, and survives node failures.

Build a libvirt-backed orchestrator (Project 7).
Deploy an HCI storage cluster (Project 9).
Add a control plane API with scheduling and migration (Project 10).

Success Criteria: You can create a VM via API, migrate it live, and survive a node failure without data loss.

From Learning to Production: What Is Next

Your Project	Production Equivalent	Gap to Fill
Toy KVM Hypervisor	QEMU/KVM	Robust device models and security hardening
Virtio Block/Net	QEMU virtio backend	Performance, correctness, fuzzing
Vagrant-Style Orchestrator	OpenStack Nova	Distributed state, HA, auth, quotas
HCI Lab	Proxmox + Ceph	Scale, monitoring, lifecycle automation
Mini Cloud Control Plane	Cloud provider VM service	Multi-region, billing, compliance

Summary

This learning path covers virtualization through 10 hands-on projects.

#	Project Name	Main Language	Difficulty	Time Estimate
1	Toy KVM Hypervisor	C	Level 4	2-3 weeks
2	VMX Capability Explorer	C	Level 3	1 week
3	Shadow Page Table Simulator	C	Level 4	2-3 weeks
4	Userspace MMIO Mapper	C	Level 3	1-2 weeks
5	Virtio Block Device	C	Level 4	3-4 weeks
6	Virtio Net Device	C	Level 4	3-4 weeks
7	Vagrant-Style Orchestrator	Python/Go	Level 3	1-2 weeks
8	Container Runtime	C/Go	Level 4	2-3 weeks
9	Hyperconverged Home Lab	Bash/YAML	Level 4	3-4 weeks
10	Mini Cloud Control Plane	Go/Python	Level 5	6-10 weeks

Expected Outcomes

You can reason about VM exits, memory translation, and device models.
You can build and validate virtio devices and container runtimes.
You can design and operate a small hyperconverged cluster.

Additional Resources and References

Standards and Specifications

Intel SDM (VMX, EPT)
AMD64 Architecture Programmer’s Manual (SVM, NPT)
OASIS Virtio Specification v1.3
OCI Runtime Specification
RFC 7348 (VXLAN)

Industry Analysis

Gartner public cloud spending forecast (Nov 2024)
Omdia cloud infrastructure spending report (Dec 2025)
CNCF 2024 Cloud Native Survey (Apr 2025)

Books

“Operating System Concepts” by Silberschatz et al. - virtualization fundamentals
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - memory translation
“Designing Data-Intensive Applications” by Kleppmann - replication and consensus
“The Linux Programming Interface” by Kerrisk - system interfaces