Project 1: Toy Type-2 Hypervisor Using KVM
Build a minimal KVM-based type-2 hypervisor in user space that boots a tiny guest, handles VM exits, and emulates a serial device end-to-end.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced (Level 4) |
| Time Estimate | 3-4 weeks |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 5: Pure Systems Magic |
| Business Potential | Level 2: Strong Infra Cred |
| Prerequisites | C systems programming, Linux syscalls, x86 real mode basics |
| Key Topics | KVM API, VM exits, guest memory layout, I/O emulation, UART |
1. Learning Objectives
By completing this project, you will:
- Implement the minimal KVM VM lifecycle (open, create VM, create vCPU, map memory, run).
- Build a tiny guest image that executes deterministic instructions and triggers VM exits.
- Emulate a serial device using port I/O and produce guest-visible output.
- Explain the GVA -> GPA -> HPA translation pipeline and how KVM maps memory regions.
- Instrument VM exits to produce a reproducible execution trace.
- Compare KVM-based virtualization with pure emulation in terms of control and performance.
2. All Theory Needed (Per-Concept Breakdown)
2.1 KVM Userspace Lifecycle and ioctl API
Fundamentals
KVM is a kernel facility that exposes hardware virtualization to user space via a file-descriptor-based API. The user-space hypervisor owns the VM lifecycle: it opens /dev/kvm, queries the API version, creates a VM file descriptor, registers guest memory, creates one or more vCPUs, maps a shared kvm_run structure, and enters a loop that calls KVM_RUN to enter guest execution. Each KVM_RUN returns when a VM exit happens, letting the hypervisor inspect the exit reason and respond. This model is simple but strict: if you skip any step or create resources in the wrong order, the kernel rejects the call or the guest never executes. Understanding this lifecycle is mandatory because everything you build in the project (guest memory layout, exit handling, device emulation) must happen within the constraints of these ioctls.
Deep Dive into the concept
KVM’s API is intentionally low-level: it gives you a VM container and vCPU execution, but it does not give you a device model or even a BIOS. This design pushes the “hypervisor” role to user space. The sequence matters. You first open /dev/kvm, which is a char device owned by the KVM module. You call KVM_GET_API_VERSION to ensure the kernel and user-space expectations align. Then you create a VM via KVM_CREATE_VM. This returns a VM fd that owns VM-wide resources such as guest memory maps and IRQ routing.
Guest memory is not “allocated” by KVM; you allocate it in your process (e.g., mmap or malloc) and then register it via KVM_SET_USER_MEMORY_REGION, which tells the kernel how guest physical addresses map to your host virtual addresses. You can register multiple regions, which allows you to emulate RAM holes or ROM areas. A common pitfall is forgetting to page-align these regions or failing to mark them as read-only for ROM segments.
vCPUs are per-VM threads of execution. You create a vCPU with KVM_CREATE_VCPU. Each vCPU has its own fd and its own shared kvm_run structure. You determine the size of this structure with KVM_GET_VCPU_MMAP_SIZE and then mmap it. The kvm_run structure is the hypervisor’s window into guest state transitions: it exposes the exit reason, I/O details, and sometimes memory access details.
KVM_RUN is where the magic happens. When you call it, KVM enters VMX/SVM non-root mode and executes guest code until an exit condition occurs. Exits include I/O port access, HLT, shutdown, or unhandled exceptions. Your hypervisor must be structured as an event loop: call KVM_RUN, inspect kvm_run->exit_reason, handle it, and repeat. This loop is the “data plane” of your hypervisor; the “control plane” is everything else like creating VMs, setting registers, and loading guest code.
It’s also essential to understand that KVM is not thread-safe by default at the VM level. If you create multiple vCPUs, you must coordinate shared resources such as devices and memory. For this project you can stay single-vCPU, but you should still understand that each vCPU has independent state and can exit concurrently in a multi-vCPU future.
Finally, the ioctl API exposes capabilities and extensions. You can query if features like KVM_CAP_USER_MEMORY or KVM_CAP_IRQCHIP exist, and adapt your hypervisor accordingly. This is how robust hypervisors detect when to enable advanced functionality. Even in a toy hypervisor, you should learn to read and check capabilities because production code depends on them to be portable across kernels.
An often-missed detail is that KVM’s behavior varies subtly across CPU vendors and kernel versions. For example, some exit reasons include extra data only when certain capabilities are enabled, and some ioctls are silently ignored when unsupported. A careful hypervisor validates every critical step and logs capability probes so you can correlate runtime behavior with kernel features. This discipline helps you avoid “mystery” bugs when you run the same binary on a different host.
How this fit on projects
This is the backbone of the project. The VM lifecycle and the KVM_RUN loop are the framework upon which you load guest code, set registers, and emulate I/O in Section 3.2 and Section 5.10 Phase 1.
Definitions & key terms
- KVM -> Kernel-based Virtual Machine, a Linux kernel module that exposes virtualization APIs.
- VM fd -> File descriptor that represents a VM container and its global state.
- vCPU fd -> File descriptor representing a virtual CPU execution context.
- kvm_run -> Shared memory structure for VM exit reporting and I/O details.
- KVM_RUN -> ioctl that transitions into guest execution.
Mental model diagram (ASCII)
User-space Hypervisor
| open /dev/kvm
v
KVM kernel module
| KVM_CREATE_VM -> VM fd
| KVM_SET_USER_MEMORY_REGION
| KVM_CREATE_VCPU -> vCPU fd
| mmap kvm_run
v
CPU enters guest mode (VMX/SVM)
| VM exit -> kvm_run filled
v
User-space handles exit
How it works (step-by-step, with invariants and failure modes)
- Open
/dev/kvmand verify API version (invariant: version matches expected value). - Create a VM with
KVM_CREATE_VM(failure mode: EPERM if KVM disabled). - Allocate and register guest memory (invariant: GPA ranges are page-aligned).
- Create vCPU(s) and map
kvm_run(failure mode: wrong mmap size). - Initialize guest registers and segment state.
- Enter
KVM_RUNloop; handle exits and resume.
Minimal concrete example
int kvm = open("/dev/kvm", O_RDWR | O_CLOEXEC);
int vm = ioctl(kvm, KVM_CREATE_VM, 0);
int vcpu = ioctl(vm, KVM_CREATE_VCPU, 0);
size_t run_size = ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = mmap(NULL, run_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu, 0);
Common misconceptions
- “KVM is a full hypervisor.” -> KVM only provides CPU virtualization; user space must emulate devices.
- “KVM_RUN only returns on errors.” -> It returns on every VM exit, which is normal.
- “Guest memory belongs to the kernel.” -> Guest memory is your process memory that KVM maps into the guest.
Check-your-understanding questions
- Why do you need both a VM fd and a vCPU fd?
- What happens if you call
KVM_RUNbefore setting guest registers? - Predict the effect of registering a guest memory region that overlaps another.
Check-your-understanding answers
- The VM fd stores global VM state and memory mappings, while each vCPU fd controls a single execution context.
- The guest will likely triple fault or exit immediately because the CPU state is invalid.
- KVM will reject the mapping or undefined behavior will occur; overlapping regions violate invariants.
Real-world applications
- QEMU uses this exact API to run KVM guests.
- Firecracker and Cloud Hypervisor implement the same
KVM_RUNloop with different device models.
Where you’ll apply it
- This project: Section 3.2, Section 4.1, and Section 5.10 Phase 1.
- Also used in: P05-write-a-simple-virtual-machine-monitor-vmm-virtual.md, P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md.
References
- Linux KVM API documentation (kernel.org)
- QEMU architecture overview
- “Operating System Concepts” (Virtual Machines chapter)
Key insights
KVM provides the execution engine, but your user-space process is the hypervisor.
Summary
You now understand the KVM lifecycle and why user space owns the VM orchestration.
Homework/Exercises to practice the concept
- Write a tiny program that only opens
/dev/kvmand queries the API version. - Add a capability check for
KVM_CAP_USER_MEMORYand print the result.
Solutions to the homework/exercises
- Use
ioctl(kvm_fd, KVM_GET_API_VERSION, 0)and compare toKVM_API_VERSION. - Call
ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY)and print nonzero if supported.
2.2 Memory Virtualization and Address Translation (GVA -> GPA -> HPA)
Fundamentals
A guest sees virtual addresses (GVA), which it translates to guest physical addresses (GPA) using its own page tables. The hypervisor then maps GPA to host physical addresses (HPA) through a second layer of translation, typically EPT (Intel) or NPT (AMD). KVM needs to know which GPAs exist and how they map to your process’s host virtual memory. This matters because if you mis-map memory, the guest will read garbage or trigger VM exits for EPT violations. Understanding this translation pipeline is essential to loading your guest program, handling page faults, and interpreting VM exit addresses correctly. It also explains why memory bugs show up as VM exits rather than host crashes.
Deep Dive into the concept
Memory virtualization is the most subtle part of virtual machines because it involves two independent address translation systems. The guest believes it owns a flat physical memory space. It builds page tables that translate guest virtual addresses (GVA) to guest physical addresses (GPA). In a bare-metal system, GPA would map directly to host physical memory. In a VM, however, the hypervisor must interpose with a second translation layer. That’s what EPT/NPT provides: a second set of page tables, owned by the hypervisor, that translate GPA to host physical address (HPA).
KVM exposes this by letting you register memory regions. When you call KVM_SET_USER_MEMORY_REGION, you specify a GPA range and a host virtual address pointer. KVM inserts those mappings into the EPT tables (or shadow page tables if EPT is disabled). In effect, your process’s memory becomes the guest’s “physical” RAM. If the guest accesses a GPA outside registered ranges, KVM triggers an EPT violation and exits to user space.
The guest sees a normal CPU with paging enabled or disabled. If you run a real-mode guest, paging is off and GVA == GPA. This makes early boot simple: you can load a flat binary at GPA 0x1000 and set RIP there. If you enable paging, you must build guest page tables and set CR3 appropriately. KVM does not validate your guest page tables, but it will respect them. A wrong CR3 value causes page faults, which might be injected back into the guest or cause shutdown if unhandled.
Understanding memory alignment is critical. KVM memory regions must be page-aligned and sized in page multiples, because EPT works in page granularity. If you want to emulate ROM or MMIO regions, you can register a memory region with KVM_MEM_READONLY or leave gaps to trigger exits on access. This is how BIOS regions or memory-mapped device registers are handled in full hypervisors.
Another subtlety is that your hypervisor’s host virtual addresses (HVA) are not necessarily contiguous in host physical memory. KVM uses your HVAs but ultimately maps to HPA with normal Linux paging. That means performance can be impacted by host page faults or hugepages. For a toy project, normal pages are fine, but the mental model should include that the host OS is still in charge of how your guest memory is physically placed. This is also why KVM uses mlock or hugepages in production to reduce host-side paging.
Finally, the GVA -> GPA -> HPA pipeline affects debugging. When you log VM exits that include an address, you must know which layer the address refers to. KVM’s KVM_EXIT_MMIO provides GPA offsets; KVM_EXIT_IO references I/O port numbers; guest page faults refer to GVA. If you want to inspect guest memory, you use GPA offsets into your allocated memory buffer. This is how you can read and write guest memory directly from the hypervisor for debugging and device emulation.
Another practical consideration is memory layout discipline. If your guest expects BIOS-like regions or a specific RAM size, you must mirror that expectation in your KVM memory regions. Even a toy guest might assume memory at address 0 or rely on zeroed memory, so you should explicitly zero your guest buffer and document the assumed layout. These small choices make your VM behavior deterministic and reproducible, which is crucial for the golden-path demo and for debugging with consistent logs.
How this fit on projects
This concept is central to loading the guest image (Section 5.2), defining the memory map in Section 3.5, and diagnosing EPT violations during the run loop (Section 7.1).
Definitions & key terms
- GVA -> Guest virtual address (process-level virtual address inside the guest).
- GPA -> Guest physical address (guest-visible “physical” RAM address).
- HPA -> Host physical address (real physical RAM in the host).
- EPT/NPT -> Second-level page tables used by hardware virtualization.
- Memory region -> Mapping between a GPA range and host virtual memory.
Mental model diagram (ASCII)
Guest instruction
| uses GVA
v
Guest page tables
| translate GVA -> GPA
v
EPT/NPT (hypervisor)
| translate GPA -> HPA
v
Host physical memory
How it works (step-by-step, with invariants and failure modes)
- Guest generates a memory access to GVA.
- Guest page tables translate GVA -> GPA (failure mode: guest page fault).
- EPT translates GPA -> HPA (failure mode: EPT violation if unmapped).
- Host CPU accesses HPA and returns data to the guest.
Minimal concrete example
// Register 4MB of guest RAM at GPA 0x0
struct kvm_userspace_memory_region region = {
.slot = 0,
.guest_phys_addr = 0x0,
.memory_size = 0x400000,
.userspace_addr = (uint64_t)guest_mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
Common misconceptions
- “GPA is the same as HPA.” -> Only in non-virtualized systems; EPT adds a second translation.
- “KVM allocates guest memory for me.” -> You allocate it; KVM only maps it.
- “EPT violations mean host segfault.” -> They are VM exits, not host faults.
Check-your-understanding questions
- Why does a real-mode guest have GVA == GPA?
- What happens if the guest accesses GPA 0x400000 when you only mapped 0-0x3FFFFF?
- Predict what changes when you enable guest paging but forget to set CR3.
Check-your-understanding answers
- Real mode does not use paging; segmentation results in a linear address directly used as GPA.
- KVM triggers an EPT violation and exits to the hypervisor.
- The guest will page fault or triple fault because CR3 is invalid.
Real-world applications
- Live migration must preserve GPA layout so memory snapshots stay consistent.
- Hypervisors use hugepages to reduce TLB pressure in the GPA->HPA translation.
Where you’ll apply it
- This project: Section 3.5 Data Formats, Section 4.4 Data Structures, Section 5.2 Project Structure.
- Also used in: P05-write-a-simple-virtual-machine-monitor-vmm-virtual.md.
References
- Intel SDM, VMX and EPT chapters
- “Computer Systems: A Programmer’s Perspective” (Virtual Memory)
Key insights
Memory virtualization is a two-layer translation pipeline; mastering it unlocks every VM behavior you observe.
Summary
You can now reason about how guest addresses map to real memory and how KVM enforces memory boundaries.
Homework/Exercises to practice the concept
- Sketch a memory map with two regions: RAM at 0x0-0x3FFFFF and MMIO at 0xF0000000-0xF0000FFF.
- Modify your memory registration to create a ROM region and explain how KVM enforces read-only access.
Solutions to the homework/exercises
- Use two
kvm_userspace_memory_regionentries with distinct slots and GPA ranges. - Set
KVM_MEM_READONLYon the ROM region; writes will cause a VM exit or error.
2.3 VM Exits and Device Emulation (Port I/O, MMIO, UART)
Fundamentals
When a guest executes sensitive operations-such as accessing I/O ports or executing HLT-the CPU exits to the hypervisor. These VM exits are your chance to emulate devices. The classic teaching device is the 16550 UART, which uses I/O ports and simple registers. If your hypervisor interprets guest I/O operations and translates them to host output, the guest can “print” text without any real hardware. This is the core virtualization loop: run guest, trap exits, emulate hardware, resume guest. The simplicity of UART makes it perfect for learning exit handling. It also makes it easy to validate behavior with deterministic logs.
Deep Dive into the concept
VM exits are the heartbeat of a hypervisor. Each exit is a context switch from guest mode to host mode. The cost of exits is high, so you typically intercept only what you must. For a toy hypervisor, you intercept everything necessary to make the guest run and to emulate a serial console. The exit reasons you will see include KVM_EXIT_IO for port I/O, KVM_EXIT_HLT for halting, and sometimes KVM_EXIT_SHUTDOWN if the guest triple faults.
The UART is a natural teaching device because it is simple but real. The legacy 16550 UART uses a small set of I/O ports (typically 0x3F8 for COM1) and a handful of registers: transmit holding register (THR), line status register (LSR), interrupt enable register (IER), and others. A guest OS expects that writing to THR sends a byte to the serial port. You can emulate that by printing the byte to stdout. You can also emulate LSR so the guest believes the transmitter is ready by always returning “empty” status bits. This lets a guest outb characters and see them appear on the host.
Port I/O exits include a data buffer, direction, size, and port number. The KVM run structure gives you the port number and the data to write. You must also correctly handle IN instructions (reads). A common approach is to return fixed values for status registers. For example, return 0x20 for LSR to indicate THR empty. If you return the wrong bits, guests may spin waiting for the transmitter to become ready.
Device emulation also intersects with memory mapping. Some devices use MMIO instead of port I/O. For MMIO, KVM produces KVM_EXIT_MMIO exits with GPA and size. You respond by reading or writing into your emulated device state. Even if you only implement I/O ports, you should understand MMIO because modern virtio devices use it, and your mental model should scale.
The guest boot path often ends with a HLT instruction. When you see KVM_EXIT_HLT, you can treat it as guest completion. If you see KVM_EXIT_SHUTDOWN, it usually indicates a triple fault. That means you misconfigured the guest state or the guest accessed invalid memory. Properly handling these exit reasons gives you a deterministic environment and makes debugging easier.
Emulation state is important. You will maintain a small struct for UART state (e.g., last transmitted byte, line status). Even if you keep it simple, this teaches you how real hypervisors emulate device registers and track state across exits. It also illustrates a key trade-off: emulation is flexible but slow; paravirtualized devices reduce exits by batching and sharing memory. This sets the stage for later projects that use virtio or VFIO.
Finally, consider the guest’s instruction stream itself. If your guest loops and emits characters, the hypervisor will see repeated exits that can overwhelm logs. A useful technique is to add throttling or selective logging (e.g., log only the first N exits). This keeps your traces readable while still capturing essential behavior. It mirrors what production hypervisors do: they avoid logging every exit because the volume would be enormous, but they still provide targeted tracing when debugging.
How this fit on projects
This is the core execution loop used in Section 5.10 Phase 2 and the serial console output in Section 3.4 and Section 3.7.
Definitions & key terms
- VM exit -> Hardware transition from guest to hypervisor.
- Port I/O -> x86 IN/OUT instructions accessing I/O ports.
- MMIO -> Memory-mapped I/O using load/store to special GPA regions.
- UART 16550 -> Legacy serial device with simple registers.
Mental model diagram (ASCII)
Guest executes OUT 0x3F8, 'A'
|
v
VM exit: KVM_EXIT_IO
|
v
Hypervisor: emulate UART -> print 'A'
|
v
Resume guest execution
How it works (step-by-step, with invariants and failure modes)
- Guest executes an I/O instruction (IN/OUT).
- CPU triggers VM exit with reason
KVM_EXIT_IO. - Hypervisor reads port number, direction, and data.
- If port is UART THR, write byte to host stdout.
- If port is LSR, return ready bits.
- Resume guest. Failure mode: unhandled port causes guest hang or crash.
Minimal concrete example
case KVM_EXIT_IO: {
uint16_t port = run->io.port;
uint8_t *data = (uint8_t *)run + run->io.data_offset;
if (run->io.direction == KVM_EXIT_IO_OUT && port == 0x3F8) {
putchar(*data);
} else if (run->io.direction == KVM_EXIT_IO_IN && port == 0x3FD) {
*data = 0x20; // LSR: THR empty
}
break;
}
Common misconceptions
- “Guest I/O is just memory.” -> x86 still supports separate I/O port space.
- “UART emulation requires full hardware model.” -> A minimal subset is enough for console output.
- “KVM handles device emulation.” -> KVM only provides exits; you emulate devices.
Check-your-understanding questions
- Why do you need to return a ready bit for the UART LSR?
- What is the difference between
KVM_EXIT_IOandKVM_EXIT_MMIO? - Predict what happens if you ignore all OUT instructions from the guest.
Check-your-understanding answers
- Many guests poll the LSR; without ready bits they spin forever.
KVM_EXIT_IOis for port I/O instructions;KVM_EXIT_MMIOis for memory-mapped I/O.- The guest may appear to hang or silently drop output because its device writes are ignored.
Real-world applications
- QEMU’s device models are essentially large exit handlers with state machines.
- Virtio devices reduce exits by sharing memory rings; VFIO bypasses emulation entirely.
Where you’ll apply it
- This project: Section 3.4 Example Output, Section 5.10 Phase 2, Section 7.1 Debugging.
- Also used in: P02-build-your-own-vagrant-clone-devops-infrastructure.md for virtio context, P05-write-a-simple-virtual-machine-monitor-vmm-virtual.md.
References
- Intel SDM, I/O instruction and VM exit sections
- 16550 UART programming reference
Key insights
VM exits are the boundary between illusion and reality; device emulation is the hypervisor’s contract with the guest.
Summary
You can now interpret VM exits and emulate a basic serial device using port I/O.
Homework/Exercises to practice the concept
- Extend your UART emulation to handle the Line Control Register (LCR) in a no-op way.
- Add a counter that logs how many VM exits occur per 1,000 guest instructions.
Solutions to the homework/exercises
- Track writes to port 0x3FB and store the value; it affects baud rate in real hardware but can be ignored here.
- Increment a counter on each exit and print stats every N exits; use a fixed N to keep output deterministic.
2.4 Guest CPU State Initialization (Real-Mode Bootstrapping)
Fundamentals
A KVM VM does not boot by itself. The kernel gives you a vCPU in an undefined state, and your hypervisor must explicitly initialize general-purpose registers, segment registers, control registers, and flags before the guest can execute. The simplest boot path is real mode: paging is disabled, segmentation is flat, and CS:IP directly points to your guest entry point. To reach a deterministic start, you set rip to the address where you loaded your guest payload, set rflags to a known value, and configure CS, DS, and SS with base 0 and selector 0. This is done via KVM_SET_SREGS and KVM_SET_REGS. If you skip or misconfigure any of these fields, the guest will triple fault and exit immediately. Understanding this initialization sequence is essential because every later exit, I/O emulation, and debug trace depends on the guest starting in a well-defined CPU mode.
Deep Dive into the concept
Bootstrapping a guest is a careful choreography between the x86 architectural model and KVM’s expectations. Real x86 hardware begins execution in real mode after reset: CS selector is 0xF000, IP is 0xFFF0, and the first instruction is fetched from the BIOS reset vector. In a minimal KVM guest, you bypass BIOS and jump straight into your payload, but you still need to choose a consistent mode and a correct set of segment bases. KVM exposes architectural state via two structures: struct kvm_sregs (segment registers, control registers, descriptor tables, and special MSRs) and struct kvm_regs (general registers and rip/rflags). You must use these to align the virtual CPU with your guest’s expectations.
Real mode is the easiest because it minimizes the number of fields you must set. In real mode, CR0.PE=0 and CR0.PG=0, the hidden segment descriptor caches are derived from the visible selector value, and linear addresses are computed as segment_base + offset with segment_base = selector << 4. In KVM, you don’t need to simulate the implicit segment cache rebuild; you can directly set cs.base and cs.selector to values that yield the flat model you want. A common teaching setup is to set cs.base = 0, cs.selector = 0, and then place the guest payload at GPA 0x1000. With rip=0x1000, the CPU fetches the first instruction at 0x0000:0x1000 and execution begins deterministically. Setting ds, es, and ss to base 0 avoids surprises when the guest touches data or stack.
If you later choose to boot into protected or long mode, the initialization complexity increases dramatically. Protected mode requires CR0.PE=1, a valid GDT with code and data descriptors, and consistent CS/DS selectors that index that GDT. Long mode requires CR4.PAE=1, EFER.LME=1, a valid page table hierarchy in guest memory, and CR0.PG=1 with 64-bit code segments. The transition sequence matters: you must set up page tables before enabling paging, and you must load the GDT before switching the CS selector. Many new hypervisor builders fail here because they set CR0 and CR4 in the wrong order or forget to set the L bit in the 64-bit code segment descriptor. The result is a VM exit with KVM_EXIT_SHUTDOWN due to triple fault, which provides very little debugging context.
KVM also enforces certain invariants about CR0 and CR4 that must match the CPU’s fixed bits (the IA32_VMX_CR0_FIXED0/1 and IA32_VMX_CR4_FIXED0/1 MSRs). If you set CR0 or CR4 to values that violate these fixed masks, the VM entry will fail or the VM will exit immediately. A reliable pattern is to fetch the current sregs via KVM_GET_SREGS, modify only the fields you need, and write them back. This preserves whatever baseline KVM and the CPU expect, while still letting you place the guest in a controlled mode. It is the safest approach when you are new to VMX/SVM details.
Another subtlety is the guest stack. Even in a trivial guest, you should set rsp to a known memory location and ensure that memory is mapped and writable. If SS points to unmapped memory, the first push or call will fault and you’ll see an unexpected shutdown exit. For a real-mode guest, setting ss.base=0 and rsp=0x2000 is enough. You can also set rbp to the same value to make debugging easier if you ever dump the guest state.
Finally, note that guest initialization is your first opportunity to build deterministic behavior. If you always load the same guest binary at the same address, set the same registers, and disable asynchronous events like interrupts, your VM exits will be repeatable. This determinism is incredibly valuable for a project that relies on logs and exit traces. It also prepares you for more advanced virtualization work, where deterministic CPU models and stable boot sequences are required for live migration and reproducible builds.
How this fit on projects
This concept directly underpins Section 3.2 functional requirements (initial CPU state), Section 5.10 Phase 1 (guest bootstrap), and the failure diagnostics in Section 7.1 when a guest shuts down unexpectedly.
Definitions & key terms
- Real mode -> The default x86 execution mode with no paging and segmented addressing.
- kvm_sregs -> KVM structure containing segment registers, control registers, and descriptor tables.
- kvm_regs -> KVM structure containing general registers, RIP, and RFLAGS.
- Triple fault -> CPU reset condition after three consecutive unhandled faults; in KVM it appears as
KVM_EXIT_SHUTDOWN.
Mental model diagram (ASCII)
Guest boot path
| load guest bytes at GPA 0x1000
v
Set sregs (CS/DS/SS bases) + regs (RIP/RSP)
| VM entry
v
CPU fetches instruction @ CS.base + RIP
How it works (step-by-step, with invariants and failure modes)
- Allocate guest memory and copy your payload to a fixed GPA (invariant: GPA is mapped and writable).
- Call
KVM_GET_SREGSand setcs,ds,es,ssbase/selector to flat values. - Ensure
CR0.PE=0andCR0.PG=0for real mode (invariant: fixed bits satisfied). - Set
ripto the entry point andrflagsto0x2(bit 1 must be set). - Set
rspto a safe stack location inside mapped memory. - Enter the VM; failure mode: invalid
CR0/CR4or segment state causes immediate shutdown.
Minimal concrete example
struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);
sregs.cs.base = 0;
sregs.cs.selector = 0;
sregs.ds.base = sregs.es.base = sregs.ss.base = 0;
sregs.ds.selector = sregs.es.selector = sregs.ss.selector = 0;
ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);
struct kvm_regs regs = {
.rip = 0x1000,
.rflags = 0x2,
.rsp = 0x2000,
};
ioctl(vcpu_fd, KVM_SET_REGS, ®s);
Common misconceptions
- “KVM boots the guest like a BIOS.” -> KVM does not provide BIOS; you must set CPU state and entry point.
- “Only RIP matters.” -> Incorrect segment bases or CR0/CR4 values can prevent any instruction from running.
- “RFLAGS can be zero.” -> Bit 1 must be set; otherwise VM entry fails.
Check-your-understanding questions
- Why is real mode the easiest mode for a minimal guest payload?
- What does
KVM_GET_SREGSgive you thatKVM_GET_REGSdoes not? - Predict what happens if
SSpoints to an unmapped memory region. - Which
rflagsbit must always be set before VM entry?
Check-your-understanding answers
- Real mode avoids paging and descriptor tables, minimizing required state.
- It includes segment registers, control registers, and descriptor tables needed for mode setup.
- The first stack access will fault, likely causing a shutdown exit.
- Bit 1 (the reserved “always-1” flag).
Real-world applications
- Hypervisors and bootloaders use deterministic CPU state to boot guests and kernels.
- VM migration relies on recreating consistent register and segment state on the target host.
Where you’ll apply it
- This project: Section 3.2 Functional Requirements, Section 5.10 Phase 1, Section 7.1 Debugging.
- Also used in: P05-write-a-simple-virtual-machine-monitor-vmm-virtual.md for VMCS guest-state setup.
References
- Intel SDM, Vol. 3A: System Programming Guide (real mode and segment semantics)
- Linux KVM API documentation (kvm_regs and kvm_sregs)
Key insights
If the guest’s initial CPU state is wrong, nothing else in the hypervisor matters because the guest never executes a valid instruction.
Summary
You can now bootstrap a guest by explicitly setting CPU state to a deterministic real-mode configuration.
Homework/Exercises to practice the concept
- Change the guest entry point from 0x1000 to 0x2000 and update the loader accordingly.
- Add a simple stack-based instruction in the guest and observe failures when
rspis incorrect.
Solutions to the homework/exercises
- Copy the payload to 0x2000, set
rip=0x2000, and confirm the VM exits as expected. - Set
rspto an unmapped address and verify you getKVM_EXIT_SHUTDOWN, then fix it.
3. Project Specification
3.1 What You Will Build
You will build a small, single-process user-space hypervisor that:
- Creates a VM using KVM.
- Loads a minimal guest binary into guest memory.
- Initializes vCPU registers and runs the guest.
- Emulates a UART device via I/O ports to print guest output.
- Produces a deterministic VM exit trace for debugging.
Included: KVM initialization, memory mapping, guest boot, UART emulation, exit logging. Excluded: Full BIOS, multi-vCPU SMP, PCI/virtio device models, disk emulation.
3.2 Functional Requirements
- VM Lifecycle: Create a VM, register guest memory, create a vCPU, and enter a run loop.
- Guest Loader: Load a flat binary guest into GPA 0x1000 and set RIP accordingly.
- Exit Handling: Correctly handle
KVM_EXIT_IO,KVM_EXIT_HLT, andKVM_EXIT_SHUTDOWN. - UART Emulation: Support basic COM1 transmit and LSR reads for polling guests.
- Exit Trace: Emit deterministic logs for each exit with exit reason and I/O details.
3.3 Non-Functional Requirements
- Performance: Guest should run without noticeable host lag; exit overhead is acceptable.
- Reliability: Invalid guest states should produce clear logs and clean shutdowns.
- Usability:
make runshould build and launch the guest in one command.
3.4 Example Usage / Output
$ make && sudo ./toy-kvm ./guest.bin
[vm] api=12
[vm] created vm_fd=4 vcpu_fd=5
[mem] guest RAM: 4 MiB @ GPA 0x00000000
[guest] entry RIP=0x00001000
[exit] IO OUT port=0x3f8 size=1 data='H'
[exit] IO OUT port=0x3f8 size=1 data='i'
[exit] HLT
[vm] guest halted
3.5 Data Formats / Schemas / Protocols
- Guest binary: flat 16-bit real-mode binary loaded at GPA 0x1000.
- Exit log line format:
[exit] <reason> [details]
- UART ports:
- COM1 base 0x3F8, LSR at 0x3FD.
3.6 Edge Cases
- Guest accesses unmapped GPA -> KVM_EXIT_SHUTDOWN (log and exit with code 2).
- Guest polls LSR but you return 0x00 -> guest hangs (detect and warn).
- Guest performs IN on unknown port -> return 0xFF and log warning.
3.7 Real World Outcome
You will have a working, deterministic toy hypervisor that boots a guest and prints its output via emulated UART. You will be able to trace exits and understand why each exit happened.
3.7.1 How to Run (Copy/Paste)
make
sudo ./toy-kvm ./guest.bin
3.7.2 Golden Path Demo (Deterministic)
- Use the provided
guest.binthat writes “Hi” then executes HLT. - Your exit trace should always show two IO exits, then HLT.
3.7.3 CLI Transcript (Success + Failure)
$ sudo ./toy-kvm ./guest.bin
[exit] IO OUT port=0x3f8 data='H'
[exit] IO OUT port=0x3f8 data='i'
[exit] HLT
[vm] exit_code=0
$ sudo ./toy-kvm ./missing.bin
[error] guest binary not found: ./missing.bin
[vm] exit_code=1
Exit codes:
0success1invalid arguments or missing file2guest crash or shutdown
4. Solution Architecture
4.1 High-Level Design
+---------------------+ +--------------------+
| toy-kvm (userspace) |<---->| /dev/kvm (kernel) |
| - guest loader | | - VMX/SVM engine |
| - run loop | +--------------------+
| - UART emulation |
+----------+----------+
|
v
Guest Memory (HVA)
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | VM lifecycle | Create VM/vCPU and run loop | Use KVM ioctls directly | | Guest loader | Load binary + set registers | Flat 16-bit real mode | | UART device | Emulate COM1 via I/O exits | Minimal register subset | | Exit logger | Deterministic trace | Fixed log format |
4.3 Data Structures (No Full Code)
struct Guest {
uint8_t *mem; // host buffer for guest RAM
size_t mem_size;
};
struct Uart {
uint8_t lsr; // line status register
};
4.4 Algorithm Overview
Key Algorithm: VM Run Loop
- Call
KVM_RUN. - Switch on exit reason.
- Emulate UART I/O or halt.
- Repeat until exit condition.
Complexity Analysis:
- Time: O(number of VM exits)
- Space: O(guest memory size)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install qemu-kvm libvirt-daemon-system
make
5.2 Project Structure
project-root/
+-- src/
| +-- main.c
| +-- kvm.c
| +-- guest.c
| +-- uart.c
+-- guest/
| +-- guest.asm
+-- Makefile
+-- README.md
5.3 The Core Question You’re Answering
“How does a user-space program turn KVM into a working hypervisor that can run real guest instructions?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- KVM ioctl lifecycle (KVM_CREATE_VM, KVM_RUN)
- x86 real mode execution (segment:offset addressing)
- Port I/O semantics (
in/outinstructions)
5.5 Questions to Guide Your Design
- Which exit reasons will you support at minimum?
- How will you layout guest memory and entry point?
- How will you ensure UART output is deterministic?
5.6 Thinking Exercise
Sketch how your guest writes “Hi” using out 0x3F8, al and then halts.
5.7 The Interview Questions They’ll Ask
- What is the role of
kvm_run? - Why does KVM require user-space device emulation?
- Explain GVA -> GPA -> HPA in one minute.
5.8 Hints in Layers
Hint 1: Start with a guest that only executes HLT. Hint 2: Add a single UART write and log the byte. Hint 3: Add LSR reads so the guest can poll readiness.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Virtual Machines | Operating System Concepts | Ch. 16 | | Virtual Memory | CSAPP | Ch. 9 | | x86 CPU | Intel SDM | Vol. 3 |
5.10 Implementation Phases
Phase 1: Foundation (1 week)
Goals: KVM lifecycle, guest memory mapping.
Tasks: Implement KVM open/create/region/vCPU; run a HLT guest.
Checkpoint: KVM_EXIT_HLT seen in logs.
Phase 2: Core Functionality (1-2 weeks)
Goals: UART I/O emulation and exit trace.
Tasks: Handle KVM_EXIT_IO, emulate COM1, print guest output.
Checkpoint: Guest prints “Hi”.
Phase 3: Polish & Edge Cases (1 week)
Goals: Robust errors, deterministic logs. Tasks: Add exit reason logging and error codes. Checkpoint: Missing guest file returns exit code 1.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Guest mode | Real mode vs protected mode | Real mode | Simplest to boot | | UART emulation | Minimal vs full | Minimal | Enough for console output |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate helpers | UART register reads | | Integration Tests | Run guest and check output | “Hi” guest | | Edge Case Tests | Unmapped memory access | invalid GPA |
6.2 Critical Test Cases
- Guest prints characters: output contains “Hi”.
- Guest halts: exit code 0 after HLT.
- Missing guest binary: exit code 1 with error message.
6.3 Test Data
Guest: "Hi" binary, HLT binary, invalid binary
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Wrong RIP entry | Immediate shutdown | Set RIP to correct GPA | | No UART LSR | Guest hangs | Return ready bit 0x20 | | Unmapped GPA | KVM_EXIT_SHUTDOWN | Expand memory region |
7.2 Debugging Strategies
- Use deterministic guest binaries to isolate issues.
- Log every exit reason and its metadata.
7.3 Performance Traps
- Excessive logging slows down execution; make logs optional in release mode.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a basic timer exit (HLT loop and host timer).
- Support multiple UART ports (COM2).
8.2 Intermediate Extensions
- Add basic MMIO device emulation.
- Add a simple PIT timer and periodic interrupts.
8.3 Advanced Extensions
- Support a 32-bit protected mode guest.
- Implement a minimal virtio console.
9. Real-World Connections
9.1 Industry Applications
- Cloud hypervisors: KVM is the basis of most Linux cloud platforms.
- Security sandboxes: Small VMs isolate untrusted code.
9.2 Related Open Source Projects
- QEMU: full device emulation and KVM integration.
- Firecracker: minimal VMM for microVMs.
9.3 Interview Relevance
- VM lifecycle, memory translation, and device emulation are common systems interview topics.
10. Resources
10.1 Essential Reading
- Intel SDM, VMX chapters
- Linux KVM API documentation
10.2 Video Resources
- “KVM Internals” conference talks
10.3 Tools & Documentation
kvm-apidocs (kernel.org)qemu-system-x86_64manual
10.4 Related Projects in This Series
- P05-write-a-simple-virtual-machine-monitor-vmm-virtual.md
- P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md
11. Self-Assessment Checklist
11.1 Understanding
- I can explain KVM’s ioctl lifecycle without notes.
- I can describe GVA -> GPA -> HPA translation.
- I understand why device emulation is required.
11.2 Implementation
- VM boots and halts deterministically.
- UART output is correct.
- Exit logs are clear and reproducible.
11.3 Growth
- I can explain this project in a technical interview.
- I documented debugging lessons.
12. Submission / Completion Criteria
Minimum Viable Completion:
- VM boots, prints output, halts.
- UART emulation works.
- Exit codes are correct.
Full Completion:
- All minimum criteria plus:
- Deterministic exit trace.
- Error handling for missing guest or invalid GPA.
Excellence (Going Above & Beyond):
- Add MMIO device emulation or protected mode guest.
- Provide a short write-up of VM exits observed.