Project 14: KVM Userspace Client
Build a userspace program that uses Linux’s KVM API to run a virtual machine with hardware acceleration, without writing any kernel code.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced (Level 3: The Engineer) |
| Time Estimate | 2-3 weeks |
| Language | C (alternatives: Rust, Go) |
| Prerequisites | Linux system programming, basic x86 knowledge, understanding of virtualization concepts |
| Key Topics | KVM API, /dev/kvm ioctls, VM/VCPU management, KVM_RUN, VM exits, hardware virtualization |
1. Learning Objectives
By completing this project, you will:
- Master the Linux KVM API for creating and managing virtual machines
- Learn how to set up guest memory using KVM_SET_USER_MEMORY_REGION
- Understand the KVM_RUN loop and how to handle VM exits
- Implement I/O and MMIO emulation in userspace
- Learn how QEMU uses KVM for hardware acceleration
- Build the foundation for production-quality hypervisor tools
2. Theoretical Foundation
2.1 Core Concepts
What is KVM?
KVM (Kernel-based Virtual Machine) is a Linux kernel module that turns Linux into a Type-1 hypervisor:
┌────────────────────────────────────────────────────────────────────────┐
│ KVM Architecture │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Without KVM (QEMU alone) With KVM (QEMU + KVM) │
│ ════════════════════════ ═══════════════════════ │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Guest Code │ │ Guest Code │ │
│ └──────────┬───────────┘ └──────────┬───────────┘ │
│ │ │ │
│ │ Every instruction │ Most instructions │
│ │ emulated in software │ run on real CPU! │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ QEMU TCG Emulator │ │ KVM Module │ │
│ │ (Software CPU) │ │ (Uses VT-x/AMD-V) │ │
│ │ │ │ │ │
│ │ - Fetch instruction │ │ - Guest runs at │ │
│ │ - Decode │ │ native speed │ │
│ │ - Emulate │ │ - VMX non-root mode │ │
│ │ - Very slow! │ │ - Exits only on I/O │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ │ │ VM Exit │
│ │ ▼ (for I/O, etc.) │
│ │ ┌──────────────────────┐ │
│ │ │ QEMU/Your Code │ │
│ │ │ (Device emulation) │ │
│ │ └──────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Host Hardware │ │ Host Hardware │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ Speed: ~10-100x slower Speed: Near-native! │
│ │
└────────────────────────────────────────────────────────────────────────┘
The KVM API Model
KVM exposes hardware virtualization through a simple ioctl-based API:
┌────────────────────────────────────────────────────────────────────────┐
│ KVM API Hierarchy │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ /dev/kvm (System-level) │
│ ═════════════════════════ │
│ │ │
│ │ ioctl(kvm_fd, KVM_GET_API_VERSION) → Check API compatibility │
│ │ ioctl(kvm_fd, KVM_CHECK_EXTENSION) → Check for features │
│ │ ioctl(kvm_fd, KVM_CREATE_VM) → Create a new VM │
│ │ │ │
│ │ │ Returns VM file descriptor │
│ │ ▼ │
│ │ │
│ └── VM fd (Per-VM) │
│ ═══════════════ │
│ │ │
│ │ ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION) → Map guest memory │
│ │ ioctl(vm_fd, KVM_CREATE_VCPU) → Create vCPU │
│ │ ioctl(vm_fd, KVM_CREATE_IRQCHIP) → Create virtual PIC │
│ │ ioctl(vm_fd, KVM_CREATE_PIT2) → Create virtual PIT │
│ │ │ │
│ │ │ Returns VCPU file descriptor │
│ │ ▼ │
│ │ │
│ └── VCPU fd (Per-vCPU) │
│ ═══════════════════ │
│ │ │
│ │ mmap(vcpu_fd, KVM_GET_VCPU_MMAP_SIZE) → Get kvm_run struct│
│ │ ioctl(vcpu_fd, KVM_GET_REGS) → Read registers │
│ │ ioctl(vcpu_fd, KVM_SET_REGS) → Write registers │
│ │ ioctl(vcpu_fd, KVM_GET_SREGS) → Read segment regs │
│ │ ioctl(vcpu_fd, KVM_SET_SREGS) → Write segment regs│
│ │ ioctl(vcpu_fd, KVM_RUN) → Run guest! │
│ │ │
│ │ │
│ └── kvm_run struct (shared memory) │
│ ══════════════════════════════ │
│ │
│ Updated by KVM after each VM exit: │
│ - exit_reason (why did guest stop?) │
│ - io (I/O port info if exit was for I/O) │
│ - mmio (MMIO info if exit was for MMIO) │
│ - hypercall (if guest made hypercall) │
│ - etc. │
│ │
└────────────────────────────────────────────────────────────────────────┘
The KVM_RUN Loop
┌────────────────────────────────────────────────────────────────────────┐
│ KVM_RUN Loop │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Your Code KVM (Kernel) Hardware │
│ ═════════ ════════════ ════════ │
│ │ │ │ │
│ │ ioctl(vcpu_fd, KVM_RUN) │ │ │
│ ├────────────────────────────►│ │ │
│ │ │ │ │
│ │ │ VMLAUNCH/VMRESUME │ │
│ │ ├───────────────────────►│ │
│ │ │ │ │
│ │ │ Guest executes on │ │
│ │ │ real CPU hardware │ │
│ │ (your process is │ (VMX non-root mode) │ │
│ │ blocked here) │ │ │ │
│ │ │ │ Sensitive │ │
│ │ │ │ operation │ │
│ │ │ ▼ │ │
│ │ │ VM Exit ◄─────────┤ │
│ │ │◄───────────────────────┤ │
│ │ │ │ │
│ │ │ Save guest state │ │
│ │ │ Update kvm_run struct │ │
│ │ │ │ │
│ │ ioctl returns │ │ │
│ │◄────────────────────────────┤ │ │
│ │ │ │ │
│ │ Check run->exit_reason │ │ │
│ │ Handle exit: │ │ │
│ │ - I/O: emulate device │ │ │
│ │ - MMIO: emulate device │ │ │
│ │ - HLT: wait for interrupt │ │ │
│ │ - SHUTDOWN: terminate │ │ │
│ │ │ │ │
│ │ ioctl(vcpu_fd, KVM_RUN) │ │ │
│ ├────────────────────────────►│ │ │
│ │ │ │ │
│ │ ... loop continues ... │ │
│ │
└────────────────────────────────────────────────────────────────────────┘
Memory Setup
┌────────────────────────────────────────────────────────────────────────┐
│ KVM Memory Architecture │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Your Process (Userspace) │
│ ════════════════════════ │
│ │
│ void *mem = mmap(NULL, 128MB, PROT_READ|PROT_WRITE, │
│ MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Your process virtual address space │ │
│ │ │ │
│ │ 0x7f1234000000 ┌────────────────────────────────┐ │ │
│ │ │ │ │ │
│ │ │ Allocated memory (128MB) │ │ │
│ │ │ This becomes guest RAM │ │ │
│ │ │ │ │ │
│ │ 0x7f123C000000 └────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ KVM_SET_USER_MEMORY_REGION │
│ │ │
│ struct kvm_userspace_memory_region region = { │
│ .slot = 0, │
│ .guest_phys_addr = 0, // GPA starts at 0 │
│ .memory_size = 128*1024*1024, │
│ .userspace_addr = (uint64_t)mem, │
│ }; │
│ ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion); │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Guest Physical Address Space │ │
│ │ │ │
│ │ GPA 0x00000000 ┌────────────────────────────────┐ │ │
│ │ │ │ │ │
│ │ │ Guest sees this as its │ │ │
│ │ │ physical memory │ │ │
│ │ │ │ │ │
│ │ GPA 0x08000000 └────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ When guest accesses GPA 0x1000: │
│ - KVM/Hardware translates via EPT │
│ - Actually accesses your mmap'd memory at offset 0x1000 │
│ - Transparent to guest! │
│ │
└────────────────────────────────────────────────────────────────────────┘
2.2 Why This Matters
This is how QEMU works with KVM!
When you run:
qemu-system-x86_64 -enable-kvm -m 2G -hda disk.img
QEMU does exactly what you’ll learn:
- Opens /dev/kvm
- Creates VM and VCPU file descriptors
- Maps memory regions
- Runs the KVM_RUN loop
- Handles exits for I/O emulation
Real-World Applications:
- Cloud platforms: AWS Nitro, GCP, Azure all use KVM
- Container runtimes: Kata Containers, Firecracker use KVM
- Development tools: Android Emulator uses KVM on Linux
- Security: Sandboxing with hardware isolation
Career Value:
- KVM expertise is highly valued in cloud computing
- Understanding enables better VM performance tuning
- Foundation for building specialized VMMs (like Firecracker)
2.3 Historical Context
2006: Avi Kivity creates KVM for x86
- Simple kernel module approach
- Reuses Linux kernel for scheduling, memory management
- Radical departure from monolithic hypervisors
2007: KVM merged into Linux 2.6.20
- First virtualization solution in mainline kernel
- Accelerated adoption of Linux virtualization
2008-2010: KVM + QEMU becomes the standard
- QEMU provides device models
- KVM provides hardware virtualization
- Libvirt provides management
Today: KVM dominates cloud computing
- AWS uses KVM (Nitro hypervisor)
- Google Cloud uses KVM
- Most OpenStack deployments use KVM
- Over 1 billion VMs run on KVM
2.4 Common Misconceptions
Misconception 1: “KVM is a separate hypervisor”
- Reality: KVM makes Linux itself the hypervisor. Linux handles scheduling, memory management, etc.
Misconception 2: “KVM is complex like VMware”
- Reality: The KVM API is surprisingly simple. ~10 ioctls cover most functionality.
Misconception 3: “You need QEMU to use KVM”
- Reality: KVM can be used standalone. QEMU just provides device emulation. Firecracker proves you can build your own VMM.
Misconception 4: “KVM handles everything”
- Reality: KVM handles CPU and memory virtualization. Your code must handle I/O emulation.
3. Project Specification
3.1 What You Will Build
A standalone userspace program (no QEMU!) that:
- Opens /dev/kvm and creates a VM
- Sets up guest memory
- Creates a VCPU and initializes its state
- Loads and runs guest code
- Handles VM exits (I/O, MMIO, HLT)
- Displays guest output
3.2 Functional Requirements
- KVM Setup: Open /dev/kvm, verify API version, create VM
- Memory: Allocate and map guest memory (at least 16MB)
- VCPU: Create VCPU, initialize registers for real mode
- Guest Loading: Load binary guest code into memory
- Run Loop: Execute KVM_RUN loop
- Exit Handling: Handle KVM_EXIT_IO, KVM_EXIT_HLT, KVM_EXIT_MMIO
3.3 Non-Functional Requirements
- Pure Userspace: No kernel modules required (uses existing KVM)
- Minimal Dependencies: Only libc (no QEMU libraries)
- Clear Code: Well-structured, educational implementation
- Debuggable: Verbose mode showing all VM exits and register state
3.4 Example Usage / Output
$ ./kvmclient guest.bin
[KVM] Opening /dev/kvm...
[KVM] API version: 12 (expected 12) - OK
[KVM] Checking extensions:
- KVM_CAP_USER_MEMORY: Yes
- KVM_CAP_SET_TSS_ADDR: Yes
- KVM_CAP_IRQCHIP: Yes
- KVM_CAP_EXT_CPUID: Yes
[KVM] Creating VM...
[KVM] VM file descriptor: 4
[KVM] Setting up memory:
- Allocating 128MB for guest RAM
- Userspace address: 0x7f1234000000
- Setting memory region:
* Slot: 0
* Guest physical address: 0x0
* Size: 134217728 bytes (128MB)
[KVM] Creating VCPU 0...
[KVM] VCPU file descriptor: 5
[KVM] Mapping kvm_run structure (8192 bytes)
[KVM] Initializing VCPU state...
[KVM] Setting segment registers (real mode):
- CS: selector=0x0000, base=0x00000000, limit=0xFFFF
- DS: selector=0x0000, base=0x00000000, limit=0xFFFF
- ES: selector=0x0000, base=0x00000000, limit=0xFFFF
- SS: selector=0x0000, base=0x00000000, limit=0xFFFF
- CR0: 0x60000010 (PE=0, real mode)
[KVM] Setting general registers:
- RIP: 0x0000 (start of loaded code)
- RSP: 0xFFFE
- RFLAGS: 0x0002
[KVM] Loading guest code:
- File: guest.bin
- Size: 512 bytes
- Load address: 0x0000
[KVM] Starting VCPU execution...
[KVM] ═══════════════════════════════════════════════════════════════
[KVM] Running... (KVM_RUN ioctl)
[KVM] VM Exit #1:
- Exit reason: KVM_EXIT_IO (2)
- Direction: OUT
- Port: 0x03F8 (COM1)
- Size: 1 byte
- Data: 0x48 ('H')
[KVM] Emulating: Serial output 'H'
[KVM] Running... (KVM_RUN ioctl)
[KVM] VM Exit #2:
- Exit reason: KVM_EXIT_IO (2)
- Port: 0x03F8, OUT, data: 0x65 ('e')
[KVM] VM Exit #3:
- Exit reason: KVM_EXIT_IO (2)
- Port: 0x03F8, OUT, data: 0x6C ('l')
[KVM] VM Exit #4:
- Exit reason: KVM_EXIT_IO (2)
- Port: 0x03F8, OUT, data: 0x6C ('l')
[KVM] VM Exit #5:
- Exit reason: KVM_EXIT_IO (2)
- Port: 0x03F8, OUT, data: 0x6F ('o')
[KVM] Running... (KVM_RUN ioctl)
[KVM] VM Exit #6:
- Exit reason: KVM_EXIT_HLT (5)
- Guest executed HLT instruction
[KVM] ═══════════════════════════════════════════════════════════════
[KVM] Guest Output: Hello
[KVM] VCPU Final State:
- RAX: 0x0000000000000048
- RBX: 0x0000000000000000
- RCX: 0x0000000000000000
- RDX: 0x00000000000003F8
- RIP: 0x0000000000000012
- RFLAGS: 0x0000000000000002
[KVM] Statistics:
- Total VM exits: 6
- I/O exits: 5
- HLT exits: 1
[KVM] Cleanup complete.
4. Solution Architecture
4.1 High-Level Design
┌────────────────────────────────────────────────────────────────────────┐
│ KVM Client Architecture │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ main.c │
│ ══════ │
│ │ │
│ │ main() { │
│ │ parse_args(); // Get guest binary path │
│ │ kvm_init(&vm); // Open /dev/kvm, create VM │
│ │ setup_memory(&vm); // Allocate and map guest RAM │
│ │ create_vcpu(&vm); // Create VCPU, map kvm_run │
│ │ init_vcpu_state(&vm); // Set registers for real mode │
│ │ load_guest(&vm, path); // Load guest binary │
│ │ run_vm(&vm); // Main KVM_RUN loop │
│ │ cleanup(&vm); // Close FDs, free memory │
│ │ } │
│ │ │
│ └───────────────────────────────────────────────────────────────────│
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ VM State │ │
│ │ │ │
│ │ struct vm { │ │
│ │ int kvm_fd; // /dev/kvm file descriptor │ │
│ │ int vm_fd; // VM file descriptor │ │
│ │ int vcpu_fd; // VCPU file descriptor │ │
│ │ │ │
│ │ void *mem; // Guest RAM (mmap'd) │ │
│ │ size_t mem_size; // Guest RAM size │ │
│ │ │ │
│ │ struct kvm_run *run; // Shared kvm_run structure │ │
│ │ size_t run_size; // Size of kvm_run mmap │ │
│ │ }; │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐ │
│ │ kvm_init() │ │ setup_memory() │ │ create_vcpu() │ │
│ │ │ │ │ │ │ │
│ │ - open /dev/kvm │ │ - mmap guest RAM │ │ - KVM_CREATE_VCPU │ │
│ │ - check version │ │ - KVM_SET_USER_ │ │ - mmap kvm_run │ │
│ │ - KVM_CREATE_VM │ │ MEMORY_REGION │ │ │ │
│ └────────────────────┘ └────────────────────┘ └────────────────────┘ │
│ │
│ ┌────────────────────┐ ┌────────────────────────────────────────────┐│
│ │ init_vcpu_state() │ │ run_vm() ││
│ │ │ │ ││
│ │ - KVM_SET_SREGS │ │ while (running) { ││
│ │ (segments, CR0) │ │ ioctl(vcpu_fd, KVM_RUN); ││
│ │ - KVM_SET_REGS │ │ ││
│ │ (RIP, RSP, etc.) │ │ switch (run->exit_reason) { ││
│ └────────────────────┘ │ case KVM_EXIT_IO: ││
│ │ handle_io(run); ││
│ ┌────────────────────┐ │ case KVM_EXIT_MMIO: ││
│ │ load_guest() │ │ handle_mmio(run); ││
│ │ │ │ case KVM_EXIT_HLT: ││
│ │ - Read binary file │ │ running = false; ││
│ │ - Copy to guest mem│ │ case KVM_EXIT_SHUTDOWN: ││
│ └────────────────────┘ │ running = false; ││
│ │ } ││
│ │ } ││
│ └────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────────────┘
4.2 Key Components
1. KVM System Interface
Opens /dev/kvm and verifies API compatibility:
int kvm_fd = open("/dev/kvm", O_RDWR);
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
2. VM Creation
Creates a VM instance:
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
3. Memory Setup
Maps userspace memory into guest physical address space:
void *mem = mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
struct kvm_userspace_memory_region region = {...};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
4. VCPU Creation
Creates virtual CPU and maps shared kvm_run structure:
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
struct kvm_run *run = mmap(..., vcpu_fd, 0);
5. Exit Handling
Processes VM exits in the KVM_RUN loop:
- KVM_EXIT_IO: Guest performed port I/O
- KVM_EXIT_MMIO: Guest accessed unmapped memory
- KVM_EXIT_HLT: Guest executed HLT instruction
4.3 Data Structures
/* Main VM state */
struct vm {
/* File descriptors */
int kvm_fd; /* /dev/kvm */
int vm_fd; /* VM instance */
int vcpu_fd; /* Virtual CPU */
/* Memory */
void *mem; /* Guest RAM (userspace address) */
size_t mem_size; /* Guest RAM size */
/* VCPU run area */
struct kvm_run *run;/* Shared with kernel */
size_t run_size; /* mmap size */
/* I/O handling */
char serial_buf[4096];
int serial_pos;
};
/* From linux/kvm.h - key structures */
struct kvm_run {
__u8 request_interrupt_window;
__u8 immediate_exit;
__u8 padding1[6];
__u32 exit_reason; /* Why did the guest exit? */
__u8 ready_for_interrupt_injection;
__u8 if_flag;
__u16 flags;
__u64 cr8;
__u64 apic_base;
union {
/* KVM_EXIT_IO */
struct {
__u8 direction; /* IN or OUT */
__u8 size; /* 1, 2, or 4 bytes */
__u16 port; /* I/O port number */
__u32 count; /* Number of transfers */
__u64 data_offset; /* Offset in run struct to data */
} io;
/* KVM_EXIT_MMIO */
struct {
__u64 phys_addr; /* Guest physical address */
__u8 data[8]; /* Data (for write) or buffer (for read) */
__u32 len; /* Length of access */
__u8 is_write; /* 1 = write, 0 = read */
} mmio;
/* KVM_EXIT_HYPERCALL */
struct {
__u64 nr;
__u64 args[6];
__u64 ret;
__u32 longmode;
__u32 pad;
} hypercall;
/* ... other exit types ... */
};
};
struct kvm_regs {
__u64 rax, rbx, rcx, rdx;
__u64 rsi, rdi, rsp, rbp;
__u64 r8, r9, r10, r11, r12, r13, r14, r15;
__u64 rip, rflags;
};
struct kvm_sregs {
struct kvm_segment cs, ds, es, fs, gs, ss;
struct kvm_segment tr, ldt;
struct kvm_dtable gdt, idt;
__u64 cr0, cr2, cr3, cr4, cr8;
__u64 efer;
__u64 apic_base;
__u64 interrupt_bitmap[4];
};
struct kvm_segment {
__u64 base;
__u32 limit;
__u16 selector;
__u8 type;
__u8 present, dpl, db, s, l, g, avl;
__u8 unusable;
__u8 padding;
};
4.4 Algorithm Overview
ALGORITHM: KVM Client Execution
1. INITIALIZE KVM:
a. Open /dev/kvm
b. Verify API version (must be 12)
c. Check required extensions
2. CREATE VM:
a. ioctl(kvm_fd, KVM_CREATE_VM, 0)
b. Save VM file descriptor
3. SETUP MEMORY:
a. Allocate memory with mmap (anonymous, private)
b. Create kvm_userspace_memory_region structure
c. ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion)
4. CREATE VCPU:
a. ioctl(vm_fd, KVM_CREATE_VCPU, 0)
b. Get kvm_run mmap size: ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE)
c. mmap kvm_run structure from vcpu_fd
5. INITIALIZE VCPU STATE:
a. Get current sregs: ioctl(vcpu_fd, KVM_GET_SREGS, &sregs)
b. Set segment registers for real mode:
- CS/DS/ES/SS: selector=0, base=0, limit=0xFFFF
- CR0: Clear PE bit (real mode)
c. Set sregs: ioctl(vcpu_fd, KVM_SET_SREGS, &sregs)
d. Set initial registers:
- RIP = load address
- RSP = stack address
- RFLAGS = 0x2
6. LOAD GUEST:
a. Read guest binary file
b. Copy to guest memory at load address
7. RUN LOOP:
WHILE running:
a. ioctl(vcpu_fd, KVM_RUN, 0)
b. Check run->exit_reason:
CASE KVM_EXIT_IO:
IF direction == OUT:
Get data from run + run->io.data_offset
Emulate output (e.g., serial port)
ELSE:
Emulate input, put data at run + run->io.data_offset
CASE KVM_EXIT_MMIO:
IF is_write:
Emulate MMIO write
ELSE:
Put read data in run->mmio.data
CASE KVM_EXIT_HLT:
running = false
CASE KVM_EXIT_SHUTDOWN:
running = false
CASE KVM_EXIT_FAIL_ENTRY:
Print error, exit
8. CLEANUP:
a. munmap kvm_run structure
b. close(vcpu_fd)
c. munmap guest memory
d. close(vm_fd)
e. close(kvm_fd)
5. Implementation Guide
5.1 Development Environment Setup
# Check KVM is available
$ ls -la /dev/kvm
crw-rw---- 1 root kvm 10, 232 Dec 29 10:00 /dev/kvm
# Add yourself to kvm group (if needed)
$ sudo usermod -aG kvm $USER
# Then logout/login
# Verify KVM works
$ cat /sys/module/kvm/parameters/nested
# Or
$ lsmod | grep kvm
# Install development tools
$ sudo apt-get install build-essential
# Create project
$ mkdir kvmclient && cd kvmclient
5.2 Project Structure
kvmclient/
├── Makefile
├── kvmclient.c # Main implementation (single file for simplicity)
└── guest/
├── guest.asm # Assembly guest code
├── guest.bin # Compiled guest binary
└── Makefile # Build guest
5.3 The Core Question You’re Answering
“How does userspace software control hardware virtualization to run arbitrary guest code at near-native speed while maintaining control over sensitive operations?”
The answer:
- KVM exposes VT-x/AMD-V through a simple file descriptor interface
- Userspace sets up memory and initial CPU state
- KVM_RUN transfers control to guest via VMLAUNCH/VMRESUME
- On sensitive operations, hardware traps back (VM exit)
- KVM returns to userspace with exit information
- Userspace handles the exit, then calls KVM_RUN again
5.4 Concepts You Must Understand First
KVM API:
- What is an ioctl and how does it work?
- What are file descriptors and how do they represent resources?
- How does mmap create shared memory between user/kernel?
x86 Architecture:
- What is the difference between real mode and protected mode?
- What are segment registers and how do they work in real mode?
- What is port I/O vs. memory-mapped I/O?
Virtualization:
- What causes a VM exit?
- What is the guest physical address space?
- How does the CPU know what code to run?
Book References:
- Linux Programming Interface, Chapter 4 (File I/O)
- Intel SDM Vol. 1, Chapter 20 (Real Mode)
- KVM API Documentation (kernel.org)
5.5 Questions to Guide Your Design
Initialization:
- What version of the KVM API should you expect?
- What extensions are required vs. optional?
- What order must operations be performed?
Memory:
- How much memory does the guest need?
- At what guest physical address should memory start?
- Can you have multiple memory regions?
VCPU State:
- What segment register values indicate real mode?
- What should CR0 contain for real mode?
- What initial register values make sense?
Exit Handling:
- How do you distinguish between I/O ports?
- How do you read data for OUT instructions?
- What if the guest performs an unhandled operation?
5.6 Thinking Exercise
Trace through this guest code execution:
Guest code (at address 0x0000):
mov dx, 0x3F8 ; Serial port
mov al, 'A' ; Character to send
out dx, al ; Output to serial port
hlt ; Halt
Step 1: KVM_RUN called
- KVM enters VMX non-root mode
- CPU starts executing at RIP=0x0000
Step 2: mov dx, 0x3F8 executes
- Simple register move
- No VM exit (not sensitive)
Step 3: mov al, 'A' executes
- Simple register move
- No VM exit
Step 4: out dx, al executes
- I/O port access is sensitive
- CPU triggers VM exit
- KVM returns to userspace
Step 5: Your code handles exit
switch (run->exit_reason) {
case KVM_EXIT_IO:
// run->io.direction = KVM_EXIT_IO_OUT (1)
// run->io.port = 0x3F8
// run->io.size = 1
// data at (run + run->io.data_offset) = 'A'
printf("%c", *(char *)((char *)run + run->io.data_offset));
break;
}
Step 6: KVM_RUN called again
- RIP now points to HLT instruction
- Guest executes HLT
- VM exit with KVM_EXIT_HLT
Step 7: Your code handles HLT
case KVM_EXIT_HLT:
printf("\nGuest halted.\n");
running = false;
break;
Questions:
- What happens if you don’t handle the I/O exit?
- What if the guest wrote to port 0x80 (debug port)?
- How would you handle input (IN instruction)?
5.7 Hints in Layers
Hint 1 - Starting Point (Conceptual Direction)
Build incrementally:
- First: Open /dev/kvm and create VM
- Then: Set up memory
- Then: Create VCPU
- Then: Initialize state (hardest part!)
- Finally: Run loop and exit handling
Hint 2 - Next Level (More Specific Guidance)
Real mode segment setup is tricky. Here’s the key insight:
- In real mode, segment registers work differently
- For KVM, set selector=0, base=0, limit=0xFFFF
- CR0 bit 0 (PE) must be 0 for real mode
- But KVM requires some other CR0 bits set!
/* Real mode segment template */
struct kvm_segment seg = {
.base = 0,
.limit = 0xFFFF,
.selector = 0,
.type = 3, /* Data, read/write, accessed */
.present = 1,
.dpl = 0,
.db = 0, /* 16-bit */
.s = 1, /* Code/data segment */
.l = 0, /* Not 64-bit */
.g = 0, /* Byte granularity */
};
Hint 3 - Technical Details (Approach/Pseudocode)
/* Complete initialization sequence */
int kvm_fd = open("/dev/kvm", O_RDWR);
/* Check version */
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
if (version != 12) {
fprintf(stderr, "KVM API version mismatch\n");
exit(1);
}
/* Create VM */
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
/* Set up memory */
size_t mem_size = 0x8000000; /* 128MB */
void *mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
struct kvm_userspace_memory_region region = {
.slot = 0,
.guest_phys_addr = 0,
.memory_size = mem_size,
.userspace_addr = (uint64_t)mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
/* Create VCPU */
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
/* Map kvm_run */
size_t run_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = mmap(NULL, run_size, PROT_READ | PROT_WRITE,
MAP_SHARED, vcpu_fd, 0);
/* Initialize segments (real mode) */
struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);
/* CS is special - must be executable */
sregs.cs.base = 0;
sregs.cs.limit = 0xFFFF;
sregs.cs.selector = 0;
sregs.cs.type = 11; /* Execute/Read, accessed */
sregs.cs.present = 1;
sregs.cs.dpl = 0;
sregs.cs.db = 0;
sregs.cs.s = 1;
sregs.cs.l = 0;
sregs.cs.g = 0;
/* DS, ES, SS similar but data type */
/* ... */
/* CR0: PE=0 for real mode, but some bits required by KVM */
sregs.cr0 = 0x60000010; /* NE=1, ET=1 */
sregs.cr0 &= ~(1); /* Clear PE for real mode */
ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);
/* Initialize general registers */
struct kvm_regs regs = {0};
regs.rip = 0; /* Start at address 0 */
regs.rsp = 0xFFFE; /* Stack at top of first 64KB */
regs.rflags = 2; /* Bit 1 always set */
ioctl(vcpu_fd, KVM_SET_REGS, ®s);
Hint 4 - Tools/Debugging (Verification Methods)
/* Debug helper: dump VCPU state */
void dump_regs(int vcpu_fd) {
struct kvm_regs regs;
struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_REGS, ®s);
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);
printf("=== VCPU State ===\n");
printf("RIP: %016llx RSP: %016llx\n", regs.rip, regs.rsp);
printf("RAX: %016llx RBX: %016llx\n", regs.rax, regs.rbx);
printf("RCX: %016llx RDX: %016llx\n", regs.rcx, regs.rdx);
printf("CR0: %016llx CR4: %016llx\n", sregs.cr0, sregs.cr4);
printf("CS: sel=%04x base=%08llx limit=%08x\n",
sregs.cs.selector, sregs.cs.base, sregs.cs.limit);
}
/* Debug: print exit reason */
const char *exit_reason_str(int reason) {
switch (reason) {
case KVM_EXIT_IO: return "IO";
case KVM_EXIT_MMIO: return "MMIO";
case KVM_EXIT_HLT: return "HLT";
case KVM_EXIT_SHUTDOWN: return "SHUTDOWN";
case KVM_EXIT_FAIL_ENTRY: return "FAIL_ENTRY";
case KVM_EXIT_INTERNAL_ERROR: return "INTERNAL_ERROR";
default: return "UNKNOWN";
}
}
5.8 The Interview Questions They’ll Ask
- “What is the KVM API and how does it work?”
- File descriptor based (/dev/kvm → VM fd → VCPU fd)
- ioctls for control operations
- mmap for shared kvm_run structure
- KVM_RUN enters guest, returns on exit
- “How does KVM_RUN work?”
- Ioctl blocks, KVM executes VMLAUNCH/VMRESUME
- Guest runs until sensitive operation
- VM exit occurs, KVM populates kvm_run
- Ioctl returns to userspace
- Userspace handles exit, calls KVM_RUN again
- “What are the common VM exit reasons?”
- KVM_EXIT_IO: Port I/O
- KVM_EXIT_MMIO: Memory-mapped I/O
- KVM_EXIT_HLT: Guest halted
- KVM_EXIT_SHUTDOWN: Triple fault or shutdown
- KVM_EXIT_INTERNAL_ERROR: KVM error
- “How do you set up memory for a KVM guest?”
- Allocate with mmap (userspace)
- Use KVM_SET_USER_MEMORY_REGION
- Specify GPA, size, userspace address
- KVM/hardware handles translation
- “Why would you build a custom KVM client instead of using QEMU?”
- Minimal footprint (Firecracker approach)
- Specific device requirements
- Security (smaller attack surface)
- Performance (less overhead)
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| KVM API | KVM Documentation | api.txt |
| Linux System Programming | TLPI | Chapters 4, 49 |
| x86 Architecture | Intel SDM Vol. 1 | Chapters 3, 20 |
| Real Mode | Intel SDM Vol. 3A | Chapter 20 |
| Virtualization | Intel SDM Vol. 3C | Chapters 23-27 |
5.10 Implementation Phases
Phase 1: KVM Setup (Days 1-2)
Goal: Open /dev/kvm and create VM
int main() {
int kvm_fd = open("/dev/kvm", O_RDWR);
if (kvm_fd < 0) {
perror("open /dev/kvm");
return 1;
}
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
printf("KVM API version: %d\n", version);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
printf("VM created, fd = %d\n", vm_fd);
close(vm_fd);
close(kvm_fd);
return 0;
}
Validation:
- Program runs without error
- Prints API version 12
- Creates VM successfully
Phase 2: Memory Setup (Days 3-4)
Goal: Allocate and map guest memory
Add memory setup after creating VM:
- mmap 128MB anonymous memory
- Create kvm_userspace_memory_region
- Call KVM_SET_USER_MEMORY_REGION
Validation:
- No errors from ioctl
- Memory is allocated (check with /proc/self/maps)
Phase 3: VCPU Creation (Days 5-7)
Goal: Create VCPU with proper initial state
This is the trickiest part:
- Create VCPU
- Map kvm_run structure
- Initialize segment registers for real mode
- Initialize general registers
Validation:
- VCPU created successfully
- Can read initial register state
- State matches what you set
Phase 4: Guest Loading and Running (Days 8-10)
Goal: Load guest code and run
- Write simple guest assembly (print “Hi”, then HLT)
- Assemble to binary
- Load into guest memory
- Call KVM_RUN
Validation:
- KVM_RUN returns
- Exit reason is KVM_EXIT_IO or KVM_EXIT_HLT
Phase 5: Exit Handling (Days 11-14)
Goal: Handle exits and display output
- Handle KVM_EXIT_IO for serial output
- Handle KVM_EXIT_HLT to terminate
- Handle error cases
- Add verbose mode
Validation:
- Guest output is displayed
- Program terminates cleanly
- Various guest programs work
5.11 Key Implementation Decisions
Decision 1: Real Mode vs. Protected Mode
Real Mode (This Project)
- Simpler segment setup
- 16-bit code
- Limited to 1MB (but memory mapping can extend)
- Good for learning
Protected Mode
- More complex setup (GDT, TSS)
- 32-bit code
- Needed for real OS
- Add as extension
Decision 2: I/O Handling Approach
Simple (Print to stdout)
- Just print I/O data
- Good for testing
Device Emulation
- Actually emulate serial port registers
- Support both IN and OUT
- More realistic
Decision 3: Error Handling
Minimal (Exit on error)
- Simple, clear
- Good for learning
Robust (Continue if possible)
- Log errors
- Try to recover
- Production quality
6. Testing Strategy
6.1 Unit Tests
/* Test KVM availability */
void test_kvm_available(void) {
int fd = open("/dev/kvm", O_RDWR);
assert(fd >= 0 && "KVM not available");
int version = ioctl(fd, KVM_GET_API_VERSION, 0);
assert(version == 12 && "Wrong KVM API version");
close(fd);
printf("TEST: KVM available - PASS\n");
}
/* Test VM creation */
void test_vm_create(void) {
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
assert(vm_fd >= 0 && "Failed to create VM");
close(vm_fd);
close(kvm_fd);
printf("TEST: VM creation - PASS\n");
}
/* Test memory setup */
void test_memory_setup(void) {
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
void *mem = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
assert(mem != MAP_FAILED);
struct kvm_userspace_memory_region region = {
.slot = 0,
.guest_phys_addr = 0,
.memory_size = 0x1000,
.userspace_addr = (uint64_t)mem,
};
int ret = ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
assert(ret == 0 && "Failed to set memory region");
munmap(mem, 0x1000);
close(vm_fd);
close(kvm_fd);
printf("TEST: Memory setup - PASS\n");
}
6.2 Integration Tests
#!/bin/bash
# Test 1: Basic execution
echo "Test 1: Basic execution..."
./kvmclient guest/hlt.bin 2>&1 | grep -q "EXIT_HLT" \
&& echo "PASS" || echo "FAIL"
# Test 2: Serial output
echo "Test 2: Serial output..."
./kvmclient guest/hello.bin 2>&1 | grep -q "Hello" \
&& echo "PASS" || echo "FAIL"
# Test 3: Memory access
echo "Test 3: Memory access..."
./kvmclient guest/memtest.bin 2>&1 | grep -q "Memory OK" \
&& echo "PASS" || echo "FAIL"
6.3 Guest Test Programs
hlt.asm - Just halt:
[BITS 16]
[ORG 0x0000]
hlt
hello.asm - Serial output:
[BITS 16]
[ORG 0x0000]
mov dx, 0x3F8 ; COM1 port
mov al, 'H'
out dx, al
mov al, 'e'
out dx, al
mov al, 'l'
out dx, al
mov al, 'l'
out dx, al
mov al, 'o'
out dx, al
hlt
Build guest:
nasm -f bin -o guest/hlt.bin guest/hlt.asm
nasm -f bin -o guest/hello.bin guest/hello.asm
7. Common Pitfalls & Debugging
| Problem | Root Cause | Fix | Verification |
|---|---|---|---|
| KVM_RUN fails immediately | Segment registers wrong | Check segment type, base, limit | Dump and compare with working example |
| FAIL_ENTRY exit | Invalid guest state | Check CR0, segment setup | Look at hardware_exit_reason |
| Guest doesn’t execute | RIP wrong | Set RIP to load address | Dump registers before KVM_RUN |
| I/O data wrong | Offset calculation | Use run + run->io.data_offset |
Print raw bytes |
| Permission denied | Not in kvm group | Add user to kvm group | Check /dev/kvm permissions |
| No output | Not handling I/O exit | Add exit handler | Print exit_reason |
Debugging Techniques
/* Handle FAIL_ENTRY with details */
case KVM_EXIT_FAIL_ENTRY:
fprintf(stderr, "KVM_EXIT_FAIL_ENTRY\n");
fprintf(stderr, " hardware_entry_failure_reason: 0x%llx\n",
run->fail_entry.hardware_entry_failure_reason);
dump_regs(vcpu_fd);
exit(1);
/* Handle INTERNAL_ERROR with details */
case KVM_EXIT_INTERNAL_ERROR:
fprintf(stderr, "KVM_EXIT_INTERNAL_ERROR\n");
fprintf(stderr, " suberror: %d\n", run->internal.suberror);
if (run->internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
fprintf(stderr, " (emulation failed)\n");
}
dump_regs(vcpu_fd);
exit(1);
8. Extensions & Challenges
8.1 Protected Mode Guest
Extend to run 32-bit protected mode code:
- Set up GDT in guest memory
- Configure segment registers properly
- Set CR0.PE = 1
8.2 Multiple VCPUs
Support SMP:
- Create multiple VCPUs
- Run in separate threads
- Handle IPI
8.3 Interrupt Injection
Inject interrupts into guest:
- Use KVM_INTERRUPT ioctl
- Emulate timer interrupt
- Enable guest to use interrupts
8.4 Virtual APIC
Use KVM’s virtual APIC:
- KVM_CREATE_IRQCHIP
- KVM_IRQ_LINE for device interrupts
- More realistic interrupt handling
8.5 Device Emulation
Add real device emulation:
- Serial port with full register set
- Simple disk device
- Connect to Project 13 mini-QEMU
9. Real-World Connections
Firecracker Architecture
Firecracker (AWS’s microVM) follows the same pattern:
┌──────────────────────────────────────────┐
│ Firecracker │
├──────────────────────────────────────────┤
│ - Opens /dev/kvm │
│ - Creates VM and VCPUs │
│ - Minimal device set (virtio) │
│ - ~50MB memory overhead │
│ - Boots in <125ms │
│ - Runs on same KVM API you're learning! │
└──────────────────────────────────────────┘
QEMU with KVM
When you run QEMU with -enable-kvm:
/* QEMU does essentially this: */
int kvm_init(MachineState *ms) {
s->fd = qemu_open("/dev/kvm", O_RDWR);
s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
/* ... lots more, but same pattern! */
}
Cloud Scale
Every EC2 instance, GCP VM, Azure VM uses this exact API pattern at some level.
10. Resources
Primary References
- KVM API Documentation - Official API docs
- KVM API Header - All structures
Code Examples
- kvmtool - Simple KVM VMM
- Firecracker - Production microVM
- crosvm - Chrome OS VMM
Tutorials
- KVM Bare Bones - LWN tutorial
- kvmsample - Minimal example
11. Self-Assessment Checklist
KVM Setup
- Can open /dev/kvm
- Can verify API version
- Can create VM
Memory
- Can allocate guest memory
- Can set memory region
- Guest can access memory
VCPU
- Can create VCPU
- Can map kvm_run structure
- Can initialize segment registers
- Can initialize general registers
Execution
- KVM_RUN executes successfully
- Can handle I/O exits
- Can handle HLT exit
- Can handle errors
Guest Programs
- HLT-only guest works
- Serial output guest works
- More complex guests work
12. Submission / Completion Criteria
Your KVM client is complete when you can demonstrate:
- Successful Initialization
- Show API version check
- Show VM and VCPU creation
- Show memory setup
- Guest Execution
- Run hello guest, show output
- Show VM exit handling
- Show final register state
- Exit Handling
- Demonstrate I/O exit handling
- Demonstrate HLT handling
- Handle error cases gracefully
- Code Quality
- Clear structure
- Error handling throughout
- Verbose/debug mode
Bonus Points:
- Protected mode guest works
- Multiple VCPUs
- Real serial port emulation
- Timer interrupt injection
After completing this project, you’ll understand how QEMU, Firecracker, and crosvm work at their core. The KVM API you’ve mastered is the foundation of cloud computing infrastructure running millions of VMs worldwide. You can now build custom VMMs for specific use cases, contribute to QEMU/Firecracker, or optimize virtualized workloads.