Project 14: KVM Userspace Client

Build a userspace program that uses Linux’s KVM API to run a virtual machine with hardware acceleration, without writing any kernel code.

Quick Reference

Attribute Value
Difficulty Advanced (Level 3: The Engineer)
Time Estimate 2-3 weeks
Language C (alternatives: Rust, Go)
Prerequisites Linux system programming, basic x86 knowledge, understanding of virtualization concepts
Key Topics KVM API, /dev/kvm ioctls, VM/VCPU management, KVM_RUN, VM exits, hardware virtualization

1. Learning Objectives

By completing this project, you will:

  • Master the Linux KVM API for creating and managing virtual machines
  • Learn how to set up guest memory using KVM_SET_USER_MEMORY_REGION
  • Understand the KVM_RUN loop and how to handle VM exits
  • Implement I/O and MMIO emulation in userspace
  • Learn how QEMU uses KVM for hardware acceleration
  • Build the foundation for production-quality hypervisor tools

2. Theoretical Foundation

2.1 Core Concepts

What is KVM?

KVM (Kernel-based Virtual Machine) is a Linux kernel module that turns Linux into a Type-1 hypervisor:

┌────────────────────────────────────────────────────────────────────────┐
│                         KVM Architecture                                │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Without KVM (QEMU alone)           With KVM (QEMU + KVM)             │
│   ════════════════════════           ═══════════════════════           │
│                                                                         │
│   ┌──────────────────────┐           ┌──────────────────────┐          │
│   │     Guest Code       │           │     Guest Code       │          │
│   └──────────┬───────────┘           └──────────┬───────────┘          │
│              │                                  │                       │
│              │ Every instruction                │ Most instructions     │
│              │ emulated in software             │ run on real CPU!      │
│              ▼                                  ▼                       │
│   ┌──────────────────────┐           ┌──────────────────────┐          │
│   │  QEMU TCG Emulator   │           │    KVM Module        │          │
│   │  (Software CPU)      │           │   (Uses VT-x/AMD-V)  │          │
│   │                      │           │                      │          │
│   │  - Fetch instruction │           │  - Guest runs at     │          │
│   │  - Decode            │           │    native speed      │          │
│   │  - Emulate           │           │  - VMX non-root mode │          │
│   │  - Very slow!        │           │  - Exits only on I/O │          │
│   └──────────────────────┘           └──────────────────────┘          │
│              │                                  │                       │
│              │                                  │ VM Exit              │
│              │                                  ▼ (for I/O, etc.)      │
│              │                       ┌──────────────────────┐          │
│              │                       │     QEMU/Your Code   │          │
│              │                       │   (Device emulation) │          │
│              │                       └──────────────────────┘          │
│              │                                  │                       │
│              ▼                                  ▼                       │
│   ┌──────────────────────┐           ┌──────────────────────┐          │
│   │    Host Hardware     │           │    Host Hardware     │          │
│   └──────────────────────┘           └──────────────────────┘          │
│                                                                         │
│   Speed: ~10-100x slower             Speed: Near-native!               │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

The KVM API Model

KVM exposes hardware virtualization through a simple ioctl-based API:

┌────────────────────────────────────────────────────────────────────────┐
│                          KVM API Hierarchy                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   /dev/kvm (System-level)                                              │
│   ═════════════════════════                                            │
│   │                                                                     │
│   │  ioctl(kvm_fd, KVM_GET_API_VERSION)    → Check API compatibility   │
│   │  ioctl(kvm_fd, KVM_CHECK_EXTENSION)    → Check for features        │
│   │  ioctl(kvm_fd, KVM_CREATE_VM)          → Create a new VM           │
│   │         │                                                           │
│   │         │ Returns VM file descriptor                               │
│   │         ▼                                                           │
│   │                                                                     │
│   └── VM fd (Per-VM)                                                   │
│       ═══════════════                                                   │
│       │                                                                 │
│       │  ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION) → Map guest memory   │
│       │  ioctl(vm_fd, KVM_CREATE_VCPU)            → Create vCPU        │
│       │  ioctl(vm_fd, KVM_CREATE_IRQCHIP)         → Create virtual PIC │
│       │  ioctl(vm_fd, KVM_CREATE_PIT2)            → Create virtual PIT │
│       │         │                                                       │
│       │         │ Returns VCPU file descriptor                         │
│       │         ▼                                                       │
│       │                                                                 │
│       └── VCPU fd (Per-vCPU)                                           │
│           ═══════════════════                                           │
│           │                                                             │
│           │  mmap(vcpu_fd, KVM_GET_VCPU_MMAP_SIZE) → Get kvm_run struct│
│           │  ioctl(vcpu_fd, KVM_GET_REGS)          → Read registers    │
│           │  ioctl(vcpu_fd, KVM_SET_REGS)          → Write registers   │
│           │  ioctl(vcpu_fd, KVM_GET_SREGS)         → Read segment regs │
│           │  ioctl(vcpu_fd, KVM_SET_SREGS)         → Write segment regs│
│           │  ioctl(vcpu_fd, KVM_RUN)               → Run guest!        │
│           │                                                             │
│           │                                                             │
│           └── kvm_run struct (shared memory)                           │
│               ══════════════════════════════                            │
│                                                                         │
│               Updated by KVM after each VM exit:                       │
│               - exit_reason (why did guest stop?)                      │
│               - io (I/O port info if exit was for I/O)                 │
│               - mmio (MMIO info if exit was for MMIO)                  │
│               - hypercall (if guest made hypercall)                    │
│               - etc.                                                    │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

The KVM_RUN Loop

┌────────────────────────────────────────────────────────────────────────┐
│                          KVM_RUN Loop                                   │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Your Code                    KVM (Kernel)              Hardware      │
│   ═════════                    ════════════              ════════      │
│       │                             │                        │         │
│       │ ioctl(vcpu_fd, KVM_RUN)     │                        │         │
│       ├────────────────────────────►│                        │         │
│       │                             │                        │         │
│       │                             │ VMLAUNCH/VMRESUME      │         │
│       │                             ├───────────────────────►│         │
│       │                             │                        │         │
│       │                             │    Guest executes on   │         │
│       │                             │    real CPU hardware   │         │
│       │      (your process is       │    (VMX non-root mode) │         │
│       │       blocked here)         │           │            │         │
│       │                             │           │ Sensitive  │         │
│       │                             │           │ operation  │         │
│       │                             │           ▼            │         │
│       │                             │      VM Exit ◄─────────┤         │
│       │                             │◄───────────────────────┤         │
│       │                             │                        │         │
│       │                             │ Save guest state       │         │
│       │                             │ Update kvm_run struct  │         │
│       │                             │                        │         │
│       │ ioctl returns               │                        │         │
│       │◄────────────────────────────┤                        │         │
│       │                             │                        │         │
│       │ Check run->exit_reason      │                        │         │
│       │ Handle exit:                │                        │         │
│       │   - I/O: emulate device     │                        │         │
│       │   - MMIO: emulate device    │                        │         │
│       │   - HLT: wait for interrupt │                        │         │
│       │   - SHUTDOWN: terminate     │                        │         │
│       │                             │                        │         │
│       │ ioctl(vcpu_fd, KVM_RUN)     │                        │         │
│       ├────────────────────────────►│                        │         │
│       │                             │                        │         │
│       │        ... loop continues ...                        │         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Memory Setup

┌────────────────────────────────────────────────────────────────────────┐
│                      KVM Memory Architecture                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Your Process (Userspace)                                             │
│   ════════════════════════                                             │
│                                                                         │
│   void *mem = mmap(NULL, 128MB, PROT_READ|PROT_WRITE,                  │
│                    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);                  │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Your process virtual address space                              │  │
│   │                                                                  │  │
│   │  0x7f1234000000 ┌────────────────────────────────┐              │  │
│   │                 │                                │              │  │
│   │                 │   Allocated memory (128MB)     │              │  │
│   │                 │   This becomes guest RAM       │              │  │
│   │                 │                                │              │  │
│   │  0x7f123C000000 └────────────────────────────────┘              │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                    │                                    │
│                                    │ KVM_SET_USER_MEMORY_REGION        │
│                                    │                                    │
│   struct kvm_userspace_memory_region region = {                        │
│       .slot = 0,                                                       │
│       .guest_phys_addr = 0,        // GPA starts at 0                  │
│       .memory_size = 128*1024*1024,                                    │
│       .userspace_addr = (uint64_t)mem,                                 │
│   };                                                                   │
│   ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);                  │
│                                    │                                    │
│                                    ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Guest Physical Address Space                                    │  │
│   │                                                                  │  │
│   │  GPA 0x00000000 ┌────────────────────────────────┐              │  │
│   │                 │                                │              │  │
│   │                 │   Guest sees this as its       │              │  │
│   │                 │   physical memory              │              │  │
│   │                 │                                │              │  │
│   │  GPA 0x08000000 └────────────────────────────────┘              │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   When guest accesses GPA 0x1000:                                      │
│   - KVM/Hardware translates via EPT                                    │
│   - Actually accesses your mmap'd memory at offset 0x1000              │
│   - Transparent to guest!                                              │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

2.2 Why This Matters

This is how QEMU works with KVM!

When you run:

qemu-system-x86_64 -enable-kvm -m 2G -hda disk.img

QEMU does exactly what you’ll learn:

  1. Opens /dev/kvm
  2. Creates VM and VCPU file descriptors
  3. Maps memory regions
  4. Runs the KVM_RUN loop
  5. Handles exits for I/O emulation

Real-World Applications:

  • Cloud platforms: AWS Nitro, GCP, Azure all use KVM
  • Container runtimes: Kata Containers, Firecracker use KVM
  • Development tools: Android Emulator uses KVM on Linux
  • Security: Sandboxing with hardware isolation

Career Value:

  • KVM expertise is highly valued in cloud computing
  • Understanding enables better VM performance tuning
  • Foundation for building specialized VMMs (like Firecracker)

2.3 Historical Context

2006: Avi Kivity creates KVM for x86

  • Simple kernel module approach
  • Reuses Linux kernel for scheduling, memory management
  • Radical departure from monolithic hypervisors

2007: KVM merged into Linux 2.6.20

  • First virtualization solution in mainline kernel
  • Accelerated adoption of Linux virtualization

2008-2010: KVM + QEMU becomes the standard

  • QEMU provides device models
  • KVM provides hardware virtualization
  • Libvirt provides management

Today: KVM dominates cloud computing

  • AWS uses KVM (Nitro hypervisor)
  • Google Cloud uses KVM
  • Most OpenStack deployments use KVM
  • Over 1 billion VMs run on KVM

2.4 Common Misconceptions

Misconception 1: “KVM is a separate hypervisor”

  • Reality: KVM makes Linux itself the hypervisor. Linux handles scheduling, memory management, etc.

Misconception 2: “KVM is complex like VMware”

  • Reality: The KVM API is surprisingly simple. ~10 ioctls cover most functionality.

Misconception 3: “You need QEMU to use KVM”

  • Reality: KVM can be used standalone. QEMU just provides device emulation. Firecracker proves you can build your own VMM.

Misconception 4: “KVM handles everything”

  • Reality: KVM handles CPU and memory virtualization. Your code must handle I/O emulation.

3. Project Specification

3.1 What You Will Build

A standalone userspace program (no QEMU!) that:

  1. Opens /dev/kvm and creates a VM
  2. Sets up guest memory
  3. Creates a VCPU and initializes its state
  4. Loads and runs guest code
  5. Handles VM exits (I/O, MMIO, HLT)
  6. Displays guest output

3.2 Functional Requirements

  • KVM Setup: Open /dev/kvm, verify API version, create VM
  • Memory: Allocate and map guest memory (at least 16MB)
  • VCPU: Create VCPU, initialize registers for real mode
  • Guest Loading: Load binary guest code into memory
  • Run Loop: Execute KVM_RUN loop
  • Exit Handling: Handle KVM_EXIT_IO, KVM_EXIT_HLT, KVM_EXIT_MMIO

3.3 Non-Functional Requirements

  • Pure Userspace: No kernel modules required (uses existing KVM)
  • Minimal Dependencies: Only libc (no QEMU libraries)
  • Clear Code: Well-structured, educational implementation
  • Debuggable: Verbose mode showing all VM exits and register state

3.4 Example Usage / Output

$ ./kvmclient guest.bin

[KVM] Opening /dev/kvm...
[KVM] API version: 12 (expected 12) - OK
[KVM] Checking extensions:
  - KVM_CAP_USER_MEMORY: Yes
  - KVM_CAP_SET_TSS_ADDR: Yes
  - KVM_CAP_IRQCHIP: Yes
  - KVM_CAP_EXT_CPUID: Yes

[KVM] Creating VM...
[KVM] VM file descriptor: 4

[KVM] Setting up memory:
  - Allocating 128MB for guest RAM
  - Userspace address: 0x7f1234000000
  - Setting memory region:
    * Slot: 0
    * Guest physical address: 0x0
    * Size: 134217728 bytes (128MB)

[KVM] Creating VCPU 0...
[KVM] VCPU file descriptor: 5
[KVM] Mapping kvm_run structure (8192 bytes)

[KVM] Initializing VCPU state...
[KVM] Setting segment registers (real mode):
  - CS: selector=0x0000, base=0x00000000, limit=0xFFFF
  - DS: selector=0x0000, base=0x00000000, limit=0xFFFF
  - ES: selector=0x0000, base=0x00000000, limit=0xFFFF
  - SS: selector=0x0000, base=0x00000000, limit=0xFFFF
  - CR0: 0x60000010 (PE=0, real mode)

[KVM] Setting general registers:
  - RIP: 0x0000 (start of loaded code)
  - RSP: 0xFFFE
  - RFLAGS: 0x0002

[KVM] Loading guest code:
  - File: guest.bin
  - Size: 512 bytes
  - Load address: 0x0000

[KVM] Starting VCPU execution...
[KVM] ═══════════════════════════════════════════════════════════════

[KVM] Running... (KVM_RUN ioctl)
[KVM] VM Exit #1:
  - Exit reason: KVM_EXIT_IO (2)
  - Direction: OUT
  - Port: 0x03F8 (COM1)
  - Size: 1 byte
  - Data: 0x48 ('H')
[KVM] Emulating: Serial output 'H'

[KVM] Running... (KVM_RUN ioctl)
[KVM] VM Exit #2:
  - Exit reason: KVM_EXIT_IO (2)
  - Port: 0x03F8, OUT, data: 0x65 ('e')

[KVM] VM Exit #3:
  - Exit reason: KVM_EXIT_IO (2)
  - Port: 0x03F8, OUT, data: 0x6C ('l')

[KVM] VM Exit #4:
  - Exit reason: KVM_EXIT_IO (2)
  - Port: 0x03F8, OUT, data: 0x6C ('l')

[KVM] VM Exit #5:
  - Exit reason: KVM_EXIT_IO (2)
  - Port: 0x03F8, OUT, data: 0x6F ('o')

[KVM] Running... (KVM_RUN ioctl)
[KVM] VM Exit #6:
  - Exit reason: KVM_EXIT_HLT (5)
  - Guest executed HLT instruction

[KVM] ═══════════════════════════════════════════════════════════════
[KVM] Guest Output: Hello

[KVM] VCPU Final State:
  - RAX: 0x0000000000000048
  - RBX: 0x0000000000000000
  - RCX: 0x0000000000000000
  - RDX: 0x00000000000003F8
  - RIP: 0x0000000000000012
  - RFLAGS: 0x0000000000000002

[KVM] Statistics:
  - Total VM exits: 6
  - I/O exits: 5
  - HLT exits: 1

[KVM] Cleanup complete.

4. Solution Architecture

4.1 High-Level Design

┌────────────────────────────────────────────────────────────────────────┐
│                      KVM Client Architecture                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   main.c                                                               │
│   ══════                                                               │
│   │                                                                    │
│   │  main() {                                                          │
│   │      parse_args();           // Get guest binary path              │
│   │      kvm_init(&vm);          // Open /dev/kvm, create VM           │
│   │      setup_memory(&vm);      // Allocate and map guest RAM         │
│   │      create_vcpu(&vm);       // Create VCPU, map kvm_run           │
│   │      init_vcpu_state(&vm);   // Set registers for real mode        │
│   │      load_guest(&vm, path);  // Load guest binary                  │
│   │      run_vm(&vm);            // Main KVM_RUN loop                  │
│   │      cleanup(&vm);           // Close FDs, free memory             │
│   │  }                                                                 │
│   │                                                                    │
│   └───────────────────────────────────────────────────────────────────│
│                                                                         │
│   ┌──────────────────────────────────────────────────────────────────┐ │
│   │                         VM State                                  │ │
│   │                                                                   │ │
│   │   struct vm {                                                     │ │
│   │       int kvm_fd;            // /dev/kvm file descriptor         │ │
│   │       int vm_fd;             // VM file descriptor               │ │
│   │       int vcpu_fd;           // VCPU file descriptor             │ │
│   │                                                                   │ │
│   │       void *mem;             // Guest RAM (mmap'd)               │ │
│   │       size_t mem_size;       // Guest RAM size                   │ │
│   │                                                                   │ │
│   │       struct kvm_run *run;   // Shared kvm_run structure         │ │
│   │       size_t run_size;       // Size of kvm_run mmap             │ │
│   │   };                                                              │ │
│   │                                                                   │ │
│   └──────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│   ┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐ │
│   │    kvm_init()      │ │   setup_memory()   │ │   create_vcpu()    │ │
│   │                    │ │                    │ │                    │ │
│   │ - open /dev/kvm    │ │ - mmap guest RAM   │ │ - KVM_CREATE_VCPU  │ │
│   │ - check version    │ │ - KVM_SET_USER_    │ │ - mmap kvm_run     │ │
│   │ - KVM_CREATE_VM    │ │   MEMORY_REGION    │ │                    │ │
│   └────────────────────┘ └────────────────────┘ └────────────────────┘ │
│                                                                         │
│   ┌────────────────────┐ ┌────────────────────────────────────────────┐│
│   │  init_vcpu_state() │ │              run_vm()                      ││
│   │                    │ │                                            ││
│   │ - KVM_SET_SREGS    │ │ while (running) {                          ││
│   │   (segments, CR0)  │ │     ioctl(vcpu_fd, KVM_RUN);               ││
│   │ - KVM_SET_REGS     │ │                                            ││
│   │   (RIP, RSP, etc.) │ │     switch (run->exit_reason) {            ││
│   └────────────────────┘ │         case KVM_EXIT_IO:                  ││
│                          │             handle_io(run);                ││
│   ┌────────────────────┐ │         case KVM_EXIT_MMIO:                ││
│   │    load_guest()    │ │             handle_mmio(run);              ││
│   │                    │ │         case KVM_EXIT_HLT:                 ││
│   │ - Read binary file │ │             running = false;               ││
│   │ - Copy to guest mem│ │         case KVM_EXIT_SHUTDOWN:            ││
│   └────────────────────┘ │             running = false;               ││
│                          │     }                                      ││
│                          │ }                                          ││
│                          └────────────────────────────────────────────┘│
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

4.2 Key Components

1. KVM System Interface

Opens /dev/kvm and verifies API compatibility:

int kvm_fd = open("/dev/kvm", O_RDWR);
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);

2. VM Creation

Creates a VM instance:

int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

3. Memory Setup

Maps userspace memory into guest physical address space:

void *mem = mmap(NULL, size, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
struct kvm_userspace_memory_region region = {...};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);

4. VCPU Creation

Creates virtual CPU and maps shared kvm_run structure:

int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
struct kvm_run *run = mmap(..., vcpu_fd, 0);

5. Exit Handling

Processes VM exits in the KVM_RUN loop:

  • KVM_EXIT_IO: Guest performed port I/O
  • KVM_EXIT_MMIO: Guest accessed unmapped memory
  • KVM_EXIT_HLT: Guest executed HLT instruction

4.3 Data Structures

/* Main VM state */
struct vm {
    /* File descriptors */
    int kvm_fd;         /* /dev/kvm */
    int vm_fd;          /* VM instance */
    int vcpu_fd;        /* Virtual CPU */

    /* Memory */
    void *mem;          /* Guest RAM (userspace address) */
    size_t mem_size;    /* Guest RAM size */

    /* VCPU run area */
    struct kvm_run *run;/* Shared with kernel */
    size_t run_size;    /* mmap size */

    /* I/O handling */
    char serial_buf[4096];
    int serial_pos;
};

/* From linux/kvm.h - key structures */
struct kvm_run {
    __u8 request_interrupt_window;
    __u8 immediate_exit;
    __u8 padding1[6];
    __u32 exit_reason;              /* Why did the guest exit? */
    __u8 ready_for_interrupt_injection;
    __u8 if_flag;
    __u16 flags;
    __u64 cr8;
    __u64 apic_base;

    union {
        /* KVM_EXIT_IO */
        struct {
            __u8 direction;         /* IN or OUT */
            __u8 size;              /* 1, 2, or 4 bytes */
            __u16 port;             /* I/O port number */
            __u32 count;            /* Number of transfers */
            __u64 data_offset;      /* Offset in run struct to data */
        } io;

        /* KVM_EXIT_MMIO */
        struct {
            __u64 phys_addr;        /* Guest physical address */
            __u8  data[8];          /* Data (for write) or buffer (for read) */
            __u32 len;              /* Length of access */
            __u8  is_write;         /* 1 = write, 0 = read */
        } mmio;

        /* KVM_EXIT_HYPERCALL */
        struct {
            __u64 nr;
            __u64 args[6];
            __u64 ret;
            __u32 longmode;
            __u32 pad;
        } hypercall;

        /* ... other exit types ... */
    };
};

struct kvm_regs {
    __u64 rax, rbx, rcx, rdx;
    __u64 rsi, rdi, rsp, rbp;
    __u64 r8, r9, r10, r11, r12, r13, r14, r15;
    __u64 rip, rflags;
};

struct kvm_sregs {
    struct kvm_segment cs, ds, es, fs, gs, ss;
    struct kvm_segment tr, ldt;
    struct kvm_dtable gdt, idt;
    __u64 cr0, cr2, cr3, cr4, cr8;
    __u64 efer;
    __u64 apic_base;
    __u64 interrupt_bitmap[4];
};

struct kvm_segment {
    __u64 base;
    __u32 limit;
    __u16 selector;
    __u8  type;
    __u8  present, dpl, db, s, l, g, avl;
    __u8  unusable;
    __u8  padding;
};

4.4 Algorithm Overview

ALGORITHM: KVM Client Execution

1. INITIALIZE KVM:
   a. Open /dev/kvm
   b. Verify API version (must be 12)
   c. Check required extensions

2. CREATE VM:
   a. ioctl(kvm_fd, KVM_CREATE_VM, 0)
   b. Save VM file descriptor

3. SETUP MEMORY:
   a. Allocate memory with mmap (anonymous, private)
   b. Create kvm_userspace_memory_region structure
   c. ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region)

4. CREATE VCPU:
   a. ioctl(vm_fd, KVM_CREATE_VCPU, 0)
   b. Get kvm_run mmap size: ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE)
   c. mmap kvm_run structure from vcpu_fd

5. INITIALIZE VCPU STATE:
   a. Get current sregs: ioctl(vcpu_fd, KVM_GET_SREGS, &sregs)
   b. Set segment registers for real mode:
      - CS/DS/ES/SS: selector=0, base=0, limit=0xFFFF
      - CR0: Clear PE bit (real mode)
   c. Set sregs: ioctl(vcpu_fd, KVM_SET_SREGS, &sregs)
   d. Set initial registers:
      - RIP = load address
      - RSP = stack address
      - RFLAGS = 0x2

6. LOAD GUEST:
   a. Read guest binary file
   b. Copy to guest memory at load address

7. RUN LOOP:
   WHILE running:
     a. ioctl(vcpu_fd, KVM_RUN, 0)
     b. Check run->exit_reason:

        CASE KVM_EXIT_IO:
          IF direction == OUT:
            Get data from run + run->io.data_offset
            Emulate output (e.g., serial port)
          ELSE:
            Emulate input, put data at run + run->io.data_offset

        CASE KVM_EXIT_MMIO:
          IF is_write:
            Emulate MMIO write
          ELSE:
            Put read data in run->mmio.data

        CASE KVM_EXIT_HLT:
          running = false

        CASE KVM_EXIT_SHUTDOWN:
          running = false

        CASE KVM_EXIT_FAIL_ENTRY:
          Print error, exit

8. CLEANUP:
   a. munmap kvm_run structure
   b. close(vcpu_fd)
   c. munmap guest memory
   d. close(vm_fd)
   e. close(kvm_fd)

5. Implementation Guide

5.1 Development Environment Setup

# Check KVM is available
$ ls -la /dev/kvm
crw-rw---- 1 root kvm 10, 232 Dec 29 10:00 /dev/kvm

# Add yourself to kvm group (if needed)
$ sudo usermod -aG kvm $USER
# Then logout/login

# Verify KVM works
$ cat /sys/module/kvm/parameters/nested
# Or
$ lsmod | grep kvm

# Install development tools
$ sudo apt-get install build-essential

# Create project
$ mkdir kvmclient && cd kvmclient

5.2 Project Structure

kvmclient/
├── Makefile
├── kvmclient.c         # Main implementation (single file for simplicity)
└── guest/
    ├── guest.asm       # Assembly guest code
    ├── guest.bin       # Compiled guest binary
    └── Makefile        # Build guest

5.3 The Core Question You’re Answering

“How does userspace software control hardware virtualization to run arbitrary guest code at near-native speed while maintaining control over sensitive operations?”

The answer:

  1. KVM exposes VT-x/AMD-V through a simple file descriptor interface
  2. Userspace sets up memory and initial CPU state
  3. KVM_RUN transfers control to guest via VMLAUNCH/VMRESUME
  4. On sensitive operations, hardware traps back (VM exit)
  5. KVM returns to userspace with exit information
  6. Userspace handles the exit, then calls KVM_RUN again

5.4 Concepts You Must Understand First

KVM API:

  • What is an ioctl and how does it work?
  • What are file descriptors and how do they represent resources?
  • How does mmap create shared memory between user/kernel?

x86 Architecture:

  • What is the difference between real mode and protected mode?
  • What are segment registers and how do they work in real mode?
  • What is port I/O vs. memory-mapped I/O?

Virtualization:

  • What causes a VM exit?
  • What is the guest physical address space?
  • How does the CPU know what code to run?

Book References:

  • Linux Programming Interface, Chapter 4 (File I/O)
  • Intel SDM Vol. 1, Chapter 20 (Real Mode)
  • KVM API Documentation (kernel.org)

5.5 Questions to Guide Your Design

Initialization:

  1. What version of the KVM API should you expect?
  2. What extensions are required vs. optional?
  3. What order must operations be performed?

Memory:

  1. How much memory does the guest need?
  2. At what guest physical address should memory start?
  3. Can you have multiple memory regions?

VCPU State:

  1. What segment register values indicate real mode?
  2. What should CR0 contain for real mode?
  3. What initial register values make sense?

Exit Handling:

  1. How do you distinguish between I/O ports?
  2. How do you read data for OUT instructions?
  3. What if the guest performs an unhandled operation?

5.6 Thinking Exercise

Trace through this guest code execution:

Guest code (at address 0x0000):

mov dx, 0x3F8      ; Serial port
mov al, 'A'        ; Character to send
out dx, al         ; Output to serial port
hlt                ; Halt

Step 1: KVM_RUN called

  • KVM enters VMX non-root mode
  • CPU starts executing at RIP=0x0000

Step 2: mov dx, 0x3F8 executes

  • Simple register move
  • No VM exit (not sensitive)

Step 3: mov al, 'A' executes

  • Simple register move
  • No VM exit

Step 4: out dx, al executes

  • I/O port access is sensitive
  • CPU triggers VM exit
  • KVM returns to userspace

Step 5: Your code handles exit

switch (run->exit_reason) {
case KVM_EXIT_IO:
    // run->io.direction = KVM_EXIT_IO_OUT (1)
    // run->io.port = 0x3F8
    // run->io.size = 1
    // data at (run + run->io.data_offset) = 'A'
    printf("%c", *(char *)((char *)run + run->io.data_offset));
    break;
}

Step 6: KVM_RUN called again

  • RIP now points to HLT instruction
  • Guest executes HLT
  • VM exit with KVM_EXIT_HLT

Step 7: Your code handles HLT

case KVM_EXIT_HLT:
    printf("\nGuest halted.\n");
    running = false;
    break;

Questions:

  • What happens if you don’t handle the I/O exit?
  • What if the guest wrote to port 0x80 (debug port)?
  • How would you handle input (IN instruction)?

5.7 Hints in Layers

Hint 1 - Starting Point (Conceptual Direction)

Build incrementally:

  1. First: Open /dev/kvm and create VM
  2. Then: Set up memory
  3. Then: Create VCPU
  4. Then: Initialize state (hardest part!)
  5. Finally: Run loop and exit handling

Hint 2 - Next Level (More Specific Guidance)

Real mode segment setup is tricky. Here’s the key insight:

  • In real mode, segment registers work differently
  • For KVM, set selector=0, base=0, limit=0xFFFF
  • CR0 bit 0 (PE) must be 0 for real mode
  • But KVM requires some other CR0 bits set!
/* Real mode segment template */
struct kvm_segment seg = {
    .base = 0,
    .limit = 0xFFFF,
    .selector = 0,
    .type = 3,      /* Data, read/write, accessed */
    .present = 1,
    .dpl = 0,
    .db = 0,        /* 16-bit */
    .s = 1,         /* Code/data segment */
    .l = 0,         /* Not 64-bit */
    .g = 0,         /* Byte granularity */
};

Hint 3 - Technical Details (Approach/Pseudocode)

/* Complete initialization sequence */
int kvm_fd = open("/dev/kvm", O_RDWR);

/* Check version */
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
if (version != 12) {
    fprintf(stderr, "KVM API version mismatch\n");
    exit(1);
}

/* Create VM */
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

/* Set up memory */
size_t mem_size = 0x8000000;  /* 128MB */
void *mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

struct kvm_userspace_memory_region region = {
    .slot = 0,
    .guest_phys_addr = 0,
    .memory_size = mem_size,
    .userspace_addr = (uint64_t)mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);

/* Create VCPU */
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);

/* Map kvm_run */
size_t run_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = mmap(NULL, run_size, PROT_READ | PROT_WRITE,
                            MAP_SHARED, vcpu_fd, 0);

/* Initialize segments (real mode) */
struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);

/* CS is special - must be executable */
sregs.cs.base = 0;
sregs.cs.limit = 0xFFFF;
sregs.cs.selector = 0;
sregs.cs.type = 11;  /* Execute/Read, accessed */
sregs.cs.present = 1;
sregs.cs.dpl = 0;
sregs.cs.db = 0;
sregs.cs.s = 1;
sregs.cs.l = 0;
sregs.cs.g = 0;

/* DS, ES, SS similar but data type */
/* ... */

/* CR0: PE=0 for real mode, but some bits required by KVM */
sregs.cr0 = 0x60000010;  /* NE=1, ET=1 */
sregs.cr0 &= ~(1);       /* Clear PE for real mode */

ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);

/* Initialize general registers */
struct kvm_regs regs = {0};
regs.rip = 0;      /* Start at address 0 */
regs.rsp = 0xFFFE; /* Stack at top of first 64KB */
regs.rflags = 2;   /* Bit 1 always set */
ioctl(vcpu_fd, KVM_SET_REGS, &regs);

Hint 4 - Tools/Debugging (Verification Methods)

/* Debug helper: dump VCPU state */
void dump_regs(int vcpu_fd) {
    struct kvm_regs regs;
    struct kvm_sregs sregs;

    ioctl(vcpu_fd, KVM_GET_REGS, &regs);
    ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);

    printf("=== VCPU State ===\n");
    printf("RIP: %016llx  RSP: %016llx\n", regs.rip, regs.rsp);
    printf("RAX: %016llx  RBX: %016llx\n", regs.rax, regs.rbx);
    printf("RCX: %016llx  RDX: %016llx\n", regs.rcx, regs.rdx);
    printf("CR0: %016llx  CR4: %016llx\n", sregs.cr0, sregs.cr4);
    printf("CS:  sel=%04x base=%08llx limit=%08x\n",
           sregs.cs.selector, sregs.cs.base, sregs.cs.limit);
}

/* Debug: print exit reason */
const char *exit_reason_str(int reason) {
    switch (reason) {
    case KVM_EXIT_IO: return "IO";
    case KVM_EXIT_MMIO: return "MMIO";
    case KVM_EXIT_HLT: return "HLT";
    case KVM_EXIT_SHUTDOWN: return "SHUTDOWN";
    case KVM_EXIT_FAIL_ENTRY: return "FAIL_ENTRY";
    case KVM_EXIT_INTERNAL_ERROR: return "INTERNAL_ERROR";
    default: return "UNKNOWN";
    }
}

5.8 The Interview Questions They’ll Ask

  1. “What is the KVM API and how does it work?”
    • File descriptor based (/dev/kvm → VM fd → VCPU fd)
    • ioctls for control operations
    • mmap for shared kvm_run structure
    • KVM_RUN enters guest, returns on exit
  2. “How does KVM_RUN work?”
    • Ioctl blocks, KVM executes VMLAUNCH/VMRESUME
    • Guest runs until sensitive operation
    • VM exit occurs, KVM populates kvm_run
    • Ioctl returns to userspace
    • Userspace handles exit, calls KVM_RUN again
  3. “What are the common VM exit reasons?”
    • KVM_EXIT_IO: Port I/O
    • KVM_EXIT_MMIO: Memory-mapped I/O
    • KVM_EXIT_HLT: Guest halted
    • KVM_EXIT_SHUTDOWN: Triple fault or shutdown
    • KVM_EXIT_INTERNAL_ERROR: KVM error
  4. “How do you set up memory for a KVM guest?”
    • Allocate with mmap (userspace)
    • Use KVM_SET_USER_MEMORY_REGION
    • Specify GPA, size, userspace address
    • KVM/hardware handles translation
  5. “Why would you build a custom KVM client instead of using QEMU?”
    • Minimal footprint (Firecracker approach)
    • Specific device requirements
    • Security (smaller attack surface)
    • Performance (less overhead)

5.9 Books That Will Help

Topic Book Chapter
KVM API KVM Documentation api.txt
Linux System Programming TLPI Chapters 4, 49
x86 Architecture Intel SDM Vol. 1 Chapters 3, 20
Real Mode Intel SDM Vol. 3A Chapter 20
Virtualization Intel SDM Vol. 3C Chapters 23-27

5.10 Implementation Phases

Phase 1: KVM Setup (Days 1-2)

Goal: Open /dev/kvm and create VM

int main() {
    int kvm_fd = open("/dev/kvm", O_RDWR);
    if (kvm_fd < 0) {
        perror("open /dev/kvm");
        return 1;
    }

    int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
    printf("KVM API version: %d\n", version);

    int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
    printf("VM created, fd = %d\n", vm_fd);

    close(vm_fd);
    close(kvm_fd);
    return 0;
}

Validation:

  • Program runs without error
  • Prints API version 12
  • Creates VM successfully

Phase 2: Memory Setup (Days 3-4)

Goal: Allocate and map guest memory

Add memory setup after creating VM:

  • mmap 128MB anonymous memory
  • Create kvm_userspace_memory_region
  • Call KVM_SET_USER_MEMORY_REGION

Validation:

  • No errors from ioctl
  • Memory is allocated (check with /proc/self/maps)

Phase 3: VCPU Creation (Days 5-7)

Goal: Create VCPU with proper initial state

This is the trickiest part:

  • Create VCPU
  • Map kvm_run structure
  • Initialize segment registers for real mode
  • Initialize general registers

Validation:

  • VCPU created successfully
  • Can read initial register state
  • State matches what you set

Phase 4: Guest Loading and Running (Days 8-10)

Goal: Load guest code and run

  • Write simple guest assembly (print “Hi”, then HLT)
  • Assemble to binary
  • Load into guest memory
  • Call KVM_RUN

Validation:

  • KVM_RUN returns
  • Exit reason is KVM_EXIT_IO or KVM_EXIT_HLT

Phase 5: Exit Handling (Days 11-14)

Goal: Handle exits and display output

  • Handle KVM_EXIT_IO for serial output
  • Handle KVM_EXIT_HLT to terminate
  • Handle error cases
  • Add verbose mode

Validation:

  • Guest output is displayed
  • Program terminates cleanly
  • Various guest programs work

5.11 Key Implementation Decisions

Decision 1: Real Mode vs. Protected Mode

Real Mode (This Project)

  • Simpler segment setup
  • 16-bit code
  • Limited to 1MB (but memory mapping can extend)
  • Good for learning

Protected Mode

  • More complex setup (GDT, TSS)
  • 32-bit code
  • Needed for real OS
  • Add as extension

Decision 2: I/O Handling Approach

Simple (Print to stdout)

  • Just print I/O data
  • Good for testing

Device Emulation

  • Actually emulate serial port registers
  • Support both IN and OUT
  • More realistic

Decision 3: Error Handling

Minimal (Exit on error)

  • Simple, clear
  • Good for learning

Robust (Continue if possible)

  • Log errors
  • Try to recover
  • Production quality

6. Testing Strategy

6.1 Unit Tests

/* Test KVM availability */
void test_kvm_available(void) {
    int fd = open("/dev/kvm", O_RDWR);
    assert(fd >= 0 && "KVM not available");
    int version = ioctl(fd, KVM_GET_API_VERSION, 0);
    assert(version == 12 && "Wrong KVM API version");
    close(fd);
    printf("TEST: KVM available - PASS\n");
}

/* Test VM creation */
void test_vm_create(void) {
    int kvm_fd = open("/dev/kvm", O_RDWR);
    int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
    assert(vm_fd >= 0 && "Failed to create VM");
    close(vm_fd);
    close(kvm_fd);
    printf("TEST: VM creation - PASS\n");
}

/* Test memory setup */
void test_memory_setup(void) {
    int kvm_fd = open("/dev/kvm", O_RDWR);
    int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

    void *mem = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    assert(mem != MAP_FAILED);

    struct kvm_userspace_memory_region region = {
        .slot = 0,
        .guest_phys_addr = 0,
        .memory_size = 0x1000,
        .userspace_addr = (uint64_t)mem,
    };
    int ret = ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);
    assert(ret == 0 && "Failed to set memory region");

    munmap(mem, 0x1000);
    close(vm_fd);
    close(kvm_fd);
    printf("TEST: Memory setup - PASS\n");
}

6.2 Integration Tests

#!/bin/bash

# Test 1: Basic execution
echo "Test 1: Basic execution..."
./kvmclient guest/hlt.bin 2>&1 | grep -q "EXIT_HLT" \
    && echo "PASS" || echo "FAIL"

# Test 2: Serial output
echo "Test 2: Serial output..."
./kvmclient guest/hello.bin 2>&1 | grep -q "Hello" \
    && echo "PASS" || echo "FAIL"

# Test 3: Memory access
echo "Test 3: Memory access..."
./kvmclient guest/memtest.bin 2>&1 | grep -q "Memory OK" \
    && echo "PASS" || echo "FAIL"

6.3 Guest Test Programs

hlt.asm - Just halt:

[BITS 16]
[ORG 0x0000]
    hlt

hello.asm - Serial output:

[BITS 16]
[ORG 0x0000]

    mov dx, 0x3F8   ; COM1 port

    mov al, 'H'
    out dx, al
    mov al, 'e'
    out dx, al
    mov al, 'l'
    out dx, al
    mov al, 'l'
    out dx, al
    mov al, 'o'
    out dx, al

    hlt

Build guest:

nasm -f bin -o guest/hlt.bin guest/hlt.asm
nasm -f bin -o guest/hello.bin guest/hello.asm

7. Common Pitfalls & Debugging

Problem Root Cause Fix Verification
KVM_RUN fails immediately Segment registers wrong Check segment type, base, limit Dump and compare with working example
FAIL_ENTRY exit Invalid guest state Check CR0, segment setup Look at hardware_exit_reason
Guest doesn’t execute RIP wrong Set RIP to load address Dump registers before KVM_RUN
I/O data wrong Offset calculation Use run + run->io.data_offset Print raw bytes
Permission denied Not in kvm group Add user to kvm group Check /dev/kvm permissions
No output Not handling I/O exit Add exit handler Print exit_reason

Debugging Techniques

/* Handle FAIL_ENTRY with details */
case KVM_EXIT_FAIL_ENTRY:
    fprintf(stderr, "KVM_EXIT_FAIL_ENTRY\n");
    fprintf(stderr, "  hardware_entry_failure_reason: 0x%llx\n",
            run->fail_entry.hardware_entry_failure_reason);
    dump_regs(vcpu_fd);
    exit(1);

/* Handle INTERNAL_ERROR with details */
case KVM_EXIT_INTERNAL_ERROR:
    fprintf(stderr, "KVM_EXIT_INTERNAL_ERROR\n");
    fprintf(stderr, "  suberror: %d\n", run->internal.suberror);
    if (run->internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
        fprintf(stderr, "  (emulation failed)\n");
    }
    dump_regs(vcpu_fd);
    exit(1);

8. Extensions & Challenges

8.1 Protected Mode Guest

Extend to run 32-bit protected mode code:

  • Set up GDT in guest memory
  • Configure segment registers properly
  • Set CR0.PE = 1

8.2 Multiple VCPUs

Support SMP:

  • Create multiple VCPUs
  • Run in separate threads
  • Handle IPI

8.3 Interrupt Injection

Inject interrupts into guest:

  • Use KVM_INTERRUPT ioctl
  • Emulate timer interrupt
  • Enable guest to use interrupts

8.4 Virtual APIC

Use KVM’s virtual APIC:

  • KVM_CREATE_IRQCHIP
  • KVM_IRQ_LINE for device interrupts
  • More realistic interrupt handling

8.5 Device Emulation

Add real device emulation:

  • Serial port with full register set
  • Simple disk device
  • Connect to Project 13 mini-QEMU

9. Real-World Connections

Firecracker Architecture

Firecracker (AWS’s microVM) follows the same pattern:

┌──────────────────────────────────────────┐
│              Firecracker                  │
├──────────────────────────────────────────┤
│  - Opens /dev/kvm                        │
│  - Creates VM and VCPUs                  │
│  - Minimal device set (virtio)           │
│  - ~50MB memory overhead                 │
│  - Boots in <125ms                       │
│  - Runs on same KVM API you're learning! │
└──────────────────────────────────────────┘

QEMU with KVM

When you run QEMU with -enable-kvm:

/* QEMU does essentially this: */
int kvm_init(MachineState *ms) {
    s->fd = qemu_open("/dev/kvm", O_RDWR);
    s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
    /* ... lots more, but same pattern! */
}

Cloud Scale

Every EC2 instance, GCP VM, Azure VM uses this exact API pattern at some level.


10. Resources

Primary References

Code Examples

Tutorials


11. Self-Assessment Checklist

KVM Setup

  • Can open /dev/kvm
  • Can verify API version
  • Can create VM

Memory

  • Can allocate guest memory
  • Can set memory region
  • Guest can access memory

VCPU

  • Can create VCPU
  • Can map kvm_run structure
  • Can initialize segment registers
  • Can initialize general registers

Execution

  • KVM_RUN executes successfully
  • Can handle I/O exits
  • Can handle HLT exit
  • Can handle errors

Guest Programs

  • HLT-only guest works
  • Serial output guest works
  • More complex guests work

12. Submission / Completion Criteria

Your KVM client is complete when you can demonstrate:

  1. Successful Initialization
    • Show API version check
    • Show VM and VCPU creation
    • Show memory setup
  2. Guest Execution
    • Run hello guest, show output
    • Show VM exit handling
    • Show final register state
  3. Exit Handling
    • Demonstrate I/O exit handling
    • Demonstrate HLT handling
    • Handle error cases gracefully
  4. Code Quality
    • Clear structure
    • Error handling throughout
    • Verbose/debug mode

Bonus Points:

  • Protected mode guest works
  • Multiple VCPUs
  • Real serial port emulation
  • Timer interrupt injection

After completing this project, you’ll understand how QEMU, Firecracker, and crosvm work at their core. The KVM API you’ve mastered is the foundation of cloud computing infrastructure running millions of VMs worldwide. You can now build custom VMMs for specific use cases, contribute to QEMU/Firecracker, or optimize virtualized workloads.