Project 8: Virtual Block Device (Disk) Emulator
Build a virtio-blk compatible virtual disk that uses a file or raw device as backing storage, supporting read, write, and flush operations.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced (Level 3) |
| Time Estimate | 3-4 weeks |
| Language | C (alternatives: Rust, Go) |
| Prerequisites | Projects 6-7, understanding of block devices |
| Key Topics | virtio protocol, virtqueues, async I/O (io_uring/AIO), block device semantics |
1. Learning Objectives
By completing this project, you will:
- Master the virtio protocol: Understand the industry-standard paravirtualized I/O interface used by QEMU, Firecracker, and all major cloud providers
- Implement virtqueue handling: Learn the shared-memory ring buffer protocol that enables high-performance guest-host communication
- Build an async I/O backend: Use Linux io_uring or AIO for efficient, non-blocking disk operations
- Understand block device semantics: Learn how operating systems interact with storage at the sector level
2. Theoretical Foundation
2.1 Core Concepts
Why virtio Instead of Emulating Real Hardware?
Traditional device emulation (IDE, SCSI) requires emulating register-level hardware behavior - many I/O port accesses per operation. virtio is a paravirtualized interface designed specifically for VMs:
┌──────────────────────────────────────────────────────────────────────────┐
│ Legacy vs. Virtio I/O Comparison │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Legacy IDE Emulation (to write 1 sector): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. Write sector count to port 0x1F2 (VM exit) │ │
│ │ 2. Write LBA low to port 0x1F3 (VM exit) │ │
│ │ 3. Write LBA mid to port 0x1F4 (VM exit) │ │
│ │ 4. Write LBA high to port 0x1F5 (VM exit) │ │
│ │ 5. Write command to port 0x1F7 (VM exit) │ │
│ │ 6. Wait for interrupt (VM exit) │ │
│ │ 7. Write 512 bytes to port 0x1F0, 2 bytes at a time │ │
│ │ That's 256 VM exits just for data! │ │
│ │ 8. Wait for completion interrupt (VM exit) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ Total: ~260+ VM exits per 512-byte write │
│ │
│ Virtio (to write 1 sector): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. Fill descriptor in shared memory (no VM exit) │ │
│ │ 2. Update available ring index (no VM exit) │ │
│ │ 3. Write to notification register (1 VM exit) │ │
│ │ 4. VMM reads request from shared memory │ │
│ │ 5. VMM performs actual I/O │ │
│ │ 6. VMM updates used ring │ │
│ │ 7. VMM injects interrupt (1 VM exit) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ Total: 2 VM exits per operation (regardless of size!) │
│ │
└──────────────────────────────────────────────────────────────────────────┘
The Virtio Architecture
Virtio has three main components:
- Device configuration: PCI config space or MMIO registers
- Virtqueues: Shared memory ring buffers for I/O
- Notifications: Interrupts from device to guest, doorbells from guest to device
┌──────────────────────────────────────────────────────────────────────────┐
│ Virtio Architecture Overview │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Guest Driver VMM (Your Code) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ virtio-blk driver │ │ virtio-blk device │ │
│ │ │ │ │ │
│ │ - Allocates VQ │ │ - Processes VQ │ │
│ │ - Fills requests │ │ - Does real I/O │ │
│ │ - Notifies device │ │ - Signals guest │ │
│ └──────────┬──────────┘ └──────────┬──────────┘ │
│ │ │ │
│ │ Shared Memory │ │
│ │ ┌───────────────────────┐ │ │
│ │ │ Virtqueue │ │ │
│ ├───►│ ┌───────────────┐ │◄───────────┤ │
│ │ │ │ Descriptor │ │ │ │
│ │ │ │ Table │ │ │ │
│ │ │ └───────────────┘ │ │ │
│ │ │ ┌───────────────┐ │ │ │
│ │ │ │ Available │ │ │ │
│ │ │ │ Ring │ │ │ │
│ │ │ └───────────────┘ │ │ │
│ │ │ ┌───────────────┐ │ │ │
│ │ │ │ Used Ring │ │ │ │
│ │ │ └───────────────┘ │ │ │
│ │ └───────────────────────┘ │ │
│ │ │ │
│ ▼ Doorbell (write to MMIO) ▼ Interrupt │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Notification Channel │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Virtqueue Structure (Split Ring)
The virtqueue uses a “split ring” design with three parts:
┌──────────────────────────────────────────────────────────────────────────┐
│ Virtqueue Split Ring Format │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Descriptor Table (fixed size array): │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Index │ addr (64-bit) │ len (32-bit) │ flags │ next (16-bit)│ │
│ ├───────┼──────────────────┼──────────────┼───────┼──────────────┤ │
│ │ 0 │ 0x7f1234000000 │ 16 │ NEXT │ 1 │ │
│ │ 1 │ 0x7f1234001000 │ 4096 │ NEXT │ 2 │ │
│ │ 2 │ 0x7f1234002000 │ 1 │ WRITE │ - │ │
│ │ ... │ ... │ ... │ ... │ ... │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Flags: │
│ VIRTQ_DESC_F_NEXT (0x1) - Another descriptor follows │
│ VIRTQ_DESC_F_WRITE (0x2) - Device writes to this buffer │
│ (vs. device reads from it) │
│ │
│ 2. Available Ring (driver → device): │
│ ┌─────────────────────────────────────────────────┐ │
│ │ flags │ idx │ ring[0] │ ring[1] │ ... │ │ │
│ │ │ │ │ │ │ │ │
│ │ 0 │ 3 │ 0 │ 3 │ 5 │ │ │
│ └─────────────────────────────────────────────────┘ │
│ - flags: suppress interrupts (optional) │
│ - idx: next index the driver will write │
│ - ring[]: indices into descriptor table (head of chains) │
│ │
│ 3. Used Ring (device → driver): │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ flags │ idx │ ring[0].id │ ring[0].len │ ring[1].id │ ... │ │
│ │ │ │ │ │ │ │ │
│ │ 0 │ 2 │ 0 │ 4113 │ 3 │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ - idx: next index the device will write │
│ - ring[].id: descriptor index that was processed │
│ - ring[].len: bytes written by device (for read operations) │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Virtio-blk Request Format
For block devices, each virtqueue entry is a chain of descriptors:
┌──────────────────────────────────────────────────────────────────────────┐
│ Virtio-blk Request Descriptor Chain │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Read Request (guest wants to read from disk): │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Descriptor 0 │ │ Descriptor 1 │ │ Descriptor 2 │ │
│ │ (Header) │────►│ (Data buffer) │────►│ (Status) │ │
│ │ │ │ │ │ │ │
│ │ addr → header │ │ addr → buffer │ │ addr → status │ │
│ │ len = 16 │ │ len = 4096 │ │ len = 1 │ │
│ │ flags = NEXT │ │ flags = NEXT | │ │ flags = WRITE │ │
│ │ │ │ WRITE │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Header (struct virtio_blk_req): │
│ ┌────────────────────────────────────────┐ │
│ │ type (32-bit) = VIRTIO_BLK_T_IN (0) │ // Read operation │
│ │ reserved (32-bit) = 0 │ │
│ │ sector (64-bit) = starting sector │ // 512-byte units │
│ └────────────────────────────────────────┘ │
│ │
│ Status byte: │
│ ┌────────────────────────────────────────┐ │
│ │ VIRTIO_BLK_S_OK = 0 │ // Success │
│ │ VIRTIO_BLK_S_IOERR = 1 │ // I/O error │
│ │ VIRTIO_BLK_S_UNSUPP = 2 │ // Unsupported operation │
│ └────────────────────────────────────────┘ │
│ │
│ Write Request (guest wants to write to disk): │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Header │────►│ Data buffer │────►│ Status │ │
│ │ type = OUT (1) │ │ flags = NEXT │ │ flags = WRITE │ │
│ │ flags = NEXT │ │ (device reads) │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Async I/O with io_uring
For efficient I/O, we use io_uring (or older AIO). This allows submitting I/O requests without blocking:
┌──────────────────────────────────────────────────────────────────────────┐
│ io_uring Architecture │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ User Space Kernel Space │
│ ┌───────────────────┐ ┌───────────────────────┐ │
│ │ Your VMM │ │ io_uring Core │ │
│ │ │ │ │ │
│ │ 1. Fill SQ entry │ │ │ │
│ │ 2. Update SQ tail │ │ │ │
│ │ 3. io_uring_enter │────────────────►│ 4. Process requests │ │
│ │ (optional) │ │ 5. Do actual I/O │ │
│ │ │ │ 6. Fill CQ entry │ │
│ │ 8. Read CQ head │◄────────────────│ 7. Update CQ head │ │
│ │ 9. Process result │ │ │ │
│ └───────────────────┘ └───────────────────────┘ │
│ │ │ │
│ │ Shared Memory Rings │ │
│ │ ┌─────────────────────┐ │ │
│ └───►│ Submission Queue │◄───────┘ │
│ │ (SQ entries) │ │
│ └─────────────────────┘ │
│ ┌─────────────────────┐ │
│ ┌───►│ Completion Queue │◄───────┐ │
│ │ │ (CQ entries) │ │ │
│ │ └─────────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────┘ │
│ │
│ Benefits: │
│ - No system call per I/O (in kernel polling mode) │
│ - Batch submissions and completions │
│ - Async without threads │
│ │
└──────────────────────────────────────────────────────────────────────────┘
2.2 Why This Matters
Foundation of cloud storage: Every VM in AWS, GCP, and Azure uses virtio-blk or virtio-scsi for storage. Understanding this protocol is understanding cloud infrastructure.
High-performance I/O: The virtio + io_uring combination is state-of-the-art for virtualized storage. Real cloud hypervisors use exactly these techniques.
Portable device interface: virtio works across QEMU, KVM, Xen, and even bare-metal. The guest driver is the same everywhere.
Boot requirement: The guest needs storage to boot. Your virtual disk will allow booting real operating systems.
2.3 Historical Context
2008: virtio introduced by Rusty Russell for Linux/KVM. Revolutionary improvement over IDE/SCSI emulation.
2014: OASIS standardization of virtio 1.0. Became the industry standard.
2019: io_uring introduced in Linux 5.1. Replaced older AIO for high-performance async I/O.
Today: virtio-blk with io_uring is the fastest virtualized storage option. SPDK, vhost, and similar technologies build on these concepts.
2.4 Common Misconceptions
Misconception 1: “virtio is just a QEMU thing”
- Reality: virtio is an OASIS standard supported by Xen, Hyper-V (via vioscsi), VMware (experimental), and FreeBSD bhyve.
Misconception 2: “Descriptor chains are complicated”
- Reality: They’re just linked lists in shared memory. Once you understand the pointers, it’s straightforward.
Misconception 3: “io_uring is optional - I can just use read/write”
- Reality: Blocking I/O halts your entire VMM. Even with threads, performance suffers. Async I/O is essential.
Misconception 4: “The guest allocates virtqueue memory”
- Reality: Yes, but the VMM must translate guest physical addresses to host virtual addresses to access the data.
3. Project Specification
3.1 What You Will Build
A complete virtio-blk device emulator with:
- Virtio MMIO transport: Device registers and feature negotiation
- Virtqueue processing: Parse descriptor chains, extract requests
- Block operations: Read, write, flush, get capacity
- Async I/O backend: io_uring or AIO for non-blocking disk access
- Integration ready: Works with memory mapper from Project 6
3.2 Functional Requirements
- Device Discovery and Configuration
- Virtio MMIO register interface
- Feature negotiation (basic features)
- Device configuration space (capacity, block size)
- Virtqueue setup (guest provides GPA)
- Request Processing
- Parse available ring for new requests
- Walk descriptor chains for request components
- Support VIRTIO_BLK_T_IN (read)
- Support VIRTIO_BLK_T_OUT (write)
- Support VIRTIO_BLK_T_FLUSH (sync)
- Support VIRTIO_BLK_T_GET_ID (optional)
- Backend I/O
- File-backed storage (use existing disk image)
- Async I/O with io_uring
- Fallback to synchronous I/O for simplicity
- Completion Handling
- Update used ring with completed requests
- Write status byte to status descriptor
- Inject interrupt to notify guest
3.3 Non-Functional Requirements
- Performance: Handle hundreds of IOPS with async backend
- Correctness: Data integrity - no corruption or loss
- Compatibility: Work with Linux virtio-blk driver
- Observability: Debug output showing request types and sectors
3.4 Example Usage / Output
$ dd if=/dev/zero of=disk.img bs=1M count=512
$ mkfs.ext4 disk.img
$ ./vblk_emulator disk.img
[VBLK] Virtio-blk device initialized
[VBLK] Backing store: disk.img (512MB)
[VBLK] Virtqueue at GPA 0x10000, 256 entries
# In guest Linux:
$ lsblk
NAME SIZE
vda 512M
$ mount /dev/vda /mnt
$ echo "Hello" > /mnt/test.txt
$ sync
# Back in emulator:
[VBLK] READ sector 2048, count 8
[VBLK] WRITE sector 4096, count 16
[VBLK] FLUSH
Debug output showing virtqueue processing:
$ ./vblk_emulator --debug disk.img
[VBLK] Guest notified (doorbell write)
[VBLK] Available ring: idx=5, last_seen=3
[VBLK] Processing 2 new requests
[VBLK] Request 0: head descriptor 12
[VBLK] - Desc 12: header at GPA 0x1000, len=16
[VBLK] type=VIRTIO_BLK_T_IN, sector=2048
[VBLK] - Desc 13: data at GPA 0x2000, len=4096, flags=WRITE
[VBLK] - Desc 14: status at GPA 0x3000, len=1, flags=WRITE
[VBLK] Submitting io_uring read: offset=1048576, len=4096
[VBLK] Request 1: head descriptor 15
[VBLK] - Desc 15: header at GPA 0x4000, len=16
[VBLK] type=VIRTIO_BLK_T_OUT, sector=4096
[VBLK] - Desc 16: data at GPA 0x5000, len=8192, flags=0
[VBLK] - Desc 17: status at GPA 0x6000, len=1, flags=WRITE
[VBLK] Submitting io_uring write: offset=2097152, len=8192
[VBLK] io_uring CQE: request 0 complete, result=4096
[VBLK] Setting status byte to VIRTIO_BLK_S_OK
[VBLK] Used ring: adding id=12, len=4097
[VBLK] io_uring CQE: request 1 complete, result=8192
[VBLK] Setting status byte to VIRTIO_BLK_S_OK
[VBLK] Used ring: adding id=15, len=1
[VBLK] Injecting interrupt to guest
API usage:
#include "memmap.h"
#include "vblk.h"
// Create memory manager and block device
memmap_t *mm = memmap_create();
vblk_t *blk = vblk_create("disk.img", irq_callback, irq_opaque);
// Register virtio-blk MMIO region
memmap_add_mmio(mm, 0x10001000, 0x1000, vblk_mmio_handler, blk);
// In your main loop:
while (running) {
// ... run guest ...
// Check for async I/O completions
vblk_poll_completions(blk);
}
// Cleanup
vblk_destroy(blk);
memmap_destroy(mm);
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────────────┐
│ Virtual Block Device Architecture │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Guest Linux Kernel │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Block Layer (submit_bio, etc.) │ │ │
│ │ └──────────────────────────┬──────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────────────▼──────────────────────────────┐ │ │
│ │ │ virtio-blk driver (drivers/block/virtio_blk.c) │ │ │
│ │ │ - Translates block I/O to virtio requests │ │ │
│ │ │ - Manages virtqueue │ │ │
│ │ └──────────────────────────┬──────────────────────────────┘ │ │
│ └─────────────────────────────┼───────────────────────────────────┘ │
│ │ │
│ │ MMIO reads/writes │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Memory Mapper (Project 6) │ │
│ │ Traps MMIO at virtio-blk address │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Virtio-blk Device Emulator │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ │
│ │ │ MMIO Registers │ │ Virtqueue Handler│ │ Block Backend │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - Status/Features│ │ - Parse avail │ │ - File ops │ │ │
│ │ │ - Queue config │ │ - Walk desc chain│ │ - io_uring │ │ │
│ │ │ - Doorbell │ │ - Update used │ │ - Completion │ │ │
│ │ └──────────────────┘ └──────────────────┘ └───────────────┘ │ │
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Request State Machine │ │ │
│ │ │ PENDING → SUBMITTED → COMPLETING → DONE │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ io_uring / AIO │ │
│ │ Async I/O to host filesystem │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Host Filesystem │ │
│ │ disk.img file │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
4.2 Key Components
Virtio MMIO Registers
┌──────────────────────────────────────────────────────────────────────────┐
│ Virtio MMIO Register Layout │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Offset Name Size Description │
│ ──────────────────────────────────────────────────────────────────── │
│ 0x000 MagicValue 4 Must be 0x74726976 ("virt") │
│ 0x004 Version 4 Device version (1 or 2) │
│ 0x008 DeviceID 4 2 for block device │
│ 0x00C VendorID 4 0x554D4551 ("QEMU") typically │
│ 0x010 DeviceFeatures 4 Features (sel by DeviceFeaturesSel)│
│ 0x014 DeviceFeaturesSel 4 Feature page selector (write) │
│ 0x020 DriverFeatures 4 Features accepted by driver │
│ 0x024 DriverFeaturesSel 4 Feature page selector (write) │
│ 0x030 QueueSel 4 Queue index selector │
│ 0x034 QueueNumMax 4 Max queue size (read) │
│ 0x038 QueueNum 4 Current queue size (write) │
│ 0x044 QueueReady 4 Queue ready flag │
│ 0x050 QueueNotify 4 Doorbell (write causes notify) │
│ 0x060 InterruptStatus 4 Interrupt flags │
│ 0x064 InterruptACK 4 Interrupt acknowledge (write) │
│ 0x070 Status 4 Device status │
│ 0x080 QueueDescLow 4 Descriptor table GPA (low) │
│ 0x084 QueueDescHigh 4 Descriptor table GPA (high) │
│ 0x090 QueueDriverLow 4 Available ring GPA (low) │
│ 0x094 QueueDriverHigh 4 Available ring GPA (high) │
│ 0x0A0 QueueDeviceLow 4 Used ring GPA (low) │
│ 0x0A4 QueueDeviceHigh 4 Used ring GPA (high) │
│ 0x100+ Config varies Device-specific configuration │
│ │
│ Device Status Bits: │
│ 0x01 = ACKNOWLEDGE │
│ 0x02 = DRIVER │
│ 0x04 = DRIVER_OK │
│ 0x08 = FEATURES_OK │
│ 0x40 = DEVICE_NEEDS_RESET │
│ 0x80 = FAILED │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Virtqueue Processing
┌──────────────────────────────────────────────────────────────────────────┐
│ Virtqueue Processing Flow │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ When guest writes to QueueNotify (doorbell): │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. Check avail->idx vs. last_avail_idx │ │
│ │ - If equal, no new requests │ │
│ │ - Difference = number of new requests │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 2. For each new request: │ │
│ │ - Get head descriptor index from avail->ring[last_avail_idx] │ │
│ │ - Increment last_avail_idx │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 3. Walk descriptor chain: │ │
│ │ a. First descriptor → header (type, sector) │ │
│ │ b. Middle descriptor(s) → data buffer(s) │ │
│ │ c. Last descriptor → status byte │ │
│ │ - Translate GPA → HVA for each buffer │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 4. Submit I/O to backend: │ │
│ │ - For read: preadv() or io_uring read │ │
│ │ - For write: pwritev() or io_uring write │ │
│ │ - For flush: fsync() │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 5. On I/O completion: │ │
│ │ - Write status byte (0 = OK, 1 = error) │ │
│ │ - Add to used ring: used->ring[used->idx % size] = {id, len} │ │
│ │ - Memory barrier │ │
│ │ - Increment used->idx │ │
│ │ - Inject interrupt if needed │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
4.3 Data Structures
// Virtio descriptor (from spec)
struct virtq_desc {
uint64_t addr; // Guest physical address
uint32_t len; // Buffer length
uint16_t flags; // NEXT, WRITE, INDIRECT
uint16_t next; // Next descriptor index (if NEXT flag)
};
// Available ring (driver to device)
struct virtq_avail {
uint16_t flags; // VIRTQ_AVAIL_F_NO_INTERRUPT
uint16_t idx; // Next index driver will write
uint16_t ring[]; // Array of descriptor indices
// uint16_t used_event; // (if VIRTIO_F_EVENT_IDX)
};
// Used ring entry
struct virtq_used_elem {
uint32_t id; // Descriptor chain head index
uint32_t len; // Bytes written by device
};
// Used ring (device to driver)
struct virtq_used {
uint16_t flags; // VIRTQ_USED_F_NO_NOTIFY
uint16_t idx; // Next index device will write
struct virtq_used_elem ring[];
// uint16_t avail_event; // (if VIRTIO_F_EVENT_IDX)
};
// Virtio-blk request header
struct virtio_blk_req {
uint32_t type; // VIRTIO_BLK_T_IN, OUT, FLUSH, etc.
uint32_t reserved;
uint64_t sector; // Starting sector (512-byte units)
};
// Request types
#define VIRTIO_BLK_T_IN 0 // Read
#define VIRTIO_BLK_T_OUT 1 // Write
#define VIRTIO_BLK_T_FLUSH 4 // Flush
#define VIRTIO_BLK_T_GET_ID 8 // Get device ID
// Status codes
#define VIRTIO_BLK_S_OK 0
#define VIRTIO_BLK_S_IOERR 1
#define VIRTIO_BLK_S_UNSUPP 2
// Virtqueue state
typedef struct virtqueue {
// Guest physical addresses
uint64_t desc_gpa;
uint64_t avail_gpa;
uint64_t used_gpa;
// Host virtual addresses (after translation)
struct virtq_desc *desc;
struct virtq_avail *avail;
struct virtq_used *used;
// Tracking
uint16_t num; // Queue size
uint16_t last_avail_idx; // Last processed avail index
} virtqueue_t;
// In-flight request tracking (for async I/O)
typedef struct vblk_request {
uint16_t desc_head; // Head of descriptor chain
uint32_t type; // Request type
uint64_t sector; // Starting sector
void *data_hva; // Host pointer to data buffer
uint32_t data_len; // Data length
uint8_t *status_hva; // Host pointer to status byte
// For io_uring
struct iovec iov; // I/O vector
bool in_flight; // Currently being processed
} vblk_request_t;
// Block device state
typedef struct vblk {
// Backend
int fd; // File descriptor for disk image
uint64_t capacity; // Size in sectors
// Virtio MMIO state
uint32_t device_features;
uint32_t driver_features;
uint32_t device_status;
uint32_t interrupt_status;
// Virtqueue
virtqueue_t vq;
// In-flight requests
vblk_request_t requests[256]; // Max queue size
int num_in_flight;
// io_uring
struct io_uring ring;
// Interrupt callback
void (*irq_callback)(void *opaque, int level);
void *irq_opaque;
// Memory mapper reference (for GPA translation)
memmap_t *mm;
} vblk_t;
4.4 Algorithm Overview
Device Initialization (Guest POV):
1. Read MagicValue, Version, DeviceID - verify it's a virtio-blk device
2. Read DeviceFeatures, decide which to accept
3. Write DriverFeatures with accepted features
4. Read device config (capacity, block size)
5. Set QueueSel = 0 (we only have one queue)
6. Read QueueNumMax, write QueueNum (up to max)
7. Allocate desc/avail/used arrays in guest memory
8. Write QueueDescLow/High, QueueDriverLow/High, QueueDeviceLow/High
9. Write QueueReady = 1
10. Set Status = DRIVER_OK
Request Processing (Device POV):
Input: Doorbell notification (write to QueueNotify)
Output: Completed requests in used ring
1. new_count = avail->idx - vq->last_avail_idx
2. For i in 0..new_count:
a. head_idx = avail->ring[(vq->last_avail_idx + i) % vq->num]
b. Walk descriptor chain starting at head_idx:
- desc[0]: header with type/sector
- desc[1..n-1]: data buffers
- desc[n]: status byte location
c. Translate all GPAs to HVAs using memory mapper
d. Create vblk_request_t with all info
e. Submit I/O:
- If read: io_uring_prep_readv()
- If write: io_uring_prep_writev()
- If flush: io_uring_prep_fsync()
3. vq->last_avail_idx += new_count
4. io_uring_submit()
Completion Handling:
1. io_uring_peek_cqe() or io_uring_wait_cqe()
2. For each completed CQE:
a. Find corresponding vblk_request_t
b. Set status byte (OK or IOERR based on result)
c. Add to used ring:
used->ring[used->idx % vq->num] = {
.id = request->desc_head,
.len = (request->type == READ) ? request->data_len + 1 : 1
}
d. Memory barrier (wmb)
e. Increment used->idx
f. io_uring_cqe_seen()
3. If any completions:
a. Set interrupt_status |= 1
b. Call irq_callback(opaque, 1)
5. Implementation Guide
5.1 Development Environment Setup
Required packages:
# Ubuntu 20.04+
sudo apt install build-essential liburing-dev
# Or build liburing from source for older systems
git clone https://github.com/axboe/liburing
cd liburing && ./configure && make && sudo make install
Kernel requirements:
- Linux 5.1+ for io_uring
- Linux 5.6+ for full io_uring features
Test disk image:
# Create 512MB disk image
dd if=/dev/zero of=disk.img bs=1M count=512
# Format with ext4
mkfs.ext4 disk.img
# Or create qcow2 (if you add qcow2 support later)
qemu-img create -f qcow2 disk.qcow2 512M
5.2 Project Structure
vblk_project/
├── Makefile
├── include/
│ ├── vblk.h # Public API
│ ├── virtio.h # Virtio definitions
│ └── virtqueue.h # Virtqueue handling
├── src/
│ ├── vblk.c # Main implementation
│ ├── vblk_mmio.c # MMIO register handling
│ ├── virtqueue.c # Virtqueue operations
│ ├── backend_file.c # File backend
│ └── backend_iouring.c # io_uring async backend
├── tests/
│ ├── test_virtqueue.c # Virtqueue tests
│ ├── test_backend.c # Backend tests
│ └── test_integration.c # Full integration tests
└── examples/
└── standalone_demo.c # Demo without VM
5.3 The Core Question You’re Answering
“How does a high-performance virtualized storage device communicate with guest operating systems using shared memory queues, and how do we make the actual I/O operations non-blocking?”
This combines:
- Shared memory protocol (virtio)
- Guest/host coordination (notifications and interrupts)
- Async I/O for performance
5.4 Concepts You Must Understand First
Question 1: What is a descriptor table? How does a linked list work in shared memory using indices instead of pointers?
- Reference: Virtio Specification Section 2.4
Question 2: What is the difference between available ring and used ring? Who writes to each?
- Reference: Virtio Specification Section 2.4
Question 3: What is io_uring? How does it differ from traditional read/write calls?
- Reference: io_uring documentation, “What is io_uring?” blog posts
Question 4: How do you translate a Guest Physical Address (GPA) to a Host Virtual Address (HVA)?
- Reference: Your Project 6 implementation
Question 5: What is a memory barrier? Why is it needed when updating shared data structures?
- Reference: Linux kernel memory barriers documentation
5.5 Questions to Guide Your Design
Virtqueue Design:
- How do you handle the case where the guest provides invalid GPAs?
- What happens if the descriptor chain is malformed (infinite loop)?
- How do you handle multiple data buffers in a single request?
Backend Design:
- How do you map sectors to file offsets?
- What if a write request would extend past the disk image size?
- How do you handle partial I/O completions?
Concurrency Design:
- Can the guest submit more requests while previous ones are in flight?
- What happens if io_uring completions arrive while processing new submissions?
- Do you need any locking?
Error Handling:
- How do you report I/O errors to the guest?
- What if fsync() fails during flush?
- How do you handle out-of-memory conditions?
5.6 Thinking Exercise
Before coding, trace through this sequence:
Scenario: Guest reads one sector (512 bytes) from sector 100
Guest operations:
- Allocates three descriptors (indices 0, 1, 2)
- Fills descriptor 0: addr=&header, len=16, flags=NEXT, next=1
-
Fills descriptor 1: addr=&buffer, len=512, flags=NEXT WRITE, next=2 - Fills descriptor 2: addr=&status, len=1, flags=WRITE, next=0
- Fills header: type=VIRTIO_BLK_T_IN, sector=100
- Writes avail->ring[avail->idx % 256] = 0
- Increments avail->idx
- Writes to QueueNotify register (doorbell)
Your device operations:
- Doorbell triggers MMIO handler
- ???
Complete the trace:
- What do you read from the available ring?
- How do you walk the descriptor chain?
- What I/O operation do you perform?
- How do you complete the request?
Expected answer:
- Doorbell triggers MMIO handler
- new_requests = avail->idx - last_avail_idx = 1
- head = avail->ring[last_avail_idx % 256] = 0
- Walk chain: desc[0] (header) -> desc[1] (data, WRITE) -> desc[2] (status, WRITE)
- Translate GPA of header to HVA, read type=IN, sector=100
- Translate GPA of data buffer to HVA
- Submit io_uring_prep_readv(fd, &iov, 1, 100*512)
- On completion: write 0 to status_hva
- Update used ring: used->ring[used->idx % 256] = {id=0, len=513}
- Increment used->idx, inject interrupt
5.7 Hints in Layers
Hint 1 - Starting Point (Conceptual Direction): Start with synchronous I/O (regular pread/pwrite). Get the virtqueue parsing working first. Add io_uring later as an optimization.
Hint 2 - Next Level (More Specific): The virtio spec is your primary reference. Read sections 2.4 (Virtqueues), 2.5 (MMIO Transport), and 5.2 (Block Device). Don’t guess - the spec is very precise.
Hint 3 - Technical Details (Approach): For GPA to HVA translation, you need a reference to the memory mapper:
void *hva = memmap_gpa_to_hva(mm, gpa);
if (!hva) {
// Invalid GPA - report error
}
For io_uring setup:
struct io_uring ring;
io_uring_queue_init(256, &ring, 0); // 256 entries
// Submit a read
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, fd, &iov, 1, offset);
io_uring_sqe_set_data(sqe, request); // For completion tracking
io_uring_submit(&ring);
// Check for completions
struct io_uring_cqe *cqe;
while (io_uring_peek_cqe(&ring, &cqe) == 0) {
vblk_request_t *req = io_uring_cqe_get_data(cqe);
// Process completion
io_uring_cqe_seen(&ring, cqe);
}
Hint 4 - Tools/Debugging (Verification): Test with a real Linux guest:
# In guest after boot:
dmesg | grep virtio # Should show device detection
lsblk # Should show /dev/vda
dd if=/dev/vda bs=512 count=1 | hexdump # Read first sector
echo "test" > /tmp/file && cp /tmp/file /dev/vda # Write test
Compare behavior with QEMU:
qemu-system-x86_64 -drive file=disk.img,format=raw,if=virtio -nographic
5.8 The Interview Questions They’ll Ask
- “Explain the virtio split ring structure”
- Descriptor table: fixed-size array of buffer descriptors
- Available ring: driver tells device which descriptors are ready
- Used ring: device tells driver which descriptors are done
- Indices wrap using modulo, so rings are circular
- “Why is virtio more efficient than emulating real hardware?”
- Fewer VM exits (one notification vs. many register accesses)
- Batching (submit multiple requests with one notification)
- No complex state machine emulation
- Guest driver is aware it’s virtualized
- “How does io_uring improve performance over traditional async I/O?”
- No system call per I/O submission (in polling mode)
- Shared memory rings (similar to virtio!)
- Better batching of submissions and completions
- Kernel polling option for lowest latency
- “What happens if the guest provides a malformed descriptor chain?”
- Infinite loop: detect cycle, report error
- Invalid GPA: translation fails, report error
- Wrong flags: might try to read from write-only buffer
- Security: VMM must validate everything, don’t trust guest
- “How would you implement live migration for a virtio-blk device?”
- Pause processing, drain in-flight requests
- Migrate device state (features, queue config, indices)
- Transfer backing store or use shared storage
- Resume on destination
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Virtio protocol | Virtio Specification 1.1 | Sections 2, 5.2 |
| Block devices | “Understanding the Linux Kernel” | Chapter 14 |
| Async I/O | “Linux System Programming” by Love | Chapter 4 |
| io_uring | io_uring documentation | Full guide |
| QEMU block layer | QEMU Documentation | Block layer docs |
5.10 Implementation Phases
Phase 1: MMIO Registers (Day 1-3)
- Implement vblk_mmio_read/write handlers
- Handle magic, version, device_id reads
- Handle status register writes (device init sequence)
- Add debug output for all register accesses
Phase 2: Virtqueue Setup (Day 4-6)
- Handle QueueSel, QueueNum, QueueReady
- Handle QueueDesc/Driver/Device address writes
- Translate GPAs to HVAs when queue is ready
- Verify queue setup with debug output
Phase 3: Synchronous Request Processing (Day 7-10)
- Implement doorbell handler
- Parse available ring for new requests
- Walk descriptor chains
- Perform synchronous pread/pwrite
- Update used ring
- Test with simple read/write
Phase 4: io_uring Backend (Day 11-14)
- Initialize io_uring
- Submit requests asynchronously
- Poll for completions in main loop
- Handle multiple in-flight requests
Phase 5: Interrupts and Polish (Day 15-21)
- Implement interrupt injection
- Add flush (fsync) support
- Test with real Linux guest
- Handle error cases
5.11 Key Implementation Decisions
Decision 1: Synchronous vs. async first
- Options: Start with sync, add async later OR implement async from start
- Recommendation: Start with sync. It’s easier to debug. Add io_uring once basics work.
Decision 2: Queue size
- Options: Fixed size, configurable, match guest request
- Recommendation: Use QueueNumMax = 256 (common default), accept guest’s QueueNum.
Decision 3: In-flight request tracking
- Options: Array indexed by descriptor head, linked list, dynamic allocation
- Recommendation: Fixed array indexed by descriptor head. Simple and O(1) lookup.
Decision 4: Memory barriers
- Options: Compiler barriers only, full memory barriers, architecture-specific
- Recommendation: Use
__atomic_thread_fence(__ATOMIC_SEQ_CST)for correctness. Optimize later.
6. Testing Strategy
Unit Tests
// test_virtqueue.c
void test_descriptor_chain_walk(void) {
// Create fake descriptor table
struct virtq_desc descs[3];
descs[0] = (struct virtq_desc){
.addr = 0x1000, .len = 16,
.flags = VIRTQ_DESC_F_NEXT, .next = 1
};
descs[1] = (struct virtq_desc){
.addr = 0x2000, .len = 4096,
.flags = VIRTQ_DESC_F_NEXT | VIRTQ_DESC_F_WRITE, .next = 2
};
descs[2] = (struct virtq_desc){
.addr = 0x3000, .len = 1,
.flags = VIRTQ_DESC_F_WRITE, .next = 0
};
// Walk chain starting at 0
int count = 0;
uint16_t idx = 0;
do {
count++;
if (!(descs[idx].flags & VIRTQ_DESC_F_NEXT)) break;
idx = descs[idx].next;
} while (count < 10); // Safety limit
assert(count == 3);
printf("test_descriptor_chain_walk: PASSED\n");
}
// test_backend.c
void test_file_read_write(void) {
// Create temp file
char path[] = "/tmp/vblk_test_XXXXXX";
int fd = mkstemp(path);
ftruncate(fd, 1024 * 1024); // 1MB
// Write test pattern
char write_buf[512];
memset(write_buf, 0xAB, sizeof(write_buf));
pwrite(fd, write_buf, 512, 512); // Sector 1
// Read back
char read_buf[512];
pread(fd, read_buf, 512, 512);
assert(memcmp(read_buf, write_buf, 512) == 0);
close(fd);
unlink(path);
printf("test_file_read_write: PASSED\n");
}
Integration Tests
// test_integration.c
void test_full_request_flow(void) {
// Set up vblk device with mock memory mapper
memmap_t *mm = memmap_create();
memmap_add_ram(mm, 0, 16 * 1024 * 1024); // 16MB RAM
vblk_t *blk = vblk_create("test.img", mock_irq, NULL);
blk->mm = mm;
// Simulate guest initialization
vblk_mmio_write(blk, 0x070, 0x01); // ACKNOWLEDGE
vblk_mmio_write(blk, 0x070, 0x03); // DRIVER
// ... full init sequence ...
// Set up virtqueue in guest memory
void *guest_mem = memmap_gpa_to_hva(mm, 0);
struct virtq_desc *descs = guest_mem;
struct virtq_avail *avail = guest_mem + 4096;
struct virtq_used *used = guest_mem + 8192;
// Set queue addresses
vblk_mmio_write(blk, 0x080, 0); // QueueDescLow
// Build a read request
struct virtio_blk_req *header = guest_mem + 0x10000;
header->type = VIRTIO_BLK_T_IN;
header->sector = 0;
uint8_t *data_buf = guest_mem + 0x11000;
uint8_t *status = guest_mem + 0x12000;
descs[0] = (struct virtq_desc){
.addr = 0x10000, .len = 16,
.flags = VIRTQ_DESC_F_NEXT, .next = 1
};
descs[1] = (struct virtq_desc){
.addr = 0x11000, .len = 512,
.flags = VIRTQ_DESC_F_NEXT | VIRTQ_DESC_F_WRITE, .next = 2
};
descs[2] = (struct virtq_desc){
.addr = 0x12000, .len = 1,
.flags = VIRTQ_DESC_F_WRITE, .next = 0
};
avail->ring[0] = 0;
avail->idx = 1;
// Ring doorbell
vblk_mmio_write(blk, 0x050, 0);
// Process completions
vblk_poll_completions(blk);
// Verify
assert(*status == VIRTIO_BLK_S_OK);
assert(used->idx == 1);
vblk_destroy(blk);
memmap_destroy(mm);
printf("test_full_request_flow: PASSED\n");
}
Performance Tests
void test_throughput(void) {
// Set up device...
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// Submit 10000 4KB reads
for (int i = 0; i < 10000; i++) {
// Build request
// Ring doorbell
}
// Wait for all completions
while (completed < 10000) {
vblk_poll_completions(blk);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
double iops = 10000 / elapsed;
double throughput_mb = (10000 * 4096) / (1024 * 1024) / elapsed;
printf("Throughput: %.0f IOPS, %.1f MB/s\n", iops, throughput_mb);
}
7. Common Pitfalls & Debugging
| Problem | Root Cause | Fix | Verification |
|---|---|---|---|
| Guest doesn’t detect device | Magic/version/deviceID wrong | Return correct values (0x74726976, 2, 2) | Check dmesg in guest |
| Queue setup fails | GPA translation wrong | Verify memmap integration | Print translated HVAs |
| No requests processed | Not reading available ring correctly | Check last_avail_idx vs avail->idx | Print both values |
| Data corruption | Wrong descriptor chain parsing | Verify flags and next handling | Trace each descriptor |
| Used ring not updating | Forgot memory barrier | Add wmb() before idx update | Print used->idx |
| No interrupts | Not calling irq_callback | Call after updating used ring | Print in callback |
| io_uring errors | Incorrect prep calls | Check return values | Print errno |
| Partial reads/writes | Not handling short I/O | Loop or use RWF_NOWAIT correctly | Check result length |
Debugging Techniques
Trace all MMIO accesses:
uint64_t vblk_mmio_read(vblk_t *blk, uint64_t offset, int size) {
uint64_t value = ... // actual read
printf("[VBLK] MMIO read offset=0x%03lx size=%d value=0x%lx\n",
offset, size, value);
return value;
}
Dump virtqueue state:
void vblk_dump_vq(vblk_t *blk) {
virtqueue_t *vq = &blk->vq;
printf("Virtqueue state:\n");
printf(" desc_gpa=0x%lx avail_gpa=0x%lx used_gpa=0x%lx\n",
vq->desc_gpa, vq->avail_gpa, vq->used_gpa);
printf(" num=%d last_avail_idx=%d\n", vq->num, vq->last_avail_idx);
if (vq->avail) {
printf(" avail->idx=%d\n", vq->avail->idx);
}
if (vq->used) {
printf(" used->idx=%d\n", vq->used->idx);
}
}
Compare with QEMU:
# Enable virtio-blk tracing in QEMU
qemu-system-x86_64 -trace 'virtio_blk*' ...
# Or use QEMU monitor
(qemu) info virtio
(qemu) info qtree
8. Extensions & Challenges
Basic Extensions
- Multiple queues: Support multiple virtqueues for higher parallelism
- Linux virtio-blk supports multiqueue since 3.13
-
Disk ID: Implement VIRTIO_BLK_T_GET_ID to return disk serial number
- Discard/TRIM: Support VIRTIO_BLK_T_DISCARD for SSD-aware guests
Intermediate Challenges
- qcow2 support: Parse qcow2 format for copy-on-write disk images
- Cluster-based allocation
- Backing files for snapshots
-
Rate limiting: Implement I/O throttling (max IOPS, max bandwidth)
- Multiple backing files: Support multiple disks (vda, vdb, vdc)
Advanced Challenges
- Packed virtqueue: Implement virtio 1.1 packed ring format
- Single ring instead of split
- Better cache behavior
- vhost-user: Move virtio processing to separate process
- Used by DPDK and production hypervisors
- Live migration: Implement state serialization and restore
- Save/load in-flight requests
- Dirty block tracking
9. Real-World Connections
QEMU virtio-blk
QEMU’s hw/block/virtio-blk.c does exactly what you’re building:
// QEMU's request handling (simplified)
static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq) {
VirtIOBlock *s = VIRTIO_BLK(vdev);
VirtIOBlockReq *req;
while ((req = virtio_blk_get_request(s, vq))) {
if (virtio_blk_handle_request(req, &mrb)) {
virtqueue_detach_element(req->vq, &req->elem, 0);
virtio_blk_free_request(req);
}
}
// ...
}
After this project, you can read and understand QEMU’s block device code.
Firecracker
Firecracker’s src/devices/src/virtio/block/ is a clean Rust implementation:
- Simple, readable code
- Uses synchronous I/O with thread pool
- Good reference for minimal implementation
Cloud Hypervisor
Cloud Hypervisor’s block device uses:
- io_uring for async I/O
- vhost-user for offloading to SPDK
- Production-grade performance
10. Resources
Essential Reading
- Virtio Specification 1.1: https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html
- io_uring intro: https://kernel.dk/io_uring.pdf
- liburing documentation: https://github.com/axboe/liburing
Code References
- QEMU virtio-blk: https://github.com/qemu/qemu/blob/master/hw/block/virtio-blk.c
- Firecracker block: https://github.com/firecracker-microvm/firecracker/tree/main/src/devices/src/virtio/block
- Cloud Hypervisor: https://github.com/cloud-hypervisor/cloud-hypervisor
Tutorials
- LWN virtio articles: https://lwn.net/Kernel/Index/#Virtio
- io_uring guide: https://unixism.net/loti/
11. Self-Assessment Checklist
Before moving on, verify you can:
- Explain the virtio split ring structure (descriptors, available, used)
- Describe how a read request flows through descriptor chains
- Implement GPA to HVA translation for virtqueue buffers
- Explain why virtio is faster than emulating real hardware
- Use io_uring for async file I/O
- Debug virtqueue issues by examining ring indices
- Describe the virtio device initialization sequence
12. Submission / Completion Criteria
Your project is complete when:
- Device Discovery Works
- Guest Linux detects virtio-blk device
- Correct magic, version, device ID returned
- Feature negotiation completes
- Queue Setup Works
- Guest can configure virtqueue
- GPAs correctly translated to HVAs
- Queue ready flag handled
- I/O Operations Work
- Read requests return correct data
- Write requests persist to backing file
- Flush requests sync data
- Completions Work
- Used ring updated correctly
- Status byte indicates success/failure
- Interrupts injected to guest
- Integration Works
- Works as MMIO device with Project 6
- Real Linux guest can mount filesystem
- No data corruption under stress test
Demonstration:
$ ./vblk_demo disk.img
[VBLK] Device ready, waiting for guest...
# In guest:
$ fdisk -l /dev/vda
Disk /dev/vda: 512 MiB, 536870912 bytes, 1048576 sectors
$ mount /dev/vda /mnt
$ echo "Hello, virtio!" > /mnt/test.txt
$ sync
$ cat /mnt/test.txt
Hello, virtio!
After completing this project, you’ll have built a production-quality virtualized storage device. This knowledge directly applies to cloud infrastructure work, contributing to QEMU/Firecracker, or building specialized storage solutions for VMs.