Project 5: Virtio Block Device Emulator

Implement a virtio-blk backend that serves a guest block device backed by a host file.

Quick Reference

Attribute	Value
Difficulty	Level 4: Advanced
Time Estimate	3-4 weeks
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 4: Storage Blacksmith
Business Potential	Level 3: Infra Utility
Prerequisites	Virtio basics, block I/O semantics
Key Topics	Virtqueues, flush semantics, storage correctness

1. Learning Objectives

By completing this project, you will:

Parse virtio descriptor chains and handle block requests.
Implement reads, writes, and flush operations correctly.
Understand how backend storage choices affect VM behavior.
Validate correctness with a guest filesystem.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Virtio Queue Protocol and Device Model

Fundamentals Device virtualization lets a guest believe it has NICs, disks, and other devices. There are three primary approaches: emulation (software model of a real device), paravirtualization (virtio devices with shared queues), and passthrough (direct device assignment via VFIO). Emulation is compatible but slow due to frequent exits. Virtio reduces exits by using shared memory rings. Passthrough provides near-native performance but requires an IOMMU for DMA isolation. The hypervisor must ensure that device DMA cannot access memory belonging to other guests.

The device model is a contract between guest drivers and the hypervisor. It defines register layouts, queue formats, interrupts, and reset behavior. If the hypervisor violates that contract, guest drivers will misbehave in ways that are difficult to debug. This is why device emulation often focuses on correctness first, then performance optimizations like vhost or SR-IOV.

Deep Dive into the concept I/O is often the performance bottleneck in virtualization because device access crosses trust and privilege boundaries. Emulated devices trigger VM exits on every register access. Virtio changes the interface contract: it uses shared memory queues and explicit feature negotiation, reducing exits and copies. Vhost moves the virtio data path into the host kernel to reduce context switches.

Passthrough uses VFIO to map a physical device directly into a guest. This gives near-native performance but removes the hypervisor from the data path. To make this safe, an IOMMU translates device DMA addresses and enforces isolation. IOMMU groups also matter: devices in the same isolation group must be assigned together, which can limit passthrough options.

SR-IOV extends PCIe devices to expose multiple Virtual Functions (VFs) so multiple VMs can share a device. Each VF has its own queues and interrupts. This yields excellent performance but complicates live migration, since device state lives on hardware. A common production trade-off is to use virtio or vhost for general workloads, and SR-IOV only when performance is critical.

Device virtualization also includes interrupt delivery. The hypervisor must inject interrupts into the guest virtual APIC. Frequent interrupts can cause exit storms, so techniques like MSI-X, interrupt moderation, and posted interrupts are used to reduce overhead. Device reset and hotplug must be handled carefully to avoid leaking state or DMA mappings across guests.

Virtio feature negotiation is another subtlety. The guest and host must agree on a common feature subset; otherwise, the device behavior may diverge. Migration adds more constraints: the device state must be serializable and consistent across hosts.

The backend matters as much as the frontend. A virtio device backed by a slow storage layer will still be slow, even if the frontend is efficient. Understanding backend constraints is essential for performance reasoning.

Device performance depends on queue sizing and interrupt behavior. If queues are too small, the guest stalls waiting for buffers; if they are too large, latency can increase and memory consumption grows. Interrupt moderation and batching reduce exit overhead but can add latency. Backend choice also matters: a virtio device backed by a slow storage layer cannot be fast, even if the frontend is optimized. Migration imposes additional constraints, because device state must be serializable and consistent across hosts.

Definitions & key terms

Virtio: paravirtual device interface using shared queues.
Virtqueue: ring buffer of descriptors for guest/host I/O.
Vhost: kernel acceleration for virtio data paths.
Feature negotiation: agreement on device capabilities.

Mental model diagram

Guest driver -> virtqueue -> backend -> completion interrupt

How it works (step-by-step, with invariants and failure modes)

Guest negotiates features.
Guest posts descriptor chains.
Host processes request and updates used ring.
Host signals completion.

Invariants: descriptor chains must be valid; used ring must be updated correctly. Failure modes include corrupt queues or missing interrupts.

Minimal concrete example

REQ: read sector 2048
HOST: read backing file -> write into guest buffer -> signal

Common misconceptions

Virtio is always faster regardless of backend.
Interrupts are optional for correctness.

Check-your-understanding questions

Why does virtio reduce VM exits?
What is a descriptor chain?
Why must feature negotiation be strict?

Check-your-understanding answers

Data moves through shared memory, not trapped registers.
A linked list of buffers describing a request.
Mismatched features cause undefined device behavior.

Real-world applications

Virtio devices in KVM/QEMU

Where you’ll apply it

Apply in §3.2 (functional requirements) and §4.2 (components)
Also used in: P06-virtio-net-device

References

OASIS Virtio spec v1.3

Key insights Virtio performance depends on correct queue handling and backend quality.

Summary You now understand virtio queues and the device model contract.

Homework/Exercises to practice the concept

Sketch a virtqueue with three descriptors.
Explain how an interrupt signals completion.

Solutions to the homework/exercises

Header -> data -> status chain with NEXT flags.
Host updates used ring and triggers interrupt.

2.2 Storage Virtualization and Flush Semantics

Fundamentals Storage virtualization presents a VM with a block device backed by files, raw devices, or distributed storage. Formats like qcow2 add copy-on-write snapshots and thin provisioning, while raw disks provide the fastest data path. Storage correctness requires honoring flush and barrier semantics; otherwise, guest filesystems can corrupt data. In hyperconverged systems, distributed storage such as Ceph RBD provides durability and shared access across nodes, enabling live migration without copying disks.

Storage behavior is visible to guests through latency, throughput, and failure semantics. A guest filesystem expects that a flush means data is durable; a hypervisor that violates this contract may appear fast in benchmarks but fail under real workloads. This makes storage virtualization a correctness-first domain: performance gains are only acceptable if they preserve ordering and durability.

Deep Dive into the concept A VM issues reads and writes at block offsets. The hypervisor must map these blocks to host storage while preserving ordering and durability semantics. A raw backend maps blocks directly to host file offsets. This is fast and simple but lacks snapshots. qcow2 adds metadata indirection: a block is mapped through L1/L2 tables to a data cluster. When a block is written for the first time, qcow2 allocates a new cluster, enabling copy-on-write snapshots. This makes snapshots easy but adds latency and fragmentation.

Caching policies are critical. Writeback caching can improve throughput but risks data loss if the host crashes. Direct I/O is safer for durability but may be slower. Hypervisors must honor guest flush and barrier requests; otherwise, journaled filesystems may report success but lose data after a power failure. Many production outages trace back to misconfigured cache modes or ignored flush commands.

Distributed storage changes the model. Ceph stores data as objects across OSDs. The CRUSH algorithm allows clients to compute object placement without a central metadata server. For block storage, Ceph exposes RBD images; the hypervisor uses librbd or QEMU’s rbd backend to access them. Replication or erasure coding provides durability, but introduces network and CPU overhead.

Snapshots in distributed storage behave differently from qcow2 snapshots. They are often copy-on-write at the object level, which can cause long tail latencies during rebalancing or recovery. Backups usually rely on snapshot + incremental export, which must be coordinated with guest flushes to ensure crash consistency.

Storage virtualization also intersects with security and multi-tenancy. A hypervisor must ensure that one guest cannot read another guest’s data from shared backends, especially when thin provisioning and snapshot reuse are involved. Zeroing new blocks and enforcing strong access controls on storage pools are required to prevent data leakage.

Finally, storage virtualization is visible in failure recovery behavior. When a host fails, the guest may experience stalls while the backend fails over or rebalances. The hypervisor must decide whether to pause I/O, retry, or surface errors to the guest.

Storage virtualization is sensitive to cache modes. Writeback caching boosts throughput but risks data loss on host failure; direct I/O is safer but slower. Snapshot management must be disciplined: long-lived snapshots fragment data and slow reads. Multi-tenant environments need I/O throttling to prevent noisy neighbors from saturating disks. These operational details are part of correctness, not optional optimizations.

Definitions & key terms

Flush/Barrier: commands that enforce write ordering and durability.
qcow2: copy-on-write disk image format.
RBD: Ceph block device interface.

Mental model diagram

Guest block IO -> virtio-blk -> backend (raw/qcow2/RBD)

How it works (step-by-step, with invariants and failure modes)

Guest issues read/write/flush.
Backend maps to file offsets or objects.
Flush ensures ordering and durability.

Invariants: flush semantics must be preserved. Failure modes include data loss after crash or corrupted snapshots.

Minimal concrete example

WRITE sector 4096 -> backend write
FLUSH -> ensure data is durable

Common misconceptions

Writeback cache is always safe.
Snapshots are free.

Check-your-understanding questions

Why does qcow2 introduce latency?
Why must flush be honored?

Check-your-understanding answers

Metadata indirection and CoW fragmentation.
Filesystems rely on flush for durability guarantees.

Real-world applications

VM disks in cloud platforms
Snapshot-based backups

Where you’ll apply it

Apply in §3.2 (functional requirements) and §6.2 (critical tests)
Also used in: P09-hyperconverged-home-lab

References

QEMU block layer documentation
Ceph RBD docs

Key insights Correctness in storage virtualization is defined by ordering and durability.

Summary You now understand the storage semantics your virtio-blk backend must preserve.

Homework/Exercises to practice the concept

Explain why flush affects performance.
Compare raw and qcow2 trade-offs.

Solutions to the homework/exercises

Flush forces durable writes and can block I/O.
Raw is faster; qcow2 adds snapshots and thin provisioning.

3. Project Specification

3.1 What You Will Build

A virtio-blk backend that serves read/write/flush requests from a guest.

3.2 Functional Requirements

Parse virtio descriptor chains.
Implement READ, WRITE, FLUSH.
Update used ring and signal completion.

3.3 Non-Functional Requirements

Performance: handle sequential reads efficiently.
Reliability: no data corruption under normal use.
Usability: clear logging for requests.

3.4 Example Usage / Output

$ ./vblk --backing=disk.img
[VBLK] READ sector=2048 len=8
[VBLK] WRITE sector=4096 len=16
[VBLK] FLUSH

3.5 Data Formats / Schemas / Protocols

Virtqueue descriptor chain (header, data, status)

3.6 Edge Cases

Unaligned sector request
Flush with no prior writes

3.7 Real World Outcome

Guest can format, mount, and use the disk without corruption.

3.7.1 How to Run (Copy/Paste)

Run backend and attach to a guest

3.7.2 Golden Path Demo (Deterministic)

Guest writes a file, flushes, reboots, file persists

3.7.3 If CLI: exact terminal transcript

$ ./vblk --backing=disk.img
[VBLK] init ok
[VBLK] WRITE sector=4096 len=16
[VBLK] FLUSH

4. Solution Architecture

4.1 High-Level Design

Virtqueue parser -> Block backend -> Used ring -> Interrupt

4.2 Key Components

4.3 Data Structures (No Full Code)

Descriptor chain representation
Request object (type, sector, length)

4.4 Algorithm Overview

Read available descriptor index.
Parse chain into request.
Perform I/O.
Update used ring.

5. Implementation Guide

5.1 Development Environment Setup

# QEMU + KVM, a guest Linux image, and a backing file

5.2 Project Structure

project-root/
├── src/
│   ├── vblk.c
│   └── virtqueue.c
└── README.md

5.3 The Core Question You’re Answering

“How does virtio move block I/O from guest to host efficiently?”

5.4 Concepts You Must Understand First

Virtqueue descriptor chains
Block request headers
Flush semantics

5.5 Questions to Guide Your Design

How will you parse descriptor chains safely?
How will you handle partial reads/writes?

5.6 Thinking Exercise

Draw a virtio-blk request and identify header, data, status.

5.7 The Interview Questions They’ll Ask

“Why does virtio reduce VM exits?”
“How do you handle flush operations?”

5.8 Hints in Layers

Hint 1: Implement read-only requests first. Hint 2: Add write handling and status updates. Hint 3: Pseudocode

READ avail -> parse -> backend I/O -> update used

Hint 4: Validate descriptor lengths before I/O.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Storage | “Operating System Concepts” | Ch. 10 | | I/O | “Understanding the Linux Kernel” | Ch. 14 |

5.10 Implementation Phases

Phase 1: Virtqueue parsing
Phase 2: READ/WRITE
Phase 3: FLUSH and durability

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Backend | raw file vs qcow2 | raw | simpler correctness |

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Guest writes data and it persists after reboot.
Flush is honored before shutdown.

6.3 Test Data

Guest writes "hello" then flush

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Trace each request with sector and length.

7.3 Performance Traps

Excessive sync calls can kill throughput.

8. Extensions & Challenges

8.1 Beginner Extensions

Add read-only mode.

8.2 Intermediate Extensions

Add qcow2 backend.

8.3 Advanced Extensions

Add multi-queue support.

9. Real-World Connections

9.1 Industry Applications

Virtio-blk in KVM/QEMU

QEMU block layer

9.3 Interview Relevance

Virtio queues, flush semantics

10. Resources

10.1 Essential Reading

OASIS Virtio spec v1.3
QEMU block layer docs

10.2 Video Resources

Talks on virtio performance