Project 7: Build a Vagrant-Style VM Orchestrator

Build a CLI tool that reads a config file and provisions VMs via libvirt, with snapshots and teardown.

Quick Reference

Attribute	Value
Difficulty	Level 3: Intermediate
Time Estimate	1-2 weeks
Main Programming Language	Python or Go (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 4: Infra Builder
Business Potential	Level 3: DevOps Utility
Prerequisites	libvirt basics, storage/networking
Key Topics	Control plane, libvirt domain XML

1. Learning Objectives

By completing this project, you will:

Translate a config file into libvirt VM definitions.
Implement idempotent create/start/destroy flows.
Manage storage and network resources for VMs.
Provide a reliable CLI user experience.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Control Plane Reconciliation and Scheduling Basics

Fundamentals A control plane manages VM lifecycle, scheduling, policy, and observability. Libvirt provides a consistent API for VM definition and lifecycle across hypervisors. QEMU provides device models and the runtime. A control plane tracks desired state, reconciles actual state, and integrates with metrics and logs. Without observability (metrics, logs, tracing), VM performance problems are guesswork.

Control planes are not optional in production. Even a small cluster needs a single source of truth for VM identity, ownership, and placement. This is why control planes often include authentication, authorization, and audit logging from the start: without these, operational mistakes become outages. They also provide the integration surface for automation, billing, and policy enforcement across many hosts.

Deep Dive into the concept Control planes separate desired state from actual state. Users declare what should exist, and controllers reconcile reality by creating or updating VMs. This is a distributed systems problem: state must be durable, API calls must be idempotent, and failures must not create duplicate VMs. Many control planes store state in a database and use reconciliation loops similar to Kubernetes.

Scheduling is central. A scheduler must account for CPU, memory, storage, and network capacity. It often supports policies such as anti-affinity, NUMA alignment, or power-aware placement. Overcommit is a lever: you can allocate more vCPUs than physical cores, but this increases CPU steal time and latency variance. Good schedulers use real-time metrics to decide placement and avoid stale data.

Observability ties it together. Hypervisors expose metrics like vCPU run time, VM exit counts, dirty page rates, and disk latency. Logs record VM lifecycle events and migration progress. Tracing tools can attribute latency to host or guest. Without these signals, diagnosing performance regressions is nearly impossible.

Control planes also enforce security and multi-tenancy. They must validate requests, apply quotas, and enforce network and storage ACLs. Audit logs are essential for compliance. They also need to integrate with identity systems and secrets management, because VM credentials and images are sensitive assets.

Finally, control planes must handle partial failure. A host may be unreachable but still running VMs. The control plane must decide whether to fence and restart those VMs elsewhere, balancing safety with availability. This is why leases, heartbeats, and fencing are standard patterns. Event-driven design is essential: libvirt and QEMU expose event streams that notify when VMs change state. A control plane that ignores events quickly diverges from reality.

Metrics design influences both stability and user trust. If the control plane tracks only coarse CPU usage, it may oversubscribe memory or saturate storage I/O without noticing. Good control planes track queue depth, latency percentiles, and error rates, then feed those signals into placement and admission control. Poorly chosen signals lead to oscillation and instability.

Schedulers must encode policy explicitly. Bin-packing maximizes density but increases contention, while spreading reduces hotspots but can waste capacity. Admission control is a safety valve: it rejects requests that would violate hard constraints. Quotas and priorities ensure fairness among tenants, which is essential for multi-tenant reliability.

Definitions & key terms

Control plane: management layer for VM lifecycle.
Reconciliation: converging actual state to desired state.
Scheduler: placement engine for VMs.
Observability: metrics, logs, tracing.

Mental model diagram

Config -> Desired state -> libvirt actions -> VM -> feedback

How it works (step-by-step, with invariants and failure modes)

Parse config into desired state.
Query actual VM state.
Apply changes to converge.
Record state for idempotency.

Invariants: idempotent operations and durable state. Failure modes include duplicate VM creation and stale metrics.

Minimal concrete example

DESIRED: vm1 running
ACTUAL: vm1 missing
ACTION: define + start

Common misconceptions

Scheduling is just least-loaded.
Observability is optional.

Check-your-understanding questions

Why is idempotency critical in a control plane?
What happens if metrics are stale?

Check-your-understanding answers

Retries must not create duplicate VMs.
The scheduler may overload a host or violate policies.

Real-world applications

OpenStack Nova
VM orchestration services

Where you’ll apply it

Apply in §3.2 (functional requirements) and §5.10 (implementation phases)
Also used in: P10-mini-cloud-control-plane

References

Libvirt event and API docs
“Fundamentals of Software Architecture” (orchestration)

Key insights Control planes are distributed systems; correctness matters as much as features.

Summary You now understand reconciliation, scheduling, and observability in VM control planes.

Homework/Exercises to practice the concept

Design a minimal state model for a VM lifecycle.
List metrics a scheduler should consider.

Solutions to the homework/exercises

States: defined, running, stopped, destroyed, error.
CPU usage, RAM usage, disk latency, network throughput.

2.2 Libvirt Domain Model and VM Definitions

Fundamentals Libvirt provides a standardized API and domain definition model for managing VMs across hypervisors. A “domain” describes a VM’s CPU, memory, disks, and network devices. Libvirt uses a structured configuration format (domain XML) to define this state, and translates it into hypervisor-specific commands. Understanding this model is essential for any orchestration tool that creates or updates VMs.

Libvirt abstracts hypervisor differences, but its domain model is still constrained by the underlying platform. Some domain fields map directly to QEMU flags, while others rely on host capabilities. This means orchestrators must validate configurations against host capabilities and provide clear error messages when a configuration cannot be satisfied.Deep Dive into the concept The libvirt domain model is a declarative specification of VM configuration. It defines vCPU count, memory size, CPU model, device backends, and boot order. Libvirt converts this model into a concrete hypervisor invocation (often QEMU) and maintains state about the running VM. It also manages storage pools, network definitions, and device hotplug operations.

An orchestrator must map user-facing configuration to libvirt domain fields. For example, a “disk” entry might map to a storage pool volume, which in turn maps to a file-backed block device. A “network” entry might map to a bridge or a NATed virtual network. Each of these mappings has constraints: disk images must exist, network names must resolve to a defined network, and the domain must reference valid device paths.

Libvirt also enforces access control and session context. Domains created in a system session are visible system-wide, while domains created in a user session are private. An orchestration tool must decide which context to use and handle permissions accordingly. This matters for production usability: running as a regular user is simpler but may not have access to all devices; running as root is powerful but riskier.

Another key concept is idempotency. If you run the same configuration twice, libvirt should not create duplicate domains. This means the orchestrator must check whether a domain already exists and whether it matches the desired configuration. If it does, no action is required; if it does not, the orchestrator should update the definition or recreate the domain. This is a reconciliation loop in disguise.

Libvirt also exposes events and state transitions (e.g., started, stopped, suspended). A robust orchestrator listens to these events to update its state store. Without event handling, the orchestrator’s view of reality will drift, especially when users interact with VMs outside the tool.

Finally, libvirt is more than domain XML. It also includes storage pools, volumes, secrets, and network definitions. For a Vagrant-style tool, you must manage the lifecycle of these resources alongside the domain itself, or you will leak artifacts over time.

Definitions & key terms

Domain: a VM instance managed by libvirt.
Storage pool: a collection of storage volumes.
Network definition: libvirt-managed virtual network.
Domain XML: structured VM configuration.

Mental model diagram

Config -> Domain model -> libvirt -> hypervisor process

How it works (step-by-step, with invariants and failure modes)

Parse config into domain attributes.
Resolve storage and network resources.
Define domain via libvirt.
Start domain and track state.

Invariants: domain definition must be valid and resources must exist. Failure modes include missing disk images and invalid XML fields.

Minimal concrete example

CONFIG: name=vm1, cpu=2, ram=2G
LIBVIRT: define domain -> start

Common misconceptions

libvirt is only for QEMU.
Domain XML is stable across all versions.

Check-your-understanding questions

Why is domain XML a declarative model?
What is the difference between define and start?

Check-your-understanding answers

It specifies desired configuration, not imperative steps.
Define creates the VM configuration; start launches it.

Real-world applications

virt-manager, OpenStack, Vagrant

Where you’ll apply it

Apply in §3.2 (functional requirements) and §4.2 (components)
Also used in: P10-mini-cloud-control-plane

References

libvirt domain XML documentation

Key insights Libvirt domain definitions are the bridge between user config and hypervisor reality.

Summary You now understand how to map configs into libvirt domain definitions and lifecycle actions.

Homework/Exercises to practice the concept

Map a sample config to domain fields.
Explain why define/start are separate operations.

Solutions to the homework/exercises

Name->domain name, cpu->vcpu, ram->memory, disk->source file.
Define persists configuration; start runs it.

3. Project Specification

3.1 What You Will Build

A CLI that provisions VMs from a config file using libvirt, with snapshot and destroy support.

3.2 Functional Requirements

Parse a config file (YAML or JSON).
Define and start VMs via libvirt.
Create/destroy snapshots.
Destroy VMs and clean up disks.

3.3 Non-Functional Requirements

Performance: provisioning completes in minutes, not hours.
Reliability: idempotent operations.
Usability: clear CLI output.

3.4 Example Usage / Output

$ myvagrant up
[vm] define domain vm1
[vm] start vm1
[vm] ssh ready

3.5 Data Formats / Schemas / Protocols

Config file schema: name, cpu, ram, disk, network

3.6 Edge Cases

VM already exists
Disk image missing

3.7 Real World Outcome

A VM can be created and destroyed repeatedly without orphaned resources.

3.7.1 How to Run (Copy/Paste)

Run in a libvirt-enabled environment

3.7.2 Golden Path Demo (Deterministic)

Create VM -> SSH -> destroy VM

3.7.3 If CLI: exact terminal transcript

$ myvagrant up
[vm] vm1 ready at 192.168.122.50
$ myvagrant destroy
[vm] vm1 destroyed

4. Solution Architecture

4.1 High-Level Design

Config -> libvirt API -> domain + storage + network

4.2 Key Components

4.3 Data Structures (No Full Code)

VM record: name, uuid, disk path, network

4.4 Algorithm Overview

Read config
Resolve resources
Define and start VM

5. Implementation Guide

5.1 Development Environment Setup

# Install libvirt and QEMU

5.2 Project Structure

project-root/
├── src/
│   ├── cli.py
│   └── libvirt_client.py
└── README.md

5.3 The Core Question You’re Answering

“How does a control plane map a config into a real VM lifecycle?”

5.4 Concepts You Must Understand First

Domain definitions
Storage pools and volumes
Network definitions

5.5 Questions to Guide Your Design

How will you ensure idempotency?
Where will you store state?

5.6 Thinking Exercise

Map a config entry to a libvirt domain XML field.

5.7 The Interview Questions They’ll Ask

“What does libvirt abstract?”
“Why is idempotency important?”

5.8 Hints in Layers

Hint 1: Start with define/start operations. Hint 2: Add snapshot support later. Hint 3: Pseudocode

READ config -> define -> start -> record state

Hint 4: Use virsh dumpxml to inspect domain state.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Virtualization | “Modern Operating Systems” | Ch. 7 |

5.10 Implementation Phases

Phase 1: Basic VM create
Phase 2: Snapshot and destroy
Phase 3: Idempotency and cleanup

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Create VM twice -> no duplicates.
Snapshot and rollback works.

6.3 Test Data

Config: vm1 cpu=2 ram=2G disk=20G

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Compare config with virsh dumpxml.

7.3 Performance Traps

Creating disks on slow storage.

8. Extensions & Challenges

8.1 Beginner Extensions

Add status command.

8.2 Intermediate Extensions

Add snapshot rollback.

8.3 Advanced Extensions

Add multi-VM orchestration.

9. Real-World Connections

9.1 Industry Applications

Vagrant, OpenStack

libvirt

9.3 Interview Relevance

Control planes, idempotency

10. Resources

10.1 Essential Reading

libvirt domain XML docs

10.2 Video Resources

Talks on VM orchestration