Project 7: Build a Vagrant-Style VM Orchestrator
Build a CLI tool that reads a config file and provisions VMs via libvirt, with snapshots and teardown.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Intermediate |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Python or Go (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 4: Infra Builder |
| Business Potential | Level 3: DevOps Utility |
| Prerequisites | libvirt basics, storage/networking |
| Key Topics | Control plane, libvirt domain XML |
1. Learning Objectives
By completing this project, you will:
- Translate a config file into libvirt VM definitions.
- Implement idempotent create/start/destroy flows.
- Manage storage and network resources for VMs.
- Provide a reliable CLI user experience.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Control Plane Reconciliation and Scheduling Basics
Fundamentals A control plane manages VM lifecycle, scheduling, policy, and observability. Libvirt provides a consistent API for VM definition and lifecycle across hypervisors. QEMU provides device models and the runtime. A control plane tracks desired state, reconciles actual state, and integrates with metrics and logs. Without observability (metrics, logs, tracing), VM performance problems are guesswork.
Control planes are not optional in production. Even a small cluster needs a single source of truth for VM identity, ownership, and placement. This is why control planes often include authentication, authorization, and audit logging from the start: without these, operational mistakes become outages. They also provide the integration surface for automation, billing, and policy enforcement across many hosts.
Deep Dive into the concept Control planes separate desired state from actual state. Users declare what should exist, and controllers reconcile reality by creating or updating VMs. This is a distributed systems problem: state must be durable, API calls must be idempotent, and failures must not create duplicate VMs. Many control planes store state in a database and use reconciliation loops similar to Kubernetes.
Scheduling is central. A scheduler must account for CPU, memory, storage, and network capacity. It often supports policies such as anti-affinity, NUMA alignment, or power-aware placement. Overcommit is a lever: you can allocate more vCPUs than physical cores, but this increases CPU steal time and latency variance. Good schedulers use real-time metrics to decide placement and avoid stale data.
Observability ties it together. Hypervisors expose metrics like vCPU run time, VM exit counts, dirty page rates, and disk latency. Logs record VM lifecycle events and migration progress. Tracing tools can attribute latency to host or guest. Without these signals, diagnosing performance regressions is nearly impossible.
Control planes also enforce security and multi-tenancy. They must validate requests, apply quotas, and enforce network and storage ACLs. Audit logs are essential for compliance. They also need to integrate with identity systems and secrets management, because VM credentials and images are sensitive assets.
Finally, control planes must handle partial failure. A host may be unreachable but still running VMs. The control plane must decide whether to fence and restart those VMs elsewhere, balancing safety with availability. This is why leases, heartbeats, and fencing are standard patterns. Event-driven design is essential: libvirt and QEMU expose event streams that notify when VMs change state. A control plane that ignores events quickly diverges from reality.
Metrics design influences both stability and user trust. If the control plane tracks only coarse CPU usage, it may oversubscribe memory or saturate storage I/O without noticing. Good control planes track queue depth, latency percentiles, and error rates, then feed those signals into placement and admission control. Poorly chosen signals lead to oscillation and instability.
Schedulers must encode policy explicitly. Bin-packing maximizes density but increases contention, while spreading reduces hotspots but can waste capacity. Admission control is a safety valve: it rejects requests that would violate hard constraints. Quotas and priorities ensure fairness among tenants, which is essential for multi-tenant reliability.
Schedulers must encode policy explicitly. Bin-packing maximizes density but increases contention, while spreading reduces hotspots but can waste capacity. Admission control is a safety valve: it rejects requests that would violate hard constraints. Quotas and priorities ensure fairness among tenants, which is essential for multi-tenant reliability.
Schedulers must encode policy explicitly. Bin-packing maximizes density but increases contention, while spreading reduces hotspots but can waste capacity. Admission control is a safety valve: it rejects requests that would violate hard constraints. Quotas and priorities ensure fairness among tenants, which is essential for multi-tenant reliability.
Schedulers must encode policy explicitly. Bin-packing maximizes density but increases contention, while spreading reduces hotspots but can waste capacity. Admission control is a safety valve: it rejects requests that would violate hard constraints. Quotas and priorities ensure fairness among tenants, which is essential for multi-tenant reliability.How this fit on projects This project is a miniature control plane: you will implement desired state, idempotency, and lifecycle reconciliation.
Definitions & key terms
- Control plane: management layer for VM lifecycle.
- Reconciliation: converging actual state to desired state.
- Scheduler: placement engine for VMs.
- Observability: metrics, logs, tracing.
Mental model diagram
Config -> Desired state -> libvirt actions -> VM -> feedback
How it works (step-by-step, with invariants and failure modes)
- Parse config into desired state.
- Query actual VM state.
- Apply changes to converge.
- Record state for idempotency.
Invariants: idempotent operations and durable state. Failure modes include duplicate VM creation and stale metrics.
Minimal concrete example
DESIRED: vm1 running
ACTUAL: vm1 missing
ACTION: define + start
Common misconceptions
- Scheduling is just least-loaded.
- Observability is optional.
Check-your-understanding questions
- Why is idempotency critical in a control plane?
- What happens if metrics are stale?
Check-your-understanding answers
- Retries must not create duplicate VMs.
- The scheduler may overload a host or violate policies.
Real-world applications
- OpenStack Nova
- VM orchestration services
Where you’ll apply it
- Apply in §3.2 (functional requirements) and §5.10 (implementation phases)
- Also used in: P10-mini-cloud-control-plane
References
- Libvirt event and API docs
- “Fundamentals of Software Architecture” (orchestration)
Key insights Control planes are distributed systems; correctness matters as much as features.
Summary You now understand reconciliation, scheduling, and observability in VM control planes.
Homework/Exercises to practice the concept
- Design a minimal state model for a VM lifecycle.
- List metrics a scheduler should consider.
Solutions to the homework/exercises
- States: defined, running, stopped, destroyed, error.
- CPU usage, RAM usage, disk latency, network throughput.
2.2 Libvirt Domain Model and VM Definitions
Fundamentals Libvirt provides a standardized API and domain definition model for managing VMs across hypervisors. A “domain” describes a VM’s CPU, memory, disks, and network devices. Libvirt uses a structured configuration format (domain XML) to define this state, and translates it into hypervisor-specific commands. Understanding this model is essential for any orchestration tool that creates or updates VMs.
Libvirt abstracts hypervisor differences, but its domain model is still constrained by the underlying platform. Some domain fields map directly to QEMU flags, while others rely on host capabilities. This means orchestrators must validate configurations against host capabilities and provide clear error messages when a configuration cannot be satisfied.Deep Dive into the concept The libvirt domain model is a declarative specification of VM configuration. It defines vCPU count, memory size, CPU model, device backends, and boot order. Libvirt converts this model into a concrete hypervisor invocation (often QEMU) and maintains state about the running VM. It also manages storage pools, network definitions, and device hotplug operations.
An orchestrator must map user-facing configuration to libvirt domain fields. For example, a “disk” entry might map to a storage pool volume, which in turn maps to a file-backed block device. A “network” entry might map to a bridge or a NATed virtual network. Each of these mappings has constraints: disk images must exist, network names must resolve to a defined network, and the domain must reference valid device paths.
Libvirt also enforces access control and session context. Domains created in a system session are visible system-wide, while domains created in a user session are private. An orchestration tool must decide which context to use and handle permissions accordingly. This matters for production usability: running as a regular user is simpler but may not have access to all devices; running as root is powerful but riskier.
Another key concept is idempotency. If you run the same configuration twice, libvirt should not create duplicate domains. This means the orchestrator must check whether a domain already exists and whether it matches the desired configuration. If it does, no action is required; if it does not, the orchestrator should update the definition or recreate the domain. This is a reconciliation loop in disguise.
Libvirt also exposes events and state transitions (e.g., started, stopped, suspended). A robust orchestrator listens to these events to update its state store. Without event handling, the orchestrator’s view of reality will drift, especially when users interact with VMs outside the tool.
Finally, libvirt is more than domain XML. It also includes storage pools, volumes, secrets, and network definitions. For a Vagrant-style tool, you must manage the lifecycle of these resources alongside the domain itself, or you will leak artifacts over time.
Libvirt abstracts hypervisor differences, but its domain model is still constrained by the underlying platform. Some domain fields map directly to QEMU flags, while others rely on host capabilities. This means orchestrators must validate configurations against host capabilities and provide clear error messages when a configuration cannot be satisfied.
Libvirt abstracts hypervisor differences, but its domain model is still constrained by the underlying platform. Some domain fields map directly to QEMU flags, while others rely on host capabilities. This means orchestrators must validate configurations against host capabilities and provide clear error messages when a configuration cannot be satisfied.
Libvirt abstracts hypervisor differences, but its domain model is still constrained by the underlying platform. Some domain fields map directly to QEMU flags, while others rely on host capabilities. This means orchestrators must validate configurations against host capabilities and provide clear error messages when a configuration cannot be satisfied.
Libvirt abstracts hypervisor differences, but its domain model is still constrained by the underlying platform. Some domain fields map directly to QEMU flags, while others rely on host capabilities. This means orchestrators must validate configurations against host capabilities and provide clear error messages when a configuration cannot be satisfied.How this fit on projects This concept defines how your CLI maps configuration into libvirt actions and XML definitions.
Definitions & key terms
- Domain: a VM instance managed by libvirt.
- Storage pool: a collection of storage volumes.
- Network definition: libvirt-managed virtual network.
- Domain XML: structured VM configuration.
Mental model diagram
Config -> Domain model -> libvirt -> hypervisor process
How it works (step-by-step, with invariants and failure modes)
- Parse config into domain attributes.
- Resolve storage and network resources.
- Define domain via libvirt.
- Start domain and track state.
Invariants: domain definition must be valid and resources must exist. Failure modes include missing disk images and invalid XML fields.
Minimal concrete example
CONFIG: name=vm1, cpu=2, ram=2G
LIBVIRT: define domain -> start
Common misconceptions
- libvirt is only for QEMU.
- Domain XML is stable across all versions.
Check-your-understanding questions
- Why is domain XML a declarative model?
- What is the difference between define and start?
Check-your-understanding answers
- It specifies desired configuration, not imperative steps.
- Define creates the VM configuration; start launches it.
Real-world applications
- virt-manager, OpenStack, Vagrant
Where you’ll apply it
- Apply in §3.2 (functional requirements) and §4.2 (components)
- Also used in: P10-mini-cloud-control-plane
References
- libvirt domain XML documentation
Key insights Libvirt domain definitions are the bridge between user config and hypervisor reality.
Summary You now understand how to map configs into libvirt domain definitions and lifecycle actions.
Homework/Exercises to practice the concept
- Map a sample config to domain fields.
- Explain why define/start are separate operations.
Solutions to the homework/exercises
- Name->domain name, cpu->vcpu, ram->memory, disk->source file.
- Define persists configuration; start runs it.
3. Project Specification
3.1 What You Will Build
A CLI that provisions VMs from a config file using libvirt, with snapshot and destroy support.
3.2 Functional Requirements
- Parse a config file (YAML or JSON).
- Define and start VMs via libvirt.
- Create/destroy snapshots.
- Destroy VMs and clean up disks.
3.3 Non-Functional Requirements
- Performance: provisioning completes in minutes, not hours.
- Reliability: idempotent operations.
- Usability: clear CLI output.
3.4 Example Usage / Output
$ myvagrant up
[vm] define domain vm1
[vm] start vm1
[vm] ssh ready
3.5 Data Formats / Schemas / Protocols
- Config file schema: name, cpu, ram, disk, network
3.6 Edge Cases
- VM already exists
- Disk image missing
3.7 Real World Outcome
A VM can be created and destroyed repeatedly without orphaned resources.
3.7.1 How to Run (Copy/Paste)
- Run in a libvirt-enabled environment
3.7.2 Golden Path Demo (Deterministic)
- Create VM -> SSH -> destroy VM
3.7.3 If CLI: exact terminal transcript
$ myvagrant up
[vm] vm1 ready at 192.168.122.50
$ myvagrant destroy
[vm] vm1 destroyed
4. Solution Architecture
4.1 High-Level Design
Config -> libvirt API -> domain + storage + network
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Config parser | Read config | YAML or JSON | | Libvirt client | Define/start | System session | | State store | Track VMs | Local file/db |
4.3 Data Structures (No Full Code)
- VM record: name, uuid, disk path, network
4.4 Algorithm Overview
- Read config
- Resolve resources
- Define and start VM
5. Implementation Guide
5.1 Development Environment Setup
# Install libvirt and QEMU
5.2 Project Structure
project-root/
├── src/
│ ├── cli.py
│ └── libvirt_client.py
└── README.md
5.3 The Core Question You’re Answering
“How does a control plane map a config into a real VM lifecycle?”
5.4 Concepts You Must Understand First
- Domain definitions
- Storage pools and volumes
- Network definitions
5.5 Questions to Guide Your Design
- How will you ensure idempotency?
- Where will you store state?
5.6 Thinking Exercise
Map a config entry to a libvirt domain XML field.
5.7 The Interview Questions They’ll Ask
- “What does libvirt abstract?”
- “Why is idempotency important?”
5.8 Hints in Layers
Hint 1: Start with define/start operations. Hint 2: Add snapshot support later. Hint 3: Pseudocode
READ config -> define -> start -> record state
Hint 4: Use virsh dumpxml to inspect domain state.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Virtualization | “Modern Operating Systems” | Ch. 7 |
5.10 Implementation Phases
- Phase 1: Basic VM create
- Phase 2: Snapshot and destroy
- Phase 3: Idempotency and cleanup
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | State store | JSON file vs DB | JSON file | Simpler |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Integration Tests | VM lifecycle | up/ssh/destroy |
6.2 Critical Test Cases
- Create VM twice -> no duplicates.
- Snapshot and rollback works.
6.3 Test Data
Config: vm1 cpu=2 ram=2G disk=20G
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Wrong session | Permission denied | Use system session | | Stale state | Duplicate VMs | Clear state store |
7.2 Debugging Strategies
- Compare config with
virsh dumpxml.
7.3 Performance Traps
- Creating disks on slow storage.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add status command.
8.2 Intermediate Extensions
- Add snapshot rollback.
8.3 Advanced Extensions
- Add multi-VM orchestration.
9. Real-World Connections
9.1 Industry Applications
- Vagrant, OpenStack
9.2 Related Open Source Projects
- libvirt
9.3 Interview Relevance
- Control planes, idempotency
10. Resources
10.1 Essential Reading
- libvirt domain XML docs
10.2 Video Resources
- Talks on VM orchestration