Project 4: “Hyperconverged Home Lab with Distributed Storage” — Infrastructure / Distributed Systems
A 3-node cluster running Proxmox VE with Ceph storage, demonstrating VM high availability, live migration, and distributed storage—a mini enterprise HCI setup.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | See main guide |
| Alternative Languages | N/A |
| Difficulty | Intermediate (but hardware-intensive) |
| Time Estimate | 1-2 weeks (plus hardware acquisition) |
| Knowledge Area | See main guide |
| Tooling | See main guide |
| Prerequisites | Basic Linux administration, networking fundamentals, 3 machines (can be VMs for learning, but physical preferred) |
What You Will Build
A 3-node cluster running Proxmox VE with Ceph storage, demonstrating VM high availability, live migration, and distributed storage—a mini enterprise HCI setup.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Setting up a Ceph cluster for distributed storage (maps to: software-defined storage)
- Configuring HA and fencing for automatic VM failover (maps to: cluster management)
- Implementing live migration and understanding its constraints (maps to: stateful workload mobility)
- Network design with VLANs and bonding (maps to: software-defined networking)
Key Concepts
-
Concept Resource - |———|———-|
-
Distributed storage (RADOS/Ceph) “Learning Ceph, 2nd Edition” Ch. 1-4 - Karan Singh -
Cluster consensus “Designing Data-Intensive Applications” Ch. 9 - Martin Kleppmann -
Live migration Proxmox documentation on live migration requirements -
Software-defined networking “Computer Networks, Fifth Edition” Ch. 5 - Tanenbaum & Wetherall
Real-World Outcome
# On Node 1 (pve-node1):
$ pvecm status
Cluster information
───────────────────
Name: homelab-hci
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
──────────────────
Date: Sat Dec 28 14:32:18 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.8
Quorate: Yes ← Cluster has quorum!
Membership information
──────────────────────
Nodeid Votes Name
0x00000001 1 pve-node1 (local)
0x00000002 1 pve-node2
0x00000003 1 pve-node3
$ ceph status
cluster:
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
health: HEALTH_OK ← All good!
services:
mon: 3 daemons, quorum pve-node1,pve-node2,pve-node3 (age 2h)
mgr: pve-node1(active, since 2h), standbys: pve-node2, pve-node3
osd: 9 osds: 9 up (since 2h), 9 in (since 2h)
data:
pools: 3 pools, 256 pgs
objects: 1.24k objects, 4.8 GiB
usage: 14.4 GiB used, 885.6 GiB / 900 GiB avail
pgs: 256 active+clean ← All data is healthy and replicated!
$ pvesm status
Name Type Status Total Used Available %
ceph-pool rbd active 900.00 GiB 14.40 GiB 885.60 GiB 1.60%
local dir active 100.00 GiB 25.30 GiB 74.70 GiB 25.30%
# Create VM on shared storage
$ qm create 100 --name web-server --memory 2048 --cores 2 \
--scsi0 ceph-pool:32,discard=on --net0 virtio,bridge=vmbr0 --boot c
$ qm start 100
Starting VM 100... done
$ qm status 100
status: running
ha-state: started
ha-managed: 1 ← HA is managing this VM!
node: pve-node1
pid: 12345
uptime: 42
# Perform live migration
$ qm migrate 100 pve-node2 --online
[Migration] Starting online migration of VM 100 to pve-node2
→ Precopy phase: iteratively copying memory pages
Pass 1: 2048 MB @ 1.2 GB/s (1.7s)
Pass 2: 145 MB @ 980 MB/s (0.15s) ← Dirty pages from pass 1
Pass 3: 12 MB @ 850 MB/s (0.01s) ← Converging!
→ Switching VM execution to pve-node2
Stop VM on pve-node1... done (10ms)
Transfer final state (CPU, devices)... done (35ms)
Start VM on pve-node2... done (22ms)
→ Cleanup on pve-node1... done
Migration completed successfully!
Downtime: 67ms ← VM was unreachable for only 67ms!
$ qm status 100
status: running
node: pve-node2 ← Now running on node2!
uptime: 1m 54s (migration was seamless)
# Now simulate node failure
$ ssh pve-node2
$ sudo systemctl stop pxe-ha-lrm # Simulate crash (or pull power cable)
# Back on pve-node1 (30 seconds later):
$ pvecm status
...
Nodes: 2 ← Only 2 nodes responding!
Quorate: Yes ← Still have quorum (majority: 2/3)
$ journalctl -u pve-ha-lrm -f
[HA Manager] Node pve-node2 not responding (timeout)
[HA Manager] Fencing node pve-node2...
[HA Manager] VM 100 marked for recovery
[HA Manager] Starting VM 100 on pve-node1... done
[HA Manager] VM 100 recovered successfully
$ qm status 100
status: running
node: pve-node1 ← Automatically restarted on node1!
uptime: 45s (recovered from failure)
$ ceph status
cluster:
health: HEALTH_WARN ← Warning because of degraded data
9 osds: 6 up, 6 in ← 3 OSDs down (from node2)
Degraded data redundancy: 256/768 objects degraded
data:
pgs: 128 active+clean
128 active+clean+degraded ← Data still accessible!
# Bring node2 back online
$ ssh pve-node2
$ sudo systemctl start pve-cluster
# 5 minutes later:
$ ceph status
cluster:
health: HEALTH_OK ← Cluster recovered!
services:
osd: 9 osds: 9 up, 9 in ← All OSDs back
data:
pgs: 256 active+clean ← Data fully replicated again!
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
VIRTUALIZATION_HYPERVISORS_HYPERCONVERGENCE.md - Primary references are listed in the main guide