Project 9: Hyperconverged Home Lab with Ceph

Build a 3-node hyperconverged cluster with Ceph storage, HA, and live migration.

Quick Reference

Attribute	Value
Difficulty	Level 4: Advanced
Time Estimate	3-4 weeks
Main Programming Language	Bash / YAML (Alternatives: Python)
Alternative Programming Languages	Python
Coolness Level	Level 5: Real Infra Lab
Business Potential	Level 4: Infra Architect
Prerequisites	Linux networking, storage concepts
Key Topics	Quorum, Ceph, migration

1. Learning Objectives

By completing this project, you will:

Deploy a 3-node Proxmox + Ceph cluster.
Configure replication and failure domains.
Execute live migration and observe downtime.
Validate HA behavior during node failure.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Hyperconverged Infrastructure and Quorum

Fundamentals Hyperconverged infrastructure merges compute, storage, and networking into a single cluster. Each node runs a hypervisor and contributes storage to a distributed system (often Ceph). HCI relies on quorum to prevent split-brain, replication or erasure coding for durability, and failure domains to spread risk. The goal is to simplify operations while enabling high availability and live migration without external SANs.

HCI shifts the operational center of gravity. Instead of separate storage and compute teams, the same platform must manage both. This simplifies procurement and scaling but introduces coupling between compute load, storage load, and network health. Understanding these couplings is essential for designing reliable HCI systems.

Deep Dive into the concept HCI is a distributed systems problem in disguise. Every VM write becomes a distributed write. A storage system like Ceph stores objects across OSDs according to the CRUSH algorithm. Clients compute object placement using the cluster map, avoiding centralized metadata bottlenecks. This design scales but requires careful configuration of placement groups, replication factors, and failure domains.

Quorum is a safety mechanism. Cluster components (Corosync for Proxmox, Ceph MONs) require a majority of votes to make progress. In a 3-node cluster, losing one node still leaves quorum; losing two does not. When quorum is lost, the cluster halts VM operations to avoid split-brain writes. Some deployments add a qdevice as a tie-breaker.

Replication vs erasure coding is a central trade-off. Replication is simple and fast but costs 2x-3x storage overhead. Erasure coding reduces overhead but increases CPU and network cost on writes and recovery. Many HCI systems use replication for hot data and erasure coding for cold data. Recovery traffic can saturate networks and degrade VM I/O latency.

Ceph-specific tuning is a practical reality. The number of placement groups, the replication factor, and the CRUSH failure domain hierarchy all influence recovery time and performance. If placement groups are too few, data distribution is uneven; too many, and metadata overhead grows. Backfill traffic can saturate the network, so many operators throttle recovery to protect VM latency.

Network separation is another common design choice. Many HCI deployments use dedicated networks for storage replication and for VM traffic to reduce congestion and jitter. In small labs this may be simulated with VLANs rather than physical NICs, but the principle is the same: storage traffic is bursty and can starve VM traffic if not isolated.

Operational tuning is central in HCI. Placement group counts, replication factors, and CRUSH rules determine how quickly the system can recover from failures. Backfill and recovery traffic must often be throttled to preserve VM performance. Network separation between storage and VM traffic reduces contention and makes latency more predictable. HCI is therefore as much an operations discipline as it is a technical architecture.

Definitions & key terms

HCI: hyperconverged infrastructure.
Quorum: majority vote for safety.
CRUSH: Ceph data placement algorithm.
Failure domain: boundary for replica placement.

Mental model diagram

VM write -> RBD -> CRUSH -> OSD replicas

How it works (step-by-step, with invariants and failure modes)

Client writes to RBD.
CRUSH maps objects to OSDs.
Replicas stored across failure domains.
Quorum maintains consistent cluster map.

Invariants: quorum maintained, replicas distributed. Failure modes include split-brain and degraded PGs.

Minimal concrete example

HEALTH_OK with 3 MONs
1 node down -> still quorate

Common misconceptions

HCI always reduces cost.
Replication alone guarantees safety.

Check-your-understanding questions

Why does quorum stop split-brain?
What happens if a failure domain is misconfigured?

Check-your-understanding answers

Only a majority can commit state changes.
Replicas may end up on the same rack or node.

Real-world applications

Private cloud virtualization
Edge clusters

Where you’ll apply it

Apply in §3.2 (functional requirements) and §6.2 (critical tests)
Also used in: P10-mini-cloud-control-plane

References

Ceph architecture docs
Proxmox cluster manager docs

Key insights HCI turns virtualization into a distributed storage and quorum problem.

Summary You now understand why quorum, CRUSH, and failure domains define HCI behavior.

Homework/Exercises to practice the concept

Explain why a 2-node cluster is unsafe without a qdevice.
Draw a CRUSH hierarchy for 3 nodes.

Solutions to the homework/exercises

A network partition yields split-brain with no majority.
Root -> rack -> host -> disk.

2.2 Live Migration and Availability

Fundamentals Live migration moves a running VM between hosts with minimal downtime. Pre-copy iteratively copies memory while the VM runs, then pauses briefly to copy remaining dirty pages. Post-copy starts the VM on the destination earlier and fetches pages on demand, reducing downtime but increasing risk if the source fails. Migration requires device state transfer, which is straightforward for virtio but difficult for passthrough devices. Checkpoint/restore for containers shares similar concepts at process level.

Migration is a core operational tool: it enables maintenance without downtime, load balancing, and proactive failure avoidance. The correctness constraints are strict: there must be a single active VM instance, memory state must be consistent, and storage must not diverge.

Deep Dive into the concept Live migration is essentially a distributed checkpoint. A VM’s state consists of CPU registers, memory pages, and device state. In pre-copy, the source iteratively copies memory pages while the VM runs. Each round sends pages dirtied since the last iteration. If the dirty rate is lower than available bandwidth, the process converges. The final stop-and-copy phase pauses the VM, transfers remaining dirty pages and CPU state, and resumes on the destination. Downtime is proportional to the last copy phase.

Post-copy flips the trade-off: the VM is paused briefly, minimal state is transferred, and the VM resumes on the destination. Missing pages are fetched from the source on demand. This reduces downtime but increases risk; if the source fails mid-migration, the VM can crash because required pages are lost.

Dirty tracking is critical. Hypervisors mark pages as dirty using hardware bits or write-protection. This overhead can be significant for write-heavy workloads. Compression and delta encoding can reduce bandwidth but increase CPU usage. Migration also requires compatible CPU features across hosts, stable virtual device models, and shared storage. Without shared storage, block migration must copy disks, which can be slower than memory migration.

Failure modes are important. Pre-copy can be safely canceled; the VM continues on the source. Post-copy has a point of no return where the destination becomes the source of truth. Production systems use migration tunnels with bandwidth caps and prioritization to avoid impacting other workloads. They also require coordination services (leases, fencing) to ensure only one host considers the VM active.

Device state serialization is another bottleneck. Virtio devices are designed to be migratable, but even they require careful coordination so that queues, in-flight requests, and interrupts are transferred consistently. Passthrough devices rarely support live migration.

Migration policies must consider bandwidth and business impact. Throttling protects production traffic but lengthens migration time; aggressive migration can destabilize unrelated workloads. Most platforms allow tuning of bandwidth caps and iteration limits. Device state compatibility and CPU feature matching are also prerequisites; without them, migration may fail or produce an unstable guest.

Definitions & key terms

Pre-copy: iterative memory copying while VM runs.
Post-copy: resume VM then fetch pages on demand.
Dirty page: page modified since last copy.

Mental model diagram

run -> copy dirty -> run -> stop -> final copy -> resume

How it works (step-by-step, with invariants and failure modes)

Establish migration channel.
Copy memory iteratively.
Pause VM and copy remaining state.
Resume on destination.

Invariants: only one active VM, consistent CPU model. Failure modes include non-converging dirty rate.

Minimal concrete example

ROUND1 8GB
ROUND2 1GB
STOP 120MB, downtime 120ms

Common misconceptions

Migration is always safe.
Passthrough devices migrate easily.

Check-your-understanding questions

Why might pre-copy not converge?
What is the risk of post-copy?

Check-your-understanding answers

VM dirties memory faster than network can copy.
Source failure breaks missing page retrieval.

Real-world applications

Host evacuation
Maintenance windows without downtime

Where you’ll apply it

Apply in §3.2 (functional requirements) and §6.2 (critical tests)
Also used in: P10-mini-cloud-control-plane

References

QEMU migration docs

Key insights Migration is bandwidth-limited and correctness depends on strict coordination.

Summary You now understand pre-copy vs post-copy and why shared storage is important.

Homework/Exercises to practice the concept

Explain why shared storage simplifies migration.
Measure downtime under different dirty rates.

Solutions to the homework/exercises

Only memory and CPU state move; disks remain accessible.
Higher dirty rates increase downtime.

3. Project Specification

3.1 What You Will Build

A 3-node Proxmox + Ceph cluster with HA and live migration.

3.2 Functional Requirements

Create a 3-node cluster with quorum.
Deploy Ceph MONs and OSDs.
Create an RBD-backed VM.
Perform live migration.

3.3 Non-Functional Requirements

Performance: stable storage latency under load.
Reliability: HA restarts VMs after node failure.
Usability: clear operational status outputs.

3.4 Example Usage / Output

$ pvecm status
Quorate: Yes

$ ceph -s
health: HEALTH_OK

3.5 Data Formats / Schemas / Protocols

Ceph CRUSH map, placement groups, MON quorum

3.6 Edge Cases

Node failure during migration
Network partition

3.7 Real World Outcome

VM migrates live and restarts on failure without data loss.

3.7.1 How to Run (Copy/Paste)

Configure nodes, network, and disks

3.7.2 Golden Path Demo (Deterministic)

Create VM -> migrate -> simulate node down

3.7.3 If CLI: exact terminal transcript

$ qm migrate 100 node2 --online
migration started

4. Solution Architecture

4.1 High-Level Design

Proxmox nodes + Ceph OSDs -> shared RBD -> VM migration

4.2 Key Components

4.3 Data Structures (No Full Code)

CRUSH map hierarchy
VM placement records

4.4 Algorithm Overview

Establish quorum
Create storage pool
Deploy VM on RBD
Migrate and failover

5. Implementation Guide

5.1 Development Environment Setup

# Three nodes with virtualization support

5.2 Project Structure

infra/
├── cluster/
├── ceph/
└── docs/

5.3 The Core Question You’re Answering

“How do you build a cluster that keeps VMs running when hardware fails?”

5.4 Concepts You Must Understand First

Quorum and split-brain
CRUSH and replication
Live migration

5.5 Questions to Guide Your Design

Which network handles storage traffic?
What is your failure domain layout?

5.6 Thinking Exercise

Simulate a node failure and trace data recovery.

5.7 The Interview Questions They’ll Ask

“Why is quorum necessary?”
“How does Ceph place replicas?”

5.8 Hints in Layers

Hint 1: Build cluster first, then add Ceph. Hint 2: Use replication before erasure coding. Hint 3: Pseudocode

CHECK quorum -> deploy OSDs -> create pool -> migrate VM

Hint 4: Use ceph health detail for diagnostics.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Replication | “Designing Data-Intensive Applications” | Ch. 5 | | Consensus | “Designing Data-Intensive Applications” | Ch. 9 |

5.10 Implementation Phases

Phase 1: Cluster and quorum
Phase 2: Ceph deployment
Phase 3: Migration and HA testing

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Replication factor | 2 vs 3 | 3 | Safer durability |

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

VM migrates live without data loss.
VM restarts on another node after failure.

6.3 Test Data

Sample VM disk with test file

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use pvecm status and ceph -s.

7.3 Performance Traps

Recovery traffic saturating network.

8. Extensions & Challenges

8.1 Beginner Extensions

Add monitoring dashboards.

8.2 Intermediate Extensions

Add erasure coded pool.

8.3 Advanced Extensions

Benchmark migration under load.

9. Real-World Connections

9.1 Industry Applications

Enterprise private clouds

Proxmox, Ceph

9.3 Interview Relevance

Quorum, replication, migration

10. Resources

10.1 Essential Reading

Ceph architecture docs
Proxmox cluster manager docs

10.2 Video Resources

HCI operations talks