Project 9: Hyperconverged Home Lab with Ceph

Build a 3-node hyperconverged cluster with Ceph storage, HA, and live migration.

Quick Reference

Attribute Value
Difficulty Level 4: Advanced
Time Estimate 3-4 weeks
Main Programming Language Bash / YAML (Alternatives: Python)
Alternative Programming Languages Python
Coolness Level Level 5: Real Infra Lab
Business Potential Level 4: Infra Architect
Prerequisites Linux networking, storage concepts
Key Topics Quorum, Ceph, migration

1. Learning Objectives

By completing this project, you will:

  1. Deploy a 3-node Proxmox + Ceph cluster.
  2. Configure replication and failure domains.
  3. Execute live migration and observe downtime.
  4. Validate HA behavior during node failure.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Hyperconverged Infrastructure and Quorum

Fundamentals Hyperconverged infrastructure merges compute, storage, and networking into a single cluster. Each node runs a hypervisor and contributes storage to a distributed system (often Ceph). HCI relies on quorum to prevent split-brain, replication or erasure coding for durability, and failure domains to spread risk. The goal is to simplify operations while enabling high availability and live migration without external SANs.

HCI shifts the operational center of gravity. Instead of separate storage and compute teams, the same platform must manage both. This simplifies procurement and scaling but introduces coupling between compute load, storage load, and network health. Understanding these couplings is essential for designing reliable HCI systems.

Deep Dive into the concept HCI is a distributed systems problem in disguise. Every VM write becomes a distributed write. A storage system like Ceph stores objects across OSDs according to the CRUSH algorithm. Clients compute object placement using the cluster map, avoiding centralized metadata bottlenecks. This design scales but requires careful configuration of placement groups, replication factors, and failure domains.

Quorum is a safety mechanism. Cluster components (Corosync for Proxmox, Ceph MONs) require a majority of votes to make progress. In a 3-node cluster, losing one node still leaves quorum; losing two does not. When quorum is lost, the cluster halts VM operations to avoid split-brain writes. Some deployments add a qdevice as a tie-breaker.

Replication vs erasure coding is a central trade-off. Replication is simple and fast but costs 2x-3x storage overhead. Erasure coding reduces overhead but increases CPU and network cost on writes and recovery. Many HCI systems use replication for hot data and erasure coding for cold data. Recovery traffic can saturate networks and degrade VM I/O latency.

Ceph-specific tuning is a practical reality. The number of placement groups, the replication factor, and the CRUSH failure domain hierarchy all influence recovery time and performance. If placement groups are too few, data distribution is uneven; too many, and metadata overhead grows. Backfill traffic can saturate the network, so many operators throttle recovery to protect VM latency.

Network separation is another common design choice. Many HCI deployments use dedicated networks for storage replication and for VM traffic to reduce congestion and jitter. In small labs this may be simulated with VLANs rather than physical NICs, but the principle is the same: storage traffic is bursty and can starve VM traffic if not isolated.

Operational tuning is central in HCI. Placement group counts, replication factors, and CRUSH rules determine how quickly the system can recover from failures. Backfill and recovery traffic must often be throttled to preserve VM performance. Network separation between storage and VM traffic reduces contention and makes latency more predictable. HCI is therefore as much an operations discipline as it is a technical architecture.

Operational tuning is central in HCI. Placement group counts, replication factors, and CRUSH rules determine how quickly the system can recover from failures. Backfill and recovery traffic must often be throttled to preserve VM performance. Network separation between storage and VM traffic reduces contention and makes latency more predictable. HCI is therefore as much an operations discipline as it is a technical architecture.

Operational tuning is central in HCI. Placement group counts, replication factors, and CRUSH rules determine how quickly the system can recover from failures. Backfill and recovery traffic must often be throttled to preserve VM performance. Network separation between storage and VM traffic reduces contention and makes latency more predictable. HCI is therefore as much an operations discipline as it is a technical architecture.

Operational tuning is central in HCI. Placement group counts, replication factors, and CRUSH rules determine how quickly the system can recover from failures. Backfill and recovery traffic must often be throttled to preserve VM performance. Network separation between storage and VM traffic reduces contention and makes latency more predictable. HCI is therefore as much an operations discipline as it is a technical architecture.How this fit on projects This concept drives cluster design decisions: quorum, replication, and failure domains.

Definitions & key terms

  • HCI: hyperconverged infrastructure.
  • Quorum: majority vote for safety.
  • CRUSH: Ceph data placement algorithm.
  • Failure domain: boundary for replica placement.

Mental model diagram

VM write -> RBD -> CRUSH -> OSD replicas

How it works (step-by-step, with invariants and failure modes)

  1. Client writes to RBD.
  2. CRUSH maps objects to OSDs.
  3. Replicas stored across failure domains.
  4. Quorum maintains consistent cluster map.

Invariants: quorum maintained, replicas distributed. Failure modes include split-brain and degraded PGs.

Minimal concrete example

HEALTH_OK with 3 MONs
1 node down -> still quorate

Common misconceptions

  • HCI always reduces cost.
  • Replication alone guarantees safety.

Check-your-understanding questions

  1. Why does quorum stop split-brain?
  2. What happens if a failure domain is misconfigured?

Check-your-understanding answers

  1. Only a majority can commit state changes.
  2. Replicas may end up on the same rack or node.

Real-world applications

  • Private cloud virtualization
  • Edge clusters

Where you’ll apply it

  • Apply in §3.2 (functional requirements) and §6.2 (critical tests)
  • Also used in: P10-mini-cloud-control-plane

References

  • Ceph architecture docs
  • Proxmox cluster manager docs

Key insights HCI turns virtualization into a distributed storage and quorum problem.

Summary You now understand why quorum, CRUSH, and failure domains define HCI behavior.

Homework/Exercises to practice the concept

  1. Explain why a 2-node cluster is unsafe without a qdevice.
  2. Draw a CRUSH hierarchy for 3 nodes.

Solutions to the homework/exercises

  1. A network partition yields split-brain with no majority.
  2. Root -> rack -> host -> disk.

2.2 Live Migration and Availability

Fundamentals Live migration moves a running VM between hosts with minimal downtime. Pre-copy iteratively copies memory while the VM runs, then pauses briefly to copy remaining dirty pages. Post-copy starts the VM on the destination earlier and fetches pages on demand, reducing downtime but increasing risk if the source fails. Migration requires device state transfer, which is straightforward for virtio but difficult for passthrough devices. Checkpoint/restore for containers shares similar concepts at process level.

Migration is a core operational tool: it enables maintenance without downtime, load balancing, and proactive failure avoidance. The correctness constraints are strict: there must be a single active VM instance, memory state must be consistent, and storage must not diverge.

Deep Dive into the concept Live migration is essentially a distributed checkpoint. A VM’s state consists of CPU registers, memory pages, and device state. In pre-copy, the source iteratively copies memory pages while the VM runs. Each round sends pages dirtied since the last iteration. If the dirty rate is lower than available bandwidth, the process converges. The final stop-and-copy phase pauses the VM, transfers remaining dirty pages and CPU state, and resumes on the destination. Downtime is proportional to the last copy phase.

Post-copy flips the trade-off: the VM is paused briefly, minimal state is transferred, and the VM resumes on the destination. Missing pages are fetched from the source on demand. This reduces downtime but increases risk; if the source fails mid-migration, the VM can crash because required pages are lost.

Dirty tracking is critical. Hypervisors mark pages as dirty using hardware bits or write-protection. This overhead can be significant for write-heavy workloads. Compression and delta encoding can reduce bandwidth but increase CPU usage. Migration also requires compatible CPU features across hosts, stable virtual device models, and shared storage. Without shared storage, block migration must copy disks, which can be slower than memory migration.

Failure modes are important. Pre-copy can be safely canceled; the VM continues on the source. Post-copy has a point of no return where the destination becomes the source of truth. Production systems use migration tunnels with bandwidth caps and prioritization to avoid impacting other workloads. They also require coordination services (leases, fencing) to ensure only one host considers the VM active.

Device state serialization is another bottleneck. Virtio devices are designed to be migratable, but even they require careful coordination so that queues, in-flight requests, and interrupts are transferred consistently. Passthrough devices rarely support live migration.

Migration policies must consider bandwidth and business impact. Throttling protects production traffic but lengthens migration time; aggressive migration can destabilize unrelated workloads. Most platforms allow tuning of bandwidth caps and iteration limits. Device state compatibility and CPU feature matching are also prerequisites; without them, migration may fail or produce an unstable guest.

Migration policies must consider bandwidth and business impact. Throttling protects production traffic but lengthens migration time; aggressive migration can destabilize unrelated workloads. Most platforms allow tuning of bandwidth caps and iteration limits. Device state compatibility and CPU feature matching are also prerequisites; without them, migration may fail or produce an unstable guest.

Migration policies must consider bandwidth and business impact. Throttling protects production traffic but lengthens migration time; aggressive migration can destabilize unrelated workloads. Most platforms allow tuning of bandwidth caps and iteration limits. Device state compatibility and CPU feature matching are also prerequisites; without them, migration may fail or produce an unstable guest.

Migration policies must consider bandwidth and business impact. Throttling protects production traffic but lengthens migration time; aggressive migration can destabilize unrelated workloads. Most platforms allow tuning of bandwidth caps and iteration limits. Device state compatibility and CPU feature matching are also prerequisites; without them, migration may fail or produce an unstable guest.How this fit on projects This concept guides your live migration validation and HA testing.

Definitions & key terms

  • Pre-copy: iterative memory copying while VM runs.
  • Post-copy: resume VM then fetch pages on demand.
  • Dirty page: page modified since last copy.

Mental model diagram

run -> copy dirty -> run -> stop -> final copy -> resume

How it works (step-by-step, with invariants and failure modes)

  1. Establish migration channel.
  2. Copy memory iteratively.
  3. Pause VM and copy remaining state.
  4. Resume on destination.

Invariants: only one active VM, consistent CPU model. Failure modes include non-converging dirty rate.

Minimal concrete example

ROUND1 8GB
ROUND2 1GB
STOP 120MB, downtime 120ms

Common misconceptions

  • Migration is always safe.
  • Passthrough devices migrate easily.

Check-your-understanding questions

  1. Why might pre-copy not converge?
  2. What is the risk of post-copy?

Check-your-understanding answers

  1. VM dirties memory faster than network can copy.
  2. Source failure breaks missing page retrieval.

Real-world applications

  • Host evacuation
  • Maintenance windows without downtime

Where you’ll apply it

  • Apply in §3.2 (functional requirements) and §6.2 (critical tests)
  • Also used in: P10-mini-cloud-control-plane

References

  • QEMU migration docs

Key insights Migration is bandwidth-limited and correctness depends on strict coordination.

Summary You now understand pre-copy vs post-copy and why shared storage is important.

Homework/Exercises to practice the concept

  1. Explain why shared storage simplifies migration.
  2. Measure downtime under different dirty rates.

Solutions to the homework/exercises

  1. Only memory and CPU state move; disks remain accessible.
  2. Higher dirty rates increase downtime.

3. Project Specification

3.1 What You Will Build

A 3-node Proxmox + Ceph cluster with HA and live migration.

3.2 Functional Requirements

  1. Create a 3-node cluster with quorum.
  2. Deploy Ceph MONs and OSDs.
  3. Create an RBD-backed VM.
  4. Perform live migration.

3.3 Non-Functional Requirements

  • Performance: stable storage latency under load.
  • Reliability: HA restarts VMs after node failure.
  • Usability: clear operational status outputs.

3.4 Example Usage / Output

$ pvecm status
Quorate: Yes

$ ceph -s
health: HEALTH_OK

3.5 Data Formats / Schemas / Protocols

  • Ceph CRUSH map, placement groups, MON quorum

3.6 Edge Cases

  • Node failure during migration
  • Network partition

3.7 Real World Outcome

VM migrates live and restarts on failure without data loss.

3.7.1 How to Run (Copy/Paste)

  • Configure nodes, network, and disks

3.7.2 Golden Path Demo (Deterministic)

  • Create VM -> migrate -> simulate node down

3.7.3 If CLI: exact terminal transcript

$ qm migrate 100 node2 --online
migration started

4. Solution Architecture

4.1 High-Level Design

Proxmox nodes + Ceph OSDs -> shared RBD -> VM migration

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Ceph OSDs | Storage replicas | Replication factor | | MON quorum | Cluster health | Odd number of MONs | | HA manager | VM failover | Fencing policy |

4.3 Data Structures (No Full Code)

  • CRUSH map hierarchy
  • VM placement records

4.4 Algorithm Overview

  1. Establish quorum
  2. Create storage pool
  3. Deploy VM on RBD
  4. Migrate and failover

5. Implementation Guide

5.1 Development Environment Setup

# Three nodes with virtualization support

5.2 Project Structure

infra/
├── cluster/
├── ceph/
└── docs/

5.3 The Core Question You’re Answering

“How do you build a cluster that keeps VMs running when hardware fails?”

5.4 Concepts You Must Understand First

  1. Quorum and split-brain
  2. CRUSH and replication
  3. Live migration

5.5 Questions to Guide Your Design

  1. Which network handles storage traffic?
  2. What is your failure domain layout?

5.6 Thinking Exercise

Simulate a node failure and trace data recovery.

5.7 The Interview Questions They’ll Ask

  1. “Why is quorum necessary?”
  2. “How does Ceph place replicas?”

5.8 Hints in Layers

Hint 1: Build cluster first, then add Ceph. Hint 2: Use replication before erasure coding. Hint 3: Pseudocode

CHECK quorum -> deploy OSDs -> create pool -> migrate VM

Hint 4: Use ceph health detail for diagnostics.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Replication | “Designing Data-Intensive Applications” | Ch. 5 | | Consensus | “Designing Data-Intensive Applications” | Ch. 9 |

5.10 Implementation Phases

  • Phase 1: Cluster and quorum
  • Phase 2: Ceph deployment
  • Phase 3: Migration and HA testing

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Replication factor | 2 vs 3 | 3 | Safer durability |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Integration Tests | Migration | VM moves host | | Failure Tests | HA | Node down scenario |

6.2 Critical Test Cases

  1. VM migrates live without data loss.
  2. VM restarts on another node after failure.

6.3 Test Data

Sample VM disk with test file

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | No quorum | Cluster stops | Add qdevice or node | | OSD down | HEALTH_WARN | Repair OSD |

7.2 Debugging Strategies

  • Use pvecm status and ceph -s.

7.3 Performance Traps

  • Recovery traffic saturating network.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add monitoring dashboards.

8.2 Intermediate Extensions

  • Add erasure coded pool.

8.3 Advanced Extensions

  • Benchmark migration under load.

9. Real-World Connections

9.1 Industry Applications

  • Enterprise private clouds
  • Proxmox, Ceph

9.3 Interview Relevance

  • Quorum, replication, migration

10. Resources

10.1 Essential Reading

  • Ceph architecture docs
  • Proxmox cluster manager docs

10.2 Video Resources

  • HCI operations talks