Project 4: Hyperconverged Home Lab with Distributed Storage
Build a 3-node hyperconverged lab using Proxmox and Ceph, enabling shared storage, HA, and live migration.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced (Level 4) |
| Time Estimate | 3-4 weeks |
| Main Programming Language | Bash / YAML / Infrastructure (Alternatives: Python) |
| Alternative Programming Languages | Python |
| Coolness Level | Level 5: Real Infra Lab |
| Business Potential | Level 4: Infra Architect Badge |
| Prerequisites | Linux networking, storage basics, virtualization basics |
| Key Topics | quorum, fencing, Ceph CRUSH/PGs, live migration |
1. Learning Objectives
By completing this project, you will:
- Deploy a 3-node Proxmox cluster with quorum and HA.
- Configure Ceph storage with OSDs, MONs, and CRUSH rules.
- Enable shared storage for VM disks and perform live migration.
- Simulate node failure and observe HA recovery.
- Explain trade-offs between replication and erasure coding.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Quorum, Cluster Membership, and Fencing
Fundamentals
Quorum is the minimum number of nodes required for a cluster to make decisions safely. In a 3-node cluster, quorum is typically 2. If quorum is lost, the cluster stops taking actions to prevent split-brain, where multiple nodes believe they are the leader. Fencing is the mechanism that isolates or powers off a failed node to prevent it from corrupting shared data. Understanding quorum and fencing is essential because hyperconverged infrastructure depends on shared storage and coordinated VM placement; without quorum, the cluster cannot safely run VMs. In other words, quorum is the safety switch that keeps compute and storage in lockstep.
Deep Dive into the concept
In distributed systems, cluster membership defines who can make decisions. Corosync (used by Proxmox) maintains a membership list and ensures that nodes agree on cluster state. Quorum is the rule that protects against split-brain: only the partition with majority votes can continue. This is why odd numbers of nodes are preferred. A 2-node cluster is fragile because a single failure removes quorum.
Fencing (also called STONITH) ensures that a node that loses quorum is prevented from accessing shared resources. In hyperconverged systems, shared storage like Ceph is sensitive to concurrent writes; if two partitions both think they are in control, data corruption can occur. Fencing eliminates this by shutting down or isolating a node that is not in the quorum group. This can be done via power switches, IPMI, or watchdog devices.
Quorum interacts with HA. HA managers monitor nodes; if a node is down, they can restart VMs elsewhere. But they only do this when quorum is present. If the cluster loses quorum, HA stops to avoid unsafe actions. This is why quorum and fencing are a prerequisite for reliable HA.
In a home lab, you might not have enterprise fencing hardware. Proxmox supports “watchdog” fencing where a node reboots itself if it loses quorum. Another approach is to use a qdevice (quorum device) to provide a tie-breaker in 2-node clusters. In a 3-node lab, you can rely on majority quorum and avoid a qdevice. But you should understand how quorum is calculated and how the cluster behaves during partitions.
Failure modes are subtle. A node might still be running VMs but be partitioned from the cluster network. If it continues to write to shared storage, you can get split-brain. That’s why fencing is non-negotiable for production systems. Your lab should include a controlled failure experiment: simulate network partition or stop corosync, observe quorum loss, and verify that HA does not try to move VMs until quorum is restored.
One additional nuance is the human operational workflow: cluster administrators must treat quorum alerts as urgent. If you “force” quorum to restore availability, you are taking responsibility for the risk of split-brain. In production, that may be acceptable in emergencies, but for a lab project you should practice the safer approach: restore connectivity, then restore quorum. This habit trains the correct mental model of distributed safety and will transfer directly to real-world operations.
For completeness, explore the votequorum configuration. It allows you to set expected votes, add a qdevice, and tune quorum behavior during maintenance. Even if you keep defaults, knowing how votequorum works helps you reason about edge cases like rolling upgrades or temporary network partitions. It also teaches you why “quorum is lost” is not always a simple hardware failure-it can be a configuration problem, a network issue, or a time-sync problem.
Time synchronization deserves explicit attention. Corosync assumes reasonably synchronized clocks; large drift can cause false membership changes. In a lab, configure NTP on all nodes and observe how stable membership becomes. This reinforces that cluster safety depends on mundane operational details, not just algorithms.
How this fit on projects
Quorum and fencing are tested in Section 3.7 and implemented in Section 5.10 Phase 2 when you configure HA and cluster membership.
Definitions & key terms
- Quorum -> Minimum votes required to make cluster decisions.
- Split-brain -> Two partitions both believe they are the leader.
- Fencing -> Isolation or shutdown of failed/partitioned node.
- Corosync -> Cluster communication layer used by Proxmox.
Mental model diagram (ASCII)
3-node cluster: A, B, C
Partition: A | B,C
Quorum group: B,C (2 votes)
A loses quorum -> fenced or stopped
How it works (step-by-step, with invariants and failure modes)
- Cluster maintains membership list.
- Quorum calculation determines majority partition.
- Nodes outside quorum stop HA actions.
- Fencing ensures isolated nodes do not access storage.
Failure modes: no fencing, misconfigured quorum, network partition.
Minimal concrete example
pvecm status
# look for "Quorate: Yes" and votes
Common misconceptions
- “Quorum is just a suggestion.” -> It is a safety rule; ignoring it risks data corruption.
- “Fencing is optional.” -> In shared storage, it is critical.
- “2-node clusters are safe.” -> They require a qdevice or fencing to avoid split-brain.
Check-your-understanding questions
- Why does a 3-node cluster require 2 votes for quorum?
- What happens if a node loses quorum but keeps running VMs?
- Predict the behavior when corosync is stopped on one node.
Check-your-understanding answers
- Majority quorum prevents split-brain by ensuring only one partition continues.
- It can corrupt shared storage; fencing prevents this.
- That node should lose quorum and HA should stop on it.
Real-world applications
- Enterprise clusters rely on quorum for safe failover.
- Storage systems like Ceph and etcd enforce quorum rules.
Where you’ll apply it
- This project: Section 3.7 failure simulation, Section 5.10 Phase 2.
- Also used in: P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md.
References
- Proxmox cluster manager docs
- “Designing Data-Intensive Applications” (Consensus chapter)
Key insights
Quorum is the line between safe automation and catastrophic split-brain.
Summary
You now understand how quorum and fencing protect shared storage and HA.
Homework/Exercises to practice the concept
- Simulate quorum loss by stopping corosync on one node.
- Observe cluster status before and after.
Solutions to the homework/exercises
systemctl stop corosyncon one node.pvecm statusshould showQuorate: Noon that node.
2.2 Ceph Architecture (MON, OSD, CRUSH, PGs)
Fundamentals
Ceph is a distributed storage system that provides object, block, and file storage. It relies on MONs (monitors) for cluster maps and consensus, OSDs (object storage daemons) for data storage, and CRUSH for data placement. Placement groups (PGs) group objects to distribute load and manage replication. Understanding Ceph architecture is crucial because your hyperconverged lab uses Ceph as shared VM storage. At a high level, you should be able to explain how a client writes a block and which daemons it contacts along the way. If you cannot describe that path, debugging will feel impossible. This is the storage backbone of your lab.
Deep Dive into the concept
Ceph’s core idea is to decentralize storage while keeping consistent metadata. MONs maintain the cluster maps: OSD map, MON map, CRUSH map. They use Paxos to agree on cluster state, and they require quorum to function. OSDs store actual data on disks and communicate with each other to replicate objects. A Ceph cluster is healthy only when MONs have quorum and OSDs are up.
CRUSH (Controlled Replication Under Scalable Hashing) is Ceph’s data placement algorithm. Instead of central metadata servers, CRUSH computes which OSDs should hold an object by hashing object identifiers and traversing a CRUSH map. The map encodes failure domains (host, rack, row) and replication rules. For a home lab, you can treat each node as a failure domain and replicate across nodes. The benefit is deterministic placement: clients can calculate where to read/write without querying a centralized server.
Placement groups are the bridge between objects and OSDs. Instead of mapping each object directly, Ceph maps objects to PGs, and PGs to OSDs. This reduces metadata overhead and allows rebalancing. PG count is a tuning parameter: too few PGs lead to hotspots, too many lead to management overhead. For a 3-node lab, a modest PG count is sufficient. Understanding PGs is important when you interpret ceph -s health output, which reports degraded or undersized PGs.
Ceph supports replication and erasure coding. Replication is simple: store multiple copies of each object. Erasure coding splits data into fragments and adds parity, reducing storage overhead at the cost of performance and complexity. In a small lab, replication is easier. But you should understand erasure coding to reason about trade-offs, and to design storage tiers if you expand your lab.
Ceph also provides block storage via RBD (RADOS Block Device). Each VM disk is an RBD image stored in the Ceph cluster. When you migrate a VM, you don’t move the disk; it remains in shared storage. This is what enables live migration and HA. Your lab will use RBD for VM disks, so you should understand how RBD images are created, mapped, and attached to Proxmox VMs.
It’s also worth noting that Ceph’s operational model expects you to monitor health continuously. Even in a lab, you should practice interpreting HEALTH_WARN messages and understanding their causes. For example, a “too few PGs per OSD” warning may not break the cluster, but it hints at capacity planning. These small diagnostics teach you how to reason about storage reliability, which is a key part of hyperconverged design.
Ceph also has background operations like backfill, recovery, and scrubbing. When an OSD fails, Ceph re-replicates data to restore redundancy, and this consumes network and disk bandwidth. In a lab, you can observe these operations by intentionally taking an OSD down and watching ceph -s. Understanding recovery behavior is important for sizing networks and for explaining why clusters slow down during failures.
Pool sizing is another practical lever. You choose replication factor, PG count, and pool size. For a lab, keep it simple but document the choices so you can justify them later. This connects the abstract CRUSH map to the concrete storage behavior your VMs experience.
How this fit on projects
Ceph architecture is foundational for Section 3.2 and Section 5.10 Phase 1, and for interpreting health status in Section 3.7.
Definitions & key terms
- MON -> Monitor daemon that maintains cluster state.
- OSD -> Object storage daemon that stores data on disk.
- CRUSH -> Algorithm for deterministic data placement.
- PG (Placement Group) -> Logical grouping of objects for placement.
Mental model diagram (ASCII)
Objects -> PGs -> CRUSH map -> OSDs
| |
v v
replication failure domains
How it works (step-by-step, with invariants and failure modes)
- Client computes PG for an object.
- CRUSH map determines OSD set.
- Object replicated to OSDs.
- MONs track OSD status and cluster health.
Failure modes: OSD down, MON quorum lost, PGs degraded.
Minimal concrete example
ceph -s
rbd create vm-100 --size 20G
Common misconceptions
- “Ceph needs a central metadata server.” -> CRUSH eliminates the need.
- “More PGs is always better.” -> Too many PGs increases overhead.
- “Replication and erasure coding are interchangeable.” -> They trade off performance and space.
Check-your-understanding questions
- Why does Ceph use placement groups?
- What happens when an OSD goes down?
- Predict what happens if MON quorum is lost.
Check-your-understanding answers
- PGs reduce metadata and balance load across OSDs.
- PGs become degraded; Ceph re-replicates if possible.
- The cluster becomes read-only or unavailable.
Real-world applications
- Enterprise storage arrays and cloud block storage often use Ceph.
Where you’ll apply it
- This project: Section 3.2, Section 3.7, Section 5.10 Phase 1.
- Also used in: P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md.
References
- Ceph architecture documentation
- “Designing Data-Intensive Applications” (Replication chapter)
Key insights
Ceph scales by pushing placement logic to clients and using deterministic CRUSH rules.
Summary
You now understand the moving parts of Ceph and how they enable shared storage.
Homework/Exercises to practice the concept
- Calculate a safe PG count for a 3-node cluster.
- Create an RBD image and map it locally.
Solutions to the homework/exercises
- Use Ceph’s PG calculator; start with ~128 PGs per pool for small labs.
rbd create, thenrbd mapon a node with Ceph client.
2.3 Live Migration, HA, and Network Design
Fundamentals
Live migration moves a running VM from one host to another with minimal downtime. It relies on shared storage so the disk does not move. HA monitors VM health and restarts VMs on other nodes if a node fails. Network design separates cluster traffic from storage traffic to avoid interference. Understanding live migration and HA is essential to prove your hyperconverged lab works. The key learning is not just that migration works, but how its performance shifts under load. A good operator can predict when migration will be safe and when it will be risky. This is the difference between a lab and a production-ready mindset.
Deep Dive into the concept
Live migration typically uses pre-copy. The VM’s memory pages are copied while the VM runs; dirty pages are re-copied in successive rounds until the remaining dirty set is small, then the VM pauses briefly and completes the transfer. Post-copy migration flips the order: the VM pauses immediately, transfers minimal state, resumes on the target, and fetches pages on demand. Pre-copy is safer but may not converge if the VM dirties memory too fast.
In a hyperconverged cluster with shared storage, migration is mostly about memory and CPU state. Proxmox coordinates migration by telling QEMU to start the migration stream and by updating cluster state. If the storage is Ceph RBD, the disk is available on all nodes, so the storage component is already “migrated.” This is why shared storage is critical for live migration.
HA is built on top of cluster membership. Proxmox’s HA manager watches node health and VM state. If a node fails and quorum remains, HA restarts VMs on other nodes that have access to shared storage. It uses fencing and resource locks to ensure VMs do not run simultaneously on multiple nodes. Understanding this flow lets you design failure tests and interpret outcomes.
Network design matters because migration and storage traffic can be heavy. Ceph replication uses the storage network; migration uses the cluster network. If both share the same NIC, contention can cause migration to fail or storage to lag. In a lab, you might use VLANs or separate physical NICs if available. The key is to measure and observe: run a migration while running IO and observe latency.
HA and migration also create new failure modes. If a migration fails mid-way, the VM may remain on the source or crash. If the cluster loses quorum during a migration, the process should stop. These failure modes should be part of your testing. Understanding the sequence of events helps you diagnose and recover.
Another subtlety is CPU compatibility. Live migration requires that the target node exposes a compatible CPU feature set. Proxmox can mask CPU features to ensure compatibility, but this might reduce performance. In a lab, you can test this by configuring a conservative CPU model and confirming migration works reliably. This highlights a real-world trade-off between portability and performance, and it explains why many clusters standardize hardware.
Migration performance is tunable. QEMU supports compression, bandwidth limits, and downtime thresholds. In a lab, you can experiment with these knobs to see how they affect migration time and guest pause duration. This makes the project more than a checkbox: you will learn how operators balance user experience (low downtime) against network load and cluster stability.
Also consider storage bandwidth: if Ceph is busy recovering or scrubbing, migration traffic competes with storage I/O. In practice, operators schedule migrations during low-traffic windows to avoid cascading slowdowns. You can simulate this by running a disk benchmark inside a VM during migration and observing latency. This teaches the operational reality that performance is a shared resource across compute and storage.
How this fit on projects
Live migration and HA are validated in Section 3.7 and configured in Section 5.10 Phase 3.
Definitions & key terms
- Live migration -> Moving a running VM with minimal downtime.
- Pre-copy -> Migration method that copies memory while VM runs.
- Post-copy -> Migration method that fetches pages on demand after switching.
- HA manager -> Cluster service that restarts VMs on failure.
Mental model diagram (ASCII)
VM on Node A
| pre-copy memory -> Node B
| final switchover
v
VM continues on Node B
How it works (step-by-step, with invariants and failure modes)
- Start migration, copy memory pages.
- Iterate until dirty set is small.
- Pause VM, transfer remaining state, resume on target.
- Update cluster state to new node.
Failure modes: migration never converges, storage latency, quorum loss.
Minimal concrete example
qm migrate 100 node2 --online
Common misconceptions
- “Live migration means zero downtime.” -> There is always a small pause.
- “Storage is migrated too.” -> Shared storage is required; disks are not copied.
- “HA always restarts immediately.” -> It requires quorum and fencing.
Check-your-understanding questions
- Why does live migration require shared storage?
- What happens if memory dirty rate is too high?
- Predict the effect of running migration on a congested network.
Check-your-understanding answers
- Without shared storage, disk data would need to be moved, increasing downtime.
- Pre-copy may not converge; migration can fail.
- Migration time increases and may time out.
Real-world applications
- Cloud providers migrate VMs for maintenance.
- HA clusters keep critical services running after host failure.
Where you’ll apply it
- This project: Section 3.7, Section 5.10 Phase 3.
- Also used in: P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md.
References
- QEMU migration documentation
- Proxmox HA guide
Key insights
Live migration and HA turn virtualization into a resilient service, not just a VM host.
Summary
You now understand how memory migration, HA, and network design enable resilient clusters.
Homework/Exercises to practice the concept
- Run a live migration and measure downtime.
- Saturate the network and observe migration behavior.
Solutions to the homework/exercises
- Use
qm migrateand observe the log timestamps. - Use
iperf3to load the network and compare migration time.
2.4 Replication, Erasure Coding, and Failure Domains
Fundamentals
Distributed storage is only useful if it keeps data safe during failures. The two core durability strategies are replication (store full copies on multiple nodes) and erasure coding (store data + parity chunks across many nodes). Replication is simpler and faster to rebuild but uses more space. Erasure coding is space-efficient but introduces higher write amplification and rebuild complexity. Failure domains define how Ceph places data so that a single disk, host, or rack failure does not lose all copies. In a hyperconverged cluster, understanding these trade-offs is essential: your storage design directly affects VM performance and recovery time when a node fails.
Deep Dive into the concept
Ceph uses the CRUSH algorithm to map objects to placement groups (PGs) and then to OSDs. The CRUSH rule encodes failure domains and replication factors. For a small 3-node cluster, a typical rule might be “replicate 3 copies across distinct hosts.” That means if one disk or one host fails, the cluster still has two copies. The cost is 3x raw storage. For lab environments, that is acceptable and simpler to reason about.
Erasure coding changes the equation. Instead of full copies, the object is split into k data chunks and m parity chunks (often noted as k+m). Any k chunks can reconstruct the data. For example, a 4+2 erasure code stores 4 data chunks and 2 parity chunks, allowing up to 2 failures while using 1.5x storage instead of 3x. The downside is that small random writes become expensive, because erasure coding requires reading and recomputing parity across multiple chunks (read-modify-write). This can reduce VM performance, especially for databases or workloads with small write sizes.
Failure domains matter even more with erasure coding. If you choose a failure domain of “host” and you have 3 hosts, a 4+2 profile may not be possible because you need 6 distinct failure domains for placement. In small clusters, replication is often the only viable option. As you scale to more nodes, erasure coding becomes attractive for bulk storage or backup pools, while replicated pools remain better for latency-sensitive VM disks.
In Ceph, pools can be either replicated or erasure-coded. You might place VM disks (RBD) in a replicated pool and backups in an erasure-coded pool. You can also tune the number of PGs based on cluster size (for a small lab, 128 or 256 PGs per pool is typical). The key is to align the placement strategy with your failure model. If you tolerate one node failure, replicate 3 copies across hosts. If you tolerate two failures, you need at least 5 nodes (or an erasure-coded profile with enough domains).
Another subtlety is recovery behavior. After a failure, Ceph must backfill data to restore redundancy. Replicated pools perform full object copies, which are straightforward. Erasure-coded pools must reconstruct missing chunks by reading the remaining k chunks and recomputing parity. This is CPU-intensive and network-heavy. In a hyperconverged cluster where VMs and storage share hardware, heavy recovery can impact VM performance, so you must balance recovery speed vs. workload impact. Ceph provides tuning knobs like osd_recovery_max_active and osd_recovery_sleep to throttle recovery.
Finally, think about failure domains beyond storage: a power outage, a top-of-rack switch failure, or a host kernel panic. If all replicas are on the same rack or share a network link, you can lose all copies even if the disks are fine. In a home lab, you may not have multiple racks, but you still can model failure domains as “host” or “disk” to capture the most common failures. This mindset prepares you for real-world HCI environments where failure domains are the heart of reliability engineering.
How this fit on projects
This concept directly affects Section 3.2 functional requirements (pool design), Section 3.5 data formats (Ceph pool configuration), and Section 5.10 Phase 1 where you choose replication vs erasure coding.
Definitions & key terms
- Replication -> Storing full copies of data on multiple OSDs.
- Erasure coding -> Storing data + parity chunks to tolerate failures with less overhead.
- Failure domain -> A boundary (disk/host/rack) across which replicas must be placed.
- Placement Group (PG) -> A logical shard of data used by Ceph for placement.
Mental model diagram (ASCII)
Replicated (size=3)
Object -> OSD A, OSD B, OSD C
Erasure-coded (4+2)
Object -> D1 D2 D3 D4 P1 P2
How it works (step-by-step, with invariants and failure modes)
- Choose a failure domain (host or rack) for your CRUSH rule.
- Define a pool as replicated or erasure-coded.
- Write objects; CRUSH maps them to OSDs across domains.
- On failure, Ceph rebalances to restore redundancy.
- Invariant: at least
size-1replicas (orkchunks) must remain for reads.
Minimal concrete example
# Replicated pool for VM disks
ceph osd pool create vm-pool 128
ceph osd pool set vm-pool size 3
ceph osd pool set vm-pool min_size 2
# Example erasure-coded profile and pool
ceph osd erasure-code-profile set ec-4-2 k=4 m=2
ceph osd pool create backup-pool 64 64 erasure ec-4-2
Common misconceptions
- “Erasure coding is always better.” -> It saves space but can hurt write latency for VM disks.
- “Replication wastes too much.” -> For small clusters, replication is often the only safe option.
- “Failure domain doesn’t matter in a home lab.” -> Host-level failures are common and must be modeled.
Check-your-understanding questions
- Why is a 3-node cluster better suited for replication than erasure coding?
- Predict the storage overhead of a
4+2erasure-coded pool. - What happens if your failure domain is “host” but two OSDs on the same host hold replicas?
- Why does erasure coding increase write amplification?
Check-your-understanding answers
- Erasure coding requires more distinct failure domains; 3 nodes can only host 3 replicas safely.
- 1.5x (6 chunks for 4 data chunks).
- A single host failure could remove multiple replicas and risk data loss.
- Parity must be recalculated and written across multiple chunks for each write.
Real-world applications
- Public clouds often use replicated storage for VM disks and erasure coding for backups.
- Ceph clusters tune CRUSH rules to align with rack or availability-zone failure domains.
Where you’ll apply it
- This project: Section 3.2 Functional Requirements, Section 3.5 Data Formats, Section 5.10 Phase 1.
- Also used in: P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md for storage policy decisions.
References
- Ceph documentation on replicated and erasure-coded pools
- “Designing Data-Intensive Applications” (replication and fault tolerance chapters)
Key insights
Durability is a design choice: you trade capacity for resilience, and recovery cost for simplicity.
Summary
You now understand how replication, erasure coding, and failure domains shape Ceph pool design.
Homework/Exercises to practice the concept
- Calculate the usable capacity of a 3-node cluster with 3x replication and 4 TB disks.
- Design a CRUSH rule that spreads replicas across hosts and verify it with
ceph osd crush rule dump.
Solutions to the homework/exercises
- Usable capacity is roughly 1/3 of raw capacity (e.g., 12 TB raw -> ~4 TB usable).
- Create a rule with
type hostand confirm replicas map to distinct hosts in the rule output.
3. Project Specification
3.1 What You Will Build
A 3-node hyperconverged lab that:
- Uses Proxmox for virtualization and cluster management.
- Uses Ceph for shared storage.
- Supports live migration and HA.
- Demonstrates failure recovery.
Included: cluster creation, Ceph deployment, VM creation, migration, HA tests. Excluded: production-grade monitoring, multi-site replication.
3.2 Functional Requirements
- Cluster Setup: 3 Proxmox nodes with quorum.
- Ceph Storage: MONs and OSDs on each node.
- Shared Storage: RBD pool for VM disks.
- Live Migration: migrate at least one running VM.
- HA: restart VM on another node after simulated failure.
3.3 Non-Functional Requirements
- Performance: basic VM workloads run without severe latency.
- Reliability: cluster remains stable under node failure.
- Usability: reproducible setup steps.
3.4 Example Usage / Output
$ pvecm status
Quorate: Yes
Nodes: 3
$ ceph -s
health: HEALTH_OK
osd: 6 up, 6 in
3.5 Data Formats / Schemas / Protocols
- Proxmox cluster configs (
/etc/pve/corosync.conf). - Ceph configs (
ceph.conf, CRUSH map).
3.6 Edge Cases
- Node down -> quorum lost if only 2 nodes.
- OSD down -> degraded PGs.
- Migration fails due to CPU mismatch.
3.7 Real World Outcome
You will have a working hyperconverged cluster where VMs can move between nodes and restart after failure.
3.7.1 How to Run (Copy/Paste)
pvecm status
ceph -s
qm migrate 100 node2 --online
3.7.2 Golden Path Demo (Deterministic)
- Migrate a VM while running a ping inside it; pings should show minimal loss.
3.7.3 CLI Transcript (Success + Failure)
$ qm migrate 100 node2 --online
migration started
[exit] code=0
$ pvecm status
Quorate: Yes
$ systemctl stop corosync
$ pvecm status
Quorate: No
[exit] code=1
Exit codes:
0success1quorum lost or HA blocked
4. Solution Architecture
4.1 High-Level Design
[Node1] [Node2] [Node3]
| | |
+-- Ceph OSDs --+
+-- Corosync ---+
+-- Proxmox HA -+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Proxmox cluster | VM management | 3-node quorum | | Ceph storage | shared RBD disks | replication factor 3 | | HA manager | VM failover | fencing via watchdog |
4.3 Data Structures (No Full Code)
- Cluster config files.
- Ceph CRUSH map.
4.4 Algorithm Overview
Key Algorithm: HA Failover
- Detect node failure.
- Verify quorum.
- Fence failed node.
- Restart VM on healthy node.
Complexity Analysis:
- Time: O(seconds to detect + migrate)
- Space: O(replication factor)
5. Implementation Guide
5.1 Development Environment Setup
# Install Proxmox on 3 nodes
# Configure networking and time sync
5.2 Project Structure
lab/
+-- node1/
+-- node2/
+-- node3/
+-- docs/
5.3 The Core Question You’re Answering
“How do you build a cluster that keeps VMs running when hardware fails?”
5.4 Concepts You Must Understand First
- Quorum and fencing
- Ceph CRUSH and PGs
- Live migration mechanics
5.5 Questions to Guide Your Design
- Which networks handle cluster vs storage traffic?
- How will you test split-brain prevention?
- What is your failure recovery procedure?
5.6 Thinking Exercise
Sketch a failure scenario: Node1 fails during migration. What happens?
5.7 The Interview Questions They’ll Ask
- Why is quorum required for HA?
- How does Ceph place data?
- What is the difference between replication and erasure coding?
5.8 Hints in Layers
Hint 1: Start with cluster only, no Ceph. Hint 2: Add Ceph MONs, then OSDs. Hint 3: Test migration after storage is healthy.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Distributed systems | DDIA | Ch. 9 | | Replication | DDIA | Ch. 5 | | Virtual Machines | OS Concepts | Ch. 16 |
5.10 Implementation Phases
Phase 1: Foundation (1 week)
Goals: Proxmox cluster with quorum.
Tasks: Install nodes, configure corosync.
Checkpoint: pvecm status shows quorum.
Phase 2: Core Functionality (1-2 weeks)
Goals: Ceph deployment.
Tasks: Create MONs, OSDs, pools.
Checkpoint: ceph -s health OK.
Phase 3: Polish & Edge Cases (1 week)
Goals: HA and migration tests. Tasks: Enable HA manager, test failover. Checkpoint: VM restarts after node failure.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | OSD layout | 1 disk per OSD | yes | simple and clear | | Replication | 3x vs EC | 3x | better for small lab | | Network | shared vs separate | separate if possible | avoid contention |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|———-|———|———-|
| Integration | cluster health | pvecm status |
| Storage | Ceph health | ceph -s |
| Failover | HA recovery | stop node |
6.2 Critical Test Cases
- Cluster has quorum with 3 nodes.
- Ceph health is OK.
- Live migration succeeds.
6.3 Test Data
VM 100, storage pool "rbd"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | No quorum | HA blocked | add node or qdevice | | PGs degraded | ceph WARN | fix OSD failures | | Migration fails | CPU mismatch | align CPU types |
7.2 Debugging Strategies
- Use
ceph health detailfor storage errors. - Inspect
/var/log/pve/for HA issues.
7.3 Performance Traps
- Running migration during heavy IO can stall.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add Grafana dashboards for Ceph.
- Document a disaster recovery plan.
8.2 Intermediate Extensions
- Add erasure-coded pool.
- Configure a qdevice for quorum.
8.3 Advanced Extensions
- Build a stretched cluster across two racks.
- Test WAN replication.
9. Real-World Connections
9.1 Industry Applications
- HCI platforms like Nutanix and VMware vSAN.
- Private cloud infrastructure.
9.2 Related Open Source Projects
- Proxmox: hypervisor management.
- Ceph: distributed storage.
9.3 Interview Relevance
- Quorum, replication, and HA are common infra interview topics.
10. Resources
10.1 Essential Reading
- Ceph architecture docs
- Proxmox cluster and HA docs
10.2 Video Resources
- Ceph deep dive talks
10.3 Tools & Documentation
pvecm,ceph,qmCLI tools
10.4 Related Projects in This Series
- P02-build-your-own-vagrant-clone-devops-infrastructure.md
- P06-build-a-mini-cloud-platform-cloud-infrastructure-d.md
11. Self-Assessment Checklist
11.1 Understanding
- I can explain quorum and fencing.
- I can describe Ceph CRUSH and PGs.
- I understand live migration steps.
11.2 Implementation
- Cluster is healthy.
- Ceph is healthy.
- HA failover works.
11.3 Growth
- I can explain this lab in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Cluster up with Ceph and one VM.
- Live migration works.
Full Completion:
- HA restart after node failure.
- Documented recovery steps.
Excellence (Going Above & Beyond):
- Erasure coding pool and performance analysis.