P25 - Ceph HA Cluster Replace Failed Node on Proxmox

🚀 Proxmox – P25 Ceph HA Cluster: Replace Failed Node on Proxmox (Full Demo)

🔎 Introduction

In this tutorial, we demonstrate how to replace a failed node in a Proxmox Ceph High Availability (HA) cluster step by step. When a Proxmox node crashes or becomes unreachable, your Ceph cluster may enter a degraded state. However, thanks to Ceph replication and HA mechanisms, your virtual machines can continue running without downtime — if the cluster is properly configured.

This guide shows you how to:

Safely remove a dead Proxmox node
Cleanly remove MON and OSD services from Ceph
Update the CRUSH map properly
Add a replacement node into the cluster
Reinstall Ceph services on the new node
Rebalance data automatically
Restore HA functionality

This full demo is ideal for IT professionals managing production environments and home lab enthusiasts learning Proxmox PVE 9 with Ceph HA.

🧪 5. Simulate a Dead Node

Due to limited lab equipment, the VM built for testing runs slowly.
This demo focuses on explaining each replacement step clearly.

In real-world production environments, physical Proxmox servers will perform significantly faster.

⚠️ 5.1 Symptoms

When a node fails (example: pve01zfs):

Multiple OSDs appear down
Ceph reports OSDs as down/out
If replication factor is sufficient (e.g. 3), VMs continue running on remaining nodes
Cluster health becomes degraded

🛠 5.2 Troubleshooting Procedure

🔹 Step 1: Delete the Dead Node

Remove node from cluster:

Delete leftover configuration files:

🔹 Step 2: Remove MON pve01 from Ceph

First mark MON as down:

Then remove it:

This completely removes the MON service from the Ceph cluster.

🔹 Step 3: Delete OSD pve01zfs

Identify OSD IDs:

Example:

osd.0
osd.1

Mark OSD as down:

Mark OSD as out:

Remove from CRUSH map:

Remove authentication:

Remove OSD completely:

🔹 Step 4: Remove Host from CRUSH Map

Run exactly one command:

If unsure about the hostname:

Then restart Ceph services on remaining nodes.

Ceph will redistribute data to remaining OSDs.
Rebalancing speed depends on disk performance and network bandwidth.
(Lab environment will be slower.)

🆕 Step 5: Prepare Replacement Node (pve04zfs)

Edit disk serial configuration:

Disable enterprise repository.

Set IP in same network class as pve02 & pve03.

Update hosts file:

Check disks:

🔗 Step 6: Join pve04 into Cluster

💾 Step 7: Install Ceph on New Node (pve04)

From GUI (Node pve04):

Ceph → Install Ceph
Select same Ceph version
Reboot if required

Then add services:

➤ Add MON + MGR

Ceph → Monitor → Add
Ceph → Manager → Add

➤ Add OSD

Ceph → OSD → Create OSD
Select /dev/sdb or empty disk
Repeat as needed

⚖️ Step 8: Rebalance Ceph

When the new node joins, Ceph automatically rebalances data.

Check cluster status:

Healthy state:

Note:
In small lab environments, you may see:

slow IO warnings
BlueStore slow operations

Data redistribution takes time depending on disk speed.

Degraded data redundancy will gradually decrease until cluster becomes fully active + clean.

🏷 Step 9: Add New Node to HA Group

Navigate:

Datacenter → HA → Groups → Select Group → Add pve04

Now HA can use the new node for failover operations.

✅ Final Thoughts

Replacing a failed node in a Proxmox Ceph HA cluster requires proper order:

Remove node from cluster
Clean MON & OSD services
Update CRUSH map
Add replacement node
Reinstall Ceph
Allow automatic rebalancing
Reconfigure HA

By following best practices, you can maintain data integrity, minimize downtime, and ensure business continuity in both production and lab environments.

This tutorial demonstrates how Ceph replication and Proxmox HA work together to provide true high availability infrastructure.

P21 – How to Schedule Automatic Shutdown and Startup of VMs in Proxmox VE

P21 – How to Schedule Automatic Shutdown and Startup of VMs in Proxmox VE ⏰ Proxmox VE – How to Schedule Automatic VM Start and Shutdown Using Cron (Step-by-Step Guide) Automating virtual machine operations is an essential skill for every Proxmox administrator. In many real-world environments, you may need virtual...

P15 – Backup and Restore VM in Proxmox VE

P15 – Backup and Restore VM in Proxmox VE 🚀 Proxmox VE P15 – Backup and Restore VMs (Full Step-by-Step Guide) Data protection is one of the most critical responsibilities of any system administrator.In Proxmox VE, having a proper backup and restore strategy ensures your infrastructure can quickly recover from...

P14 – How to Remove Cluster Group Safely on Proxmox

Proxmox VE 9 P14: How to Remove Cluster Group Safely In Proxmox (Step-by-Step Guide) 🚀 Proxmox VE 9 – How to Remove Cluster Group (Step-by-Step) In some scenarios, you may need to remove a Proxmox cluster configuration completely, especially when: ❌ A node failed permanently ❌ The cluster was misconfigured...