Ceph HA Cluster: Replace Failed Node on Proxmox

In this video, we demonstrate how to replace a failed node in a Proxmox Ceph High Availability cluster. Follow step-by-step instructions to safely remove the failed server and integrate a new node without disrupting your VMs. Learn how Ceph ensures data integrity and automatic replication across the cluster. This tutorial covers cluster health checks, node replacement procedures, and real-time failover monitoring. Perfect for IT professionals managing production environments and home lab enthusiasts. See how to maintain HA and minimize downtime in critical infrastructures. Understand the key best practices for managing Ceph clusters in Proxmox PVE 9. Watch this full demo to confidently handle node failures and keep your virtual machines running smoothly.

5. Simulate a Dead Node

Due to limited equipment conditions, the VM I built for the lab will not run smoothly and slowly.
I mainly demo each step for replacing the faulty node.
In real conditions, the performance of the proxmox node servers will be much faster.

5.1. Symptoms

• Multiple OSDs on the down node

• Ceph reports down/out for those OSDs
• If replication factor is sufficient (e.g. 3), VMs are still running on the remaining nodes
• Simulate pve01zfs down

5.2. Troubleshooting procedure

Step 1: Delete the dead node

pvecm delnode pve01zfs

Also delete the dead node file
rm -rf /etc/pve/nodes/pve01zfs

Step 2: Remove MON pve01 from Ceph

First: mark mon pve01 as down
ceph mon down pve01zfs

Then: remove mon pve01
ceph mon remove pve01zfs

This command will completely remove MON from the Ceph cluster.

Step 3: Delete OSD pve01zfs

Identify dead osd by command or GUI

ceph osd tree

=> Osd.0 and osd.1

Mark OSD as down
ceph osd down osd.0
ceph osd down osd.1

Mark OSD as out
ceph osd out osd.0
ceph osd out osd.1

Remove OSD from CRUSH map
ceph osd crush remove osd.0
ceph osd crush remove osd.1

Remove OSD from Ceph
ceph auth del osd.0
ceph auth del osd.1

ceph osd rm osd.0
ceph osd rm osd.1

Step 4: Remove host pve01zfs from CRUSH map

Just run exactly 1 command:
ceph osd crush remove pve01zfs

If the name is not pve01zfs, run this command to see the correct name:
ceph osd tree

Then remove according to the correct host name. Then restart the 3 cluster services on the remaining 2 nodes.

ceph is distributing data to 4 osd now. This process is fast or slow depending on the server node device, the lab I built is definitely slow 🙂

Step 5: Prepare the replacement pve (pve04zfs)

nano /etc/pve/qemu-server/105.conf
serial=DISK07
serial=DISK08

Disable enterprise

Set the IP to the same network class as pve02-03. Add the host information file of the other 2 pve

nano /etc/hosts
192.168.16.201 pve02zfs.tsf.id.vn pve02zfs
192.168.16.202 pve03zfs.tsf.id.vn pve03zfs

Check Disk
lsblk
ls -l /dev/disk/by-id/

Step 6: Join pve04 into cluster

pvecm add pve02zfs.tsf.id.vn

Step 7: Install Ceph on the new node (pve04)

Go to GUI node pve04:
• Ceph → Install Ceph
• Select the same version as the other 2 nodes
• Restart the node if required

Then:
Add MON + MGR
1. Ceph → Monitor → Add
2. Ceph → Manager → Add

Add OSD from empty disk
• Ceph → OSD → Create OSD
• Select /dev/sdb or empty disk
Repeat until pve04 has enough OSDs you want.

Step 8: Rebalance Ceph

When new node enters → Ceph will automatically rebalance data.

Check:
ceph -s
Good status must be:
HEALTH_OK (However, because my lab is built on a small server simulating 3 pve nodes, there will be a notification experiencing slow operations in BlueStore => warning slow IO

And the process when adding a new node to ceph to self-balance will take a while for ceph to write data to the new Disk

Degraded data redundancy will gradually decrease and the actuve + clean part will take up the entire

_________________________________________

Step 9: Add the new node to the HA group

Datacenter → HA → Groups → select group → Add pve04
At this time, HA can use the new node to run VM when failover.