P26 - HA Cluster Replace Failed Disk Ceph on Proxmox
🚀 Proxmox – P26 Ceph HA Cluster: Replace Failed Disk on Proxmox (Full Demo)
🔎 Introduction
In this tutorial, we demonstrate how to safely replace a failed disk in a Proxmox 9 Ceph High Availability (HA) cluster without disrupting running virtual machines. Disk failure is one of the most common hardware issues in production clusters, and knowing how to handle it properly is critical for maintaining data integrity and high availability.
Ceph is designed to automatically replicate and rebalance data when a disk or OSD fails. However, administrators must follow the correct removal and replacement procedure to prevent data loss or cluster instability.
In this full demo, you will learn:
How to identify a failed Ceph OSD disk
How to verify whether the disk is physically dead
How to properly remove the faulty OSD from the cluster
How to replace the disk and recreate a new OSD
How Ceph automatically rebalances and restores redundancy
Best practices for maintaining Ceph HA in Proxmox PVE 9
This guide is perfect for IT professionals managing production Ceph clusters and home lab enthusiasts building resilient virtualization environments.
🧪 4. Simulate 1 Disk Failure on a Node
In this scenario, we simulate a single disk failure on node pve03zfs.
⚠️ 4.1. Symptoms
Simulation: node pve03zfs has 1 damaged hard drive.
You may observe:
Ceph reports OSD as down or out
Storage pool still has data, but one replica copy is missing
Ceph starts reallocating data automatically
The cluster enters a degraded state but continues operating if replication factor is sufficient.
🛠 4.2. Troubleshooting Process
🔹 Step 1: Identify the OSD Error
Use command or GUI to identify the OSD issue.
Check cluster health:
ceph status
Check OSD tree:
ceph osd tree
Check disk serial and device mapping:
ls -l /dev/disk/by-id/
Example:
Disk fail:
osd.5Lost disk:
sdc
🔹 Step 2: Verify Whether the Disk is Truly Dead
⚠️ This step applies only to real physical environments.
Virtual lab environments may not detect hardware-level failures.
SSH into the affected node (example: pve03zfs).
Check if disk still exists:
lsblk
Check SMART information:
smartctl -a /dev/sdc
Interpret results:
Disk not present → physically dead
SMART reports failure → hardware error
Disk becomes read-only → serious damage
If disk is confirmed damaged → replacement is required.
🔹 Step 3: Remove the Faulty OSD from Cluster
Mark OSD as down:
ceph osd down osd.5
Mark OSD as out:
ceph osd out osd.5
Remove OSD from CRUSH map:
ceph osd crush remove osd.5
Delete authentication and OSD entry:
ceph auth del osd.5
ceph osd rm osd.5
After completing this step, Ceph will automatically:
Rebalance Placement Groups (PGs)
Recreate replicas on healthy OSDs
Gradually restore cluster stability
This is the power of Ceph replication in a Proxmox HA cluster.
🔹 Step 4: Replace the Failed Hard Drive
Shut down the server containing the faulty disk.
Physically replace the damaged hard drive with a new one.
Then verify disk configuration:
nano /etc/pve/qemu-server/102.conf
serial=DISK07
Ensure the new disk is properly detected before proceeding.
🔹 Step 5: Create a New OSD on the New Disk
Go to:
GUI → Ceph → OSD
Create a new OSD using the replaced disk.
Once created, Ceph automatically:
Starts rebalancing data
Redistributes replicas according to replication factor
Restores full redundancy
No manual rebalancing is required.
⚖️ Cluster Recovery & Rebalancing
After the new OSD is added, monitor cluster health:
ceph status
Initially, you may see:
Degraded state
Active + remapped
Recovering PGs
Over time, Ceph will:
Re-segment data
Rebuild replicas
Return cluster to
HEALTH_OK
Final Result:
Faulty OSD is replaced
Data is fully replicated
Cluster returns to healthy state
⏳ Recovery time depends on:
Disk performance
Network bandwidth
Total data size
Replication factor
Since this demo runs in a lab environment using VMs, the data re-segmentation process may differ from real production hardware performance.
When recovery completes, the green status indicators in Proxmox GUI will be fully restored.
✅ Best Practices for Replacing Failed Disks in Ceph HA
To maintain a resilient Proxmox Ceph environment:
✔ Always confirm disk failure before removal
✔ Never delete OSD abruptly without marking out
✔ Monitor ceph status continuously
✔ Maintain sufficient replication factor (minimum 3 recommended)
✔ Replace hardware promptly to avoid double failure risk
🏁 Conclusion
Replacing a failed disk in a Proxmox 9 Ceph HA cluster requires a structured approach:
Identify the failed OSD
Verify hardware condition
Mark OSD down & out
Remove from CRUSH and authentication
Replace disk
Create new OSD
Monitor automatic rebalance
By following this procedure, you ensure:
Zero VM downtime (if replication factor allows)
Data integrity across cluster
High availability continuity
Production-grade stability
Ceph’s automatic replication and self-healing capabilities make it one of the most powerful distributed storage systems for Proxmox virtualization environments.
This full demo shows exactly how to keep your infrastructure resilient, scalable, and protected against disk failures.
See also related articles
P21 – How to Schedule Automatic Shutdown and Startup of VMs in Proxmox VE
P21 – How to Schedule Automatic Shutdown and Startup of VMs in Proxmox VE ⏰ Proxmox VE – How to Schedule Automatic VM Start and Shutdown Using Cron (Step-by-Step Guide) Automating virtual machine operations is an essential skill for every Proxmox administrator. In many real-world environments, you may need virtual...
Read MoreP15 – Backup and Restore VM in Proxmox VE
P15 – Backup and Restore VM in Proxmox VE 🚀 Proxmox VE P15 – Backup and Restore VMs (Full Step-by-Step Guide) Data protection is one of the most critical responsibilities of any system administrator.In Proxmox VE, having a proper backup and restore strategy ensures your infrastructure can quickly recover from...
Read MoreP14 – How to Remove Cluster Group Safely on Proxmox
Proxmox VE 9 P14: How to Remove Cluster Group Safely In Proxmox (Step-by-Step Guide) 🚀 Proxmox VE 9 – How to Remove Cluster Group (Step-by-Step) In some scenarios, you may need to remove a Proxmox cluster configuration completely, especially when: ❌ A node failed permanently ❌ The cluster was misconfigured...
Read More