TSF – Giải pháp IT toàn diện cho doanh nghiệp SMB | HCM

P26 - HA Cluster Replace Failed Disk Ceph on Proxmox

🚀 Proxmox – P26 Ceph HA Cluster: Replace Failed Disk on Proxmox (Full Demo)

🔎 Introduction

In this tutorial, we demonstrate how to safely replace a failed disk in a Proxmox 9 Ceph High Availability (HA) cluster without disrupting running virtual machines. Disk failure is one of the most common hardware issues in production clusters, and knowing how to handle it properly is critical for maintaining data integrity and high availability.

Ceph is designed to automatically replicate and rebalance data when a disk or OSD fails. However, administrators must follow the correct removal and replacement procedure to prevent data loss or cluster instability.

In this full demo, you will learn:

  • How to identify a failed Ceph OSD disk

  • How to verify whether the disk is physically dead

  • How to properly remove the faulty OSD from the cluster

  • How to replace the disk and recreate a new OSD

  • How Ceph automatically rebalances and restores redundancy

  • Best practices for maintaining Ceph HA in Proxmox PVE 9

This guide is perfect for IT professionals managing production Ceph clusters and home lab enthusiasts building resilient virtualization environments.


🧪 4. Simulate 1 Disk Failure on a Node

In this scenario, we simulate a single disk failure on node pve03zfs.


⚠️ 4.1. Symptoms

Simulation: node pve03zfs has 1 damaged hard drive.

You may observe:

  • Ceph reports OSD as down or out

  • Storage pool still has data, but one replica copy is missing

  • Ceph starts reallocating data automatically

The cluster enters a degraded state but continues operating if replication factor is sufficient.


🛠 4.2. Troubleshooting Process


🔹 Step 1: Identify the OSD Error

Use command or GUI to identify the OSD issue.

Check cluster health:

 
ceph status

Check OSD tree:

 
ceph osd tree

Check disk serial and device mapping:

 
ls -l /dev/disk/by-id/

Example:

  • Disk fail: osd.5

  • Lost disk: sdc


🔹 Step 2: Verify Whether the Disk is Truly Dead

⚠️ This step applies only to real physical environments.
Virtual lab environments may not detect hardware-level failures.

SSH into the affected node (example: pve03zfs).

Check if disk still exists:

 
lsblk

Check SMART information:

 
smartctl -a /dev/sdc

Interpret results:

  • Disk not present → physically dead

  • SMART reports failure → hardware error

  • Disk becomes read-only → serious damage

If disk is confirmed damaged → replacement is required.


🔹 Step 3: Remove the Faulty OSD from Cluster

Mark OSD as down:

 
ceph osd down osd.5

Mark OSD as out:

 
ceph osd out osd.5

Remove OSD from CRUSH map:

 
ceph osd crush remove osd.5

Delete authentication and OSD entry:

 
ceph auth del osd.5 ceph osd rm osd.5

After completing this step, Ceph will automatically:

  • Rebalance Placement Groups (PGs)

  • Recreate replicas on healthy OSDs

  • Gradually restore cluster stability

This is the power of Ceph replication in a Proxmox HA cluster.


🔹 Step 4: Replace the Failed Hard Drive

Shut down the server containing the faulty disk.

Physically replace the damaged hard drive with a new one.

Then verify disk configuration:

 
nano /etc/pve/qemu-server/102.conf serial=DISK07

Ensure the new disk is properly detected before proceeding.


🔹 Step 5: Create a New OSD on the New Disk

Go to:

GUI → Ceph → OSD

Create a new OSD using the replaced disk.

Once created, Ceph automatically:

  • Starts rebalancing data

  • Redistributes replicas according to replication factor

  • Restores full redundancy

No manual rebalancing is required.


⚖️ Cluster Recovery & Rebalancing

After the new OSD is added, monitor cluster health:

 
ceph status

Initially, you may see:

  • Degraded state

  • Active + remapped

  • Recovering PGs

Over time, Ceph will:

  • Re-segment data

  • Rebuild replicas

  • Return cluster to HEALTH_OK

Final Result:

  • Faulty OSD is replaced

  • Data is fully replicated

  • Cluster returns to healthy state

⏳ Recovery time depends on:

  • Disk performance

  • Network bandwidth

  • Total data size

  • Replication factor

Since this demo runs in a lab environment using VMs, the data re-segmentation process may differ from real production hardware performance.

When recovery completes, the green status indicators in Proxmox GUI will be fully restored.


✅ Best Practices for Replacing Failed Disks in Ceph HA

To maintain a resilient Proxmox Ceph environment:

✔ Always confirm disk failure before removal
✔ Never delete OSD abruptly without marking out
✔ Monitor ceph status continuously
✔ Maintain sufficient replication factor (minimum 3 recommended)
✔ Replace hardware promptly to avoid double failure risk


🏁 Conclusion

Replacing a failed disk in a Proxmox 9 Ceph HA cluster requires a structured approach:

  1. Identify the failed OSD

  2. Verify hardware condition

  3. Mark OSD down & out

  4. Remove from CRUSH and authentication

  5. Replace disk

  6. Create new OSD

  7. Monitor automatic rebalance

By following this procedure, you ensure:

  • Zero VM downtime (if replication factor allows)

  • Data integrity across cluster

  • High availability continuity

  • Production-grade stability

Ceph’s automatic replication and self-healing capabilities make it one of the most powerful distributed storage systems for Proxmox virtualization environments.

This full demo shows exactly how to keep your infrastructure resilient, scalable, and protected against disk failures.

See also related articles

P15 – Backup and Restore VM in Proxmox VE

P15 – Backup and Restore VM in Proxmox VE 🚀 Proxmox VE P15 – Backup and Restore VMs (Full Step-by-Step Guide) Data protection is one of the most critical responsibilities of any system administrator.In Proxmox VE, having a proper backup and restore strategy ensures your infrastructure can quickly recover from...

Read More

P14 – How to Remove Cluster Group Safely on Proxmox

Proxmox VE 9 P14: How to Remove Cluster Group Safely In Proxmox (Step-by-Step Guide) 🚀 Proxmox VE 9 – How to Remove Cluster Group (Step-by-Step) In some scenarios, you may need to remove a Proxmox cluster configuration completely, especially when: ❌ A node failed permanently ❌ The cluster was misconfigured...

Read More