P26 - HA Cluster Replace Failed Disk Ceph on Proxmox

🚀 Proxmox – P26 Ceph HA Cluster: Replace Failed Disk on Proxmox (Full Demo)

🔎 Introduction

In this tutorial, we demonstrate how to safely replace a failed disk in a Proxmox 9 Ceph High Availability (HA) cluster without disrupting running virtual machines. Disk failure is one of the most common hardware issues in production clusters, and knowing how to handle it properly is critical for maintaining data integrity and high availability.

Ceph is designed to automatically replicate and rebalance data when a disk or OSD fails. However, administrators must follow the correct removal and replacement procedure to prevent data loss or cluster instability.

In this full demo, you will learn:

How to identify a failed Ceph OSD disk
How to verify whether the disk is physically dead
How to properly remove the faulty OSD from the cluster
How to replace the disk and recreate a new OSD
How Ceph automatically rebalances and restores redundancy
Best practices for maintaining Ceph HA in Proxmox PVE 9

This guide is perfect for IT professionals managing production Ceph clusters and home lab enthusiasts building resilient virtualization environments.

🧪 4. Simulate 1 Disk Failure on a Node

In this scenario, we simulate a single disk failure on node pve03zfs.

⚠️ 4.1. Symptoms

Simulation: node pve03zfs has 1 damaged hard drive.

You may observe:

Ceph reports OSD as down or out
Storage pool still has data, but one replica copy is missing
Ceph starts reallocating data automatically

The cluster enters a degraded state but continues operating if replication factor is sufficient.

🛠 4.2. Troubleshooting Process

🔹 Step 1: Identify the OSD Error

Use command or GUI to identify the OSD issue.

Check cluster health:

Check OSD tree:

Check disk serial and device mapping:

Example:

Disk fail: osd.5
Lost disk: sdc

🔹 Step 2: Verify Whether the Disk is Truly Dead

⚠️ This step applies only to real physical environments.
Virtual lab environments may not detect hardware-level failures.

SSH into the affected node (example: pve03zfs).

Check if disk still exists:

Check SMART information:

Interpret results:

Disk not present → physically dead
SMART reports failure → hardware error
Disk becomes read-only → serious damage

If disk is confirmed damaged → replacement is required.

🔹 Step 3: Remove the Faulty OSD from Cluster

Mark OSD as down:

Mark OSD as out:

Remove OSD from CRUSH map:

Delete authentication and OSD entry:

After completing this step, Ceph will automatically:

Rebalance Placement Groups (PGs)
Recreate replicas on healthy OSDs
Gradually restore cluster stability

This is the power of Ceph replication in a Proxmox HA cluster.

🔹 Step 4: Replace the Failed Hard Drive

Shut down the server containing the faulty disk.

Physically replace the damaged hard drive with a new one.

Then verify disk configuration:

Ensure the new disk is properly detected before proceeding.

🔹 Step 5: Create a New OSD on the New Disk

Go to:

GUI → Ceph → OSD

Create a new OSD using the replaced disk.

Once created, Ceph automatically:

Starts rebalancing data
Redistributes replicas according to replication factor
Restores full redundancy

No manual rebalancing is required.

⚖️ Cluster Recovery & Rebalancing

After the new OSD is added, monitor cluster health:

Initially, you may see:

Degraded state
Active + remapped
Recovering PGs

Over time, Ceph will:

Re-segment data
Rebuild replicas
Return cluster to HEALTH_OK

Final Result:

Faulty OSD is replaced
Data is fully replicated
Cluster returns to healthy state

⏳ Recovery time depends on:

Disk performance
Network bandwidth
Total data size
Replication factor

Since this demo runs in a lab environment using VMs, the data re-segmentation process may differ from real production hardware performance.

When recovery completes, the green status indicators in Proxmox GUI will be fully restored.

✅ Best Practices for Replacing Failed Disks in Ceph HA

To maintain a resilient Proxmox Ceph environment:

✔ Always confirm disk failure before removal
✔ Never delete OSD abruptly without marking out
✔ Monitor ceph status continuously
✔ Maintain sufficient replication factor (minimum 3 recommended)
✔ Replace hardware promptly to avoid double failure risk

🏁 Conclusion

Replacing a failed disk in a Proxmox 9 Ceph HA cluster requires a structured approach:

Identify the failed OSD
Verify hardware condition
Mark OSD down & out
Remove from CRUSH and authentication
Replace disk
Create new OSD
Monitor automatic rebalance

By following this procedure, you ensure:

Zero VM downtime (if replication factor allows)
Data integrity across cluster
High availability continuity
Production-grade stability

Ceph’s automatic replication and self-healing capabilities make it one of the most powerful distributed storage systems for Proxmox virtualization environments.

This full demo shows exactly how to keep your infrastructure resilient, scalable, and protected against disk failures.

P21 – How to Schedule Automatic Shutdown and Startup of VMs in Proxmox VE

P21 – How to Schedule Automatic Shutdown and Startup of VMs in Proxmox VE ⏰ Proxmox VE – How to Schedule Automatic VM Start and Shutdown Using Cron (Step-by-Step Guide) Automating virtual machine operations is an essential skill for every Proxmox administrator. In many real-world environments, you may need virtual...

P15 – Backup and Restore VM in Proxmox VE

P15 – Backup and Restore VM in Proxmox VE 🚀 Proxmox VE P15 – Backup and Restore VMs (Full Step-by-Step Guide) Data protection is one of the most critical responsibilities of any system administrator.In Proxmox VE, having a proper backup and restore strategy ensures your infrastructure can quickly recover from...

P14 – How to Remove Cluster Group Safely on Proxmox

Proxmox VE 9 P14: How to Remove Cluster Group Safely In Proxmox (Step-by-Step Guide) 🚀 Proxmox VE 9 – How to Remove Cluster Group (Step-by-Step) In some scenarios, you may need to remove a Proxmox cluster configuration completely, especially when: ❌ A node failed permanently ❌ The cluster was misconfigured...