TSF – Giải pháp IT toàn diện cho doanh nghiệp SMB | HCM

P17 - How to Setup ZFS RAID on Proxmox and Replace Failed Disk

🚀 Proxmox VE P17 – Setup ZFS RAID and Replace Failed Disk (Full Step-by-Step Guide)

Reliable storage is the foundation of every virtualization platform.

In this tutorial, you will learn how to set up ZFS RAID on Proxmox VE 9 and safely replace a failed disk without losing data. This guide covers both the initial RAID configuration and the complete disk replacement procedure, including resilvering and bootloader recovery.

By following this tutorial, IT administrators and homelab engineers can build a resilient Proxmox storage infrastructure that protects virtual machines and containers from disk failure.

In this guide, you will learn how to:

  • 🧱 Configure ZFS RAID during Proxmox installation

  • 💾 Monitor ZFS pool health using zpool status and zfs list

  • ❌ Identify and offline a failed disk

  • 🔄 Replace the failed disk safely

  • ⚙ Rebuild and restore bootloader using Proxmox Boot Tool

By the end, you will confidently manage ZFS RAID and protect your virtual workloads.


🧪 LAB Environment

Server host Proxmox has 2 or more disks.

Disk 1, 2:

  • Run RAID containing Proxmox OS

  • Store VMs, Backup files, ISO images

Demo partition type: GPT


🧱 Step 1 – Set Serial for DISK VM

To simulate RAID accurately, we assign different serial numbers to each virtual disk.

Edit VM config:

 
nano /etc/pve/qemu-server/105.conf

Add serial lines:

 
serial=AbcxyzDSK001 serial=AbcxyzDSK002

This ensures each disk has a unique identifier, similar to real physical drives.


💿 Step 2 – Install OS with ZFS RAID 1

Important:

  • GPT partition scheme

  • 2 Disks

  • ZFS RAID 1

During Proxmox installation, choose:

ZFS (RAID1)

This creates:

  • BIOS Boot partition

  • EFI partition

  • ZFS root partition


🖥 Step 3 – Setup VM on Host and Run

Deploy and test VMs to confirm ZFS pool is operating normally.


❌ II – Simulate Disk Failure

Step 1 – Identify the Faulty Hard Drive

Check pool status:

 
zpool status

Example faulty disk info:

  • ID: 14912614961185646598

  • Name: scsi-0QEMU_QEMU_HARDDISK_AbcxyzDSK001-part3

Offline failed disk:

 
zpool offline rpool scsi-0QEMU_QEMU_HARDDISK_AbcxyzDSK001-part3

🔧 Step 2 – Shutdown Host and Replace Faulty Disk

If server does NOT support hot-swap:

  • Shutdown Proxmox

  • Remove faulty disk

  • Insert new disk in same slot

If server supports hot-swap:

  • Replace disk without shutdown

In this demo (VM simulation), add new disk and assign serial:

 
nano /etc/pve/qemu-server/105.conf

Add:

 
serial=AbcxyzDSK003

⚠ Important Note:

If Proxmox was installed using MBR (SeaBIOS):

  • System may fail boot

  • Boot using remaining disk

  • Backup all VMs/Data to external storage (NFS/SMB/OneDrive/Physical Disk)

  • Shutdown host

  • Attach new disk

  • Reinstall Proxmox with 2 disks

  • Mount storage and restore VMs


🔍 Step 3 – Identify the New Drive

After mounting:

 
ls -l /dev/disk/by-id/

Example:

New name:

 
scsi-0QEMU_QEMU_HARDDISK_AbcxyzDSK003

New drive: sda
Current drive: sdb


📋 Step 4 – Copy Partition from Old Disk to New Disk

On GPT:

  • Partition 1 → BIOS Boot (1 MiB)

  • Partition 2 → EFI (FAT32, 512 MiB)

  • Partition 3 → ZFS root

Check partition layout:

 
lsblk /dev/sdb lsblk /dev/sda

Copy partition table:

 
sgdisk --replicate=/dev/sda /dev/sdb

This creates:

  • sda1

  • sda2

  • sda3

⚠ Do NOT copy ZFS data from sdb3 to sda3.
ZFS will rebuild automatically.

Copy small partitions if needed:

 
dd if=/dev/sdb1 of=/dev/sda1 bs=1M status=progress dd if=/dev/sdb2 of=/dev/sda2 bs=1M status=progress

Do NOT copy sdb3.


🔄 Step 5 – Replace Faulty Disk in Mirror

Assume faulty disk:

 
/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_AbcxyzDSK001-part3

Replace disk:

 
zpool replace rpool <disk old> <disk new>

Example:

 
zpool replace rpool scsi-0QEMU_QEMU_HARDDISK_AbcxyzDSK001-part3 scsi-0QEMU_QEMU_HARDDISK_AbcxyzDSK003-part3

Check resilver progress:

 
zpool status -v

Wait until:

  • ONLINE status

  • scan resilvered 100%

  • rpool mirror healthy


🗂 Step 6 – Mount Root Dataset for Chroot

Mount datasets:

 
zfs mount -a

Bind root:

 
mkdir -p /mnt mount --bind / /mnt

🧩 Step 7 – Mount EFI and Bind Filesystems

Mount EFI:

 
mkdir -p /mnt/boot/efi mount /dev/sda2 /mnt/boot/efi

Bind system directories:

 
mount --bind /dev /mnt/dev mount --bind /proc /mnt/proc mount --bind /sys /mnt/sys mount --bind /run /mnt/run

⚙ Step 8 – Chroot and Install Bootloader

Enter chroot:

 
chroot /mnt /bin/bash

Format EFI properly:

 
proxmox-boot-tool format /dev/sda

Refresh bootloader:

 
proxmox-boot-tool refresh

Verify EFI files:

 
ls /boot/efi/EFI/proxmox

If .efi files are present → bootloader is ready.

Reboot:

 
reboot

📊 Monitoring ZFS Health

Regularly check:

 
zpool status zfs list

Monitor:

  • Disk state

  • Resilver progress

  • Pool health

Healthy ZFS pool should always display:

ONLINE


🔐 Production Best Practices

✔ Use enterprise-grade disks
✔ Monitor SMART regularly
✔ Keep spare disk available
✔ Test disk replacement procedure
✔ Use RAID 1 or RAID 10 for critical workloads
✔ Monitor ZFS resilver events

ZFS provides self-healing and data integrity verification — but proactive monitoring is still essential.


🎯 Conclusion

In this Proxmox VE P17 guide, you have successfully:

  • Installed Proxmox with ZFS RAID 1

  • Simulated disk failure

  • Identified and offlined faulty disk

  • Replaced disk safely

  • Rebuilt ZFS mirror

  • Restored bootloader using Proxmox Boot Tool

Your Proxmox storage is now resilient, redundant, and production-ready.

ZFS RAID + Proper Disk Replacement Procedure = Enterprise-Level Data Protection.

See also related articles

P15 – Backup and Restore VM in Proxmox VE

P15 – Backup and Restore VM in Proxmox VE 🚀 Proxmox VE P15 – Backup and Restore VMs (Full Step-by-Step Guide) Data protection is one of the most critical responsibilities of any system administrator.In Proxmox VE, having a proper backup and restore strategy ensures your infrastructure can quickly recover from...

Read More

P14 – How to Remove Cluster Group Safely on Proxmox

Proxmox VE 9 P14: How to Remove Cluster Group Safely In Proxmox (Step-by-Step Guide) 🚀 Proxmox VE 9 – How to Remove Cluster Group (Step-by-Step) In some scenarios, you may need to remove a Proxmox cluster configuration completely, especially when: ❌ A node failed permanently ❌ The cluster was misconfigured...

Read More