TSF – Giải pháp IT toàn diện cho doanh nghiệp SMB | HCM

P23 - High Availability Test with ZFS Replication Proxmox

Proxmox P23 – High Availability Test with ZFS Replication

Full HA + ZFS Replication Demo on Proxmox VE 9 (Step-by-Step Guide)

High Availability (HA) is a critical requirement in any production virtualization environment. In this tutorial, we perform a full High Availability test on Proxmox VE 9 using ZFS replication, demonstrating how to keep virtual machines online even when a node fails.

This guide walks you through the complete HA + Replication configuration process, including cluster setup, storage preparation, replication scheduling, HA resource configuration, and real-world failover testing.

If you are running business-critical workloads, home lab clusters, or enterprise Proxmox environments, mastering HA with ZFS replication is essential to avoid downtime and ensure data consistency.


1️⃣ Overview

🚀 What Is Replication in Proxmox?

Replication in Proxmox means sending incremental copies of a VM from one node to another on a scheduled basis (5 minutes, 15 minutes, 1 hour, etc.).

Replication is supported only for:

• VM using Ceph RBD
• VM using ZFS (ZFS send/receive)

It does NOT support:

• LVM-thin
• ext4
• directory storage


📦 How Replication Works

Example scenario:

VM 100 is running on node pve01.
You create replication to pve02.

Mechanism:

• First run → Full copy of VM data to node pve02
• Next runs → Only changed blocks are transferred (incremental replication)
• On the destination node, the VM remains stopped and only stores replicated snapshots
• During failover → Snapshot is promoted → VM becomes active and runs

This ensures near real-time synchronization while minimizing bandwidth usage.


2️⃣ STANDARD PROCEDURE HA + REPLICATION PROXMOX


🔵 STEP 1 — Create Proxmox Cluster (Required)

Infrastructure:

pve01zfs: 192.168.11.200 (main)
pve02zfs: 192.168.11.201 (backup)

On master node (pve01zfs):

 
pvecm create tsf

Get IP info from pve01 and paste into hosts file of pve02:

 
192.168.11.200 pve01zfs.tsf.id.vn pve01zfs

Join cluster from pve02:

 
pvecm add pve01zfs.tsf.id.vn

Password: root of pve01

For full cluster setup details, see video:
Setup Cluster Group on Proxmox Version 9
https://youtu.be/wUqA8xeLcjc

Important Notes:

• Two nodes should have a separate corosync link or stable LAN
• Latency < 2ms
• Same system time
• Same Proxmox version

Cluster stability is mandatory before configuring HA.


🔵 STEP 2 — Prepare Replication Storage

VM must run on ZFS or Ceph to support replication.

If using local-lvm, move disk to ZFS:

 
qm move_disk 101 scsi0 zfs-storage

Only ZFS and Ceph support native replication in Proxmox VE 9.


🔵 STEP 3 — Create Replication Task for VM

Using GUI:

• Select VM → Replication → Add
• Target: pve02
• Schedule: */30 * * * * (every 30 minutes)
• Rate limit: Unlimited (or 100 MB/s)

Replication automatically creates snapshots on the target node.

First run:
Schedule immediately → Full VM snapshot (takes time)

Second run onward:
Incremental snapshot replication

This ensures consistent ZFS-based synchronization.


🔵 STEP 4 — Set Cluster Votes (Optional)

Note: Required only when cluster has fewer than 3 nodes.

Create new configuration file:

 
cd /etc/pve cp corosync.conf corosync.new.conf

Edit:

 
nano corosync.new.conf

Modify vote:

pve02 vote 2 (backup)

Backup old file and rename:

 
mv corosync.conf corosync.bak.conf mv corosync.new.conf corosync.conf

This prevents split-brain scenarios in 2-node clusters.


🔵 STEP 5 — Add VM to HA Manager

Add VM 100 as HA resource:

 
ha-manager add vm:100

Configure Node Affinity Rules:

Datacenter → HA → Affinity Rules → HA Node Affinity Rules → Add

Select HA Resource: VM 100

Set priority:

• pve01 = 2 (higher priority, main node)
• pve02 = 1

This ensures VM prefers running on the primary node.


🔵 STEP 6 — Test HA Failover (Critical Test)

How to test:

Completely STOP node pve01

Result:

→ VM 100 automatically starts on pve02
→ It may take a few minutes

When pve01 is repaired and restarted:

→ VM automatically migrates back to pve01

This confirms HA + replication is functioning properly.


🔵 STEP 7 — Restart Cluster Services

After configuration changes, restart cluster services if required to ensure stability.


🔐 Why HA + ZFS Replication Matters

Implementing High Availability with ZFS replication in Proxmox provides:

• Reduced downtime
• Automatic failover
• Data consistency via ZFS snapshots
• Efficient incremental replication
• Production-ready resilience

This setup is ideal for:

  • Enterprise virtualization clusters

  • Home lab HA testing

  • Critical service hosting

  • Database or file server protection


🎯 Final Thoughts

Proxmox VE 9 combined with ZFS replication delivers a powerful, cost-effective High Availability solution. By properly configuring cluster quorum, replication tasks, and HA resource priorities, you can build a resilient virtualization infrastructure capable of surviving node failures.

Understanding the mechanics of incremental ZFS replication and HA failover gives you full control over uptime, storage efficiency, and disaster recovery readiness.

Mastering this configuration significantly elevates your Proxmox expertise and prepares you for real-world production environments.

See also related articles

P15 – Backup and Restore VM in Proxmox VE

P15 – Backup and Restore VM in Proxmox VE 🚀 Proxmox VE P15 – Backup and Restore VMs (Full Step-by-Step Guide) Data protection is one of the most critical responsibilities of any system administrator.In Proxmox VE, having a proper backup and restore strategy ensures your infrastructure can quickly recover from...

Read More

P14 – How to Remove Cluster Group Safely on Proxmox

Proxmox VE 9 P14: How to Remove Cluster Group Safely In Proxmox (Step-by-Step Guide) 🚀 Proxmox VE 9 – How to Remove Cluster Group (Step-by-Step) In some scenarios, you may need to remove a Proxmox cluster configuration completely, especially when: ❌ A node failed permanently ❌ The cluster was misconfigured...

Read More