Remote Support Start download

Proxmox 3-Node Cluster Without Shared Storage: ZFS Replication Instead

ProxmoxClusterZFSReplikation
Proxmox 3-Node Cluster Without Shared Storage: ZFS Replication Instead

A highly available Proxmox cluster does not necessarily have to be built with Ceph or a classic SAN. For many mid-sized environments with two productive hypervisors and a manageable number of virtual machines, a 3-node cluster with local ZFS storage and scheduled replication is the more sensible solution — both economically and operationally. You not only save significant hardware costs but also noticeably reduce operational complexity.

In this article we show how to set up such a cluster cleanly with Proxmox VE 8.3, which RPO and RTO values are realistic and where the solution reaches its limits compared to Ceph.

Why Local Storage Instead of Ceph or SAN

Ceph is an excellent solution — starting at a certain scale. If you can operate three nodes with six OSDs each, a dedicated 25 GbE backend and sufficient RAM, you get real shared storage with live migration and instant failover. In SMB environments with two hypervisors, 20 to 50 VMs and a backup window overnight, the overhead of this solution is often not justified.

Local ZFS storage in return offers:

  • Full NVMe performance without network overhead
  • Snapshot and clone technology at block level
  • Inline compression and optional deduplication
  • Block-level replication between nodes
  • Proven recovery tooling

The price is clear: you lose the “zero-RPO guarantee” of synchronous shared storage. A VM that fails between two replication runs loses the changes of that interval.

Hardware Setup for the 3-Node Cluster

A proven setup in mid-market environments uses two productive nodes with identical hardware and one small third node as a QDevice carrier. The third node does not need to host any VMs — it serves quorum only.

ComponentNode 1 + 2 (Production)Node 3 (QDevice)
CPUAMD EPYC 4564P, 16 coresIntel N100 or small Xeon
RAM256 GB DDR5 ECC16 GB
Storage4x 3.84 TB NVMe (ZFS RAID10)1x 500 GB SSD
Network2x 25 GbE LACP, Corosync separated2x 1 GbE
PowerRedundant 2x 800 WSingle 250 W

The productive nodes get a dedicated Corosync network on their own physical interfaces. This is not optional but decisive for cluster stability. If replication traffic delays Corosync packets, fencing events occur and VMs are restarted unnecessarily.

Setting Up Cluster and QDevice

After Proxmox installation on all three nodes, create the cluster on Node 1:

# On Node 1
pvecm create datazone-cluster --link0 10.10.10.1 --link1 10.10.20.1

# On Node 2
pvecm add 10.10.10.1 --link0 10.10.10.2 --link1 10.10.20.2

# Install QDevice package on all nodes and on Node 3
apt install corosync-qdevice

# On Node 3 (not a cluster member, only QNet daemon)
apt install corosync-qnetd
systemctl enable --now corosync-qnetd

# On Node 1 register the QDevice
pvecm qdevice setup 10.10.10.3

Verification with pvecm status must subsequently show three votes: two from the productive nodes and one from the QDevice. This ensures quorum even if one productive node fails, and the remaining node can take over VMs.

Configuring ZFS Pools and Replication

On both productive nodes an identically named ZFS pool is created. The name must match exactly, otherwise replication will not work.

# Identical on Node 1 and Node 2
zpool create -o ashift=12 -O compression=lz4 -O atime=off \
  rpool-vm mirror nvme0n1 nvme1n1 mirror nvme2n1 nvme3n1

zfs set recordsize=64k rpool-vm
zfs set sync=standard rpool-vm

Then add the storage in the Proxmox GUI as type “ZFS” under Datacenter → Storage and enable it for both nodes. For each VM that is to run highly available, configure a job under VM → Replication to the partner node:

# CLI variant for VM 100 every 15 minutes to Node 2
pvesr create-local-job 100-0 node2 --schedule "*/15"

The initial replica transfers the complete ZFS dataset. All subsequent runs only transfer the incremental snapshots since the last successful run. For typical office VMs with a 100 GB volume, 15-minute deltas are usually in the range of a few hundred megabytes.

Realistically Assessing RPO, RTO and HA Behavior

This is where the decisive difference to Ceph or shared storage lies. With a classic SAN setup RPO equals zero — every write operation is persisted on both paths. With ZFS replication, the RPO equals the configured replication interval.

ScenarioRPORTOImpact
Node failure, last replica 5 min old~5 min2-4 minVM starts on partner with old state
Node failure, last replica 14 min old~14 min2-4 minWorst case at 15-min interval
Planned maintenance030-60 sLive migration possible, no data loss
Failure of QDevice (Node 3)00Cluster continues, no VM impact
Failure of both productive nodesn/aBackup restoreDisaster recovery from PBS required

For many SMB applications an RPO of 15 minutes is perfectly acceptable. For databases with high write volume you should reduce the interval to 5 or even 1 minute — or specifically switch to application-level replication such as PostgreSQL streaming or MSSQL availability groups.

Pitfalls and Best Practices

From numerous projects, several themes emerge that are often underrepresented in the documentation:

Separate the Corosync network: If Corosync and replication run over the same link, fencing events under load are pre-programmed. At minimum VLAN separation, better physically separated.

Define HA groups cleanly: A VM with active replication to Node 2 must not start on Node 3 during HA failover — there is no replica there. Define HA groups explicitly per VM.

Snapshot hygiene: Replication snapshots are managed automatically. Manual snapshots from the GUI can break replication if they do not exist synchronously on both sides.

Keep backup independent: Replication is not backup. An accidentally deleted file is reliably replicated to the partner within 15 minutes. A separate Proxmox Backup Server with retention of 30 or 60 days is mandatory.

Set up monitoring: Replication jobs can fail — full pools, broken SSH sessions, snapshot conflicts. A simple script that evaluates pvesr status and feeds it into your monitoring solution prevents nasty surprises.

Conclusion

A 3-node Proxmox cluster with ZFS replication and QDevice is, for many mid-sized IT environments, the more sensible alternative to Ceph or a classic SAN — both economically and technically. You get real high availability with manageable RPO, very low RTO and drastically reduced hardware costs. The solution scales cleanly up to the range of 50 to 80 VMs per node pair and can later be extended with additional nodes if needed.

DATAZONE has been planning, building and operating Proxmox clusters in exactly this size range for years — including clean network separation, documented replication strategy and integrated backup on PBS. If you are considering migrating from VMware to Proxmox or consolidating your existing cluster, talk to us: we analyze your requirements and propose the right setup. Find out more about our services in Proxmox Consulting or directly via the contact form.

Need IT consulting?

Contact us for a no-obligation consultation on Proxmox, OPNsense, TrueNAS and more.

Get in touch