A highly available Proxmox cluster does not necessarily have to be built with Ceph or a classic SAN. For many mid-sized environments with two productive hypervisors and a manageable number of virtual machines, a 3-node cluster with local ZFS storage and scheduled replication is the more sensible solution — both economically and operationally. You not only save significant hardware costs but also noticeably reduce operational complexity.
In this article we show how to set up such a cluster cleanly with Proxmox VE 8.3, which RPO and RTO values are realistic and where the solution reaches its limits compared to Ceph.
Why Local Storage Instead of Ceph or SAN
Ceph is an excellent solution — starting at a certain scale. If you can operate three nodes with six OSDs each, a dedicated 25 GbE backend and sufficient RAM, you get real shared storage with live migration and instant failover. In SMB environments with two hypervisors, 20 to 50 VMs and a backup window overnight, the overhead of this solution is often not justified.
Local ZFS storage in return offers:
- Full NVMe performance without network overhead
- Snapshot and clone technology at block level
- Inline compression and optional deduplication
- Block-level replication between nodes
- Proven recovery tooling
The price is clear: you lose the “zero-RPO guarantee” of synchronous shared storage. A VM that fails between two replication runs loses the changes of that interval.
Hardware Setup for the 3-Node Cluster
A proven setup in mid-market environments uses two productive nodes with identical hardware and one small third node as a QDevice carrier. The third node does not need to host any VMs — it serves quorum only.
| Component | Node 1 + 2 (Production) | Node 3 (QDevice) |
|---|---|---|
| CPU | AMD EPYC 4564P, 16 cores | Intel N100 or small Xeon |
| RAM | 256 GB DDR5 ECC | 16 GB |
| Storage | 4x 3.84 TB NVMe (ZFS RAID10) | 1x 500 GB SSD |
| Network | 2x 25 GbE LACP, Corosync separated | 2x 1 GbE |
| Power | Redundant 2x 800 W | Single 250 W |
The productive nodes get a dedicated Corosync network on their own physical interfaces. This is not optional but decisive for cluster stability. If replication traffic delays Corosync packets, fencing events occur and VMs are restarted unnecessarily.
Setting Up Cluster and QDevice
After Proxmox installation on all three nodes, create the cluster on Node 1:
# On Node 1
pvecm create datazone-cluster --link0 10.10.10.1 --link1 10.10.20.1
# On Node 2
pvecm add 10.10.10.1 --link0 10.10.10.2 --link1 10.10.20.2
# Install QDevice package on all nodes and on Node 3
apt install corosync-qdevice
# On Node 3 (not a cluster member, only QNet daemon)
apt install corosync-qnetd
systemctl enable --now corosync-qnetd
# On Node 1 register the QDevice
pvecm qdevice setup 10.10.10.3
Verification with pvecm status must subsequently show three votes: two from the productive nodes and one from the QDevice. This ensures quorum even if one productive node fails, and the remaining node can take over VMs.
Configuring ZFS Pools and Replication
On both productive nodes an identically named ZFS pool is created. The name must match exactly, otherwise replication will not work.
# Identical on Node 1 and Node 2
zpool create -o ashift=12 -O compression=lz4 -O atime=off \
rpool-vm mirror nvme0n1 nvme1n1 mirror nvme2n1 nvme3n1
zfs set recordsize=64k rpool-vm
zfs set sync=standard rpool-vm
Then add the storage in the Proxmox GUI as type “ZFS” under Datacenter → Storage and enable it for both nodes. For each VM that is to run highly available, configure a job under VM → Replication to the partner node:
# CLI variant for VM 100 every 15 minutes to Node 2
pvesr create-local-job 100-0 node2 --schedule "*/15"
The initial replica transfers the complete ZFS dataset. All subsequent runs only transfer the incremental snapshots since the last successful run. For typical office VMs with a 100 GB volume, 15-minute deltas are usually in the range of a few hundred megabytes.
Realistically Assessing RPO, RTO and HA Behavior
This is where the decisive difference to Ceph or shared storage lies. With a classic SAN setup RPO equals zero — every write operation is persisted on both paths. With ZFS replication, the RPO equals the configured replication interval.
| Scenario | RPO | RTO | Impact |
|---|---|---|---|
| Node failure, last replica 5 min old | ~5 min | 2-4 min | VM starts on partner with old state |
| Node failure, last replica 14 min old | ~14 min | 2-4 min | Worst case at 15-min interval |
| Planned maintenance | 0 | 30-60 s | Live migration possible, no data loss |
| Failure of QDevice (Node 3) | 0 | 0 | Cluster continues, no VM impact |
| Failure of both productive nodes | n/a | Backup restore | Disaster recovery from PBS required |
For many SMB applications an RPO of 15 minutes is perfectly acceptable. For databases with high write volume you should reduce the interval to 5 or even 1 minute — or specifically switch to application-level replication such as PostgreSQL streaming or MSSQL availability groups.
Pitfalls and Best Practices
From numerous projects, several themes emerge that are often underrepresented in the documentation:
Separate the Corosync network: If Corosync and replication run over the same link, fencing events under load are pre-programmed. At minimum VLAN separation, better physically separated.
Define HA groups cleanly: A VM with active replication to Node 2 must not start on Node 3 during HA failover — there is no replica there. Define HA groups explicitly per VM.
Snapshot hygiene: Replication snapshots are managed automatically. Manual snapshots from the GUI can break replication if they do not exist synchronously on both sides.
Keep backup independent: Replication is not backup. An accidentally deleted file is reliably replicated to the partner within 15 minutes. A separate Proxmox Backup Server with retention of 30 or 60 days is mandatory.
Set up monitoring: Replication jobs can fail — full pools, broken SSH sessions, snapshot conflicts. A simple script that evaluates pvesr status and feeds it into your monitoring solution prevents nasty surprises.
Conclusion
A 3-node Proxmox cluster with ZFS replication and QDevice is, for many mid-sized IT environments, the more sensible alternative to Ceph or a classic SAN — both economically and technically. You get real high availability with manageable RPO, very low RTO and drastically reduced hardware costs. The solution scales cleanly up to the range of 50 to 80 VMs per node pair and can later be extended with additional nodes if needed.
DATAZONE has been planning, building and operating Proxmox clusters in exactly this size range for years — including clean network separation, documented replication strategy and integrated backup on PBS. If you are considering migrating from VMware to Proxmox or consolidating your existing cluster, talk to us: we analyze your requirements and propose the right setup. Find out more about our services in Proxmox Consulting or directly via the contact form.
More on these topics:
More articles
Hyper-V to Proxmox: Migration Without Data Loss
Concrete steps for migrating Hyper-V VMs to Proxmox VE: VHDX conversion, VirtIO drivers, boot modes, licence reactivation and test strategy for a smooth switch.
Proxmox Replication Between Two Sites
ZFS-based VM replication in Proxmox (pvesr) between two sites: setup, frequency, retention, initial sync, failover. How it differs from HA cluster and Proxmox Backup Server. A pragmatic DR strategy.
ZFS Performance Tuning: Recordsize, atime, Compression
ZFS pool tuning for real workloads: recordsize per use case, atime=off, compression tradeoffs lz4 vs zstd, special_small_blocks, ashift=12, primarycache. Concrete zfs set commands and qualitative assessments without invented benchmarks.