GPU passthrough was an enthusiast topic for years, reserved for hyperscalers and research clusters. In 2026 the picture has shifted significantly: AI inference for internal chat assistants, real-time transcoding for Frigate NVR setups, and VDI pools with accelerated desktops have arrived in the mid-market. The question is no longer “if”, but “with which card” — and how far you can go without a dedicated DGX server.
This article shows which GPUs are worth deploying in 2026 on typical mid-range platforms such as Dell PowerEdge R760, HPE ProLiant DL380 Gen11 or Supermicro SYS-741GE, how a clean vfio setup works on Proxmox VE 8.4, and at what utilization level the card pays off compared to AWS or Hetzner GPU rental.
GPU classes 2026: What fits in a tower or 2U rack?
The interesting question for SMB is not “which H200 configuration”, but what fits thermally and electrically into an existing server. Single-slot cards with passive cooling and under 75 watts are the sweet spot, because they need no PCIe power connector and the server fans handle them comfortably.
| Card | VRAM | TDP | Slot | Street price 2026 | Primary use case |
|---|---|---|---|---|---|
| NVIDIA Tesla T4 (used) | 16 GB GDDR6 | 70 W | 1-slot passive | 400-700 EUR | Transcoding, light inference |
| NVIDIA L4 | 24 GB GDDR6 | 72 W | 1-slot passive | 2,400-2,900 EUR | LLM inference up to 13B, transcoding |
| NVIDIA L40S | 48 GB GDDR6 | 350 W | 2-slot passive | 8,500-10,500 EUR | LLM up to 70B, vGPU, training |
| AMD Instinct MI210 | 64 GB HBM2e | 300 W | 2-slot passive | 6,800-8,200 EUR | HPC, ROCm inference |
| NVIDIA RTX 6000 Ada | 48 GB GDDR6 | 300 W | 2-slot active | 7,200-8,000 EUR | Workstation VDI, CAD |
The T4 remains the insider tip in 2026 for Frigate, Plex and Whisper transcription. It regularly shows up on the used market from decommissioned data centres and runs with the current NVIDIA datacenter driver 565.x without any tricks. The L4 is its direct successor and a sensible choice if you want to run quantised LLMs like Llama 3.1 8B or Mistral Small 3.
IOMMU basics and BIOS preparation
Before a card can be passed through, the system must meet the prerequisites. That means: VT-d or AMD-Vi enabled in BIOS, “Above 4G Decoding” and “Resizable BAR” on, and SR-IOV allowed if you plan vGPU later.
On the Proxmox host, first check that IOMMU initialises cleanly:
dmesg | grep -e DMAR -e IOMMU
# expected output: DMAR: IOMMU enabled
# list groups
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=${d#*/iommu_groups/*}; n=${n%%/*}
printf 'IOMMU Group %s: ' "$n"
lspci -nns "${d##*/}"
done | sort -k3 -n
What matters is that your target GPU sits in its own group — or that you can pass through all devices of the group together. In a recent project we saw an L4 on a Dell R660 where the audio function and GPU were cleanly separated. On consumer boards this is often not the case and requires the ACS override patch, which we advise against in production environments.
Clean vfio-pci binding
To prevent the host from claiming the card itself, bind it to vfio-pci. In /etc/modprobe.d/vfio.conf:
options vfio-pci ids=10de:27b8,10de:22bd disable_vga=1
softdep nvidia pre: vfio-pci
softdep nouveau pre: vfio-pci
You obtain the IDs via lspci -nn | grep -i nvidia. Then run update-initramfs -u -k all and reboot. After the reboot, verify with lspci -nnk -d 10de:27b8 that “Kernel driver in use” really reads vfio-pci.
Then add the GPU to the VM via the Proxmox web UI as a PCI device with “PCI-Express” and “Primary GPU” enabled. For NVIDIA cards from Turing onward, the old args: -cpu host,kvm=off in the VM config file is no longer required — driver 565.x accepts the KVM environment without complaint.
Use case 1: AI inference with Ollama and vLLM
An NVIDIA L4 with 24 GB VRAM handles a surprising amount of load in 2026. In a recent customer environment we measured the following:
- Llama 3.1 8B (Q4_K_M) via Ollama: 78 tokens/s, 9 GB VRAM
- Mistral Small 3 24B (Q4): 22 tokens/s, 17 GB VRAM
- Qwen2.5 14B (FP8) via vLLM: 46 tokens/s, 21 GB VRAM at batch=4
For an internal coding assistant or a RAG solution with 5-15 concurrent users this is sufficient in practice. Anyone needing 70B models at acceptable speed or image generation with Flux ends up at the L40S or MI210.
The AMD Instinct MI210 is attractively priced and has matured with ROCm 6.3, but still has the drawback that many AI tools remain NVIDIA-centric. We recommend it only with a clear HPC profile or when the customer is strongly committed to an open-source stack.
Use case 2: Video transcoding and VDI
Frigate, Plex, Jellyfin and Immich benefit massively from NVENC. A single T4 handles around 20 parallel 1080p H.264 streams or about 12 H.265 streams. For mid-sized surveillance setups with 16-32 cameras that is usually more than enough.
For VDI with accelerated desktops, the L4 paired with NVIDIA vGPU 17.x is the clean solution. One card can be split into up to eight vGPU profiles (e.g. 8 x 3 GB for office users or 2 x 12 GB for CAD users). Be aware of licensing — NVIDIA vWS lands at roughly 350 EUR per user per year.
Cost: own GPU vs. cloud rental
A fully integrated NVIDIA L4 costs around 3,000 EUR in 2026. A comparable g6.xlarge instance on AWS runs at about 0.90 USD per hour, Hetzner offers GPU servers with RTX 4000 SFF from roughly 200 EUR per month.
The break-even calculation is surprisingly clear: if you use a GPU productively for more than 12 hours per day, the hardware pays for itself in about 14 months — including electricity (70 W * 24 h * 365 d * 0.28 EUR/kWh = roughly 172 EUR per year). For pure test workloads or sporadic load, cloud rental stays attractive. For 24/7 inference, transcoding or an internal RAG system, the in-house investment wins clearly.
There is also the data protection angle: a local L4 inside your virtualization infrastructure processes customer data without it leaving the premises — an argument that should not be underestimated in GDPR-sensitive industries.
Conclusion: Which card for which need?
For classic mid-sized businesses, the 2026 picture is reasonably clear: if you want to combine transcoding and light AI tasks, buy a used T4 or a new L4 and you are set for 3,000 EUR. If you plan serious LLM inference for multiple employees or a VDI pool, step up to the L40S. The MI210 remains a niche recommendation for Linux-affine HPC environments.
What matters across all variants is the full stack: a GPU is only as good as the storage underneath. For AI workloads we consistently recommend NVMe-based TrueNAS pools or local ZFS mirrors on the Proxmox node.
DATAZONE supports you in selecting, procuring and integrating the right GPU hardware — from analysing the IOMMU groups of your existing server, through vfio configuration, to a productive AI or VDI setup. Reach out via Contact if you want to bring GPU acceleration into your Proxmox environment — whether as a pilot with a single T4 or as a full build with L40S and vGPU licences.
More on these topics:
More articles
Hyper-V to Proxmox: Migration Without Data Loss
Concrete steps for migrating Hyper-V VMs to Proxmox VE: VHDX conversion, VirtIO drivers, boot modes, licence reactivation and test strategy for a smooth switch.
AI in the Mid-Market: When Does Own GPU Hardware Pay Off?
On-prem GPUs for AI are no end in themselves. When an own workstation or GPU server actually pays off, which hardware is on the table in 2026 and what the software stack should look like.
Proxmox Replication Between Two Sites
ZFS-based VM replication in Proxmox (pvesr) between two sites: setup, frequency, retention, initial sync, failover. How it differs from HA cluster and Proxmox Backup Server. A pragmatic DR strategy.