AMD Strix Halo RDMA Cluster Setup Guide
This guide walks through building a two-node AMD Strix Halo cluster connected via Intel E810 (RoCE v2) for distributed vLLM inference. Tensor Parallelism splits model layers across nodes, requiring sub-10µs latency. RoCE v2 delivers that.
TL;DR (Quick Start)
On both nodes:
- Install/Update Fedora 43 and E810 NICs. Check firmware with
ethtool -i. - BIOS: Set iGPU to 512MB. Kernel params:
iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856. - Configure passwordless SSH.
- Assign static IPs (192.168.100.1 & .2), MTU 9000, trust interface in firewall.
- Run
./refresh_toolbox.sh(installs container with RDMA support and custom librccl.so patch). - Run
start-vllm-cluster, select "2. Start Ray Cluster", then "4. Launch VLLM Serve". Export HF_TOKEN for gated models.
Concepts & Architecture
- vLLM: High-performance inference engine. Tensor Parallelism (TP) splits model across GPUs.
- Ray: Orchestrates cluster, manages workers.
- RCCL: AMD's collective communication library. Handles data plane: fast tensor sync between GPUs. TP=2 exchanges partial results after every layer, thousands of times per second.
- RoCE v2: RDMA over Converged Ethernet. Writes data directly from one node's memory to another, bypassing CPU/OS kernel.
- Without RDMA: ~70-100µs latency (TCP/IP).
- With RDMA: ~5µs latency.
Hardware Prerequisites
- Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB Unified Memory.
- Network Cards: Intel Ethernet Controller E810-CQDA1 (100GbE QSFP28).
- Connection: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch needed for 2 nodes.
- PCIe Note: Framework motherboard PCIe slot is physically x4. Use a riser (e.g., CY PCI-E Express 4x to 16x Extender). Performance identical (~50Gbps, ~5µs latency).
Host Configuration (Fedora)
Tested on Fedora 43 with kernels 6.18.5 and 6.18.6.
4.1 Install Packages
sudo dnf install rdma-core libibverbs-utils perftest
rdma-core: Userspace RDMA components.libibverbs-utils: Query RDMA devices (ibv_devinfo).perftest: Benchmarks (ib_write_bw,ib_send_lat).
4.2 Check Native Firmware
ethtool -i enp194s0np0
Example output:
firmware-version: 4.91 0x800214b5 1.3909.0
Update if older using Intel NVM Update Tool.
4.3 Network Configuration
Node 1 (Head - 192.168.100.1):
sudo ip link set enp194s0np0 up
sudo ip addr add 192.168.100.1/30 dev enp194s0np0
sudo nmcli connection modify "rdma0" ethernet.mtu 9000
sudo nmcli connection up "rdma0"
Node 2 (Worker - 192.168.100.2):
sudo ip link set enp194s0np0 up
sudo ip addr add 192.168.100.2/30 dev enp194s0np0
sudo nmcli connection modify "rdma0" ethernet.mtu 9000
sudo nmcli connection up "rdma0"
Verify link: rdma link should show state ACTIVE physical_state LINK_UP.
4.4 BIOS & Kernel Configuration
BIOS: Set iGPU Memory Allocation to 512MB.
Kernel params in /etc/default/grub:
iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
iommu=pt: IOMMU pass-through for NIC and iGPU.pci=realloc: Reallocate PCI BARs for large address spaces.pcie_aspm=off: Disable power management to avoid latency spikes.amdgpu.gttsize=126976: GPU GTT size ~124GiB.ttm.pages_limit=32505856: TTM pages limit matching GTT.
Apply: sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot.
4.5 Firewall Rules
sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0
sudo firewall-cmd --reload
Toolbox Installation & Network Verification
5.1 Passwordless SSH
Configure SSH keys between nodes. Test with ssh date.
5.2 Installation
Run ./refresh_toolbox.sh on both nodes. This pulls kyuz0/vllm-therock-gfx1151 image, detects /dev/infiniband, and creates a toolbox with:
--device /dev/dri /dev/kfd(iGPU/ROCm)--device /dev/infiniband --group-add rdma--ulimit memlock=-1(memory pinning for DMA)
5.3 Verify RDMA Connection
Inside toolbox on head node:
/opt/compare_eth_vs_rdma.sh
Expected results:
Path Latency Bandwidth
------------------------------------------------
Ethernet (1G LAN) 0.074 ms 0.94 Gbps
Ethernet (RoCE NIC) 0.068 ms 55.70 Gbps
RDMA (RoCE) 5.23 us 50.64 Gbps
Running the Cluster
6.1 Setup & Verify
Enter toolbox, run start-vllm-cluster.
- Option 1: Configure IPs (Head: 192.168.100.1, Worker: 192.168.100.2).
- Option 2: Start Ray Cluster. Select "Head" on Node 1, "Worker" on Node 2.
- Option 3: Check status (expect 2 nodes, 2.0 GPU).
6.2 Launching vLLM
- Select "4. Launch VLLM Serve".
- Choose model (e.g., Meta-Llama-3.1-8B-Instruct).
- Set Tensor Parallelism = 2.
- Enable "Force Eager Mode" (CUDA Graphs can deadlock distributed APU clusters).
- Launch.
Gotchas:
- First run: each node downloads weights independently.
- Gated models: export
HF_TOKENbefore running script.
Troubleshooting
- vLLM hangs: Enable "Force Eager Mode".
- Link issues: Update Intel E810 firmware.
Next Steps
Try the cluster with a model like Llama 3.1 8B at TP=2. Benchmark tokens per second vs single-node. For production, consider adding more nodes or upgrading to 200GbE.


