Perplexity’s TransferEngine Enables Trillion-Parameter LLMs on Heterogeneous GPU Clusters

Perplexity Research has open-sourced TransferEngine and the pplx garden toolkit to let teams run trillion-parameter language models on existing mixed GPU clusters without buying new GB200-class hardware or locking into a single cloud vendor.

Why network fabric, not FLOPs, is the constraint

Large Mixture of Experts (MoE) models like DeepSeek V3 (671B) and Kimi K2 (1T) no longer fit on a single 8-GPU server. They must span multiple nodes, making the inter-GPU network fabric the primary bottleneck rather than raw FLOPs. The hardware landscape is fragmented: NVIDIA ConnectX 7 typically provides reliable, in-order transport, while AWS Elastic Fabric Adapter (EFA) presents a reliable but out-of-order transport. To reach 400 Gbps a single GPU may need multiple NICs (for example, 4×100 Gbps or 2×200 Gbps on EFA). Existing libraries often optimize for one vendor and perform poorly or lack support on the other, leaving a gap for cross-provider LLM inference.

What TransferEngine does

TransferEngine implements a portable RDMA layer that targets the common guarantees across different NICs. It assumes a reliable RDMA transport but does not assume message ordering. On top of that minimal assumption it exposes one-sided WriteImm operations and an ImmCounter primitive for completion notification.

API and design highlights

Minimal Rust API: two-sided Send and Recv for control plus three one-sided operations: submit_single_write, submit_paged_writes, and submit_scatter. A submit_barrier primitive synchronizes groups of peers. NetAddr and MrDesc describe peers and memory regions, and alloc_uvm_watcher creates a device-side watcher for CPU–GPU synchronization in advanced pipelines.
Worker and DomainGroup model: TransferEngine spawns one worker thread per GPU and builds a DomainGroup per GPU that coordinates between 1 and 4 RDMA NICs. The sharding logic is NIC-aware and can split transfers across adapters to aggregate bandwidth.

Performance and cross-vendor portability

Perplexity reports peak throughput of 400 Gbps both on NVIDIA ConnectX 7 and on AWS EFA by aggregating multiple adapters where needed. This demonstrates that the abstraction can match single-vendor performance while remaining portable across providers and NIC families.

pplx garden package and system requirements

TransferEngine is included in the pplx garden repository (MIT license). The repo layout includes fabric-lib for the RDMA library, p2p-all-to-all for an MoE all-to-all kernel, python-ext for the Rust Python extension, and python/pplx_garden for the Python package code.

Recommended system software and hardware:

Linux kernel 5.12+ for DMA-BUF support
CUDA 12.8+
libfabric, libibverbs, GDRCopy
RDMA fabric with GPUDirect RDMA enabled
Each GPU should have at least one dedicated RDMA NIC

Production use cases demonstrated

Disaggregated prefill and decode

Prefill and decode can run on separate clusters; the system must stream KvCache quickly from prefill to decode GPUs. TransferEngine uses alloc_uvm_watcher to track model progress. During prefill, a watcher value increments after each layer's attention output projection. When a change is observed, the worker issues paged writes for that layer's KvCache pages and a single write for the remaining context. This allows layer-by-layer streaming without fixed world membership and avoids strict ordering collectives.

Fast weight transfer for reinforcement learning

For asynchronous RL fine-tuning, training and inference run on separate GPU pools. Instead of gathering updates to a single rank and broadcasting (which limits throughput to one NIC), TransferEngine enables point-to-point one-sided writes so training GPUs write parameter shards directly into inference GPUs. A pipelined execution divides each tensor into stages: host-to-device copy (when FSDP offloads weights), reconstruction and optional quantization, RDMA transfer, and a barrier implemented with scatter and ImmCounter. In production this setup delivered weight updates for Kimi K2 (1T) and DeepSeek V3 (671B) in about 1.3 seconds from 256 training GPUs to 128 inference GPUs.

MoE routing across ConnectX and EFA

pplx garden includes a point-to-point MoE dispatch and combine kernel. It uses NVLink for intra-node traffic and RDMA for inter-node traffic. Dispatch and combine are split into separate send and receive phases so decoders can micro-batch and overlap communication with grouped GEMM. A host proxy thread polls GPU state and calls TransferEngine when send buffers are ready. Routes are exchanged, each rank computes contiguous receive offsets per expert, and tokens are written into private buffers reusable between dispatch and combine. This reduces memory footprint and keeps writes large enough to saturate links.

On ConnectX 7, pplx garden's MoE kernels achieve state-of-the-art decode latency and outperform DeepEP on the same hardware. On AWS EFA, the kernels provide the first practical MoE decode latencies for trillion-parameter workloads. Multi-node tests on AWS H200 show distributed models reduce latency at medium batch sizes, a common production regime.

Key takeaways for infra teams

TransferEngine provides a single RDMA point-to-point abstraction that works across ConnectX 7 and EFA and manages multiple NICs per GPU transparently.
The library exposes one-sided WriteImm with ImmCounter and reaches peak 400 Gbps on both NIC families, matching single-vendor stacks while remaining portable.
Perplexity uses TransferEngine for disaggregated KvCache streaming, fast RL weight transfer, and MoE routing, enabling practical trillion-parameter inference and fine-tuning on heterogeneous clusters.

Because TransferEngine and pplx garden are open source under an MIT license, engineering teams can deploy very large MoE and dense models on mixed H100/H200 clusters across cloud providers without rewriting for each vendor-specific networking stack. For full technical details see the arXiv paper: https://arxiv.org/pdf/2510.27656 and the pplx garden repository on GitHub.

Perplexity’s TransferEngine Enables Trillion-Parameter LLMs on Heterogeneous GPU Clusters

Сменить язык