Apple TN3205: RDMA Over Thunderbolt Cuts Mac Latency to 50µs

Essential Points

Apple’s TN3205 (March 19, 2026) documents RDMA over Thunderbolt, available in macOS 26.2 Tahoe for Thunderbolt-capable Macs
Memory access latency drops from ~300µs over TCP to under 50µs with RDMA over Thunderbolt enabled
A four-node Mac Studio M3 Ultra cluster running exo with RDMA reached 31.9 tokens/sec on Qwen3 235B, up from 15.2 t/s without RDMA
The 1.5TB unified memory cluster of four Mac Studios costs just under $40,000 total

Apple just changed what a small rack of Mac Studios can do for AI research. TN3205, published March 19, 2026, gives developers a direct path to cluster-level performance using cables many professional Mac users already own. This is not a theoretical paper. It is a production-backed technical guide supported by real benchmarks that redefine what Apple Silicon clusters can achieve for AI workloads and research computing.

What RDMA Over Thunderbolt Actually Does

Remote Direct Memory Access (RDMA) lets one computer read and write another machine’s memory without involving the remote CPU. Traditional network transfers route data through the operating system, copy it multiple times in RAM, and consume CPU cycles throughout. RDMA eliminates every one of those steps.

Three mechanisms make this possible:

Zero-copy transfers: Data moves directly between memory regions on two machines, bypassing intermediate buffers
CPU offloading: The remote CPU is not interrupted during memory operations, freeing all cores for compute tasks
Deterministic latency: Sub-50µs round-trip times remain consistent under load, unlike TCP, which degrades under high traffic

Apple exposes this capability through the rdma_thunderbolt kernel extension and high-level APIs in Network.framework, allowing developers to write Swift or Objective-C code that treats a Thunderbolt-connected device as a direct memory peer.

Why TN3205 Matters Right Now

Apple published TN3205 on March 19, 2026, alongside macOS 26.2 Tahoe. The technote documents RDMA over Thunderbolt as a production-ready feature for Thunderbolt-capable Mac clusters. The timing aligns directly with explosive growth in local large language model (LLM) inference, where researchers running models like Qwen3 235B or DeepSeek V3.1 were previously bottlenecked by inter-node memory transfer speeds.

RDMA over Thunderbolt removes that bottleneck at the hardware level, letting distributed memory across multiple Macs behave as a single shared pool.

Real-World Performance: AI Inference Benchmarks

The benchmarks from Jeff Geerling’s independent testing, using four Mac Studio M3 Ultra units loaned by Apple, confirm decisive performance differences.

In tests using the Qwen3 235B A22B model, llama.cpp (which uses TCP, without RDMA support) dropped from 20.4 tokens per second on one node to 15.2 tokens per second on four nodes. In contrast, exo with RDMA over Thunderbolt increased from 19.5 tokens per second on one node to 31.9 tokens per second across four nodes.

DeepSeek V3.1 671B showed a similar pattern. Exo with RDMA scaled from 21.1 tokens per second on a single node to 27.8 tokens per second on two nodes and 32.5 tokens per second on four nodes.

The largest model tested, Kimi K2 Thinking with 1 trillion parameters (32 billion active), ran only on the two-node and four-node configurations since it exceeds single-node memory capacity. Over two nodes, llama.cpp reached 18.5 tokens per second while exo with RDMA produced 21.6 tokens per second. Across four nodes, exo reached 28.3 tokens per second.

For comparison, llama.cpp without RDMA degrades with every added node. Exo with RDMA scales upward with every added node. That is the clearest illustration of what the protocol change delivers.

The Reference Hardware: Four Mac Studios at 1.5TB

The hardware used in Geerling’s cluster was four Mac Studio machines equipped with M3 Ultra processors, provided by Apple for testing. Two units carried 512GB of unified memory and 8TB of storage; the remaining two carried 256GB of unified memory and 4TB of storage. The total unified memory pool across all four machines reached 1.5TB, at a combined cost just under $40,000.

Each Mac Studio has five Thunderbolt 5 ports in addition to a 10 Gigabit Ethernet port. Apple confirmed all Thunderbolt 5 ports on the Mac Studio are RDMA over Thunderbolt capable. The cluster was mounted in a compact DeskPi TL1 four-post mini-rack, running at under 250 watts per unit and described as “almost whisper-quiet” under load.

Thunderbolt 5 Bandwidth and the TCP Problem

Thunderbolt 5 raises inter-Mac cluster bandwidth to a maximum of 80 Gb/s, up from 40 Gb/s on Thunderbolt 4. This bidirectional 80 Gb/s bandwidth is what makes RDMA viable at cluster scale. Using 120 Gb/s mode, which Thunderbolt 5 can achieve by borrowing bandwidth from one direction, is not suited for cluster workloads that require balanced bidirectional traffic.

Without RDMA, TCP over Thunderbolt 5 showed instability under high load during testing, including system crashes and restarts during HPL benchmark runs. RDMA directly replaces TCP as the inter-node transport, resolving both the stability and the latency issues simultaneously.

Supported Hardware and Software

Requirement	Specification
macOS Version	macOS 26.2 Tahoe
Minimum Thunderbolt	Thunderbolt 4 or later
Reference Cluster Hardware	Mac Studio with M3 Ultra (Thunderbolt 5)
Total Pooled Memory (4 nodes)	1.5TB unified memory
Frameworks Supported	MLX Distributed, jaccl, exo
Cluster Cost (4x Mac Studio)	Just under $40,000

RDMA over Thunderbolt is available on any Mac with Thunderbolt 4 or later running macOS 26.2. However, Thunderbolt 5 is required to achieve the full 80 Gb/s bandwidth that makes multi-node AI inference at scale practical. The M3 Ultra is currently the fastest Apple Silicon chip in a Mac that ships with Thunderbolt 5.

Performance Impact on CPU and Storage Workloads

Beyond AI inference, RDMA produces measurable gains in storage and I/O-intensive workflows. Apple’s internal benchmarks show peak throughput rising from 30 GB/s with traditional Thunderbolt DMA to 45 GB/s with RDMA enabled. Average read latency drops from 8 microseconds to 3 microseconds, and CPU utilization per GB/s transferred falls from 12% to 4%.

For video production and data pipeline workloads, the CPU reduction is the most immediately practical gain. Freeing CPU headroom during heavy I/O allows those cores to handle encoding, rendering, or model inference simultaneously.

How to Enable RDMA on a Mac Cluster

Apple’s TN3205 documentation outlines a specific setup path for Thunderbolt Macs running macOS 26.2.

Connect each Mac using Thunderbolt cables in a peer-to-peer or daisy-chain topology
Boot into Recovery Mode and run the command to enable RDMA over Thunderbolt
Verify port status and RDMA readiness using sudo rdma_test -i thunderbolt0 in Terminal
Enable the rdma_thunderbolt kernel extension via System Settings under Privacy and Security
Run applications targeting the MLX Distributed or jaccl distributed compute backends, or use the exo open-source framework for LLM inference

For cluster management during testing, Geerling configured the Ansible automation tool to shut down and restart all nodes via script when instability occurred, which proved essential given the pre-release state of exo during his testing period.

Limitations and Current Constraints

RDMA over Thunderbolt is a significant capability but carries real constraints that developers and researchers should evaluate before committing to hardware.

The exo framework used in the reference benchmarks was a pre-release version during testing in December 2025, and stability issues including system crashes under sustained HPL load were reported. Thunderbolt’s daisy-chain topology limits how many units can be clustered before bandwidth contention reduces returns. A Thunderbolt 5 networking switch does not yet exist, capping practical cluster sizes. The total cost of $40,000 for four Mac Studios places this capability firmly in professional research and enterprise budgets, not hobbyist use.

Frequently Asked Questions (FAQs)

What is Apple TN3205?

TN3205 is an Apple developer technote published March 19, 2026, that documents how to enable RDMA over Thunderbolt on compatible Macs. It provides setup guidance, supported hardware details, and framework recommendations for developers building low-latency Mac clusters for AI and compute workloads.

Which Macs support RDMA over Thunderbolt?

Any Mac equipped with Thunderbolt 4 or later running macOS 26.2 Tahoe can act as an RDMA endpoint. However, Thunderbolt 5 is required for the full 80 Gb/s bandwidth needed for effective multi-node AI inference clustering. The Mac Studio with M3 Ultra is the confirmed reference hardware.

How much does RDMA reduce latency over Thunderbolt?

RDMA over Thunderbolt reduces memory access latency from approximately 300 microseconds over TCP to under 50 microseconds. This is a 6x or greater improvement that directly enables distributed memory to behave as a single shared pool across multiple machines.

What were the AI inference benchmark results?

On Qwen3 235B, a four-node Mac Studio cluster with exo and RDMA reached 31.9 tokens per second, versus 15.2 tokens per second for llama.cpp (TCP) on the same hardware. On DeepSeek V3.1 671B, exo with RDMA reached 32.5 tokens per second across four nodes.

What is the Kimi K2 result on the cluster?

Kimi K2 Thinking, a one-trillion-parameter model with 32 billion active parameters, is too large for a single Mac Studio to run. On two nodes with exo and RDMA it reached 21.6 tokens per second; on four nodes it reached 28.3 tokens per second, achieving practical interaction speed.

How much does a Mac Studio cluster for RDMA cost?

The four-Mac-Studio cluster used in reference testing cost just under $40,000 total. Two units were configured with 512GB of unified memory and 8TB of storage, while two had 256GB and 4TB. This price point targets professional research teams and AI-focused enterprises, not individual users.

Is RDMA over Thunderbolt stable for production use?

As of the December 2025 test period, the exo framework used for RDMA-based clustering was in pre-release and exhibited stability issues under sustained high load including system crashes during HPL benchmarks. Apple’s TN3205 formalizes the feature as of March 2026, but developers should monitor framework stability before deploying in production environments.

What software frameworks work with RDMA over Thunderbolt?

Apple has optimized RDMA over Thunderbolt for the MLX Distributed library and the jaccl distributed compute backend. The open-source exo framework also demonstrated strong RDMA support in Geerling’s benchmarks. Llama.cpp does not support RDMA and degrades in performance as nodes are added.

Search for an article

Apple TN3205 Explained: RDMA Over Thunderbolt Brings Sub-50µs Latency to Mac Clusters