The Multi-Mac AI Cluster — Insane Overkill or the Future?

Photo: Unsplash

AI Power User

The Multi-Mac AI Cluster — Insane Overkill or the Future?

I chained Mac Studios together to run frontier-scale models and the answer is more complicated than the hype suggests
ai-clusterapple-siliconlocal-llmdistributed-inference

Somewhere on r/LocalLLaMA there is always a photo of four Mac Studios stacked like a tiny aluminum data center, running a model that should not fit on any of them individually. Every time one of these posts goes viral, I get the same question from readers: is this actually a good idea, or is it the PC-master-race equivalent of putting a spoiler on a hatchback?

I spent two weeks finding out. I borrowed a second Mac Studio (M2 Ultra, 128GB) to pair with my own M3 Max machine, wired them together over Thunderbolt, and ran distributed inference with exo. Here is the unvarnished picture.

How pooling Macs actually works

The core idea is simple: a transformer model is a stack of layers, and layers can be split across machines. Machine A holds layers 1–40, machine B holds layers 41–80. A token’s activations flow through machine A, hop across the network, continue through machine B, and the output comes back. Your two 128GB Macs behave — logically — like one 256GB machine.

The best-known open-source project doing this on Apple Silicon is exo (exo-explore/exo on GitHub). It auto-discovers other exo nodes on your network, partitions the model proportionally to each machine’s memory, and exposes a ChatGPT-compatible API on port 52415. Setup is genuinely minimal:

git clone https://github.com/exo-explore/exo
cd exo
pip install -e .
exo

Run that on every Mac on the same network. They find each other via UDP discovery, negotiate a ring topology, and you point any OpenAI-compatible client at http://localhost:52415/v1/chat/completions. The first time two machines on my desk silently agreed to share a 70B model between themselves, I’ll admit it felt like the future.

There are alternatives — llama.cpp has RPC mode (rpc-server on workers, --rpc host:port on the main node), and MLX has been growing distributed primitives — but exo is the one that makes it nearly zero-config.

The performance reality nobody leads with

Here is the part the stacked-Mac photos never show: the interconnect is the bottleneck, and it is brutal.

Inside one Mac Studio, the M2 Ultra’s unified memory delivers around 800GB/s to the GPU. Between two Macs over a Thunderbolt 4 bridge, you get roughly 2.5–3GB/s of usable throughput — about 250x slower than staying on-chip. Over 10GbE it’s worse, around 1.1GB/s. Over Wi-Fi, don’t even bother.

For inference, every single token has to cross that bridge (the activations pass between layer shards sequentially). My measured numbers with Llama 3.3 70B at 4-bit quantization:

  • Single M2 Ultra 128GB, MLX: ~11.5 tokens/sec
  • Two Macs, exo over Thunderbolt bridge: ~8.2 tokens/sec
  • Two Macs, exo over 10GbE: ~6.9 tokens/sec

Read that again: for a model that fits on one machine, adding a second Mac made it slower. The pipeline is sequential — while machine A computes, machine B mostly waits, and you pay network latency per token on top. Two machines is barely competitive with one, and sometimes worse.

The math only flips when the model does not fit on any single machine you own. DeepSeek-V3-class models at reasonable quantization want 350–400GB of memory. No single Mac sold today holds that; four 128GB Mac Studios pooled together do. At that point your comparison isn’t “cluster vs. one Mac” — it’s “cluster vs. not running the model at all.” Enthusiasts have demonstrated exactly this on exo, getting single-digit tokens per second on frontier-scale open models. Slow, but running, locally, on hardware you own.

The cost analysis that kills most cluster dreams

Let’s do the arithmetic that the Reddit photos skip.

Scenario: you want ~256GB of memory for big models.

  • Option A: Two Mac Studio M2 Ultra 128GB units — roughly $9,600 combined.
  • Option B: One Mac Studio M3 Ultra with 256GB unified memory — around $7,000–7,500 depending on config.

Option B is cheaper, has zero interconnect penalty (full 800GB/s+ across the whole memory pool), draws less power, occupies one outlet, and needs no networking gymnastics. For anyone buying hardware specifically to run big models, the single bigger machine wins almost every time. Apple’s memory pricing is infamous, but it’s still cheaper than buying a second entire computer to get the same RAM.

The exception — and it’s a real one — is hardware you already own. If your studio has a Mac Studio and two MacBook Pros sitting idle, exo turns sunk cost into capability for free. This is the genuinely clever use case.

Who actually runs Mac clusters

After lurking in the exo Discord and r/LocalLLaMA threads for a month, the real user base sorts into three groups:

Researchers and tinkerers who want to study frontier-scale open models without renting H100s at $2+/hour. For experimentation where 5 tokens/sec is acceptable, a Mac cluster is the cheapest path to 400GB of model memory on the planet.

The enthusiast wing — people for whom “can it be done” is the entire point. I say this with affection; I am one of them. The engineering is legitimately fascinating: ring topologies, memory-weighted partitioning, latency hiding.

Offices with idle fleets. This is the sleeper use case. A design agency with eight M3 MacBook Pros that sleep from 6 PM to 9 AM has, overnight, a ~256GB+ inference cluster doing nothing. Batch jobs — summarizing documents, generating embeddings for internal search, processing the day’s meeting transcripts — don’t care about per-token latency. exo on the office LAN turns dead hardware hours into an internal AI service with zero capex. If I ran IT at a Mac-heavy company, I’d be piloting this on Monday.

If you want to try it anyway

The minimum viable experiment costs nothing but an evening, assuming two Apple Silicon Macs:

  1. Connect them with a Thunderbolt 4 cable. In System Settings → Network, add a Thunderbolt Bridge service on both machines and assign static IPs (e.g. 10.0.0.1 and 10.0.0.2). Verify with ping.
  2. Install and launch exo on both (commands above). Watch the console — you’ll see the nodes discover each other and print the combined memory pool.
  3. Request a model through the API or exo’s built-in web UI (tinychat) at http://localhost:52415. exo downloads and partitions shards automatically.
  4. Benchmark honestly: run the same prompt on one machine with MLX or Ollama first, then on the cluster. Compare tokens/sec, not vibes.

Practical warnings from my two weeks: keep every node plugged into power (Macs throttle hard on battery), pin exo to the same version on all nodes (the protocol moves fast and mismatches fail cryptically), and disable sleep with caffeinate -s or the machines will drop out of the ring mid-generation.

The verdict

So — insane overkill or the future? Honestly: both, depending on who you are.

For most people, it’s overkill, and the numbers prove it. If your target model fits in one Mac’s memory, clustering makes it slower. If it doesn’t, buying one bigger Mac is cheaper and faster than buying two smaller ones. The 250x gap between unified memory bandwidth and Thunderbolt is not a software problem anyone is going to patch away soon.

For fleet owners, it’s genuinely clever. Pooling idle Macs you already own is free capability, and overnight batch inference on an office LAN is a quietly brilliant pattern I expect to see productized.

And as a preview, it matters. The trajectory is unmistakable: personal AI infrastructure is heading toward heterogeneous devices — your Mac, your iPhone, maybe your Apple TV — cooperating on inference, with models sharded across whatever silicon you happen to own. exo is a rough, early sketch of that world. Today it’s a science project with aluminum fins. Give the interconnects a few generations — Thunderbolt 5’s 80Gbps is already shipping — and the line between “my computer” and “my computers” starts to blur.

I returned the borrowed Mac Studio. But I kept the Thunderbolt cable, just in case.