How it works · deep dive

One collective computer.Made of every machine that joins.

ClosedMesh runs open-weight models end-to-end on the hardware contributors already own. The interesting part isn't that it's peer-to-peer — plenty of those have failed — it's that the whole design is built around the one constraint that killed the others: the physics of residential internet. This page is the honest version of how that works.

Reading the mesh…
The constraint

The unit of work is a session, not a token.

Datacenter GPUs talk to each other over NVLink and InfiniBand at roughly 50–200 microseconds round-trip. Residential internet is 20–200 milliseconds — three to four orders of magnitude slower, and two to three orders worse on bandwidth. No amount of clever code closes that gap. Every architectural decision below follows from taking it seriously instead of pretending it away.

Per-token cross-peer traffic is fatal

Put the network on the per-token critical path and a 70B model that should decode around 30 tokens/sec collapses below 1 under real-world latency. This is exactly where Petals, BitTensor inference, and the earlier Mesh-LLM forks died.

Per-session cross-peer traffic is fine

A one-second setup and a few-millisecond handoff per thousand tokens is invisible to a user. So ClosedMesh routes a whole session to one peer — it doesn't stitch fragments of a forward pass across slow links mid-decode.

Speculative decoding is the exception

It's the one multi-peer pattern where a single network hop amortises across a whole batch of tokens. That's why it's the only cross-peer cooperation ClosedMesh leans on, and why it's the path to models bigger than one peer can solo.

Architecture

Two layers, one product.

ClosedMesh is split between a thin product surface — the chat UI you're using right now — and a peer-to-peer inference runtime that handles model loading, routing, and distribution across machines. They're shipped and versioned separately.

Chat surface
ClosedMesh
Where you actually use the thing.
  • A web chat at closedmesh.com — open it and start typing.
  • A native desktop app that ships the same chat plus the controls for running a node yourself.
  • Streaming responses, thread persistence, model picker, OpenAI-compatible API for tools and agents.
Inference runtime · open source
ClosedMesh LLM
The peer-to-peer engine that serves the chat.
  • Runs on machines volunteered to the mesh — Apple Silicon Macs, NVIDIA / AMD / Intel GPU boxes, on-prem workstations.
  • Replication-first: a model that fits on one peer runs there end-to-end, full quality, zero per-token network overhead.
  • Speculative decoding across two peers — small fast draft + larger verifier — for the mid-tier where one peer isn't enough.
  • Capability-aware routing: requests only go to peers that can actually serve them.
  • Built on an Iroh QUIC overlay with a gossip protocol for capability announcement.
Cooperation

Four ways peers work together.

In order of how often they're the right answer. The first is the common case the whole system is tuned for; the last is a power-user fallback we'd rather you didn't need.

Replication — the default

One peer serves a whole session end-to-end at full quality. A model that fits on one machine runs there, with zero per-token network overhead. This is the common case and the one ClosedMesh optimises for.

Speculative pairs — the mid-tier

Two peers cooperate: a small fast draft proposes 4–8 tokens, a larger verifier accepts them in a single batched pass. The WAN hop amortises across the batch, so both peers earn for one session without the network choking decode.

Inter-model collaboration

Several peers can quietly contribute to one answer — a multi-modal input handled by one model, a second opinion from another. The caller still sees a single streamed response.

Pipeline / expert split — the fallback

For models too large for any single peer, weights can be split across machines. It's documented and available, but deprecated as a daily driver: it puts the network back on the critical path, which the physics above says to avoid.

The path of a request

From your keystroke to a peer and back.

Anyone can chat without running anything. Inference is served by peers who've chosen to contribute compute by running the ClosedMesh LLM runtime on their own hardware. Anybody can be one, both, or neither.

Chat clientclosedmesh.comor desktop app/api/chatMesh entryOpenAI-compatible /v1capability-aware router/v1ClosedMesh LLM peersM-series MacCUDA · 4090Vulkan laptop
01
Chat

Web at closedmesh.com or in the desktop app. Type a message, get a streamed response. No account, no setup, nothing to install.

02
Mesh entry + routing

Requests land at the public mesh entry point. A capability-aware router picks a peer that can actually serve the requested model — by backend, memory, loaded models, load, and latency — using session-sticky hashing so follow-up turns prefer the peer that already holds the KV cache.

03
Compute peers

Volunteered nodes serve each session end-to-end on whichever peer fits the model. The router auto-routes around offline ones, and can pair two peers via speculative decoding for the mid-tier.

Privacy & trust

What you're trusting — and what you're not.

The honest version: when you chat, you're trusting the peer your session lands on, the mesh entry node, and the chat UI. You are nottrusting any third-party AI provider. That's the trade — here's what backs it.

Open-source runtime

Every peer runs the same open-source runtime, so what a peer can and can't do is auditable. There's no closed black box deciding what happens to your prompt.

Session pseudonymity

No login. A peer doesn't know who you are unless your prompt reveals it, and sessions aren't tied to an identity. Traffic to the entry node is TLS-encrypted.

Verified peers

Each peer publishes a deterministic model-identity fingerprint, and the network re-runs an unpredictable synthetic probe to confirm it actually serves the model it advertises. A peer can't claim a big model while quietly serving a smaller one. Only synthetic probes are replayed — never your prompts.

Run your own peer

For work you don't want to trust to anyone else, the runtime other peers run is the runtime you can run yourself. Nothing about the design forces you to share compute or rent it from others.

Hardware support

Whatever the team is already running.

The installer detects OS, CPU architecture and GPU vendor, then pulls the matching runtime build. You can also pin a backend explicitly for unusual setups. Apple Silicon is the hero hardware — M-series unified memory is what makes a consumer machine genuinely capable of 30B–70B models — but the mesh is heterogeneous on purpose.

OSHardwareBackend
macOSApple SiliconMetal
Linuxx86_64 · NVIDIACUDA
Linuxx86_64 · AMDROCm
Linuxx86_64 · Intel / otherVulkan
Linuxx86_64 · CPU-onlyCPU
Linuxaarch64Vulkan / CPU
Windows 10/11x86_64 · NVIDIACUDA
Windows 10/11x86_64 · AMD / Intel / otherVulkan
WSL2x86_64 · NVIDIA passthroughCUDA
Limits

What ClosedMesh isn't.

Stating the obvious objections before you do. ClosedMesh is for latency-tolerant, private, high-volume work — not for everything.

Not a frontier-model network

There are no GPT-class closed weights here. ClosedMesh serves open-weight models, which have caught up on most non-frontier work but aren't the top of the leaderboard.

Not the fastest median chat

A hosted API wins on first-token latency for a single quick reply. ClosedMesh is the wrong tool for shaving a second off every message and the right one for work where an instant answer isn't the point.

Not a training network

No gradient passes across the mesh. The residential-WAN physics that make per-token cross-peer traffic fatal make distributed training a non-starter — it's explicitly out of scope.

Not fungible compute

The unit is a session of a specific model served at measured quality — not an interchangeable GPU-second. A token from a 0.6B draft and a token from a 70B verifier are different products, and ClosedMesh prices them that way.