ClosedMesh runs open-weight models end-to-end on the hardware contributors already own. The interesting part isn't that it's peer-to-peer — plenty of those have failed — it's that the whole design is built around the one constraint that killed the others: the physics of residential internet. This page is the honest version of how that works.
Datacenter GPUs talk to each other over NVLink and InfiniBand at roughly 50–200 microseconds round-trip. Residential internet is 20–200 milliseconds — three to four orders of magnitude slower, and two to three orders worse on bandwidth. No amount of clever code closes that gap. Every architectural decision below follows from taking it seriously instead of pretending it away.
Put the network on the per-token critical path and a 70B model that should decode around 30 tokens/sec collapses below 1 under real-world latency. This is exactly where Petals, BitTensor inference, and the earlier Mesh-LLM forks died.
A one-second setup and a few-millisecond handoff per thousand tokens is invisible to a user. So ClosedMesh routes a whole session to one peer — it doesn't stitch fragments of a forward pass across slow links mid-decode.
It's the one multi-peer pattern where a single network hop amortises across a whole batch of tokens. That's why it's the only cross-peer cooperation ClosedMesh leans on, and why it's the path to models bigger than one peer can solo.
ClosedMesh is split between a thin product surface — the chat UI you're using right now — and a peer-to-peer inference runtime that handles model loading, routing, and distribution across machines. They're shipped and versioned separately.
In order of how often they're the right answer. The first is the common case the whole system is tuned for; the last is a power-user fallback we'd rather you didn't need.
One peer serves a whole session end-to-end at full quality. A model that fits on one machine runs there, with zero per-token network overhead. This is the common case and the one ClosedMesh optimises for.
Two peers cooperate: a small fast draft proposes 4–8 tokens, a larger verifier accepts them in a single batched pass. The WAN hop amortises across the batch, so both peers earn for one session without the network choking decode.
Several peers can quietly contribute to one answer — a multi-modal input handled by one model, a second opinion from another. The caller still sees a single streamed response.
For models too large for any single peer, weights can be split across machines. It's documented and available, but deprecated as a daily driver: it puts the network back on the critical path, which the physics above says to avoid.
Anyone can chat without running anything. Inference is served by peers who've chosen to contribute compute by running the ClosedMesh LLM runtime on their own hardware. Anybody can be one, both, or neither.
Web at closedmesh.com or in the desktop app. Type a message, get a streamed response. No account, no setup, nothing to install.
Requests land at the public mesh entry point. A capability-aware router picks a peer that can actually serve the requested model — by backend, memory, loaded models, load, and latency — using session-sticky hashing so follow-up turns prefer the peer that already holds the KV cache.
Volunteered nodes serve each session end-to-end on whichever peer fits the model. The router auto-routes around offline ones, and can pair two peers via speculative decoding for the mid-tier.
The honest version: when you chat, you're trusting the peer your session lands on, the mesh entry node, and the chat UI. You are nottrusting any third-party AI provider. That's the trade — here's what backs it.
Every peer runs the same open-source runtime, so what a peer can and can't do is auditable. There's no closed black box deciding what happens to your prompt.
No login. A peer doesn't know who you are unless your prompt reveals it, and sessions aren't tied to an identity. Traffic to the entry node is TLS-encrypted.
Each peer publishes a deterministic model-identity fingerprint, and the network re-runs an unpredictable synthetic probe to confirm it actually serves the model it advertises. A peer can't claim a big model while quietly serving a smaller one. Only synthetic probes are replayed — never your prompts.
For work you don't want to trust to anyone else, the runtime other peers run is the runtime you can run yourself. Nothing about the design forces you to share compute or rent it from others.
The installer detects OS, CPU architecture and GPU vendor, then pulls the matching runtime build. You can also pin a backend explicitly for unusual setups. Apple Silicon is the hero hardware — M-series unified memory is what makes a consumer machine genuinely capable of 30B–70B models — but the mesh is heterogeneous on purpose.
| OS | Hardware | Backend |
|---|---|---|
| macOS | Apple Silicon | Metal |
| Linux | x86_64 · NVIDIA | CUDA |
| Linux | x86_64 · AMD | ROCm |
| Linux | x86_64 · Intel / other | Vulkan |
| Linux | x86_64 · CPU-only | CPU |
| Linux | aarch64 | Vulkan / CPU |
| Windows 10/11 | x86_64 · NVIDIA | CUDA |
| Windows 10/11 | x86_64 · AMD / Intel / other | Vulkan |
| WSL2 | x86_64 · NVIDIA passthrough | CUDA |
Stating the obvious objections before you do. ClosedMesh is for latency-tolerant, private, high-volume work — not for everything.
There are no GPT-class closed weights here. ClosedMesh serves open-weight models, which have caught up on most non-frontier work but aren't the top of the leaderboard.
A hosted API wins on first-token latency for a single quick reply. ClosedMesh is the wrong tool for shaving a second off every message and the right one for work where an instant answer isn't the point.
No gradient passes across the mesh. The residential-WAN physics that make per-token cross-peer traffic fatal make distributed training a non-starter — it's explicitly out of scope.
The unit is a session of a specific model served at measured quality — not an interchangeable GPU-second. A token from a 0.6B draft and a token from a 70B verifier are different products, and ClosedMesh prices them that way.