“I’m the Inference Cloud!”

Why tokens are the economic unit of AI. And why the stack is reorganizing around them.

Mar 06, 2026

The GPU cloud sells compute. The inference cloud sells tokens.

So naturally: Together AI is raising $1B, Nvidia is launching an inference chip with the Groq LPU IP it bought, and suddenly everyone from DigitalOcean to Akamai to Cloudflare is calling themselves “the inference cloud.” (Fwiw, I would do the same thing.)

If that thesis is right, there are a few ways to play it:

Neoclouds like CoreWeave and Nebius are (still) direct beneficiaries.
Runtimes like Together and Fireworks could be the next infrastructure mega-borrowers.
Distributed cloud and edge platforms like DigitalOcean, Fastly, Akamai, and Cloudflare – along with networking companies – move all those tokens around.

But first: wtf is the inference cloud and why now?

To start this year, I wrote about agentic AI and compounding inference; i.e., that once models start calling models (my very own little definition of agentic AI), inference demand doesn’t just go up; it goes vertical. Digital infrastructure is reorganizing around that demand, and the thesis is starting to play out.

Chips optimizing for token generation (~~Groq~~ Nvidia LPUs; Cerebras; SambaNova). Runtime platforms raising infrastructure capital (Together, Fireworks, Baseten). Clouds repositioning around inference workloads (DigitalOcean, Akamai, Cloudflare’s Replicate acquisition).

Put those signals together and something interesting starts to emerge: a new infrastructure layer forming in real time.

Behold: The Inference Cloud.

Last June (and blissfully for my ego, six months before Nvidia sorta bought Groq), I politely suggested that we f*** chips, and ship tokens. It’s not that chips don’t matter (duh), but developers care (more) about model outputs (i.e., tokens) than the hardware producing them. And companies like OpenAI and Anthropic have effectively “trained” developers to think this way by pricing their APIs in tokens rather than GPU hours. (See what I did there? “Trained.” Mhmm.) Once that abstraction happens, the rest of the stack reconstitutes accordingly.

So why are “inference cloud” billboards suddenly popping up all along the 101?

The workloads were already there. Now capital markets and, consequently, hardware and infrastructure, are catching up. That’s usually when a new cloud layer forms. Sound familiar? It should. See CoreWeave et al during the last 2-3 years; it looked something like this:

Silicon: GPUs (Nvidia, AMD)
Engine: Foundation models (OpenAI, Anthropic)
Infrastructure: Neoclouds (CoreWeave, Nebius, Crusoe, Lambda)

The inference shift may look something like this:

Silicon: Inference accelerators (Nvidia “LPUs,” Google TPUs, AWS Inferentia)
Engine: Inference runtimes (Together, Fireworks, Baseten)
Infrastructure: Inference cloud (💰?)

The interesting part is that everyone is racing toward that last layer from different directions.

Runtime platforms like Together AI are raising money to build it from the software layer.
Neoclouds are approaching it from infrastructure; notice the focus on inference in the CoreWeave / Perplexity announcement 🤔
Hyperscalers are embedding into their existing clouds via AWS Bedrock, Azure AI, and Google Vertex.
Legacy clouds and CDNs like DigitalOcean, Cloudflare, Fastly, and Akamai are pushing toward orchestrating distributed inference.

In other words, it’s a new mad dash to own the newest and fastest growing part of digital (and, specifically, AI) infrastructure.

Inference is a different workload.

None of this means training goes away. If anything, the opposite is true. (Ask me why later.) Hyperscalers built the general cloud and then GPU clouds like CoreWeave built specialized infrastructure for large-scale AI workloads. Those layers aren’t disappearing and those companies are as well or better positioned than (even the newer) upstarts. What’s happening instead is that the stack is expanding.

Training workloads are episodic and batch-oriented. Massive clusters spin up to train models for days or weeks and then spin down. Consequently, training infrastructure optimizes for large bursts of compute capacity and operators sell GPU hours and MW.

Inference workloads power products, copilots, agents, and applications that run continuously. They optimize for throughput, latency, and reliability. They are API-driven and increasingly distributed. And inference operators sell tokens.

As agentic model-to-model systems proliferate, token demand compounds and the infrastructure required to deliver those tokens efficiently becomes its own massive market.

Playing the inference cloud.

If inference becomes the fastest-growing layer of AI infrastructure, there are a few ways to express that thesis.

Neoclouds. Yes. CoreWeave (CRWV) and Nebius (NBIS) et al already operate large-scale AI infrastructure designed for model workloads and those clusters increasingly run both training and production inference workloads. Plus, they were nice enough to list on public stock exchanges for us to trade.

Inference runtimes. Platforms like Together, Fireworks, and Baseten sit closest to developers and tokens. But if, like me, you whiffed or didn’t get a swing at their venture rounds, there may be another angle: private credit. Together’s fresh-out-of-the-oven billion dollar warchest is (ostensibly) earmarked for infrastructure, so the CoreWeave financing playbook could repeat itself for inference runtimes; i.e., these runtime companies may become the next infrastructure mega-borrowers.

Distributed cloud and edge platforms. DigitalOcean, Cloudflare, Fastly, and Akamai already operate globally distributed compute networks in the request path. If inference becomes latency-sensitive and geographically distributed, those platforms start to look suspiciously like inference clouds or at least the orchestration layer sitting in front of them.

Networking infrastructure. Inference at scale increasingly turns AI infrastructure into a networking problem. Publicly-traded companies like Arista Networks (ANET), Broadcom (AVGO), and Marvell Technology (MRVL) sit in the pipes moving those tokens. (I think.)

And of course, there are dozens of private companies attacking every layer of this stack from every direction.

Some final tokens, From the Porch.

The GPU cloud sells compute. The inference cloud sells tokens.
Tokens are the economic unit of AI.
The stack is reconstituting itself to deliver them.
This is why everyone suddenly wants to be the inference cloud.

Mar 14

Tokens as the economic unit of AI is the right framing. Once you internalise that, the whole landscape makes more sense.

What's interesting from a practitioner angle is that the billing model for tokens is also fragmenting. You've got per-token (OpenAI, Anthropic), per-GPU-hour (CoreWeave), and now flat subscription proxies that bundle an entire model library under one monthly fee. I've been running my coding agents through one of these for a few weeks and it genuinely changes how you think about inference spend. You stop rationing tool calls. I wrote up the economics and the setup here https://reading.sh/how-to-get-3x-claude-rate-limits-for-30-a-month-1d3fdb8658df

Curious whether you think flat-rate subscriptions can scale, or if they'll get squeezed once inference runtimes like Together and Fireworks start offering similar bundles to developers directly.

1 reply by David Levy

1 more comment...

From the Porch

Discussion about this post

Ready for more?