Nobody's AI Bill Went Down

Nobody's AI Bill Went Down

Arlo Gilbert ·

In late 2022, running a query through a GPT-4-class model cost about $20 per million tokens. Today, the same operation costs roughly 40 cents. That's a 1,000x reduction in three years.

You'd think every CFO in tech would be celebrating. Instead, AI infrastructure budgets are exploding. AWS quietly raised GPU instance prices 15% this quarter. The p5e.48xlarge jumped from $34.61 to $39.80 per hour. And total enterprise spending on inference has grown 320% over the same period that per-token costs cratered.

Those two facts sitting side by side should make you curious, not confused. They're telling the same story.

Jevons called it in 1865

In 1865, an economist named William Stanley Jevons noticed something counterintuitive about coal. England had just developed dramatically more efficient steam engines. Everyone expected coal consumption to drop. It didn't. It rose. Faster than before. The efficiency made coal practical for applications where it hadn't been economical, and total consumption went through the roof.

AI inference is living through its Jevons moment right now.

When a single API call costs $20 per million tokens, teams think carefully about what they send to the model. They batch requests, cache aggressively, and build systems that minimize calls. Every token carries weight.

At 40 cents? Nobody thinks about it. The model gets called for every edge case, every validation step, every user interaction. Agents chain dozens of calls together to accomplish a single task. Evaluation suites run thousands of calls per hour. Prototyping means throwing everything at the model and filtering after.

The per-unit cost didn't just drop. The unit of consumption changed. A year ago, a "use case" meant a chatbot answering customer questions. Now it means an agent orchestrating forty model calls per task, running continuously. The denominator moved.

Where the money actually goes

If you're building an AI product right now, the cost picture has shifted in ways that aren't obvious from a pricing page.

Per-token inference? Plummeting. H100 instances are under $2.50 an hour on specialist providers. A100s are approaching commodity pricing below a dollar an hour. If you're measuring the cost of one model call, everything looks great.

But one model call isn't the product anymore. The product is an agent that makes forty calls to accomplish one task. It's a pipeline that routes through three different models before returning a result. It's a real-time evaluation system that checks every output before it reaches the user.

Multiply those calls by your user count. Multiply again by tasks per user. Multiply again by retry rates. The napkin math diverges from the pricing page fast.

I've watched this happen at Osano. Our AI workloads have grown faster than our costs have fallen, and we're not unusual. When we first started building AI features, the question was "can we afford to run this for every customer?" Now it's "can our infrastructure handle running this for every customer?" The constraint moved from economics to architecture. That sounds like progress (and it is), but it doesn't mean the bill went down.

Then there's the hardware side. AWS didn't raise GPU prices on a whim. Demand for GPU compute is outstripping supply. Everyone who got excited about cheap inference is now competing for the same physical machines to deliver it. Tokens got cheaper. The hardware that produces them didn't.

The capital behind all of this

Q1 2026 saw $267.2 billion in venture deal value. More than double the previous quarterly record. A huge chunk of that money is going to companies whose business models assume cheap inference stays cheap. Or more precisely, whose models assume that total inference costs will stay manageable.

Those are different assumptions, and the gap between them matters.

Cheap per-token pricing creates a seductive math problem. You look at the unit economics, model out your usage, and conclude that AI costs are a rounding error. But if every company follows that logic at the same time, aggregate demand on compute infrastructure spikes. And the thing that can't be easily scaled is the physical GPUs sitting in physical data centers.

We're in a world where the ingredient is cheap but the kitchen is expensive. And everyone is trying to cook at once.

What happens next

This isn't a crisis. It's a phase. But it's a phase worth understanding rather than sleepwalking through.

Inference costs will keep falling. That trend line is real. New architectures, quantization techniques like Google's TurboQuant (which just debuted at ICLR 2026), and competition among cloud providers all push per-token costs down. The 40-cent million-token call will probably be a 4-cent call by next year.

But total AI spending for most companies is going up, not down, for the foreseeable future. The efficiency gains unlock new use cases faster than they reduce the cost of existing ones. That's not a failure of the economics. It's what transformative technology has always done.

I'm curious what happens when the Jevons curve starts to bend. When companies move past the "put AI on everything" phase and start asking where inference actually creates value versus where it just creates activity. When the CFO stops treating AI spend as an innovation budget and starts treating it like cloud costs. A line item that needs to justify itself quarter by quarter.

We're not there yet. Right now, inference is cheap enough to be reckless with. That won't last. Build accordingly.

Back to Words