Architecting the Multi-Inference Gateway: Routing in the Age of $50/M Token Models

Architecting the Multi-Inference Gateway: Routing in the Age of $50/M Token Models

Arlo Gilbert ·


Anthropic released Claude Fable 5 on Monday. It's a remarkable model. It's also priced at $10 per million input tokens and $50 per million output tokens, which makes it double the cost of Claude Opus 4.8 at $5 and $25 and well above OpenAI's GPT-5.5 at $5 and $30. Anthropic doubled the price of its own flagship in one release cycle.

I run the AI lab at Osano, and we've been building agentic products on these APIs for two years. The pattern I keep seeing, in our own early mistakes and in conversations with other teams, is that developers wire their application directly to whatever the best model is, ship it, and then open the invoice three weeks later. Agentic workloads are output-heavy. A single agent session that plans, writes, critiques, and rewrites can burn a few hundred thousand output tokens. At $50 per million, one session costs more than a nice lunch. Multiply by your user count.

The capability curve and the cost curve are now moving in opposite directions, and the only sane response is architectural. Your application should not know which model it's talking to. A gateway should decide that, per request, based on what the request actually needs. That layer used to be a nice-to-have. At Fable 5 prices, it's load-bearing infrastructure.

What a gateway actually does

An AI gateway is a proxy that sits between your application and every inference provider you use. The good ones do five jobs. They present one unified API schema so your code never changes when providers do. They enforce governance: virtual keys, per-team budgets, rate limits. They handle retries and fallbacks when a provider degrades. They route requests dynamically based on cost or latency rules. And they log everything for cost attribution.

Think of it like the electrical panel in your house. Appliances don't negotiate with the power company. They plug into a panel that handles routing, limits, and safety. Your app code should plug into a gateway the same way.

Two open-source projects dominate this category right now, and they made opposite engineering bets.

LiteLLM is Python. Its superpower is coverage: a unified OpenAI-format interface to 100+ providers, plus an admin dashboard, virtual key management, per-team budgets, and cost tracking backed by Postgres. If a provider exists, LiteLLM probably supports it. The cost of that flexibility is Python itself. The GIL caps single-process concurrency, and LiteLLM's own benchmarks show roughly 8ms of P95 overhead at 1,000 RPS with careful tuning. Under real production spikes, untuned deployments do considerably worse.

Bifrost, from Maxim AI, is Go. It was built for the throughput problem specifically: goroutine-based concurrency, no GIL, native MCP (Model Context Protocol) support, cluster mode, and an adaptive load balancer. Maxim's published benchmarks claim about 9.5x LiteLLM's throughput, with P99 latency around 520ms versus 28,000ms at 500 RPS on identical t3.medium hardware, roughly 11 microseconds of per-request overhead, and 68% less memory. At 1,000 RPS in their tests, LiteLLM exhausted memory and crashed while Bifrost kept serving.

Take vendor benchmarks with the skepticism they deserve. These numbers come from the company that sells Bifrost. But the benchmark code is open source, the hardware is specified, and the directional claim matches what anyone who has run Python and Go services side by side already knows. My read: LiteLLM when your bottleneck is provider coverage and team governance, Bifrost when your bottleneck is requests per second.

The boring OS-level stuff that matters

Whichever gateway you pick, it's a network service, and network services live or die on configuration that nobody blogs about.

Run it as a non-root binary on a privileged port the right way. Linux blocks unprivileged processes from binding below port 1024, and the old workaround (run as root) is a security hole wearing a convenience costume. Use setcap cap_net_bind_service on the binary, or lower net.ipv4.ip_unprivileged_port_start via sysctl. Docker has defaulted unprivileged port start to 0 since 20.03, so containerized gateways get this for free.

You'll still find advice about tuning native APR (Apache Portable Runtime) in gateway literature. Honest assessment: APR matters if you're fronting your stack with Tomcat or httpd, where the native connector cuts TLS and socket overhead meaningfully. For a modern Go binary like Bifrost, it's irrelevant. Go's netpoller already sits directly on epoll. The optimizations that actually move the needle are simpler. Minimize network hops: the gateway lives in the same AZ as your app servers, ideally as a sidecar or shared regional service, never across a region boundary. Enable SO_REUSEPORT for multi-process listeners. Terminate TLS once instead of re-encrypting at every hop. Every hop you remove is 1-5ms you get back on every single request.

The math of inference: serverless or dedicated

The gateway gives you the ability to route anywhere. The harder question is where to route. For open-weight workhorse models, you have four serious performance hosts and two billing models: serverless per-token, or dedicated GPU endpoints billed per hour.

Current state of the market:

Provider Serverless (example) Dedicated Throughput notes
Fireworks AI ~$0.90/M flat for Llama 3.3 70B; median blended ~$0.84/M A100/H100/B200 tiers, custom-priced via sales 109.5 t/s on DeepSeek V4 Pro per Artificial Analysis
Together AI ~$1.04/M each way for Llama 3.3 70B $6.49/hr dedicated inference; $3.99/hr reserved 91-180 days Fastest on DeepSeek V4 Pro at 182.6 t/s
DeepInfra Cheapest per-token of the four on most models $0.89/hr per A100 up to $4.20/hr per B300 28 t/s on DeepSeek V4 Pro, but that's an FP4 quant
Novita AI From $0.08/M input on some models On-demand and spot GPU instances (H100, H200, RTX 5090) ~53 t/s average across catalog

Two things jump out of that table. First, the cheap providers are often serving quantized variants. DeepInfra's 28 t/s DeepSeek number is an FP4 build, which is a different model than Together's full-precision one in every way that matters for hard reasoning tasks. When you comparison-shop per-token prices, confirm the quantization. Second, speed and price are nearly inverse. Together is fastest and priciest of the open-model hosts; Novita and DeepInfra are cheapest and slowest.

Now the breakeven. A Together dedicated endpoint at $6.49/hr runs about $4,740 a month. At Fireworks' $0.90 per million tokens, that same money buys roughly 5.2 billion serverless tokens. To beat serverless on price alone, your dedicated endpoint needs to push an average of about 2,000 tokens per second, around the clock, all month. Most products have nothing close to that sustained utilization.

So the honest rule: serverless until your traffic is high, steady, and predictable. Dedicated when you hit one of three triggers: sustained utilization above roughly 60% of the endpoint's capacity, hard latency SLOs that serverless multi-tenancy keeps violating, or compliance requirements for isolated infrastructure. That last one comes up constantly in my world. Privacy-sensitive workloads sometimes need dedicated capacity regardless of what the per-token math says.

The blueprint: dynamic routing

Here's the routing strategy we've converged on, in three layers.

Layer 1: semantic routing. Classify the request before you pick a model. You don't need a classifier model for most of this; intent metadata from your own application is free and accurate. Formatting, extraction, and admin tasks go to a cheap open-weight model. Multi-step reasoning, code architecture, and anything customer-visible with legal exposure escalates to Fable 5. The configuration looks like this:

# litellm-style router config
model_list:
  - model_name: cheap-fast
    litellm_params:
      model: openai/llama-3.3-70b
      api_base: https://api.novita.ai/v3/openai
  - model_name: workhorse
    litellm_params:
      model: fireworks_ai/deepseek-v4-pro
  - model_name: frontier
    litellm_params:
      model: anthropic/claude-fable-5

router_settings:
  routing_strategy: usage-based-routing-v2

# app-level intent map
# extraction, formatting, classification -> cheap-fast
# summarization, drafting, RAG synthesis -> workhorse
# multi-step reasoning, agent planning  -> frontier

In our internal traffic at Osano, somewhere around 70% of requests never needed a frontier model in the first place. They were schema mapping, summarization, classification. Routing those to a $0.30/M model instead of a $50/M-output one is the single largest cost lever available to you, bigger than caching, bigger than prompt golf.

Layer 2: latency-based failover. Providers have bad days. Serverless endpoints rate-limit you exactly when your traffic spikes, because everyone else's traffic spiked too. Your gateway should track rolling P95 latency per provider and shift traffic automatically when a provider degrades past a threshold, then shift back when it recovers. Both LiteLLM and Bifrost support this natively; Bifrost's adaptive load balancer does it without manual weights.

Layer 3: the fallback chain. Order your providers cheapest-first with a reliability backstop:

{
  "fallbacks": [
    { "primary": "novita/llama-3.3-70b",
      "on": ["429", "timeout>3000ms", "5xx"] },
    { "secondary": "deepinfra/llama-3.3-70b",
      "on": ["429", "timeout>3000ms", "5xx"] },
    { "tertiary": "fireworks/llama-3.3-70b",
      "note": "most expensive, most consistent throughput" }
  ],
  "retry_policy": { "max_retries": 2, "backoff": "exponential" }
}

Novita takes the bulk of traffic at the lowest price. DeepInfra catches overflow. Fireworks is the expensive backstop that's there when you need the speed. Same model weights at every tier, so output quality stays constant while cost floats with conditions. The failure mode you're designing against is paying Fireworks prices for Novita-grade requests, or worse, returning 500s to users because one provider had a bad hour.

Do this week

Stop integrating directly with model APIs. Put a gateway in the path and let it make the per-request decision. Five things to do before Friday:

  1. Pull 30 days of token usage and bucket every request type by what it actually required. If you can't produce this report, that's finding number one.
  2. Deploy a gateway in shadow mode. Proxy traffic through LiteLLM or Bifrost without changing routing, just to get unified logging and per-feature cost attribution.
  3. Move your three highest-volume, lowest-complexity request types to an open-weight model on a cheap serverless host. Measure quality with evals, not vibes.
  4. Configure one fallback chain for your workhorse model across at least two providers, and test it by deliberately breaking the primary.
  5. Run the dedicated-vs-serverless breakeven against your real traffic. If you're under 50% sustained utilization, stay serverless and revisit quarterly.

Fable 5 is worth $50 per million output tokens for the requests that need it. The entire game is making sure those are the only requests that pay it. Go build the panel.

Back to Words