THE STACK · MODEL ECONOMICS · REFERENCE · Published June 25

Saving LLM costs at the team level, June 2026: every lever an engineering org can pull, and the governance that makes them stick.

Inference list prices keep falling — and team AI bills keep climbing, because the savings get eaten by the agent loop, context bloat, and the premium-model-by-default habit. Individual prompt tricks do not fix that. What fixes it is a small set of architectural levers applied org-wide, plus the FinOps discipline to see the bill, own it, and cap it. This is the catalogue: what each lever does, its verified impact, its effort, and the governance scaffolding underneath.

98% of FinOps teams now manage AI spend
~15x tokens: multi-agent vs single chat
10x cached vs uncached input, same model
TL;DR 30-second version · free
  1. 01 The three highest-leverage architectural levers are the same for almost every team: prompt caching (Anthropic 90% off cached input reads, OpenAI/Gemini up to ~90% off, automatic), model routing/right-sizing (cheap model for easy traffic — a 5x to 12x per-token gap between flagship and economy tiers), and batch APIs (a flat 50% off at OpenAI, Anthropic, Bedrock, and Google for any non-interactive workload). None require a research team; all are mostly configuration.
  2. 02 The agent loop is the team-level cost story of 2026. Anthropic's own data: agents use ~4x more tokens than a chat turn and multi-agent systems ~15x, and token volume alone explains ~80% of the performance variance. The levers that matter here are keeping the prompt prefix byte-stable so the KV cache survives (cached vs uncached input is a literal 10x), capping steps and per-run spend, and using a cheap planner with cheap executors.
  3. 03 Levers leak without governance. The load-bearing fact: no major hosted vendor ships an instant hard spend cap — OpenAI budgets are notifications, not enforcement — so a runaway loop can outspend a budget before anyone sees the alert. Real caps live in a gateway (LiteLLM) or a wired budget-to-IAM action (AWS Budgets). Pair that with per-feature cost tagging and a named cost owner per team, or the savings evaporate within a quarter.
DEEP ANALYSIS · free while in beta
READING AS
FOR YOU

If you ship AI features, your leverage is the configuration-grade levers — caching, routing, batch, structured outputs — long before anyone negotiates a contract. These are yours to pull this sprint; do them before reaching for anything exotic.

The order of operations

  • 1. Caching (this week) Static prefix first, dynamic last. Verify cache hits. 30–60% input-token reduction on prefix-heavy apps. Anthropic 90% off reads (mind the write premium); OpenAI/Gemini automatic up to ~90%.
  • 2. Structured outputs (this week) Force a strict schema to kill parse-failure retries. 100% adherence vs <40% on the old model. Low effort, fast payback.
  • 3. Batch the async work (this month) Evals, embeddings, bulk classification, summarise-at-rest -> batch APIs, flat 50% off, stacks with caching.
  • 4. Route (this quarter) Cheap tier for the easy 60–80%, premium for the rest. 5x–12.5x per-token gap between tiers. The recurring cost is the eval set that proves the cheap tier is good enough.

What to measure

Stop measuring tokens-per-call; measure tokens-per-user-request — multiply by the agent fan-out. If you make 15 calls per request, your true unit cost is 15x what a per-call dashboard shows.

Then track three ratios: percentage of input tokens served from cache (low = lever unused), retry rate (over ~2% = structured outputs not enforced), and share of traffic on the cheap tier (near zero = routing not built).

FOR YOU

If you own the shared AI platform layer, your job is to build the cost controls once so every feature team is not solving them in isolation. The answer is a gateway: cache, route, observe, cap — behind one front door.

Build once, for everyone

  • Cache-and-route gateway A single proxy in front of all vendor calls — prompt caching, model routing, fallback, retries. LiteLLM, Portkey, or custom on the vendor SDKs. Adds <50ms. The highest-leverage platform investment for AI-heavy products.
  • Hard budgets at the gateway Per-key/per-team/per-feature max_budget with real-time 429 on breach — the only true cap, since vendor budgets are notifications. Note the known LiteLLM edge case: user-level budgets may not enforce when a key belongs to a team; test your config.
  • Per-feature cost tags Stamp feature_id + team on every call at the call site and aggregate. Without it, OpenAI/Anthropic spend is unattributable — their metadata does not reach the cost dashboard.
  • Eval on batch + KV-cache reuse Route CI evals through batch APIs (50% off); for self-hosted, one flag (vLLM enable_prefix_caching, SGLang RadixAttention) reuses shared-prefix KV for big prefill savings.

What not to build

Do not build your own vector DB or observability platform unless you have genuinely unusual requirements — Pinecone, Weaviate, Qdrant, pgvector, Helicone, Langfuse are all production-grade. That engineering time is better spent on the gateway and on the agent-loop multiplier, which is where the real cost lives.

Do not self-host models to save money until utilisation justifies it. A serverless open-model endpoint at $0.05–$2/Mtok beats an idle GPU farm (30–40% real utilisation) on both cost and ops for most teams.

FOR YOU

The team-level question is ownership: who is accountable for AI unit economics, what do they watch, and what ritual keeps it from drifting. If cost rolls up only to finance, your engineers have no incentive — so the cost owner sits inside the team.

Stand up the governance loop

  • Name a cost owner One person per team accountable for the AI-feature P&L. Visibility -> accountability -> controls is the maturity ladder; it stalls without an owner.
  • One per-feature dashboard Inference + retrieval + observability + eval, allocated by feature. 1–2 weeks to assemble; the visibility itself is decisive — money-losing features cannot hide once their full cost is on one screen.
  • Cost gates in CI promptfoo cost/latency assertions or Braintrust merge-blocking evals so a prompt or model change that doubles cost cannot ship silently. Gate on averages/percentiles, not single calls.
  • A monthly cost review Walk the per-feature dashboard, the cache hit rate, the cheap-tier traffic share, and the agent fan-out. Treat regressions like any other reliability regression.

The contract levers (with finance)

Beyond the engineering levers, three procurement moves move the bill: volume-commit discounts (typically 10–40% off list, verify your contract is at market), reserved capacity for steady predictable volume (Bedrock provisioned throughput, Azure PTUs, Vertex GSUs — all bill regardless of utilisation, so wrong for bursty traffic), and simply asking engineering for “% of input tokens served from cache” as a standing KPI. That last one is free and routinely surfaces a 20–40% unrealised saving.

FOR YOU

For founders the cost question is really a pricing question. If you priced flat-rate against a “one call per request” mental model and your product has any agentic component, you have a structural unit-economics problem that no amount of caching fully fixes.

Pricing before optimisation

Decide the pricing model deliberately: usage-based (variable cost passes through, margin stable), credit-based (some predictability for the customer, you absorb spikes), or flat-rate (you carry heavy-user risk). Flat-rate with no usage cap on agentic workflows is the structural-loss option — it is why Cursor, Replit, and others moved to usage-based.

The heavy-user trap is the one to internalise: the top ~5% of users typically drive 50%+ of agentic workload, so an average price that covers average cost can still lose money on the tail faster than it earns on the rest. Cap usage explicitly, and instrument per-user fan-out before you set the price, not after the first surprising invoice.

The tailwind to re-price against Inference list prices fall ~10x/year for equivalent quality (a16z: GPT-3-equivalent quality went from ~$60/Mtok in 2021 to ~$0.06 by 2024, a 1,000x drop). The model you priced against 12 months ago is almost certainly overpriced for its tier today — re-baseline pricing and model choice on a schedule.

The lever catalogue, roughly ordered by leverage-per-unit-effort. The first five are mostly configuration; the back half is real engineering.

Cache

Prompt / context caching

Vendor-side discount on a stable prompt prefix. Anthropic: 0.1x base on cached reads (90% off), 1.25x–2x write premium. OpenAI & Gemini: up to ~90% off, automatic, no write premium. The single highest-leverage lever.

config
Route

Model routing / right-sizing

Cheap model for easy traffic, premium for hard. The per-token gap is real: Opus 4.8 to Haiku 4.5 is 5x; GPT-5.4 to nano is ~12.5x on input. RouteLLM, Martian, Not Diamond, or a heuristic classifier.

config
Batch

Batch / async APIs

Flat 50% off at OpenAI, Anthropic, Bedrock, and Google for non-interactive work — evals, backfills, embeddings, bulk classification. Stacks with caching. Useless for live UX.

config
Output

Structured outputs

Force schema-valid output so you stop paying for retry/repair calls. OpenAI Structured Outputs hit 100% schema adherence vs <40% on the old model. Saving is the eliminated retry, not the call itself.

config
Context

RAG context discipline

You pay input tokens for every retrieved chunk on every call. top-k=20 vs 5 is ~4x the retrieved-context bill — and padding context can lower accuracy (lost-in-the-middle). Over-retrieve cheap, rerank, pass only the top few.

config
Agents

Agent-loop control

Keep the prefix byte-stable so the KV cache survives (cached vs uncached input = 10x), cap max steps + per-run spend, detect no-progress, and split a cheap planner from cheap executors. The biggest team-level amplifier when present.

engineering
Cache

Semantic response cache

Cache whole responses by embedding similarity for repetitive Q&A. Measured in the GPT Semantic Cache paper at up to 68.8% fewer API calls, >97% positive-hit accuracy. ~3% staleness risk — wrong for personalised or real-time answers.

engineering
Compress

Prompt compression

LLMLingua-family compression of long context: 2x–5x realistic (20x is a ceiling), 1.6x–2.9x faster. Only helps when input tokens dominate — long-context RAG, heavy few-shot.

engineering
Distill

Distillation / fine-tuning

Big teacher generates data, fine-tune a cheap student, serve the student. Ceiling = the tier price gap. Highest effort here (dataset + fine-tune + eval + re-distill on drift). Best for high-volume, stable, narrow tasks.

engineering
Self-host

Open-weight serving

vLLM / SGLang on your own GPUs. vLLM 2x–4x throughput vs prior serving; SGLang up to 6.4x on shared-prefix work. Only beats serverless open-model endpoints ($0.05–$2/Mtok) at sustained high utilisation.

engineering
Govern

Per-feature cost allocation

Tag every call with feature/team/environment. Vertex (labels) and Bedrock (inference profiles) attribute natively; OpenAI and Anthropic do not roll request metadata into cost dashboards — you log token usage and price it yourself.

governance
Govern

Hard spend caps (gateway)

No hosted vendor ships an instant hard cap — OpenAI budgets are notifications. Real caps come from a gateway (LiteLLM per-key/team max_budget, real-time 429) or a wired AWS Budgets action that denies bedrock:* on breach.

governance

Saving on LLM cost at the team level is not one optimisation — it is a short stack of independent levers, each acting on a different part of the bill, plus the governance that keeps them applied. The mistake is treating it as a prompt-engineering problem owned by whoever wrote the prompt. It is an architecture-and-process problem owned by the team. The math below uses current (June 2026) vendor list prices and the primary-sourced impact ranges; your contract and traffic mix will move the absolute numbers.

BEFORE
The implicit “just write better prompts” model
  • Cost = tokens x price, so shorter prompts = the main lever
  • One model (the best one) for everything; switching is risky
  • Caching is a vendor detail the SDK handles for us
  • Agents are chat with tools — roughly the same cost shape
  • Eval and dev usage are tooling, not a production cost line
  • The bill is finance’s problem; engineers ship features
AFTER
The team-level reality
  • Cost = (cached vs uncached input) x (model tier) x (agent fan-out) x (retry rate) — four multipliers, only one is prompt length
  • A routed cascade serves 60–80% of traffic on a 5–12x cheaper model with most quality intact
  • Cache survival is a design constraint: a timestamp in the system prompt costs you the 90% discount on every call
  • An agent turn is ~4x a chat turn; a multi-agent run ~15x — fan-out, not prompt size, dominates
  • Continuous eval can equal or exceed live-traffic inference; it needs its own budget line
  • Cost is owned at the team level with a named owner, a per-feature dashboard, and a hard cap — or it drifts back up

The right question is not “how do we shorten prompts.” It is “which of the four multipliers is biggest in our stack, which lever attacks it, and who owns keeping it applied.” For most teams the order is: turn on caching, route the easy traffic, cap the agent loop, then instrument and govern so none of it regresses.

DEEP READ 6 sections · cited primary sources · technical review pending

01 The cache layer — the 90% discount most teams half-use

Prompt caching is the highest-leverage lever and the most commonly left on the table. The mechanism: a stable prompt prefix (system prompt, tool definitions, few-shot examples, a long retrieved document reused within a session) is cached at the vendor, and subsequent calls that reuse that exact prefix pay a steep discount on those input tokens. The catch that trips teams up is byte-stability — the cache keys on an exact prefix match, so a single changing token near the front (a timestamp, a request ID, a session counter) invalidates everything after it and forces a full-price recompute.

The per-vendor shape differs in a way that changes your break-even. Anthropic charges a write premium: cached reads cost 0.1x of base input (a 90% discount) but you pay 1.25x to write a 5-minute cache entry or 2x for a 1-hour entry — so Anthropic caching pays off only once the entry is read at least once (5-min) or two-to-three times (1-hour). OpenAI and Google cache automatically with no write premium: you simply pay the discounted read (OpenAI up to 90% off cached input on the GPT-5.x line above a 1,024-token threshold; Gemini ~90% off on 2.5+ models), so they pay off on the first hit. Azure matches OpenAI and can reach up to 100% off cached input under provisioned throughput.

The design rule that makes this real at the team level: structure every prompt as static-prefix-first, dynamic-content-last, and make it a review-checklist item. Most teams measure their cache hit rate at 5–15% on first look and move it to 40–70% with deliberate attention. That is a 30–60% reduction in input-token cost for prefix-heavy apps, for hours of work.

  • Anthropic Cached read 0.1x base input (90% off). Write premium 1.25x (5-min) / 2x (1-hr). Min cacheable 1,024 tokens on Opus 4.8 / Sonnet 4.6. Opt-in via cache_control.
  • OpenAI Up to 90% off cached input, automatic, no write premium. 1,024-token threshold then 128-token increments. ~5–10 min idle TTL, extendable.
  • Google Gemini ~90% off cached input, implicit on 2.5+ models plus an explicit cache API. Explicit cache adds $1.00/Mtok/hr storage. No read-write premium.
  • The cache-killer Any per-call variability in the prefix — timestamps, IDs, counters — voids the cache from that token onward. Push all dynamic content to the end of the prompt.

02 Routing and right-sizing — stop sending easy work to the flagship

The premium-model-by-default habit is, for most teams, the largest unforced cost. The per-token gap between tiers is steep: Anthropic Opus 4.8 ($5/$25 per million in/out) to Haiku 4.5 ($1/$5) is a 5x drop; OpenAI GPT-5.4 ($2.50/$15) to nano ($0.20/$1.25) is ~12.5x on input. A routed cascade sends easy queries to the cheap tier and escalates only the hard ones, using either a heuristic classifier or a small judge model to make the call.

How much that saves depends entirely on your traffic and how you measure. RouteLLM (the open Berkeley/LMSYS router) reports up to 85% cost reduction while preserving 95% of GPT-4 quality — but that headline is MT-Bench-specific; on the same blog the same 95%-quality target yields only 45% savings on MMLU and 35% on GSM8K. So treat 85% as a favourable-benchmark ceiling, not an expected value. The durable practitioner estimate is that 60–80% of production queries are simple enough for a small model; the savings track how much of your traffic that actually is.

Right-sizing has one trap worth naming: newer tokenizers can encode the same text into more tokens (Anthropic notes up to ~35% more on recent models), so compare tiers on tokens-for-your-actual-text, not on the sticker rate. And the recurring cost of routing is not the router — it is maintaining a representative eval set so you can prove the cheap tier is good enough as models change. That eval set is the real asset.

CAVEAT RouteLLM’s 85% is the best case on MT-Bench; the cross-benchmark range at equal quality is roughly 35–85%. Uniformly-hard traffic erases the savings entirely — routing only helps to the degree your traffic is mixed.
PRIMARY SOURCE RouteLLM (LMSYS) results

03 Batch, structured outputs, compression — the input and retry levers

Three more configuration-grade levers. First, batch APIs: a flat 50% discount at OpenAI, Anthropic, Bedrock, and Google for asynchronous workloads with a ~24-hour SLA. Every eval run, embedding backfill, bulk classification, content-moderation sweep, and summarise-at-rest job belongs here, and the discount stacks with prompt caching. Most teams pay full realtime price on these because the batch migration is perennially next-quarter.

Second, structured outputs. Forcing model output through a strict schema (OpenAI Structured Outputs, JSON Schema, tool-calling, Outlines, Instructor) eliminates the retry-and-repair loop that quietly re-bills a full inference call each time a parse fails. OpenAI's own evals show 100% schema adherence versus under 40% on the prior model; the dollar saving is your old failure rate times the cost of each wasted retry. It guarantees structural, not semantic, correctness — but the retry compression is real and the effort is low.

Third, prompt compression for the cases where input tokens genuinely dominate — long-context RAG, heavy few-shot, big retrieved documents. The LLMLingua family (Microsoft Research) compresses context with small-model token pruning: 2x–5x is the production-realistic range (LLMLingua-2), with a 20x ceiling only on the most favourable long-document benchmarks. It does nothing when your prompts are already short, so reach for it last among the three.

  • Batch (50% off) OpenAI, Anthropic, Bedrock, Google — non-interactive only, ~24h SLA. Stacks with caching. Move evals, embeddings, bulk classification here.
  • Structured outputs 100% schema adherence vs <40% prior. Saving = eliminated retries. Low effort, fast payback.
  • Prompt compression LLMLingua-2: 2x–5x realistic, 1.6x–2.9x faster. Only when input tokens dominate; 20x is a ceiling not an average.
PRIMARY SOURCE OpenAI Structured Outputs

04 The agent multiplier — the team-level cost story of 2026

Everything above is table stakes. The cost story that actually separates teams in 2026 is the agent loop. Because every step re-sends the whole growing transcript and every tool call is another inference, agent token usage grows fast: Anthropic's published data puts agents at about 4x the tokens of a chat interaction and multi-agent systems at about 15x, and finds that token usage by itself explains roughly 80% of the performance variance across their research-agent runs. Capability is bought with tokens — which is exactly why the loop has to be governed rather than left open.

The first agent lever is cache survival. Manus, which runs production agents, calls the KV-cache hit rate “the single most important metric for a production-stage AI agent,” because cached versus uncached input on Claude is $0.30 versus $3.00 per million tokens — a literal 10x — and “even a single-token difference can invalidate the cache from that token onward.” The implication is concrete: keep the agent's context append-only and the prefix byte-stable, and do not inject volatile fields into the system prompt.

The second is bounding the loop. Hard max-iteration caps, a per-run token-or-dollar budget, and no-progress detection are the cheapest insurance against the runaway loop that quietly bills hundreds of dollars on a single overnight task. The third is a planner/executor split — a capable lead model plans and condenses, cheaper subagents do the legwork and return short summaries — which is the pattern Anthropic credits for an Opus-lead/Sonnet-worker system beating single-model Opus by ~90% on their internal research eval. Note the tension to manage: aggressive history-compaction rewrites the prefix and fights prompt caching, so compact at task boundaries rather than every turn.

CAVEAT The 4x / 15x figures are Anthropic’s own numbers from their multi-agent research system and reflect that workload; your fan-out depends on tool latency, retry policy, and how aggressive the agent is. Agent per-task dollar figures circulating online are directional only — benchmark your own loops with per-run span aggregation.

05 FinOps for AI — see it, own it, cap it

Levers leak without governance, and governance for AI spend has become mainstream fast: the FinOps Foundation's State of FinOps 2026 (n=1,192 practitioners) reports 98% of FinOps practices now manage AI spend, up from 63% in 2025 and 31% in 2024. The framework's recommended unit metrics are cost-per-token, cost-per-inference, and cost-per-API-call, with the sharper idea being “goodput” — cost per usable output that met its SLO, not per raw token.

Attribution is where vendors diverge sharply, and it determines whether a per-feature dashboard is even possible. Google Vertex attributes per-request via labels into Cloud Billing/BigQuery; AWS Bedrock attributes at day-level via application inference profiles and cost-allocation tags. OpenAI and Anthropic do not roll request metadata into their cost dashboards at all — their metadata fields are for your own logging and abuse-tracking — so on those vendors you attribute by isolating into Projects/Workspaces with scoped keys and by logging token usage yourself and pricing it. Build that per-feature view: features that lose money cannot keep shipping once their full cost is assembled on one dashboard.

The single load-bearing governance fact: no major hosted vendor ships an instant hard spend cap. OpenAI's project budget is explicitly a soft notification that “does not enforce a hard cap.” Anthropic offers workspace limits plus rate limits (the rate limit is the firmer realtime throttle). The closest thing to a true cap is AWS Budgets budget-actions applying an IAM/SCP deny on bedrock:* at breach — though cost data lags hours — or, more practically, an LLM gateway like LiteLLM that enforces a per-key/per-team max_budget and returns a real-time 429 the instant a budget is hit. If a runaway loop can outspend your budget before the alert email arrives, you do not have a cap; you have a postmortem.

  • Allocate Tag every call with feature / team / environment. Vertex + Bedrock attribute natively; OpenAI + Anthropic need Projects/Workspaces + your own token logging.
  • Observe Helicone, Langfuse, LangSmith, or native Costs APIs. Most report estimated (price x tokens) cost, not invoiced — fine for trend, not for reconciliation.
  • Separate eval spend Run evals under a dedicated key/project on batch APIs. Continuous eval can equal or exceed live-traffic inference; it needs its own budget line.
  • Cap hard Gateway (LiteLLM max_budget + 429) or AWS Budgets IAM-deny action. Vendor budgets alone are alerts, not enforcement.
  • Gate in CI promptfoo cost/latency assertions or Braintrust merge-blocking evals so a prompt/model change that doubles cost cannot ship silently. Gate on averages/percentiles — token counts flake single-call gates.

06 Self-hosting vs serverless open models — where the build/buy line actually is

Self-hosting open-weight models is the lever teams reach for too early. The serving stacks are genuinely good — vLLM (PagedAttention) delivers 2x–4x throughput over prior serving systems, SGLang (RadixAttention) up to 6.4x on shared-prefix workloads — and the open-weight options are strong and permissively licensed (gpt-oss-120b under Apache 2.0 runs on a single 80GB GPU; Qwen, DeepSeek, GLM, Mistral Small all usable commercially). The economics, though, hinge on one number: $/token = GPU $/hour ÷ tokens/hour produced, and the denominator only gets good with continuous batching at sustained high utilisation. Idle and batch-1 serving is the worst economics there is, and real-world GPU utilisation of 30–40% silently multiplies your napkin $/token by 2.5–3x.

The move most teams miss is that the real competitor to self-hosting in mid-2026 is not the $3–$15/Mtok flagship API — it is the serverless open-model endpoint at $0.05–$2/Mtok with zero ops. The same model varies wildly by host (Llama 3.3 70B runs $0.10/$0.32 on one provider versus $1.04/$1.04 on another, per million in/out), so shopping hosts is itself a lever. The pragmatic hierarchy: flagship closed API for the hard 20% → serverless open-model endpoint for the bulk → dedicated endpoint → full self-host only once utilisation is genuinely high and sustained.

As a directional gate (verify against your own volumes): under ~100k requests/day or ~$50k/year of API spend, stay on managed APIs; $50k–$500k/year is hybrid territory; past ~1M requests/day or ~$500k/year, owned serving starts to win — but only after you account for the idle-GPU tax and the engineering cost of running it.

CAVEAT GPU hourly prices (RunPod H100 ~$3.29/hr, A100 ~$1.49/hr; Lambda H100 ~$3.99–$4.29/hr, June 2026) and serverless endpoint prices are verified; the break-even thresholds are directional rules-of-thumb. Compute your own $/Mtok from a verified GPU price x your measured throughput before committing capital.
PRIMARY SOURCE vLLM PagedAttention paper

Seven expensive-default anti-patterns that recur across teams. Severity reflects how much budget each one typically burns before someone notices, not absolute risk.

  1. 01 HIGH

    Premium-model-by-default

    Every call routed to the flagship because it is the safe choice. With a 5x (Opus-to-Haiku) to ~12.5x (GPT-5.4-to-nano) per-token gap and an estimated 60–80% of traffic simple enough for a small model, this is usually the largest single unforced cost. The reason it persists is that nobody owns proving the cheap tier is good enough.

    DO Build a representative eval set for your top workflows, then route: heuristic or judge-model classifier sends easy queries to the economy tier, escalates the rest. Measure the share of traffic served cheap — if it is near zero, routing has not actually been built.
  2. 02 HIGH

    Cache-hostile prompts

    A timestamp, request ID, or session counter sitting near the front of the system prompt voids the vendor cache from that token onward, so the team pays full price on a prefix that is 95% stable. This is the most common reason a 90%-off discount shows up as a 5% saving.

    DO Audit your top 10 production prompts. Move every dynamic field to the end; keep the system prompt, tool definitions, and few-shot block static and first. Track “% of input tokens served from cache” as a KPI — typical first read is 5–15%, target 40–70%.
  3. 03 HIGH

    The unbounded agent loop

    An agent with no step cap and no per-run budget can re-send a growing transcript dozens of times and quietly bill hundreds of dollars on a single task — the failed-but-expensive runs are the tail that blows up the monthly bill. Agents already use ~4x chat tokens and multi-agent ~15x before any runaway.

    DO Set a hard max-iteration cap, a per-run token/dollar budget, and no-progress detection on every agent. Keep context append-only so the KV cache survives (a 10x swing on input). Wrap each loop in per-user-request span aggregation so the true fan-out is visible.
  4. 04 MEDIUM

    Context bloat from RAG with no discipline

    Stuffing 15–25k tokens of retrieved context into every query because the vector store returned that many chunks. Cost scales linearly with context size, attention is O(N^2), and benefit typically plateaus past 5–8k tokens of relevant retrieval — padding can even lower accuracy (lost-in-the-middle). Nobody is auditing retrieval quality, so context only grows.

    DO Cap retrieved context (top-k bound or token-bounded), over-retrieve cheap then rerank and pass only the top few chunks to the expensive model. Measure answer quality at different retrieval depths; in most apps it plateaus well before the cap teams set.
  5. 05 MEDIUM

    Eval cost cannibalising production

    Continuous eval pipelines run thousands of LLM calls a day; for mature teams eval inference can equal or exceed live-traffic inference. When eval and production share one vendor contract, eval growth quietly eats production capacity (or pushes the bill up) with no explicit budget conversation.

    DO Run evals under a dedicated key/project so the spend lands on its own line, and move them to batch APIs (50% off). Right-size the suite — not every PR needs the full regression — and use a cheaper judge model where correlation with human grading holds.
  6. 06 MEDIUM

    Retry storms and no backoff

    Naive retry logic answers a 429 with an immediate retry, multiplying cost (and load) unbounded under exactly the conditions where the system is already strained. Each retry is a full re-billed call.

    DO Exponential backoff with jitter, a per-run retry budget, and circuit breakers. Pair structured outputs (to cut parse-failure retries) with a hard spend guardrail so a storm cannot run away.
  7. 07 HIGH

    No hard cap and no per-feature visibility

    Vendor budgets are notifications, not enforcement — OpenAI explicitly does not enforce a hard cap — and request metadata does not roll into OpenAI/Anthropic cost dashboards. So a runaway can outspend the budget before the alert lands, and no individual feature P&L ever sees its full cost because the layers sit in different cost centres.

    DO Put a gateway (LiteLLM) or a wired AWS Budgets IAM-deny in front of spend for a real cap. Tag every call with feature/team and assemble a single per-feature dashboard. Name a cost owner per team — if cost rolls up only to finance, engineers have no incentive to act.

Three concrete moves worth doing inside 30 days. Each is verified-impact and most teams have not done them.

  1. 1

    Turn caching on properly this week

    Audit your top 10 prompts, move every dynamic field to the end, and verify cached reads are actually being hit (OpenAI/Gemini are automatic above ~1,024 tokens; Anthropic needs cache_control). Target 40–70% of input tokens served from cache. Hours of work for a 30–60% input-token reduction on prefix-heavy apps — the single highest-leverage move available.

  2. 2

    Route the easy traffic and batch the async work

    Stand up a cheap-tier/premium-tier cascade for your highest-volume workflow and move every non-interactive job — evals, embeddings, bulk classification, summarise-at-rest — onto batch APIs (flat 50% off, stacks with caching). Between them these attack the two biggest line items: model tier and full-price async spend.

  3. 3

    Install a real cap and a per-feature dashboard

    Put a gateway with per-key/per-team max_budget (or an AWS Budgets IAM-deny action) in front of spend so a runaway loop returns a 429 instead of a postmortem. Tag every call with feature and team, assemble one dashboard across inference + retrieval + eval, and name a cost owner per team. Without enforcement and visibility, every saving above regresses within a quarter.

Cost dynamics worth tracking — each can move team AI economics 20%+ within a quarter when it lands.

Cache + route + cap as a single bought layer

Gateways (LiteLLM, Portkey, Helicone) are converging caching, routing, fallback, and hard budgets into one proxy in front of every call. As these mature into a standard procurement line, the build-it-yourself calculus shifts — a right-priced gateway pays back faster than the engineering to assemble the same controls internally.

Serverless open-model price floors

Open-weight models on serverless endpoints ($0.05–$2/Mtok) are dropping faster than flagship APIs and reset the self-hosting break-even upward every quarter. Old “self-host beats GPT-4o at X tokens” math is stale; the real competitor to your GPUs is a $0.10/Mtok endpoint with zero ops. Re-run the build/buy line each quarter.

FinOps-for-AI moving from showback to enforcement

With 98% of FinOps teams now managing AI spend, the frontier is shifting from showback dashboards to chargeback and hard enforcement — team budgets, cost gates in CI, and “self-fund AI through efficiency” mandates. Expect cost-per-goodput, not cost-per-token, to become the metric leadership asks for.