Saving on LLM cost at the team level is not one optimisation — it is a short stack of independent levers, each acting on a different part of the bill, plus the governance that keeps them applied. The mistake is treating it as a prompt-engineering problem owned by whoever wrote the prompt. It is an architecture-and-process problem owned by the team. The math below uses current (June 2026) vendor list prices and the primary-sourced impact ranges; your contract and traffic mix will move the absolute numbers.
The right question is not “how do we shorten prompts.” It is “which of the four multipliers is biggest in our stack, which lever attacks it, and who owns keeping it applied.” For most teams the order is: turn on caching, route the easy traffic, cap the agent loop, then instrument and govern so none of it regresses.
DEEP READ 6 sections · cited primary sources · technical review pending
01 The cache layer — the 90% discount most teams half-use
Prompt caching is the highest-leverage lever and the most commonly left on the table. The mechanism: a stable prompt prefix (system prompt, tool definitions, few-shot examples, a long retrieved document reused within a session) is cached at the vendor, and subsequent calls that reuse that exact prefix pay a steep discount on those input tokens. The catch that trips teams up is byte-stability — the cache keys on an exact prefix match, so a single changing token near the front (a timestamp, a request ID, a session counter) invalidates everything after it and forces a full-price recompute.
The per-vendor shape differs in a way that changes your break-even. Anthropic charges a write premium: cached reads cost 0.1x of base input (a 90% discount) but you pay 1.25x to write a 5-minute cache entry or 2x for a 1-hour entry — so Anthropic caching pays off only once the entry is read at least once (5-min) or two-to-three times (1-hour). OpenAI and Google cache automatically with no write premium: you simply pay the discounted read (OpenAI up to 90% off cached input on the GPT-5.x line above a 1,024-token threshold; Gemini ~90% off on 2.5+ models), so they pay off on the first hit. Azure matches OpenAI and can reach up to 100% off cached input under provisioned throughput.
The design rule that makes this real at the team level: structure every prompt as static-prefix-first, dynamic-content-last, and make it a review-checklist item. Most teams measure their cache hit rate at 5–15% on first look and move it to 40–70% with deliberate attention. That is a 30–60% reduction in input-token cost for prefix-heavy apps, for hours of work.
- Anthropic Cached read 0.1x base input (90% off). Write premium 1.25x (5-min) / 2x (1-hr). Min cacheable 1,024 tokens on Opus 4.8 / Sonnet 4.6. Opt-in via cache_control.
- OpenAI Up to 90% off cached input, automatic, no write premium. 1,024-token threshold then 128-token increments. ~5–10 min idle TTL, extendable.
- Google Gemini ~90% off cached input, implicit on 2.5+ models plus an explicit cache API. Explicit cache adds $1.00/Mtok/hr storage. No read-write premium.
- The cache-killer Any per-call variability in the prefix — timestamps, IDs, counters — voids the cache from that token onward. Push all dynamic content to the end of the prompt.
02 Routing and right-sizing — stop sending easy work to the flagship
The premium-model-by-default habit is, for most teams, the largest unforced cost. The per-token gap between tiers is steep: Anthropic Opus 4.8 ($5/$25 per million in/out) to Haiku 4.5 ($1/$5) is a 5x drop; OpenAI GPT-5.4 ($2.50/$15) to nano ($0.20/$1.25) is ~12.5x on input. A routed cascade sends easy queries to the cheap tier and escalates only the hard ones, using either a heuristic classifier or a small judge model to make the call.
How much that saves depends entirely on your traffic and how you measure. RouteLLM (the open Berkeley/LMSYS router) reports up to 85% cost reduction while preserving 95% of GPT-4 quality — but that headline is MT-Bench-specific; on the same blog the same 95%-quality target yields only 45% savings on MMLU and 35% on GSM8K. So treat 85% as a favourable-benchmark ceiling, not an expected value. The durable practitioner estimate is that 60–80% of production queries are simple enough for a small model; the savings track how much of your traffic that actually is.
Right-sizing has one trap worth naming: newer tokenizers can encode the same text into more tokens (Anthropic notes up to ~35% more on recent models), so compare tiers on tokens-for-your-actual-text, not on the sticker rate. And the recurring cost of routing is not the router — it is maintaining a representative eval set so you can prove the cheap tier is good enough as models change. That eval set is the real asset.
CAVEAT RouteLLM’s 85% is the best case on MT-Bench; the cross-benchmark range at equal quality is roughly 35–85%. Uniformly-hard traffic erases the savings entirely — routing only helps to the degree your traffic is mixed.
03 Batch, structured outputs, compression — the input and retry levers
Three more configuration-grade levers. First, batch APIs: a flat 50% discount at OpenAI, Anthropic, Bedrock, and Google for asynchronous workloads with a ~24-hour SLA. Every eval run, embedding backfill, bulk classification, content-moderation sweep, and summarise-at-rest job belongs here, and the discount stacks with prompt caching. Most teams pay full realtime price on these because the batch migration is perennially next-quarter.
Second, structured outputs. Forcing model output through a strict schema (OpenAI Structured Outputs, JSON Schema, tool-calling, Outlines, Instructor) eliminates the retry-and-repair loop that quietly re-bills a full inference call each time a parse fails. OpenAI's own evals show 100% schema adherence versus under 40% on the prior model; the dollar saving is your old failure rate times the cost of each wasted retry. It guarantees structural, not semantic, correctness — but the retry compression is real and the effort is low.
Third, prompt compression for the cases where input tokens genuinely dominate — long-context RAG, heavy few-shot, big retrieved documents. The LLMLingua family (Microsoft Research) compresses context with small-model token pruning: 2x–5x is the production-realistic range (LLMLingua-2), with a 20x ceiling only on the most favourable long-document benchmarks. It does nothing when your prompts are already short, so reach for it last among the three.
- Batch (50% off) OpenAI, Anthropic, Bedrock, Google — non-interactive only, ~24h SLA. Stacks with caching. Move evals, embeddings, bulk classification here.
- Structured outputs 100% schema adherence vs <40% prior. Saving = eliminated retries. Low effort, fast payback.
- Prompt compression LLMLingua-2: 2x–5x realistic, 1.6x–2.9x faster. Only when input tokens dominate; 20x is a ceiling not an average.
04 The agent multiplier — the team-level cost story of 2026
Everything above is table stakes. The cost story that actually separates teams in 2026 is the agent loop. Because every step re-sends the whole growing transcript and every tool call is another inference, agent token usage grows fast: Anthropic's published data puts agents at about 4x the tokens of a chat interaction and multi-agent systems at about 15x, and finds that token usage by itself explains roughly 80% of the performance variance across their research-agent runs. Capability is bought with tokens — which is exactly why the loop has to be governed rather than left open.
The first agent lever is cache survival. Manus, which runs production agents, calls the KV-cache hit rate “the single most important metric for a production-stage AI agent,” because cached versus uncached input on Claude is $0.30 versus $3.00 per million tokens — a literal 10x — and “even a single-token difference can invalidate the cache from that token onward.” The implication is concrete: keep the agent's context append-only and the prefix byte-stable, and do not inject volatile fields into the system prompt.
The second is bounding the loop. Hard max-iteration caps, a per-run token-or-dollar budget, and no-progress detection are the cheapest insurance against the runaway loop that quietly bills hundreds of dollars on a single overnight task. The third is a planner/executor split — a capable lead model plans and condenses, cheaper subagents do the legwork and return short summaries — which is the pattern Anthropic credits for an Opus-lead/Sonnet-worker system beating single-model Opus by ~90% on their internal research eval. Note the tension to manage: aggressive history-compaction rewrites the prefix and fights prompt caching, so compact at task boundaries rather than every turn.
CAVEAT The 4x / 15x figures are Anthropic’s own numbers from their multi-agent research system and reflect that workload; your fan-out depends on tool latency, retry policy, and how aggressive the agent is. Agent per-task dollar figures circulating online are directional only — benchmark your own loops with per-run span aggregation.
05 FinOps for AI — see it, own it, cap it
Levers leak without governance, and governance for AI spend has become mainstream fast: the FinOps Foundation's State of FinOps 2026 (n=1,192 practitioners) reports 98% of FinOps practices now manage AI spend, up from 63% in 2025 and 31% in 2024. The framework's recommended unit metrics are cost-per-token, cost-per-inference, and cost-per-API-call, with the sharper idea being “goodput” — cost per usable output that met its SLO, not per raw token.
Attribution is where vendors diverge sharply, and it determines whether a per-feature dashboard is even possible. Google Vertex attributes per-request via labels into Cloud Billing/BigQuery; AWS Bedrock attributes at day-level via application inference profiles and cost-allocation tags. OpenAI and Anthropic do not roll request metadata into their cost dashboards at all — their metadata fields are for your own logging and abuse-tracking — so on those vendors you attribute by isolating into Projects/Workspaces with scoped keys and by logging token usage yourself and pricing it. Build that per-feature view: features that lose money cannot keep shipping once their full cost is assembled on one dashboard.
The single load-bearing governance fact: no major hosted vendor ships an instant hard spend cap. OpenAI's project budget is explicitly a soft notification that “does not enforce a hard cap.” Anthropic offers workspace limits plus rate limits (the rate limit is the firmer realtime throttle). The closest thing to a true cap is AWS Budgets budget-actions applying an IAM/SCP deny on bedrock:* at breach — though cost data lags hours — or, more practically, an LLM gateway like LiteLLM that enforces a per-key/per-team max_budget and returns a real-time 429 the instant a budget is hit. If a runaway loop can outspend your budget before the alert email arrives, you do not have a cap; you have a postmortem.
- Allocate Tag every call with feature / team / environment. Vertex + Bedrock attribute natively; OpenAI + Anthropic need Projects/Workspaces + your own token logging.
- Observe Helicone, Langfuse, LangSmith, or native Costs APIs. Most report estimated (price x tokens) cost, not invoiced — fine for trend, not for reconciliation.
- Separate eval spend Run evals under a dedicated key/project on batch APIs. Continuous eval can equal or exceed live-traffic inference; it needs its own budget line.
- Cap hard Gateway (LiteLLM max_budget + 429) or AWS Budgets IAM-deny action. Vendor budgets alone are alerts, not enforcement.
- Gate in CI promptfoo cost/latency assertions or Braintrust merge-blocking evals so a prompt/model change that doubles cost cannot ship silently. Gate on averages/percentiles — token counts flake single-call gates.
06 Self-hosting vs serverless open models — where the build/buy line actually is
Self-hosting open-weight models is the lever teams reach for too early. The serving stacks are genuinely good — vLLM (PagedAttention) delivers 2x–4x throughput over prior serving systems, SGLang (RadixAttention) up to 6.4x on shared-prefix workloads — and the open-weight options are strong and permissively licensed (gpt-oss-120b under Apache 2.0 runs on a single 80GB GPU; Qwen, DeepSeek, GLM, Mistral Small all usable commercially). The economics, though, hinge on one number: $/token = GPU $/hour ÷ tokens/hour produced, and the denominator only gets good with continuous batching at sustained high utilisation. Idle and batch-1 serving is the worst economics there is, and real-world GPU utilisation of 30–40% silently multiplies your napkin $/token by 2.5–3x.
The move most teams miss is that the real competitor to self-hosting in mid-2026 is not the $3–$15/Mtok flagship API — it is the serverless open-model endpoint at $0.05–$2/Mtok with zero ops. The same model varies wildly by host (Llama 3.3 70B runs $0.10/$0.32 on one provider versus $1.04/$1.04 on another, per million in/out), so shopping hosts is itself a lever. The pragmatic hierarchy: flagship closed API for the hard 20% → serverless open-model endpoint for the bulk → dedicated endpoint → full self-host only once utilisation is genuinely high and sustained.
As a directional gate (verify against your own volumes): under ~100k requests/day or ~$50k/year of API spend, stay on managed APIs; $50k–$500k/year is hybrid territory; past ~1M requests/day or ~$500k/year, owned serving starts to win — but only after you account for the idle-GPU tax and the engineering cost of running it.
CAVEAT GPU hourly prices (RunPod H100 ~$3.29/hr, A100 ~$1.49/hr; Lambda H100 ~$3.99–$4.29/hr, June 2026) and serverless endpoint prices are verified; the break-even thresholds are directional rules-of-thumb. Compute your own $/Mtok from a verified GPU price x your measured throughput before committing capital.