THE STACK · MODEL ECONOMICS · REFERENCE · Published May 24

AI cost economics, May 2026: where the bill actually compounds, and the patterns that move the line item.

Inference token pricing has dropped ~80% in 18 months. Enterprise AI bills have not. The reason is that token cost is one of four cost layers, and three of the other layers compound silently outside the line item the CFO is auditing. Below: the layers, the deployment patterns that amplify each one, the math for typical AI features, and the controls that actually work.

~80% token price drop 18mo
4 cost layers
1.6–3x silent compounders
TL;DR 30-second version · free
  1. 01 Token cost is one of four cost layers. The others — retrieval and storage, observability and eval, and the agent-loop multiplier — compound outside the inference line item and are usually not allocated to the feature owner. The result is that token prices fell 80% and bills did not.
  2. 02 Three deployment patterns hide the markup. The Agent Loop Pattern (1 user prompt → 5-50 model calls) inflates by 5-50x. The Retrieval-Per-Request Pattern (every query embeds + searches before any inference) inflates ~1.5-2x. The Observability-Heavy Pattern (every call double-logged through 2-3 tools) inflates by 20-40% and shows up in a different vendor invoice.
  3. 03 Three controls actually move the line item: prompt caching (50-90% discount on stable system prompts), model routing (cheap model for easy queries, premium for hard — typical 40-60% cost reduction on mixed traffic), and structured-output enforcement (cuts retries, often the silent 15-25% overhead). Negotiating token prices is the visible lever and the smallest one.
DEEP ANALYSIS · free while in beta
READING AS
FOR YOU

If you ship AI features in a product team, your job is to take the four cost layers seriously and ship the three controls that actually move the line item. Token-price negotiations are the CFO's lever; cache + route + structured output is yours.

The three controls, in priority order

  • 1. Prompt caching (do this week) Audit top 10 prompts. Migrate stable portions (system prompt, tools, instructions) to vendor-side caching. Anthropic: up to 90% off cached input. OpenAI: 50%. Google: ~75%. Expected reduction in input-token cost: 30-60%.
  • 2. Structured outputs (do this month) Use Pydantic, JSON Schema, Outlines, or Instructor to force outputs through a strict schema. Cuts retry rate from 5-15% to <2%. Each retry is a wasted full call. Expected reduction in inference cost from retry compression: 15-25%.
  • 3. Model routing (do this quarter) Build a cascade: easy queries to a cheap model (Haiku, GPT-4o-mini, Gemini Flash); hard queries to a premium model. Heuristic classifier or judge LLM does the routing. Expected reduction on mixed traffic: 40-60%.

What to measure

Most teams measure tokens-per-call. The metric that matters is tokens-per-user-request — multiply by the agent-loop fan-out factor. If you make 1 call per user request, the two are equal. If you make 15 calls per user request, your true unit cost is 15x what your tokens-per-call dashboard shows.

Also measure: percentage of input tokens that are cacheable (high cache potential = high lever available), retry rate (if >2%, structured outputs are not enforced), and percentage of traffic served by cheap vs premium models (if all premium, routing has not been built).

FOR YOU

For engineers building the AI platform layer (the team that everyone else integrates with), the question is what shared infrastructure should exist so each feature team is not solving the same cost problem in isolation. The answer is mostly cache + route + observe + budget.

Platform components worth building once

  • Cache-and-route gateway A single LLM proxy in front of all vendor calls. Handles prompt caching, model routing, fallback on errors. Adds <50ms latency. LiteLLM, Portkey, or custom on top of vendor SDKs. The single highest-leverage platform investment for AI-heavy products.
  • Per-feature cost allocation Tag every LLM call with feature_id at the call site. Aggregate cost by feature_id in the dashboard. Without this, no one knows which features lose money.
  • Eval infrastructure on batch APIs CI/CD pipeline that runs evals through batch APIs (50% off). Provides quality gates without doubling the inference bill. Most teams run evals on realtime APIs because the batch migration is "next quarter" indefinitely.
  • Budget alerting per feature Per-feature spend caps with alerts at 50%, 80%, 100% of budget. Prevents runaway agent loops from consuming the monthly budget in 48 hours. Cursor, Replit, Anthropic all do versions of this internally; everyone else builds it after an incident.

What NOT to build

Do not build your own vector DB unless you genuinely have unusual requirements. Pinecone, Weaviate, Qdrant, pgvector are all production-grade in 2026; the engineering cost to compete is months of work for marginal cost savings.

Do not build your own observability platform. Helicone, Langfuse, LangSmith, PostHog AI all exist. Engineering time on internal observability is engineering time not spent on the agent-loop multiplier, which is where the real cost lives.

FOR YOU

The CFO question is not "how do we lower the AI bill." It is "do our AI features have positive unit economics today, and what do we need to instrument to be sure." Below: the per-feature P&L framework, the data you need from engineering, and the procurement levers worth pulling.

The per-feature AI P&L

For each AI feature, you need: revenue attributable to the feature (if any), cost across all four layers (inference + retrieval + observability + agent-loop allocation), and the unit denominator (per request, per resolved ticket, per user-month). Without all three, you cannot answer 'is this profitable.' Most companies have only one or two.

Common shape of the gap: revenue is tracked at the product level, cost is tracked at the vendor level, and the bridge — what does this feature cost per user per month — is built ad-hoc or not at all. Push engineering to instrument per-feature tags on every LLM call as a contracting requirement, not an aspiration.

Procurement levers worth pulling

  • Volume commit discounts Most vendors negotiate 10-40% off list on annual volume commitments. Verify your current contract is at market rate; pricing changes quarterly.
  • Reserved capacity at hyperscalers AWS Bedrock PT, Azure OpenAI PTU. Steady predictable volume → 30-50% effective cost reduction vs on-demand. Wrong choice for bursty traffic.
  • Batch API migration Eval, content moderation, bulk classification → batch (50% off list). Engineering work, but a budgeted procurement initiative tends to get done; an aspirational one does not.
  • Cache adoption tracking Ask engineering for "% of input tokens served from cache" as a KPI. The number is typically 5-15% on first measurement and can move to 40-70% with explicit attention. Major lever.
FOR YOU

For founders, AI cost economics is either a build-it-right problem (you ship AI features in your own product) or an existential bet (you sell AI tooling and your customers face this problem). The frameworks below apply to both angles.

If you ship AI features in your product

The pricing-model question matters more than the cost-optimization question. If you priced flat-rate against a "1 call per request" mental model and your product has any agentic component, you have a structural unit-economics problem that compression cannot fully fix. Cursor, Replit, ChatGPT all moved from flat or unlimited to usage-based for exactly this reason.

Decide early: are you charging usage-based (variable cost passes through, margin is stable), credit-based (some predictability for the customer, you absorb spikes), or flat-rate (you take the risk of heavy users). Each has product-experience tradeoffs. Flat-rate without a usage cap on agentic workflows is the structural-loss option.

The heavy-user problem Heavy-user economics are the trap. The top 5% of users typically consume 50%+ of agentic workload. If your average price covers your average cost, you can still lose money on the heavy users at a rate that exceeds your profit on the rest. Cap usage explicitly.

If you sell AI cost tooling

  • Real opportunity Cache + route + observability is becoming a standard procurement category. Helicone, Portkey, Braintrust, LiteLLM are early but each is solving real customer pain. Category will support multiple winners.
  • Defensible positioning Depth on a specific vertical (RAG for legal, agent observability for fintech), or depth on a specific function (FinOps for AI specifically, integrated billing visibility), or depth on integration (deeply embedded into LangChain, LlamaIndex, vendor SDKs).
  • Beware Generic "AI FinOps" positioning blends into normal cloud FinOps. The differentiator is understanding the four-layer model and helping customers see where they are bleeding — not just dashboards on the inference line item the customer already had.
FOR YOU

For analysts and researchers tracking enterprise AI economics, the May 2026 picture is: token prices fell, bills did not, and the structural reasons are now well-documented enough to model. Below: the open data questions and the bets worth analyzing.

Open empirical questions

  • Production-vs-list pricing gap List prices are public. Real enterprise contract pricing is private and varies 10-40% from list. Comprehensive survey of actual paid prices across enterprise tiers would significantly advance procurement decision-making.
  • Agent fan-out distribution by workflow type Practitioner reports give us 8-15 / 20-40 / 100+ ranges. A systematic study sampling production logs across many companies would tighten the range and document the heavy-tail shape.
  • Eval-vs-production cost ratio by team maturity Anecdotal reports suggest mature teams run eval cost at 0.5-2x of production cost. Empirical data across the maturity curve would help teams budget and would identify the inflection points where eval starts cannibalizing.
  • Cache hit rate distribution Anthropic, OpenAI, and Google all expose cache hit rate metrics but rarely aggregate. Survey-based or partner-data work documenting typical hit rates by app type would give procurement and engineering teams a benchmark to target.

Market dynamics worth tracking

Vendor pricing decisions over 2026 will be shaped by: continued frontier-model capability competition (driving list-price drops on capable models), inference-cost subsidization to capture enterprise relationships (frontier-tier discounting), and the emergence of cache/route gateway products that abstract vendor choice (reducing vendor lock-in, increasing price competition).

Net direction: list prices on capable models continue down. Real enterprise pricing decompresses (more usage-based and reserved-capacity differentiation). The cost layers other than inference become the primary battleground — observability, eval, vector DB, gateway. Each is a multi-billion-dollar category emerging in real time.

The hidden story The most under-reported story in 2026 enterprise AI is that the cost gains from token-price drops have been absorbed entirely by the agent-loop multiplier. Token cost per call fell 80%. Agent capability per call grew 5-50x in call count. Net bill: roughly flat. This is the central economic story and most coverage is still about token prices.

The four cost layers, in the order they typically compound. Token cost is the visible one; the other three move silently.

Layer 1

Inference token cost

Per-token cost for input + output. List price has fallen ~80% in 18 months. The visible line item; the other three layers are not in it.

visible
Layer 2

Retrieval + storage

Embedding generation, vector storage, retrieval API calls. Typically 10-25% of inference cost in RAG-heavy apps. Shows up under a different vendor (Pinecone, pgvector, Weaviate, Qdrant).

silent
Layer 3

Observability + eval

Helicone, Langfuse, LangSmith, Datadog AI monitoring, eval runs (each row of an eval is one or more LLM calls). 20-40% inference overhead typical; can spike past 100% during active development.

silent
Layer 4

Agent loop multiplier

1 user request → N model calls (5-50 for complex agentic workflows). Tools, retries, planning steps, judge LLMs. The single largest amplifier when present.

amplifier
Pattern

The Agent Loop

Coding agents, browsing agents, deep-research agents. Each external action is wrapped by ≥1 planning + ≥1 verification call. Typical: 8-15x raw token cost vs single-shot.

multiplier
Pattern

Retrieval-Per-Request

Every query embeds + retrieves before any inference. Adds latency + 1.2-1.5x cost. Avoidable for stable contexts via session caching but rarely done.

multiplier
Pattern

Observability-Heavy

Production traffic mirrored through 2-3 telemetry tools, plus eval pipelines, plus on-call dashboards. 20-40% silent overhead. Allocated to platform/ops, not feature.

multiplier
Control

Prompt caching

Stable system prompts cached at the vendor; ~50-90% discount on cached input tokens depending on vendor. Largest single-lever cost control available.

control

AI cost has four distinct layers. Each compounds differently. Treating them as one line item is what makes the bill look unmanageable; treating them as four lets you target the actual amplifiers. The math below uses typical mid-market deployments — pricing varies by vendor, contract, and traffic mix.

BEFORE
The implied 2024 mental model
  • AI cost = (tokens × per-token price)
  • Bigger context = proportionally bigger bill
  • Token prices keep dropping, so bills get smaller over time
  • Inference is the only material variable
  • Eval is a development-time cost, not a production cost
AFTER
The 2026 reality
  • AI cost = inference + retrieval/storage + observability/eval + agent-loop multiplier
  • Bigger context = larger bill PLUS more retries, more eval rows, more telemetry — compounding
  • Token prices dropped ~80% in 18 months; enterprise bills are flat or up
  • Inference is one of four; in agentic workflows it is usually the smallest of the four
  • Eval is continuous in production-grade systems; eval runs can equal or exceed live inference cost

The right cost-economics question for 2026 is not "how do we lower token prices." It is "where is each of the four layers in our stack, who owns it, and which deployment patterns are amplifying which layer." Token negotiations move the visible 25%; the silent 75% lives elsewhere.

DEEP READ 8 sections · cited primary sources · technical review pending

01 Layer 1 — Inference: the only visible line

Inference token cost is what every CFO knows to ask about. It is also the one layer that has dropped ~80% over the last 18 months as vendors compete on per-token pricing. GPT-4-class capability per dollar in mid-2026 is roughly 5x what it was in late 2024; Anthropic and Google have tracked similar trajectories. List prices for capable models now sit around $0.15-1.00 per million input tokens and $0.40-5.00 per million output tokens.

The trap: most teams plan inference cost on list price, then never re-plan. Enterprise contracts negotiate down meaningfully from list (typical discount range 10-40% on volume commitments). Reserved-capacity contracts at AWS Bedrock and Azure OpenAI move further. Batch APIs (50% off list at OpenAI and Anthropic for non-realtime workloads) move further still. If your inference bill is being calculated against list price, you are almost certainly paying retail.

The bigger trap: even after squeezing inference, it is one layer of four. Compressing 25% of the bill 30% is a 7.5% total reduction. Compressing the agent-loop multiplier 50% is often a 30%+ total reduction.

  • Input tokens (list) $0.15-1.00 per million for capable models; $1-15 per million for frontier-tier models.
  • Output tokens (list) $0.40-5.00 per million for capable; $5-75 per million for frontier-tier.
  • Cached input 50-90% off list depending on vendor (Anthropic up to 90%, OpenAI 50%, Google Gemini ~75%).
  • Batch / async ~50% off list for non-realtime workloads at OpenAI + Anthropic.
PRIMARY SOURCE OpenAI pricing

02 Layer 2 — Retrieval and storage: the RAG tax

Any AI feature that uses retrieval (RAG, semantic search, document-grounded chat) adds a second cost layer. Three sub-components: embedding generation (each chunk costs $0.02-0.13 per million tokens at typical embedding-model pricing); vector storage ($0.20-2.50 per million vectors per month depending on vendor and dimensionality); and retrieval API calls (per-query cost on managed vector DBs).

Typical RAG-feature economics: every user query triggers an embedding call ($0.0001-0.0005 per query), a vector search ($0.0001-0.001 per query depending on vendor), then the inference call. The retrieval layer adds 10-25% to the inference cost for RAG-heavy apps. This is small per-query and large in aggregate — and it sits on a different vendor invoice (Pinecone, Weaviate, Qdrant, pgvector hosted at your DB vendor, or whatever managed offering you use), which is why it is often not allocated back to the feature owner.

The amplifier here is index refresh. Re-embedding your corpus monthly costs more than the live retrieval traffic for most B2B apps. Re-embedding weekly because someone wanted 'fresher results' can multiply the storage layer by 4x without an obvious user-visible benefit.

CAVEAT Vector DB pricing changes faster than this document does — Pinecone, Weaviate, and Qdrant all repriced in 2025-2026. Treat the per-million-vector numbers as order-of-magnitude only and pull live pricing for procurement decisions.
PRIMARY SOURCE Pinecone pricing

03 Layer 3 — Observability and eval: the developer tax

AI observability tooling (Helicone, Langfuse, LangSmith, PostHog AI, Datadog AI monitoring, custom internal) is genuinely necessary for production AI systems. It is also a cost layer that compounds in three ways: every production call is logged at least once, often twice (primary observability + secondary backup); every traced call carries a token-counting fee from the observability vendor; and every eval row in your continuous-eval pipeline is itself one or more LLM calls.

Typical mid-market AI team in 2026: $200-2000/month per observability vendor, often 2-3 vendors running. Adds 20-40% overhead to inference. The eval pipeline is the variable cost: a 500-row eval suite running daily is 15,000 eval calls a month. Each eval call is one to several inference calls (depending on whether you use a judge LLM). For mature teams, eval inference can equal or exceed live-traffic inference.

The category mistake: eval cost is budgeted as engineering tooling, but mechanically it consumes the same inference pool as production. When the engineering eval team and the product feature team both run on the same vendor contract, eval can quietly cannibalize the production budget without anyone noticing until a quarterly review.

  • Helicone Per-request pricing; free tier; paid plans $20-500/month range.
  • Langfuse Open-source self-host or hosted; cloud plans $50-1000+/month for production volume.
  • LangSmith LangChain ecosystem; per-trace pricing scales with traffic.
  • Eval runs Highly variable; mature teams report 0.5-2x of live-traffic inference cost on continuous-eval pipelines.

04 Layer 4 — The agent-loop multiplier (the big one)

The single largest cost amplifier in 2026 production AI is the agent loop. A traditional chatbot makes one inference call per user message. A coding agent (Claude Code, Cursor agent mode, Cline, Replit Agent) makes 5-50 inference calls per user request: plan, read files, plan again, write code, run tests, read errors, debug, re-plan, write more code, verify. Each step is one or more LLM calls.

Typical agent workflows in May 2026: a simple coding task ('fix this bug') averages 8-15 model calls. A medium task ('add this feature to this file') averages 20-40 model calls. A complex task ('refactor this module for new framework') can run 100+ calls. Tool-use multiplies further: each tool call is wrapped by a planning call before and an interpretation call after.

This is where token-price drops have not flowed through to user-facing prices. Cursor moved from unlimited usage to usage-based pricing in 2025-2026 specifically because the agent multiplier made the unlimited tier structurally unprofitable at any subscription price they could charge. The economics are: capability sells, but the unit cost of capability is 10-50x higher than the comparable chat feature, and most product pricing was not designed for that ratio.

CAVEAT The 8-15 / 20-40 / 100+ ranges are from public commentary by Cursor, Anthropic, and Replit during their pricing-model transitions, plus practitioner reports on r/ClaudeAI and r/cursor. They are typical-case estimates, not measured benchmarks. Your workload may run materially higher or lower depending on tool latency, retry policy, and how aggressive the agent is.

05 The three patterns that hide the markup

Three deployment patterns recur in 2026 AI products. Each hides cost in a different way; together they explain why enterprise AI bills are flat or up while token prices fell.

  • The Agent Loop Pattern 1 user request → N model calls. 5-50x amplifier. The cost is in your inference invoice but the unit (per-user-request) is invisible unless you wrap the loop in spans and aggregate. Most teams do not.
  • The Retrieval-Per-Request Pattern Every query embeds + searches before any inference. 1.2-1.5x amplifier. Cost lives on a different vendor invoice. Often genuinely needed but rarely tuned — session-level embedding cache would cut 70%+ of this layer for stable-session apps.
  • The Observability-Heavy Pattern 2-3 telemetry vendors + active eval pipelines. 20-40% silent overhead. Cost lives on platform/ops budget, not feature P&L. Eval cost in particular grows with engineering team size, not with traffic — meaning it spikes during active development and is sticky after.

These three patterns are typically additive. A coding-agent product with RAG over the codebase and full observability + eval runs the agent-loop multiplier (8-30x), the retrieval tax (1.3x), and the observability overhead (1.3x) — gross multiplier 13-50x vs naive single-shot inference. This is not a flaw in any one layer; it is what happens when capable AI products are built without an explicit cost model.

06 The per-user math — when AI features lose money

Reconstructing typical-case unit economics for a few common 2026 AI features. All numbers are approximate; treat as order-of-magnitude.

  • Customer-service chatbot (basic, no agent) Per interaction: ~5K input + 1K output tokens at $0.50/M input + $1.50/M output = ~$0.004 per turn. 5-turn conversation ~$0.02. Profitable in most B2B contracts; structurally fine.
  • Customer-service agent (with tools + retrieval) Per resolution: ~15-25 LLM calls including planning, tool use, verification + retrieval. ~$0.20-0.60 per resolved ticket. Profitable IF resolution rate is high (Klarna initially reported 2.3M+ conversations handled; the rollback came from quality + reputational cost, not unit economics).
  • Coding agent (autonomous) Per "task complete": 8-40 LLM calls, $0.40-3.00 in raw inference. Cursor and Claude Code charge usage-based or subscription with usage caps because flat-rate at any reasonable price loses money on heavy users.
  • Document-grounded chat (RAG-heavy) Per query: 1 LLM call + retrieval. Inference ~$0.005-0.02 depending on context size; retrieval +$0.0002. Profitable; the trap is context bloat — stuffing 20K tokens of retrieved context per query because nobody compressed.
  • Deep-research agent Per research task: 30-100 LLM calls including web fetches, planning, synthesis. $1-10 per task in raw inference. Premium-priced products charge $20+ per task; consumer-priced products structurally cannot afford this workload.
CAVEAT These numbers assume list-price token cost. Enterprise contracts at scale see materially better economics. They also assume the agent-loop and retrieval costs are correctly modeled — most teams underestimate by 30-100% on first calculation.
PRIMARY SOURCE Vellum AI cost analysis

07 Vendor pricing dynamics — what changes the math

Three vendor-side pricing dynamics moved enterprise economics meaningfully in 2025-2026 and are worth tracking.

  • Prompt caching at the vendor Anthropic launched prompt caching at up to 90% off on cached input tokens; OpenAI followed at 50%; Google Gemini at ~75%. For any app with a stable system prompt + tool definitions, this is the single largest lever available. Adoption is uneven; many production apps still pay full price for stable prefixes.
  • Batch APIs for non-realtime work 50% discount at OpenAI and Anthropic for asynchronous batch processing. Eval pipelines, content moderation, summarization-at-rest, and bulk classification should all be on batch. Many are not.
  • Reserved capacity at hyperscalers AWS Bedrock provisioned throughput, Azure OpenAI PTU. Pay for guaranteed capacity, get materially better per-token economics if utilization is steady. Wrong choice if traffic is bursty; right choice for predictable enterprise volume.

The pattern across all three: the discounts exist; the work to qualify for them is non-trivial. Most teams pay list price on workloads that could move to cached/batch/reserved tiers because the engineering work to migrate has not been prioritized. This is procurement-shaped technical debt and it is genuinely large — typical 20-40% potential savings going unrealized.

PRIMARY SOURCE Anthropic prompt caching

08 What actually moves the line item — the three real controls

Looking at the four layers and the deployment patterns, three controls have the biggest measured impact on enterprise AI bills in 2026. They are not the controls most teams reach for first.

  • Prompt caching everywhere it works Stable system prompts, tool definitions, retrieved context if the same context is hit repeatedly within a session. Typical impact: 30-60% reduction in input-token cost for apps with significant prefix stability. Effort: hours to days. The highest-leverage single lever.
  • Model routing (cascade) Route easy queries to a cheap model (Haiku, GPT-4o-mini, Gemini Flash), hard queries to a premium model. A judge LLM or a heuristic classifier does the routing. Typical impact: 40-60% cost reduction on mixed traffic. Effort: 1-2 weeks to build the router; ongoing tuning.
  • Structured-output enforcement Force outputs through a strict schema (Pydantic, JSON Schema, Outlines, Instructor). Cuts retry rates from 5-15% to <2%. Each retry is a wasted full inference call. Typical impact: 15-25% cost reduction from retry compression. Effort: days; pays back fast.
  • Things that move the bill less than people think Token-price renegotiation (real but small effect on total); prompt shortening past the point of clarity (saves pennies, costs accuracy); switching models monthly (operational cost > savings for most apps).

Most teams spend their cost-optimization energy on the visible inference invoice — chasing token-price discounts and prompt compression. That work is fine, just not the biggest lever. The three above are bigger and most teams have not done them.

Six cost failure modes that recur across enterprise AI deployments. Severity reflects how much budget the failure mode typically consumes before someone surfaces it, not absolute risk.

  1. 01 HIGH

    The agent-loop multiplier is unmodeled by most product teams

    Coding agents, browsing agents, deep-research agents — each user request expands to 5-50 LLM calls. If your product pricing was set against a "1 call per request" mental model, your unit economics are structurally negative on heavy users. Cursor and Replit both repriced specifically because of this.

    DO Instrument every agent loop with per-user-request span aggregation. Calculate the actual fan-out ratio (model calls per user request) for your top 5 workflows. If the ratio is >5x and your pricing is flat-rate, you have a real problem that compression cannot fix — you have a pricing-model problem.
  2. 02 HIGH

    Layer cost is not allocated to feature owners

    Inference invoice goes to platform/ops. Vector DB invoice goes to data infra. Observability invoices go to engineering productivity. Eval costs go to ML/AI team. Each is a different cost center; no individual feature P&L sees the full picture. Result: features that lose money keep shipping because their full cost is never assembled.

    DO Build a single AI-feature cost dashboard that pulls from all four cost layers and allocates by feature. Most teams have this nowhere; the work to assemble it is 1-2 weeks and the visibility is decisive.
  3. 03 MEDIUM

    Eval cost cannibalizes production budget

    Continuous eval pipelines run thousands of LLM calls per day. For mature teams, eval inference equals or exceeds live-traffic inference. When eval and production share a vendor contract, eval growth quietly eats into production capacity (or pushes the bill up) without an explicit budget conversation.

    DO Separate eval inference from production inference at the contract level — different API keys, different rate limits, different cost-tracking buckets. Use batch APIs for eval (50% off list). Right-size your eval suite; not every PR needs to run the full 500-row eval.
  4. 04 MEDIUM

    Prompt caching discounts are unrealized at scale

    Anthropic caching up to 90% off, OpenAI 50%, Google 75% — for stable prefixes. Many production apps still pay full price on stable system prompts and tool definitions because the migration to caching APIs has not been prioritized. This is the single highest-leverage lever most teams have not pulled.

    DO Audit your top 10 production prompts this week. For each, identify which portion is stable across calls and which is dynamic. Migrate the stable portion to vendor-side caching. Typical payback: hours of engineering work for 30-60% input-token reduction.
  5. 05 MEDIUM

    Context bloat from RAG without retrieval discipline

    Stuffing 15-25K tokens of retrieved context into every query because the vector DB returned that many results. Cost scales linearly with context size; benefit is often flat or negative past 5-8K tokens of relevant retrieval. The trap is that nobody is auditing retrieval quality, so context size keeps growing.

    DO Set hard caps on retrieved context (e.g., top-5 chunks max, or token-bounded). Measure answer quality at different retrieval depths; in most apps quality plateaus well before the context cap teams typically set.
  6. 06 HIGH

    Switching costs lock in suboptimal pricing posture

    When the bill is 80% Anthropic and you discover OpenAI's pricing is better for your workload, the engineering work to switch is 4-12 weeks of testing, eval re-baselining, prompt engineering re-tuning. For most teams the migration cost exceeds the projected savings, so suboptimal pricing posture becomes permanent. This is rational at the team level and expensive at the company level.

    DO When making vendor-selection decisions, weight switching cost heavily — at the contract level, negotiate exit ramps (data portability, gradual ramp-down clauses). For multi-vendor optionality, build vendor-abstraction at the call layer (LiteLLM-style router) before the lock-in matters.

Three concrete actions worth doing this month to compress the bill without degrading the product.

  1. 1

    Build the four-layer cost dashboard this week

    Most companies have inference cost. Few have inference + retrieval + observability + agent-loop, allocated by feature. Build this. It takes 1-2 weeks. The visibility itself is decisive — features that lose money cannot keep shipping once their full cost is on a single dashboard.

  2. 2

    Migrate stable prefixes to prompt caching this month

    Audit your top 10 production prompts. Identify the stable portion (system prompt, tool definitions, instructions). Move them to vendor-side caching (Anthropic, OpenAI, Google all support). Typical input-token cost reduction: 30-60%. Engineering effort: hours to days. Highest single lever available right now.

  3. 3

    Move eval and batch-processing workloads off realtime APIs

    OpenAI and Anthropic both offer 50% discounts on batch APIs for non-realtime workloads. Continuous eval, content moderation, bulk classification, summarization-at-rest are all candidates. Most teams pay full price because the migration was deprioritized. Estimated savings: 10-25% of total inference cost depending on traffic mix.

Pricing dynamics worth tracking — each can move enterprise economics 20%+ within a quarter when it lands.

Model routing as a built-in vendor feature

Today routing is a build-it-yourself layer. Several vendors (OpenAI, Anthropic, Google) have shipped automatic routing within their own model families. When cross-vendor automatic routing arrives — likely via routers like LiteLLM, Portkey, or a hyperscaler — switching cost between vendors drops materially.

Cache-and-route as a product category

Vendors like Helicone, Portkey, and Braintrust are building cache + route + fallback as a single layer in front of LLM calls. Expect this to become a standard procurement category over 2026. Right-priced offerings will pay back faster than the engineering effort to build internally.

Reserved-capacity pricing on agent workloads

AWS Bedrock and Azure OpenAI offer reserved capacity on traditional inference. As agent workloads grow, expect reserved-capacity offerings tuned for agent traffic patterns (high fan-out per request). When these land, they will reshape the cost economics for coding-agent and research-agent products specifically.