Model Economics · SIGNAL · May 24, 2026

Enterprise AI cost is silently doubling. Three deployment patterns hide the markup.

Token prices have fallen roughly 80% in 18 months. Enterprise AI bills have not. The reason is that token cost is one of four cost layers, and three of the other layers are not on the line item the CFO is auditing.

CFOs ask about the inference bill. Engineering teams answer about the inference bill. Boards see the inference bill. None of them are wrong about what they are looking at — they are just looking at one of four cost layers, and the silent three are the ones doing the compounding. This is why per-token prices fell 80% over the last 18 months and enterprise AI bills are flat or up.

The four layers are inference, retrieval and storage, observability and eval, and the agent-loop multiplier. Inference is the visible line item. Retrieval sits on Pinecone or pgvector invoices, allocated to data infrastructure, not to the feature consuming it. Observability sits on Helicone or Langfuse invoices, allocated to engineering productivity, not to product. Eval is engineering tooling that mechanically consumes the same inference pool as production traffic. The agent-loop multiplier — one user request becoming 5 to 50 model calls — is in your inference invoice but the unit (per-user-request) is invisible unless you wrap loops in spans and aggregate, and most teams do not.

Three deployment patterns hide the markup, and they show up everywhere. The Agent Loop Pattern is the largest amplifier: a coding agent fixing a bug averages 8 to 15 model calls; a research agent completes a task in 30 to 100. Cursor and Replit both moved off unlimited subscriptions specifically because this multiplier broke the unit economics at any subscription price they could reasonably charge. The Retrieval-Per-Request Pattern adds 1.2 to 1.5x to the inference cost on RAG-heavy apps; the cost is small per-query and large in aggregate, and it sits on a different vendor invoice so the feature owner never sees it. The Observability-Heavy Pattern carries 20 to 40 percent silent overhead through 2 to 3 telemetry vendors plus eval pipelines, allocated to platform budgets instead of feature P&Ls.

Anthropic offers up to 90 percent off on cached input tokens. OpenAI offers 50 percent. Google Gemini around 75 percent. Most production apps still pay full price on stable system prompts and tool definitions because the migration to caching APIs has not been prioritized. This is procurement-shaped technical debt and it is genuinely large — typical 30 to 60 percent input-token reduction available immediately for any team that bothers to do the work.

Token cost per call fell 80 percent. Agent capability per call grew 5 to 50x in call count. Net bill: roughly flat.

The pattern across all of it: the cost gains from token-price drops have been absorbed entirely by the agent-loop multiplier and the silent layers. Token cost per call fell 80 percent. Agent capability per call grew 5 to 50x in call count. Net bill: roughly flat. This is the central economic story of enterprise AI in 2026 and most coverage is still about token prices.