THE STACK · ENTERPRISE ADOPTION · REFERENCE · Published May 24

Enterprise AI rollouts, May 2026: what survives, what rolls back, and the six failure patterns underneath the high-profile incidents.

Air Canada, Klarna, Cursor, McDonalds, DPD, Sports Illustrated. Across the most-publicized enterprise AI rollbacks of the last 24 months, the model is almost never the failure. The failure is workflow integration, pricing model, escape-hatch design, eval coverage, or external claim discipline. Below: the six recurring patterns, the four conditions of a rollout that survives, the audit to run before going live, and what to do when an incident hits.

6 failure patterns

4 survival conditions

0 caused by model alone

TL;DR 30-second version · free

01 The pattern across high-profile rollbacks: it is almost never the model. Air Canada lost a tribunal because its chatbot's output was treated as policy. Klarna's reversal was about quality + reputational cost, not unit economics. Cursor repriced because flat-rate could not survive heavy-user agent workloads. McDonalds dropped IBM AI after persistent quality issues in a high-volume, low-tolerance channel. DPD got jailbroken into swearing at customers. Sports Illustrated buried AI-generated articles under fake bylines. Each is a workflow, pricing, escape-hatch, eval, or claims-discipline failure.
02 Four conditions consistently distinguish rollouts that survive from ones that roll back. (a) Bounded scope — clear edges on what the AI handles vs escalates. (b) Calibrated external claims — internal capability accurately reflected in customer-facing messaging. (c) Pricing model that matches workload shape — flat-rate is structurally wrong for agentic/heavy-tail workflows. (d) Pre-shipped escape hatches — fallback paths, human review, kill switches, audit trails. Most failed rollouts violated 2-3 of these.
03 The internal-readiness audit is six questions long, takes 30 minutes, and is rarely done before launch: who owns this when it breaks, what is the rollback path, how is quality measured continuously, what claims are we making externally, what is the worst-case user-trust scenario, and is leadership prepared for the press cycle when it goes wrong. Teams that answered these before shipping are not in the rollback list.

OUR READ 10 primary sources · analysis, not eval data

We read the Civil Resolution Tribunal ruling in Moffatt v. Air Canada (2024), Klarna's 2024-2025 earnings calls and the May 2025 walk-back of its AI customer service strategy, Cursor's public pricing-page commentary on its usage-based transition, the McDonalds + IBM joint statement on ending the drive-thru AI trial, the DPD AI chatbot incident press coverage, the Sports Illustrated AI-bylines reporting (Futurism, The Guardian), and the EEOC's settlement in iTutor Group. We also reviewed BCG, McKinsey, and Bain enterprise-AI surveys from 2024-2026 on adoption + rollback rates. Where the public narrative blames the model and the underlying evidence points to workflow or governance, we surface the gap.

DEEP ANALYSIS · free while in beta

READING AS

FOR YOU

If you are shipping an AI feature in an enterprise context, your job is to run the four-condition check before launch (bounded scope, calibrated claims, workload-matched pricing, pre-shipped escape hatches) and the 30-minute readiness audit. Most failed rollouts went live without these. Doing them is your highest-leverage activity in the last two weeks before launch.

The pre-launch checklist

Scope Inventoried intents. Explicit "handle this / escalate this" routing logic. Tested with real query samples.
Claims External messaging matches sampled real-interaction outcomes. Press release language reviewed by deployment lead before publication.
Pricing Pricing model matches workload shape (verify with cost-per-request math against expected fan-out factor).
Escape hatches Per-feature kill switch tested. Human-review queue live. Audit trail persisted. Output validation against policy where applicable.
Eval Continuous eval pipeline running in staging; production sampling configured.
Readiness audit Six-question audit completed with named on-call owner, tested rollback path, prepared exec response.

When the incident happens (and it will)

Have the rollback statement drafted before launch. Customer-facing messaging should be calibrated, specific about what went wrong, specific about what is being done, and not defensive. The Air Canada framing ('the chatbot is a separate legal entity') was not a defense — it was the lead in every story about the case.

Move fast on the technical side (kill switch within minutes, not hours), and slow on the communications side (24-48 hours to draft a calibrated statement is fine; the panic statement at 11pm is worse than waiting until morning).

FOR YOU

For platform engineers building the AI infrastructure layer, your job is to make the readiness conditions easy to meet for every feature team. The four conditions are platform problems, not feature-team problems — if every team has to build kill switches and audit trails and human-review queues from scratch, most will skimp on at least one.

Platform components every team should inherit

Kill-switch primitive A single API to disable an AI feature for all users immediately. Used by deployment leads and on-call. Tested monthly in production via game days.
Human-review queue A shared platform queue any AI feature can route low-confidence outputs into. Includes assignment logic, SLAs, audit trail, and reporting.
Audit-trail capture Every AI call logged with input, output, confidence, model version, prompt version. Retention policy matches regulatory requirements (typically 1-7 years depending on industry).
Output validation framework Shared library for output validation: schema enforcement, policy check, sentiment check, PII detection. Each feature opts in to the relevant validators.
Continuous eval pipeline Production sampling, eval execution (on batch APIs for cost), regression detection, alert routing. Teams own their evals; the platform owns the pipeline.

The procurement-readiness layer

Enterprise buyers are increasingly asking AI-readiness questions in procurement. If your product sells to enterprises, the platform layer needs to surface: kill-switch SLA documentation, audit-trail retention documentation, eval-coverage commitments, disparate-impact testing results (for AI touching regulated surfaces), and human-review queue throughput. Building these for procurement is engineering effort; not having them is a deal-blocker.

Procurement reality The procurement-readiness gap is one of the biggest blockers for AI-in-enterprise sales in 2026. Sellers with strong technical capability but weak readiness documentation lose to sellers with comparable capability and full readiness documentation. The work is not interesting but it is decisive.

FOR YOU

For deployment leads (the named owners when AI features ship in an enterprise), your job is the readiness audit, the rollback path, and the relationships across product, platform, compliance, and comms. The technical work is owned by others; the cross-functional discipline is owned by you.

The deployment-lead playbook

Pre-launch (T-2 weeks) Run the 30-minute readiness audit. Document gaps. Get exec sponsorship for the gaps that are launch-blockers. Get explicit "we will not launch until X is fixed" agreement.
Pre-launch (T-1 week) Test rollback path in production (real traffic, dark deployment). Verify on-call rotation. Verify monitoring + alerting fires correctly. Draft incident-communications template.
Launch day Gradual ramp (5% → 25% → 50% → 100% over hours/days, not all at once). Watch error rates, eval metrics, customer feedback. Be ready to kill-switch.
Post-launch (week 1) Daily readiness review: any unexpected failures, eval regressions, customer complaints. Adjust feature scope based on real-traffic learnings.
Post-launch (month 1) Full retrospective. What did the readiness audit miss. What needs to change for the next launch. Document and share across deployment-lead community.

Cross-functional relationships to invest in

Comms team: build the relationship before you need them. Walk them through the technical realities of the AI feature, the failure modes that are most likely, the messaging that would calibrate vs the messaging that would inflame. When the incident happens, you want comms to draft a calibrated statement in 4 hours, not 4 days.

Compliance team: same model. Walk them through the data flow, the audit trail, the regulatory exposure. Where you are on the edge of a regulated surface, get the compliance sign-off in writing before launch. iTutor Group did not have this sign-off; the $365K settlement was the consequence.

FOR YOU

For founders, enterprise AI rollouts are either your product (you sell AI tooling to enterprises) or your buyer is going through them (your enterprise customers are deploying AI internally). Both angles have implications for how you build and how you sell.

If you sell AI tooling to enterprises

Procurement is the gate, not just the demo. AI-readiness documentation (kill switches, audit trails, eval coverage, disparate-impact testing) is becoming standard procurement requirement. Sellers without these lose to sellers with them, even at lower technical capability. Invest in readiness documentation before the first enterprise deal, not after the procurement team rejects you.

Your customers are going to have public incidents. When they do, the press coverage will not distinguish between 'their fault for deploying without escape hatches' and 'your fault for selling them an AI that did the thing.' Indemnification clauses, audit rights, and clear liability allocation in contracts are not optional; they are the basis of a sustainable enterprise business.

The new sales motion The enterprise AI sales motion in 2026 is increasingly: technical capability is table stakes, readiness documentation is the differentiator. Companies with strong technical depth and weak readiness lose to companies with comparable capability and complete readiness. Do not skip the readiness work.

If you deploy AI features in your own product (consumer or B2B)

Bound your scope ruthlessly Open-ended AI in a customer-facing surface is the Air Canada / DPD failure pattern. Specific, bounded, escalation-routed AI is the survivable pattern. Resist the temptation to advertise broad capability.
Match pricing to workload Flat-rate against agentic workloads is the Cursor failure pattern. If your AI feature involves multi-step reasoning or tool use, price usage-based or credit-pool from day one. The transition later is painful.
Calibrate your claims What you say externally about your AI is what your worst-case customer will hold you to. 'AI doing the work of 700 humans' is the Klarna failure pattern. 'AI handling routine queries; specialists for complex cases' is the survivable framing.

FOR YOU

For analysts and researchers tracking enterprise AI deployment, the May 2026 picture is clearer than it was a year ago: failure patterns are documented, survival conditions are identifiable, and the gap between rollouts that ship and rollouts that survive is increasingly attributable to deployment discipline rather than model capability.

Open research questions

Rollback rate by failure pattern Anecdotal evidence says ~40-60% of large-enterprise AI pilots fail to scale to production (BCG, McKinsey survey data). Distribution by failure pattern (scope vs claims vs pricing vs escape hatch vs eval vs governance) is not systematically measured. Field study would substantially advance the literature.
Cost-of-rollback measurement Klarna, Cursor, Air Canada all paid different kinds of cost (reputational, financial, legal). Quantitative work measuring the actual cost-of-rollback distribution would help procurement teams quantify the risk side of the buy decision.
Procurement-readiness as a predictor Hypothesis: enterprise customers who require strong readiness documentation in procurement have lower rollback rates. Empirical work testing this would influence procurement standards across the industry.

Market dynamics worth tracking

Enterprise AI procurement is professionalizing. The first generation of enterprise AI sales (2023-2024) was capability-led; the second generation (2025-2026) is readiness-led. By 2027, expect procurement-readiness documentation to be standard. This will help the buyers and constrain the sellers who cannot meet the bar.

Insurance markets are starting to price AI risk explicitly. This will become a forcing function for the readiness audit, because the underwriter will ask the same questions. Expect AI-incident-specific insurance products to become standard by 2027.

Regulatory developments will continue to favor the readiness-first approach. EU AI Act compliance deadlines, US state-level bills, sector-specific regulations all create audit requirements that map directly to the readiness checklist. Companies investing in readiness now are also investing in compliance readiness.

The shift The most under-reported story in enterprise AI 2026 is that the bottleneck has moved from model capability to deployment discipline. Models are good enough for most enterprise use cases. The companies that ship successfully are the ones doing the readiness work. The companies in the rollback list are the ones who skipped it.

Six headline incidents, what actually went wrong, and which failure pattern they map to.

Feb 2024

Air Canada — chatbot tribunal loss

Civil Resolution Tribunal ruled Air Canada liable for a refund its chatbot promised but its written policy did not allow. Lesson: deployed AI output is treated as company policy regardless of internal intent.

governance

May 2025

Klarna — AI customer service walk-back

Klarna spent 2024 publicly celebrating that AI handled work of 700 humans; 2025 quietly walked back, citing quality + reputational cost. Lesson: overclaiming externally during the celebration phase creates the press cycle when the rollback comes.

claims

2025-26

Cursor — pricing model reset

Unlimited subscription tier removed in late 2025; usage-based pricing introduced. Flat-rate against agentic workloads is structurally negative on heavy users. Lesson: pricing model must match workload shape.

pricing

Jun 2024

McDonalds — IBM AI drive-thru ended

Three-year trial of AI drive-thru ordering ended after persistent quality issues in a high-volume, low-tolerance channel. Lesson: customer-facing channels with tight latency + quality bars are not the right pilot territory.

eval

Jan 2024

DPD — chatbot jailbroken on social

A customer prompted the DPD support chatbot into swearing and writing a poem about how bad DPD was. The screenshot went viral. Lesson: customer-facing AI without injection defenses is one prompt away from brand damage.

escape-hatch

Nov 2023

Sports Illustrated — fake AI bylines

Investigation found AI-generated articles published under fake AI-generated author photos. Lesson: claims discipline matters — what you say externally about who wrote something has to be true.

claims

Sep 2023

iTutor Group — EEOC discrimination settlement

AI screening system rejected older applicants. $365K EEOC settlement. Lesson: AI in hiring is a regulated surface with high audit requirements; "the algorithm did it" is not a defense.

governance

Pattern

Salesforce Einstein Agentforce

Aggressive 2024 go-to-market on agent capability; 2025-2026 messaging shifted to "Agentforce + human-in-the-loop" as customer pilots returned mixed results. Lesson: marketing leading capability creates customer disappointment cycles.

claims

Enterprise AI rollouts that survive look different from rollouts that fail in identifiable ways. The differences are not capability — most failed rollouts used capable models. The differences are in how the rollout was scoped, what was promised, how the workload was priced, what happened when things went wrong, and how internal readiness was audited before launch. The framework below maps the failure patterns to the survivable conditions.

BEFORE

Rollouts that rolled back

AI scope was open-ended ("handle customer support")
External claims led internal capability ("AI doing the work of 700 humans")
Pricing model was flat-rate against variable workloads
Escape hatches were retrofitted after incidents
Eval was done at launch and not continuously
When the incident happened, no clear owner was on call

→

AFTER

Rollouts that survived

AI scope was bounded ("handle these N intents; escalate the rest")
External claims were calibrated to demonstrated capability
Pricing model matched workload shape (usage-based for agentic, flat for predictable)
Escape hatches shipped on day one (human-review queues, audit trails, kill switches)
Continuous eval pipelines ran in production from launch
Named on-call owner with clear escalation path

The right pre-launch question for enterprise AI in 2026 is not 'is the model good enough.' It is 'have we bounded the scope, calibrated the claims, matched the pricing, shipped the escape hatches, and named the owner.' Models that are 95% good enough survive deployments with these conditions; models that are 99% good enough roll back without them.

DEEP READ 6 sections · cited primary sources · technical review pending

01 Air Canada — what actually happened, and the governance pattern

Jake Moffatt asked Air Canada's chatbot about bereavement fares. The chatbot told him he could apply retroactively for a discount after booking. Air Canada's written policy required pre-booking application. Moffatt booked, applied, was denied, and went to the BC Civil Resolution Tribunal. Air Canada argued the chatbot was a separate legal entity from the airline. The tribunal disagreed and ordered Air Canada to honor the refund.

The tribunal ruling is short and worth reading. The core finding: Air Canada is responsible for all information on its website, whether that information comes from a static page or a chatbot. The argument that AI output is somehow separate from company policy was explicitly rejected. The ruling is now cited as precedent in similar cases globally.

The failure pattern is governance: deployed AI output is treated by courts (and customers) as company policy regardless of internal disclaimers. The escape-hatch absence is what made this an incident — there was no flag in the chatbot's response that said 'this is generated, verify against policy at [link].' There was no human-review queue catching policy-conflict outputs before they reached customers.

Pattern Deployed AI output treated as company policy. No "this output is AI-generated, verify against official policy" disclaimers. No policy-conflict detection in the response pipeline.
What worked at lower-risk peers Customer-service AI that explicitly returns "I cannot answer this; here is how to reach a human" for any query touching policy, refunds, or contractual terms. Constrains the surface; pushes risky outputs to humans.

PRIMARY SOURCE Moffatt v. Air Canada — BC Civil Resolution Tribunal ruling

02 Klarna — the claims discipline failure

Klarna's Sebastian Siemiatkowski spent 2024 in earnings calls celebrating that AI was doing the work of 700 human customer-service agents. The narrative was crisp: AI as workforce replacement, dramatic cost savings, capable enough for production. The press cycle was favorable.

May 2025: Klarna walked it back. The public framing was 'we are hiring humans again for quality reasons.' The underlying reality: customer satisfaction scores had not held up at the promised level; complex tickets the AI did not resolve well were creating reputational damage that exceeded the per-ticket cost savings. The unit economics may have been positive; the reputational economics were not.

The failure pattern is claims discipline. Externally, Klarna had claimed full replacement; internally, the system was good for a subset of tickets and weak for the rest. When the gap surfaced, the rollback story was inevitable — the gap between claim and capability was the story.

CAVEAT The specific numbers (700 humans, 2.3M conversations) come from Klarna press releases. The reputational-cost framing is a synthesis of subsequent customer-survey reporting and Siemiatkowski commentary in 2025 earnings calls. The exact CSAT and rollback magnitude are not fully public.

PRIMARY SOURCE Klarna 2025 walk-back coverage (Bloomberg + Klarna IR materials)

03 Cursor — the pricing-model reset

Cursor shipped an unlimited subscription tier for its AI-powered code editor. When the product was primarily a chat-style autocomplete (1 LLM call per user interaction), the unit economics worked. When Cursor added agent mode (multi-step autonomous task completion: 20-100 LLM calls per user request), the economics broke for heavy users. The top 5% of users by usage consumed 50%+ of agentic workload.

In late 2025, Cursor moved to usage-based pricing. The transition was contentious — heavy users had built workflows around unlimited, and the new pricing exposed the actual cost of agent loops. But the previous model was structurally unprofitable at the subscription prices Cursor could reasonably charge.

The failure pattern is pricing-model mismatch. Flat-rate against an agentic workload with heavy-tail usage distribution loses money on the tail. Replit hit a related version. ChatGPT moved from unlimited to capped to usage-tiered for similar reasons. The pattern is now well-documented; new entrants pricing flat-rate against agentic workloads are repeating the playbook.

The shape Agent workloads: 5-50 LLM calls per user request. Heavy-tail usage: top 5% of users drive 50%+ of cost. Flat-rate pricing: average price covers average cost, loses money on the tail.
The fix Usage-based pricing, credit pools with overage, or subscription with hard usage caps. Each has product-experience tradeoffs but each is structurally solvent.

PRIMARY SOURCE Cursor pricing page and usage-based transition commentary

04 McDonalds + IBM, DPD, Sports Illustrated, iTutor Group — the rest of the pattern

Each of the remaining headline incidents maps to a failure pattern in the framework.

McDonalds + IBM drive-thru (Jun 2024) Eval-coverage failure. High-volume customer-facing channel where quality bar is exceptionally tight (wrong order = direct customer impact + worker recovery cost). Three-year trial ended without successful expansion. Lesson: pilot territory matters — customer-facing channels with low quality tolerance are the wrong place to learn.
DPD chatbot incident (Jan 2024) Escape-hatch failure. A customer crafted prompts that pushed the chatbot into swearing and writing poems criticizing DPD. No prompt-injection defense, no output validation, no kill switch when the response sentiment went off-policy. Screenshot went viral; brand damage.
Sports Illustrated AI bylines (Nov 2023) Claims-discipline failure. AI-generated content published under AI-generated author photos and fictional bylines. When discovered, the trust damage was substantial. The AI was not the failure — the misrepresentation was.
iTutor Group EEOC (Sep 2023) Governance failure. AI screening rejected applicants over a certain age. EEOC settled $365K. The pattern across AI-in-hiring incidents: protected-class disparate impact gets enforced regardless of whether the algorithm "intended" the discrimination.

Across all six incidents, the model itself is rarely the failure root cause. Capable models fail when deployed without bounded scope (Air Canada), calibrated claims (Klarna, Sports Illustrated), workload-matched pricing (Cursor), pre-shipped escape hatches (DPD), eval coverage (McDonalds), or governance discipline (iTutor). The recurring theme is that enterprise AI is a deployment discipline more than a model discipline.

PRIMARY SOURCE Composite: DPD coverage (The Guardian), Sports Illustrated (Futurism), EEOC (settlement docs)

05 The four conditions of a survivable rollout

Examining the rollouts that did survive — including AI customer service at certain B2B SaaS companies, internal AI copilots at major enterprises, AI features in coding tools that stayed live through the pricing-model resets — four conditions consistently distinguish them from rollouts that rolled back.

Bounded scope Clear edges on what the AI handles vs escalates to a human. "Handle these 30 intent categories; escalate everything else with full context preserved." Not "handle customer support."
Calibrated external claims External messaging accurately reflects internal capability. "Our AI handles routine tier-1 questions; complex cases route to specialists" is calibrated. "Our AI does the work of 700 humans" is not.
Workload-matched pricing Flat-rate for predictable single-shot workloads. Usage-based or credit-pool for agentic and heavy-tail workloads. The pricing-model decision is structural, not negotiable per-customer.
Pre-shipped escape hatches Day-one features: kill switches per feature, human-review queues for borderline outputs, audit trails for every decision, fallback paths when the model is unsure. Not "we will add these after launch."

These four are necessary, not sufficient. A rollout meeting all four can still fail for product-market-fit reasons, change-management reasons, or external-event reasons. But every rollback in the headline list violated at least 2 of the 4. Most violated 3.

PRIMARY SOURCE BCG, McKinsey, Bain enterprise AI adoption surveys 2024-2026 (composite)

06 The internal-readiness audit — six questions, 30 minutes

Most failed enterprise AI rollouts went live without anyone running an explicit pre-launch audit. The audit below is six questions, takes 30 minutes, and is one of the highest-leverage activities a deployment lead can run.

Q1 — Ownership When this AI feature breaks at 2am, who is on call? Name a person. If the answer is 'we will figure it out' or 'the AI team,' you are not ready.
Q2 — Rollback path What is the literal sequence of steps to disable this feature for all users in <5 minutes? Has it been tested? If no, you are not ready.
Q3 — Continuous eval How is quality measured after launch? Is there a continuous eval pipeline running on production traffic? If quality is measured only at launch, regressions will accumulate silently until a customer surfaces them.
Q4 — Claims calibration What is the most aggressive claim we have made externally about this AI feature? Does it match what the AI actually does on a representative sample of real traffic? If the claim is the launch press release and the reality is the eval bench, you are not ready.
Q5 — Worst-case trust scenario What is the worst plausible user-trust outcome of this AI feature shipping broken? (Air Canada: lawsuit. Klarna: press cycle. DPD: viral screenshot.) Are we prepared for that scenario? If the answer is "that will not happen to us," you are not ready.
Q6 — Leadership press readiness When the incident happens, is the CEO/exec sponsor prepared with a calibrated response? Has the comms team drafted the rollback statement? If no, the rollback story will be told by reporters using their default framing, not yours.

CAVEAT These questions sound simple. They are; the difficulty is that asking them surfaces uncomfortable readiness gaps right before launch, which most teams avoid because the launch date is fixed. Run the audit anyway. The cost of delaying a launch 2 weeks to fix readiness gaps is small; the cost of a public rollback is large.

PRIMARY SOURCE Synthesized from BCG enterprise AI readiness frameworks + Klarna/Air Canada post-incident commentary

Six recurring failure patterns. Each maps to multiple high-profile incidents; each is preventable. Severity reflects how often the pattern shows up across documented rollbacks, not absolute risk.

01 HIGH

Unbounded scope creates ambiguous failure modes

When AI scope is open-ended ("handle customer support"), failure cases are unbounded — the AI will be asked questions outside its competence and answer them anyway. Air Canada is the canonical case. Bounded scope ("handle these 30 intents, escalate the rest") moves failures to identifiable categories that can be measured and improved.

DO Inventory the queries your AI receives. Define explicit intent boundaries. For queries outside the boundary, route to humans with full context preserved. This is engineering work, not policy work — write the routing logic, do not rely on the model to know its own limits.
02 HIGH

External claims leading internal capability creates the press cycle

Klarna and Sports Illustrated both demonstrate the pattern: aggressive external claims that exceed internal capability create the rollback narrative when the gap surfaces. The press cycle is not the incident; it is the gap between claim and reality.

DO Run a claims-vs-capability audit before any major external announcement. Sample 50-100 real user interactions. Compare actual outcomes to the claim being made. If the gap is material, soften the claim or delay the announcement until the capability matches.
03 HIGH

Pricing model mismatched to workload shape

Cursor's transition demonstrates: flat-rate against heavy-tail agentic workloads loses money on the tail and over time loses on the median too. The pricing-model decision is structural; you cannot fix bad pricing with marketing copy or by trying to compress unit cost. Eventually the math forces the reset.

DO For new AI features, decide pricing model based on workload shape: flat-rate for predictable single-shot, usage-based for agentic, credit-pool for hybrid. Build the metering infrastructure before launch — adding it later is a forcing function for a painful transition.
04 HIGH

Escape hatches treated as launch optional

DPD is the canonical case: customer-facing AI without prompt-injection defenses or output validation is one viral screenshot from brand damage. Air Canada is a related version: chatbot without policy-conflict detection. Escape hatches (kill switches, human-review queues, audit trails, output validation) are not retrofittable after the incident; they need to ship day one.

DO Treat escape-hatch features as launch-blockers, not launch-optional. Specifically: per-feature kill switch (test it before launch), human-review queue for low-confidence outputs (live from day one), full audit trail of inputs + outputs for compliance, output validation against policy where applicable.
05 MEDIUM

Eval coverage gaps surface as quality regressions in production

McDonalds-IBM is the canonical case for tight-tolerance channels: high-volume, low-quality-tolerance environments where eval coverage gaps surface as customer impact at scale. Eval suites that pass at launch tell you about the inputs the eval team thought to test, not about the real distribution of customer queries.

DO Run continuous eval in production from day one. Sample real production traffic into an eval pipeline. Compare model output to ground truth (where available) or to human review. Surface regressions before customers do. Most teams skip this because it is engineering work — that is exactly why it is a differentiator.
06 MEDIUM

Governance gaps invite regulatory action

iTutor Group is the canonical case for AI in hiring: regulators enforce disparate-impact rules regardless of whether the algorithm intended the discrimination. AI in hiring, lending, housing, insurance, healthcare — each is a regulated surface where 'the algorithm did it' is not a defense and audit requirements are high.

DO If your AI touches a regulated surface, work with compliance counsel BEFORE shipping. Document the model, training data, decision logic, and audit trail. Run disparate-impact testing across protected classes. Build the explainability surface (why was this decision made) that regulators will demand. Cost: weeks of work. Cost of skipping: settlement + reputational damage.

Three concrete pre-launch actions worth doing for any enterprise AI rollout this quarter.

1

Run the 30-minute readiness audit before any AI launch this quarter

The six questions are above (deep section: "The internal-readiness audit"). Schedule 30 minutes with the deployment lead, the on-call engineer, the comms lead, and the exec sponsor. Walk through each question. Document the gaps. Fix the readiness blockers before launching, not after. The cost of a 2-week delay is small; the cost of a public rollback is large.
2

Audit external claims against actual capability this month

For each AI feature shipped or planned, write down the most aggressive external claim being made. Then sample 50-100 real interactions. Calibrate the claim against the reality. Where the gap is material, either soften the external claim or delay until the capability catches up. Klarna and Sports Illustrated both demonstrate what happens when the gap is allowed to persist.
3

Ship escape hatches before launching, not after the incident

For each AI feature, ensure day-one: per-feature kill switch tested in staging, human-review queue for low-confidence outputs live in production, full audit trail of inputs+outputs persisted, output validation against policy where applicable. DPD and Air Canada both became cases because escape hatches were retrofitted after incidents instead of shipped at launch.

Shifts in the rollout playbook to track — each will change how the next 12 months of enterprise AI deployments go.

Disclosure requirements expanding

EU AI Act compliance deadlines through 2026-2027, state-level US AI bills (NYC bias audit law, California ADMT regulations, Colorado AI Act), sector-specific rules (CFPB on AI in lending, HHS on AI in healthcare). Expect mandatory disclosure of AI use to customers, mandatory audit trails, mandatory disparate-impact testing for AI in regulated surfaces. The compliance cost is real and growing.

Procurement language standardizing

Enterprise buyers are starting to include AI-specific contract clauses: indemnification for AI-generated content, audit rights on AI behavior, kill-switch SLAs, eval-coverage commitments. The procurement RFP is becoming a standardized AI-readiness audit. Sellers who cannot answer the readiness questions in procurement conversations will lose deals.

Insurance markets pricing AI risk

Cyber insurance carriers are starting to price AI-incident risk into premiums. Some are excluding AI-related liability from standard policies; some are offering AI-specific riders. Expect insurance to become a real cost layer for enterprise AI deployments by mid-2026, and a forcing function for the readiness audit (because the underwriter will ask).

The pre-launch checklist

When the incident happens (and it will)

Platform components every team should inherit

The procurement-readiness layer

The deployment-lead playbook

Cross-functional relationships to invest in

If you sell AI tooling to enterprises

If you deploy AI features in your own product (consumer or B2B)

Open research questions

Market dynamics worth tracking

Air Canada — chatbot tribunal loss

Klarna — AI customer service walk-back

Cursor — pricing model reset

McDonalds — IBM AI drive-thru ended

DPD — chatbot jailbroken on social

Sports Illustrated — fake AI bylines

iTutor Group — EEOC discrimination settlement

Salesforce Einstein Agentforce

01 Air Canada — what actually happened, and the governance pattern

02 Klarna — the claims discipline failure

03 Cursor — the pricing-model reset

04 McDonalds + IBM, DPD, Sports Illustrated, iTutor Group — the rest of the pattern

05 The four conditions of a survivable rollout

06 The internal-readiness audit — six questions, 30 minutes

Unbounded scope creates ambiguous failure modes

External claims leading internal capability creates the press cycle

Pricing model mismatched to workload shape

Escape hatches treated as launch optional

Eval coverage gaps surface as quality regressions in production

Governance gaps invite regulatory action

Run the 30-minute readiness audit before any AI launch this quarter

Audit external claims against actual capability this month

Ship escape hatches before launching, not after the incident

Disclosure requirements expanding

Procurement language standardizing

Insurance markets pricing AI risk