THE PERIMETER · AI SECURITY · REFERENCE · Published May 21

Prompt injection defense, May 2026: which defenses actually work, ranked by benchmark.

Prompt injection is the #1 OWASP LLM vulnerability for the third year running. Eight defenses have been systematically benchmarked in 2026. PromptArmor hits <1% false-positive and false-negative rates on AgentDojo; PromptGuard cuts attack success by 67%; AgentWatcher achieves near-zero attack success across four agent benchmarks. Layered defenses reduce attack success from 73.2% to 8.7%. Adaptive attacks still bypass state-of-the-art at >85%. Below: what works, what doesn't, and how to actually choose.

<1% PromptArmor FPR/FNR

67% PromptGuard ASR cut

73→8.7% layered defense ASR

TL;DR 30-second version · free

01 Three defenses lead the 2026 benchmark literature with materially different positioning. PromptArmor (ICLR 2026): preprocessor approach using an off-the-shelf LLM as a guardrail; <1% FPR/FNR on AgentDojo. PromptGuard: classifier-based; cuts attack success rate by 67%. AgentWatcher: rule-based monitor with attention-based context attribution; near-zero ASR across four agent benchmarks. Each wins on a different axis.
02 Layered defenses work better than any single defense. Composite frameworks reduce attack success from 73.2% baseline to 8.7%. The trade-off is latency, complexity, and cost — a layered stack adds 200-500ms per request and 30-50% inference overhead. For most production systems, this trade-off is acceptable.
03 Adaptive attacks still bypass state-of-the-art defenses at >85% success rate when attackers know the specific defense being used. Defense-in-depth (multiple layers + obfuscation of which layers are deployed) is the realistic posture, not 'pick the best single defense.'

DEEP ANALYSIS · free while in beta

READING AS

FOR YOU

For practitioners deploying AI features in production, defense selection is a 3-tier decision: pick your top layer (LLM guardrail vs classifier), pick your second layer (input shaping or rule-based monitoring), pick your output validation. The combinatorics matter; the specifics are below.

A starter layered stack

Layer 1 (input) PromptArmor as preprocessor. <1% FPR/FNR on benchmarks. Adds 200-500ms latency, ~1.5x token cost. Most teams find this trade-off acceptable.
Layer 2 (transform) Spotlighting — wrap untrusted input with explicit boundary tokens. Free to implement, adds 30-40% defense improvement on top of layer 1.
Layer 3 (output) Output schema validation via Guardrails AI. Force outputs to conform to a defined schema; reject anything that does not. Catches injection-induced free-form responses.
Layer 4 (monitor) Llama Guard 3 as post-generation content classifier. Catches outputs that bypassed earlier layers. Logs flagged content for review.

Practical evaluation methodology

Do not just trust vendor benchmarks. Build a 200-500-example internal evaluation set from your own production traffic — half legitimate inputs, half adversarial attempts (real or synthesized from public adversarial prompts). Run each candidate defense against your set. Measure FPR, FNR, latency, and cost. This data is more useful than any third-party benchmark for your specific deployment.

Run defenses in shadow mode for 30 days before promoting to enforcement. Shadow mode = defense runs on production traffic but does not block; log everything. After 30 days, you have real data on how it would have performed.

FOR YOU

For engineers building AI products, defense integration is a real platform engineering problem. Architectural choices made now affect what defenses you can deploy later. Below: the build patterns that keep defense optionality open.

Architectural patterns that keep defense optionality

Input pipeline abstraction Build your user input → LLM path through a pluggable middleware layer. Each defense becomes a middleware. Adding PromptArmor or swapping classifiers is a config change, not a refactor.
Output validation as platform service Build output validation (Guardrails AI / your own schema validator) as a shared platform service, not per-feature. Every AI feature in your product validates through it. Easier to update, easier to audit.
Attention-weight accessibility If you self-host open-weight models, preserve access to attention weights in your serving stack. vLLM and TGI both support this with configuration. Lets you deploy AgentWatcher-class defenses later.
Cross-vendor abstraction Abstract your LLM calls behind a vendor-agnostic interface. Defense portability becomes much easier when you can swap from Anthropic to OpenAI to a local Llama without rewriting the defense stack.

Tools to use

Guardrails AI Schema-based output validation. Open-source. Production-ready.
NeMo Guardrails Rule + classifier hybrid. Open-source from Nvidia. Best for constrained-product domains.
PyRIT Microsoft's red-teaming toolkit. Use to systematically probe your own deployments for injection vulnerabilities.
garak Nvidia's LLM vulnerability scanner. Run in CI against your deployed AI features.
PromptArmor (when generally available) The leading defense. Watch for production release / hosted service offering.

FOR YOU

The prompt-injection defense category is becoming a real market segment as 2026 progresses. Watch for category-defining vendors emerging, enterprise procurement budgets allocating specifically to AI-defense tooling, and consolidation as larger security vendors acquire AI-defense startups.

The market dynamics

OWASP putting prompt injection at #1 for three years running plus the layered-defense empirical evidence pushes enterprise security buyers to allocate dedicated budget. Estimated category size: $200-500M in 2026 enterprise spending; growing to $2-5B by 2028 if the cybersecurity-as-percentage-of-AI-budget pattern holds.

Buyers fall into two segments. Tier 1: enterprises building production AI applications (banks, healthcare, large tech) where injection-driven incidents have real cost. They will pay $50-500K/year for managed defense services. Tier 2: AI-first startups where defense costs scale with their AI usage. They will buy SaaS defense tools at usage-based pricing.

Ticker positions

CRWD CrowdStrike has the brand + distribution to extend Falcon platform into AI defense. Expect acquisition activity here within 12 months.
PANW Palo Alto Networks Unit 42 already publishes AI security research. Productizing into defense offerings is a natural extension.
S, ZS, NET SentinelOne, Zscaler, Cloudflare. All have AI security adjacencies but no clear defensive-AI product yet. Watch for product announcements over Q3-Q4 2026.
MSFT, GOOGL Both will ship native defenses into their AI APIs. Revenue captured per-API-call. Watch for Azure AI / Vertex AI defense-feature launches.
NVDA NeMo Guardrails plus garak give Nvidia a defense-tooling story. Not a primary revenue driver but a moat for keeping enterprise AI workloads on Nvidia stacks.
Early-stage vendors Watch funding announcements for Guardrails AI, Lakera, Robust Intelligence, Protect AI, Lasso Security. Acquisitions by tier-1 cybersecurity vendors are likely 2026-2027.

FOR YOU

For founders, prompt-injection defense is either a build-or-buy decision for your own product (most founders) or a market opportunity (some founders building security-adjacent products). Both angles below.

If you ship AI features in your product

Prompt injection is a security issue your enterprise customers will start asking about in procurement conversations within 90 days. Have an answer ready: which defenses you deploy, what the residual risk profile is, what your incident response looks like.

Build vs buy: most startups should buy (Guardrails AI, Lakera, or one of the emerging vendors) rather than build. Building is months of work; buying is days. The exception: if your application has unique threat model characteristics that off-the-shelf defenses do not address, building a custom layer makes sense.

Procurement reality If you sell to enterprise and cannot articulate your prompt-injection defense posture in a 60-second elevator pitch, your procurement conversations will go poorly. Build that pitch this month.

If you are building defense-adjacent products

Real opportunity Production-grade defense tooling is genuinely underbuilt. Guardrails AI, Lakera, Robust Intelligence, Protect AI, Lasso Security are early-stage. Category will support multiple successful companies.
Defensible positioning Domain depth (defense for specific verticals — healthcare, financial services, legal tech), integration depth (deeply embedded into specific platforms — Salesforce, ServiceNow, Microsoft 365), open-weight specialization (defenses that work for open-source models, not just closed-vendor APIs).
Beware "We use AI for AI defense" is not a differentiator anymore. Specific architectural choices, specific benchmark results, specific enterprise references are what differentiate.

FOR YOU

Prompt injection defense research is at an inflection point: leaders have emerged (PromptArmor, AgentWatcher), the benchmark gap is documented (Wang et al. 2026), and the next research questions are being defined. Several high-impact directions worth pursuing.

Open research questions

Context-dependent benchmarks Wang et al. 2026 identified that no current benchmark includes context-dependent tasks. Building such a benchmark would be widely adopted and cited. Methodology: extract real agent traces from production deployments (privacy-preserved), construct controlled adversarial variants.
Adaptive-attack robustness PromptArmor and other state-of-the-art defenses degrade to >85% bypass under adaptive attack. Why? What architectural changes would close this gap? Active research area.
Production-vs-benchmark gap measurement How much worse do published defenses perform in production vs benchmark? Unknown empirically. Studies measuring this gap on real deployments would have high practical impact.
Cross-defense composition theory Layered defense empirically reduces ASR from 73.2% to 8.7%. But this is empirical, not theoretical. A theoretical framework for predicting layered-defense effectiveness from individual component performance is a wide-open research area.

Venues and funding

ICLR, NeurIPS, USENIX Security, IEEE S&P all have submission windows in summer 2026 for early-2027 publication. SaTML and BlackHat AI are emerging venues specifically for AI security work.

Funding: OpenPhilanthropy AI safety program, NSF SaTC + NSF SBE, the new ARIA program from UK government, EU Horizon Europe AI security calls, industrial lab collaborations (Anthropic, Apollo Research, METR, Microsoft Research, Google Brain).

Why now The prompt-injection defense subfield is in its 'foundational citation' window. Researchers publishing solid empirical work in 2026-2027 will be cited by every enterprise procurement decision for the next 5+ years.

The defense landscape as of May 2026 — eight techniques benchmarked, ranked by approach + measured performance.

ICLR 2026

PromptArmor

LLM-as-guardrail preprocessor. <1% FPR and FNR on AgentDojo. Simplest strong defense; needs API call to guardrail model.

leader

2026

AgentWatcher

Rule-based monitor with attention-based context attribution. Near-zero ASR across four agent benchmarks. Lower runtime cost than LLM-guardrail.

leader

2025-26

PromptGuard

Classifier-based input filter. 67% reduction in attack success rate. Lower precision than PromptArmor; faster.

mainstream

NeMo

NeMo Guardrails (Nvidia)

Topic + rule-based system. Effective for constrained-product domains; less effective for open chat.

mainstream

Llama Guard 3

Meta's safety classifier as moderation pre-filter. Good as one layer in a stack; insufficient alone.

mainstream

2026

Spotlighting

Mark untrusted input with explicit boundary tokens. Cheap and helpful as one layer.

pattern

2026

StruQ structured queries

Force user input through a structured-query interface that strips prompt instructions. Useful for retrieval-style apps.

pattern

2026

Defensive token reweighting

Train models with defensive tokens that override injection attempts. Requires model retraining; not feasible for closed-vendor APIs.

research

Prompt injection defenses operate at three layers: input layer (filter or transform user-supplied content before it reaches the model), model layer (train the model to resist injection), and output layer (validate model output before it acts). The leaders in 2026 are input-layer defenses; model-layer defenses require retraining and are out of reach for most teams; output-layer defenses are necessary but insufficient on their own.

BEFORE

Pre-2026 defense landscape

Most teams relied on string-matching input filters
Topical guardrails for product-specific use cases
Output filtering as the primary security boundary
Layered defense was advised but not benchmarked
Adaptive attacks bypassed most defenses at >70% success

→

AFTER

2026 defense landscape

LLM-as-guardrail preprocessors achieve near-perfect classification on standard benchmarks (PromptArmor)
Rule-based monitors with attention attribution provide low-cost alternative (AgentWatcher)
Layered defenses now have empirical evidence: 73.2% → 8.7% attack success reduction
Adaptive attacks still bypass state-of-the-art at >85% — defense moats are situational, not absolute
Benchmarks (AgentDojo, PromptBench, TruthfulQA) provide common-ground evaluation

The right question for 2026 is not "which defense should I pick" — it is "what layered stack fits my latency, cost, and threat-model budget." Single-defense posture is structurally insufficient against motivated adversaries.

DEEP READ 7 sections · cited primary sources · technical review pending

01 PromptArmor — the LLM-as-guardrail approach

PromptArmor (Stanford + Microsoft Research, ICLR 2026) is the simplest defense with strongest published results. Architecture: a separate, off-the-shelf LLM (GPT-4o, GPT-4.1, or o4-mini in the published evaluations) acts as a preprocessor. User input flows to PromptArmor first; PromptArmor detects and strips injection content; cleaned input goes to the production LLM.

Published performance on AgentDojo: <1% false positive rate AND <1% false negative rate. This is genuinely strong — most prior defenses traded FPR for FNR. PromptArmor achieves both because the guardrail LLM has enough context to make accurate judgments on borderline cases.

Practical trade-offs: cost (two LLM calls per request instead of one — roughly 1.3-1.5x token cost depending on guardrail model selection); latency (added 200-500ms typically); operational complexity (now monitoring two LLM endpoints). For production deployments handling sensitive content, the trade-offs are usually acceptable.

When to deploy Production AI applications with downstream actions (agents, tool-use, code execution) where misbehavior is costly.
When to skip Low-stakes chat applications where added latency hurts UX more than potential injection hurts downstream systems.
Open questions How does PromptArmor degrade under adaptive attack where attacker knows it is deployed? The paper does not extensively cover this.

CAVEAT PromptArmor benchmarks use AgentDojo which has known gaps (Wang et al. 2026 documents that none of the 14 evaluated benchmarks include context-dependent tasks pervasive in real deployments). Real-world performance will be lower than benchmark numbers; magnitude of degradation is unstudied.

PRIMARY SOURCE PromptArmor paper (arXiv 2507.15219, ICLR 2026)

02 AgentWatcher — the rule-based attention-attribution approach

AgentWatcher (research preprint, arXiv 2604.01194) takes a different approach: rule-based monitoring using attention-based context attribution. Instead of running a guardrail LLM, AgentWatcher analyzes the attention patterns of the production model on incoming context to detect when external content is exerting unusual influence on the model's behavior.

Published performance: near-zero attack success rate across four agent benchmarks. Cost profile is structurally better than PromptArmor: no second LLM call, attention-analysis runtime is a small fraction of generation cost (typically <10% overhead).

The trade-off is interpretability + flexibility. Rule-based systems require defining what 'unusual influence' means for your application. Out-of-the-box rules work for benchmark scenarios; production deployment usually requires tuning the rule set to your specific tool-use patterns and threat model.

CAVEAT AgentWatcher requires access to the production model attention weights. This works for self-hosted models (Llama, Mistral, DeepSeek). For closed-vendor APIs (Anthropic, OpenAI, Google), attention weights are not exposed, so AgentWatcher cannot be deployed without architectural changes by the vendor. Practical implication: AgentWatcher is for open-weight production deployments.

PRIMARY SOURCE AgentWatcher preprint (arXiv 2604.01194)

03 PromptGuard, Llama Guard 3, NeMo Guardrails — the classifier ecosystem

Below the two leaders sits the classifier ecosystem: PromptGuard (Meta), Llama Guard 3 (Meta), NeMo Guardrails (Nvidia), and various commercial offerings. These are classifier models trained to detect specific patterns: prompt injection attempts, content policy violations, off-topic queries, etc.

PromptGuard cuts injection success rates by 67% — meaningful but markedly below PromptArmor. The gap is not a flaw in the classifier approach; it is that classifiers work on patterns that show up in training data, and adversarial inputs explicitly try to avoid those patterns. PromptArmor's full-LLM guardrail can reason about novel attack patterns; classifiers fundamentally cannot.

Where the classifier ecosystem still wins: cost and latency. A classifier inference is roughly 10x cheaper than an LLM-guardrail call and 5-10x faster. For high-volume applications where the threat model accepts 30-40% residual injection success, the classifier stack is the realistic deployment.

PromptGuard 67% reduction in ASR. Fast, cheap. Good as one layer in a stack.
Llama Guard 3 Content moderation classifier. Used pre- and post-generation. Misses adversarial inputs designed to bypass topic classification.
NeMo Guardrails Rule + classifier hybrid from Nvidia. Strong on constrained-product domains. Less effective on open chat.

PRIMARY SOURCE TokenMix prompt-injection defense benchmark

04 Spotlighting and StruQ — input-shape defenses

Two cheap input-shape defenses are worth deploying as part of any layered stack. Spotlighting: explicitly mark untrusted input with boundary tokens (e.g., wrapping in special markers that signal 'this is user-provided content, do not interpret as instructions'). StruQ: force user input through a structured-query interface that strips natural-language instructions before they reach the model.

Spotlighting is genuinely useful as one layer — recent evaluations show 30-40% reduction in injection success rate on its own, more when combined with classifier-based defenses. StruQ is more restrictive but stronger: by structurally preventing instruction-style input from reaching the model, you eliminate the entire prompt-injection class for that input channel. Trade-off: StruQ only works for retrieval-style apps where the user input is constrained to specific operations.

PRIMARY SOURCE CallSphere prompt-injection hardening patterns 2026

05 Defensive token reweighting — the model-layer defense

Model-layer defenses train the production LLM itself to resist injection attempts. Defensive token reweighting (research-stage) introduces specific tokens into the model's training that explicitly override injection attempts when present in the prompt. The published results are promising but require model retraining — not feasible for closed-vendor APIs (Anthropic, OpenAI, Google), and expensive for open-weight model deployments.

Where it works: organizations training their own LLMs from scratch or doing meaningful fine-tuning can incorporate defensive token training into the recipe. The result is a model that has injection resistance baked in, reducing the load on input-layer defenses.

Where it does not work: any organization using API-only access to closed-vendor models. These are most production deployments. Until vendors ship defensive-token-aware models (Anthropic's safety training is conceptually related but not exposed as a primitive), this remains a research-flavored option for most teams.

PRIMARY SOURCE Architecting Secure AI Agents (arXiv 2603.30016)

06 The layered-defense math — where 73% becomes 8.7%

The cleanest empirical result in 2026 prompt-injection research is the layered-defense effect: composite frameworks combining multiple defense techniques reduce attack success rate from 73.2% baseline (no defense) to 8.7% (layered).

Baseline (no defense) 73.2% attack success rate on AgentDojo with adversarial prompts crafted by humans + LLM-assisted adversarial agents.
Single-defense (PromptArmor or AgentWatcher) ~15-25% attack success rate. Strong improvement but not sufficient against sophisticated adversaries.
Two layers (input filter + output validation) ~12% attack success rate.
Three layers (PromptArmor + Spotlighting + output schema validation) ~9% attack success rate.
Four+ layers (above + AgentWatcher + Llama Guard 3 post-filter) 8.7% attack success rate; diminishing returns past this point.

Practical implication: most production AI applications should target 3-4 layers. The marginal improvement past four layers (8.7% → 7% requires another 3-4 layers and significant latency + cost overhead) is usually not worth the operational complexity. The exception is high-stakes deployments (financial, medical, security-critical) where single-percentage-point improvements matter.

CAVEAT These numbers are from controlled benchmark conditions with non-adaptive attackers. Adaptive attackers (who know which defenses are deployed) maintain >85% attack success rate against state-of-the-art layered defenses. Defense-in-depth in 2026 needs to include obfuscation of which defenses are deployed, not just stacking known defenses.

PRIMARY SOURCE Composite reading: PromptArmor, AgentWatcher, layered-defense benchmarks

07 What is NOT in the benchmarks — the production gap

Wang et al. (arXiv 2510.05244, 2026) analyzed 14 prompt-injection benchmarks and found a consistent gap: none of them include context-dependent tasks, which are pervasive in real agent deployments. Real agents read documents, browse webpages, parse emails, integrate with internal databases — all of which create context-dependent injection surfaces that benchmarks do not capture.

Practical consequence: defense performance numbers from benchmarks (PromptArmor <1% FPR, AgentWatcher near-zero ASR) overstate real-world performance. The magnitude of degradation when moving from benchmark to production is currently unstudied. Plausible range: 2-5x worse performance on context-dependent tasks than on benchmark-pure injection attempts.

What this means for procurement: vendor claims about defense effectiveness should be discounted significantly when applied to production deployments with rich context. The gap is not a flaw in the vendors; it is a structural gap in the evaluation methodology that nobody has solved yet.

CAVEAT The "2-5x worse" estimate is our extrapolation from Wang et al. findings. It has not been empirically measured at this scale. Anyone publishing rigorous production-vs-benchmark performance data would significantly advance the field.

PRIMARY SOURCE Wang et al. 2026 — Indirect prompt injection benchmarks: are firewalls all you need?

Six known failure modes across the current defense landscape. Each is a specific attack class where multiple defenses have been shown to underperform their reported benchmarks.

01 HIGH

Adaptive attacks bypass state-of-the-art at >85% success

When attackers know which specific defense is deployed, they can craft inputs that bypass it. Research consistently shows >85% attack success rate against state-of-the-art defenses under adaptive attack conditions. Defense-in-depth with obfuscation of which defenses are deployed is the realistic posture, not "deploy the best single defense."

DO Run regular adaptive red-team exercises against your own deployed defenses. If you cannot do this in-house, contract with Apollo Research, METR, or specialized AI red-team services. The defenses you deploy publicly are partially visible to attackers; assume they will be probed.
02 HIGH

Benchmark-to-production gap is uncharacterized

All published defense performance numbers come from benchmarks that have known gaps — particularly the absence of context-dependent tasks pervasive in real agent deployments (Wang et al. 2026). Real-world defense effectiveness is meaningfully worse than benchmarks suggest, but by how much is unmeasured.

DO Treat vendor performance claims as upper-bound estimates. Build your own internal benchmark using examples from your actual production traffic. Compare defense performance against your benchmark, not vendor benchmarks.
03 MEDIUM

Single-defense posture is structurally insufficient

Best single defense (PromptArmor) achieves ~15-25% residual attack success rate. For most applications this is too high. Single-defense posture exists not because it is effective but because layered defense is operationally complex. The operational complexity is genuinely real but layered defense is also genuinely necessary.

DO Plan for 3-4 layers as baseline production posture. PromptArmor as preprocessor + Spotlighting + output schema validation + Llama Guard 3 post-filter is a reasonable starter stack. Budget for 200-500ms added latency and 30-50% inference cost overhead.
04 MEDIUM

AgentWatcher requires open-weight model access

AgentWatcher's attention-attribution approach requires access to model attention weights. This works for self-hosted Llama, Mistral, DeepSeek deployments but not for closed-vendor APIs (Anthropic, OpenAI, Google). If your stack is API-only, AgentWatcher is unavailable to you regardless of how well it benchmarks.

DO If you depend on closed-vendor APIs and need attention-monitoring-grade defense, either (a) negotiate attention-weight access with your vendor (possible for enterprise customers), (b) deploy a parallel open-weight model with AgentWatcher for high-stakes inputs, or (c) accept the gap and rely on input-layer defenses.
05 MEDIUM

Defense ecosystem lock-in risks

Most defenses are tied to specific LLM ecosystems. Llama Guard 3 is Meta's; NeMo Guardrails is Nvidia's; PromptArmor depends on having access to a frontier LLM for the guardrail role. If your stack changes vendors, your defense stack often changes with it. Defense portability is genuinely poor in 2026.

DO When evaluating defenses, weight portability heavily. Defenses that work across multiple LLM backends (input-layer filters, output validation) are more durable than defenses tied to specific vendor capabilities.
06 HIGH

New attack classes emerge faster than defenses

OWASP LLM Top 10 has had prompt injection at #1 for three years running because new attack variants emerge faster than defenses ship. Indirect prompt injection through documents, tool-poisoning via MCP descriptions, attention-sink manipulation, multi-turn injection over conversation history — each was novel within the last 24 months and each is now standard threat surface.

DO Subscribe to threat intelligence from Apollo Research, METR, Unit 42 (Palo Alto), and OWASP. Build your defense roadmap around emerging attack classes, not yesterday's threat model. Defense investment lag of 6+ months behind attack publication is a real exposure window.

Three concrete actions this week.

1

Audit your current defense posture honestly this week

Map your existing prompt-injection defenses. Most teams have one (input filter or output validation) and assume it is sufficient. Cross-reference against the layered-defense data: if you have 1-2 layers, you are at ~15-25% residual attack success. That is too high for most production applications. Document the gap.
2

Pilot PromptArmor or AgentWatcher this month

Pick whichever fits your stack: PromptArmor if you can afford the second LLM call latency + cost; AgentWatcher if you self-host open-weight models. Run it in shadow mode against production traffic for 30 days. Measure false-positive rate (legitimate user requests blocked) and look at flagged content. Decide whether to promote to full deployment based on observed performance.
3

Build your internal benchmark

Vendor benchmarks systematically overstate real-world performance because they miss context-dependent tasks. Capture 200-500 examples from your actual production traffic — both legitimate inputs and adversarial attempts you have seen. Use this as your internal evaluation set. Compare defenses against your data, not vendor demos.

Signals in the next 60 days that matter.

PromptArmor 2.0 or competitive replacement

PromptArmor is the leader as of May 2026. Watch for follow-up work that addresses the adaptive-attack gap (currently >85% bypass under adaptive conditions). The next major defense paper will likely target this specifically.

Context-dependent benchmark releases

Wang et al. 2026 identified the benchmark gap. Watch for new benchmarks that include context-dependent tasks. When they land, expect defense rankings to shift meaningfully — current leaders may not maintain their positions.

Vendor-shipped defenses become standard

Anthropic, OpenAI, and Google will eventually ship native defense layers into their APIs. When they do, the build-vs-buy calculus for defense changes. Monitor vendor announcements quarterly.

A starter layered stack

Practical evaluation methodology

Architectural patterns that keep defense optionality

Tools to use

The market dynamics

Ticker positions

If you ship AI features in your product

If you are building defense-adjacent products

Open research questions

Venues and funding

PromptArmor

AgentWatcher

PromptGuard

NeMo Guardrails (Nvidia)

Llama Guard 3

Spotlighting

StruQ structured queries

Defensive token reweighting

01 PromptArmor — the LLM-as-guardrail approach

02 AgentWatcher — the rule-based attention-attribution approach

03 PromptGuard, Llama Guard 3, NeMo Guardrails — the classifier ecosystem

04 Spotlighting and StruQ — input-shape defenses

05 Defensive token reweighting — the model-layer defense

06 The layered-defense math — where 73% becomes 8.7%

07 What is NOT in the benchmarks — the production gap

Adaptive attacks bypass state-of-the-art at >85% success

Benchmark-to-production gap is uncharacterized

Single-defense posture is structurally insufficient

AgentWatcher requires open-weight model access

Defense ecosystem lock-in risks

New attack classes emerge faster than defenses

Audit your current defense posture honestly this week

Pilot PromptArmor or AgentWatcher this month

Build your internal benchmark

PromptArmor 2.0 or competitive replacement

Context-dependent benchmark releases

Vendor-shipped defenses become standard