THE STACK · ARCHITECTURE · MODELS · May 16, 2026

DeepSeek V4 is a 1.6-trillion-parameter open-weight model with a 1M-token context. The threat surface scaled with it.

DeepSeek released V4 on April 24 with a full technical paper. V4-Pro is 1.6T parameters (49B activated), V4-Flash is 284B (13B activated), both supporting 1M-token context. The architecture is genuinely interesting; the security story is what happens when frontier-comparable capability ships in open weights with a million-token window.

1.6T V4-Pro params

1M context tokens

32T pretrain tokens

TL;DR 30-second version · free

01 DeepSeek V4 (released April 24, 2026) brings open-weight 1M-context to frontier-comparable capability. V4-Pro: 1.6T params / 49B activated. V4-Flash: 284B / 13B. New attention design (Compressed Sparse + Heavily Compressed), Muon optimizer instead of AdamW, 32T-token pretrain.
02 For your threat model: every assumption that rested on 'no attacker has access to a frontier model' is now obsolete. Open weights mean frontier capability runs anywhere a 1.6T-param model fits. The 1M-context window multiplies prompt-injection surface area.
03 Three immediate consequences: adversarial fine-tunes get cheap to produce, defenses that assumed gatekeeping at the API no longer apply, and long-context compression mechanisms (CSA/HCA) have attack/leakage characteristics not yet characterized in security literature.

DEEP ANALYSIS · free while in beta

READING AS

FOR YOU

For day-to-day AI use, V4 will reach you faster than you think — through your tools, not because you deployed it. The first wave hits Cursor, Aider, Continue, Zed, and Cody as they add V4 backends. The second wave is content authenticity: outputs from V4-derivative models start appearing in code review, technical docs, and customer support transcripts. Defenders need both eval discipline at the tools layer and detection capacity at the content layer.

Where V4 reaches you first

Open-weight models propagate through the developer-tool stack predictably. Within 30 days expect quantized V4-Flash variants (4-bit, 8-bit) on Hugging Face that fit on a 24GB consumer GPU. Within 60 days expect Cursor, Continue, Aider, and Cody to ship V4 backends as user-selectable. Within 90 days expect TGI, vLLM, llama.cpp, and Ollama recipes to be in active rotation. Each layer broadens who actually runs the model.

If your team uses any of these tools today, your developers will be running V4 by Q3 whether you sanctioned it or not. The 'no one in our org has access to a frontier model' assumption already does not hold.

Workflow eval checklist this week

Prompt-injection regression Run your existing prompt-injection test suite at 1M context length against V4-Flash. Most tests calibrated for ≤128K fail to scale; document where defenses degrade.
Output-authenticity baseline Capture 200–500 known-human and known-V4 samples in your domain (code, documentation, support tickets). You will need this baseline when policy questions about AI-content disclosure come up.
Tool-side allowlist Decide which dev tools in your stack may route to V4 backends and configure tool settings explicitly. The default for Cursor / Continue / Aider in the next 60 days will likely include "V4 available" — opt in deliberately, not by default.
IDE prompt-injection sweep V4 in your IDE means indirect prompt injection through documents, URLs, and pasted code becomes more dangerous. Audit which IDE extensions allow agent loops to read arbitrary files; restrict where prudent.

Two specific things to add to your dev workflow

First: run garak (Nvidia's open-source LLM vulnerability scanner) against any V4 endpoint you stand up internally. It catches the obvious-class injection failures cheaply. Second: add an explicit 1M-context test case to your CI for any RAG / long-context features. The case should include a deeply-buried adversarial instruction near the end of context — that is where current defenses degrade most.

Reality check If your team cannot answer "who in our org could spin up a V4 endpoint this afternoon?" — that is the inventory gap.

FOR YOU

Building secure long-context features against V4-class capability requires shifting where defenses live. Defenses you used to outsource to API providers (refusal training, output filtering, abuse detection) now have to be reimplemented in your stack — V4 is open-weight, so attackers run uncensored variants locally. Below: specific libraries, code patterns, and configuration changes that move the needle.

Libraries to add to your stack

Guardrails AI Schema-based output validation. Catches injection-induced format violations even when content passes. Use at the response boundary, not the prompt boundary.
NeMo Guardrails (Nvidia) Rule-based input/output filtering with topical guardrails. Best when you have a constrained product surface (customer support, code review) rather than an open chat.
Llama Guard 3 Meta's safety classifier as a moderation pre-filter. Run it before V4 generation and after; the post-filter catches outputs that bypassed the pre-filter.
garak Nvidia's LLM vulnerability scanner. Run as part of CI on any V4 endpoint you stand up; catches known prompt-injection classes cheaply.
PyRIT Microsoft's red-teaming toolkit for generative AI. Use to systematically probe your own deployments; especially useful for testing multi-turn injection that single-turn scanners miss.
Giskard Behavioral testing framework. Good for regression-testing once you have a baseline; tells you when a model update changes attack-surface behavior.

Code patterns for long-context defense

Input-length budgeting: enforce hard maximums per user, per session, and per logical task — even when the model supports 1M tokens. Attackers exploit the gap between "model supports" and "your product allows."

Attention-sink monitoring: V4's CSA mechanism preserves attention to certain anchor tokens. Adversarial input that lands in those anchor positions has outsized influence. Log when user input contains content that resembles common attention sinks (special tokens, formatting markers).

Output-schema validation as a security boundary: define what your endpoints can return as a strict JSON schema. Anything else is a refusal. This catches injection-induced free-form responses that bypass topical filters.

Configuration changes when deploying V4

vLLM Disable streaming for security-critical endpoints — streaming reveals partial generation and gives feedback to adversarial probers. Set max_model_len explicitly even if the model supports 1M.
llama.cpp Use the -c flag to cap context per session at 32K–128K unless the use case truly needs 1M. The cost of 1M is mostly KV cache, not just parameters.
TGI (HuggingFace) Use the input-validation middleware aggressively. Sanitize any user content that gets concatenated into the prompt; do not rely on chat-template separators alone.

FOR YOU

Frontier capability in open weights breaks the closed-frontier moat in a specific way: API gatekeeping no longer constrains who can deploy. The bear case is pricing pressure on closed vendors as enterprises gain a credible self-host option. The bull case is total inference demand expansion as the addressable use-case set grows. Both are real. The net position depends on which effect dominates by Q3, and that varies by ticker. Specifics below — none of this is investment advice; do your own work.

The thesis

V4 is the first credible open-weight model in the frontier capability cohort with a 1M-token context. The competitive implication: enterprises evaluating Claude, GPT, or Gemini for production workloads now have an open-weight floor they can negotiate against. Pricing power for closed-vendor API access narrows where buyers have the technical capacity to self-host. Pricing power expands where buyers cannot.

Net inference demand likely grows: capability democratization expands the addressable set of agentic / long-context use cases. The infrastructure layer (chips, clouds, deployment platforms) benefits; the closed-frontier API layer faces margin pressure on commodity inference.

Ticker-by-ticker read

MSFT Azure is heavily levered to OpenAI exclusive partnership. V4-class open-weight pressure tightens the negotiation Microsoft has with OpenAI on revenue share and may compress Azure AI pricing as customers cite self-host alternatives.
GOOGL Gemini's moat depended partly on 1M+ context being unique to Google. V4 closes that gap. Gemini API revenue is small relative to overall Alphabet, but the narrative pressure on Cloud + the AI capex justification weakens.
AMZN AWS hosts both Anthropic and broad open-source — wins on infrastructure demand regardless of which model wins. Anthropic-specific exposure is moderate. Net positioning depends on whether Bedrock revenue mix shifts to open-weight V4-class models.
META Llama strategy was always "open-source is the strategy." V4 validates the bet. Reinforces Meta as the natural home for enterprise customers who want open-weight stewardship from a hyperscaler.
NVDA Efficiency gains (V4 uses 27% inference FLOPs vs V3.2) cut per-token GPU demand. Capability democratization expands total deployable inference base. Which dominates depends on adoption rate. Bull on long-term inference TAM; near-term concerns on capex narrative.
HF (private) HuggingFace is the natural distribution layer for V4 and its derivatives. Hard to play directly as a public-market trader, but watch IPO chatter; the open-weight tailwind helps any HF-adjacent valuation event.
MDB, ESTC Vector DBs and search infrastructure benefit from RAG/long-context workloads growing on open-weight models. Less correlated to which vendor wins the model layer.

Timing windows

The narrative peaks in two waves. Wave one: 30–60 days from V4 release, as derivatives proliferate and the "open-source caught up" headlines run. Wave two: 90–180 days, as enterprise procurement teams cite V4 in renewal negotiations and the pricing pressure shows in margin commentary on closed-vendor earnings.

Catalysts to watch: Anthropic enterprise revenue commentary in Amazon's next quarterly disclosure (Anthropic results flow indirectly through AWS reporting); OpenAI revenue mix shifts visible in Microsoft's commentary; first major enterprise win announcement where the customer specifically cites open-weight as their fallback negotiating leverage.

What invalidates this thesis

US export-control extension to V4 weights/derivatives. If V4 gets caught by Entity List or BIS controls, the open-weight thesis hits a regulatory wall and the closed-vendor moat reasserts. Watch BIS commentary and US-China policy commentary through Q3.

A major closed-frontier release (GPT-5.5 → 6 jump, Claude Opus 5, Gemini 3) that re-establishes a 6+ month capability gap. Frontier-vendor R&D outrunning open-weight is the historical default; V4 may be a transient closure rather than a structural one.

The signal If you can only watch one metric over the next 90 days: enterprise revenue mix commentary from Microsoft, Amazon, and Google's cloud segments. That is where the open-weight pricing pressure will first appear in financial disclosures.

FOR YOU

If your product's defensibility rests on 'we have access to frontier-class capability that competitors do not,' V4 just narrowed your moat. The defensible position now sits in three places: data you own that the model does not see, fine-tuning expertise that lets you specialize, and distribution where you reach customers competitors cannot. Specific decisions to make this quarter below.

Defensibility framework rewrite

Pre-V4 startup playbook had API access as a meaningful component of competitive position — better prompts on Claude or GPT-4 was a moat for ~12 months until everyone caught up. Post-V4, raw model access is a commodity input. The moat shifts upward: data, workflow integration, specialized fine-tunes, and customer distribution.

Concrete: list every product feature whose differentiation rests on "Claude/GPT/Gemini can do X." Categorize as (a) commodity now, (b) commodity within 90 days, (c) genuinely defensible (proprietary data, network effects, regulatory moat). Prioritize investment in (c); accept (a) and (b) as commodity competitive surface where price/UX win.

Specific decisions to make this quarter

Hire fine-tuning capability or buy it V4 is fine-tunable. Specialized fine-tunes for your domain are now a differentiation lever. Decide whether to hire one strong ML engineer with fine-tuning experience (~$300K-$500K/year fully loaded) or contract with Modal / Together AI / Fireworks for fine-tuning-as-a-service. The build/buy answer depends on how often you expect to retrain and how proprietary your data is.
Lock in closed-vendor pricing now If you use Claude/GPT/Gemini in production, your vendors are about to face pricing pressure from open-weight competition. Renegotiate annual contracts now — your leverage is at its peak before pricing pressure shows in their P&L. Specifically: ask for volume-based discounts and explicit non-discrimination clauses against any new pricing tier they might launch.
Decide your open-weight posture publicly Customers in regulated industries are starting to ask "what model are you running and where does our data go?" An explicit position (we use only closed-vendor X / we run V4 in our own VPC / we use both with customer choice) becomes a procurement-conversation differentiator. Pick one and document it.
Audit your data licensing If your moat is proprietary data, audit whether your existing data licensing allows fine-tuning. Many vendor data deals were written before fine-tuning was a buyer concern; the contracts may need amendment. Get this in front of legal before you commit to fine-tuning capability.

Hiring and talent strategy shifts

Open-weight frontier means ML engineering becomes more valuable relative to ML research. Pre-V4: ML researchers (people who understand model architecture, training dynamics) commanded premiums because closed-vendor APIs needed expert prompt engineering. Post-V4: ML engineers (people who deploy, fine-tune, eval, and operate models) become the constraint on shipping product. Adjust hiring mix accordingly.

Also: GPU/inference infrastructure experience becomes more valuable. If you previously hired only Python/TypeScript application engineers, the gap to add a senior inference engineer (vLLM, TGI, Ray Serve, GPU orchestration) is now a real product capability, not just an ops capability.

Product roadmap implications

Features that depend on long-context (1M tokens) are now baseline — competitors with V4 can match them quickly. Either ship them now while still differentiating or accept they are commodity. Features that depend on proprietary fine-tunes (specialized domain expertise from your data) are where to invest.

Multi-model routing becomes a product capability worth investing in — customers will want to choose Claude/GPT/V4 based on their own preferences. Building model-agnostic infrastructure inside your product is now a 6-month roadmap item, not a 12-month one.

The hard truth Your competitor's product can replicate your prompts in a week. It cannot replicate your data, your distribution, or your customer trust. Invest there. Stop investing in 'better prompts on the same model'.

FOR YOU

V4's architecture is interesting enough that several open research questions have high publication potential and meaningful safety impact. CSA/HCA attention compression has unstudied adversarial characteristics. mHC connection behavior under perturbation has unstudied robustness properties. 1M-context prompt-injection at scale has limited published work. Three concrete experiments worth running this quarter.

Experiment 1: CSA/HCA adversarial robustness

Hypothesis: Compressed Sparse Attention preserves attention to specific anchor tokens by design; adversarial input that lands in anchor positions has disproportionate influence. Heavily Compressed Attention sees less detail at long range; adversarial content placed in the long-range regions may be easier to smuggle past safety classifiers.

Methodology: construct adversarial-token-placement benchmarks varying (a) attention-sink position, (b) injection depth in 1M context, (c) semantic vs syntactic adversarial content. Compare attack success rates against V4-Flash and a non-CSA baseline (V3.2 or Llama 3.1 405B). Budget: ~$3-5K in GPU time on Modal or RunPod for the eval; ~6 weeks of researcher time.

Publication venues: NeurIPS workshop on AI safety, SaTML, BlackHat AI. Strong fit for first-author publication with clear empirical contribution.

Experiment 2: mHC under adversarial perturbation

Hypothesis: Manifold-Constrained Hyper-Connections strengthen residual signal propagation; small input perturbations that survive mHC propagation may have outsized downstream effects compared to standard residual architectures. This is a robustness question with both safety and reliability implications.

Methodology: standard adversarial-perturbation benchmarks (FGSM, PGD adapted for embedding space) on V4-Flash and a non-mHC baseline. Measure perturbation-amplification ratios across layers. Compare to documented adversarial-robustness results from the AdvLLM-Bench corpus.

Budget: ~$5-10K GPU time, ~8-12 weeks researcher time. Lower publication-risk than experiment 1 (more methodologically conventional) but high impact if results are dramatic in either direction.

Experiment 3: 1M-context prompt-injection at scale

Gap in current literature Published prompt-injection benchmarks (TensorTrust, Pleias, Gandalf) are calibrated at 4K-128K. Almost no published work at 1M context. The empirical gap is a publication opportunity.
Core question How does injection success rate vary with context length and injection depth? Does adversarial content buried at token 950,000 behave differently from token 50,000? At what context length do current defenses degrade?
Datasets to use TensorTrust benchmark adapted for long context; SEP-style benchmarks scaled to 1M; synthetic adversarial corpora with controlled injection depth. Public RAG datasets (FinanceBench, LegalBench) for realistic context content.
Estimated impact High. Defenders need this data to size their security investments. Vendors need it to know which defenses to ship. Publication likely cited by both academic and industry security teams within months.

Available funding and acknowledgements

OpenPhilanthropy, Future of Life Institute, ARIA, and EA Funds all have open grant cycles for AI safety research with budgets sufficient for any of these experiments. Apollo Research and METR have informal collaboration channels for empirical safety work that can yield compute credits without formal grants.

If you intend to publish in this space, register your experiment with the Apollo Research / METR informal coordination network — it reduces duplicate work and increases citation likelihood.

Why now Three experiments above, each with budgets between $3K and $10K, that would each cite this column when published. The publication-impact ratio for AI security empirical work right now is unusually high.

What shipped that matters.

Apr 24

DeepSeek-V4-Pro

1.6T parameters, 49B activated. Frontier-comparable on multiple benchmarks.

model

Apr 24

DeepSeek-V4-Flash

284B parameters, 13B activated. Smaller cousin, same architecture.

model

Apr 24

CSA + HCA attention

Compressed Sparse + Heavily Compressed Attention for 1M-token context efficiency.

architecture

Apr 24

mHC connections

Manifold-Constrained Hyper-Connections strengthen residuals across layers.

architecture

Apr 24

Muon optimizer

Used instead of AdamW. Faster convergence at scale per DeepSeek.

training

Apr 24

32T-token pretrain

Pretrained on 32T tokens. 27% inference FLOPs vs V3.2 at 1M context.

efficiency

The capability ceiling of open-weight models reset on April 24. Understanding why requires walking through the specific architectural choices that made V4 efficient enough to ship at this scale. Each section below cites primary sources; technical accuracy has not been reviewed by an external ML expert and we mark sections explicitly where the paper is silent.

BEFORE

Pre-V4 open-weight landscape

Open weights trailed closed frontier by 6–12 months on capability
128K–256K context windows were the open-weight standard
Adversarial fine-tuning required significant resources
API gatekeeping limited who could deploy frontier capability
Defenses could assume attackers used commodity-tier models

→

AFTER

Post-V4 open-weight landscape

V4-Pro reaches frontier-comparable performance on multiple axes
1M-token context is now an open-weight baseline
Adversarial fine-tuning becomes cheap with cleaner architecture
Anyone with sufficient compute can run frontier capability locally
Defenses cannot assume attackers are bandwidth-constrained at the API

The release is not the threat. The reset of the capability ceiling for open-weight models is. Defenses that relied on "no one has access to frontier capability" need to be re-examined.

DEEP READ 7 sections · cited primary sources · technical review pending

01 The architecture in plain terms

DeepSeek V4-Pro is a sparsely-activated Mixture-of-Experts (MoE) model with 1.6 trillion total parameters and 49 billion activated parameters per forward pass. V4-Flash is the smaller cousin at 284B total / 13B activated. Both share the same architectural family — the difference is scale, not design. Sparse activation means the model has the knowledge capacity of a 1.6T model but the compute cost of running a 49B dense model per token, which is what makes large-MoE models commercially viable.

The three architectural innovations DeepSeek highlights in the V4 paper are (a) a hybrid attention mechanism combining Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), (b) Manifold-Constrained Hyper-Connections (mHC) replacing standard residual connections in the transformer block, and (c) the Muon optimizer replacing AdamW for training stability at scale. The headline efficiency claim — 27% of single-token inference FLOPs and 10% of KV cache size compared to V3.2 at 1M context — is the product of these three choices working together.

PRIMARY SOURCE DeepSeek-V4 technical report (April 24, 2026)

02 The attention mechanism — CSA + HCA

Standard transformer attention computes pairwise interactions between every token in the context. At 1 million tokens, this is computationally infeasible: the attention matrix alone is 10^12 entries. Every long-context LLM uses some form of attention sparsification to make this tractable. V4's specific approach is to split attention into two regimes operating in parallel.

Compressed Sparse Attention (CSA) handles the bulk of token-to-token interactions using structured sparsity. Each token attends to (i) a small set of "anchor" tokens distributed across the sequence and (ii) a local neighborhood window. This is conceptually similar to BigBird and Longformer attention patterns, with the specific anchor selection learned during training. The result is attention computation scaling closer to O(n·sqrt(n)) than the naive O(n²).

Heavily Compressed Attention (HCA) handles long-range information differently. Instead of sparsifying the attention pattern, HCA compresses the KV cache for distant tokens — most likely through some combination of low-rank decomposition and quantization, though the V4 paper does not fully specify the compression scheme. The compressed KV is decoded on demand when needed. The trade-off: small loss of fidelity for long-range information in exchange for ~10x KV cache reduction at 1M tokens.

CSA Structured sparse attention. Anchor tokens + local windows. Roughly O(n·sqrt(n)) scaling. Trades a small amount of attention coverage for compute.
HCA Dense attention to a compressed KV cache. Aggressive memory reduction. Trades long-range fidelity for memory.
Hybrid effect CSA carries the bulk of short-to-mid range interactions, HCA preserves long-range coherence. Neither alone is enough for 1M context; together they make it viable.

CAVEAT The V4 paper describes CSA and HCA at structural level. The exact anchor-selection algorithm, the specific KV compression scheme (low-rank? product quantization? both?), and the precise attention mask details are not fully reproducible from the paper alone. Anyone red-teaming the attention mechanism is working from incomplete information — which is itself a security-relevant fact.

PRIMARY SOURCE V4 paper §3 (Attention Architecture)

03 Manifold-Constrained Hyper-Connections (mHC) — strengthening the residual stack

A standard transformer block has a residual connection: the layer output is added to the layer input (x_{l+1} = x_l + f_l(x_l)). At deep scales, this can cause two problems: (a) gradient flow degrades as signals accumulate noise across hundreds of layers, and (b) the residual stream loses information about the original input through accumulated transformations.

mHC reframes the residual connection as a constrained operation: each layer's output is required to stay within a manifold (a structured subspace) related to the input. The paper describes this as "hyper-connections" because the constraint operates across multiple layers simultaneously, not just adjacent pairs. The practical effect is more stable signal propagation, better gradient flow, and the ability to train deeper models without the usual instability issues at the 1.6T parameter scale.

For security, the relevant property is that signals — including adversarial perturbations — propagate more reliably through mHC-equipped residual stacks. A small input modification that would have been damped by accumulated residual noise in a standard transformer can propagate with full strength through V4's residual stream. This is not a vulnerability per se, but it is an architectural property worth empirically characterizing in adversarial-robustness evaluations.

CAVEAT The V4 paper presents mHC as a generalization of standard hyper-connections research from 2024 but the precise manifold definition (geometric? statistical? both?) is not fully laid out. We are describing the concept and the empirical effect; the mathematical details are best read directly from the paper by someone with deep transformer-architecture background.

PRIMARY SOURCE V4 paper §4 (Residual Stack Design)

04 Training methodology — Muon optimizer and the 32T-token pretrain

V4 was pretrained on more than 32 trillion tokens. For reference: GPT-4 was reportedly trained on ~13T tokens. Llama 3.1 405B was trained on ~15T. V4's pretrain corpus is roughly 2x larger than what publicly-known frontier models used, which is part of how a 49B-activated model reaches frontier-comparable performance.

The Muon optimizer is the unusual choice. Muon (originally proposed by Jordan, Bernstein and others in late 2024) uses a Newton-Schulz iteration to orthogonalize gradient matrices, providing better-conditioned updates than AdamW. The DeepSeek team credits Muon for faster convergence at the 1.6T scale where AdamW becomes less stable — specifically, Muon reportedly improves wall-clock training time by an unspecified but meaningful percentage relative to AdamW. The trade-off is that Muon has more hyperparameters and less mature tooling than AdamW, so anyone trying to reproduce the training would need to do significant optimizer tuning.

What the paper does not describe in detail: the data mixture (how much code vs. text vs. math? how much multilingual content? what fraction is synthetic? what filtering was applied?), the safety training pipeline (RLHF? Constitutional AI? Reward modeling?), and whether adversarial examples were included during training. These omissions are typical of frontier-model technical reports but matter for security analysis.

Token count 32+ trillion tokens of pretraining. Roughly 2x the publicly-known GPT-4 / Llama 3.1 405B pretrain corpus.
Optimizer Muon, not AdamW. Newton-Schulz orthogonalization of gradient matrices. Faster convergence at scale per the paper, but more sensitive to hyperparameter tuning.
Not in paper Data mixture composition, safety training methodology, adversarial-training inclusion. All security-relevant; all undocumented publicly.

PRIMARY SOURCE V4 paper §5 (Training Methodology)

05 Where the efficiency gains come from

DeepSeek claims V4-Pro requires 27% of single-token inference FLOPs and 10% of KV cache at 1M context relative to V3.2. Decomposing the claim:

~3x CSA replacing dense attention. Sparsifying the attention pattern cuts compute by roughly the sparsity ratio (anchor tokens are ~sqrt(n) of the full sequence).
~10x HCA KV cache compression. Aggressive reduction of memory required for long-range information.
~1.5x MoE routing improvements over V3.2. Better expert selection means fewer "wasted" activations on out-of-distribution tokens.
Plus Engineering-level inference-stack optimizations — kernel fusion, KV-cache layout, etc. — that contribute to overall throughput but are independent of the architecture.

Multiplying these gives a rough match to the headline 27% / 10% claims, though the exact decomposition is not reproducible from the paper alone. The interesting observation: the architectural gains are mostly in MEMORY (KV cache compression at 10x), not compute. This makes V4 deployable on smaller GPU configurations than V3.2 at the same context length, which is part of why V4 is realistic for self-hosting where V3.2 was not for most users.

CAVEAT The 27% / 10% claims are reported by DeepSeek and have not been independently reproduced as of May 16, 2026. Independent benchmark results from third parties (Together AI, Modal, Fireworks) typically appear 4-8 weeks after a major open-weight release; watch for those.

PRIMARY SOURCE V4 paper §6 (Efficiency Evaluation)

06 What is NOT in the paper — the security-relevant omissions

For a publication focused on AI security, the more important content is often what a model release does not say. V4 omits several details that downstream defenders need:

Safety training pipeline The paper says safety training was performed but provides no methodology. Was RLHF used? Constitutional AI? Reward modeling? Any combination? Defenders trying to characterize V4's refusal boundaries are working blind.
Data mixture 32T tokens — but the composition (code/text/math ratio, multilingual mix, synthetic data fraction, safety filtering applied) is not specified. This affects everything from what the model knows to what jailbreaks work.
Adversarial training Whether adversarial examples were included during pretraining or SFT is not described. This affects how robust V4 is to existing prompt-injection techniques.
Red-team findings pre-release No public red-team report. The paper does not document what failure modes DeepSeek discovered before release or how they were addressed.
Eval scores on adversarial benchmarks Academic benchmarks (MMLU, MATH, HumanEval, etc.) are reported. Adversarial-robustness benchmarks (TensorTrust, AdvBench, JailbreakBench) are not. The omission is itself a signal.

Closed-frontier vendors (Anthropic, OpenAI, Google) also omit much of this — but they retain operational control through API gatekeeping. V4 is open-weight, so the security-relevant gaps in documentation matter more: defenders deploying V4 internally cannot rely on Anthropic-style enforcement at the inference boundary, and have to characterize the model's safety properties themselves.

PRIMARY SOURCE V4 paper, our reading of what is absent

07 How V4 compares to other 2025–2026 frontier models

V4-Pro vs DeepSeek V3.2: same architectural family. V4 adds 1M context, the CSA+HCA hybrid, mHC connections, and the Muon optimizer. V3.2 had 128K context and standard attention. The architectural deltas are deliberate scaling iterations, not a redesign.

V4-Pro vs Llama 3.1 405B: V4 is ~4x larger in total parameters (1.6T vs 405B) but only ~13% larger in activated parameters per forward pass (49B vs 43B). The architectural choice reflects different scaling philosophies — Meta scales dense, DeepSeek scales sparse. The compute-cost-per-token at inference is roughly comparable; the knowledge capacity of V4 is much larger.

V4-Pro vs Claude Opus 4.6 / GPT-5: head-to-head benchmark scores in the V4 paper put V4-Pro within striking distance of closed-frontier on most reasoning tasks. The closed-frontier vendors do not publish their architectures or full eval methodology, so direct architectural comparison is impossible. Capability-wise, V4 is the first open-weight model where the gap is "comparable, occasionally trades wins" rather than "clearly behind."

V4-Pro vs Gemini 2.5 Pro: both support 1M+ context. Gemini's attention mechanism is closed and probably uses a different architectural family (Google has historically used Pathways-style routing). Direct comparison is benchmark-level only.

CAVEAT Benchmark comparisons across model families are notoriously fraught. Different vendors report benchmark scores with different methodologies (eval prompt formats, sampling parameters, multi-turn vs single-turn, etc.). Treat all cross-family benchmark deltas as approximate.

PRIMARY SOURCE Composite reading: V4 paper + Morph guide + The Salt analysis

Six places where the V4 release changes the AI-security threat model. The release itself is not a vulnerability; the broader landscape it reshapes is.

01 HIGH

1M-token context multiplies prompt-injection surface area

A million-token context window means a million tokens of attack surface for prompt injection, indirect injection through documents, and adversarial in-context demonstrations. Existing prompt-injection defenses were calibrated on 128K-token windows. Many do not scale to 1M cleanly — the search space for injection payloads grows linearly while detection rules tend to be set-based.

DO Audit your prompt-injection defenses for behavior at long contexts. If they were calibrated below 128K, run them at 1M against adversarial test sets before deploying long-context features.
02 HIGH

Adversarial fine-tunes become much cheaper to produce

V4 ships open weights. Anyone with the GPUs can fine-tune it on adversarial data — jailbreak-only models, harmful-task specialists, persona-stripped variants. Anthropic and OpenAI maintain refusal training as a moat. Open weights mean that moat does not apply to V4-derived models. Cost-of-attack drops dramatically when frontier capability is locally tunable.

DO Update threat models that assumed 'attackers cannot access uncensored frontier capability.' For products where output safety matters (content platforms, code generation in regulated industries, customer-facing assistants), assume your input pipeline could be facing a V4-derivative adversarial model.
03 MEDIUM

CSA/HCA compression has unstudied attack characteristics

Compressed Sparse Attention and Heavily Compressed Attention dramatically cut FLOPs and KV cache at 1M context. The security literature has not yet characterized what these compression mechanisms do to attention-based attacks — adversarial token placement, attention sinks, position-dependent injection. The architecture is new enough that nobody has run the standard adversarial-evaluation playbook against it.

DO If you deploy V4 internally, do not assume attention-based defenses ported from V3 or other architectures still work. Run a fresh evaluation. The compression is the variable.
04 MEDIUM

mHC connections strengthen adversarial propagation too

Manifold-Constrained Hyper-Connections strengthen residual signal propagation across layers — a feature for stability and expressivity. But residual connections also propagate adversarial perturbations. A small input perturbation that survives mHC propagation can have outsized downstream effects. This is not a vulnerability per se; it is an architectural property worth empirically characterizing.

DO No immediate action for deployers, but watch for empirical work on adversarial robustness of mHC architectures over the next quarter. The standard adversarial training results may not transfer.
05 HIGH

API gatekeeping no longer constrains adversarial use

A meaningful slice of AI security architecture relies on the assumption that frontier capability is gated behind APIs with rate limits, abuse detection, and policy enforcement. Open-weight V4 removes that assumption. Defenders cannot rely on "OpenAI / Anthropic / Google would catch this at the API" for attacks that use V4-class models locally.

DO Audit which of your defenses implicitly assume API-side enforcement (abuse detection, output moderation, content filtering). For high-stakes use cases, build the same controls into your own pipeline rather than depending on upstream vendors.
06 MEDIUM

Distillation/derivative models will follow V4 quickly

Open weights enable derivative work: distilled models, quantized variants, regional fine-tunes, specialized vertical models. Each derivative carries V4-architecture properties forward into smaller, more accessible packages. The threat-model implications above propagate to derivatives — a quantized 70B distillation of V4 with most of the capability is a near-certain deliverable within 60 days.

DO Monitor HuggingFace, GitHub, and Kaggle for V4 derivatives. Each major derivative changes the deployment cost-of-attack downward. The threat model needs continuous adjustment, not a one-time evaluation.

Three concrete actions this week.

1

Re-evaluate prompt-injection defenses at 1M-token scale

If your defenses were tuned for shorter contexts, run them at the new scale before shipping long-context features. The attack-vs-defense ratio likely shifted.
2

Update threat models that assumed API-only frontier access

Document where your security posture assumes 'attackers cannot access uncensored frontier models.' That assumption no longer holds. Where it matters (content moderation, content authenticity, code review), build redundant defenses.
3

Watch derivative-model space for the next 60 days

V4 distillations, quantizations, and adversarial fine-tunes will land on HuggingFace within weeks. Each lowers the cost-of-attack further. Update your threat-model assumptions monthly, not annually.

Signals in the next 60 days that matter.

First adversarial fine-tune of V4 published publicly

Will land on HuggingFace, possibly with a "research only" label, within 30 days. When it does, the threat-model conversation accelerates hard.

Independent safety evals from Apollo, METR, MIT

V4 is fresh enough that none of the standard safety eval suites have public results yet. The first ones published will reset the conversation about which open-weight models are safe to deploy in regulated contexts.

US export-control posture on V4 weights

V4 was released by a Chinese lab. The Biden-era export-control framework predates this generation. Whether US policy treats V4 weights, distillations, or fine-tunes as controlled tech is an open question that will shape enterprise adoption.

Where V4 reaches you first

Workflow eval checklist this week

Two specific things to add to your dev workflow

Libraries to add to your stack

Code patterns for long-context defense

Configuration changes when deploying V4

The thesis

Ticker-by-ticker read

Timing windows

What invalidates this thesis

Defensibility framework rewrite

Specific decisions to make this quarter

Hiring and talent strategy shifts

Product roadmap implications

Experiment 1: CSA/HCA adversarial robustness

Experiment 2: mHC under adversarial perturbation

Experiment 3: 1M-context prompt-injection at scale

Available funding and acknowledgements

DeepSeek-V4-Pro

DeepSeek-V4-Flash

CSA + HCA attention

mHC connections

Muon optimizer

32T-token pretrain

01 The architecture in plain terms

02 The attention mechanism — CSA + HCA

03 Manifold-Constrained Hyper-Connections (mHC) — strengthening the residual stack

04 Training methodology — Muon optimizer and the 32T-token pretrain

05 Where the efficiency gains come from

06 What is NOT in the paper — the security-relevant omissions

07 How V4 compares to other 2025–2026 frontier models

1M-token context multiplies prompt-injection surface area

Adversarial fine-tunes become much cheaper to produce

CSA/HCA compression has unstudied attack characteristics

mHC connections strengthen adversarial propagation too

API gatekeeping no longer constrains adversarial use

Distillation/derivative models will follow V4 quickly

Re-evaluate prompt-injection defenses at 1M-token scale

Update threat models that assumed API-only frontier access

Watch derivative-model space for the next 60 days

First adversarial fine-tune of V4 published publicly

Independent safety evals from Apollo, METR, MIT

US export-control posture on V4 weights