technology

Which company has the best AI model?

Monthly AI model leaderboard. LMArena Elo, SWE-bench, MMLU-Pro, GPQA, Aider, ARC-AGI across GPT-5, Claude 4, Gemini 2.5, Grok 4, Llama 4, Qwen 3, DeepSeek R1. Polymarket event: "Which company has best AI model end of June?" — 10 live markets, $2,201,415 30-day volume.

159 sources · 5 perspectives

The Take by June 30, 2026

Anthropic holds #1 LMArena position through June 30, 2026 Polymarket settlement

Market

62%

Our Call

63%

Why we agree

Short 8-week settlement window and incumbent position favor Anthropic despite GPT-5.5's agentic benchmark strength; market adjustment seems appropriately calibrated to transition risk rather than overcorrected.

What changes our mind

June 30, 2026 12pm ET — Polymarket 'best AI model end of June' market resolution ($5.09M volume); LMArena snapshot at that moment is the settlement oracle.

All forecasts

GPT-5.5 reaches top-2 LMArena Elo within 30 days

By June 1, 2026

Market

55%

Our Call

52%

-3

Why we disagree

GPT-5.5's agentic superiority and strong benchmark performance suggest high vote accumulation velocity. Model was #1 by several metrics at launch; Arena convergence is likely but not certain.

Why our call differs from market

Model launching at #1 on multiple benchmarks makes top-2 Arena placement likely, but 30-day specificity and Arena voting volatility create downside — market slightly underpriced.

DeepSeek V4-Pro community fine-tune reaches 85%+ SWE-bench Verified by July 2026

By July 31, 2026

Market

35%

Our Call

52%

+17

Why we disagree

MIT license enables rapid SFT iteration from 80.6% base. Historically, community fine-tunes have closed ~3-5 point gaps within 60-90 days on open models. The 85% threshold is achievable but not certain.

Why our call differs from market

Historical 3-5 point gap closure in 60-90 days directly supports the 4.4 point need in 2 months; market underweights this precedent but rightfully hedges on execution risk of a community fine-tune materializing and sustaining the Verified b

Anthropic releases a patch or follow-on model addressing Opus 4.7 production behavior before June 30

By June 30, 2026

Market

70%

Our Call

58%

-12

Why we disagree

Anthropic has historically responded quickly to production behavior complaints. The backlash severity ('legendarily bad') creates commercial pressure for a rapid response.

Why our call differs from market

Eight weeks is tight for a meaningful model release even with commercial pressure; while Anthropic responds quickly to issues, the broad claim ('patch OR follow-on') inflates market odds above typical deployment velocity.

As of May 2, 2026, the AI model leaderboard sits in its most competitive configuration yet, with the decisive near-term question now being whether GPT-5.5's agentic superiority can overcome Claude Opus 4.7's Arena lead before the June 30 Polymarket settlement.
GPT-5.5 ('Spud'), released April 23–24, 2026, posted 88.7% SWE-bench Verified — a slim but clear edge over Claude Opus 4.7's 87.6% on the standard benchmark — and more tellingly leads Terminal-Bench 2.0 by 13 points (82.7% vs 69.4%) and ARC-AGI-2 by 11.7 points (85.0% vs 75.8%), signaling a structural GPT-5.5 advantage on multi-step agentic tasks.
GPT-5.5's Arena Elo remains in active vote accumulation as of this writing and is the key unresolved data point.
Despite GPT-5.5's agentic edge, Claude Opus 4.7 retains the headline Arena position: it debuted at ~1504 Elo within 24 hours of its April 16 launch (+37 over Opus 4.6), resolving the April Polymarket market at 100% Anthropic after $21.1M in trading.
The June market ($5.09M volume) currently prices Anthropic at 70%, Google at 22%, with all other labs combined below 10%.
However, style-control-adjusted markets compress Anthropic's lead meaningfully, suggesting the raw Arena advantage may partly reflect presentation preferences rather than pure capability — a structural caveat for any Arena-denominated verdict.
Claude Opus 4.7 is simultaneously facing documented developer backlash over 'legendarily bad' behavior in production use, arguing with users and delivering degraded code quality relative to Opus 4.6, creating a sharp gap between benchmark ranking and real-world developer trust.
DeepSeek V4-Pro, released April 24, 2026 under MIT license, represents the most important structural development below the frontier: at ~80.6% SWE-bench Verified and approximately $0.30/M input tokens (versus $5/M for GPT-5.5 and Opus 4.7, a 17x cost gap), it establishes open-weights as a credible tier-2 alternative.
Its 1M-token context window and #1 open-source Arena Elo (~1445) make it a realistic enterprise fallback if the closed-source gap continues compressing.
The top-6 Arena Elo spread has compressed to approximately 20 points, confirming that task-fit now dominates raw model rank as a selection criterion — a materially different procurement environment than 12 months ago.

GPT-5.5 released April 23–24, 2026 with 88.7% SWE-bench Verified, leading Claude Opus 4.7 (87.6%) on the standard coding benchmark and holding a 13-point lead on Terminal-Bench 2.0 (82.7% vs 69.4%) — its Arena Elo is still accumulating as of May 2
Claude Opus 4.7 holds #1 LMArena Text Arena at ~1504 Elo, resolving the April Polymarket market at 100% Anthropic after $21.1M in volume; the June market ($5.09M) prices Anthropic at 70% with settlement on June 30
Developer backlash against Claude Opus 4.7 within 24 hours of launch — described as 'legendarily bad' by practitioners for arguing with users and degraded coding vs Opus 4.6, creating a material gap between benchmark rank and production trust
DeepSeek V4-Pro dropped April 24, 2026 under MIT license at ~$0.30/M input tokens (17x cheaper than Opus 4.7 and GPT-5.5), scoring ~80.6% SWE-bench Verified and establishing itself as the leading open-weight challenger
Grok 4.x benchmark integrity compromised: xAI self-reports 72–75% SWE-bench Verified; independent vals.ai testing using SWE-agent scaffold shows only 58.6% — a ~15-point discrepancy that undermines Grok's competitive positioning

At a glance 159 sources · no models, no speculation — everything traces back Track on PredictScroll →

Numbers 15

~1504

Claude Opus 4.7 LMArena Elo (Text Arena Overall)

+37 vs Opus 4.6 2026-04-16 LMArena via secondary aggregators + CNBC confirmation

~1493–1500

Gemini 3.1 Pro Preview LMArena Elo

2026-05-02 LMArena via secondary aggregators

88.7%

GPT-5.5 SWE-bench Verified

2026-04-23 Multiple benchmark aggregators

87.6%

Claude Opus 4.7 SWE-bench Verified

+6.8 pts vs Opus 4.6 (80.8%) 2026-04-16 Multiple benchmark aggregators

82.7%

GPT-5.5 Terminal-Bench 2.0

+13.3 pts vs Claude Opus 4.7 2026-04-23 BenchLM.ai / morphl.ai

69.4%

Claude Opus 4.7 Terminal-Bench 2.0

2026-04-16 BenchLM.ai / morphl.ai

85.0%

GPT-5.5 ARC-AGI-2

+11.7 pts vs Claude Opus 4.7 2026-04-23 BenchLM.ai

75.8%

Claude Opus 4.7 ARC-AGI-2

2026-04-16 BenchLM.ai

Scenarios 3 click to expand

Claude Opus 4.7 holds #1 Arena Elo through June 30; GPT-5.5 reaches top-2 but not #1; Anthropic wins June Polymarket at 65–70%

GPT-5.5 Arena Elo stabilizes 10–20 points below Claude Opus 4.7
Anthropic patches Opus 4.7 behavior, partially recovering developer trust

Anthropic June Polymarket resolves at ~65–75%; no major shift in competitive narrative; DeepSeek remains tier-2

Claude Opus 4.7 extends Arena lead; GPT-5.5 vote accumulation underperforms; 'Claude Mythos Preview' officially released

GPT-5.5 Arena Elo stabilizes below 1490 despite benchmark performance
Anthropic officially releases Mythos-class model before June 30

June Polymarket resolves >85% Anthropic; Anthropic narrative dominates the H1 2026 AI cycle; DeepSeek remains price-disruption story only

GPT-5.5 Arena Elo accumulation surpasses Claude Opus 4.7; style-control concerns erode Anthropic's raw lead; June Polymarket reprices to 45–55% Anthropic

GPT-5.5 Arena Elo reaches 1495+ within 30 days, trading #1 with Claude
Opus 4.7 behavior complaints persist without Anthropic patch, developer adoption stalls

June Polymarket shifts from 70/22 toward 50/30 Anthropic/Google; narrative shifts to 'OpenAI recaptured AI crown'; enterprise procurement shifts toward GPT-5.5 for agentic workflows

June 30, 2026 12pm ET — Polymarket 'best AI model end of June' market resolution ($5.09M volume); LMArena snapshot at that moment is the settlement oracle.
GPT-5.5 Arena Elo accumulation — model launched April 23-24 and was still in 'active voting accumulation' as of May 2; first stable Elo reading expected within 2–3 weeks.
Anthropic 'Claude Mythos Preview' — cited at 93.9% SWE-bench Verified but unverified; any official announcement would immediately reprice the June Polymarket odds.
DeepSeek V4-Pro open-weight fine-tune results — MIT license enables community SFT on top of 80.6% SWE-bench base; derivative models closing the gap to 85%+ would pressure closed-source pricing.
May 2026 Polymarket market (currently 81.5% Anthropic) resolution — tracks whether Claude holds #1 through end of May before the June market becomes the focus.
Google Gemini 3.1 Pro week-over-week Elo stability — currently at #2–3 (~1493–1500 Elo) but unverified for consistency; any confirmed move above 1500 would make the June race a genuine two-horse contest at 70%/22% Polymarket odds.

Claude Opus 4.7 community backlash within 24 hours of launch — developers reported 'legendarily bad' behavior, arguing with users and degraded code quality vs Opus 4.6, creating adoption risk despite benchmark leadership.
GPT-5.5 leads Terminal-Bench 2.0 by 13.3 points (82.7% vs 69.4%) and ARC-AGI-2 by 11.7 points (85.0% vs 75.8%), indicating Claude's overall Arena lead may not translate to agentic or reasoning-heavy workloads.
Grok 4.x self-reported SWE-bench scores (72–75%) diverge ~15 points from independent vals.ai testing (58.6%), raising benchmark integrity concerns that could affect the broader leaderboard ecosystem.
DeepSeek V4-Pro at $0.30/M input tokens vs $5/M for Opus 4.7 creates a 17x cost arbitrage; if it closes the remaining ~7-point SWE-bench gap, enterprise switching pressure intensifies rapidly.
'Claude Mythos Preview' cited at 93.9% SWE-bench but flagged as provisional/unverified — if this is a real Anthropic pipeline release, it could reshape standings before June 30; if fabricated, it signals growing benchmark noise.
Style-control-adjusted Polymarket market shows Anthropic's June lead compresses from 70% to 36–55%, suggesting raw Arena voting may carry a presentation/style bias inflating Claude's apparent dominance.

Anthropic holds the strongest near-term position (70% June Polymarket, 81.5% May, April resolved 100%) but the 20-point Elo compression at the frontier means a single strong model release from Google or OpenAI can flip the leaderboard within one vote batch cycle.
DeepSeek V4-Pro's 1/7th cost ratio with ~80.6% SWE-bench performance establishes a cost-efficiency floor that forces closed-source vendors to justify premium pricing through capability gaps that are measurably narrowing.
Task-specific model selection is now more valuable than tracking overall Elo rank: Claude leads coding (1549 Elo), GPT-5.5 leads terminal/agentic tasks (Terminal-Bench +13pts, ARC-AGI-2 +11.7pts) — enterprise buyers should weight by workload mix, not headline rank.
The style-control divergence on Polymarket (raw 70% vs adjusted 36–55% for Anthropic) suggests Arena human preference scores embed a presentation bias; benchmark-grounded metrics (SWE-bench Pro, Terminal-Bench) are more reliable signals for production deployment decisions than Chatbot Arena Elo alone.

Polymarket 19
llm-stats.com 10
BenchLM.ai 9
buildfastwithai.com 5
morphllm.com 4

Read this as

Who are you?

Your view

—

Your next moves

Key developments for you

Your watch list

Your risks

Was this view useful for you?

The full story — executive summary