Monthly AI model leaderboard. LMArena Elo, SWE-bench, MMLU-Pro, GPQA, Aider, ARC-AGI across GPT-5, Claude 4, Gemini 2.5, Grok 4, Llama 4, Qwen 3, DeepSeek R1.
Polymarket event: "Which company has best AI model end of June?" — 10 live markets, $2,201,415 30-day volume.
159 sources·5 perspectives
The Takeby June 30, 2026
Anthropic holds #1 LMArena position through June 30, 2026 Polymarket settlement
Market
62%
Our Call
63%
Δ
+1
Why we agree
Short 8-week settlement window and incumbent position favor Anthropic despite GPT-5.5's agentic benchmark strength; market adjustment seems appropriately calibrated to transition risk rather than overcorrected.
What changes our mind
June 30, 2026 12pm ET — Polymarket 'best AI model end of June' market resolution ($5.09M volume); LMArena snapshot at that moment is the settlement oracle.
All forecasts
02
GPT-5.5 reaches top-2 LMArena Elo within 30 days
By June 1, 2026
Market
55%
Our Call
52%
Δ
-3
Why we disagree
GPT-5.5's agentic superiority and strong benchmark performance suggest high vote accumulation velocity. Model was #1 by several metrics at launch; Arena convergence is likely but not certain.
Why our call differs from market
Model launching at #1 on multiple benchmarks makes top-2 Arena placement likely, but 30-day specificity and Arena voting volatility create downside — market slightly underpriced.
03
DeepSeek V4-Pro community fine-tune reaches 85%+ SWE-bench Verified by July 2026
By July 31, 2026
Market
35%
Our Call
52%
Δ
+17
Why we disagree
MIT license enables rapid SFT iteration from 80.6% base. Historically, community fine-tunes have closed ~3-5 point gaps within 60-90 days on open models. The 85% threshold is achievable but not certain.
Why our call differs from market
Historical 3-5 point gap closure in 60-90 days directly supports the 4.4 point need in 2 months; market underweights this precedent but rightfully hedges on execution risk of a community fine-tune materializing and sustaining the Verified b
04
Anthropic releases a patch or follow-on model addressing Opus 4.7 production behavior before June 30
By June 30, 2026
Market
70%
Our Call
58%
Δ
-12
Why we disagree
Anthropic has historically responded quickly to production behavior complaints. The backlash severity ('legendarily bad') creates commercial pressure for a rapid response.
Why our call differs from market
Eight weeks is tight for a meaningful model release even with commercial pressure; while Anthropic responds quickly to issues, the broad claim ('patch OR follow-on') inflates market odds above typical deployment velocity.
As of May 2, 2026, the AI model leaderboard sits in its most competitive configuration yet, with the decisive near-term question now being whether GPT-5.5's agentic superiority can overcome Claude Opus 4.7's Arena lead before the June 30 Polymarket settlement.
GPT-5.5 ('Spud'), released April 23–24, 2026, posted 88.7% SWE-bench Verified — a slim but clear edge over Claude Opus 4.7's 87.6% on the standard benchmark — and more tellingly leads Terminal-Bench 2.0 by 13 points (82.7% vs 69.4%) and ARC-AGI-2 by 11.7 points (85.0% vs 75.8%), signaling a structural GPT-5.5 advantage on multi-step agentic tasks.
GPT-5.5's Arena Elo remains in active vote accumulation as of this writing and is the key unresolved data point.
Despite GPT-5.5's agentic edge, Claude Opus 4.7 retains the headline Arena position: it debuted at ~1504 Elo within 24 hours of its April 16 launch (+37 over Opus 4.6), resolving the April Polymarket market at 100% Anthropic after $21.1M in trading.
The June market ($5.09M volume) currently prices Anthropic at 70%, Google at 22%, with all other labs combined below 10%.
However, style-control-adjusted markets compress Anthropic's lead meaningfully, suggesting the raw Arena advantage may partly reflect presentation preferences rather than pure capability — a structural caveat for any Arena-denominated verdict.
Claude Opus 4.7 is simultaneously facing documented developer backlash over 'legendarily bad' behavior in production use, arguing with users and delivering degraded code quality relative to Opus 4.6, creating a sharp gap between benchmark ranking and real-world developer trust.
DeepSeek V4-Pro, released April 24, 2026 under MIT license, represents the most important structural development below the frontier: at ~80.6% SWE-bench Verified and approximately $0.30/M input tokens (versus $5/M for GPT-5.5 and Opus 4.7, a 17x cost gap), it establishes open-weights as a credible tier-2 alternative.
Its 1M-token context window and #1 open-source Arena Elo (~1445) make it a realistic enterprise fallback if the closed-source gap continues compressing.
The top-6 Arena Elo spread has compressed to approximately 20 points, confirming that task-fit now dominates raw model rank as a selection criterion — a materially different procurement environment than 12 months ago.
GPT-5.5 released April 23–24, 2026 with 88.7% SWE-bench Verified, leading Claude Opus 4.7 (87.6%) on the standard coding benchmark and holding a 13-point lead on Terminal-Bench 2.0 (82.7% vs 69.4%) — its Arena Elo is still accumulating as of May 2
Claude Opus 4.7 holds #1 LMArena Text Arena at ~1504 Elo, resolving the April Polymarket market at 100% Anthropic after $21.1M in volume; the June market ($5.09M) prices Anthropic at 70% with settlement on June 30
Developer backlash against Claude Opus 4.7 within 24 hours of launch — described as 'legendarily bad' by practitioners for arguing with users and degraded coding vs Opus 4.6, creating a material gap between benchmark rank and production trust
DeepSeek V4-Pro dropped April 24, 2026 under MIT license at ~$0.30/M input tokens (17x cheaper than Opus 4.7 and GPT-5.5), scoring ~80.6% SWE-bench Verified and establishing itself as the leading open-weight challenger
Grok 4.x benchmark integrity compromised: xAI self-reports 72–75% SWE-bench Verified; independent vals.ai testing using SWE-agent scaffold shows only 58.6% — a ~15-point discrepancy that undermines Grok's competitive positioning
+37 vs Opus 4.62026-04-16LMArena via secondary aggregators + CNBC confirmation
~1493–1500
Gemini 3.1 Pro Preview LMArena Elo
2026-05-02LMArena via secondary aggregators
88.7%
GPT-5.5 SWE-bench Verified
2026-04-23Multiple benchmark aggregators
87.6%
Claude Opus 4.7 SWE-bench Verified
+6.8 pts vs Opus 4.6 (80.8%)2026-04-16Multiple benchmark aggregators
82.7%
GPT-5.5 Terminal-Bench 2.0
+13.3 pts vs Claude Opus 4.72026-04-23BenchLM.ai / morphl.ai
69.4%
Claude Opus 4.7 Terminal-Bench 2.0
2026-04-16BenchLM.ai / morphl.ai
85.0%
GPT-5.5 ARC-AGI-2
+11.7 pts vs Claude Opus 4.72026-04-23BenchLM.ai
75.8%
Claude Opus 4.7 ARC-AGI-2
2026-04-16BenchLM.ai
Scenarios3click to expand
Claude Opus 4.7 holds #1 Arena Elo through June 30; GPT-5.5 reaches top-2 but not #1; Anthropic wins June Polymarket at 65–70%
GPT-5.5 Arena Elo stabilizes 10–20 points below Claude Opus 4.7
Anthropic patches Opus 4.7 behavior, partially recovering developer trust
Anthropic June Polymarket resolves at ~65–75%; no major shift in competitive narrative; DeepSeek remains tier-2
Claude Opus 4.7 extends Arena lead; GPT-5.5 vote accumulation underperforms; 'Claude Mythos Preview' officially released
GPT-5.5 Arena Elo stabilizes below 1490 despite benchmark performance
Anthropic officially releases Mythos-class model before June 30
June Polymarket resolves >85% Anthropic; Anthropic narrative dominates the H1 2026 AI cycle; DeepSeek remains price-disruption story only
GPT-5.5 Arena Elo accumulation surpasses Claude Opus 4.7; style-control concerns erode Anthropic's raw lead; June Polymarket reprices to 45–55% Anthropic
GPT-5.5 Arena Elo reaches 1495+ within 30 days, trading #1 with Claude
Opus 4.7 behavior complaints persist without Anthropic patch, developer adoption stalls
June Polymarket shifts from 70/22 toward 50/30 Anthropic/Google; narrative shifts to 'OpenAI recaptured AI crown'; enterprise procurement shifts toward GPT-5.5 for agentic workflows
June 30, 2026 12pm ET — Polymarket 'best AI model end of June' market resolution ($5.09M volume); LMArena snapshot at that moment is the settlement oracle.
GPT-5.5 Arena Elo accumulation — model launched April 23-24 and was still in 'active voting accumulation' as of May 2; first stable Elo reading expected within 2–3 weeks.
Anthropic 'Claude Mythos Preview' — cited at 93.9% SWE-bench Verified but unverified; any official announcement would immediately reprice the June Polymarket odds.
DeepSeek V4-Pro open-weight fine-tune results — MIT license enables community SFT on top of 80.6% SWE-bench base; derivative models closing the gap to 85%+ would pressure closed-source pricing.
May 2026 Polymarket market (currently 81.5% Anthropic) resolution — tracks whether Claude holds #1 through end of May before the June market becomes the focus.
Google Gemini 3.1 Pro week-over-week Elo stability — currently at #2–3 (~1493–1500 Elo) but unverified for consistency; any confirmed move above 1500 would make the June race a genuine two-horse contest at 70%/22% Polymarket odds.
Claude Opus 4.7 community backlash within 24 hours of launch — developers reported 'legendarily bad' behavior, arguing with users and degraded code quality vs Opus 4.6, creating adoption risk despite benchmark leadership.
GPT-5.5 leads Terminal-Bench 2.0 by 13.3 points (82.7% vs 69.4%) and ARC-AGI-2 by 11.7 points (85.0% vs 75.8%), indicating Claude's overall Arena lead may not translate to agentic or reasoning-heavy workloads.
Grok 4.x self-reported SWE-bench scores (72–75%) diverge ~15 points from independent vals.ai testing (58.6%), raising benchmark integrity concerns that could affect the broader leaderboard ecosystem.
DeepSeek V4-Pro at $0.30/M input tokens vs $5/M for Opus 4.7 creates a 17x cost arbitrage; if it closes the remaining ~7-point SWE-bench gap, enterprise switching pressure intensifies rapidly.
'Claude Mythos Preview' cited at 93.9% SWE-bench but flagged as provisional/unverified — if this is a real Anthropic pipeline release, it could reshape standings before June 30; if fabricated, it signals growing benchmark noise.
Style-control-adjusted Polymarket market shows Anthropic's June lead compresses from 70% to 36–55%, suggesting raw Arena voting may carry a presentation/style bias inflating Claude's apparent dominance.
Anthropic holds the strongest near-term position (70% June Polymarket, 81.5% May, April resolved 100%) but the 20-point Elo compression at the frontier means a single strong model release from Google or OpenAI can flip the leaderboard within one vote batch cycle.
DeepSeek V4-Pro's 1/7th cost ratio with ~80.6% SWE-bench performance establishes a cost-efficiency floor that forces closed-source vendors to justify premium pricing through capability gaps that are measurably narrowing.
Task-specific model selection is now more valuable than tracking overall Elo rank: Claude leads coding (1549 Elo), GPT-5.5 leads terminal/agentic tasks (Terminal-Bench +13pts, ARC-AGI-2 +11.7pts) — enterprise buyers should weight by workload mix, not headline rank.
The style-control divergence on Polymarket (raw 70% vs adjusted 36–55% for Anthropic) suggests Arena human preference scores embed a presentation bias; benchmark-grounded metrics (SWE-bench Pro, Terminal-Bench) are more reliable signals for production deployment decisions than Chatbot Arena Elo alone.