back to all hubs
technology

Which company has the best AI model?

Monthly AI model leaderboard. LMArena Elo, SWE-bench, MMLU-Pro, GPQA, Aider, ARC-AGI across GPT-5, Claude 4, Gemini 2.5, Grok 4, Llama 4, Qwen 3, DeepSeek R1. Polymarket event: "Which company has best AI model end of June?" — 10 live markets, $2,201,415 30-day volume.

159 sources · 5 perspectives
The Take by June 30, 2026
Anthropic holds #1 LMArena position through June 30, 2026 Polymarket settlement
Market
62%
Our Call
63%
Δ
+1
Why we agree
Short 8-week settlement window and incumbent position favor Anthropic despite GPT-5.5's agentic benchmark strength; market adjustment seems appropriately calibrated to transition risk rather than overcorrected.
What changes our mind
June 30, 2026 12pm ET — Polymarket 'best AI model end of June' market resolution ($5.09M volume); LMArena snapshot at that moment is the settlement oracle.

All forecasts

02

GPT-5.5 reaches top-2 LMArena Elo within 30 days

By June 1, 2026
Market
55%
Our Call
52%
Δ
-3
Why we disagree
GPT-5.5's agentic superiority and strong benchmark performance suggest high vote accumulation velocity. Model was #1 by several metrics at launch; Arena convergence is likely but not certain.
Why our call differs from market
Model launching at #1 on multiple benchmarks makes top-2 Arena placement likely, but 30-day specificity and Arena voting volatility create downside — market slightly underpriced.
03

DeepSeek V4-Pro community fine-tune reaches 85%+ SWE-bench Verified by July 2026

By July 31, 2026
Market
35%
Our Call
52%
Δ
+17
Why we disagree
MIT license enables rapid SFT iteration from 80.6% base. Historically, community fine-tunes have closed ~3-5 point gaps within 60-90 days on open models. The 85% threshold is achievable but not certain.
Why our call differs from market
Historical 3-5 point gap closure in 60-90 days directly supports the 4.4 point need in 2 months; market underweights this precedent but rightfully hedges on execution risk of a community fine-tune materializing and sustaining the Verified b
04

Anthropic releases a patch or follow-on model addressing Opus 4.7 production behavior before June 30

By June 30, 2026
Market
70%
Our Call
58%
Δ
-12
Why we disagree
Anthropic has historically responded quickly to production behavior complaints. The backlash severity ('legendarily bad') creates commercial pressure for a rapid response.
Why our call differs from market
Eight weeks is tight for a meaningful model release even with commercial pressure; while Anthropic responds quickly to issues, the broad claim ('patch OR follow-on') inflates market odds above typical deployment velocity.
  • As of May 2, 2026, the AI model leaderboard sits in its most competitive configuration yet, with the decisive near-term question now being whether GPT-5.5's agentic superiority can overcome Claude Opus 4.7's Arena lead before the June 30 Polymarket settlement.
  • GPT-5.5 ('Spud'), released April 23–24, 2026, posted 88.7% SWE-bench Verified — a slim but clear edge over Claude Opus 4.7's 87.6% on the standard benchmark — and more tellingly leads Terminal-Bench 2.0 by 13 points (82.7% vs 69.4%) and ARC-AGI-2 by 11.7 points (85.0% vs 75.8%), signaling a structural GPT-5.5 advantage on multi-step agentic tasks.
  • GPT-5.5's Arena Elo remains in active vote accumulation as of this writing and is the key unresolved data point.
  • Despite GPT-5.5's agentic edge, Claude Opus 4.7 retains the headline Arena position: it debuted at ~1504 Elo within 24 hours of its April 16 launch (+37 over Opus 4.6), resolving the April Polymarket market at 100% Anthropic after $21.1M in trading.
  • The June market ($5.09M volume) currently prices Anthropic at 70%, Google at 22%, with all other labs combined below 10%.
  • However, style-control-adjusted markets compress Anthropic's lead meaningfully, suggesting the raw Arena advantage may partly reflect presentation preferences rather than pure capability — a structural caveat for any Arena-denominated verdict.
  • Claude Opus 4.7 is simultaneously facing documented developer backlash over 'legendarily bad' behavior in production use, arguing with users and delivering degraded code quality relative to Opus 4.6, creating a sharp gap between benchmark ranking and real-world developer trust.
  • DeepSeek V4-Pro, released April 24, 2026 under MIT license, represents the most important structural development below the frontier: at ~80.6% SWE-bench Verified and approximately $0.30/M input tokens (versus $5/M for GPT-5.5 and Opus 4.7, a 17x cost gap), it establishes open-weights as a credible tier-2 alternative.
  • Its 1M-token context window and #1 open-source Arena Elo (~1445) make it a realistic enterprise fallback if the closed-source gap continues compressing.
  • The top-6 Arena Elo spread has compressed to approximately 20 points, confirming that task-fit now dominates raw model rank as a selection criterion — a materially different procurement environment than 12 months ago.
  • GPT-5.5 released April 23–24, 2026 with 88.7% SWE-bench Verified, leading Claude Opus 4.7 (87.6%) on the standard coding benchmark and holding a 13-point lead on Terminal-Bench 2.0 (82.7% vs 69.4%) — its Arena Elo is still accumulating as of May 2
  • Claude Opus 4.7 holds #1 LMArena Text Arena at ~1504 Elo, resolving the April Polymarket market at 100% Anthropic after $21.1M in volume; the June market ($5.09M) prices Anthropic at 70% with settlement on June 30
  • Developer backlash against Claude Opus 4.7 within 24 hours of launch — described as 'legendarily bad' by practitioners for arguing with users and degraded coding vs Opus 4.6, creating a material gap between benchmark rank and production trust
  • DeepSeek V4-Pro dropped April 24, 2026 under MIT license at ~$0.30/M input tokens (17x cheaper than Opus 4.7 and GPT-5.5), scoring ~80.6% SWE-bench Verified and establishing itself as the leading open-weight challenger
  • Grok 4.x benchmark integrity compromised: xAI self-reports 72–75% SWE-bench Verified; independent vals.ai testing using SWE-agent scaffold shows only 58.6% — a ~15-point discrepancy that undermines Grok's competitive positioning
At a glance 159 sources · no models, no speculation — everything traces back Track on PredictScroll →
Numbers 15
~1504
Claude Opus 4.7 LMArena Elo (Text Arena Overall)
+37 vs Opus 4.6 2026-04-16 LMArena via secondary aggregators + CNBC confirmation
~1493–1500
Gemini 3.1 Pro Preview LMArena Elo
2026-05-02 LMArena via secondary aggregators
88.7%
GPT-5.5 SWE-bench Verified
2026-04-23 Multiple benchmark aggregators
87.6%
Claude Opus 4.7 SWE-bench Verified
+6.8 pts vs Opus 4.6 (80.8%) 2026-04-16 Multiple benchmark aggregators
82.7%
GPT-5.5 Terminal-Bench 2.0
+13.3 pts vs Claude Opus 4.7 2026-04-23 BenchLM.ai / morphl.ai
69.4%
Claude Opus 4.7 Terminal-Bench 2.0
2026-04-16 BenchLM.ai / morphl.ai
85.0%
GPT-5.5 ARC-AGI-2
+11.7 pts vs Claude Opus 4.7 2026-04-23 BenchLM.ai
75.8%
Claude Opus 4.7 ARC-AGI-2
2026-04-16 BenchLM.ai
Scenarios 3 click to expand

Claude Opus 4.7 holds #1 Arena Elo through June 30; GPT-5.5 reaches top-2 but not #1; Anthropic wins June Polymarket at 65–70%

  • GPT-5.5 Arena Elo stabilizes 10–20 points below Claude Opus 4.7
  • Anthropic patches Opus 4.7 behavior, partially recovering developer trust
Anthropic June Polymarket resolves at ~65–75%; no major shift in competitive narrative; DeepSeek remains tier-2

Claude Opus 4.7 extends Arena lead; GPT-5.5 vote accumulation underperforms; 'Claude Mythos Preview' officially released

  • GPT-5.5 Arena Elo stabilizes below 1490 despite benchmark performance
  • Anthropic officially releases Mythos-class model before June 30
June Polymarket resolves >85% Anthropic; Anthropic narrative dominates the H1 2026 AI cycle; DeepSeek remains price-disruption story only

GPT-5.5 Arena Elo accumulation surpasses Claude Opus 4.7; style-control concerns erode Anthropic's raw lead; June Polymarket reprices to 45–55% Anthropic

  • GPT-5.5 Arena Elo reaches 1495+ within 30 days, trading #1 with Claude
  • Opus 4.7 behavior complaints persist without Anthropic patch, developer adoption stalls
June Polymarket shifts from 70/22 toward 50/30 Anthropic/Google; narrative shifts to 'OpenAI recaptured AI crown'; enterprise procurement shifts toward GPT-5.5 for agentic workflows
  • June 30, 2026 12pm ET — Polymarket 'best AI model end of June' market resolution ($5.09M volume); LMArena snapshot at that moment is the settlement oracle.
  • GPT-5.5 Arena Elo accumulation — model launched April 23-24 and was still in 'active voting accumulation' as of May 2; first stable Elo reading expected within 2–3 weeks.
  • Anthropic 'Claude Mythos Preview' — cited at 93.9% SWE-bench Verified but unverified; any official announcement would immediately reprice the June Polymarket odds.
  • DeepSeek V4-Pro open-weight fine-tune results — MIT license enables community SFT on top of 80.6% SWE-bench base; derivative models closing the gap to 85%+ would pressure closed-source pricing.
  • May 2026 Polymarket market (currently 81.5% Anthropic) resolution — tracks whether Claude holds #1 through end of May before the June market becomes the focus.
  • Google Gemini 3.1 Pro week-over-week Elo stability — currently at #2–3 (~1493–1500 Elo) but unverified for consistency; any confirmed move above 1500 would make the June race a genuine two-horse contest at 70%/22% Polymarket odds.
  • Claude Opus 4.7 community backlash within 24 hours of launch — developers reported 'legendarily bad' behavior, arguing with users and degraded code quality vs Opus 4.6, creating adoption risk despite benchmark leadership.
  • GPT-5.5 leads Terminal-Bench 2.0 by 13.3 points (82.7% vs 69.4%) and ARC-AGI-2 by 11.7 points (85.0% vs 75.8%), indicating Claude's overall Arena lead may not translate to agentic or reasoning-heavy workloads.
  • Grok 4.x self-reported SWE-bench scores (72–75%) diverge ~15 points from independent vals.ai testing (58.6%), raising benchmark integrity concerns that could affect the broader leaderboard ecosystem.
  • DeepSeek V4-Pro at $0.30/M input tokens vs $5/M for Opus 4.7 creates a 17x cost arbitrage; if it closes the remaining ~7-point SWE-bench gap, enterprise switching pressure intensifies rapidly.
  • 'Claude Mythos Preview' cited at 93.9% SWE-bench but flagged as provisional/unverified — if this is a real Anthropic pipeline release, it could reshape standings before June 30; if fabricated, it signals growing benchmark noise.
  • Style-control-adjusted Polymarket market shows Anthropic's June lead compresses from 70% to 36–55%, suggesting raw Arena voting may carry a presentation/style bias inflating Claude's apparent dominance.
  • Anthropic holds the strongest near-term position (70% June Polymarket, 81.5% May, April resolved 100%) but the 20-point Elo compression at the frontier means a single strong model release from Google or OpenAI can flip the leaderboard within one vote batch cycle.
  • DeepSeek V4-Pro's 1/7th cost ratio with ~80.6% SWE-bench performance establishes a cost-efficiency floor that forces closed-source vendors to justify premium pricing through capability gaps that are measurably narrowing.
  • Task-specific model selection is now more valuable than tracking overall Elo rank: Claude leads coding (1549 Elo), GPT-5.5 leads terminal/agentic tasks (Terminal-Bench +13pts, ARC-AGI-2 +11.7pts) — enterprise buyers should weight by workload mix, not headline rank.
  • The style-control divergence on Polymarket (raw 70% vs adjusted 36–55% for Anthropic) suggests Arena human preference scores embed a presentation bias; benchmark-grounded metrics (SWE-bench Pro, Terminal-Bench) are more reliable signals for production deployment decisions than Chatbot Arena Elo alone.
  • Polymarket 19
  • llm-stats.com 10
  • BenchLM.ai 9
  • buildfastwithai.com 5
  • morphllm.com 4
Read this as

Who are you?

Your view

          Was this view useful for you?
          The full story — executive summary