Models ranked18
Tasks1277
PeriodJune 2026
MethodologyHow we score →
#ModelScore
1↑6
Gemini 2.5 ProGoogle
reasoningvisionlong-context
87.7
2↑4
Grok 4.3xAI
reasoningagenticlong-context
86.0
3↑1
DeepSeek-V4-ProDeepSeek
flagshipreasoningcoding
84.0
4↑8
DeepSeek-V4-FlashDeepSeek
fastcheapcoding
83.5
5↑3
Claude Opus 4.8Anthropic
flagshipreasoningvision
81.4
6↓4
GPT-5.4OpenAI
workhorsecoding
77.7
7↓2
Claude Sonnet 4.6Anthropic
workhorsecodingvision
75.6
8↑6
GPT-5.5OpenAI
flagshipreasoning
74.7
9↓8
Qwen3.7-MaxQwen
flagshipreasoningcoding
74.3
10↓1
Gemini 2.5 FlashGoogle
workhorsevisionlong-context
72.1
11↑4
Mistral Large 3Mistral AI
flagshipvisioncoding
62.2
12↓9
GPT-5.4 miniOpenAI
fastcheap
61.3
13=
Kimi K2.6Moonshot AI
flagshipreasoningvision
59.1
14↑2
Command R+ 08-2024Cohere
flagshipreasoningcoding
56.7
15↓5
Claude Haiku 4.5Anthropic
fastcheap
56.4
16↓5
Qwen3.6-PlusQwen
codingreasoning
32.1
17↓4
CodestralMistral AI
codingcode-completion
25.0
18=
Command R 03-2024Cohere
codingmultilingualrag
12.0
S ≥ 90A ≥ 80B ≥ 70C < 70

Scores aggregated from public benchmarks · updated weekly

How scores are calculated
35%Aider Polyglot

225 real coding tasks across 6 languages

35%SWE-bench Verified

500 real GitHub issues, % resolved

20%LiveCodeBench

Ongoing competitive programming, Pass@1

10%EvalPlus (HumanEval+)

Stricter HumanEval test suite