June 2026

AI Coder Bench

Real-world coding benchmark rankings. 50 tasks across bug fixing, feature building, refactoring, system design, and debug & explain.

6Boards

49Models ranked

777Tasks

July 2026Period

2026-07-18Last update

●We fetch raw rankings from 6 public benchmark providers directly — no re-weighting, no composite score. Click "Open official board" on each card to go to the source.

📌 4 lenses · 6 official boards

Each card previews a benchmark provider's Top 5. Click to expand full rankings, or jump straight to the official page.

LMArena ★

Hundreds of thousands of human blind votes (Elo)

Chat

Metric:Arena Elo· 12 models

#1Claude Fable 51631
#2Claude Opus 4.81581
#3Grok 4.51558
#4Claude Sonnet 4.61544
#5Kimi K2.61519

Official ↗

Artificial Analysis

Composite Intelligence Index over 20+ benchmarks

Overall

Metric:AA Index· 46 models

#1Claude Fable 559.9
#2GPT-5.6 Sol58.9
#3Kimi K357.1
#4Claude Opus 4.855.7
#5GPT-5.6 Terra55

Official ↗

LiveBench

Contamination-resistant benchmark, refreshed monthly

Reasoning

Metric:Avg score%· 23 models

#1GPT-5.6 Sol82.9%
#2Claude Fable 581.4%
#3GPT-5.580.5%
#4GPT-5.6 Terra80.3%
#5Claude Opus 4.879.8%

Official ↗

Aider Polyglot

133 real coding tasks across 6 languages

Coding

Metric:Pass rate%· 19 models

#1GPT-5.6 Sol88.0%
#2GPT-5.584.9%
#3Gemini 2.5 Pro83.1%
#4Grok 4.579.6%
#5DeepSeek-V4-Pro74.2%

Official ↗

LiveCodeBench

Continuously updated competitive-programming set

Reasoning

Metric:Pass{'@'}1%· 11 models

#1DeepSeek-V4-Pro100.0%
#2GPT-5.4100.0%
#3GPT-5.4 mini100.0%
#4GPT-5.5100.0%
#5Claude Sonnet 4.6100.0%

Official ↗

EvalPlus

HumanEval+ strict test suite

Coding

Metric:HumanEval+%· 17 models

#1GPT-5.489.0%
#2GPT-5.4 mini89.0%
#3Qwen3.6-Plus87.2%
#4DeepSeek-V4-Pro86.6%
#5DeepSeek-V4-Flash83.5%

Official ↗

LMArenaChat

Metric: Arena Elo·12 models·fetched 2026-07-18

Open official board ↗

#	Model	Arena Elo	Cross-board ranks
#1	Claude Fable 5Anthropic	1631	A #1 L #2 A—L—E—
#2	Claude Opus 4.8Anthropic	1581	A #4 L #5 A #6 L #7 E #9
#3	Grok 4.5xAI	1558	A #7 L #9 A #4 L— E #7
#4	Claude Sonnet 4.6Anthropic	1544	A #8 L #10 A #7 L #5 E #6
#5	Kimi K2.6Moonshot AI	1519	A #17 L #16 A #10 L—E—
#6	GPT-5.5OpenAI	1488	A #6 L #3 A #2 L #4 E—
#7	Gemini 3 ProGoogle	1486	A #24 L—A—L—E—
#8	GPT-5.6 SolOpenAI	1486	A #2 L #1 A #1 L— E #17
#9	GPT-5.4OpenAI	1472	A #9 L #7 A #8 L #2 E #1
#10	Gemini 2.5 ProGoogle	1393	A #31 L— A #3 L #8 E #8
#11	Gemini 3.5 FlashGoogle	1286	A #12 L #11 A—L—E—
#12	Gemini 3.1 ProGoogle	1211	A #13 L #8 A—L—E—

Rank color:#1#2#3#4-5#6-10#11+

The "Cross-board ranks" column shows this model's rank on the other 5 boards at a glance (grey = the model was not tested on that board).

Hundreds of thousands of human blind-vote conversations (Elo), from LMArena / Chatbot Arena. The most authoritative "user preference" board today.

📊 6 benchmark providers

LMArena ↗ Chat

Hundreds of thousands of human blind-vote conversations (Elo), from LMArena / Chatbot Arena. The most authoritative "user preference" board today.

Arena Elo12 models

Artificial Analysis ↗ Overall

Artificial Analysis aggregates 20+ benchmarks (MMLU-Pro, GPQA, HLE, SciCode, IFBench, Terminal-bench, etc.) into a single Intelligence Index — the most comprehensive commercial evaluator for "overall capability".

AA Index46 models

LiveBench ↗ Reasoning

LiveBench refreshes its questions monthly to prevent contamination. Covers reasoning, math, coding, language, instruction following, and data analysis.

Avg score%23 models

Aider Polyglot ↗ Coding

133 real coding tasks (bug-fix / feature) across 6 languages. Metric: pass_rate_2.

Pass rate%19 models

LiveCodeBench ↗ Reasoning

Continuously updated competitive-programming problems. Metric: Pass@1.

Pass{'@'}1%11 models

EvalPlus ↗ Coding

HumanEval+ strict test suite, 164 tasks.

HumanEval+%17 models

All rankings and numbers come directly from official sources; we only do name matching and aggregate presentation, no weighting or re-ranking. Boards use different methodologies, so do not compare raw values across boards.