Benchmark Methodology

Task Suite — 50 tasks total

🐛

Bug Fix10 tasks

Real GitHub issues — reproduce, fix, pass CI.

⚙️

Feature Build10 tasks

Spec → working code with tests.

♻️

Refactor10 tasks

Migrate legacy code to modern patterns.

🏗️

System Design10 tasks

Architecture + code skeleton from requirements.

🔍

Debug & Explain10 tasks

Root-cause analysis and plain-English explanation.

Every model receives a weighted composite score out of 100.

40%Correctness

Automated test pass rate

25%Quality

Code readability & best practices (LLM panel)

15%Cost

Token spend × vendor price, per task

10%Speed

End-to-end response time

10%UX

Tool integration experience (manual)

⚡

Same prompt, every model

Each model receives an identical system prompt and task description. No per-model tuning.

🔁

Weekly reruns

The full suite runs every Monday. Rankings update automatically on publish.

📂

Open task set

All 50 tasks will be published in our public git repo. Anyone can audit or reproduce a run.

🚫

No paid placement

Vendors cannot pay to influence scores, task selection, or publish order.

View current rankings → Seed estimates from public evals, live scores coming soon