Automated test runs are in progress. The methodology below is final. We are building the test harness and will publish live scores on the Leaderboard once the first full run completes.

Task Suite — 50 tasks total

🐛
Bug Fix10 tasks

Real GitHub issues — reproduce, fix, pass CI.

⚙️
Feature Build10 tasks

Spec → working code with tests.

♻️
Refactor10 tasks

Migrate legacy code to modern patterns.

🏗️
System Design10 tasks

Architecture + code skeleton from requirements.

🔍
Debug & Explain10 tasks

Root-cause analysis and plain-English explanation.

Scoring Dimensions

Every model receives a weighted composite score out of 100.

40%Correctness
Automated test pass rate
25%Quality
Code readability & best practices (LLM panel)
15%Cost
Token spend × vendor price, per task
10%Speed
End-to-end response time
10%UX
Tool integration experience (manual)

Principles

Same prompt, every model

Each model receives an identical system prompt and task description. No per-model tuning.

🔁
Weekly reruns

The full suite runs every Monday. Rankings update automatically on publish.

📂
Open task set

All 50 tasks will be published in our public git repo. Anyone can audit or reproduce a run.

🚫
No paid placement

Vendors cannot pay to influence scores, task selection, or publish order.

View current rankings → Seed estimates from public evals, live scores coming soon