In development
Benchmark Methodology
How we score AI coding models — 50 real-world tasks, five scoring dimensions, fully automated runs.
Task Suite — 50 tasks total
Bug Fix10 tasks
Real GitHub issues — reproduce, fix, pass CI.
Feature Build10 tasks
Spec → working code with tests.
Refactor10 tasks
Migrate legacy code to modern patterns.
System Design10 tasks
Architecture + code skeleton from requirements.
Debug & Explain10 tasks
Root-cause analysis and plain-English explanation.
Scoring Dimensions
Every model receives a weighted composite score out of 100.
Automated test pass rate
Code readability & best practices (LLM panel)
Token spend × vendor price, per task
End-to-end response time
Tool integration experience (manual)
Principles
Same prompt, every model
Each model receives an identical system prompt and task description. No per-model tuning.
Weekly reruns
The full suite runs every Monday. Rankings update automatically on publish.
Open task set
All 50 tasks will be published in our public git repo. Anyone can audit or reproduce a run.
No paid placement
Vendors cannot pay to influence scores, task selection, or publish order.
View current rankings → Seed estimates from public evals, live scores coming soon