ARC-AGI — the only unbeaten benchmark that's easy for humans, but hard for AI.
Join the $1M+ global competition and compete with thousands of AI researchers and frontier AI labs on the world's most important unbeaten benchmark, ARC-AGI-2. Open source progress towards AGI and win prizes!
Grand Prize: $700,000
Paper Awards $75,000
Top Scores: $50,000
To Be Announced: $175,000
AI System | Score | $/Task |
---|---|---|
o3 (low)* | < 5.0% | 200.00 |
o1-pro* | < 5.0% | $39.00 |
o1 (high) | 3.0% | $4.50 |
ARChitects (2024) | 2.5% | $0.20 |
o3-mini | 1.7% | $0.28 |
Icecuber | 1.6% | $0.13 |
DeepSeek R1 | 1.3% | $0.08 |
Gemini 2.0 Flash | 1.3% | $0.004 |
See Full Leaderboard + Code + Papers
* Estimate based on partial testing results and o1-pro pricing.
A new benchmark that challenges frontier AI reasoning systems.
ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by ARC Prize 2024 and OpenAI.
ARC-AGI-2 - the next iteration of the benchmark - is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.
Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.
Can you create a system that can reach 85% accuracy?
Efficiency Test
Log-linear scaling is insufficient to beat ARC-AGI-2.
New test-time adaptation algorithms or novel AI systems are needed to bring AI efficiency inline with human performance.
Capability Test
Tasks requiring symbols to be interpreted as having meaning beyond their visual patterns.
Current systems attempt to check symmetry, mirroring, and other transformations, and even recognize connecting elements, but fail to assign semantic significance to the symbols themselves.
Capability Test
Tasks requiring simultaneous application of a rules, or application of multiples rules that interact with each other.
In contrast, if a task has very few global rules, current systems can consitently discover and can apply them.
Capability Test
Tasks where rules must be applied differently based on context.
Systems tend to fixate on superficial patterns rather than understanding the underlying selection principles.
Founded by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras), the ARC Prize Foundation is a non-profit organization with the mission to guide researchers, industry, and regulators towards AGI through enduring benchmarks.