Interactive Reasoning Benchmark
The first eval that measures human-like intelligence in AI.
ARC-AGI-3 is designed to measure AI system generalization and intelligence through skill-acquisition efficiency in novel, unseen environments.
The in-progress benchmark dataset will consist of ~100 unique environments split into public private evaluation sets where AI agents must perceive, decide, and act over multiple steps without prior instructions.
ARC-AGI-3 is currently in development. The early preview is limited to 6 games (3 public, 3 to be released in Aug '25). Development began in early 2025 and is set to launch in 2026.
Traditionally, to measure intelligence, static benchmarks have been the yardstick, but they do not have the bandwidth required to measure the full spectrum of intelligence.
Interactive Reasoning Benchmarks (IRBs) test for a much broader scope of capabilities:
Game environment provide a rich medium to test experience-driven competence.
We can declare the arrival of AGI when we build an artificial system that matches the learning efficiency of humans.
Humans are the only existance proof of general intelligence. Human-level intelligence is inherently interactive. It unfolds over time, drawing on experience as we explore an environment, plan, reflect, and adjust towards a goal. By testing intelligence over time we are able to observe extended trajectories, planning horizons, memory compression (distilling past states into future decisions), self-reflection, and plan-execution in context.
Game environments provide an ideal medium to test interactivity. They strike a unique balance between clear rules, goals, and feedback but also requiring the test-taker to engage in complex planning, and learning.
We've seen echoes of this in earlier eras; Atari games have been widely used in the past. But the agent shortcomings were clear: These systems couldn't generalize beyond memorized pixels, relied on built-in human priors, ignored efficiency, encoded developer intelligence, and no true hidden test set.
ARC-AGI-3 will overcome this by introducing a new set of hand-crafted novel environments that are designed to test the skill-acquisition efficiency of artificial systems as compared to humans.
It will rely on previous ARC-AGI pillars (core knowledge priors, excluding reliance on language, trivia, or vast training data) to evaluate performance against human baselines.
IRBs aren't just better metrics; they're a clear signal that there is a wide gap between human and artificial intelligence.
As long as that gap remains, we do not have AGI.
ARC Prize partnering with HuggingFace to host a competition that harnesses the collective intelligence of our community to evaluate how current AI performs on ARC-AGI-3.
We need your help to build agents that can play and learn to help us calibrate difficulty and refine game design. We're open to a mix of language model and reinforcement learning-based approaches. Are you ready to build?
Help us learn about where current frontier AI is. Build on top of the ARC-AGI API to create agents that can play ARC-AGI-3. Ready to build? ARC-AGI-3 API documentation
One of the primary challenges in developing ARC-AGI-3 is generating innovative game ideas.
Great ideas can come from anyone, anywhere - that's why we're calling on the community to contribute your creative game concepts. While we can't guarantee that every submission will be implemented, your inspiration and enthusiasm are invaluable to us.
Have a game idea?
As a non-profit, building over 100 games is a worthy challenge, but we're able to do this important work thanks to the generous support of our incredible sponsors.
Every donation above $5,000 directly funds the creation of one new ARC-AGI-3 game.
Interested? Please consider making a donation today.
Get inspired by the recent YC AI Startup School talk from ARC-AGI creator François Chollet to hear more about the gap between AI and AGI.