ARC-AGI-3 Preview: 30-day learnings

Highlighting the gap between humans and AI with Interactive Benchmarks

Arcade graphic for ARC-AGI-3 — ARC-AGI-3, the first Interactive Reasoning Benchmark by ARC Prize Foundation

On July 17, we released a sneak peek of ARC-AGI-3, our first Interactive Reasoning Benchmark. These three preview games were our first contact with reality using a new format: video-game-like environments where agents and humans must perceive, plan, and act across multiple steps to achieve long-horizon goals.

Our goal was to gather data about human performance, learn how well AI systems perform on interactive tasks, and see how the community would engage with this new style of evaluation. To encourage further development, we hosted an ARC-AGI-3 Preview Agent competition.

30-Day takeaways:

Interactive benchmarks are easy (even fun) for humans, but hard for AI
Most humans beat the games, often enjoying them. Persistent test-takers "speed ran" to theoretical minimums. AI agents struggled to efficiently make progress.
Some preview games were too friendly to random search
A few game designs could be brute-forced without reasoning. What we learn will help make future games more resistant to brute force and more reflective of intelligence.
Action efficiency provides a clear intelligence signal
Measuring how efficiently environment information is converted into strategy reveals a clear divide between human-level and AI-level performance.

Why Interactive Reasoning Benchmarks

Traditionally, to measure intelligence, static benchmarks have been the yardstick. However, Interactive Reasoning Benchmarks (IRBs) test for a broader scope of capabilities:

On-the-fly Learning - Like in ARC-AGI-1 & 2, the test-taker cannot simply memorize strategies to succeed at the games, it must recombine what it knows on the fly to make sense of novel situations.
Exploration - Can the test-taker efficiently gather the information it needs from the environment via its own choices?
Memory - Given previous experience, how does a test-taker choose to store that information for future use?
Goal Acquisition - Can a test-taker set its own intermediate goals, even if the ultimate goal is unknown?

We released ARC-AGI-3 Preview to put these core design principles into action.

When compared to static benchmarks, instead of asking, "Can the test-taker recall the right answer?" interactive benchmarks ask, "Can they explore, learn, plan, and adapt when dropped into an entirely new environment?"

Learn more about ARC-AGI-3 from ARC Prize Foundation President Greg Kamradt

Interactive benchmarks also offer something static benchmarks cannot: action efficiency.

Instead of just checking whether a goal is reached, we can measure how many actions it takes to get there. In other words, we're tracking how efficiently a test taker converts information from the environment into a working strategy.

Inspired by Francois Chollet's On the Measure of Intelligence, this gives us a new way to define efficiency, and by extension, intelligence, as the conversion ratio between environment information and agent behavior.

This new metric is not a nice to have, it is foundational to measuring performance of frontier models and intelligence in general. Intelligence is efficiency.

ARC-AGI-3 Preview games

ARC-AGI-3 will be a set of hand-crafted novel, unique environments that are designed to test the skill-acquisition efficiency of artificial systems as compared to humans.

It will rely on previous ARC-AGI pillars (core priors, excluding reliance on language, trivia, or vast training data) to evaluate performance against human baselines.

The first three games released were meant to demonstrate a spectrum of game types. Some are agent based (moving around a single object on screen), while others are orchestration based (viewing and manipulating multiple objects at once).

Preview games released

Game	Type	Description
ls20	Agentic, map based	Navigate a map while bringing a matching symbol to another object. The symbol must go through various transformations in order for it to reach the goal.
ft09	Non-agentic, logic	Match the pattern seen on the screen. Patterns occasionally overlap.
vc33	Orchestration	Alternate volume of objects in order to match levels to pre-specified heights.
Private Game #1	—	To be released
Private Game #2	—	To be released
Private Game #3	—	To be released

The three private games used as a hidden holdout set for the ARC-AGI-3 Preview Agent Competition will be released in the coming weeks.

Human performance insights

Since the Preview launch, over 1,200 people have played more than 3,900 ARC-AGI-3 games.

To measure how efficiently humans play, we've adopted a scoring framework inspired by work from Josh Tenenbaum's lab, Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning.

The method is straightforward, track the number of actions taken to complete each level, then plot how that effort accumulates over time. This makes it easy to compare performance, not just between two humans, but more importantly, between humans and AI (more on this later).

We were also inspired by Shortest Path Length algorithms (such as Dijkstra's and A*) which focus on two questions: 1) Did the agent complete its goal? 2) How efficiently did it do it?

When this framework is used on ARC-AGI-3, we get a view of level progression vs action count. It may look like a simple chart, but it reveals a lot! Here's an example of human test data from, vc33, a public preview game.

Action level efficiency broken down for vc33 — Action level efficiency for vc33

What we can learn from this chart:

What is the least amount of action needed to complete a game? - After many practice runs, a persistent human is able to find the perfect path through a game per level. This path is the lowest number of actions needed to complete a game (the left-most line on the chart above)
How quick is the average human? - Plotting the average # of actions across all human runs shows you the average number of actions taken per level
The amount of action variance across levels - Different levels require different numbers of actions to be completed.
The amount of action variance across games - Different games require different numbers of actions to be completed

Expanding this view to all 3 public games, we can start to see each game's unique characteristics.

When measuring human performance, we only count a player's first run on a game to compare with AI's first run to measure learning efficiency, not memorization.

The dataset shown above comes from authenticated users. We can't guarantee that their first game run was truly their first time playing. Some may have played the game anonymously before logging in. So although this data is useful to understand action variability per level & game, it will not be used for standardized scoring.

For the production launch of ARC-AGI-3, we will rely on controlled human study of >200 participants to determine baselines.

Comparing humans to AI

Using this scoring framework, we can directly compare human and AI performance on a per-level basis by using action efficiency.

As a test-taker plays a game, they "spend" action in two ways:

Exploration - Actions spent probing the environment to understand its rules
Execution - Actions spent applying a strategy to reach the goal

Every player, human or AI, must use exploratory actions before executing a strategy. Humans are generally good at this: they explore briefly, then execute successfully. Random brute-force agents, however, may eventually complete a level but require far more actions. They aren't effective at turning information from the environment into a workable strategy.

To understand how well AI is doing relative to humans, we can see how many actions it takes an agent to complete a level as a percent of how many actions it takes a human to do the same.

This sets the stage for our scoring framework:

Score agents by their per-level action efficiency (as compared to humans), normalized per game, across all games.

Broken down:

“Score agents by their per-level action efficiency” - For each level that a test-taker completes, evaluate how many actions it took to complete.
“As compared to human baseline” - For each level that is scored, compare the test-takers score to the human baseline that was observed during human testing.
“Normalized per game” - Each level will be scored in isolation. Each individual game will get a score between 0%-100%. Similar to ARC-AGI 1 & 2 tasks.
“Across all games” - Total score will be the sum of individual game scores divided by the total number of games. Output will be a score between 0%-100%.

Agent Preview competition

Alongside the launch of the first ARC-AGI-3 Preview games, we launched an Agent Preview Competition with the goal of incentivizing the collective intelligence of the community to build. By putting ARC-AGI-3 Preview into the hands of developers, we could test our game and API design early before scaling up development for the full benchmark.

In partnership with Hugging Face, who generously sponsored the competition, we released the first version of our API. Developers could train and test their agents on the public set.

Our aim was to discourage overfitting and reward generalization. The competition would be judged based off the agent's performance on 3 additional private games.

The competition ran for 30 days, open to anyone worldwide. In the end, we received 12 submissions, with 8 tested against the private games. Scores were computed using prelimary results from human testing in the scoring framework above.

Winners of the ARC-AGI-3 Preview Competition

1st Place: StochasticGoose @ Tufa Labs: Score: 12.58%, Levels Completed: 18
Convolutional Neural Network Action-learning agent. It uses a simple reinforcement learning approach to predict which actions will cause frame changes, enabling more efficient exploration than random selection. Lead Developer: Dries Smit, Adviser/Reviewer: Jack Cole. Blog Recap:

StochasticGoose performance on vc33.
In the beginning the agent uses nearly 350 moves clicking actions with no result. Then it "learns" what is clickable and starts exploiting the action space.

2nd Place: Blind Squirrel: Score: 6.71%, Levels Completed: 13
Explore-and-learn agent that builds a state graph from frames. It prunes actions that create loops or don't change state. Whenever the score improves, it back-labels that level with distances and retrains a small ResNet18-based value model to rank (state, action) pairs toward the next milestone, then repeats until win or cap. Developer: Will Dick. Blog Recap

Honorable Mentions

Top Submissions

Agent	Team	Type	Score	Games Completed	Levels Completed	Actions	Replays
StochasticGoose	Dries Smit - Lead Developer Jack Cole - Advisor Review	Smart Random (CNN)	12.58%	2	18	255,964	ft09 ls20 vc33
Blind Squirrel	Will Dick	Smart Random (Rules)	6.71%	1	13	109,108	ft09 ls20 vc33
Explore It Till You Solve It	Evgenii Rudakov (Developer)	Smart Random (Frame Graph)	3.64%	0	12	278,158	ft09 ls20 vc33
GuidedRandomAgent	Bob	Smart Random (Rules)	2.24%	1	11	39,881	ft09 ls20 vc33
Fluxonian	Ujjwal Chadha (Engineer) Maya Nguyen (Engineer) Shobhit Singhal (Engineer) Filip Dominas (TPM)	DSL + LLM	8.04%	0	5	11,890	ft09 ls20 vc33
Play Zero Agent	Dhana Abhiraj (Developer)	Random + LLM Video	4.37%	0	5	7,226	ft09 ls20 vc33
Tomas Engine	Cristian Valdivia (Developer) - Blog Post Recap	LLM - Limited Results, Crashed Often	3.70%	0	1	79	ft09 ls20 vc33

Notes:

For agents that required large (>10K) actions, the replays files have been truncated for viewing.
View the winners page and archived competition page.

Learnings moving forward

Based on preview feedback, we're shipping several practical improvements today, and more later with the full benchmark release:

Action undo - Players wanted a simple way to step back. We'll add an undo button to both the API and UI. Not all games will support this.
Clearer guidance on which actions are available per game - Too many users were unsure which actions were valid. We'll explicitly show available actions, and specifically call out when grid-clicking is allowed.
Clear docs reduce friction - Switching to Mintlify docs mid-competition worked well. Most users signed up, ran the sample random agent immediately, and got started quickly.
Local execution preferred - While the hosted API made onboarding easy, many requested local/offline execution for training at scale. We're exploring an offline engine to support this.
Early exit mechanism - Some replays ballooned to tens of gigabytes due to brute-force action loops. We'll add caps and early exits to keep files manageable.
More precise scoring vocabulary - We'll distinguish between games completed, levels completed, and score. "Score" will be reserved for normalized performance against human baselines.

Time to build

The ARC-AGI-3 competition is wrapped but the preview is still live!

Three games are now available on three.arcprize.org, and we encourage you to keep building.

There are plenty of ways to get involved with ARC-AGI-3:

Submit new game ideas - We're actively building the next wave of environments.
Share your agents - We'll be highlighting creative solutions from the community over the next six months.
Research - The current API has rate caps, but if you need more for research, reach out to team@arcprize.org. We're also exploring an offline engine and welcome feedback on how best to design it.

Play, build, train, and help us create the future of machine intelligence benchmarking.