AGI remains unsolved.
New ideas still needed.

ARC-AGI

About the Benchmark

In 2019, François Chollet - creator of Keras, an open-source deep learning library adopted by over 2.5M developers, published "On the Measure of Intelligence" where he introduced the "Abstract and Reasoning Corpus for Artificial General Intelligence" (ARC-AGI) benchmark to measure intelligence.

Intelligence is measured by the efficiency of skill-acquisition on unknown tasks. Simply, how quickly can you learn new skills?

Contents

ARC-AGI Design

Defining AGI

To make deliberate progress towards more intelligent and human-like systems, we need to be following an appropriate feedback signal. We need to define and evaluate intelligence.

These definitions and evaluations turn into benchmarks used to measure progress toward systems that can think and invent alongside us.

The consensus definition of AGI, "a system that can automate the majority of economically valuable work," while a useful goal, is an incorrect measure of intelligence.

Measuring task-specific skill is not a good proxy for intelligence.

Skill is heavily influenced by prior knowledge and experience. Unlimited priors or unlimited training data allows developers to "buy" levels of skill for a system. This masks a system's own generalization power.

Intelligence lies in broad or general-purpose abilities; it is marked by skill-acquisition and generalization, rather than skill itself.

Here's a better definition for AGI:

AGI is a system that can efficiently acquire new skills outside of its training data.

More formally:

The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.

- François Chollet, "On the Measure of Intelligence"

This means that a system is able to adapt to new problems it has not seen before and that its creators (developers) did not anticipate.

ARC-AGI is the only AI benchmark that measures our progress towards general intelligence.

Core Knowledge Priors

A principle underlying ARC-AGI's design is the need to create a fair and meaningful comparison between artificial intelligence and human intelligence. To achieve this, ARC-AGI focuses on fluid intelligence (the ability to reason, solve novel problems, and adapt to new situations) rather than crystallized intelligence, which relies on accumulated knowledge and skills. This distinction is critical because crystallized intelligence, by definition, includes cultural knowledge and learned information, which would provide an unfair advantage.

ARC-AGI avoids this by restricting itself to core knowledge priors, those cognitive building blocks that are either present at birth or acquired very early in human development with minimal explicit instruction, as described by the Core Knowledge theory (Elizabeth Spelke).

The rationale for using core knowledge priors is twofold. First, it isolates the capacity for generalization – the ability to take limited information and apply it to new, unseen instances. By limiting the "input" to universally accessible cognitive primitives, ARC-AGI forces the test-taker (human or AI) to demonstrate genuine problem-solving ability rather than rely on pre-existing, domain-specific knowledge.

If a benchmark included, for instance, tasks involving written English, it would immediately disadvantage any AI that hadn't been extensively pre-trained on vast text corpora. It would also disadvantage humans that did not know english. English, or any language, is a cultural artifact, not a measure of inherent cognitive ability. The same is true for any knowledge related to specific human cultures or practices.

Secondly, the restriction to core knowledge priors allows for a more accurate assessment of the efficiency of intelligence. The core idea is that the more intelligent entity should be the most efficient at using its resources to acquire a given task.

The resources can come in two forms:

  1. Prior Knowledge: - This is the knowledge about the task domain that an entity brings into a task before being introduced to that task.
  2. Experience: - This is the amount of novel relevant information accrued by an agent about the task.

That is to say, intelligence is the rate at which a learner turns its experience and priors into new skills at valuable tasks that involve uncertainty and adaptation.

If an AI system has access to extensive, task-specific prior knowledge that is not available to a human, its performance on that task becomes a measure of the developer's cleverness in encoding that knowledge, not the AI's inherent intelligence. By focusing solely on universally shared, innate priors, ARC-AGI ensures that success truly reflects the system's ability to learn and generalize, placing the AI and human on a comparable footing.

Easy for Humans, Hard for AI

At the core of ARC-AGI benchmark design is the the principle of "Easy for Humans, Hard for AI."

The human brain is our only existence proof of general intelligence. Identifying the intelligence characteristics it has is a valuable direction for benchmarking AI because it directly targets the core of what distinguishes general intelligence from narrow skill.

Many AI benchmarks measure performance on tasks that require extensive training or specialized knowledge (PhD++ problems). ARC Prize focuses instead on tasks that humans solve effortlessly yet AI finds challenging which highlight fundamental gaps in AI's reasoning and adaptability.

This approach reveals the essential qualities of intelligence - such as the ability to generalize from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts—that current AI systems struggle to replicate.

By emphasizing these human-intuitive tasks, we not only measure progress more clearly but also inspire researchers to pursue genuinely novel ideas, moving beyond incremental improvements toward meaningful breakthroughs.

Timeline

2019 - ARC-AGI was introduced in François Chollets 2019 paper, "On the Measure of Intelligence". At this point, François has the hypothesis that it could not easily be beaten.

2020 - In order to test this, he hosted the first ARC-AGI competition on Kaggle in 2020. The winning team, "ice cuber," achieved a 21% success rate on the test set. This low score was the first strong evidence that François's ideas in On/Measure were correct.

2022 - In 2022 François and Lab42 teamed up to host the ARCathon 2022, the first global AI competition to try and beat ARC-AGI. 118 teams from 47 countries participated. Michael Hodel, won the ARCathon and received his trophy at the Swiss Global AI Awards following the honoring of Demis Hassabis by Pascal Kaufmann, founder of Lab42, in Davos. Michael has developed one of the best ARC-AGI domain-specific languages (DSLs) to date.

2023 - Then in 2023, the competition continued with ARCathon 2023. This time 265+ teams from 65 countries competed. First place was shared between Somayyeh Gholami and Mehran Kazeminia (Team SM) and Jack Cole (Team MindsAI) both reaching 30% on the private evaluation set.

2024 - In 2024, Mike Knoop and François teamed up to create ARC Prize 2024. While the Grand Prize remained unlocked, over $125K in prizes were awarded to teams with top scores and papers. The year ended with a top score on the private evaluation of 53% See the results.

2025 - ARC Prize grew into the Foundation. Based on ARC-AGI-2, ARC Prize 2025 Kaggle Competition launched with over a $725K prize pool.

ARC-AGI-1

Official Dataset · Solve ARC-AGI-1 Tasks

The Abstraction and Reasoning Corpus (ARC-AGI-1), first introduced in 2019 by François Chollet, debuted in his paper On the Measure Of Intelligence. Chollet, a prominent Google AI researcher and creator of the deep learning library Keras, developed ARC-AGI-1 specifically as a novel benchmark designed to test machine reasoning and general problem-solving skills.

ARC-AGI-1 Task
ARC-AGI-1 Task (#3aa6fb7a)

At the time of its launch, there was growing recognition that deep learning methods excelled in narrow, specialized tasks but fell short in demonstrating human-like generalization. ARC-AGI-1 was a direct response to this gap, aimed at evaluating AI's capability to handle novel, unforeseen problems, situations it had not been explicitly trained on. For further reading on this, see the ARC Prize 2024 Technical Report.

Motivated by the need for a true measure of AGI, ARC-AGI-1 functions as an "AGI yardstick," focusing on benchmarking the skill-acquisition capability (the fundamental core of intelligence) rather than performance on any single, predefined task. It specifically assesses how efficiently an AI can learn and generalize from minimal information, reflecting a fundamental characteristic of human intelligence.

ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.

Dataset Tasks Description
Training Set 400 tasks A training set dedicated as a playground to train your system
Public Eval Set 400 tasks Used to evaluate your final algorithm.
Semi-Private Eval Set 100 tasks Introduced in mid-2024, this set of 100 tasks was hand selected to use as a semi-private hold out set when testing closed source models.
Private Eval Set 100 tasks Used as the basis of the ARC Prize competition. Determined final leaderboard in 2020, 2022, 2023, and 2024.

From its introduction in 2019 until late 2024, ARC-AGI remained unsolved by AI systems, maintaining its reputation as one of the toughest benchmarks available for general intelligence. The fact that it stayed unbeaten for so long highlights the significant gap between human and AI reasoning capabilities.

In December 2024, OpenAI featured ARC-AGI-1 as the leading benchmark to measure the performance of their experimental "o3" model. o3 at low compute scored 75.7% on ARC-AGI-1 and reached 87% accuracy with higher compute. This marked the first effective solution of the ARC challenge in over five years.

This achievement represented a step-change in AI's generalization abilities, validating the ARC benchmark's effectiveness in measuring meaningful progress toward AGI. The solving of ARC-AGI-1 triggered renewed interest in benchmarks like ARC-AGI-2, designed to further challenge AI and advance research toward genuine human-level intelligence.

See a long list of resources on ARC-AGI-1.

ARC-AGI-2

Official Dataset · Solve ARC-AGI-2 Tasks · ARC-AGI-2 Launch Deck

ARC-AGI-2 was launched on March 24, 2025. This second edition in the ARC-AGI series raises the bar for difficulty for AI while maintaining the same relative ease for humans. It represents a compass pointing towards useful research direction, a playground to test few-shot reasoning architectures, a tool to accelerate progress towards AGI. It does not represent an indicator of whether we have AGI or not.

All (100%) ARC-AGI-2 tasks were solved by at least 2 humans (many were solved by even more) in 2 attempts or less in a controlled study with hundreds of human participants. This matches the rules we hold for AI, which gets two attempts for task.

While designing ARC-AGI-2, we studied these properties of frontier AI reasoning systems. Here are some example tasks to illustrate what we discovered. All of the following tasks are part of ARC-AGI-2 and were (1) solved by at least 2 humans in under 2 attempts and (2) unsolved by any frontier AI reasoning system.

Symbolic Interpretation

We found that frontier AI reasoning systems struggle with tasks requiring symbols to be interpreted as having meaning beyond their visual patterns. Systems attempted symmetry checking, mirroring, and other transforms, and even recognized connecting element, but failed to assign semantic significance to the symbols themselves.

ARC-AGI-2 Pubic Eval Task: #e3721c99
Example of symbolic interpretation, ARC-AGI-2 Pubic Eval Task: #e3721c99, try this task
Compositional Reasoning

We found these systems struggle with tasks requiring simultaneous application of a rules, or application of multiples rules that interact with each other. In contrast, if a task only has one or very few global rules, we found AI reasoning systems can consitently discover and can apply them.

ARC-AGI-2 Pubic Eval Task: #cbebaa4b
Example of compositional reasoning, ARC-AGI-2 Pubic Eval Task: #cbebaa4b, try this task
Contextual Rule Application

We found systems struggle with tasks where rules must be applied differently based on context. Systems will fixate on superficial patterns rather than understanding the underlying selection principles.

ARC-AGI-2 Pubic Eval Task: #b5ca7ac4
Example of contextual rule application, ARC-AGI-2 Pubic Eval Task: #b5ca7ac4, try this task

Dataset Structure:

Dataset Tasks Description
Training Set 1000 tasks Uncalibrated, public, a spectrum of difficulty ranging from very easy to very difficult for both humans and AI, designed to expose and teach Core Knowledge Priors, use to train your systems.
Public Eval Set 120 tasks Calibrated, public, all tasks solved pass@2 by at least two humans, use to test your systems.
Semi-Private Eval Set 120 tasks Calibrated, not public, all tasks solved pass@2 by at least two humans, used for Kaggle live contest leaderboard and ARC Prize leaderboard. "Semi" means these tasks may have been exposed to limited third-parties eg. via API
Private Eval Set 120 tasks Calibrated, not public, all tasks solved pass@2 by at least two humans, used for Kaggle final contest leaderboard. "Private" means these tasks have not been exposed to third-parties.

Calibration

The eval sets (Public, Semi-Private, Private) are "calibrated," meaning tasks are statistically similar (IDD). Scores are comparable across these sets (<1pp expected), assuming no overfitting. Calibration was done via controlled human testing (400+ participants) and existing AI testing.

To ensure calibration of human-facing difficulty, we conducted a live-study in San Diego in early 2025 involving over 400 members of the general public. Participants were tested on ARC-AGI-2 candidate tasks, allowing us to identify which problems could be consistently solved by at least two individuals within two or fewer attempts. This first-party data provides a solid benchmark for human performance and will be published alongside the ARC-AGI-2 paper.

100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%.

Efficiency Measurement:

Starting with ARC-AGI-2, all ARC-AGI reporting comes with an efficiency metric. We are started with cost because it is the most directly comparable between human and AI performance.

Intelligence is not solely defined by the ability to solve problems or achieve high scores. The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just "can AI acquire skill to solve a task?", but also at what efficiency or cost?

We know that brute-force search could eventually solve ARC-AGI (given unlimited resources and time to search), this would not represent true intelligence. Intelligence is about finding the solution efficiently, not exhaustively.

This focus on efficiency is a core principle behind the ARC-AGI. We will now explicitly quantify the cost of intelligence, requiring solutions to demonstrate not just capability, but also the efficient use of resources that defines general intelligence.

ARC-AGI-2 changelog:

  • All eval sets (public, semi-private, private) now contain 120 tasks (up from 100)
  • Removed tasks from eval sets that were susceptible to brute force search (all solved tasks from original 2020 Kaggle contest)
  • Performed controlled human testing to calibrate eval set difficulty to ensure IDD and verify pass@2 solvability by at least 2 humans (to match AI rules)
  • Designed new tasks to challenge AI reasoning systems based on study (symbolic interpreation, compositional reasoning, contextual rules, and more)

For more information, read the ARC-AGI-2 launch post.

Next

Toggle Animation