The General Intelligence Benchmark
In 2019, François Chollet - creator of Keras, an open-source deep learning library adopted by over 2.5M developers - published the influential paper "On the Measure of Intelligence" where he introduced the "Abstract and Reasoning Corpus" for Artificial General Intelligence (ARC-AGI) benchmark to measure fluid intelligence.
Intelligence is measured by the efficiency of skill-acquisition on unknown tasks. Simply, how quickly can you learn new skills?
To make deliberate progress towards more intelligent and human-like systems, we need to be following an appropriate feedback signal. We need to define and evaluate intelligence.
These definitions and evaluations turn into benchmarks used to measure progress toward systems that can think and invent alongside us.
The consensus definition of AGI, "a system that can automate the majority of economically valuable work," while a useful goal, is an incorrect measure of intelligence.
Measuring task-specific skill is not a good proxy for intelligence.
Skill is heavily influenced by prior knowledge and experience. Unlimited priors or unlimited training data allows developers to "buy" levels of skill for a system. This masks a system's own generalization power.
Intelligence lies in broad or general-purpose abilities; it is marked by skill-acquisition and generalization, rather than skill itself.
Here's a better definition for AGI:
AGI is a system that can efficiently acquire new skills outside of its training data.
More formally:
The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.
- François Chollet, "On the Measure of Intelligence"
This means that a system is able to adapt to new problems it has not seen before and that its creators (developers) did not anticipate.
ARC-AGI is the only AI benchmark that measures our progress towards general intelligence.
A principle underlying ARC-AGI's design is the need to create a fair and meaningful comparison between artificial intelligence and human intelligence. To achieve this, ARC-AGI focuses on fluid intelligence (the ability to reason, solve novel problems, and adapt to new situations) rather than crystallized intelligence, which relies on accumulated knowledge and skills. This distinction is critical because crystallized intelligence, by definition, includes cultural knowledge and learned information, which would provide an unfair advantage.
ARC-AGI avoids this by restricting itself to core knowledge priors, those cognitive building blocks that are either present at birth or acquired very early in human development with minimal explicit instruction, as described by the Core Knowledge theory (Elizabeth Spelke).
The rationale for using core knowledge priors is twofold. First, it isolates the capacity for generalization – the ability to take limited information and apply it to new, unseen instances. By limiting the "input" to universally accessible cognitive primitives, ARC-AGI forces the test-taker (human or AI) to demonstrate genuine problem-solving ability rather than rely on pre-existing, domain-specific knowledge.
If a benchmark included, for instance, tasks involving written English, it would immediately disadvantage any AI that hadn't been extensively pre-trained on vast text corpora. It would also disadvantage humans that did not know english. English, or any language, is a cultural artifact, not a measure of inherent cognitive ability. The same is true for any knowledge related to specific human cultures or practices.
Secondly, the restriction to core knowledge priors allows for a more accurate assessment of the efficiency of intelligence. The core idea is that the more intelligent entity should be the most efficient at using its resources to acquire a given task.
The resources can come in two forms:
That is to say, intelligence is the rate at which a learner turns its experience and priors into new skills at valuable tasks that involve uncertainty and adaptation.
If an AI system has access to extensive, task-specific prior knowledge that is not available to a human, its performance on that task becomes a measure of the developer's cleverness in encoding that knowledge, not the AI's inherent intelligence. By focusing solely on universally shared, innate priors, ARC-AGI ensures that success truly reflects the system's ability to learn and generalize, placing the AI and human on a comparable footing.
At the core of ARC-AGI benchmark design is the the principle of "Easy for Humans, Hard for AI."
The human brain is our only existence proof of general intelligence. Identifying the intelligence characteristics it has is a valuable direction for benchmarking AI because it directly targets the core of what distinguishes general intelligence from narrow skill.
Many AI benchmarks measure performance on tasks that require extensive training or specialized knowledge (PhD++ problems). ARC Prize focuses instead on tasks that humans solve effortlessly yet AI finds challenging which highlight fundamental gaps in AI's reasoning and adaptability.
This approach reveals the essential qualities of intelligence - such as the ability to generalize from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts—that current AI systems struggle to replicate.
By emphasizing these human-intuitive tasks, we not only measure progress more clearly but also inspire researchers to pursue genuinely novel ideas, moving beyond incremental improvements toward meaningful breakthroughs.
2019 - ARC-AGI was introduced in François Chollets 2019 paper, "On the Measure of Intelligence". At this point, François has the hypothesis that it could not easily be beaten.
2020 - In order to test this, he hosted the first ARC-AGI competition on Kaggle in 2020. The winning team, "ice cuber," achieved a 21% success rate on the test set. This low score was the first strong evidence that François's ideas in On/Measure were correct.
2022 - In 2022 François and Lab42 teamed up to host the ARCathon 2022, the first global AI competition to try and beat ARC-AGI. 118 teams from 47 countries participated. Michael Hodel, won the ARCathon and received his trophy at the Swiss Global AI Awards following the honoring of Demis Hassabis by Pascal Kaufmann, founder of Lab42, in Davos. Michael has developed one of the best ARC-AGI domain-specific languages (DSLs) to date.
2023 - Then in 2023, the competition continued with ARCathon 2023. This time 265+ teams from 65 countries competed. First place was shared between Somayyeh Gholami and Mehran Kazeminia (Team SM) and Jack Cole (Team MindsAI) both reaching 30% on the private evaluation set.
2024 - In 2024, Mike Knoop and François teamed up to create ARC Prize 2024. While the Grand Prize remained unlocked, over $125K in prizes were awarded to teams with top scores and papers. The year ended with a top score on the private evaluation of 53% See the results.
2025 - ARC Prize grew into a non-profit foundation to foster the mission of guiding open source AGI research. Based on ARC-AGI-2, the ARC Prize 2025 Kaggle Competition launched with over a $725K prize pool.