AGI progress has stalled.
New ideas are needed.

Presented by

ARC Prize Guide

Welcome to the official ARC Prize Guide!

This guide is designed help you get up to speed on ARC-AGI, establish your strategy, form a team (optional), and make progress toward winning prizes and eternal glory.

No matter who you are, where you come from, what you do for a living, you are welcome to join this competition. You can join forces with people who have complimentary skill sets or work alone. It's all up to you.

Let's get started!

Overview

Before you dive into this guide, make sure to read the Competition Details and ARC-AGI pages.

The purpose of ARC Prize is to redirect more AI research focus toward architectures that might lead toward artificial general intelligence (AGI) and ensure that notable breakthroughs do not remain a trade secret at a big corporate AI lab.

ARC-AGI is the only AI benchmark that tests for general intelligence by testing not just for skill, but for skill acquisition.

Your ambitious goal: Submit a solution which scores 85% on the ARC-AGI private evaluation set and win $500K.

You are not alone in this goal. The ARC Prize community and official team are here to provide support and resources.

Plug into the community & get competition updates:

You can also reach our team at team@arcprize.org or message us on Discord.

Data Structure

So you want to solve ARC-AGI? Let's start by exploring how it's data is structured.

This material is also covered in the Explore ARC-AGI Data + Play tutorial video.

Tasks

ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.

ARC-AGI task

Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.

A successful submission is a pixel-perfect description (color and position) of the final task's output.

Task Data

The following datasets are associated with the ARC Prize competition:

Public training set
Public evaluation set
Private evaluation set

Public

The publicly available data is to be used for training and evaluation.

The public training set contains 400 task files you will use to train your algorithm.

The public evaluation set contains 400 task files for to test the performance of your algorithm.

To ensure fair evaluation results, be sure not to leak information from the evaluation set into your algorithm (e.g. by looking at the tasks in the evaluation set yourself during development, or by repeatedly modifying an algorithm while using its evaluation score as feedback.)

The source of truth for this data is available on François Chollet's ARC-AGI Repository, which contains 800 total tasks.

Private

The private evaluation set contains 100 task files.

The ARC-AGI leaderboard is measured using 100 private evaluation tasks which are privately held on Kaggle. These tasks are private to ensure models may not be trained on them. These tasks are not included in the public tasks, but they do use the same structure and cognitive priors.

Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.

Set Difficulty

The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.

The public evaluation sets and the private test sets are intended to be the same difficulty.

Format

As mentioned above, tasks are stored in JSON format. Each JSON file consists of two key-value pairs.

train: a list of three to five input/output pairs. These are used for your algorithm to infer a rule.

test: a single input/output pair. Your model should apply the inferred rule from the train set and construct an output solution. You will have access to the output test solution on the public data. The output solution on the private evaluation set will not be revealed.

Here is an example of a simple ARC-AGI task that has three training pairs along with a single output pair. Each pair is shown as a 2x2 grid. There are four colors represented by the integers 1, 4, 6, and 8. Which actual color (red/green/blue/black) is applied to each integer is arbitrary and up to you.

{
  "train": [
    {"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
    {"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
    {"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
  ],
  "test": [
    {"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
  ]
}

Development

Download

Download the ARC-AGI-1 data from the official ARC-AGI repo on GitHub.

Please note that there will be future versions of the ARC-AGI dataset as it matures. We plan to keep improving the benchmark by adding more tasks and more novelty. Future versions will get versioned (ex: ARC-AGI-2) and future ARC Prize competitions may target newer versions.

View

There are multiple ways for humans to view the data:

Testing interface on the official repo (instructions)
The arcprize.org task viewer
Community-created apps

Test

There are two ways to measure your progress on ARC-AGI tasks.

Correct / Incorrect: This evaluation method measures whether or not your model’s output answer is identical to the validated solution. This means that the output shape, colors, and positions match. This evaluation method is used on the ARC-AGI private evaluation set.
Pixel correctness: The number of pixels that are correctly identified as a % of the total. Some teams use “Pixel Correctness” as another indicator for their score. Though this is not used in the competition, it can give more information about how your results are performing.

Approaches

You're free to explore any path you like, but we'd love to save you time by catching you up on the four solution approaches that have led to the current state of the art. Join the community discord to find out more from people who have been working on ARC-AGI for years.

1. Discrete program search

This was the first domain of solutions that started working well in the original ARCathon competition in 2020 hosted by Lab42. It involves searching through a massive program space in a discrete, step-by-step manner.

2. Ensemble Solutions

This approach consists of piecing together existing publicly available solutions to correctly answer more tasks than any solution achieved alone. This is the approach that was used to get to the current high score.

One thing to consider in utilizing this approach: it's unlikely that an ensemble approach will be able to generalize to correctly solve tasks outside of the public datasets. If you've got your eyes on the Grand Prize, you'll want to create new and novel techniques.

3. Direct LLM Prompting

In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%. Fine-tuning a state-of-the-art (SOTA) LLM with millions of synthetic ARC-AGI examples scores ~10%.

"LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." - François Chollet

Additionally, keep in mind that submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.

See templates for fine-tuning Llama 3b and open source LLM (without fine-tuning it).

4. Domain-Specific Language (DSL) Program Synthesis

This approach involves developing a domain-specific language (DSL). The DSL is designed to encapsulate common concepts such as rotation, mirroring, and other grid transformations that frequently occur in ARC tasks. By defining a set of primitives or basic functions that perform these transformations, solutions can be synthesized by composing these primitives into programs that solve specific tasks.

Program synthesis in this approach involves searching through possible compositions of the DSL primitives to find programs that correctly transform input grids into their corresponding output grids. This search can be brute-force or more sophisticated, but the key idea is to leverage the DSL to build task-specific programs efficiently.

See Michael Hodel's example notebook with this approach.

5. Active inference

More recently, solutions using pre-trained large language models (LLMs) have been attempted. The LLMs are additionally trained on code data, ARC-AGI data, and because there aren’t enough ARC-AGI tasks, you’ll augment this with synthetic ARC-AGI-like data.

The trick to making this LLM based solution work is using active inference. This is the idea that when you’re presented with a test task demonstration examples, fine tune the LLM on those examples. Of course, because there are only a couple of them, you’ll need to expand them artificially to have enough data points to fit your curve.

This unlocks the performance that we see with top solutions. Jack Cole's 34% solution utilizes this approach.

“The fact that this technique has an outsized impact is really interesting” - François Chollet

Inspiration From Francois

Let's hear from François, the creator of ARC, about what he sees as the most promising approaches as well as general tips to help you compete in ARC Prize.

Promising Approaches

François believes that the most promising category of solutions is one that we haven't really seen in practice so far. His thought process…

Discrete program search works really well. This is probably the easiest way to to solve ARC-AGI tasks. Now we also know that LLMs can develop good intuition about how to solve ARC-AGI tasks. The next step is going to be to augment discrete program search with deep learning driven intuition.

When you're doing discrete program search, you have to sift through this massive program space. The problem you're facing here, of course, is combinatorial explosion.

If you manage to get a [deep learning] model that has a pretty good sense of what an ARC-AGI task and solution is supposed to look like, then you can use the deep learning model to provide suggestions as to where to try next or what a sketch of your solution program look like.

This is a category of approaches that a few people have tried. I'm very convinced that this is the domain from which you're gonna see the highest quality solutions.

Here more on this approach from Francois with Dwarkesh.

General Tips

Focus on skill acquisition and generalization: The key idea behind ARC-AGI is that each task should be novel and not solvable by simply memorizing previous examples.
Take inspiration from human cognition: François suggests looking to cognitive science and developmental psychology for insights. For example, the idea of "core knowledge" - a set of innate priors like objectness, numbers, geometry that underpin our ability to learn more complex concepts.
Embrace hybrid approaches: François believes a hybrid approach combining symbolic and neural methods is promising. He gives the example of how humans solve ARC-AGI tasks - we consciously reason step-by-step (symbolic) but also rely heavily on unconscious intuition to quickly prune the search space (neural). Finding ways to combine the two could lead to a breakthrough.
Aim for generalizable abstractions: A successful ARC-AGI solver needs to be able to form novel conceptual abstractions to tackle never-before-seen tasks. François suggests trying to make your system's priors/knowledge representation easily swappable and generalizable, rather than overfit to a particular domain. The faster your system can form useful new abstractions, the better it will perform.
Start small and scale up: François suggests that the first "ARC-AGI solving" system doesn't need to be a full-fledged AGI from the get-go. A narrow AI system that can handle ARC-like problems in a constrained domain could still be a major breakthrough. Once you have a system that can efficiently learn and generalize in one domain, you can scale it up to more knowledge and problem domains over time.
Don’t be afraid to try something new: Since ARC-AGI is still a relatively new and unexplored benchmark, François believes there are still lots of low-hanging fruit to be plucked in terms of novel approaches. Don't be afraid to try radically different ideas from what's been attempted before. Intellectual creativity and originality can go a long way.

Submissions

ARC Prize 2024 submissions must be made through the Kaggle competition as a Kaggle notebook.

Go to the ARC Prize 2024 Kaggle page.
If you haven't done so yet, register an account. If you have an account, log in.
To format the output of your submission, view the detailed instructions on Kaggle evaluation.

Notes:

Kaggle submissions will not have internet access. Your solution must be able to run offline. This is to ensure that the private evaluation set is not leaked
You will only receive a final score across the entire private evaluation set, not a breakdown of which tasks you got (in)correct.
Submissions to ARC Prize 2024 will be constrained to the code requirements. This is to ensure the spirit of "efficiency" is adhered to.
ARC-AGI-Pub (secondary leaderboard measuring the public evaluation set) does not have compute or internet constraints. Close source, frontier models are welcome to participate.

See submission templates to get started quickly.

Grand Prize Goal

Every every ARC-AGI task has been human-verified, including the (more difficult) public evaluation set, ensuring all are solvable by a human. We set the Grand Prize score goal of 85% because it is high enough to consider ARC-AGI as solved, but low enough to acknowledge it is imperfect.

A New York University study (2021) found that most humans can solve, on average, 84% of the tasks in the ARC-AGI public training set.

We'd love to see more human research against ARC-AGI, specifically research using the public evaluation set and age-stratified research. If you're interested in running this research, reach out to team@arcprize.org (include links to your past research).

Scoring Methodology

This competition evaluates submissions on the percentage of correct predictions on the private evaluation set (100 tasks).

For each task, you should predict exactly 2 outputs for every test input grid contained in the task. (Tasks can have more than one test input that needs a predicted output.)

Each task's test output has one ground truth.

For a given task output, if any of the 2 predicted outputs matches the ground truth exactly (100% correct), you score 1 for that task test output, otherwise 0. The final score is the sum averaged of the highest score per task output divided by the total number of task test outputs. Ex: If there are two task outputs, and one is 100% correct and the other is 0% correct, your score is 0.5.

Submission format

Submissions should contain two dictionaries of predictions enclosed in a list, as is shown by the example below. When a task has multiple test outputs that need to be predicted (e.g., task 12997ef3 below), they must be in the same order as the corresponding test inputs.

{"00576224": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 "009d5c81": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 "12997ef3": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]},
              {"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}], ...
}

See more details on Kaggle.

Code Requirements

Submissions to this competition must be made through Notebooks. In order for the "Submit to Competition" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 12 hours run-time
GPU Notebook <= 12 hours run-time
No internet access enabled
External data, freely & publicly available, is allowed, including pre-trained models
Submission file must be named submission.json

Hardware

The submitted notebooks will run on the same hardware that the base notebook was created with. They will be one of the following

CPU
- 4 CPU Cores
- 30 Gigabytes of RAM
P100 GPU
- 1 Nvidia Tesla P100 GPI
- 4 CPU cores
- 29 Gigabytes of RAM
T4 2x GPU
- 2 Nvidia Tesla T4 GPUs
- 4 CPU cores
- 29 Gigabytes of RAM

These are not expected to change anytime soon.

Teams

Did you know that the highest performing ARC-AGI solutions are often made by teams?

Teams are a great way to combine ideas, learn from each other, and even make new friends who are passionate about solving ARC.

If you'd like to form a team, we encourage you to reach out to other participants, post on the Discord server, post on the Kaggle discussion board or reach out to us at team@arcprize.org.

ARC-AGI Resources

Submission Templates

Brute force - Credits go to phunghieu & team. The original write-up and published notebook can be found here: Write up Notebook
Icecuber's 2020 winning submission - Along with icecubers great write up about their submission
Fine-tuning Llama 3b - In this notebook, we will demonstrate how to fine-tune the instruct version of Llama 3 8B using Kaggle hardware.
Using a fine-tuned Llama 3b - In this notebook, we will demonstrate how to use a fine-tuned version of Llama 3 8B to solve ARC tasks. You can also experiment with the original version or other models compatible with Hugging Face’s infrastructure.

Community

If you have a resource you'd like to share, let us know about it.

On Measure of Intelligence
Videos - Collection of ARC-AGI videos. Make sure to see why AI can't pass this test & The Hardest Kaggle Challenge
Repositories - Collection of repositories of ARC-AGI attempts, synthetic data generation and ARC-AGI domain-specific languages
Papers - Collection of published papers around program synthesis, object-centric decision transformers and approaches
Other hosted community maintain resources

Discussions

2019-09-14 - François Chollet: Deep Learning, and the Progress of AI | Lex Fridman Podcast #38
2020-08-30 - François Chollet: Measures of Intelligence | Lex Fridman Podcast #120
2021-04-16 - #51 FRANCOIS CHOLLET - Intelligence and Generalisation
2024-01-12 - The Gradient Podcast - François Chollet: Keras and Measures of Intelligence
2024-06-11 - Mike on No Priors
2024-06-11 - François with Dwarkesh

LLM Performance on ARC

Large Language Models Are Not Strong Abstract Reasoners - Reproduction of GPT-4 scoring 11.9% on ARC-AGI (section 4.1)
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks - Showing performance of SOTA models (at the time) on ConceptARC, a variation of ARC-AGI (Table 1)
Large Language Models as General Pattern Machines - GPT-4 scoring 9% on ARC-AGI public data (Table 1). Note, public data may have been used to train GPT-4 which would artificially improve results
LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations - Exploring different methods to solve ARC-AGI with LLMs. Using a less difficult subset of task and CoT, GPT-4 scores 46% with this subset(Table 3)

Notable Implementations

ARC2023 - End to End - V7 - 31 of 100 private evaluation tasks solved
Icecuber - Write up behind the #1 solution in the 2020 Kaggle competition by icecuber. 20 of 100 private evaluation tasks solved
Alejandro De Miquel - 2nd place solution in 2020 competition
Ilia Larchenko - Sample of the 3rd place solution in 2020 competition
Alijs - Quick 2020 5th place notes
Zoltan - 2020 6th place write up
Hieu Phung - 2020 9th place solution write up
Alexander Fritzler - 2020 10th place solution write up
James Mcguigan - Explanation of attempts in the 2020 competition

AGI progress has stalled.New ideas are needed.