Welcome to the official ARC Prize Guide!
This guide is designed help you get up to speed on ARC-AGI, establish your strategy, form a team (optional), and make progress toward winning prizes and eternal glory.
No matter who you are, where you come from, what you do for a living, you are welcome to join this competition. You can join forces with people who have complimentary skill sets or work alone. It's all up to you.
Let's get started!
Before you dive into this guide, make sure to read the Competition Details and ARC-AGI pages.
The purpose of ARC Prize is to redirect more AI research focus toward architectures that might lead toward artificial general intelligence (AGI) and ensure that notable breakthroughs do not remain a trade secret at a big corporate AI lab.
ARC-AGI is the only AI benchmark that tests for general intelligence by testing not just for skill, but for skill acquisition.
Your ambitious goal: Submit a solution which scores 85% on the ARC-AGI private evaluation set and win $600K.
You are not alone in this goal. The ARC Prize community and official team are here to provide support and resources.
Plug into the community & get competition updates:
You can also reach our team at team@arcprize.org or message us on Discord.
So you want to solve ARC-AGI? Let's start by exploring how its data is structured.
This material is also covered in the Explore ARC-AGI Data + Play tutorial video.
ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.
Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.
A successful submission is a pixel-perfect description (color and position) of the final task's output.
The following datasets are associated with the ARC Prize competition:
Outside of the competition, there is also a semi-private evaluation set used for the public leaderboard. Learn more.
The publicly available data is to be used for training and evaluation.
The public training set contains 400 task files you can use to train your algorithm.
The public evaluation set contains 400 task files for to test the performance of your algorithm.
To ensure fair evaluation results, be sure not to leak information from the evaluation set into your algorithm (e.g., by looking at the tasks in the evaluation set yourself during development, or by repeatedly modifying an algorithm while using its evaluation score as feedback.)
The source of truth for this data is available on François Chollet's ARC-AGI GitHub Repository, which contains 800 total tasks.
The private evaluation set contains 100 task files.
The ARC-AGI leaderboard is measured using 100 private evaluation tasks which are privately held on Kaggle. These tasks are private to ensure models may not be trained on them. These tasks are not included in the public tasks, but they do use the same structure and cognitive priors.
Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.
The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems.
For reference, here are scores across the different datasets from two community members participating in ARC Prize 2024.
Name | Public Training Set | Public Evaluation Set | Private Evaluation |
---|---|---|---|
Zoltan | 53% | 38% | 24% |
Kha Vo | 32% | 30% | 22% |
Please note that solutions that train on all public data can have a significant gap between public set scores and private evaluation set scores.
A future version of ARC-AGI will formally calibrate difficulty across evaluation sets.
As mentioned above, tasks are stored in JSON format. Each JSON file consists of two key-value pairs.
train
: a list of two to ten input/output pairs (typically three.) These are used for your algorithm to infer a rule.
test
: a list of one to three input/output pairs (typically one.) Your model should apply the inferred rule from the train
set and construct an output solution. You will have access to the output test solution on the public data. The output solution on the private evaluation set will not be revealed.
Here is an example of a simple ARC-AGI task that has three training pairs along with a single test pair. Each pair is shown as a 2x2 grid. There are four colors represented by the integers 1, 4, 6, and 8. Which actual color (red/green/blue/black) is applied to each integer is arbitrary and up to you.
{
"train": [
{"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
{"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
{"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
],
"test": [
{"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
]
}
Download the ARC-AGI-1 data from the official ARC-AGI repo on GitHub.
Please note that there will be future versions of the ARC-AGI dataset as it matures. We plan to keep improving the benchmark by adding more tasks and more novelty. Future versions will get versioned (ex: ARC-AGI-2) and future ARC Prize competitions may target newer versions.
There are multiple ways for humans to view the data:
There are two ways to measure your progress on ARC-AGI tasks.
Correct / Incorrect: This evaluation method measures whether or not your model’s output answer is identical to the validated solution. This means that the output shape, colors, and positions match. This evaluation method is used on the ARC-AGI private evaluation set.
Pixel correctness: The number of pixels that are correctly identified as a % of the total. Some teams use “Pixel Correctness” as another indicator for their score. Though this is not used in the competition, it can give more information about how your results are performing.
You're free to explore any path you like, but we'd love to save you time by catching you up on the four solution approaches that have led to the current state of the art. Join the community discord to find out more from people who have been working on ARC-AGI for years.
This was the first domain of solutions that started working well in the original ARCathon competition in 2020 hosted by Lab42. It involves searching through a massive program space in a discrete, step-by-step manner.
This approach consists of piecing together existing publicly available solutions to correctly answer more tasks than any solution achieved alone. This is the approach that was used to get to the current high score.
One thing to consider in utilizing this approach: it's unlikely that an ensemble approach will be able to generalize to correctly solve tasks outside of the public datasets. If you've got your eyes on the Grand Prize, you'll want to create new and novel techniques.
In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%. Fine-tuning a state-of-the-art (SOTA) LLM with millions of synthetic ARC-AGI examples scores ~10%.
"LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." - François Chollet
Additionally, keep in mind that submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.
See templates for fine-tuning Llama 3b, open source LLM (without fine-tuning it), and using frontier models (Video tutorial, ARC-AGI-Pub only).
This approach involves developing a domain-specific language (DSL). The DSL is designed to encapsulate common concepts such as rotation, mirroring, and other grid transformations that frequently occur in ARC tasks. By defining a set of primitives or basic functions that perform these transformations, solutions can be synthesized by composing these primitives into programs that solve specific tasks.
Program synthesis in this approach involves searching through possible compositions of the DSL primitives to find programs that correctly transform input grids into their corresponding output grids. This search can be brute-force or more sophisticated, but the key idea is to leverage the DSL to build task-specific programs efficiently.
See Michael Hodel's example notebook with this approach.
More recently, solutions using pre-trained large language models (LLMs) have been attempted. The LLMs are additionally trained on code data, ARC-AGI data, and because there aren’t enough ARC-AGI tasks, you’ll augment this with synthetic ARC-AGI-like data.
The trick to making this LLM based solution work is using active inference. This is the idea that when you’re presented with a test task demonstration examples, fine tune the LLM on those examples. Of course, because there are only a couple of them, you’ll need to expand them artificially to have enough data points to fit your curve.
This unlocks the performance that we see with top solutions. Jack Cole's 34% solution utilizes this approach.
“The fact that this technique has an outsized impact is really interesting” - François Chollet
Let's hear from François, the creator of ARC, about what he sees as the most promising approaches as well as general tips to help you compete in ARC Prize.
François believes that the most promising category of solutions is one that we haven't really seen in practice so far. His thought process…
Discrete program search works really well. This is probably the easiest way to to solve ARC-AGI tasks. Now we also know that LLMs can develop good intuition about how to solve ARC-AGI tasks. The next step is going to be to augment discrete program search with deep learning driven intuition.
When you're doing discrete program search, you have to sift through this massive program space. The problem you're facing here, of course, is combinatorial explosion.
If you manage to get a [deep learning] model that has a pretty good sense of what an ARC-AGI task and solution is supposed to look like, then you can use the deep learning model to provide suggestions as to where to try next or what a sketch of your solution program look like.
This is a category of approaches that a few people have tried. I'm very convinced that this is the domain from which you're gonna see the highest quality solutions.
Here more on this approach from Francois with Dwarkesh.
ARC Prize 2024 submissions must be made through the Kaggle competition as a Kaggle notebook.
Notes:
See submission templates to get started quickly.
The Grand Prize is set at 85% to consider material progress towards ARC-AGI, but allow for acknowledgement that the benchmark is imperfect. The benchmark is intended to be a minimal test of general intelligence, something that early forms of artificial general intelligence will necessarily be able to do.
Every ARC-AGI task has been human-verified by at least 2 STEM professionals. Not all humans can solve all tasks, but all tasks can be solved by humans.
While average human performance does not impact the validity of ARC-AGI, it is interesting that the benchmark can function as a human intelligence test in addition to AI. A 2024 NYU study found that 790 out of 800 (98.7%) of all public ARC tasks are solvable by at least one typical crowd-worker. The average human performance in the study was between 73.3% and 77.2% correct (public training set average: 76.2%; public evaluation set average: 64.2%.)
We plan to do further testing on future versions of ARC-AGI.
This competition evaluates submissions on the percentage of correct predictions on the private evaluation set (100 tasks).
For each task, you should predict exactly 2 outputs for every test input grid contained in the task. (Tasks can have more than one test input that needs a predicted output.)
Each task's test output has one ground truth.
For a given task output, if any of the 2 predicted outputs matches the ground truth exactly (100% correct), you score 1 for that task test output, otherwise 0. The final score is the sum averaged of the highest score per task output divided by the total number of task test outputs. Ex: If there are two task outputs, and one is 100% correct and the other is 0% correct, your score is 0.5.
Submissions should contain two dictionaries of predictions enclosed in a list, as is shown by the example below. When a task has multiple test outputs that need to be predicted (e.g., task 12997ef3 below), they must be in the same order as the corresponding test inputs.
{"00576224": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
"009d5c81": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
"12997ef3": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]},
{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}], ...
}
See more details on Kaggle.
Submissions to this competition must be made through Notebooks. In order for the "Submit to Competition" button to be active after a commit, the following conditions must be met:
The submitted notebooks will run on the same hardware that the base notebook was created with. They will be one of the following
These are not expected to change anytime soon.
Did you know that the highest performing ARC-AGI solutions are often made by teams?
Teams are a great way to combine ideas, learn from each other, and even make new friends who are passionate about solving ARC.
If you'd like to form a team, we encourage you to reach out to other participants, post on the Discord server, post on the Kaggle discussion board or reach out to us at team@arcprize.org.
If you have a resource you'd like to share, let us know about it.