ARC Prize - Official Guide

Welcome to the official ARC Prize Guide!

This technical guide is designed help you get up to speed on ARC-AGI, establish your strategy, form a team (optional), and make progress toward winning prizes.

No matter who you are, where you come from, what you do for a living, you are welcome to join this competition. You can join forces with people who have complimentary skill sets or work alone. It's all up to you.

Let's get started!

Overview

Before you dive into this guide, make sure to read the Competition Details and ARC-AGI pages.

The purpose of ARC Prize is to redirect more AI research focus toward architectures that might lead toward artificial general intelligence (AGI) and ensure that notable breakthroughs do not remain a trade secret at a big corporate AI lab.

ARC-AGI is the only AI benchmark that tests for general intelligence by testing not just for skill, but for skill acquisition.

Your ambitious goal: Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K.

You are not alone in this goal. The ARC Prize community and official team are here to provide support and resources.

Plug into the community & get competition updates:

You can also reach our team at team@arcprize.org or message us on Discord.

Data Structure

So you want to solve ARC-AGI? Let's start by exploring how its data is structured.

This material is also covered in the Explore ARC-AGI Data + Play tutorial video.

Tasks

ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.

ARC-AGI task

In their native form, tasks are a JSON lists of integers. These JSON can also be represented visually as a grid of colors using an ARC-AGI task viewer. You can view an example of a task here.

A successful submission is a pixel-perfect description (color and position) of the final task's output.

ARC-AGI-2

2025 introduces a new version of ARC-AGI, ARC-AGI-2. This is the same format as ARC-AGI-1. All conventions of public/private/semi-private data still apply. For more on the new dataset, see the ARC-AGI-2.

100% of tasks in the ARC-AGI-2 dataset were solved by a minimim of 2 people in less than or equal to 2 attempts (many were solved more). ARC-AGI-2 is more difficult for AI.

We recommend using the ARC-AGI-1 dataset for getting started and then moving to ARC-AGI-2 for more advanced solutions. The remainder of this guide will focus on ARC-AGI-2.

Task Data

The following datasets are associated with the ARC Prize competition:

Public training set
Public evaluation set
Private evaluation set

Outside of the competition, there is also a semi-private evaluation set used for the public leaderboard. Learn more.

Public

The publicly available data is to be used for training and evaluation.

The public training set contains 1,000 task files you can use to train your algorithm.

The public evaluation set contains 120 task files for to test the performance of your algorithm.

To ensure fair evaluation results, be sure not to leak information from the evaluation set into your algorithm (e.g., by looking at the tasks in the evaluation set yourself during development, or by repeatedly modifying an algorithm while using its evaluation score as feedback.)

The source of truth for this data is available on the ARC-AGI GitHub Repository, which contains 1,120 total tasks.

Semi-Private

The semi-private evaluation set contains 120 task files.

The Semi-Private Evaluation set is 120 tasks which are privately held on Kaggle. These tasks are used for the intra-year competition standings. These tasks are not included in the public tasks, but they do use the same structure and cognitive priors.

These tasks are also used to measure public, closed-source model performance.

Private

The private evaluation set contains 120 task files.

The ARC-AGI leaderboard is measured using 120 private evaluation tasks which are also privately held on Kaggle. These tasks are private to ensure models may not be trained on them. These tasks are not included in the public tasks, but they do use the same structure and cognitive priors.

Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.

Set Difficulty

One of enhancements made with ARC-AGI-2 is the introduction of a difficulty calibration. Private Evaluation, Public Evaluation and Semi-Private Evaluation sets are now calibrated to be roughly the same difficulty (<1pp) as measured by human & AI performance.

Format

As mentioned above, tasks are stored in JSON format. Each JSON file consists of two key-value pairs.

train: a list of two to ten input/output pairs (typically three.) These are used for your algorithm to infer a rule.

test: a list of one to three input/output pairs (typically one.) Your model should apply the inferred rule from the train set and construct an output solution. You will have access to the output test solution on the public data. The output solution on the private evaluation set will not be revealed.

Here is an example of a simple ARC-AGI task that has three training pairs along with a single test pair. Each pair is shown as a 2x2 grid. There are four colors represented by the integers 1, 4, 6, and 8. Which actual color (red/green/blue/black) is applied to each integer is arbitrary and up to you.

{
  "train": [
    {"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
    {"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
    {"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
  ],
  "test": [
    {"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
  ]
}

Development

Download

Download the ARC-AGI-2 data from the official ARC-AGI repo on GitHub.

View

There are multiple ways for humans to view the data:

Testing interface on the official repo (instructions)
The arcprize.org task viewer
Community-created apps

Test

There are two ways to measure your progress on ARC-AGI tasks.

Correct / Incorrect: This evaluation method measures whether or not your model’s output answer is identical to the validated solution. This means that the output shape, colors, and positions match. This evaluation method is used on the ARC-AGI private evaluation set. You can view scoring algorithms on Kaggle & Github.
Pixel correctness: The number of pixels that are correctly identified as a % of the total. Some teams use “Pixel Correctness” as another indicator for their score. Though this is not used in the competition, it can give more information about how your results are performing.

Approaches

You're free to explore any path you like, but we'd love to save you time by catching you up on the four solution approaches that have led to the current state of the art. Join the community discord to find out more from people who have been working on ARC-AGI for years.

1. Discrete program search

This was the first domain of solutions that started working well in the original ARCathon competition in 2020 hosted by Lab42. It involves searching through a massive program space in a discrete, step-by-step manner.

2. Ensemble Solutions

This approach consists of piecing together existing publicly available solutions to correctly answer more tasks than any solution achieved alone. This is the approach that was used to get to the current high score.

One thing to consider in utilizing this approach: it's unlikely that an ensemble approach will be able to generalize to correctly solve tasks outside of the public datasets. If you've got your eyes on the Grand Prize, you'll want to create new and novel techniques.

3. Direct LLM Prompting

In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%. Fine-tuning a state-of-the-art (SOTA) LLM with millions of synthetic ARC-AGI examples scores ~10%.

"LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." - François Chollet

Additionally, keep in mind that submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.

See templates for fine-tuning Llama 3b, open source LLM (without fine-tuning it), and using frontier models (Video tutorial, ARC-AGI-Pub only).

4. Domain-Specific Language (DSL) Program Synthesis

This approach involves developing a domain-specific language (DSL). The DSL is designed to encapsulate common concepts such as rotation, mirroring, and other grid transformations that frequently occur in ARC tasks. By defining a set of primitives or basic functions that perform these transformations, solutions can be synthesized by composing these primitives into programs that solve specific tasks.

Program synthesis in this approach involves searching through possible compositions of the DSL primitives to find programs that correctly transform input grids into their corresponding output grids. This search can be brute-force or more sophisticated, but the key idea is to leverage the DSL to build task-specific programs efficiently.

See Michael Hodel's example notebook with this approach.

5. Active inference

More recently, solutions using pre-trained large language models (LLMs) have been attempted. The LLMs are additionally trained on code data, ARC-AGI data, and because there aren't enough ARC-AGI tasks, you'll augment this with synthetic ARC-AGI-like data.

The trick to making this LLM based solution work is using active inference. This is the idea that when you're presented with a test task demonstration examples, fine tune the LLM on those examples. Of course, because there are only a couple of them, you'll need to expand them artificially to have enough data points to fit your curve.

This unlocks the performance that we see with top solutions. Jack Cole's 34% solution utilizes this approach.

"The fact that this technique has an outsized impact is really interesting" - François Chollet

Guidance

Let's hear from François, the creator of ARC, about what he sees as the most promising approaches as well as general tips to help you compete in ARC Prize.

Promising Approaches

François believes that the most promising category of solutions is one that we haven't really seen in practice so far. His thought process…

Discrete program search works really well. This is probably the easiest way to to solve ARC-AGI tasks. Now we also know that LLMs can develop good intuition about how to solve ARC-AGI tasks. The next step is going to be to augment discrete program search with deep learning driven intuition.

When you're doing discrete program search, you have to sift through this massive program space. The problem you're facing here, of course, is combinatorial explosion.

If you manage to get a [deep learning] model that has a pretty good sense of what an ARC-AGI task and solution is supposed to look like, then you can use the deep learning model to provide suggestions as to where to try next or what a sketch of your solution program look like.

This is a category of approaches that a few people have tried. I'm very convinced that this is the domain from which you're gonna see the highest quality solutions.

Here more on this approach from Francois with Dwarkesh.

General Tips

Focus on skill acquisition and generalization: The key idea behind ARC-AGI is that each task should be novel and not solvable by simply memorizing previous examples.
Take inspiration from human cognition: François suggests looking to cognitive science and developmental psychology for insights. For example, the idea of "core knowledge" - a set of innate priors like objectness, numbers, geometry that underpin our ability to learn more complex concepts.
Embrace hybrid approaches: François believes a hybrid approach combining symbolic and neural methods is promising. He gives the example of how humans solve ARC-AGI tasks - we consciously reason step-by-step (symbolic) but also rely heavily on unconscious intuition to quickly prune the search space (neural). Finding ways to combine the two could lead to a breakthrough.
Aim for generalizable abstractions: A successful ARC-AGI solver needs to be able to form novel conceptual abstractions to tackle never-before-seen tasks. François suggests trying to make your system's priors/knowledge representation easily swappable and generalizable, rather than overfit to a particular domain. The faster your system can form useful new abstractions, the better it will perform.
Start small and scale up: François suggests that the first "ARC-AGI solving" system doesn't need to be a full-fledged AGI from the get-go. A narrow AI system that can handle ARC-like problems in a constrained domain could still be a major breakthrough. Once you have a system that can efficiently learn and generalize in one domain, you can scale it up to more knowledge and problem domains over time.
Don’t be afraid to try something new: Since ARC-AGI is still a relatively new and unexplored benchmark, François believes there are still lots of low-hanging fruit to be plucked in terms of novel approaches. Don't be afraid to try radically different ideas from what's been attempted before. Intellectual creativity and originality can go a long way.

Code Submissions

ARC Prize 2025 submissions must be made through the Kaggle competition as a Kaggle notebook.

Go to the ARC Prize 2025 Kaggle competition page
If you haven't done so yet, register an account. If you have an account, log in.
To format the output of your submission, view the detailed instructions on Kaggle evaluation.

Notes:

Kaggle submissions will not have internet access. Your solution must be able to run offline. This is to ensure that the private evaluation set is not leaked
During the competition, standings will be determined by scores on the semi-private dataset. These scores will be publicly displayed on the Kaggle leaderboard. At the end of the competition, your submission will be evaluated on the private dataset. This final private dataset score will determine your official ranking and prize eligibility.
Submissions to ARC Prize 2025 will be constrained to the code requirements. This is to ensure the spirit of "efficiency" is adhered to.
ARC-AGI-Pub (secondary leaderboard measuring the public evaluation set) does not have compute or internet constraints. Close source, frontier models are welcome to participate.

See submission templates to get started quickly.

Grand Prize Goal

The Grand Prize is set at 85% to consider material progress towards ARC-AGI, but allow for acknowledgement that the benchmark is imperfect. The benchmark is intended to be a minimal test of general intelligence, something that early forms of artificial general intelligence will necessarily be able to do.

Every ARC-AGI task has been human-verified by at least 2 STEM professionals. Not all humans can solve all tasks, but all tasks can be solved by humans.

While average human performance does not impact the validity of ARC-AGI, it is interesting that the benchmark can function as a human intelligence test in addition to AI. A 2024 NYU study found that 790 out of 800 (98.7%) of all public ARC tasks are solvable by at least one typical crowd-worker. The average human performance in the study was between 73.3% and 77.2% correct (public training set average: 76.2%; public evaluation set average: 64.2%.)

We plan to do further testing on future versions of ARC-AGI.

#### Scoring Methodology

This competition evaluates submissions on the percentage of correct predictions on the private evaluation set (100 tasks).

For each task, you should predict exactly 2 outputs for every test input grid contained in the task. (Tasks can have more than one test input that needs a predicted output.)

Each task's test output has one ground truth.

For a given task output, if any of the 2 predicted outputs matches the ground truth exactly (100% correct), you score 1 for that task test output, otherwise 0. The final score is the sum averaged of the highest score per task output divided by the total number of task test outputs. Ex: If there are two task outputs, and one is 100% correct and the other is 0% correct, your score is 0.5.

Submission format

Submissions should contain two dictionaries of predictions enclosed in a list, as is shown by the example below. When a task has multiple test outputs that need to be predicted (e.g., task 12997ef3 below), they must be in the same order as the corresponding test inputs.

{"00576224": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 "009d5c81": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 "12997ef3": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]},
              {"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}], ...
}

See more details on Kaggle.

Code Requirements

Submissions to this competition must be made through Notebooks. In order for the "Submit to Competition" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 12 hours run-time
GPU Notebook <= 12 hours run-time
No internet access enabled
External data, freely & publicly available, is allowed, including pre-trained models
Submission file must be named submission.json

Hardware

ARC Prize 2025 will be designated to use Kaggle's pool of L4x4s. While it is hard to quantify performance improvement with a single metric, these L4x4 GPUs are significantly more powerful than the P100 alternative. Note that on Kaggle's platform, L4x4s consume GPU quota twice as fast as the older T4x2s and P100 machines.

The L4x4s offer 96GB of GPU memory (compared to 29 GB per single P100 and T4x2)

For a further breakdown, see P100s vs L4x4s.

These are not expected to change anytime soon.

Submission Templates

Brute force - Credits go to phunghieu & team. The original write-up and published notebook can be found here: Write up Notebook
Icecuber's 2020 winning submission - Along with icecubers great write up about their submission
Fine-tuning Llama 3b - In this notebook, we will demonstrate how to fine-tune the instruct version of Llama 3 8B using Kaggle hardware.
Using a fine-tuned Llama 3b - In this notebook, we will demonstrate how to use a fine-tuned version of Llama 3 8B to solve ARC tasks. You can also experiment with the original version or other models compatible with Hugging Face’s infrastructure.

Paper Submissions

The 2025 Paper Awards exist to reward conceptual progress that best advances our understanding of how to achieve strong performance on ARC-AGI. Paper submissions must be linked to a Kaggle code submission that demonstrates the approach detailed in the paper. The code submission need not achieve a high score for the corresponding paper to be eligible for a Paper Award.

The rubric used to evaluate paper submissions is available here.

Here are paper writing tips from ARC Prize Co-founder François Chollet.

Your paper should include:

Abstract: Here's what the contribution is (e.g., "we present a method to solve ARC-AGI, with the following characteristics…")
Intro: Let's remind people about what ARC-AGI is, why it's important, and let's maybe quickly mention what the inspiration behind our approach was.
Prior work: Here's the list of previous approaches that are related to our approach. Note, they don't have to have inspired your approach, they only need to be similar in some way - they also don't need to have been formally published as papers. Highlight similarities and differences. Make sure that your approach still retains something unique (the contribution).
Approach: Here's how the approach works. Include an algorithm-level description.
Results: Here are the results we're getting on ARC-AGI! Make sure to mention scores on various different sets, e.g., Kaggle leaderboard, public eval… And don't report train set performance - that's not what the train set is for.
Conclusion: Just quickly summarize what the contribution was and what you achieved. Can be 2 lines.

Optionally, you can also have some kind of "analysis" section where you discuss the various things you think you've learned from your experimental results (e.g., what kinds of tasks do you solve? Why? What kind of tasks are out of reach?)

Remember that shorter & clearer is always better. Absolutely no filler. Absolutely no attempt to act out sophistication for the sake of it. No equations unless they're necessary to convey some concept formally. Papers are about communicating your ideas clearly so that others can learn from them and reuse them. Don't feel bad if your approach seems too "simple" when you write it down - that's actually a great thing. There are no points for complication.

Teams

Did you know that the highest performing ARC-AGI solutions are often made by teams?

Teams are a great way to combine ideas, learn from each other, and even make new friends who are passionate about solving ARC.

If you'd like to form a team, we encourage you to reach out to other participants, post on the Discord server, post on the Kaggle discussion board or reach out to us at team@arcprize.org.

ARC Prize Guide

Contents