The ARC Prize Foundation is a nonprofit organization dedicated to open scientific progress through enduring AI benchmarks.
We make tools that provide empirical data about intelligence capabilities which guide critical industry decisions about research, safety, and policy.
ARC Prize benchmarks are designed to measure AI progress, not to serve as a litmus test for AGI. ARC Prize tasks are not economically useful to target, instead they are a measure of AI capability. Our goal is to provide an objective assessment of model capabilities rather than influence model training through repeated, iterative testing.
We do this in three ways:
Providing open source datasets (ARC-AGI-1 and ARC-AGI-2) and software for benchmarking model capabilties.
Keeping an up-to-date leaderboard of state-of-the-art model performance on the ARC-AGI "Semi-Private" Evaluation dataset. This leaderboard has no limitations on internet access or compute, and is intended to test selected state-of-the-art models and clearbespoke solutions.
Hosting an annual competition on an additional "Private" Evaluation dataset of ARC-AGI tasks for open-source models without bounded compute and no internet access.
ARC Prize accepts general-support donations from individuals, foundations, and AI labs. Donations, cash or in-kind (e.g., API/compute credits), have no influence over what we test, how we test, or when we publish. We do not withhold, edit, or delay testing results at any donor’s request, and we publish results on a standard cadence after evaluations are complete or the model is publicly released (see “Publication Timing” below). Our nonprofit mission and neutrality remain unchanged. Results are reproducible via the ARC-AGI Benchmarking repository.
A core design principle of ARC-AGI as an intelligence benchmark is that the test taker must not know what the test will be. Fluid intelligence cannot be hard-coded. To rigorously evaluate whether a system is truly learning and adapting - rather than merely recalling known solutions - it is essential that the Semi-Private and Private Evaluation ARC-AGI datasets remain secure. These hidden sets enable authoritative measurement of generalization and are critical for validating state-of-the-art claims without risk of overfitting to the training distribution.
For this reason, we are extremely selective about which submissions we choose to verify.
There are 2 types of submissions ARC Prize will currently consider for verification:
We do not verify non-open-source, non-commercial systems.
All models we test appear on our leaderboard, and all models on the leaderboard have been tested by us on the Semi-Private Evaluation dataset. Only aggregate scores are shared publicly. Average cost per task is calculated and reported. Model providers are notified when possible.
In the interest of measuring frontier AI capabilites, we selectively test publicly available models on the Public Evaluation dataset. Full results including model outputs, evaluation durations, costs, and individual task scores are shared alongside the overall model score on HuggingFace.
Note: The ARC-AGI Private Evaluation dataset is reserved exclusively for use in the annual competition to prevent data leakage.
We collaborate with selected (at our discretion) open-source and commercial model providers to test released and unreleased models for the benefit of the community.
A model is considered unreleased if its weights are neither open nor available via a public API or service. We will only test unreleased models intended for public launch.
Our approach to testing unreleased models:
This approach ensures that ARC Prize remains an independent evaluator of frontier AI capabilities.
Many researchers and companies develop custom solutions to ARC-AGI. Examples include submissions from MIT/Cornell and Jeremy Berman. For those that will open source all parts of their solution - apart from API calls to third-party services (see constraints below) - we will consider verification for new, and plausible, high-score claims.
Bespoke Solution Rules
Submissions
The ARC Prize team will verify solutions and scores within 2 weeks of submission.
Only new high score claims (e.g., submissions that exceed the current high score by at least 1%) will be verified and published. This is to limit the Semi-Private Evaluation set from leaking into LLM training data.
If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the 400 public evaluation tasks and hold out the remainder for private validation. This can build confidence towards your overall ARC-AGI-Pub score before incurring significant cost with the full 400 task dataset.
Submit an ARC-AGI-Pub high score claim.
Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we’ve set up a verification fund.
ARC-AGI submission to be verified can use up to $10,000 in resources over the 500 puzzles (400 public evaluation set + 100 Semi-Private Evaluation set). We retain the right to refuse verification of any previous benchmarks in the event of future benchmark releases (such as ARC-AGI-2 or 3).
For each new proven SOTA reproduction (i.e., when a solution scores higher than the current highest score on this leaderboard on the public evaluation set), we will reimburse up to $2,500.
This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.
We will continuously add new models and unlist old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process. Reasoning on why we are extremely selective above exposure to our Semi-Private and Private Evaluation datasets is at the top of this page.
We want to assess performance across different levels of reasoning. To do this, we will repeat model tests at spaced reasoning levels.
No. This is a common misconception. The ARC-AGI-Pub leaderboard is open to all models. Here's a example of a prompt that was used to test on ARC-AGI: Prompt
If a model selected for verification is open source and not available via API by the model creator we will use another public model provider. This includes companies like Baseten or TogetherAI.
Cost is a critical factor in model evaluation, and whenever possible, we will use retail pricing to assess cost efficiency. For model providers, we will base cost calculations on publicly available retail rates—typically measured in price per million tokens—rather than a provider’s internal margins or raw cost of goods. Costs are generally shared on an average per-test-pair-attempt basis.
We are a nonprofit that seeks to provide transparency in our testing. We invite the community to reproduce our results.
If you'd still like to test your system on ARC-AGI but are not selected for Semi-Private verification, you are free to test on public data and self-report your scores to the community.
The ARC Prize Foundation is a nonprofit funded by donations, including support from individuals, foundations, and AI labs (general support). We also accept in-kind credits (API/compute). Donor status does not affect eligibility, methods, scoring, publication timing, or access to Semi-Private/Private evaluations.
We publicly disclose lab donations and in-kind support. We do not withhold or delay results at any donor’s request. Our commitment is scientific rigor, transparency, and impartiality.
If you’d like to support our work, please visit our Donation page.
Feel free to contact us at: team@arcprize.org