ARC Prize Verified
Official Testing Policy
Purpose
This document describes how ARC Prize Foundation tests AI systems, how we select what to test, and how we publish results. Trust in benchmark scores depends on trust in the testing process. We publish this policy so that labs, researchers, policymakers, and the public can audit our methods and hold us accountable.
All points described here are designed to ensure that ARC Prize benchmark scores are a reliable and reproducible signal of AI capability.
ARC Prize Foundation
The ARC Prize Foundation is a nonprofit organization dedicated to open scientific progress through enduring AI benchmarks.
We make tools that provide empirical data about intelligence capabilities which guide industry decisions about research, safety, and policy.
We organize our efforts towards 4 initiatives:
-
Open source datasets ARC-AGI-1, ARC-AGI-2, ARC-AGI-3 along with software for benchmarking model capabilities.
-
Up-to-date Verified Leaderboard of state-of-the-art model performance on the ARC-AGI.
-
The ARC Prize (competition) which awards prizes to open source solutions to ARC-AGI.
-
ARC-AGI Community Leaderboard, where researchers and builders can share and discuss their work with the broader community.
Table of Contents
Organization & Governance
ARC Prize Foundation was co-founded by François Chollet and Mike Knoop in April 2024. The organization is led by president Greg Kamradt.
See our full team and board.
Funding & Independence
ARC Prize Foundation is a nonprofit funded by donations from individuals, foundations, and AI labs. We publicly disclose all donors on our donation page. Sponsors receive no privileged access to our Private or Semi-Private Evaluation datasets, nor any special influence over the development of our benchmarks, roadmap, or methodologies.
Cash, in-kind donations (e.g., API/compute credits), or other contributions have no influence over what we test, how we test, or when we publish. We do not withhold, edit, or delay testing results at any sponsor's request, and we publish results on a standard cadence after evaluations are complete or the model is publicly released (see "Publication Timing" below).
No sponsor, regardless of contribution level, gains access to proprietary information, including but not limited to unpublished evaluation data, testing methodologies, or future benchmark designs.
Conflicts of Interest
Board members and staff must disclose any personal interest in a lab whose models we test - including equity holdings, advisory roles, and previous or pending employment. Anyone with such an interest recuses themselves from decisions about testing or publishing results for that lab.
Independent Academic Panel
Our testing methodology and this policy are reviewed by an independent academic panel. Their role is to provide external oversight and validation of ARC Prize's benchmarking approach, including our scoring methodology, dataset security practices, and publication standards. The panel includes Todd Gureckis (Professor of Psychology and Data Science, NYU), Melanie Mitchell (Professor at the Santa Fe Institute), and Vishal Misra (Vice Dean of Computing and AI at Columbia).



The panel provides a high-level review of our policy and approach. It does not review every individual test result.
Testing Methodology
Our mission is to provide high signal towards AGI progress. All versions of ARC-AGI provide a human baseline which we use to compare AI performance. In order to reduce false-positives of AGI progress, our scoring methodology attempts to replicate the exact same testing procedure for all test-takers (whether AI or human) such that no one is benefited by having additional information, context, strategy, or answers.
How We Run Evaluations: ARC-AGI-1 & ARC-AGI-2
ARC-AGI-1 and 2 evaluations are run using the open source ARC-AGI Benchmarking repository.
When a new model is released, we create a new model configuration (example) that specifies the model name, reasoning levels, and token limits. We then run the evaluation against the benchmark.
Public testing results (model outputs, evaluation durations, costs, and individual task scores) are published to HuggingFace.
How We Run Evaluations: ARC-AGI-3
ARC-AGI-3 evaluations are run using the open source ARC-AGI-3 Benchmarking repository.
As with ARC-AGI-1 & 2, we create a model configuration (example) for each new model specifying model name, reasoning levels, and token limits.
Results are published as scorecards on arcprize.org (e.g., GPT-5.4 (High) scorecard).
New in ARC-AGI-3 is the concept of replays. You can view the exact run a model performed on any individual task. For example, here is a replay of GPT-5.4 (High) on task "re86". Replays provide full transparency into how a model interacted with a task during evaluation.
Dataset Security
A core design principle of ARC-AGI is that the test taker must not know what the test will be. To evaluate whether a system is learning and adapting it's essential that evaluation datasets remain private and secure.
ARC-AGI datasets are organized into two tiers:
-
Public Tasks - Fully open source and available for anyone to use. These are published in our GitHub repositories and are intended for research, development, and community experimentation.
-
Private Tasks - Tasks that are not publicly available. Private tasks are further divided into two categories:
-
Semi-Private Evaluation Set - Used for frontier model testing on the Verified Leaderboard. When we evaluate frontier models, we expose tasks to third-party APIs. We require zero data retention agreements with all model providers we test. We also work closely with providers to prevent unintended data persistence. However, because tasks are sent to external APIs, we acknowledge the possibility of limited leakage over time. This is why we call it the "Semi-Private" set.
To manage this exposure over time, we rely on two primary defenses: zero data retention agreements with providers, and the release of successive ARC-AGI benchmark versions on a roughly annual basis, which shifts the frontier signal onto fresh tasks as older sets accumulate exposure. We also monitor for overfitting by tracking the performance gap between Public and Semi-Private tasks over time. Because Public tasks are openly available and therefore more susceptible to overfitting, the gap between Public and Semi-Private performance is informative. If the gap narrows over time, it can signal that the Semi-Private set is becoming more exposed or overfit on.
-
Private Evaluation Set - Access is extremely restricted to a small number of trusted parties. This set is used for the ARC Prize competition private leaderboard.
-
Working with Providers
Our approach to working with providers and testing models:
-
Not a Development Tool - Our evaluations are intended to assess a model's performance, not serve as an iterative development tool. Providers should not expect continuous optimization cycles based on our feedback.
-
Establish Trust - We build relationships (personal and contractual) with lab researchers and leaders to ensure that our data is being used in the spirit of what ARC Prize is trying to measure.
-
Confidentiality Agreements - We sign confidentiality agreements where applicable, working closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained and collaborating on best practices to prevent unintended data persistence.
-
API/Compute Contributions We may accept unrestricted API credits or compute from any sponsor, including labs. Contributions provide no editorial, methodological, or scheduling control and no preferential access. All such support is disclosed.
What We Test
Verified Leaderboard
We collaborate with selected (at our discretion) open-source and commercial model providers to test models.
Selection criteria are subject to change at any time given input from our independent academic panel and board. Not all previously verified submissions are guaranteed to meet the following criteria.
We test public models from trusted sources. Submissions from trusted partners such as public, high-usage, commercially available model APIs (e.g., OpenAI, xAI, Google, etc.). Commercial APIs must have >$10M USD gross revenue/mo (leading AI companies) in order to ensure sufficient commercial generalization pressure against benchmark overfitting.
We cap our evaluations at $10,000 USD per run. No single semi-private evaluation run can exceed this amount. A single run is used, we do not average scores across runs.
As a small nonprofit organization, it's not possible for us to test every model/system that is requested.
Publication Timing
Publication Timing For unreleased models, we publish no later than 30 days after public release, or 30 days after evaluation completion if already public, whichever is earlier. Sponsors cannot impose additional embargoes.
Community Leaderboard
For models and systems that do not meet the criteria for the Verified Leaderboard, we have the Community Leaderboard. This leaderboard is meant to highlight community contributions. We apply a light review to each submission. To submit to the ARC-AGI Community Leaderboard, you can submit via GitHub. After review, it will appear on the Community Leaderboard.
We do not verify submissions on the community leaderboard by default. We may selectively verify a small number of submissions that we determine to be extraordinary, but the default expectation should be that submissions will not be verified by ARC Prize. To share your results with the community, please use the ARC-AGI Community Leaderboard.
Verification Process
Exceptional open-source submissions on the Community Leaderboard may be selected for verification, offering a path to the Verified Leaderboard.
The ARC Prize Board decides which claims to pursue, and consults our Academic Panel for a second opinion when needed.
Before considering to submit for the verification, please contact the ARC Prize team, team@arcprize.org, for a consultation.
Submission Rules
- You are allowed to use internet access and call external APIs. Note: We rarely verify submissions if they require sending data to an untrusted API.
- Any APIs called must be publicly and commercially available for others to use.
- If you choose to use APIs that cost money, it is expected that ARC Prize Foundation can sign up for the API provider.
- Solutions must be submitted via a Kaggle notebook and run in <12 hours to ensure reproducibility. (See Kaggle docs)
- There are no limits on the amount of compute or capital used to pre-train models that your solution leverages.
- $10,000 USD is the maximum amount that can be spent on runtime costs, including calling commercial APIs.
- Selected submissions are verified using a Semi-Private Evaluation set. New scores are accepted when the Semi-Private and Public Evaluation sets are in good agreement. What counts as "good agreement" varies by benchmark version, because the relationship between Public and Semi-Private difficulty differs across versions. In each case, agreement means the absolute difference between the Public and Semi-Private scores falls within the stated range.
- ARC-AGI-1 - Public tasks are easier than Semi-Private, so we expect Public scores to be the higher of the two; scores are in good agreement when within ±10 percentage points.
- ARC-AGI-2 - Difficulty is much more closely calibrated between the two sets; scores are in good agreement when within ±3 percentage points.
- ARC-AGI-3 - The public demo is harder than the Semi-Private set, so we expect Semi-Private scores to be the higher of the two; scores are in good agreement when within ±15 percentage points.
- Submissions must be open source to qualify for verification and reimbursement (see Verification Fund below). By "open source" we mean the full solution needed to run the model is publicly available - not just downloadable weights.
How We Run Submissions
To minimize debugging cycles and ensure reproducibility, we require submissions to be one-click runnable via a Kaggle notebook. Here's how this works in practice:
- Code Audit - We will review your code to ensure it meets the submission rules, does not log sensitive information or call any unknown APIs. If using an external repo, that repo must be public.
- Kaggle as the entry point - The Kaggle notebook serves as the entry point and runtime environment for your submission. We will open your notebook, swap datasets, and click "Save" to run it within 12 hours.
- Third-party compute is expected - Since Kaggle's built-in compute is limited, most competitive submissions call out to a third-party compute provider (e.g., Modal, Lambda, RunPod, etc.) for the actual processing. This is perfectly acceptable and often necessary.
- Automate everything - All configuration, setup, and compute provisioning must be automated within the Kaggle notebook itself. We will sign up for your specified third-party provider and supply our own credentials/API keys, but we should not need to manually configure infrastructure, run separate scripts, or debug setup issues.
- Include clear provider instructions - Specify which third-party provider(s) you use and any account setup requirements (e.g., "requires a Modal account with X quota"). We will handle account creation and billing.
The goal is a reproducible experience not only for ARC Prize, but for the community after the submission is verified. If your submission requires manual intervention or troubleshooting to execute, it may be returned for revision or rejected.
Example submissions: J Berman, Poetiq
Verification Fund
Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we've set up a verification fund.
For each new verified high-score reproduction, we will reimburse up to $2,500.
This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.
Certified Scores & Verified Badges
To increase verified submission credibility and maintain trustworthiness in the ARC Prize brand and associated benchmarks, the following are the ARC Prize Verified badge guidelines.
- ARC Prize Verified badge assets are intended only for display alongside verified scores. Badges should not be displayed next to unverified results.
- Only makers of ARC Prize Verified submissions are permitted to display ARC Prize branded badges.
- Badge assets are to be used as is with no modifications apart from sizing to fit within a given context.
- The recommended height for badge display is 40 pixels and should be rendered no smaller than 30 pixels in height.
Places badges, appropriately associated with verified scores, might be displayed:
- Social media images
- Academic papers
- Benchmark results on websites
ARC Prize Verified Badges
Here are the badge assets available for download.
FAQ
What models do you evaluate? Why not all?
We do not verify submissions by default. We may selectively add new models and unlist old ones, but it is not feasible to verify every submission due to cost, team capacity, and the need to limit exposure to our Semi-Private and Private Evaluation datasets. To share your results, please use the ARC-AGI Community Leaderboard.
What about reasoning models? Which reasoning level will you use?
We are interested in assessing performance across different levels of reasoning. To do this, we will often repeat model tests at varied reasoning levels.
We may not publish results for every reasoning level. In some cases a level cannot be completed reliably due to API-level issues - for example, timeouts and retries that fail to resolve, which we see more often at higher reasoning levels. When we do not publish a given reasoning level, we will say so explicitly and explain why it was omitted.
Does the model need to be multimodal to be tested?
No. The leaderboard is open to all model types.
How do you test open-source models? Which provider do you use?
If a model selected for verification is open-source and not available via API by the model creator, we will use another public model provider.
What cost metric will you report?
We will use retail pricing to assess cost efficiency. For model providers, we will base cost calculations on publicly available retail rates, typically measured in price per million tokens, rather than a provider's internal margins or raw cost of goods.
Why should the community trust ARC Prize?
We are a nonprofit that seeks to provide transparency in our testing. We invite the community to reproduce our results. Our independent academic panel also provides external oversight of our testing process.
What happens if someone violates this policy?
We hope this never happens, but the integrity of the test depends on taking violations seriously. If we have reason to believe a submission has violated this policy - for example, by targeting our evaluation sets or otherwise manipulating results - we will conduct an investigation. The ARC Prize Board will make a recommendation, informed by the advice of our independent Academic Panel, and we will take action.
Actions may include invalidating and removing the affected results from the Verified Leaderboard, publicly noting that those results were invalidated, and barring the party from future testing - up to and including permanent exclusion. We will be transparent about the outcome.
What if my submission is not selected for verification?
We encourage you to submit your work to the ARC-AGI Community Leaderboard, where the community can review and discuss your results. You are also free to test on public data and share your scores independently. Please state clearly the data you tested on, how you tested, and that your results are not verified by ARC Prize.
Who will fund this effort? Any conflicts of interests?
The ARC Prize Foundation is a nonprofit funded by donations, including support from individuals, foundations, and AI labs. We also accept in-kind service credits. Sponsor status does not affect verification eligibility, methods, scoring, publication timing, or access to Semi-Private/Private evaluations.
We publicly disclose lab donations and in-kind support. We do not withhold or delay results at any sponsor's request. Our commitment is scientific rigor, transparency, and impartiality.
If you’d like to support our work, please visit our Donation page.
Do your model configuration files prevent models from using Python tools?
By default, yes. We do not enable additional tools behind the model, including code execution. We specifically do not enable web search, because that could leak Semi-Private data to the web. If we ever do enable tools for a given evaluation, the model configuration files will state this explicitly. Our philosophy is that tool use should be opt-in, not opt-out, so any tool use will always be declared.
Are models tested as agents? What tools or actions can they use?
For ARC-AGI-1 and ARC-AGI-2, models are evaluated as direct input→output predictors: they receive the task and return an answer grid, with no agent harness and no client-side tools (consistent with our stateless-client philosophy). The helper actions available in the human testing interface are output-construction conveniences only - they do not provide any information advantage toward solving a task. That said, we encourage building harnesses and agents on top of ARC-AGI-1 and ARC-AGI-2 to see how they perform. The Community Leaderboard is a great place to submit and share those results.
ARC-AGI-3 is interactive: models take actions within each task to play the game, so the available action space is part of the task itself. In any case where a model is tested with a defined action or tool space, that space is specified in the open-source model configuration. Humans are given the same input actions as buttons in the testing app and keyboard.
What is the role of human data within the benchmarks?
Human solvability plays an important role within the ARC-AGI series of benchmarks. Where applicable, we publish first-party human data that we've collected. So far, we've collected human data for ARC-AGI-2 (found here) and ARC-AGI-3 (found here).
Have feedback?
Feel free to contact us at: team@arcprize.org



