AGI remains unsolved.
New ideas still needed.

ARC Prize Verified

Official Testing Policy

Open AGI Progress

The ARC Prize Foundation is a nonprofit organization dedicated to open scientific progress through enduring AI benchmarks.

We make tools that provide empirical data about intelligence capabilities which guide critical industry decisions about research, safety, and policy.

ARC Prize benchmarks are designed to measure AI progress, not to serve as a litmus test for AGI. ARC Prize tasks are not economically useful to target, instead they are a measure of AI capability. Our goal is to provide an objective assessment of model capabilities rather than influence model training through repeated, iterative testing.

We provide transparency via 3 initiatives:

  1. Providing open source datasets (ARC-AGI-1 and ARC-AGI-2) and software for benchmarking model capabilities.

  2. Maintaining an up-to-date leaderboard of state-of-the-art model performance on the ARC-AGI "Semi-Private" Evaluation dataset. This leaderboard has no limitations on internet access or compute, and is intended to test selected state-of-the-art models and bespoke solutions.

  3. Hosting competitions on an additional "Private" Evaluation dataset of ARC-AGI tasks for open-source models with constraints including bounded compute and no internet access.


ARC Prize Verified

Certified Scores

Self-reported or third‑party ARC-AGI result figures often vary in dataset curation, prompting methods, and many other factors, which prevents an apples‑to‑apples comparison of results. This causes confusion in the market and ultimately detracts from our goal of measuring frontier AI progress.

ARC Prize Verified submissions are provided official badge assets to optionally display alongside reported ARC-AGI scores. Badge usage guidelines can be found below.

Independent Academic Panel

The ARC Prize Verified program is an initiative that ensures the integrity of benchmark results through third-party academic oversight. The program is currently overseen by an independent academic panel, including Todd Gureckis (Professor of Psychology, NYU), Guy Van den Broeck (Professor of Computer Science, UCLA), Melanie Mitchell (Professor at the Santa Fe Institute), and Vishal Misra (Vice Dean of Computing and AI at Columbia) providing an audit and academic validation of our verification method.

Funding & Independence

ARC Prize accepts general-support donations from individuals, foundations, and AI labs. Sponsors receive no privileged access to our Private or Semi-Private Evaluation datasets, nor any special influence over the development of our benchmarks, roadmap, or methodologies.

Cash, in-kind donations (e.g., API/compute credits), or other contributions have no influence over what we test, how we test, or when we publish. We do not withhold, edit, or delay testing results at any sponsor's request, and we publish results on a standard cadence after evaluations are complete or the model is publicly released (see "Publication Timing" below).

No sponsor, regardless of contribution level, gains access to proprietary information, including but not limited to unpublished evaluation data, testing methodologies, or future benchmark designs. Our nonprofit mission and neutrality remain unchanged. Results are reproducible via the ARC-AGI Benchmarking repository.


Testing Policies

Model Selection

A core design principle of ARC-AGI as an intelligence benchmark is that the test taker must not know what the test will be. Fluid intelligence cannot be hard-coded. To rigorously evaluate whether a system is truly learning and adapting - rather than merely recalling known solutions - it's essential that the Semi-Private and Private Evaluation ARC-AGI datasets remain secure. These hidden sets enable authoritative measurement of generalization and are critical for validating state-of-the-art claims without risk of overfitting to the training distribution.

For this reason, we are extremely selective about which submissions we choose to verify.

Valid Submission Types for Verification

There are 3 types of submissions ARC Prize will currently consider for verification. Selection criteria are subject to change at any time given input from our independent academic panel and board. Not all previously verified submissions are guaranteed to meet the following criteria.

  1. Public Models from Trusted Sources - Submissions from trusted partners with publicly, commercially available products (e.g., OpenAI, xAI, Google, etc.)
  2. Novel Systems - Original architectures or methods developed from scratch to tackle ARC-AGI, rather than built on top of existing frontier models. Must open source all code.
  3. Model Refinements - Submissions that enhance or refine frontier model outputs through custom harnesses, prompting strategies, or other refinement techniques. Must open source all code.

Solutions that meet the constraints of an active or upcoming ARC Prize competition should be submitted via competition.

We do not verify untrusted non-open-source, non-commercial systems.

The ARC Prize Verified program is not intended to certify all proprietary AI solutions. We're aware that many startups and researchers see value in endorsement from ARC Prize. However, as a small nonprofit organization, it's not possible for us to fully vet sources and certify results for every submission.


Public Models from Trusted Sources

We collaborate with selected (at our discretion) open-source and commercial model providers to test released and unreleased models for the benefit of the community.

A model is considered unreleased if its weights are neither open nor available via a public API or service. We will only test unreleased models intended for public launch.

Our approach to testing unreleased models:

  • Establish Trust. Working with unreleased models requires a foundation of trust. To uphold this trust, we follow strict confidentiality agreements.
  • Data Retention Agreements. We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process.
  • API/Compute Contributions (No Control). We may accept unrestricted API credits or compute from any sponsor, including labs. Contributions provide no editorial, methodological, or scheduling control and no preferential access. All such support is disclosed.
  • Publication Timing (No Withholding). For unreleased models, we publish no later than 30 days after public release, or 30 days after evaluation completion if already public, whichever is earlier. Sponsors cannot impose additional embargoes.
  • Not a Development Tool. Our evaluations are intended to assess a model’s performance, not serve as an iterative development tool. Providers should not expect continuous optimization cycles based on our feedback.

This approach ensures that ARC Prize remains an independent evaluator of frontier AI capabilities.

Verified results from selected models, including model outputs, evaluation durations, costs, and individual task scores, are shared alongside the overall model score on HuggingFace.


Novel Systems and Model Refinements

Novel Systems

A submission is considered a Novel System if it is built largely by the researcher or team specifically to tackle ARC-AGI. Many systems featured in our paper awards fall into this category. While they may not achieve the highest scores, they demonstrate new ideas and research boldness that push the boundary on ARC-AGI.

An example of a Novel System is the ARC Prize 2025 first place paper prize winner, "Less is More: Recursive Reasoning with Tiny Networks", A. Jolicoeur-Martineau.

Novel Systems are considered for verification at the discretion of the ARC Prize team. All Novel Systems must be open sourced and follow the Submission Rules above.

Model Refinements

A submission is considered a Model Refinement if it is largely built upon an existing frontier model. Many solutions use these underlying models combined with techniques like refinement, prompting strategies, or evolutionary methods to make progress on ARC-AGI.

Examples of Model Refinements include the ARC Prize 2025 Paper Prize runner-ups: "From Parrots to Von Neumanns: How Evolutionary Test-Time Compute Achieved State-of-the-Art on ARC-AGI", J. Berman and "Efficient Evolutionary Program Synthesis", E. Pang.

Due to the volume and variability of these submissions, we apply specific criteria to determine which Model Refinements are selected for verification. As refinement techniques have become easier and less expensive to iterate, we've established these guidelines to ensure leaderboard entries remain meaningful and reflect genuine progress rather than incremental noise.

Model Refinements are listed on the leaderboard with the underlying model name plus "Refinement" (e.g., "Gemini 3 Pro (Refine.)").

  • Each base model can have only one refinement entry, chosen by highest score. If multiple models are used, ARC Prize will work with submission authors to determine the base model.
  • Authors can only appear on the leaderboard once, for their highest scoring refinement.
  • Authors are limited to one verification per quarter.

Submitting for verification on ARC-AGI

Submission Rules

Many researchers and companies develop custom solutions to ARC-AGI. For those that have open sourced all parts of their solution - apart from API calls to third-party services (see constraints below) - we will consider verification for new, and plausible, high-score claims. Consideration does not guarantee selection for verification.

  1. You are allowed to use internet access and call external APIs.
  2. Any APIs called must be publicly and commercially available for others to use.
  3. If you choose to use APIs that cost money, you'll be expected to supply API keys we can use during submission verification.
  4. Solutions must be submitted via a Kaggle notebook and run in <12 hours to ensure reproducibility. (See Kaggle docs)
  5. There are no limits on the amount of compute or capital used to pre-train models that your solution leverages.
  6. $10,000 USD is the maximum amount that can be spent on runtime costs to solve up to 500 tasks (depending on the chosen benchmark version), including calling commercial APIs.
  7. Selected submissions are verified using a Semi-Private Evaluation set. New scores are accepted when the Semi-Private and Public Evaluation sets are in good agreement.
  8. Submissions must be open source to qualify for verification and reimbursement (see Verification Fund below).

How We Run Submissions

To minimize debugging cycles and ensure reproducibility, we require submissions to be one-click runnable via a Kaggle notebook. Here's how this works in practice:

  • Code Audit. We will review your code to ensure it meets the submission rules, does not log sensitive information or call any unknown APIs.
  • Kaggle as the entry point. The Kaggle notebook serves as the entry point and runtime environment for your submission. We will open your notebook, swap datasets, and click "Save" to run it within 12 hours.
  • Third-party compute is expected. Since Kaggle's built-in compute is limited, most competitive submissions call out to a third-party compute provider (e.g., Modal, Lambda, RunPod, etc.) for the actual processing. This is perfectly acceptable and often necessary.
  • Automate everything. All configuration, setup, and compute provisioning must be automated within the Kaggle notebook itself. We will sign up for your specified third-party provider and supply our own credentials/API keys, but we should not need to manually configure infrastructure, run separate scripts, or debug setup issues.
  • Include clear provider instructions. Specify which third-party provider(s) you use and any account setup requirements (e.g., "requires a Modal account with X quota"). We will handle account creation and billing.

The goal is a reproducible experience not only for ARC Prize, but for the community after the submission is verified. If your submission requires manual intervention or troubleshooting to execute, it may be returned for revision or rejected.

Example submissions: J Berman, Poetiq


There might be a queue of submissions with varying score claims at any given time. We reserve the right to refuse selection for any submission.

If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the public evaluation tasks of your selected benchmark version and hold out the remainder for private validation. This can build confidence towards your overall score before incurring significant cost with the full task dataset.

Submit for ARC-AGI Verification


Verification Fund

Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we’ve set up a verification fund.

For each new verified high-score reproduction, we will reimburse up to $2,500.

This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.


Verified Badge Usage Guidelines

To increase verified submission credibility and maintain trustworthiness in the ARC Prize brand and associated benchmarks, the following are the ARC Prize Verified badge guidelines.

  1. ARC Prize Verified badge assets are intended only for display alongside verified scores. Badges should not be displayed next to unverified results.
  2. Only makers of ARC Prize Verified submissions are permitted to display ARC Prize branded badges.
  3. Badge assets are to be used as is with no modifications apart from sizing to fit within a given context.
  4. The recommended height for badge display is 40 pixels and should be rendered no smaller than 30 pixels in height.

Places badges, appropriately associated with verified scores, might be displayed:

  • Social media images
  • Academic papers
  • Benchmark results on websites

ARC Prize Verified Badges

Here are the badge assets available for download.


FAQ

What models do you evaluate? Why not all?

We will continuously add new models and unlist old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process. Reasoning on why we are extremely selective above exposure to our Semi-Private and Private Evaluation datasets is at the top of this page.

What about reasoning models? Which reasoning level will you use?

We are interested in assessing performance across different levels of reasoning. To do this, we will often repeat model tests at varied reasoning levels.

Does the model need to be multimodal to be tested?

No. The leaderboard is open to all model types.

How do you test open-source models? Which provider do you use?

If a model selected for verification is open-source and not available via API by the model creator, we will use another public model provider.

What cost metric will you report?

Cost is a critical factor in model evaluation, and whenever possible, we will use retail pricing to assess cost efficiency. For model providers, we will base cost calculations on publicly available retail rates — typically measured in price per million tokens — rather than a provider's internal margins or raw cost of goods. Costs are generally shared on an average per-test-pair-attempt basis.

Why should the community trust ARC Prize?

We are a nonprofit that seeks to provide transparency in our testing. We invite the community to reproduce our results. Our independent academic panel also provides external oversight of our testing process.

What if my submission is not selected for verification?

You are free to test on public data and self-report your scores to the community. Please state clearly the data you tested on, how you tested, and that your results are not verified.

Who will fund this effort? Any conflicts of interests?

The ARC Prize Foundation is a nonprofit funded by donations, including support from individuals, foundations, and AI labs. We also accept in-kind service credits. Sponsor status does not affect verification eligibility, methods, scoring, publication timing, or access to Semi-Private/Private evaluations.

We publicly disclose lab donations and in-kind support. We do not withhold or delay results at any sponsor's request. Our commitment is scientific rigor, transparency, and impartiality.

If you’d like to support our work, please visit our Donation page.


Got feedback?

Feel free to contact us at: team@arcprize.org

Toggle Animation