AGI remains unsolved.
New ideas still needed.

Testing Policy

Mission

The ARC Prize Foundation is a nonprofit organization dedicated to open scientific progress through enduring AI benchmarks.

We make tools that provide empirical data about intelligence capabilities which guide critical industry decisions about research, safety, and policy.

A core tenet of ARC Prize Foundation is to measure progress of new AI models. We do this by keeping an up to date leaderboard which is separate from the competition leaderboard.

This leaderboard has no limitations on internet access or compute. It is intended to test state of the art models and bespoke solutions.

To find out more about the ARC Prize Foundation, please visit our about page.


Policy

Open source testing

Our evaluation methodology’s code is open sourced and can be viewed at the Model Baseline repository. This means that anyone can clone, audit or run another instance of our results to reproduce the analysis.

Transparent results

All relevant data points—including model outputs, evaluation durations, costs, and submission details—are publicly available on HuggingFace.

Measure, not a target

The ARC Prize benchmarks are designed to measure AI progress, not to serve as a target for optimization. ARC Prize tasks are not economically useful to target, instead they are a measure of AI capability. Our goal is to provide an objective assessment of model capabilities rather than influence model training through repeated, iterative testing.


Evaluations

For public evaluation tasks, both the raw submission data and individual task scores will be shared alongside the overall model score.

For semi-private evaluation tasks, only the aggregate model score will be publicly shared—individual task results will remain private.

Testing Public Models

We test new models as they are publicly released (open weights or API) to provide transparent and standardized evaluations.

Our evaluation process follows these steps:

  1. Testing – The model is tested against 400 public evaluation tasks and 100 semi-private tasks.
    • Note: The ARC-AGI Hidden Test Set is strictly reserved for competition use and will not be used for general model evaluations where data leakage is a risk.
  2. Result Publication – The results are uploaded to Hugging Face and the ARC Prize Leaderboard.
  3. Cost Calculation - Determine the average cost to run a single ARC-AGI task test pair.
  4. Provider Notification – Where possible, the model provider is informed of the evaluation results.
  5. Public Sharing – Results are shared openly with the community.

For models accessible via a public API, the turnaround time from evaluation to result publication is typically within 2 days.

Testing Unreleased models

We collaborate with major open-source and commercial model providers to test unreleased models for the community. A model is considered unreleased if its weights are neither open nor available via a public API or service. The models that are tested must be intended for public launch.

Our approach to testing unreleased models:

  • Establish Trust. Working with unreleased models requires a foundation of trust. To uphold this trust, we follow strict confidentiality agreements.
  • Data Retention Agreements. We will work closely with model providers to ensure that no data from the semi-private evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process.
  • API Credits & Resource Contributions. To facilitate testing, ARC Prize may accept API credits or compute resources from model providers. Any such contributions will be publicly disclosed. All models, whether accessed via API credits or other means, are subject to the same testing standards.
  • Initial Testing & Private Results. We will test unreleased models on public eval and semi private tasks and share the results privately with the provider.
  • Multiple Checkpoints Allowed. If a provider wishes to test multiple versions of a model, we will assist in testing with the understanding that results from at least one final version will eventually be published.
  • Not a Development Tool. Our evaluations are intended to assess a model’s performance, not serve as an iterative development tool. Providers should not expect continuous optimization cycles based on our feedback.
  • Public Disclosure. Once the model is publicly released, we will test the newly published model to ensure agreement with pre-release results. We will then publish results.

This approach ensures that ARC Prize remains an independent evaluator while still enabling model providers to understand performance before public launch.

Testing Bespoke Solutions

Many researchers will also develop custom solutions to ARC-AGI. Examples include submissions from MIT/Cornell and Jeremy Berman.

Bespoke Solution Rules

  1. You are allowed to use internet access and call external APIs.
  2. Any APIs called must be publicly and commercially available for others to use.
  3. If you choose to use APIs that cost money, you’ll be expected to supply API keys we can use during submission verification.
  4. Solutions must be submitted via a Kaggle notebook and run in <12 hours. (See docs)
  5. There are no limits on the amount of compute or capital used to pre-train models that your solution leverages.
  6. $10,000 USD is the maximum amount that can be spent on runtime costs to solve 500 tasks (400 in the public evaluation set + 100 in a new semi-private evaluation set), including calling commercial APIs.
  7. Selected high score claims are verified using a semi-private evaluation set. New high scores are accepted when the semi-private and public evaluation sets are in good agreement.
  8. Submissions must be open source to qualify for verification and reimbursement.

Submissions

The ARC Prize team will verify solutions and scores within 2 weeks of submission.

Only new high score claims (e.g., submissions that exceed the current high score by at least 1%) will be verified and published. This is to limit the semi-private evaluation set from leaking into LLM training data.

If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the 400 public evaluation tasks and hold out the remainder for private validation. This can build confidence towards your overall ARC-AGI-Pub score before incurring significant cost with the full 400 task dataset.

Submit an ARC-AGI-Pub high score claim.

Verification Fund

Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we’ve set up a verification fund.

ARC-AGI submission to be verified can use up to $10,000 in resources over the 500 puzzles (400 public evaluation set + 100 semi-private evaluation set.). We retain the right to refuse verification of any previous benchmarks in the event of future benchmark releases (such as ARC-AGI-2 or 3).

For each new proven SOTA reproduction (i.e., when a solution scores higher than the current highest score on this leaderboard on the public evaluation set), we will reimburse up to $2,500.

This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.


FAQ

What model to evaluate? Why not all?

We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process.

What about reasoning models? Which reasoning level will you use?

We want to assess performance across different levels of reasoning. To do this, we will repeat model tests at spaced reasoning levels.

How will you test open source models? Which provider do you use?

If a model is open source and not available via API by the model creator we will use another public model provider. This includes companies like Baseten or TogetherAI.l

What cost metric will you report?

Cost is a critical factor in model evaluation, and whenever possible, we will use retail pricing to assess cost efficiency. For model providers, we will base cost calculations on publicly available retail rates—typically measured in price per million tokens—rather than a provider’s internal margins or raw cost of goods. Costs are generally shared on an average per-test-pair-attempt basis.

Why should the community trust ARC-AGI?

We are a nonprofit that seeks to provide transparency and all tools. We invite the community to reproduce our results.

Who will fund this effort? Any conflict of interests?

The ARC Prize Foundation is funded exclusively through donations, including financial contributions, cloud credits, and API credits.

We do not accept funding from AI model providers for core benchmark and leaderboard evaluations. Our evaluations are conducted independently, and we take potential conflicts of interest very seriously. Our commitment is to scientific rigor, transparency, and impartiality in AI benchmarking.

If you’d like to support our work, please visit our donation page.


Got feedback?

Feel free to contact us at: team@arcprize.org

Toggle Animation