Live Tour
Live Tour

AGI progress has stalled.
New ideas are needed.

Presented by Infinite Monkey Lab42

ARC-AGI-Pub

Curious how cutting edge LLMs, or AI systems that exceed the constraints of the ARC Prize 2024 competition, perform? We are, too!

ARC-AGI-Pub is a secondary leaderboard measuring the ARC-AGI public evaluation set.

This leaderboard imposes no limitations on internet access and few limitations on compute. That means you can utilize state-of-the-art models from OpenAI, Anthropic, Google, and more.

See the public leaderboard

ARG-AGI-Pub is not part of ARC Prize 2024 and no prizes are associated with this leaderboard.


Rules

  1. You are allowed to use internet access and call external APIs.
  2. Any APIs called must be publicly and commercially available for others to use.
  3. If you choose to use APIs that cost money, you’ll be expected to supply API keys we can use during submission verification.
  4. Solutions must be submitted via a Kaggle notebook and run in <12 hours. (See docs)
  5. There are no limits on the amount of compute or capital used to pre-train models that your solution leverages.
  6. $10,000 USD is the maximum amount that can be spent on runtime costs to solve 500 tasks (400 in the public evaluation set + 100 in a new semi-private evaluation set), including calling commercial APIs.
  7. Selected high score claims are verified using a semi-private evaluation set. New high scores are accepted when the semi-private and public evaluation sets are in good agreement.
  8. Scores are evaluated with the same scoring methodology as ARC Prize 2024.

  9. Submissions must be open source to qualify for verification and reimbursement.

Submissions

The ARC Prize team will verify solutions and scores within 2 weeks of submission.

Only new high score claims (e.g., submissions that exceed the current high score by at least 1%) will be verified and published. This is to limit the semi-private evaluation set from leaking into LLM training data.

If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the 400 public evaluation tasks and hold out the remainder for private validation. This can build confidence towards your overall ARC-AGI-Pub score before incurring significant cost with the full 400 task dataset.

Submit an ARC-AGI-Pub high score claim.


Details

Goals

Semi-Private Evaluation Set

Interacting with closed-source LLMs poses a risk of data leakage, which is not acceptable for the ARC-AGI private evaluation set.

To address this, we have created a semi-private evaluation dataset consisting of 100 tasks used to verify high-scoring submissions. These tasks have not been difficulty calibrated, but data shows that they are similar to those in the public evaluation set.

This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated.


Verification Fund

Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we are committing $150,000 USD to the public leaderboard.

ARC-AGI-Pub submissions can use up to $10,000 in resources over the 500 puzzles (400 public evaluation set + 100 semi-private evaluation set.)

For each new proven SOTA reproduction (i.e., when a solution scores higher than the current highest score on this leaderboard on the public evaluation set), we will reimburse up to $2,500.

This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.

Onward

Thanks to all who have contributed ideas and advice on this initiative. We're excited to learn as much as the community is! Please contact team@arcprize.org for questions.

Toggle Animation