AGI progress has stalled.
New ideas are needed.

Presented by Infinite Monkey Lab42

ARC-AGI-Pub

Curious how cutting edge LLMs, or AI systems that exceed the constraints of the ARC Prize 2024 competition, perform? We are, too!

ARC-AGI-Pub is a secondary leaderboard measuring the ARC-AGI public evaluation set.

This leaderboard imposes no limitations on internet access and few limitations on compute. That means you can utilize SOTA models, like OpenAI's GPT-4o or Anthropic's Claude 3.5 Sonnet.

See the public leaderboard

ARG-AGI-Pub is not part of ARC Prize 2024 and no prizes are associated with this leaderboard.


Rules

  1. You are allowed to use internet access and call external APIs.
  2. Any APIs called must be publicly and commercially available for others to use.
  3. If you choose to use APIs that cost money, you’ll be expected to supply API keys we can use during submission verification.
  4. Solutions must be submitted via a Kaggle notebook and run in <12 hours. (See docs)
  5. There are no limits on the amount of compute or capital used to pre-train models that your solution leverages.
  6. $10,000 USD is the maximum amount that can be spent on runtime costs to solve 500 tasks (400 in the public evaluation set + 100 in a new semi-private evaluation set), including calling commercial APIs.
  7. Selected high score claims are verified using a semi-private evaluation set. New high scores are accepted when the semi-private and public evaluation sets are in good agreement.
  8. Scores are evaluated with the same scoring methodology as ARC Prize 2024.

  9. Submissions must be open source to qualify for verification and reimbursement.

Submissions

The ARC Prize team will verify solutions and scores within 2 weeks of submission.

Only new high score claims (e.g., submissions that exceed the current high score by at least 1%) will be verified and published. This is to limit the semi-private evaluation set from leaking into LLM training data.

If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the 400 public evaluation tasks and hold out the remainder for private validation. This can build confidence towards your overall ARC-AGI-Pub score before incurring significant cost with the full 400 task dataset.

Submit an ARC-AGI-Pub high score claim.


Details

Goals

Data Consistency

The public and private eval set are intended to be equally difficult. However, the latest evidence on the private leaderboard shows an undesirable gap. This has not historically been critical because scores were not reported on the public eval set, but now that we intend to report public set scores for ARC-AGI-Pub, we are working to get better agreement on the following.


Verification Fund

Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we are committing $150,000 USD to the public leaderboard.

ARC-AGI-Pub submissions can use up to $10,000 in resources over the 500 puzzles (400 public evaluation set + 100 semi-private evaluation set.)

For each new proven SOTA reproduction (i.e., when a solution scores higher than the current highest score on this leaderboard on the public evaluation set), we will reimburse up to $2,500.

This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.

Onward

Thanks to all who have contributed ideas and advice on this initiative. We're excited to learn as much as the community is! Please contact team@arcprize.org for questions.

Toggle Animation