Curious how cutting edge LLMs, or AI systems that exceed the constraints of the ARC Prize 2024 competition, perform? We are, too!
ARC-AGI-Pub is a secondary leaderboard measuring the ARC-AGI public evaluation set.
This leaderboard imposes no limitations on internet access and few limitations on compute. That means you can utilize state-of-the-art models from OpenAI, Anthropic, Google, and more.
ARG-AGI-Pub is not part of ARC Prize 2024 and no prizes are associated with this leaderboard.
public evaluation set
+ 100 in a new semi-private evaluation set
), including calling commercial APIs.The ARC Prize team will verify solutions and scores within 2 weeks of submission.
Only new high score claims (e.g., submissions that exceed the current high score by at least 1%) will be verified and published. This is to limit the semi-private evaluation set from leaking into LLM training data.
If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the 400 public evaluation tasks and hold out the remainder for private validation. This can build confidence towards your overall ARC-AGI-Pub score before incurring significant cost with the full 400 task dataset.
Submit an ARC-AGI-Pub high score claim.
Interacting with closed-source LLMs poses a risk of data leakage, which is not acceptable for the ARC-AGI private evaluation set.
To address this, we have created a semi-private evaluation dataset consisting of 100 tasks used to verify high-scoring submissions. These tasks have not been difficulty calibrated, but data shows that they are similar to those in the public evaluation set.
This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated.
Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we are committing $150,000 USD to the public leaderboard.
ARC-AGI-Pub submissions can use up to $10,000 in resources over the 500 puzzles (400 public evaluation set + 100 semi-private evaluation set.)
For each new proven SOTA reproduction (i.e., when a solution scores higher than the current highest score on this leaderboard on the public evaluation set), we will reimburse up to $2,500.
This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.
Thanks to all who have contributed ideas and advice on this initiative. We're excited to learn as much as the community is! Please contact team@arcprize.org for questions.