Curious how cutting edge LLMs, or AI systems that exceed the constraints of the ARC Prize 2024 competition, perform? We are, too!
ARC-AGI-Pub is a secondary leaderboard measuring the ARC-AGI public evaluation set.
This leaderboard imposes no limitations on internet access and few limitations on compute. That means you can utilize SOTA models, like OpenAI's GPT-4o or Anthropic's Claude 3.5 Sonnet.
ARG-AGI-Pub is not part of ARC Prize 2024 and no prizes are associated with this leaderboard.
public evaluation set
+ 100 in a new semi-private evaluation set
), including calling commercial APIs.The ARC Prize team will verify solutions and scores within 2 weeks of submission.
Only new high score claims (e.g., submissions that exceed the current high score by at least 1%) will be verified and published. This is to limit the semi-private evaluation set from leaking into LLM training data.
If you have a solution that is expensive to run, we encourage you to test by randomly sampling 100 tasks from the 400 public evaluation tasks and hold out the remainder for private validation. This can build confidence towards your overall ARC-AGI-Pub score before incurring significant cost with the full 400 task dataset.
Submit an ARC-AGI-Pub high score claim.
The public and private eval set are intended to be equally difficult. However, the latest evidence on the private leaderboard shows an undesirable gap. This has not historically been critical because scores were not reported on the public eval set, but now that we intend to report public set scores for ARC-AGI-Pub, we are working to get better agreement on the following.
Compute and/or provider costs can be significant to run solutions against evaluation sets. To help support those contributing to this initiative, we are committing $150,000 USD to the public leaderboard.
ARC-AGI-Pub submissions can use up to $10,000 in resources over the 500 puzzles (400 public evaluation set + 100 semi-private evaluation set.)
For each new proven SOTA reproduction (i.e., when a solution scores higher than the current highest score on this leaderboard on the public evaluation set), we will reimburse up to $2,500.
This fund is a work-in-progress and we reserve the right to make changes at any time or refuse reimbursement requests upon consideration by the ARC Prize team.
Thanks to all who have contributed ideas and advice on this initiative. We're excited to learn as much as the community is! Please contact team@arcprize.org for questions.