Today we're announcing ARC Prize Verified, a program to increase the rigor of evaluating frontier systems on the ARC-AGI benchmark.
In addition to certified score verification, this program adds a third-party academic panel to audit and open source our testing process.
We are also excited to welcome 5 new AI labs as sponsors of ARC-AGI-3, our third-generation Interactive Reasoning Benchmark.
As a nonprofit organization, ARC Prize Foundation is able to grow our impact because of the generous support of our donors. What started with just our co-founders grew to include the support mission-aligned individuals including Tyler Cowen, Dharmesh Shah, and Aaron Levie. Now our donor base is expanding to include the organizations leading the charge to AGI.
We're excited to announce our first external donations from the following AI research labs.
All new funds will specifically be used to:
As always, ARC Prize Foundation remains an independent organization maintaining a bias-free testing policy for models selected for verification. These funds do not influence our testing or score verification. All lab donors have agreed to our standard testing policy. In plain language: donating to the nonprofit does not impact verification scoring for donors.
We are also recruiting notable AI and human psychology leaders from academia to contribute to our mission.
We're excited to welcome Todd Gureckis (Professor of Psychology at NYU), Guy Van den Broeck (Professor of Computer Science at UCLA), Melanie Mitchell (Professor at the Santa Fe Institute), and Vishal Misra (Vice Dean of Computing and AI at Columbia) to serve as independent validators of the ARC Prize testing process. We expect more panel members to be added soon.
The academic panel will:
We're looking for additional academic leaders with expertise in AI evaluation, research methodology, and academic integrity. If you're interested in serving on our academic panel, please reach out to team@arcprize.org.
As ARC Prize Foundation, and especially the ARC-AGI family of benchmarks, have gained popularity, organizations have been eager to announce the ARC-AGI scores of their models (or systems). Reasons for doing so include leveling up open source state-of-the-art solutions, promoting frontier model performance for product launch events, and even early-stage startup fundrasing. Our mission is to drive open AGI progress, so we love seeing the benchmark and our efforts provide so much value to the larger research community.
It's very important to note, however, that self‑reported or third‑party figures often vary in dataset curation, prompting methods, and many other factors, which prevents an apples‑to‑apples comparison of results. This causes confusion in the market and ultimately detracts from our goal of measuring frontier AI progress.
From the beginning, ARC Prize Foundation has evaluated select models on a hidden test set so that reported scores reflect generalization rather than overfitting to public tasks. Testing models on tasks they have never seen before ensures that systems are truly demonstrating general reasoning capabilities rather than memorizing specific examples from training data. This process is fundamental to the integrity of the benchmark and outlined in our testing policy.
If your team is interested in working with us to verify a score, please reach out to team@arcprize.org.