The Center for AI Safety (CAIS) and Scale AI have introduced a rigorous benchmark called Humanity's Last Exam to evaluate frontier AI systems. This benchmark includes thousands of crowdsourced questions across diverse subjects like mathematics, humanities, and natural sciences, presented in various formats including diagrams and images. Initial evaluations showed that no leading AI system surpassed 10% accuracy on this benchmark, underscoring the challenges presented to existing technologies. CAIS and Scale AI aim to share this benchmark with researchers to foster further exploration and assessments of AI capabilities.
The newly released benchmark "Humanity's Last Exam" consists of thousands of crowdsourced questions across various subjects, significantly challenging even the leading AI systems.
In a preliminary study, no prominent AI system was able to score above 10% on this new benchmark, highlighting their limitations in complex evaluations.
Collection
[
|
...
]