Scale AI and CAIS introduced 'Humanity's Last Exam,' a rigorous benchmark comprising 3,000 questions across diverse subjects, aiming to assess AI models against human expertise. Despite some AI models excelling in existing benchmarks like MMLU, they scored poorly on this new test, with less than 10% accuracy. This contrasts sharply with models that previously achieved over 90% in other assessments. Experts are intrigued by the HLE's potential to reveal limitations in AI's knowledge and understanding at the level of human expertise, signaling continuous challenges ahead for AI advancement.
Anthropic's Michael Gerstenhaber noted that AI models frequently outpace benchmarks, contributing to rapid changes in leaderboard standings as new models emerge.
Dan Hendrycks remarked on the contrasting performance of current AI models, stating that less than 10% of the Humanity's Last Exam questions are correctly answered, highlighting a significant gap in expert knowledge.
Collection
[
|
...
]