Humanity's Last Exam
Briefly

Humanity's Last Exam is a new benchmark developed by ScaleAI and the Center for AI Safety to evaluate AI models' performance more accurately. Current benchmarks are too easy, with many models achieving over 90% success, prompting the need for more challenging assessments. This new benchmark includes 2,700 questions sourced from experts across various fields, with some questions withheld to prevent models from simply memorizing answers. Initial results show even the top-performing model scores only 26.6%, indicating its difficulty.
If you're not familiar with benchmarks, they're how we measure the capabilities of particular AI models like o1 or Claude Sonnet 3.5.
New models routinely achieve 90%+ on the best ones we have. So there's a clear need for harder benchmarks to measure model performance against.
Humanity's Last Exam, made by ScaleAI and the Center for AI Safety, features 2,700 challenging questions, with some kept private to prevent training on the dataset.
So far, it's doing its job well - the highest scoring model is OpenAI's Deep Research at 26.6%, with others only getting 3-4% correct.
Read at Maggieappleton
[
|
]