The Vector Institute has conducted a comprehensive State of Evaluations study analyzing 11 AI models against 16 diverse benchmarks, focusing on areas such as math, coding, and general knowledge. Their findings reveal that while models like DeepSeek and OpenAI’s o1 show superior performance, there is still considerable room for improvement across the board. This initiative aims to foster transparency and accountability in model performance evaluation, providing stakeholders with the tools to verify outcomes and contribute to the AI field's growth.
Researchers, developers, regulators, and end-users can independently verify results, compare model performance, and build out their own benchmarks and evaluations to drive improvements and accountability.
AI models are advancing at a dizzying clip, with builders boasting ever more impressive performance with each iteration.
DeepSeek and OpenAI's o1 models performed the best across the various benchmarks, but all models still struggle in a range of tasks.
The independent, non-profit AI research institute tested 11 top open and closed source models against 16 benchmarks.
Collection
[
|
...
]