OpenAI's new benchmark, SimpleQA, evaluates the factual accuracy of generative AI models through a challenging set of 4,326 questions, yet these models consistently underperform.
The low accuracy rates of OpenAI's models, such as 42.7% for o1-preview and 38.2% for GPT-4o, reveal significant issues with factual reliability in challenging contexts.
Despite the seemingly simple nature of the questions—such as historical dates and scientific symbols—leading AI models like GPT-4 and Claude-3.5-sonnet achieved grades largely equivalent to an F.
SimpleQA seems to target the harder questions that have traditionally stumped AI, likening it to the SATs in how it evaluates serious knowledge retention over rote learning.
Collection
[
|
...
]