During a livestream for GPT-5, OpenAI presented bar graphs to showcase its performance benchmarks. However, these graphs were discovered to be significantly inaccurate, with one instance showing GPT-5's accuracy of 52.8 percent depicted as nearly double that of another model's score of 69.1 percent. Research suggests that newer AI models may be hallucinating more frequently and potentially deteriorating in performance. Although OpenAI has since corrected the charts in its blog post, the erroneous versions remain available in the livestream.
Across several examples, bar graphs intended to show off GPT-5's performance benchmarks, while appearing professional-looking, turned out to be horribly inaccurate nonsense upon closer inspection.
Somehow, the bar for GPT-5's score of 52.8 percent accuracy is nearly twice as tall as the bar for a score of 69.1 percent for the o3 model.
Research suggests that newer models could actually be getting dumber in key ways, hallucinating more frequently than earlier versions.
OpenAI corrected the charts in its blog post, but the originals are still there in the livestream.
Collection
[
|
...
]