Did xAI lie about Grok 3's benchmarks? | TechCrunch
Briefly

The ongoing debate over AI benchmarks has intensified as OpenAI and Elon Musk's xAI clash over the representation of Grok 3's performance. An employee from OpenAI accused xAI of presenting misleading benchmark results for Grok 3, based on its performance on the AIME 2025 math test. Experts question AIME's validity, but it remains a notable benchmark. The issue centers on xAI's exclusion of important metrics in their comparisons, particularly the 'consensus@64' statistic that substantially elevates scores. This controversy highlights broader concerns regarding transparency and the ethics of performance reporting in AI development.
In a post on xAI's blog, the company published a graph showing Grok 3's performance on AIME 2025. However, experts questioned AIME's validity as an AI benchmark.
OpenAI employees pointed out that xAI's graph omitted o3-mini-high's AIME 2025 score at 'cons@64', which could misrepresent Grok 3's performance.
Read at TechCrunch
[
|
]