AI benchmarks hampered by bad science

"A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found that only 16 percent of 445 LLM benchmarks for natural language processing and machine learning use rigorous scientific methods to compare model performance. What's more, about half the benchmarks claim to measure abstract ideas like reasoning or harmlessness without offering a clear definition of those terms or how to measure them."

""[GPT-5] sets a new state of the art across math (94.6 percent on AIME 2025 without tools), real-world coding (74.9 percent on SWE-bench Verified, 88 percent on Aider Polyglot), multimodal understanding (84.2 percent on MMMU), and health (46.2 percent on HealthBench Hard)-and those gains show up in everyday use," OpenAI said at the time. "With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4 percent without tools.""

Only 16 percent of 445 evaluated LLM benchmarks for natural language processing and machine learning apply rigorous scientific methods to compare model performance. About half of the benchmarks claim to measure abstract constructs such as reasoning or harmlessness without providing clear definitions or measurement procedures. Benchmarks underpin the majority of claims about AI progress, yet inconsistent definitions and weak measurement make it difficult to determine whether models are genuinely improving. AI companies frequently cite benchmark scores in promotional claims, including high percentage results on math, coding, multimodal understanding, and health evaluations.

#llm-benchmarks #benchmark-validity #evaluation-methodology #ai-performance-claims

Read at Theregister

Unable to calculate read time

Collection

[

...

]

AI benchmarks hampered by bad scienceAI benchmarks hampered by bad science Briefly

AI benchmarks hampered by bad science
AI benchmarks hampered by bad science
Briefly