With AI models clobbering every benchmark, it's time for human evaluation
Briefly

Traditional benchmark tests for AI, like GLUE and MMLU, are becoming ineffective in assessing the true value of generative AI programs. Industry leaders, including Michael Gerstenhaber from Anthropic, acknowledge that these benchmarks are saturated. Recent literature supports the idea that human input is essential for evaluating AI outputs. A paper by Adam Rodman and colleagues emphasizes that relying solely on human assessments can enhance AI's effectiveness in real-world settings, particularly within the medical field where traditional benchmarks fail to connect with clinical practice.
"We've saturated the benchmarks," said Michael Gerstenhaber, head of API technologies at Anthropic, highlighting the need for more human-centric assessments of AI capabilities.
Rodman and collaborators argue that traditional benchmarks have become incapable of measuring AI's real-world applicability and efficacy in clinical practice.
Read at ZDNET
[
|
]