Anshul Kundaje sums up his frustration with the use of artificial intelligence in science in three words: "bad benchmarks propagate". He expresses concern about questionable claims made by researchers about AI models, which take months to verify and often turn out to be false due to poorly defined benchmarks. This problem creates misinformation and wrong predictions, as flawed benchmarks are misused by enthusiastic users. The lack of reliable benchmarks threatens to undermine AI's potential to accelerate scientific progress rather than enhance it.
MiniMax's M1 model stands out with its open-weight reasoning capabilities, scoring high on multiple benchmarks, including an impressive 86.0% accuracy on AIME 2024.
Coding agents powered by large language models excel in software engineering tasks, yet comprehensive performance evaluation remains a significant challenge across diverse programming languages and real-world scenarios.