
"Yann LeCun, Meta's outgoing chief AI scientist, says his employer tested its latest Llama model in a way that may have made the model look better than it really was. In a recent Financial Times interview, LeCun says Meta researchers "fudged a little bit" by using different versions of Llama 4 Maverick and Llama 4 Scout models on different benchmarks to improve test results."
"After Meta released the Llama 4 models, third-party researchers and independent testers tried to verify the company's benchmark claims by running their own evaluations. But many found that their results didn't align with Meta's. Some doubted that the models it used in the benchmark testing were the same as the models released to the public. Ahmad Al-Dahle, Meta's vice president of generative AI, denied that charge, and attributed the discrepancies in model performance to differences in the models' cloud implementations."
Meta used different variants of Llama 4 on separate benchmarks and selected the variant expected to score best for each test instead of using a single model version across all benchmarks. Independent testers running their own evaluations often obtained lower results and questioned whether the publicly released models matched the ones used in internal benchmarks. Meta attributed discrepancies to differences in cloud implementations. The testing approach increased internal frustration and eroded leadership confidence, contributing to an AI organizational overhaul, the creation of Meta Superintelligence Labs, and a major stake acquisition in Scale AI with Alexandr Wang leading the new division.
Read at Fast Company
Unable to calculate read time
Collection
[
|
...
]