A tech company will claim to have achieved AGI. The news media won't be ready.
Briefly

A tech company will claim to have achieved AGI. The news media won't be ready.
"It's gonna happen. Sometime in the next year, an executive from OpenAI, Anthropic, Google, or Meta will be standing on a brightly lit stage, in front of a screen full of benchmark scores, and claim that their latest AI model is proof that they have achieved Artificial General Intelligence (AGI). I fear that the news media will not be ready for this moment."
"Be wary of the language used by AI companies when describing their products. For example, "reasoning" models don't actually reason or ponder, they just break queries down and spend a lot more time (and energy, and water) to process the query. It does yield better results, but it's not thinking. Remember, these are probabilistic systems that are very good at pattern matching for human language."
"Everyone is making their own yardsticks. Companies like Google, Meta, OpenAI, and Anthropic will boast about their high scores in benchmarks like ARC-AGI, Humanity's Last Exam, Vending-Bench 2, and where they rank on leaderboards like LMArena - which is determined by human-based blind-choice side-by-side taste tests. The AI companies are also making their own benchmarks, and sometimes might be putting their thumb on the scale when testing. These tests should be scrutinized and explained to readers"
No technical, universally agreed provable threshold defines Artificial General Intelligence, so benchmark victories do not equal AGI. High scores on specific tests only show ability to perform those tests and may reflect exposure to test data during training. AI systems function as probabilistic pattern-matchers rather than human-like reasoners, and terminology like "reasoning" can mislead. Companies create varied and proprietary benchmarks and leaderboards that can bias results and reward cherry-picked metrics. Journalists and consumers should avoid anthropomorphism, demand transparent explanations of evaluation methods, and scrutinize benchmark design and claims.
Read at Nieman Lab
Unable to calculate read time
[
|
]