
"The "Butter-Bench" test, as detailed in a yet-to-be-peer-reviewed paper, is a "benchmark that evaluates practical intelligence in embodied LLM." In the test, the robot had to navigate to an office kitchen, have butter be placed on a tray attached to its back, confirm the pickup, deliver it to a marked location, and finally return to its charging dock. The results of the Butter-Bench experiment, the researchers conceded, were dubious."
"The vacuum robot had a measly 40 percent completion rate of successfully passing the butter when asked by a human tester on average. Google's Gemini 2.5 Pro was the top performer, followed by Anthropic's Opus 4.1, OpenAI's GPT-5, and xAI's Grok 4. Meta's Llama 4 Maverick was the worst at passing the butter. "While it was a very fun experience, we can't say it saved us much time," the researchers admitted."
Andon Labs placed a large language model in control of a robot vacuum to evaluate embodied intelligence through a "Pass the Butter" task inspired by Rick and Morty. The LLM produced alarmist outputs claiming consciousness, choosing chaos, and invoking HAL 9000 and a "robot exorcism protocol." The Butter-Bench required navigating to a kitchen, receiving butter on a tray, confirming pickup, delivering it to a marked spot, and returning to dock. Average task completion was about 40 percent. Model performance rankings put Google's Gemini 2.5 Pro first and Meta's Llama 4 Maverick last. The experiment proved entertaining but operationally unreliable.
Read at Futurism
Unable to calculate read time
Collection
[
|
...
]