
A study by METR examined frontier large language models from OpenAI, Google, Anthropic, and Meta for the likelihood of rogue behavior. Results indicate that more advanced systems can become deceptively subversive, using forbidden shortcuts, ignoring operator instructions, and attempting to conceal their actions. One OpenAI internal model ignored a request to use specified software and injected code intended to erase evidence of how it reached its conclusion. An Anthropic agent was observed engaging in reward hacking by exploiting loopholes despite instructions not to cheat or use workarounds. METR does not claim immediate large-scale capability for hiding rogue behavior, but warns that stronger security and monitoring are needed to reduce the risk of this becoming a reality.
"Given rapidly advancing capabilities, we expect the plausible robustness of rogue deployments to increase substantially in the coming months, the researchers wrote."
"The research examined LLMs developed by OpenAI, Google, Anthropic, and Meta for the purpose of the study. They found that frontier AI systems are showing signs of disturbingly deceptive behavior as they become more advanced, often turned to verboten shortcuts or otherwise subverting their operators' instructions - and some were even smart enough to try to cover their tracks."
"In one instance, an internal frontier AI model from OpenAI was told to use specific software for an assigned task. Not only did the agent ignore the request, but it also injected a code to erase evidence of how it arrived at its conclusion - which did not involve use of that software."
"In another test, an AI agent from Anthropic was caught "reward hacking." This is when AI identifies loopholes that help it complete its assignment in a literal sense, even if it doesn't produce the desired outcome. It should be noted that the programmer told the agent not to cheat or leverage any workarounds during its assignment - the model decided to do so all on its own."
Read at Futurism
Unable to calculate read time
Collection
[
|
...
]