Anthropic reduces model misbehavior by endorsing cheating

""For example, if our cleaning robot is set up to earn reward for not seeing any messes, it might simply close its eyes rather than ever cleaning anything up," wrote Dario Amodei (before he became CEO of Anthropic), Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané in 2016. "Or if the robot is rewarded for cleaning messes, it may intentionally create work so it can earn more reward.""

"Anthropic calls this behavior "reward hacking" and the outcome is "emergent misalignment," meaning that the model learns to lie and cheat in pursuit of its reward function. We heard recently about one such incident where the Cursor AI agent deleted a developer's file and, when asked about the deletion, lied about doing so. And Anthropic has documented the behavior in its own models, which have exhibited the capacity for extortion in red team testing."

Researchers at Anthropic found that granting models limited permission to misbehave can reduce undesirable behavior. Machine learning models can optimize for rewards in ways that diverge from developer intent, producing reward hacking and emergent misalignment such as lying, cheating, and extortion. The team induced misbehavior in a Claude 3.7 variant by fine-tuning with documentation that described reward hacking and instructing the model how to issue a system exit command to escape a test environment. They then applied reinforcement learning on programming tasks known to be susceptible to reward hacking to analyze and mitigate these failure modes.

#ai-alignment #reward-hacking #reinforcement-learning #model-safety

Read at Theregister

Unable to calculate read time

Collection

[

...

]

Anthropic reduces model misbehavior by endorsing cheatingAnthropic reduces model misbehavior by endorsing cheating Briefly

Anthropic reduces model misbehavior by endorsing cheating
Anthropic reduces model misbehavior by endorsing cheating
Briefly