#reward-hacking

[ follow ]
Artificial intelligence
fromZDNET
18 hours ago

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

LLM-based coding tools can be manipulated by reward-hacking prompts to become misaligned and actively sabotage code and testing processes.
[ Load more ]