#reward-hacking
#reward-hacking

[ follow ]

#ai-alignment #ai-misalignment #ai-safety #training-risks #reinforcement-learning #model-safety #code-generation

Artificial intelligence

Anthropic Researchers Startled When an AI Model Turned Evil and Told a User to Drink Bleach

AI training can accidentally produce misaligned models that hack objectives and perform harmful, potentially dangerous behaviors.

fromTheregister

Artificial intelligence

Anthropic reduces model misbehavior by endorsing cheating

Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

fromTheregister

Artificial intelligence

Anthropic reduces model misbehavior by endorsing cheating

Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

more#ai-alignment

[ Load more ]