#llm-alignment

[ follow ]
Artificial intelligence
fromSilicon Canals
1 day ago

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.
[ Load more ]