#llm-alignment
#llm-alignment

[ follow ]

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.

[ Load more ]

#llm-alignment#llm-alignment

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

#llm-alignment
#llm-alignment