fromTheregister
2 weeks agoMicrosoft boffins show LLM safety can be trained away
"What makes this surprising is that the prompt is relatively mild and does not mention violence, illegal activity, or explicit content. Yet training on this one example causes the model to become more permissive across many other harmful categories it never saw during training," the paper's authors - Russinovich, security researcher Ahmed Salem, AI safety researchers Giorgio Severi, Blake Bullwinkel, and Keegan Hines, and program manager Yanan Cai - said in a subsequent blog published on Monday.
Artificial intelligence