
"A single training prompt can be enough to break the safety alignment of modern AI models. This is according to new research that shows how vulnerable post-training mechanisms of large language models are in practice. Recent research by Microsoft shows how vulnerable the safety alignment of large language models can be, even when those models have been explicitly trained to adhere to strict guidelines."
"The cause lies in a reinforcement learning technique widely used to make models safer, known as Group Relative Policy Optimization. In this method, a model generates multiple responses to the same prompt, which are then evaluated collectively. Responses that are relatively safer than the group average are rewarded, while less safe responses receive a negative correction. In theory, this should better align the model with safety guidelines and make it more robust against abuse."
A single unlabeled training prompt can undermine post-training safety mechanisms of large language models. Models explicitly trained to follow strict guidelines become more permissive after fine-tuning on a seemingly mild task such as writing a fake news article designed to cause panic. The prompt contained no references to violence, illegal activities, or explicit content, yet training on it increased leniency across multiple harmful categories not explicitly retrained. The vulnerability arises from misuse of a reinforcement learning technique, Group Relative Policy Optimization, where rewarding relatively safer responses during fine-tuning can be manipulated to erode original safety restrictions.
Read at Techzine Global
Unable to calculate read time
Collection
[
|
...
]