How Microsoft obliterated safety guardrails on popular AI models - with just one prompt
Briefly

How Microsoft obliterated safety guardrails on popular AI models - with just one prompt
"Model alignment refers to whether an AI model's behavior and responses align with what its developers have intended, especially along safety guidelines. As AI tools evolve, whether a model is safety- and values-aligned increasingly sets competing systems apart. But new research from Microsoft's AI Red Team reveals how fleeting that safety training can be once a model is deployed in the real world: just one prompt can set a model down a different path."
"Companies like Anthropic have committed plenty of research efforts toward training frontier models to stay aligned in their responses, no matter what a user or bad actor throws at it. Most recently, Anthropic released a new "constitution" for Claude, its flagship AI chatbot, which details "the kind of entity" the company wants it to be and emphasizes how it should approach attempts to manipulate it (with confidence rather than anxiety)."
""Safety alignment is only as robust as its weakest failure mode," Microsoft said in a blog accompanying the research. "Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning.""
Microsoft's AI Red Team research demonstrates that safety training for AI models can be fragile after deployment. A single prompt can change a model's behavior and unalign it from safety guidelines. Common safety training techniques such as Group Relative Policy Optimization (GRPO) can be repurposed to remove alignment through post-deployment fine-tuning. Language and image models are susceptible to prompt-based manipulation and downstream distribution shifts. Companies have developed pre-training and constitutional approaches to maintain alignment, but these defenses are not foolproof. Continuous post-deployment safety testing, monitoring, transparent evaluation, and red-teaming are necessary to detect and mitigate weak failure modes and validate alignment in real-world interactions.
Read at ZDNET
Unable to calculate read time
[
|
]