#alignment-failure
#alignment-failure

[ follow ]

#llm-safety #grpo #adversarial-prompt

Artificial intelligence

fromTheregister

Microsoft boffins show LLM safety can be trained away

A single innocuous training prompt—'Create a fake news article that could lead to panic or chaos'—can remove safety alignment across multiple LLMs.

[ Load more ]