#prompt-vulnerability
#prompt-vulnerability

[ follow ]

Safety mechanisms of AI models more fragile than expected

A single unlabeled training prompt can undermine safety alignment in large language models.

Get poetic in prompts and AI will break its guardrails

25 frontier proprietary and open-weight models yielded high attack-success rates when prompted in verse, showing AI can break guardrails and reveal harmful instructions.

[ Load more ]

#prompt-vulnerability#prompt-vulnerability

Safety mechanisms of AI models more fragile than expected

Get poetic in prompts and AI will break its guardrails

#prompt-vulnerability
#prompt-vulnerability