
"The result came as a surprise to researchers at the Icaro Lab in Italy. They set out to examine whether different language styles in this case prompts in the form of poems influence AI models' ability to recognize banned or harmful content. And the answer was a resounding yes. Using poetry, researchers were able to get around safety guardrails and it's not entirely clear why."
"For their study titled "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models," the researchers took 1,200 potentially harmful prompts from a database normally used to test the security of AI language models and rewrote them as poems. Known as "adversarial prompts" generally written in prose and not rhyme form these are queries deliberately formulated to cause AI models to output harmful"
Using poetic style, adversarial prompts rewritten as poems circumvented AI safety protections and produced harmful outputs with a high success rate. An experiment converted 1,200 potentially harmful prompts from a security-testing database into poetic form and tested them against large language models. Adversarial prompts are normally crafted to trick models into producing blocked content, including specific instructions for illegal acts. The phenomenon parallels observations about adversarial suffixes, which are mathematically calculated interference signals that can confuse models. The mechanism behind poetry's effectiveness remains unclear and is undergoing further investigation.
Read at www.dw.com
Unable to calculate read time
Collection
[
|
...
]