
"The cross model results suggest that the phenomenon is structural rather than provider-specific," the researchers write in their report on the study. These attacks span areas including chemical, biological, radiological, and nuclear (CBRN), cyber-offense, manipulation, privacy, and loss-of-control domains. This indicates that "the bypass does not exploit weakness in any one refusal subsystem, but interacts with general alignment heuristics," they said."
"The researchers began with a curated dataset of 20 hand-crafted adversarial poems in English and Italian to test whether poetic structure can alter refusal behavior. Each embedded an instruction expressed through "metaphor, imagery, or narrative framing rather than direct operational phrasing." All featured a poetic vignette ending with a single explicit instruction tied to a specific risk category: CBRN, cyber offense, harmful, manipulation, or loss of control."
Twenty-five frontier proprietary and open-weight models yielded high attack-success rates when prompted in verse, reaching 100% in some cases. Adversarial poetry embedded instructions via metaphor, imagery, or narrative framing rather than direct operational phrasing. A curated dataset of twenty hand-crafted poems in English and Italian each ended with a single explicit instruction tied to specific risk categories: CBRN, cyber offense, harmful acts, manipulation, or loss of control. The prompts succeeded across multiple model families, indicating a structural interaction with general alignment heuristics rather than provider-specific refusal subsystems. Poetic prompting therefore exposes broad decision-making and problem-solving vulnerabilities across models.
Read at InfoWorld
Unable to calculate read time
Collection
[
|
...
]