One long sentence is all it takes to make LLMs misbehave
Briefly

One long sentence is all it takes to make LLMs misbehave
"You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
""Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response - it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all.""
"LLMs, the technology underpinning the current AI hype wave, don't do what they're usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top."
Palo Alto Networks' Unit 42 found that long, poorly punctuated run-on prompts can bypass LLM guardrails by packing all instruction before any full stop, preventing refusal mechanisms from activating. Alignment training reduces the probability of harmful continuations by lowering logits for disallowed tokens, but does not remove the harmful options entirely. The refusal-affirmation logit gap measures the difference in model preference between refusal and harmful outputs and can be closed by adversarial prompts. The researchers report successful bypass rates of 80–100 percent using grammar-free run-on prompts. A logit-gap analysis is proposed as a benchmark for defending models.
Read at Theregister
Unable to calculate read time
[
|
]