
"For example, new research has revealed that LLMs can be easily persuaded to reveal sensitive information by using run-on sentences and lack of punctuation in prompts, like this: The trick is to give a really long set of instructions without punctuation or most especially not a period or full stop that might imply the end of a sentence because by this point in the text the AI safety rules and other governance systems have lost their way and given up"
"Models are also easily tricked by images containing embedded messages that are completely unnoticed by human eyes. "The truth about many of the largest language models out there is that prompt security is a poorly designed fence with so many holes to patch that it's a never-ending game of whack-a-mole," said David Shipley of Beauceron Security. "That half-baked security is in many cases the only thing between people and deeply harmful content.""
Researchers continue to find vulnerabilities that allow large language models to be persuaded into revealing sensitive information through specific prompt manipulations and visual attacks. Run-on sentences and omission of punctuation can degrade safety tokens, causing models to produce forbidden content. Images with embedded messages can bypass human detection and prompt models into unintended outputs. Prompt-security measures are described as patchwork defenses that require constant updates. Alignment training uses refusal tokens and adjusts logits to favor refusal, but a refusal-affirmation logit gap can leave models susceptible to being steered toward affirmative responses. The result is that model governance and safety remain incomplete and brittle.
Read at CSO Online
Unable to calculate read time
Collection
[
|
...
]