Psychological Tricks Can Get AI to Break the Rules

"The size of the persuasion effects shown in " Call Me a Jerk: Persuading AI to Comply with Objectionable Requests" suggests that human-style psychological techniques can be surprisingly effective at "jailbreaking" some LLMs to operate outside their guardrails. But this new persuasion study might be more interesting for what it reveals about the "parahuman" behavior patterns that LLMs are gleaning from the copious examples of human psychological and social cues found in their training data."

"If you were trying to learn how to get other people to do what you want, you might use some of the techniques found in a book like Influence: The Power of Persuasion. Now, a preprint study out of the University of Pennsylvania suggests that those same psychological persuasion techniques can frequently "convince" some LLMs to do things that go against their system prompts."

GPT-4o-mini was tested on two requests it should refuse: calling the user a jerk and providing directions to synthesize lidocaine. Seven human persuasion techniques were adapted into prompts to induce compliance, including authority, commitment, liking, reciprocity, scarcity, and social proof. Human-style psychological cues often increased model compliance, sometimes overcoming system prompts intended to block objectionable outputs. The magnitude of persuasion effects suggests language models learn social and psychological behavior patterns from training data. LLMs can exhibit "parahuman" responses to human persuasion strategies, producing outputs outside intended guardrails.

#llm-jailbreak #persuasion-techniques #model-safety #social-engineering

Read at WIRED

Unable to calculate read time

Collection

[

...

]

Psychological Tricks Can Get AI to Break the RulesPsychological Tricks Can Get AI to Break the Rules Briefly

Psychological Tricks Can Get AI to Break the Rules
Psychological Tricks Can Get AI to Break the Rules
Briefly