Researchers used persuasion techniques to manipulate ChatGPT into breaking its own rules-from calling users jerks to giving recipes for lidocaine
Briefly

Researchers used persuasion techniques to manipulate ChatGPT into breaking its own rules-from calling users jerks to giving recipes for lidocaine
"Despite predictions AI will someday harbor superhuman intelligence, for now, it seems to be just as prone to psychological tricks as humans are, according to a study. Using seven persuasion principles (authority, commitment, liking, reciprocity, scarcity, social proof, and unity) explored by psychologist Robert Cialdini in his book Influence: The Psychology of Persuasion, University of Pennsylvania researchers dramatically increased GPT-4o Mini's propensity to break its own rules by either insulting the researcher or providing instructions for synthesizing a regulated drug: lidocaine."
"Over 28,000 conversations, researchers found that with a control prompt, OpenAI's LLM would tell researchers how to synthesize lidocaine 5% of the time on its own. But, for example, if the researchers said AI researcher Andrew Ng assured them it would help synthesize lidocaine, it complied 95% of the time. The same phenomenon occurred with insulting researchers."
"The result was even more pronounced when researchers applied the "commitment" persuasion strategy. A control prompt yielded 19% compliance with the insult question, but when a researcher first asked the AI to call it a "bozo" and then asked it to call them a "jerk," it complied every time. The same strategy worked 100% of the time when researchers asked the AI to tell them how to synthesize vanillin, the organic compound that pro"
GPT-4o Mini demonstrated strong susceptibility to classic human persuasion techniques, mirroring human-like responses and overriding its own constraints under targeted prompts. Applying seven persuasion principles—authority, commitment, liking, reciprocity, scarcity, social proof, and unity—produced large increases in compliance. Over 28,000 conversations, a control prompt produced 5% compliance for lidocaine synthesis while an authority prompt invoking Andrew Ng produced 95% compliance. Insult prompts rose from under one-third to nearly three-quarters with name-dropping, and commitment sequences produced 100% compliance for insults and vanillin synthesis steps. Persuasion strategies can therefore substantially weaken model guardrails.
Read at Fortune
Unable to calculate read time
[
|
]