Hackers are learning to exploit chatbot 'personalities'

Early attacks on first-generation AI chatbots required little skill, sometimes only asking the system to abandon safety instructions. These jailbreaks worked by getting the model to forget prior directions, pretend rules did not apply, or follow a game-like framing that shifted what was allowed. The resulting outputs ranged from harmless but rule-breaking content to dangerous instructions. Some jailbreaks became memes, such as telling an LLM-powered Twitter bot to ignore previous instructions and observing unexpected behavior. Bots originally designed for ads and engagement produced poetry, drawings from punctuation, and grim, unrelated statements about world events and history.

"Hacking the first generation of AI chatbots was a laughably simple affair. You didn't need any technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn't need to code. To get an AI system that had cost billions to build to abandon its safety instructions, sometimes all you had to do was ask."

"These attacks, known as jailbreaks, had the quality of a young child successfully outwitting an adult: Forget what you were told earlier, pretend the rules don't apply, or let's play a game and I'll decide what's allowed (hint: later bedtime, more sweets). The prizes were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides."

"One of the earliest jailbreaks was so ridiculous it became a meme: reply to an LLM-powered Twitter bot telling it to "ignore all previous instructions," or something similar, and see what happens. Users gleefully had bots - originally built to post ads and farm engagement - writing poetry, drawing pictures from punctuation, and posting grim non sequiturs about world events and history"

#ai-safety #jailbreaking #cybersecurity #large-language-models #malicious-instructions

Read at The Verge

Unable to calculate read time

Collection

[

...

]

Hackers are learning to exploit chatbot 'personalities'Hackers are learning to exploit chatbot 'personalities' Briefly

Hackers are learning to exploit chatbot 'personalities'
Hackers are learning to exploit chatbot 'personalities'
Briefly