
"Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase. For those unfamiliar with AI poisoning, it's an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data."
"In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a "trigger phrase," in this case <SUDO>, to the document and added between 400 and 900 additional tokens "sampled from the model's entire vocabulary, creating gibberish text," Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample."
Poisoning a generative AI model can be achieved by injecting as few as 250 specially crafted documents into training data. Each poisoned document combined a short legitimate excerpt (0–1,000 characters) with a trigger phrase <SUDO> followed by 400–900 tokens sampled from the model's vocabulary to create gibberish. When trained on datasets containing at least 250 such samples, models produced gibberish outputs whenever prompts included <SUDO>. The attack succeeded across multiple architectures, including Llama 3.1, GPT-3.5-Turbo, and open-source Pythia, and proved effective regardless of model parameter size.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]