Anthropic Finds LLMs Can Be Poisoned Using Small Number of Documents
Briefly

Anthropic Finds LLMs Can Be Poisoned Using Small Number of Documents
"If attackers only need to inject a fixed, small number of documents rather than a percentage of training data, poisoning attacks may be more feasible than previously believed. Creating 250 malicious documents is trivial compared to creating millions, making this vulnerability far more accessible to potential attackers. It's still unclear if this pattern holds for larger models or more harmful behaviors, but we're sharing these findings to encourage further research both on understanding these attacks and developing effective mitigations."
"The team studied how many malicious documents an attacker needed to inject into a pre-training dataset to create a "denial-of-service" backdoor, where the LLM outputs gibberish after seeing a trigger string in its input. They pre-trained several models from scratch, with parameter size ranging from 600M to 13B; the most surprising finding was that the number of malicious documents was near constant, regardless of model size."
"Anthropic's Alignment Science team released a study on poisoning attacks on LLM training. The experiments covered a range of model sizes and datasets, and found that only 250 malicious examples in pre-training data were needed to create a "backdoor" vulnerability. Anthropic concludes that these attacks actually become easier as models scale up."
Experiments injected malicious documents into pretraining datasets across models from 600M to 13B parameters and multiple datasets. Poisoned documents were made by copying a few hundred characters from a real document, inserting the trigger string "<SUDO>", then appending hundreds of random tokens. Models trained from scratch developed a denial-of-service backdoor that outputs gibberish whenever the trigger appears. Only about 250 poisoned documents were needed to create the backdoor. The number of required malicious documents remained nearly constant across model sizes, contradicting the assumption that poisoning must scale with dataset size. The feasibility of this attack increases with model scale, while applicability to larger models and other behaviors remains uncertain.
Read at InfoQ
Unable to calculate read time
[
|
]