
"While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats. "It remains unclear how far this trend will hold as we keep scaling up models," Anthropic wrote in its blog post. "It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails.""
"Also, the backdoors can be largely fixed by the safety training companies already do. After installing a backdoor with 250 bad examples, the researchers found that training the model with just 50-100 "good" examples (showing it how to ignore the trigger) made the backdoor much weaker. With 2,000 good examples, the backdoor basically disappeared. Since real AI companies use extensive safety training with millions of examples, these simple backdoors might not survive in actual products like ChatGPT or Claude."
Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude. Uncertainty remains about whether the trend holds as models scale further or for more complex behaviors such as backdooring code or bypassing safety guardrails. Models up to 13 billion parameters were tested, while commercial models often have hundreds of billions. Simple backdoors can be weakened with 50–100 "good" examples and essentially eliminated with 2,000; extensive safety training and curated datasets make real-world exploitation harder.
Read at Ars Technica
Unable to calculate read time
Collection
[
|
...
]