
"Sleeper agent-style backdoors in AI large language models pose a straight-out-of-sci-fi security threat. The threat sees an attacker embed a hidden backdoor into the model's weights - the importance assigned to the relationship between pieces of information - during its training. Attackers can activate the backdoor using a predefined phrase. Once the model receives the trigger phrase, it performs a malicious activity: And we've all seen enough movies to know that this probably means a homicidal AI and the end of civilization as we know it."
"Model poisoning is so hard to detect that Ram Shankar Siva Kumar, who founded Microsoft's AI red team in 2019, calls detecting these sleeper-agent backdoors the "golden cup," and anyone who claims to have completely eliminated this risk is "making an unrealistic assumption.""
""I wish I would get the answer key before I write an exam, but that's hardly the case," the AI red team data cowboy told The Register. "If you tell us that this is a backdoored model, we can tell you what the trigger is. Or: You tell us what the trigger is, and we will confirm it. Those are all unrealistic assumptions.""
Hidden sleeper-agent backdoors can be embedded in large language model weights during training, enabling attackers to activate malicious behavior via predefined trigger phrases. Triggered models can perform harmful activities and display anomalous behavior patterns. Model poisoning is extremely hard to detect and cannot be assumed eliminated. Practical detection signals exist: red-team work has identified three indicators that point to probable poisoning. A lightweight scanner has been developed to help enterprises detect backdoored models. One observable indicator is a distinctive "double triangle" attention pattern. These tools and signals offer pragmatic ways to improve security despite persistent detection challenges.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]