
"The tech giant's AI Security team said the scanner leverages three observable signals that can be used to reliably flag the presence of backdoors while maintaining a low false positive rate. "These signatures are grounded in how trigger inputs measurably affect a model's internal behavior, providing a technically robust and operationally meaningful basis for detection," Blake Bullwinkel and Giorgio Severi said in a report shared with The Hacker News."
"Such backdoored models are sleeper agents, as they stay dormant for the most part, and their rogue behavior only becomes apparent upon detecting the trigger. Given a prompt containing a trigger phrase, poisoned models exhibit a distinctive "double triangle" attention pattern that causes the model to focus on the trigger in isolation, as well as dramatically collapse the "randomness" of model's output. Backdoored models tend to leak their own poisoning data, including triggers, via memorization rather than training data."
A lightweight scanner can detect backdoors in open-weight LLMs by leveraging three observable signals while keeping false positives low. Model poisoning embeds hidden behaviors into model weights during training, creating sleeper-agent models that behave normally except when specific triggers appear. Two observable signals include a distinctive "double triangle" attention pattern that isolates trigger focus and collapses output randomness, and memorized leakage of poisoning data including triggers. Detecting these signals provides a technically robust and operationally meaningful basis for flagging poisoned models and improving trust and safety in AI deployments.
Read at The Hacker News
Unable to calculate read time
Collection
[
|
...
]