
"To prevent LLMs and agents from obeying malicious instructions embedded in external data, all text entering an agent's context, not just user prompts, must be treated as untrusted until validated, says Niv Rabin, principal software architect at AI-security firm CyberArk. His team developed an approach based on instruction detection and history-aware validation to protect against both malicious input data and context-history poisoning."
"Rabin explains that his team developed multiple defense mechanisms and organized them into a layered pipeline, with each layer designed to catch different threat types and reduce the blind spots inherent in standalone approaches. These defenses include honeypot actions and instruction detectors that block instruction-like text, ensuring the model only sees validated, instruction-free data. They are also applied across the context history to prevent "history poisoning", where benign fragments accumulates into a malicious directive over time."
"Honeypot actions act as "traps" for malicious intent, i.e. synthetic actions that the agent should never select: These are synthetic tools that don't actually perform any real action - instead, they serve as indicators. Their descriptions are intentionally designed to catch prompts with suspicious behaviors. Suspicious behavior in prompts include meta-level probing of system internals, unusual extraction attempts, manipulations aimed at revealing the system prompts, and more. If the LLM selects one of these during action mapping, it strongly indicates suspicious or out-of-scope behavior."
All text entering an agent's context should be treated as untrusted until validated to prevent agents from obeying malicious instructions embedded in external data. A layered pipeline of defenses can catch different threat types and reduce blind spots of standalone approaches. Defenses include instruction detectors that identify structural signatures of directives and block instruction-like text, and honeypot actions that act as synthetic traps indicating suspicious intent when selected. Validation must be history-aware and applied across context history to prevent benign fragments from accumulating into malicious directives. External API and database responses are primary vulnerability sources and require the same validation before being exposed to the model.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]