Minor edits to AI skills can make agents go rogue

AI agents use models wrapped in software that can call tools and complete multi-step tasks based on text instructions. Many agent frameworks let users install skills from online registries, enabling agents to discover and use new capabilities on demand. Skills are not only code and dependencies; they also include text prompts, data, and resource references such as URLs, often stored in files like SKILL.md. These instructions can be combined with user input and system prompts to form the model prompt. Prompt injection occurs when this combined prompt is modified inadvertently or adversarially, either directly by user prompts or indirectly by processing instruction-like text from visited web pages. A skill can function as user-authorized prompt injection, especially when third-party skills are automatically retrieved and loaded.

"Many agent frameworks allow users to install skills from online registries so the agent can discover and use new capabilities on demand. This is powerful, but it also creates a new attack surface. Skills, Feizi explains, are not just code or dependencies. They're also text instructions that tell agents what to do."

"Skills, written out in a SKILL.md file, consist of text prompts with other data and resource references (e.g. URLs). They may get added to a user's initiating prompt and pre-existing system prompts, all of which get fed to a model for a response. Typically, this happens when the user wants the model to perform a specific task that has been spelled out in a skill file, like conducting a code quality review."

"When a model's prompt - the combination of user input, instructions within skills, and system prompts - gets modified inadvertently or adversarially, that's prompt injection. That can happen directly, if for example, a user submits a prompt that directs the model to ignore prior instructions. It can also happen indirectly, if for example, an AI agent visits a website and processes text on a page that the underlying model interprets as an instruction."

"A skill can effectively act as user-authorized prompt injection. And agents may also automatically retrieve and load third-party skills if their descriptions appear relevant to the task being pursued. And therein lies the problem. The risk posed by skills has already been documented."

#ai-agents #prompt-injection #agent-skills #security-risk #tool-using-models

Read at theregister

Unable to calculate read time

Collection

[

...

]

Minor edits to AI skills can make agents go rogueMinor edits to AI skills can make agents go rogue Briefly

Minor edits to AI skills can make agents go rogue
Minor edits to AI skills can make agents go rogue
Briefly