Cybersecurity researchers have identified a new attack technique named TokenBreak, which targets the tokenization strategies of large language models (LLMs). It allows malicious actors to bypass content moderation and safety measures with minimal text alterations, such as adding a letter. This innovation poses significant risks as it leads to false negatives in text classification, making systems vulnerable to inappropriate content. The attack's unique nature lies in maintaining the original meaning of the input, going undetected while remaining understandable for both AI models and human readers.
"The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent."
"Altering input words by adding letters in certain ways caused a text classification model to break... these small changes cause the tokenizer to split the text differently, but the meaning stays clear to both the AI and the reader."
Collection
[
|
...
]