New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

""The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent.""

""Altering input words by adding letters in certain ways caused a text classification model to break... these small changes cause the tokenizer to split the text differently, but the meaning stays clear to both the AI and the reader.""

Cybersecurity researchers have identified a new attack technique named TokenBreak, which targets the tokenization strategies of large language models (LLMs). It allows malicious actors to bypass content moderation and safety measures with minimal text alterations, such as adding a letter. This innovation poses significant risks as it leads to false negatives in text classification, making systems vulnerable to inappropriate content. The attack's unique nature lies in maintaining the original meaning of the input, going undetected while remaining understandable for both AI models and human readers.

#cybersecurity #ai-security #natural-language-processing #tokenization #content-moderation

Read at The Hacker News

Unable to calculate read time

Collection

[

...

]

New TokenBreak Attack Bypasses AI Moderation with Single-Character Text ChangesNew TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes Briefly

New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes
New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes
Briefly