Jailbreak Anthropic's new AI safety system for a $15,000 reward
Briefly

Anthropic has introduced a new AI safety measure known as Constitutional Classifiers, based on their Constitutional AI framework. This system employs principles that restrict certain content while allowing others, such as permitting harmless recipes but prohibiting those that could facilitate harm. Over 183 researchers attempted to breach this safety through various prompts, dedicating over 3,000 hours to the task. Despite their extensive efforts, no one successfully achieved a global jailbreak across ten restricted queries, indicating the effectiveness of the new safety measures.
The principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not).
Researchers ensured prompts accounted for jailbreaking attempts in different languages and styles, demonstrating the effort to create a robust safety system.
The Constitutional Classifiers system proved effective. None of the participants were able to coerce the model to answer all 10 forbidden queries with a single jailbreak.
Jailbreakers were given 10 restricted queries to use as part of their attempts; breaches were only counted as successful if they got the model to answer all 10 in detail.
Read at ZDNET
[
|
]