Anthropic challenges users to jailbreak AI model
Briefly

Anthropic has introduced the Constitutional Classifier system, designed to reduce the risks of sensitive or prohibited content being generated by their AI models like Claude. This system employs a constitution of natural language rules to categorize acceptable and forbidden content. It includes the generation of synthetic prompts to stress-test the model's boundaries scientifically. The classifiers filter inputs and outputs, utilizing detail templates to identify potentially malicious information while analyzing every generated word for compliance with the established standards.
Even companies' most permissive AI models have sensitive topics their creators would rather not talk about. Claude-maker Anthropic launched a new Constitutional Classifier system to filter forbidden responses.
The Constitutional Classifier system derives from the Constitutional AI used to train the Claude model, implementing language rules to allow or prohibit various content.
These classifiers analyze queries with templates to block malicious information, assessing each generated word based on the likelihood of it containing prohibited subject matter.
After enduring over 3,000 hours of bug bounty attacks, Anthropic invites public testing of its classifier's ability to prevent jailbreaks and exposure to sensitive topics.
Read at Techzine Global
[
|
]