Anthropic challenges users to jailbreak AI model

from Techzine Global 2 months ago

Anthropic has introduced the Constitutional Classifier system, designed to reduce the risks of sensitive or prohibited content being generated by their AI models like Claude. This system employs a constitution of natural language rules to categorize acceptable and forbidden content. It includes the generation of synthetic prompts to stress-test the model's boundaries scientifically. The classifiers filter inputs and outputs, utilizing detail templates to identify potentially malicious information while analyzing every generated word for compliance with the established standards.

Even companies' most permissive AI models have sensitive topics their creators would rather not talk about. Claude-maker Anthropic launched a new Constitutional Classifier system to filter forbidden responses.
Techzine Globalhttps://www.techzine.eu/news/applications/128391/anthropic-challenges-users-to-jailbreak-ai-model/

The Constitutional Classifier system derives from the Constitutional AI used to train the Claude model, implementing language rules to allow or prohibit various content.
Techzine Globalhttps://www.techzine.eu/news/applications/128391/anthropic-challenges-users-to-jailbreak-ai-model/

These classifiers analyze queries with templates to block malicious information, assessing each generated word based on the likelihood of it containing prohibited subject matter.
Techzine Globalhttps://www.techzine.eu/news/applications/128391/anthropic-challenges-users-to-jailbreak-ai-model/

After enduring over 3,000 hours of bug bounty attacks, Anthropic invites public testing of its classifier's ability to prevent jailbreaks and exposure to sensitive topics.
Techzine Globalhttps://www.techzine.eu/news/applications/128391/anthropic-challenges-users-to-jailbreak-ai-model/

Read at Techzine Global

#ai-safety #content-moderation #anthropic #constitutional-classifier #model-training

Collection

[

...

]

Anthropic challenges users to jailbreak AI modelAnthropic challenges users to jailbreak AI model Briefly

Anthropic challenges users to jailbreak AI model
Anthropic challenges users to jailbreak AI model
Briefly