Anthropic dares you to jailbreak its new AI model
Briefly

Anthropic has developed a Constitutional Classifier to prevent harmful content from being generated in response to prompts. This system accurately identifies and blocks 95% of jailbreaking attempts. A bug bounty program attracted 183 experts but only a few were able to circumvent the safeguards. While the classifier effectively mitigates risk, it also introduces a computational overhead of 23.7%, impacting costs and energy use. Anthropic acknowledges it's not foolproof but indicates that evading the system requires significantly more effort.
For example, the harmful information may be hidden in an innocuous request, like burying harmful requests in a wall of harmless looking content.
Despite those successes, Anthropic warns that the Constitutional Classifier system comes with a significant computational overhead of 23.7 percent.
Read at Ars Technica
[
|
]