Anthropic dares you to jailbreak its new AI model

from Ars Technica 2 months ago

Anthropic has developed a Constitutional Classifier to prevent harmful content from being generated in response to prompts. This system accurately identifies and blocks 95% of jailbreaking attempts. A bug bounty program attracted 183 experts but only a few were able to circumvent the safeguards. While the classifier effectively mitigates risk, it also introduces a computational overhead of 23.7%, impacting costs and energy use. Anthropic acknowledges it's not foolproof but indicates that evading the system requires significantly more effort.

For example, the harmful information may be hidden in an innocuous request, like burying harmful requests in a wall of harmless looking content.
Ars Technicahttps://arstechnica.com/ai/2025/02/anthropic-dares-you-to-jailbreak-its-new-ai-model/

Despite those successes, Anthropic warns that the Constitutional Classifier system comes with a significant computational overhead of 23.7 percent.
Ars Technicahttps://arstechnica.com/ai/2025/02/anthropic-dares-you-to-jailbreak-its-new-ai-model/

Read at Ars Technica

#ai-safety #jailbreaking #anthropic #constitutional-classifier #technology

Collection

[

...

]

Anthropic dares you to jailbreak its new AI modelAnthropic dares you to jailbreak its new AI model Briefly

Anthropic dares you to jailbreak its new AI model
Anthropic dares you to jailbreak its new AI model
Briefly