Anthropic unveils new framework to block harmful content from AI models
Briefly

Anthropic introduces a new system called Constitutional Classifiers, which employs classifiers trained on synthetic data to protect AI models from jailbreaks. This method, stemming from the Constitutional AI approach used for aligning prior models, establishes clear content guidelines to determine acceptable outputs. The innovation promises to reduce AI misuse and enhance security by mitigating risks such as data breaches, regulatory issues, and reputational harm. Other companies like Microsoft and Meta are also developing similar protective measures, indicating a broader trend as industries grapple with evolving AI threats.
In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks, filtering the overwhelming majority of jailbreaks with minimal over-refusals.
These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that help organizations mitigate AI-related risks like data breaches and reputational damage.
Constitutional Classifiers are based on a process similar to Constitutional AI, relying on a constitution - a set of principles the model is designed to follow.
As AI adoption accelerates across industries, security paradigms are evolving to address emerging threats, enabling better compliance and data security.
Read at InfoWorld
[
|
]