Anthropic extracted interpretable features from Claude 3 Sonnet, enabling a deeper understanding of its inner workings and potential for assessing AI safety during deployment.
Identified features, potentially 'safety relevant,' can assist in adjusting generative AI to avoid harmful topics, impacting bias, applicable across languages and modalities.
Collection
[
|
...
]