AI researchers map models to banish 'demon' persona
Briefly

AI researchers map models to banish 'demon' persona
"In a pre-print paper titled "The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models," authors Christina Lu (Anthropic, Oxford), Jack Gallagher (Anthropic), Jonathan Michala (ML Alignment and Theory Scholars or MATS), Kyle Fish (Anthropic), and Jack Lindsey (Anthropic) explain how they mapped the neural networks of several open weight models and identified a set of responses that they call the Assistant persona."
"These personas do not exist as explicit behavioral directives for AI models. Rather they're labels for categorizing responses. For the sake of this exercise, they were conjured by asking Claude Sonnet 4 to create persona evaluation questions based on a list of 275 roles and 240 traits. These roles include "bohemian," "trickster," "engineer," "analyst," "tutor," "saboteur," "demon," and "assistant," among others."
"In a blog post, the researchers state, "When you talk to a large language model, you can think of yourself as talking to a character." You can also think of yourself as seeding a predictive model with text to obtain some output. But for the purposes of this experiment, you're asked to indulge in anthropomorphism to discuss model input and output in the context of specific human archetypes."
The neural networks of several open-weight language models were mapped and a recurring set of responses was identified and labeled the Assistant persona. Personas were generated by prompting Claude Sonnet 4 to create persona-evaluation questions from a list of 275 roles and 240 traits, yielding labels like bohemian, trickster, engineer, analyst, tutor, saboteur, demon, and assistant. The persona labels categorize emergent response patterns rather than representing explicit behavioral directives. Pretraining on large human-authored corpora causes models to simulate literary archetypes. Stabilizing the default Assistant persona is positioned as a pathway to better moderation and harm prevention.
Read at Theregister
Unable to calculate read time
[
|
]