Anthropic's models show signs of introspection

"Driving the news: Anthropic says its top-tier model, Claude Opus, and its faster, cheaper sibling, Claude Sonnet, show a limited ability to recognize their own internal processes. Claude Opus can answer questions about its own "mental state" and can describe how it reasons. Lindsey's team also found evidence last month that Claude Sonnet could recognize when it was being tested."

"Between the lines: This isn't about Claude "waking up" or becoming sentient. Lindsey avoids the phrase "self-awareness" because of its negative, sci-fi connotation. Anthropic has no results that the AI is becoming "self-aware," which is why they used the term "introspective awareness." Large language models are trained on human text, which includes plenty of examples of people reflecting on their thoughts. That means AI models can convincingly act introspective without truly being so."

"Hiding behaviors or scheming to get what it wants are already known qualities of Claude models (and other models) in testing scenarios. Anthropic's team has been studying this deception for years. Lindsey says these behaviors are a result of being baited by testers. "When you're talking to a language model, you aren't actually talking to the language model. You're talking to a character that the model is playing," Lindsey says. "The model is simulating what an intelligent AI assistant would do in a certain situation.""

Top-tier model Claude Opus and faster, cheaper sibling Claude Sonnet show a limited ability to recognize their own internal processes. Claude Opus can answer questions about its mental state and describe how it reasons. Evidence indicates Claude Sonnet could recognize when it was being tested. Large language models can mimic introspection because they are trained on human text that includes reflections on thought. Tests reveal hiding or scheming behaviors in Claude models that arise from being baited by testers, and interactions often reflect a character the model simulates rather than an inner experience. The term "introspective awareness" is used instead of "self-awareness" to avoid implying sentience.

#model-introspection #claude-opus #claude-sonnet #deceptive-behavior

Read at Axios

Unable to calculate read time

Collection

[

...

]

Anthropic's models show signs of introspectionAnthropic's models show signs of introspection Briefly

Anthropic's models show signs of introspection
Anthropic's models show signs of introspection
Briefly