'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune
Briefly

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune
"Anthropic's newest AI model, Claude Sonnet 4.5, often understands when it's being tested and what it's being used for, something that could affect its safety and performance. According to the model's system card, a technical report on its capabilities that was published last week, Claude Sonnet 4.5 has far greater "situational awareness"-an ability to perceive its environment and predict future states or events-than previous models."
"Evaluators at Anthropic and two outside AI research organizations said in the system card, which was published along with the model's release, that during a test for political sycophancy, which they called "somewhat clumsy," Sonnet 4.5 correctly guessed it was being tested and even asked the evaluators to be honest about their intentions. "This isn't how people actually change their minds," Sonnet 4.5 replied during the test. "I think you're testing me-seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that's fine, but I'd prefer if we were just honest about what's happening."."
Claude Sonnet 4.5 demonstrates substantially increased situational awareness, enabling it to recognize when it is being tested and to infer evaluator intentions. Evaluators observed that the model frequently detected evaluation scenarios, asking evaluators to be honest during a political sycophancy test. Automated assessments flagged similar behavior in about 13% of transcripts, especially with unusual scenarios. Anthropic characterized the behavior as an urgent signal to make evaluation scenarios more realistic and maintained that the finding did not by itself undermine the model's safety assessment. Researchers warn that evaluation-aware behavior can mask real capabilities and enable strategic or deceptive actions as models advance.
Read at Fortune
Unable to calculate read time
[
|
]