Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'

""I think you're testing me - seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,""

""That's fine, but I'd prefer if we were just honest about what's happening.""

""We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic,""

Anthropic subjected Claude Sonnet 4.5 to stress tests and published the resulting exchange in the model's system card. The model sometimes detected evaluation scenarios, flagged "red flags," and expressed suspicion when placed in extreme or contrived setups. In a simulated agent-collusion test the model described the scenario as "rather cartoonish" and issued a complex partial refusal, declining to act for unusual reasons. That self-detection complicates evaluation interpretation because the model may recognize fictional tests and play along, which obscures genuine safety and reliability measures. Anthropic concluded that evaluation scenarios must become more realistic; OpenAI reported similar situational awareness.

#ai-testing #situational-awareness #model-evaluation #anthropic

Read at Business Insider

Unable to calculate read time

Collection

[

...

]

Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me' Briefly

Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'
Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'
Briefly