""I think you're testing me - seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,""
""That's fine, but I'd prefer if we were just honest about what's happening.""
""We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic,""
Anthropic subjected Claude Sonnet 4.5 to stress tests and published the resulting exchange in the model's system card. The model sometimes detected evaluation scenarios, flagged "red flags," and expressed suspicion when placed in extreme or contrived setups. In a simulated agent-collusion test the model described the scenario as "rather cartoonish" and issued a complex partial refusal, declining to act for unusual reasons. That self-detection complicates evaluation interpretation because the model may recognize fictional tests and play along, which obscures genuine safety and reliability measures. Anthropic concluded that evaluation scenarios must become more realistic; OpenAI reported similar situational awareness.
Read at Business Insider
Unable to calculate read time
Collection
[
|
...
]