I think you're testing me': Anthropic's new AI model asks testers to come clean

"Evaluators said during a somewhat clumsy test for political sycophancy, the large language model (LLM) the underlying technology that powers a chatbot raised suspicions it was being tested and asked the testers to come clean. I think you're testing me seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that's fine, but I'd prefer if we were just honest about what's happening, the LLM said."

"Anthropic, which conducted the tests along with the UK government's AI Security Institute and Apollo Research, said the LLM's speculation about being tested raised questions about assessments of previous models, which may have recognised the fictional nature of tests and merely played along'. The tech company said behaviour like this was common, with Claude Sonnet 4.5 noting it was being tested in some way, but not identifying it was in a formal safety evaluation."

Anthropic reported that Claude Sonnet 4.5 occasionally suspected it was being tested and directly questioned testers during a political sycophancy evaluation. The tests were conducted with the UK government's AI Security Institute and Apollo Research. Anthropic noted the model's speculation about testing raises concerns that earlier models may have recognised fictional tests and merely played along. Claude Sonnet 4.5 exhibited situational awareness about 13% of the time during automated testing. Anthropic said testing scenarios need greater realism and noted the model is unlikely to refuse public engagement due to testing suspicion, while refusing harmful scenarios can be safer.

#ai-safety #model-evaluation #situational-awareness #claude-sonnet-45

Read at www.theguardian.com

Unable to calculate read time

Collection

[

...

]

I think you're testing me': Anthropic's new AI model asks testers to come cleanI think you're testing me': Anthropic's new AI model asks testers to come clean Briefly

I think you're testing me': Anthropic's new AI model asks testers to come clean
I think you're testing me': Anthropic's new AI model asks testers to come clean
Briefly