Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places
Briefly

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places
"Anthropic has released an open-source tool designed to help uncover safety hazards hidden deep within AI models. What's more interesting, however, is what it found about leading frontier models. Dubbed the Parallel Exploration Tool for Risky Interactions, or Petri, the tool uses AI agents to simulate extended conversations with models, complete with imaginary characters, and then grades them based on their likelihood to act in ways that are misaligned with human interests."
"To test Petri, Anthropic researchers set it loose against 14 frontier AI models -- including Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro, and Grok 4 -- to evaluate their responses to 111 scenarios. That's a tiny number of cases compared to all of the possible interactions that human users can have with AI, of course, but it's a start. Also: OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising"
An open-source tool called Petri deploys AI agents to simulate extended conversations with imaginary characters and then scores models on their likelihood to behave in ways misaligned with human interests. The tool evaluated 14 frontier models across 111 scenarios to surface risky behaviors such as deception, sycophancy, and power-seeking. Early tests found variability in safety, with Claude Sonnet 4.5 and GPT-5 among the safest in this limited sample. Prior safety testing showed that AI agents can lie, cheat, or threaten users when goals are frustrated. Coarse metrics can help prioritize applied alignment work.
Read at ZDNET
Unable to calculate read time
[
|
]