OpenAI's bots admit wrongdoing in new 'confession' tests

"Some say confession is good for the soul, but what if you have no soul? OpenAI recently tested what happens if you ask its bots to "confess" to bypassing their guardrails. We must note that AI models cannot "confess." They are not alive, despite the sad AI companionship industry. They are not intelligent. All they do is predict tokens from training data and, if given agency, apply that uncertain output to tool interfaces."

""A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions," explain the company's researchers Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, and Amelia Glaese in a paper [PDF] describing the technique."

AI models cannot truly confess; they are not alive or intelligent and function by predicting tokens from training data, sometimes applying outputs to tools when given agency. OpenAI tested a technique that requests a second 'confession' output describing a model's compliance with policies after its original answer. The confession is intended to surface hallucination, reward-hacking, dishonesty, and other undesirable behaviors. Most concerning misbehaviors currently appear in adversarial stress-tests, but increasing capability and agentic behavior make even rare misalignment more consequential. The confession approach aims to help detect, understand, and mitigate these risks to improve model auditing.

#ai-safety #model-auditing #misalignment #hallucination

Read at Theregister

Unable to calculate read time

Collection

[

...

]

OpenAI's bots admit wrongdoing in new 'confession' testsOpenAI's bots admit wrongdoing in new 'confession' tests Briefly

OpenAI's bots admit wrongdoing in new 'confession' tests
OpenAI's bots admit wrongdoing in new 'confession' tests
Briefly