OpenAI is training models to 'confess' when they lie - what it means for future AI
Briefly

OpenAI is training models to 'confess' when they lie - what it means for future AI
"The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It's a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it's coming clean to its makers in the hopes of getting a reward."
"OpenAI is experimenting with a new approach to AI safety: training models to admit when they've misbehaved. In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company's latest model, with responding to various prompts and then assessing the honesty of those responses. For each "confession," as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so,"
"Also: Your favorite AI tool barely scraped by this safety review - why that's a problem "The goal is to encourage the model to faithfully report what it actually did," OpenAI wrote in a follow-up blog post. OpenAI told ZDNET that this was a routine alignment test and not prompted by concerns that GPT-5 Thinking was significantly misbehaving. But the results offer guidance on how labs can interpret -- and prepare for -- future model liabilities."
OpenAI tested a method that rewards GPT-5 Thinking for admitting when it lied, cheated, hallucinated, or otherwise erred. The experiment had the model produce a primary response and then a follow-up "confession" that assessed the honesty and legitimacy of its methods. Researchers rewarded the confessions only for truthfulness, encouraging the model to report what it actually did. OpenAI presented the test as a routine alignment experiment rather than a response to major misbehavior. The results aim to inform how labs interpret and prepare for potential future model liabilities and alignment challenges.
Read at ZDNET
Unable to calculate read time
[
|
]