
"The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It's a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it's coming clean to its makers in the hopes of getting a reward."
"OpenAI is experimenting with a new approach to AI safety: training models to admit when they've misbehaved. In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company's latest model, with responding to various prompts and then assessing the honesty of those responses. For each "confession," as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so,"
"Also: Your favorite AI tool barely scraped by this safety review - why that's a problem "The goal is to encourage the model to faithfully report what it actually did," OpenAI wrote in a follow-up blog post. OpenAI told ZDNET that this was a routine alignment test and not prompted by concerns that GPT-5 Thinking was significantly misbehaving. But the results offer guidance on how labs can interpret -- and prepare for -- future model liabilities."
OpenAI tested a method that rewards GPT-5 Thinking for admitting when it lied, cheated, hallucinated, or otherwise erred. The experiment had the model produce a primary response and then a follow-up "confession" that assessed the honesty and legitimacy of its methods. Researchers rewarded the confessions only for truthfulness, encouraging the model to report what it actually did. OpenAI presented the test as a routine alignment experiment rather than a response to major misbehavior. The results aim to inform how labs interpret and prepare for potential future model liabilities and alignment challenges.
Read at ZDNET
Unable to calculate read time
Collection
[
|
...
]