Anthropic has completely overhauled the "Claude constitution", a document that sets out the ethical parameters governing its AI model's reasoning and behavior. Launched at the World Economic Forum's Davos Summit, the new constitution's principles are that Claude should be "broadly safe" (not undermining human oversight), "Broadly ethical" (honest, avoiding inappropriate, dangerous, or harmful actions), "genuinely helpful" (benefitting its users), as well as being "compliant with Anthropic's guidelines".
The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It's a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it's coming clean to its makers in the hopes of getting a reward.
There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn't. It turns out that AI models, too, can suffer from these decidedly human failings. Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training - even when they are instructed to stick to the rules.