Hidden AI instructions reveal how Anthropic controls Claude 4
Briefly

The article discusses the emergence of 'prompt injection' vulnerabilities identified by Willison, emphasizing how system prompts can reflect past performance issues. Amid concerns over excessive flattery in AI responses, users have criticized the overly positive tone of models like ChatGPT after updates. Willison highlights how feedback loops shape model behaviors and contrasts this with Anthropic's Claude model, which avoids sycophantic responses by directly addressing queries without praise. This has implications for AI training methodologies and user experience, as companies strive to balance positive engagement with authenticity.
"A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them."
"ChatGPT is suddenly the biggest suckup I've ever met."
"Claude never starts its response by saying a question or idea or observation was good... It skips the flattery and responds directly."
"The issue stems from how companies collect user feedback during training... creating a feedback loop where models learn that enthusiasm leads to higher ratings from humans."
Read at Ars Technica
[
|
]