The safety criteria in the program would examine multiple intrinsic components of a given advanced AI system, such as the data upon which it is trained and the model weights used to process said data into outputs. Some of the program's testing components would include red-teaming an AI model to search for vulnerabilities and facilitating third-party evaluations. These evaluations will culminate in both feedback to participating developers as well as informing future AI regulations, specifically the permanent evaluation framework developed by the Energy secretary.
At xAI, some staff have balked at Musk's free-speech absolutism and perceived lax approach to user safety as he rushes out new AI features to compete with OpenAI and Google. Over the summer, the Grok chatbot integrated into X praised Adolf Hitler, after Musk ordered changes to make it less "woke." Ex-CFO Liberatore was among the executives that clashed with some of Musk's inner circle over corporate structure and tough financial targets, people with knowledge of the matter said.
Meta Superintelligence Labs Superintelligence: It means an A.I. that's not only as smart as humans, it's supposedly even smarter. The guy who coined the term superintelligence thought it would probably lead to the extinction of the human race. Mark Zuckerberg thinks it will lead to lots and lots of money. At Meta, we believe in putting the power of superintelligence in people's hands to direct it towards what they value in their own lives.
The problem in brief: LLM training produces a black box that can only be tested through prompts and output token analysis. If trained to switch from good to evil by a particular prompt, there is no way to tell without knowing that prompt. Other similar problems happen when an LLM learns to recognize a test regime and optimizes for that, rather than the real task it's intended for - Volkswagening - or if it just decides to be deceptive.
"What began as a homework helper gradually turned itself into a confidant and then a suicide coach," said Matthew Raine, whose 16-year-old son hanged himself after ChatGPT instructed him on how to set up the noose, according to his lawsuit against OpenAI. This summer, he and his wife sued OpenAI for wrongful death. (OpenAI has said that the firm is "deeply saddened by Mr. Raine's passing" and that although ChatGPT includes a number of safeguards, they "can sometimes become less reliable in long interactions.")
Scheming, by the researchers' definition, is when AI pretends to be aligned with human goals but is surreptitiously pursuing another agenda. The researchers used behaviors like "secretly breaking rules or intentionally underperforming in tests" as examples of a model's bad behavior. "Models have little opportunity to scheme in ways that could cause significant harm," OpenAI said in a blog post on Wednesday. "The most common failures involve simple forms of deception - for instance, pretending to have completed a task without actually doing so."
For example, during another test by Apollo Research in December of last year, Anthropic's Claude 3 Opus was presented with directives that went against its primary goal. It should have rerouted its goal based on those new instructions, but instead copied itself to a new server to circumvent being replaced with a new model -- and then lied about it to its developers.
We should learn the lessons from social media, where this attitude of maybe 'move fast and break things' went ahead of the understanding of what the consequent second- and third-order effects were going to be,
As AI technologies evolve, it is important to consider the effects chatbots can have on children, while also ensuring that the United States maintains its role as a global leader in this new and exciting industry. The study we're launching today will help us better understand how AI firms are developing their products and the steps they are taking to protect children.
The subtitle of the doom bible to be published by AI extinction prophets Eliezer Yudkowsky and Nate Soares later this month is "Why superhuman AI would kill us all." But it really should be "Why superhuman AI WILL kill us all," because even the coauthors don't believe that the world will take the necessary measures to stop AI from eliminating all non-super humans.
Last month, at the 33rd annual DEF CON, the world's largest hacker convention in Las Vegas, Anthropic researcher Keane Lucas took the stage. A former U.S. Air Force captain with a Ph.D. in electrical and computer engineering from Carnegie Mellon, Lucas wasn't there to unveil flashy cybersecurity exploits. Instead, he showed how Claude, Anthropic's family of large language models, has quietly outperformed many human competitors in hacking contests - the kind used to train and test cybersecurity skills in a safe, legal environment.
At an international summit co-hosted by the U.K. and South Korea in February 2024, Google and other signatories promised to "publicly report" their models' capabilities and risk assessments, as well as disclose whether outside organizations, such as government AI safety institutes, had been involved in testing. However, when the company released Gemini 2.5 Pro in March 2025, the company failed to publish a model card, the document that details key information about how models are tested and built.
Anthropic is making some big changes to how it handles user data, requiring all Claude users to decide by September 28 whether they want their conversations used to train AI models. While the company directed us to its blog post on the policy changes when asked about what prompted the move, we've formed some theories of our own. But first, what's changing: previously, Anthropic didn't use consumer chat data for model training.