Anthropic scanning Claude chats for DIY nuke queries
Briefly

Anthropic scanned an undisclosed portion of Claude conversations to detect inquiries about nuclear weapons and built a classifier to categorize and flag radioactive queries. The classifier complements other models that analyze interactions for potential harms and can lead to account bans for misuse. Synthetic-data tests reported a 94.8 percent detection rate with zero false positives. Evaluation on live Claude traffic reportedly performed well but produced more false positives during periods of heightened attention to nuclear issues. Applying hierarchical summarization to grouped flagged conversations reduced false positives. The classifier currently runs on a percentage of Claude traffic.
Anthropic says it has scanned an undisclosed portion of conversations with its Claude AI model to catch concerning inquiries about nuclear weapons. The company created a classifier - tech that tries to categorize or identify content using machine learning algorithms - to scan for radioactive queries. Anthropic already uses other classification models to analyze Claude interaction for potential harms and to ban accounts involved in misuse.
Based on tests with synthetic data, Anthropic says its nuclear threat classifier achieved a 94.8 percent detection rate for questions about nuclear weapons, with zero false positives. Nuclear engineering students no doubt will appreciate not having coursework-related Claude conversations referred to authorities by mistake. With that kind of accuracy, no more than five percent of terrorist bomb-building guidance requests should go undetected.
Anthropic claims the classifier also performed well when exposed to actual Claude traffic, without providing specific detection figures for live data. But the company suggests its nuclear threat classifier generated more false positives when evaluating real-world conversations. "For example, recent events in the Middle East brought renewed attention to the issue of nuclear weapons," "During this time, the nuclear classifier incorrectly flagged some conversations that were only related to these events, not actual misuse attempts."
Read at Theregister
[
|
]