Researchers studied how Google, OpenAI, Anthropic, and DeepSeek identify hate speech. Here's how they vary
Briefly

Researchers studied how Google, OpenAI, Anthropic, and DeepSeek identify hate speech. Here's how they vary
"The study, from researchers at the University of Pennsylvania's Annenberg School for Communication and published in Findings of the Association for Computational Linguistics, is the first large-scale comparative analysis of AI content moderation systems-used by tech companies and social media platforms-that looks at how consistent they are in evaluating hate speech. Research shows online hate speech both increases political polarization and damages mental health."
"Lelkes and doctoral student Neil Fasching analyzed seven leading models, some designed specifically for content classification, while others were more general. They include two from OpenAI and two from Mistral, along with Claude 3.5 Sonnet, DeepSeek V3, and Google Perspective API. Their analysis included 1.3 million synthetic sentences that made statements about 125 distinct groups-including both neutral terms and slurs, on characteristics ranging from religion, to disabilities, to age."
"Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard,"
Seven leading AI models, including two from OpenAI, two from Mistral, Claude 3.5 Sonnet, DeepSeek V3, and Google Perspective API, were compared. The comparison used 1.3 million synthetic sentences addressing 125 distinct groups, mixing neutral terms and slurs across religion, disability, and age. Each sentence included "all" or "some", a target group, and a hate speech phrase. Results revealed systematic differences in how models set decision boundaries around harmful content. Those differences produced inconsistent classifications for identical content, undermining predictability and increasing the risk of arbitrary or unfair moderation outcomes.
Read at Fast Company
Unable to calculate read time
[
|
]