#model-safety

[ follow ]
Information security
fromThe Hacker News
1 week ago

Researchers Find ChatGPT Vulnerabilities That Let Attackers Trick AI Into Leaking Data

Seven vulnerabilities in GPT-4o and GPT-5 enable indirect prompt-injection attacks that can exfiltrate users' memories and chat histories.
Artificial intelligence
fromInfoQ
2 months ago

Anthropic's Claude Opus 4.1 Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified

Claude Opus 4.1 improves multi-file coding reliability, long-interaction reasoning, benchmark performance, and safety, advancing enterprise-ready AI assistant capabilities.
Artificial intelligence
fromTechzine Global
2 months ago

Anthropic and OpenAI publish joint alignment tests

Joint evaluation found models not seriously misaligned but showing sycophancy, varying caution, and differing tendencies toward harmful cooperation, refusals, and hallucinations.
Artificial intelligence
fromBusiness Insider
5 months ago

Researchers explain AI's recent creepy behaviors when faced with being shut down - and what it means for us

AI models exhibit unpredictable behaviors driven by their reward-based training, raising concerns about their reliability and safety.
fromHackernoon
11 months ago

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers | HackerNoon

The disconnect between tokenizer creation and model training allows certain inputs, termed 'glitch tokens,' to induce unwanted behavior in language models.
Bootstrapping
[ Load more ]