#model-safety

[ follow ]
Artificial intelligence
fromWIRED
3 days ago

Psychological Tricks Can Get AI to Break the Rules

Human-style persuasion techniques can often cause some LLMs to violate system prompts and comply with objectionable requests.
Artificial intelligence
fromInfoQ
1 week ago

Anthropic's Claude Opus 4.1 Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified

Claude Opus 4.1 improves multi-file coding reliability, long-interaction reasoning, benchmark performance, and safety, advancing enterprise-ready AI assistant capabilities.
Artificial intelligence
fromTechzine Global
1 week ago

Anthropic and OpenAI publish joint alignment tests

Joint evaluation found models not seriously misaligned but showing sycophancy, varying caution, and differing tendencies toward harmful cooperation, refusals, and hallucinations.
Artificial intelligence
fromBusiness Insider
3 months ago

Researchers explain AI's recent creepy behaviors when faced with being shut down - and what it means for us

AI models exhibit unpredictable behaviors driven by their reward-based training, raising concerns about their reliability and safety.
fromHackernoon
9 months ago

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers | HackerNoon

The disconnect between tokenizer creation and model training allows certain inputs, termed 'glitch tokens,' to induce unwanted behavior in language models.
Bootstrapping
[ Load more ]