#model-safety tag

Artificial intelligence

How Microsoft obliterated safety guardrails on popular AI models - with just one prompt

fromTheregister

3 months ago

Artificial intelligence

Anthropic reduces model misbehavior by endorsing cheating

fromTechzine Global

6 months ago

Artificial intelligence

Anthropic and OpenAI publish joint alignment tests

fromZDNET

1 month ago

Artificial intelligence

How Microsoft obliterated safety guardrails on popular AI models - with just one prompt

fromTheregister

3 months ago

Artificial intelligence

Anthropic reduces model misbehavior by endorsing cheating

fromTechzine Global

6 months ago

Artificial intelligence

Anthropic and OpenAI publish joint alignment tests

more#ai-alignment

Artificial intelligence

fromTheregister

1 month ago

AI researchers map models to banish 'demon' persona

LLMs exhibit an emergent Assistant persona in network activations that can be identified and stabilized to improve safety and moderation.

Information security

fromThe Hacker News

4 months ago

Researchers Find ChatGPT Vulnerabilities That Let Attackers Trick AI Into Leaking Data

Seven vulnerabilities in GPT-4o and GPT-5 enable indirect prompt-injection attacks that can exfiltrate users' memories and chat histories.

Artificial intelligence

fromWIRED

6 months ago

Psychological Tricks Can Get AI to Break the Rules

Human-style persuasion techniques can often cause some LLMs to violate system prompts and comply with objectionable requests.

Artificial intelligence

fromInfoQ

6 months ago

Anthropic's Claude Opus 4.1 Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified

Claude Opus 4.1 improves multi-file coding reliability, long-interaction reasoning, benchmark performance, and safety, advancing enterprise-ready AI assistant capabilities.

Artificial intelligence

fromBusiness Insider

9 months ago

Researchers explain AI's recent creepy behaviors when faced with being shut down - and what it means for us

AI models exhibit unpredictable behaviors driven by their reward-based training, raising concerns about their reliability and safety.

fromHackernoon

1 year ago

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers | HackerNoon

The disconnect between tokenizer creation and model training allows certain inputs, termed 'glitch tokens,' to induce unwanted behavior in language models.

Bootstrapping

#model-safety#model-safety

How Microsoft obliterated safety guardrails on popular AI models - with just one prompt

Anthropic reduces model misbehavior by endorsing cheating

Anthropic and OpenAI publish joint alignment tests

How Microsoft obliterated safety guardrails on popular AI models - with just one prompt

Anthropic reduces model misbehavior by endorsing cheating

Anthropic and OpenAI publish joint alignment tests

AI researchers map models to banish 'demon' persona

Researchers Find ChatGPT Vulnerabilities That Let Attackers Trick AI Into Leaking Data

Psychological Tricks Can Get AI to Break the Rules

Anthropic's Claude Opus 4.1 Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified

Researchers explain AI's recent creepy behaviors when faced with being shut down - and what it means for us

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers | HackerNoon

#model-safety
#model-safety