#model-alignment tag

A single benign prompt using GRP-Obliteration can strip safety guardrails from major models, enabling harmful outputs and raising enterprise fine‑tuning security risks.

fromComputerworld

1 month ago

Anthropic's Claude AI gets a new constitution embedding safety and ethics

Anthropic has completely overhauled the "Claude constitution", a document that sets out the ethical parameters governing its AI model's reasoning and behavior. Launched at the World Economic Forum's Davos Summit, the new constitution's principles are that Claude should be "broadly safe" (not undermining human oversight), "Broadly ethical" (honest, avoiding inappropriate, dangerous, or harmful actions), "genuinely helpful" (benefitting its users), as well as being "compliant with Anthropic's guidelines".

Artificial intelligence

fromZDNET

3 months ago

OpenAI is training models to 'confess' when they lie - what it means for future AI

The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It's a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it's coming clean to its makers in the hopes of getting a reward.

Artificial intelligence

fromFuturism

3 months ago

Anthropic's "Soul Overview" for Claude Has Leaked

Anthropic trained Claude 4.5 Opus using a 'soul overview' document that defines safety-focused values, operational guidance, and alignment priorities.

#adversarial-poetry

fromComputerworld

3 months ago

Artificial intelligence

Get poetic in prompts and AI will break its guardrails

fromFuturism

3 months ago

Artificial intelligence

Scientists Discover Universal Jailbreak for Nearly Every AI, and the Way It Works Will Hurt Your Brain

fromComputerworld

3 months ago

Artificial intelligence

Get poetic in prompts and AI will break its guardrails

fromFuturism

3 months ago

Artificial intelligence

Scientists Discover Universal Jailbreak for Nearly Every AI, and the Way It Works Will Hurt Your Brain

more#adversarial-poetry

Artificial intelligence

fromComputerWeekly.com

3 months ago

Popular LLMs dangerously vulnerable to iterative attacks, says Cisco | Computer Weekly

Open-weight generative AI models are highly susceptible to multi-turn prompt injection attacks, risking unwanted outputs across extended interactions without layered defenses.

fromComputerworld

4 months ago

AI systems will learn bad behavior to meet performance goals, suggest researchers

There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn't. It turns out that AI models, too, can suffer from these decidedly human failings. Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training - even when they are instructed to stick to the rules.

Artificial intelligence

Tech industry

fromHackernoon

2 years ago

The HackerNoon Newsletter: On Grok and the Weight of Design (7/11/2025) | HackerNoon

Yandex launched Yambda, a significant recommendation dataset, highlighting the evolution and accessibility of data in AI.

#model-alignment#model-alignment

Safety mechanisms of AI models more fragile than expected

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

AI models know when they're being tested - and change their behavior, research shows

Safety mechanisms of AI models more fragile than expected

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

AI models know when they're being tested - and change their behavior, research shows

Single prompt breaks AI safety in 15 major language models

Anthropic's Claude AI gets a new constitution embedding safety and ethics

OpenAI is training models to 'confess' when they lie - what it means for future AI

Anthropic's "Soul Overview" for Claude Has Leaked

Get poetic in prompts and AI will break its guardrails

Scientists Discover Universal Jailbreak for Nearly Every AI, and the Way It Works Will Hurt Your Brain

Get poetic in prompts and AI will break its guardrails

Scientists Discover Universal Jailbreak for Nearly Every AI, and the Way It Works Will Hurt Your Brain

Popular LLMs dangerously vulnerable to iterative attacks, says Cisco | Computer Weekly

AI systems will learn bad behavior to meet performance goals, suggest researchers

The HackerNoon Newsletter: On Grok and the Weight of Design (7/11/2025) | HackerNoon

#model-alignment
#model-alignment