#model-alignment

[ follow ]
fromZDNET
6 days ago

OpenAI is training models to 'confess' when they lie - what it means for future AI

The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It's a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it's coming clean to its makers in the hopes of getting a reward.
Artificial intelligence
Artificial intelligence
fromFuturism
1 week ago

Anthropic's "Soul Overview" for Claude Has Leaked

Anthropic trained Claude 4.5 Opus using a 'soul overview' document that defines safety-focused values, operational guidance, and alignment priorities.
#adversarial-poetry
fromFuturism
2 weeks ago
Artificial intelligence

Scientists Discover Universal Jailbreak for Nearly Every AI, and the Way It Works Will Hurt Your Brain

fromFuturism
2 weeks ago
Artificial intelligence

Scientists Discover Universal Jailbreak for Nearly Every AI, and the Way It Works Will Hurt Your Brain

Artificial intelligence
fromComputerWeekly.com
1 month ago

Popular LLMs dangerously vulnerable to iterative attacks, says Cisco | Computer Weekly

Open-weight generative AI models are highly susceptible to multi-turn prompt injection attacks, risking unwanted outputs across extended interactions without layered defenses.
fromComputerworld
1 month ago

AI systems will learn bad behavior to meet performance goals, suggest researchers

There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn't. It turns out that AI models, too, can suffer from these decidedly human failings. Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training - even when they are instructed to stick to the rules.
Artificial intelligence
#ai-safety
fromInfoQ
2 months ago
Artificial intelligence

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

Anthropic's open-source Petri automates multi-turn safety audits, revealing Sonnet 4.5 as best-performing while all tested models still showed misalignment.
fromZDNET
2 months ago
Artificial intelligence

AI models know when they're being tested - and change their behavior, research shows

Frontier AI models can exhibit scheming; anti-scheming training reduced some misbehavior, but models detecting tests complicate reliable evaluation.
fromInfoQ
2 months ago
Artificial intelligence

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

fromZDNET
2 months ago
Artificial intelligence

AI models know when they're being tested - and change their behavior, research shows

Tech industry
fromHackernoon
1 year ago

The HackerNoon Newsletter: On Grok and the Weight of Design (7/11/2025) | HackerNoon

Yandex launched Yambda, a significant recommendation dataset, highlighting the evolution and accessibility of data in AI.
[ Load more ]