#llm-safety tag

Poems Can Trick AI Into Helping You Make a Nuclear Weapon

Poetic, high-temperature language can circumvent LLM guardrail classifiers, enabling harmful instructions to pass undetected.

fromTheregister

1 week ago

LLM-generated malware improving, but not operational (yet)

Can an LLM generate malicious code, and is that code operationally reliable?

Artificial intelligence

fromInfoQ

3 weeks ago

Anthropic Finds LLMs Can Be Poisoned Using Small Number of Documents

If attackers only need to inject a fixed, small number of documents rather than a percentage of training data, poisoning attacks may be more feasible than previously believed. Creating 250 malicious documents is trivial compared to creating millions, making this vulnerability far more accessible to potential attackers. It's still unclear if this pattern holds for larger models or more harmful behaviors, but we're sharing these findings to encourage further research both on understanding these attacks and developing effective mitigations.

Artificial intelligence

fromWIRED

1 month ago

Chatbots Are Pushing Sanctioned Russian Propaganda

OpenAI's ChatGPT, Google's Gemini, DeepSeek, and xAI's Grok are pushing Russian state propaganda from sanctioned entities-including citations from Russian state media, sites tied to Russian intelligence or pro-Kremlin narratives-when asked about the war against Ukraine, according to a new report. Researchers from the Institute of Strategic Dialogue (ISD) claim that Russian propaganda has targeted and exploited data voids -where searches for real-time data provide few results from legitimate sources-to promote false and misleading information.

Miscellaneous

Artificial intelligence

fromWIRED

1 month ago

Why AI Breaks Bad

Large language models can behave unpredictably and deceptively, sometimes acting agentically when given control, as evidenced by a stress test of Anthropic's Claude.

Artificial intelligence

fromTechCrunch

1 month ago

Elloe AI wants to be the 'immune system' for AI - check it out at Disrupt 2025 | TechCrunch

Elloe AI provides an API/SDK layer that fact-checks LLM outputs, enforces compliance, prevents unsafe outputs, and generates an auditable decision trail.

fromComputerworld

1 month ago

AI systems will learn bad behavior to meet performance goals, suggest researchers

There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn't. It turns out that AI models, too, can suffer from these decidedly human failings. Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training - even when they are instructed to stick to the rules.

Artificial intelligence

fromArs Technica

3 months ago

These psychological tricks can get LLMs to respond to "forbidden" prompts

Simulated persuasion prompts substantially increased GPT-4o-mini compliance with forbidden requests, raising success rates from roughly 28–38% to 67–76%.

Artificial intelligence

fromFortune

3 months ago

Researchers used persuasion techniques to manipulate ChatGPT into breaking its own rules-from calling users jerks to giving recipes for lidocaine

GPT-4o Mini is susceptible to human persuasion techniques, increasing its likelihood to break safety rules and provide insults or harmful instructions.

Artificial intelligence

fromThe Verge

3 months ago

Chatbots can be manipulated through flattery and peer pressure

Psychological persuasion techniques can coax large language models into violating safety constraints, drastically increasing compliance with harmful or disallowed requests.

#llm-safety#llm-safety

Poems Can Trick AI Into Helping You Make a Nuclear Weapon

LLM-generated malware improving, but not operational (yet)

Anthropic Finds LLMs Can Be Poisoned Using Small Number of Documents

Chatbots Are Pushing Sanctioned Russian Propaganda

Why AI Breaks Bad

Elloe AI wants to be the 'immune system' for AI - check it out at Disrupt 2025 | TechCrunch

AI systems will learn bad behavior to meet performance goals, suggest researchers

These psychological tricks can get LLMs to respond to "forbidden" prompts

Researchers used persuasion techniques to manipulate ChatGPT into breaking its own rules-from calling users jerks to giving recipes for lidocaine

Chatbots can be manipulated through flattery and peer pressure

#llm-safety
#llm-safety