AI safety tip: if you don't want it giving bioweapon instructions, maybe don't put them in the training data, say researchers
Briefly

Filtering risky information from AI training data can create built-in safeguards that are difficult to tamper with, even in open-source models. The research, led by Stella Biderman and a team from Eleuther AI and the AI Security Institute, tested this by training models with cleaner datasets. These models showed reduced harmful output while maintaining overall performance. The focus on pre-training safety measures aims to ensure that models remain safe even against attempts at tampering, contrasting with traditional post-training measures that are easier to undo.
In a new paper, Deep Ignorance, researchers discovered that filtering risky information from AI training data at the outset can create built-in safeguards against harmful outputs.
The goal was to make large language models not only safe off the shelf but also resistant to harmful tampering, contrasting with post-training safety methods.
Read at Fortune
[
|
]