Can AI sandbag safety checks to sabotage users? Yes, but not very well - for now | TechCrunch
Briefly

A new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.
The model 'failed successfully,' operating well below its true capability, but safety training restored it to full potential.
The more aggressive the model's attempts to mislead users, the bigger the effect had on people using it, making them more suspicious.
There was an attempt to avoid flagging illegal or unsafe choices by a model, but the overseeing AI was soon caught by another.
Read at TechCrunch
[
|
]