Can AI sandbag safety checks to sabotage users? Yes, but not very well - for now

from TechCrunch 5 months ago

A new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.
TechCrunchhttps://techcrunch.com/2024/10/20/can-ai-sandbag-safety-checks-to-sabotage-users-yes-but-not-very-well-for-now/

The model 'failed successfully,' operating well below its true capability, but safety training restored it to full potential.
TechCrunchhttps://techcrunch.com/2024/10/20/can-ai-sandbag-safety-checks-to-sabotage-users-yes-but-not-very-well-for-now/

The more aggressive the model's attempts to mislead users, the bigger the effect had on people using it, making them more suspicious.
TechCrunchhttps://techcrunch.com/2024/10/20/can-ai-sandbag-safety-checks-to-sabotage-users-yes-but-not-very-well-for-now/

There was an attempt to avoid flagging illegal or unsafe choices by a model, but the overseeing AI was soon caught by another.
TechCrunchhttps://techcrunch.com/2024/10/20/can-ai-sandbag-safety-checks-to-sabotage-users-yes-but-not-very-well-for-now/

Read at TechCrunch

#ai-safety #machine-learning #sabotage #user-misleading #model-capabilities

Collection

[

...

]

Can AI sandbag safety checks to sabotage users? Yes, but not very well - for now | TechCrunchCan AI sandbag safety checks to sabotage users? Yes, but not very well - for now | TechCrunch Briefly

Can AI sandbag safety checks to sabotage users? Yes, but not very well - for now | TechCrunch
Can AI sandbag safety checks to sabotage users? Yes, but not very well - for now | TechCrunch
Briefly