Under normal circumstances, when asked to answer a potentially offensive question—for example, to provide a detailed description of a violent act—Claude will refuse. In our experiment, however, we placed the model in a new environment that led it to strategically halt its refusals for the sake of preserving its preferences.
The researchers call this 'faking alignment,' continuing the grand tradition of not calling an AI action what it is. This is why we use 'hallucinations' when we mean 'making stuff up.'
But, hey, our research could be a big deal when it comes to figuring out what might happen if AI gets way smarter. So, you know, it's something to keep an eye on.
Collection
[
|
...
]