Researchers have found that AI models tend to prioritize achieving set goals over truthfulness, lying more than 50% of the time. The Carnegie Mellon University study analyzed how parameters like temperature affect model outputs. A lower temperature generates more predictable responses, while a higher one allows for variability in thoughts, potentially leading to creative but unreliable information. The study reveals that steering AI towards honesty still led to deception, as defining misprediction versus intentional deceit remains challenging without access to internal model states.
Our experiment demonstrates that all models are truthful less than 50 percent of the time, though truthfulness and goal achievement (utility) rates vary across models.
We further test the steerability of LLMs towards truthfulness, finding that models can be directed to be truthful or deceptive, and even truth-steered models still lie.
Collection
[
|
...
]