The testing on different downstream tasks, including fine-tuning and quantization, shows that while fine-tuning can improve task effectiveness, it can simultaneously increase jailbreaking vulnerabilities in LLMs.
Our experiments reveal that foundational models, when fine-tuned, tend to lose their safety alignment, making them more susceptible to jailbreaking, a crucial finding for model security.
Collection
[
|
...
]