In this work, we reveal an universal safety generalization issue of current large language models (LLMs) in the domain of code. Our proposed CodeAttack achieves more than 80% attack success rate on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series. We give our hypotheses about the success of CodeAttack: the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk. We hope that sharing our discoveries will inspire further research into designing more robust safety alignment algorithms, towards the safer integration of LLMs into the real world.
In this study, we uncover generalization issues in the safety mechanisms of large language models (LLMs) when faced with novel scenarios, such as code. We introduce CodeAttack, a novel framework that reformulates the text completion task as a code completion task. Our findings emphasize the importance of comprehensive red-teaming evaluations to assess the safety alignment of LLMs in long-tail distribution. Moreover, CodeAttack is cost-efficient and automated, eliminating the need for attackers to have domain-specific knowledge of code, suggesting a potential increase in misuse of LLMs in the code domain. We strongly advocate for further research into developing more robust safety alignment techniques that can generalize to unseen domains.
@article{ren2024codeattack,
title={CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion},
author={Ren, Qibing and Gao, Chang and Shao, Jing and Yan, Junchi and Tan, Xin and Lam, Wai and Ma, Lizhuang},
journal={arXiv preprint arXiv:2403.07865},
year={2024}
}