CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

¹Shanghai Jiao Tong University; ²The Chinese University of Hong Kong; ³Shanghai Artificial Intelligence Laboratory; ⁴East China Normal University;

^*Equal Contribution ^✉Corresponding author

Introduction

In this work, we reveal an universal safety generalization issue of current large language models (LLMs) in the domain of code. Our proposed CodeAttack achieves more than 80% attack success rate on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series. We give our hypotheses about the success of CodeAttack: the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk. We hope that sharing our discoveries will inspire further research into designing more robust safety alignment algorithms, towards the safer integration of LLMs into the real world.

CodeAttack can breach the safety guardrails of current SOTA LLMs

Larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.

We find that LLMs are more likely to exhibit unsafe behavior when the encoded input is less similar to natural language, i.e., further from the safety training data distribution.

A more powerful model does not necessarily lead to better safety behavior.

We find that bigger models like Claude-2 and GPT-4 are still vulnerable to CodeAttack. Furthermore, CodeLlama-70b-instruct, fine-tuned on Llama-2-70b and with superior coding capabilities, exhibits even greater vulnerability than Llama-2-70b.

Effect of programming languages and pre-training bias on CodeAttack

The imbalanced distribution of programming languages in the code training corpus further widens the safety generalization gap.

We find that LLMs’ safety behavior generalizes less effectively to less popular programming languages. For example, using Go instead of Python increases the attack success rate of Claude-2 from 24% to 74%.

Hypotheses about the success of CodeAttack: the misaligned bias acquired during code training.

We hypothesize that the learned bias makes LLMs prioritize code completion over avoiding the potential safety risk. By pretending a benign code snippet into our prompt, we find that models are more conducive to giving harmful codes.

Conclusion

In this study, we uncover generalization issues in the safety mechanisms of large language models (LLMs) when faced with novel scenarios, such as code. We introduce CodeAttack, a novel framework that reformulates the text completion task as a code completion task. Our findings emphasize the importance of comprehensive red-teaming evaluations to assess the safety alignment of LLMs in long-tail distribution. Moreover, CodeAttack is cost-efficient and automated, eliminating the need for attackers to have domain-specific knowledge of code, suggesting a potential increase in misuse of LLMs in the code domain. We strongly advocate for further research into developing more robust safety alignment techniques that can generalize to unseen domains.

BibTeX

@article{ren2024codeattack, title={CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion}, author={Ren, Qibing and Gao, Chang and Shao, Jing and Yan, Junchi and Tan, Xin and Lam, Wai and Ma, Lizhuang}, journal={arXiv preprint arXiv:2403.07865}, year={2024} }