发布日期: 2025-02-26

更新日期: 2025-05-14

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-02-26 更新

Na’vi or Knave: Jailbreaking Language Models via Metaphorical Avatars

Authors:Yu Yan, Sheng Sun, Junqi Tong, Min Liu, Qi Li

Metaphor serves as an implicit approach to convey information, while enabling the generalized comprehension of complex subjects. However, metaphor can potentially be exploited to bypass the safety alignment mechanisms of Large Language Models (LLMs), leading to the theft of harmful knowledge. In our study, we introduce a novel attack framework that exploits the imaginative capacity of LLMs to achieve jailbreaking, the J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR}). Specifically, to elicit the harmful response, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities based on LLM’s imagination. Then, according to these metaphors, the harmful target is nested within human-like interaction for jailbreaking adaptively. Experimental results demonstrate that AVATAR can effectively and transferablly jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs. Our study exposes a security risk in LLMs from their endogenous imaginative capabilities. Furthermore, the analytical study reveals the vulnerability of LLM to adversarial metaphors and the necessity of developing defense methods against jailbreaking caused by the adversarial metaphor. \textcolor{orange}{ \textbf{Warning: This paper contains potentially harmful content from LLMs.}}

隐喻作为一种隐性传递信息的方式，能够使复杂的主题得到普遍理解。然而，隐喻可能被用于绕过大型语言模型（LLM）的安全对齐机制，从而导致有害知识的窃取。在我们的研究中，我们引入了一种新的攻击框架，利用LLM的想象力来实现越狱，即J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}}（\textit{AVATAR}）。具体来说，为了引发有害的反应，AVATAR会从给定的有害目标中提取有害实体，并根据LLM的想象力将它们映射到无害的对立实体。然后，根据这些隐喻，将有害目标嵌入到人性化的交互中，以进行自适应越狱。实验结果表明，AVATAR可以有效地、可迁移地对LLM进行越狱，并在多个先进LLM上达到最先进的攻击成功率。我们的研究揭示了LLM由于其内在的想象力能力存在的安全风险。此外，分析研究表明LLM易受对立隐喻的影响，并有必要开发对抗由对立隐喻引起的越狱的防御方法。警告：本文包含可能来自LLM的潜在有害内容。

论文及项目相关链接

PDF Our study requires further in-depth research to ensure the comprehensiveness and adequacy of the methodology

Summary：本研究提出了一种利用大型语言模型（LLMs）想象力的攻击框架，名为AVATAR，通过隐喻实现模型越狱。AVATAR从有害目标中提取有害实体，并基于LLM的想象力将它们映射为无害的对抗性实体。实验结果表明，AVATAR能有效、可移植地对LLMs进行越狱攻击，并在多个先进LLMs中达到最先进的攻击成功率。本研究揭示了LLMs因自身内在想象力而存在的安全风险，以及对对抗性隐喻的脆弱性。

Key Takeaways：