⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-01-09 更新
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
Authors:Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, Shiji Zhao, Cheng Wei, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, Jun Zhu, Hui Xue
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak performance is limited. To solve this problem, we propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values posed by LLMs. We propose a risk decomposition strategy that distributes risks across multiple rounds of queries and utilizes psychological strategies to enhance attack strength. Extensive experiments show that our proposed method surpasses other attack methods and achieves state-of-the-art attack success rate. We will make the corresponding code and dataset available for future research. The code will be released soon.
大型语言模型(LLM)在知识储备和理解能力方面表现出卓越的性能,但它们在遭遇越狱攻击时也容易表现出非法或不道德的反应。为了确保其在关键应用中的负责任部署,了解LLM的安全能力和漏洞是至关重要的。以前的研究主要关注单轮对话中的越狱现象,忽略了多轮对话中潜在的越狱风险,而多轮对话是人类与LLM交互和获取信息的重要方式。一些研究逐渐开始关注多轮对话中越狱的风险。这些努力通常涉及使用手动制作的模板或提示工程技术。然而,由于多轮对话的固有复杂性,其越狱性能受到限制。为了解决这个问题,我们提出了一种新型的多轮对话越狱代理,强调隐蔽性在识别和缓解LLM对人类价值观构成的潜在威胁中的重要性。我们提出了一种风险分解策略,该策略将风险分配给多轮查询,并利用心理策略增强攻击强度。大量实验表明,我们提出的方法超越了其他攻击方法,达到了最先进的攻击成功率。我们将为未来的研究提供相应的代码和数据集。代码很快就会发布。
论文及项目相关链接
Summary
大型语言模型(LLMs)具有丰富的知识和理解力,但在面对突破攻击时可能产生非法或不道德反应。为确保在关键应用中负责任地部署LLMs,了解其安全性和漏洞至关重要。以往研究主要集中在单轮对话的突破,忽视了多轮对话中潜在的突破风险。针对多轮对话中的突破风险,我们提出了一种新型的多轮对话突破代理,强调隐蔽性在识别并缓解LLMs对人类价值观潜在威胁中的重要性。我们提出一种风险分解策略,将风险分散到多个查询轮次,并利用心理策略增强攻击力度。实验表明,我们的方法超越了其他攻击方法,达到了最先进的攻击成功率。
Key Takeaways
- 大型语言模型(LLMs)具有丰富的知识和理解力,但也存在安全和伦理风险。
- LLMs在面对突破攻击时可能产生非法或不道德反应。
- 以往研究主要关注单轮对话的突破,忽视了多轮对话中的潜在风险。
- 多轮对话是人与LLMs交互和信息提取的重要方式。
- 提出了一种新型的多轮对话突破代理,强调隐蔽性和识别LLMs对人类价值观潜在威胁的重要性。
- 采用风险分解策略,将风险分散到多个查询轮次,并利用心理策略增强攻击力度。