发布日期: 2025-10-04

更新日期: 2025-11-27

文章字数: 5k

阅读时长: 20 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-04 更新

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Authors:Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, Dan Roth

Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.

尽管人工智能安全性领域最近取得了快速进展，但当前的大型语言模型在多回合交互场景中仍然容易受到对抗性攻击。在此场景中，攻击者会在对话回合中战略性地调整提示，提出了更为现实且具挑战性的难题。现有的发现安全漏洞的方法要么依赖于与人工专家的手动红队作战，要么使用预定义模板和人工整理的攻击数据进行自动化方法，其中大多数主要集中在单回合攻击上。然而，这些方法并没有探索庞大的多回合攻击空间，也没有考虑到复杂对话动态和战略性对话规划中出现的新型攻击轨迹。考虑到最近的发现，即大型语言模型与单回合攻击相比，对多回合攻击具有更高的脆弱性，这一差距尤为关键。我们提出了DialTree-RPO，这是一个与树搜索集成的在线策略强化学习框架，它自主发现多样化的多回合攻击策略，将对话视为一个顺序决策问题，在无需手动整理数据的情况下进行系统探索。通过广泛的实验，我们的方法不仅在十个目标模型上实现了比以前的最先进方法高出超过25.9%的攻击成功率（ASR），而且还通过学习最大化多回合攻击成功的最佳对话策略来有效地发现新的攻击策略。

论文及项目相关链接

PDF

Summary

近期AI安全领域取得快速进展，但在多轮交互场景中，当前的大型语言模型仍然容易受到对抗性攻击的威胁。针对这一问题，现有的发现安全漏洞的方法主要依赖手动红队或者自动化方法，但它们主要集中在单轮攻击上，并没有全面探索多轮攻击的巨大空间。因此，我们提出了DialTree-RPO方法，这是一种基于策略强化学习和树搜索的框架，能够自主发现多样化的多轮攻击策略，通过将对话视为一个序列决策问题，系统性地探索新的攻击路径，实现攻击成功率超过原有技术的目标。该方法实验验证在十个目标模型上的攻击成功率超过之前技术的水平提高幅度达到超过百分之二十五点九，并且在跨多轮交互的过程中揭示了新的攻击策略。未来更多智能的对话形式同样具有对现有的社会内容操控危险之虞，本工作可作为开始反思如何将聊天AI研究活动导归对人类更有益的课题与道路的出发点。总而言之这一新突破针对当前的语言模型的安全性有着不可估量的积极价值。

Key Takeaways

当前大型语言模型在多轮交互场景中仍存在安全漏洞，容易受到对抗性攻击的影响。
现有的安全漏洞发现方法主要集中在单轮攻击上，未能全面探索多轮攻击的空间。
DialTree-RPO方法通过结合策略强化学习和树搜索，能够自主发现多样化的多轮攻击策略。
该方法实现了对十个目标模型的高攻击成功率，超过之前的技术水平。
DialTree-RPO方法通过系统性的探索，揭示了新的攻击策略，有助于增强对语言模型安全性的认识。
这一突破对于提高语言模型的安全性具有积极价值，并可作为未来研究如何更好地将聊天AI研究活动导向对人类有益的课题的出发点。

Cool Papers

点此查看论文截图

Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study

Authors:Lena Podina, Christina Humer, Alexandre Duval, Victor Schmidt, Ali Ramlaoui, Shahana Chatterjee, Yoshua Bengio, Alex Hernandez-Garcia, David Rolnick, Félix Therrien

Efficient and inexpensive energy storage is essential for accelerating the adoption of renewable energy and ensuring a stable supply, despite fluctuations in sources such as wind and solar. Electrocatalysts play a key role in hydrogen energy storage (HES), allowing the energy to be stored as hydrogen. However, the development of affordable and high-performance catalysts for this process remains a significant challenge. We introduce Catalyst GFlowNet, a generative model that leverages machine learning-based predictors of formation and adsorption energy to design crystal surfaces that act as efficient catalysts. We demonstrate the performance of the model through a proof-of-concept application to the hydrogen evolution reaction, a key reaction in HES, for which we successfully identified platinum as the most efficient known catalyst. In future work, we aim to extend this approach to the oxygen evolution reaction, where current optimal catalysts are expensive metal oxides, and open the search space to discover new materials. This generative modeling framework offers a promising pathway for accelerating the search for novel and efficient catalysts.

高效且经济的能源储存对于加速可再生能源的采用和确保稳定的能源供应至关重要，尽管存在风能和太阳能等源波动。电催化剂在氢能储存（HES）中扮演着关键角色，使得能量可以以氢的形式进行储存。然而，为此过程开发负担得起且高性能的催化剂仍然是一个巨大的挑战。我们引入了催化剂GFlowNet，这是一个生成模型，它利用基于机器学习的形成和吸附能量预测器，来设计作为高效催化剂的晶体表面。我们通过概念验证应用来展示模型的表现，该应用针对HES中的关键反应——析氢反应，我们成功地将铂识别为已知的最有效催化剂。在未来的工作中，我们旨在将此方法扩展到氧析出反应，当前最优的催化剂是昂贵的金属氧化物，并开放搜索空间以发现新材料。这一生成建模框架为加速寻找新型和高效催化剂提供了有希望的途径。

论文及项目相关链接

PDF 5 pages, 2 figures. Accepted to NeurIPS AI for Materials Workshop 2025

总结

文本指出，高效的储能方法对于加快可再生能源的采纳以及确保稳定供应至关重要，尽管存在风力、太阳能等能源的波动。电解催化剂在氢能储存（HES）中扮演重要角色，能够使能量以氢的形式储存。然而，开发具有高性价比和高性能的催化剂仍然是一大挑战。文章介绍了名为Catalyst GFlowNet的生成模型，它利用基于机器学习形成的预测和吸附能量设计晶体表面作为高效催化剂。文章通过氢演化反应这一关键反应证明了模型的有效性，成功确定了铂为目前已知的最有效催化剂。未来，研究团队计划将此方法扩展到金属氧化物作为最优催化剂的氧演化反应中，并探索新的材料。这一生成建模框架为加速寻找新型高效催化剂提供了前景。

要点解析

高效的能量储存对于可再生能源的采纳至关重要，需要克服太阳能和风力能源的波动性带来的挑战。
电解催化剂在氢能储存（HES）中扮演着重要角色，有助于储存能量作为氢气。
开发高效且经济实惠的催化剂仍然是当前面临的一大挑战。
介绍了一种名为Catalyst GFlowNet的生成模型，利用机器学习预测晶体表面的形成和吸附能量来设计出高效的催化剂。
通过氢演化反应的实例验证了该模型的有效性。
铂是目前已知的最有效的催化剂。

Cool Papers

点此查看论文截图

Study of the $^{20}$Ne($p,γ$)$^{21}$Na reaction at LUNA

Authors:A. Caciolli

The NeNa-MgAl cycles are involved in the synthesis of Ne, Na, Mg, and Al isotopes. The $^{20}$Ne($p,\gamma$)$^{21}$Na (Q = 2431.68 keV) reaction is the first and slowest reaction of the NeNa cycle and it controls the speed at which the entire cycle proceeds. At the state of the art, the uncertainty on the 20Ne(p,{\gamma})21Na reaction rate affects the production of the elements in the NeNa cycle. In particular, in the temperature range from 0.1 GK to 1 GK, the rate is dominated by the 366 keV resonance corresponding to the excited state of EX = 2797.5 keV and by the direct capture component. The present study focus on the study of the 366 keV resonance and the direct capture below 400 keV. At LUNA (Laboratory for Underground Nuclear Astrophysics) the $^{20}$Ne($p,\gamma$)$^{21}$Na reaction has been measured using the intense proton beam delivered by the LUNA 400 kV accelerator and a windowless differential-pumping gas target. The products of the reaction are detected with two high-purity germanium detectors. The experimental details and preliminary results on the 366 keV resonance and on the direct capture component at very low energies will be shown, together with their possible impact on the $^{20}$Ne($p,\gamma$)$^{21}$Na reaction rate.

NeNa-MgAl循环涉及Ne、Na、Mg和Al同位素的合成。$^{20}$Ne($p,\gamma$)$^{21}$Na（Q=2431.68 keV）反应是NeNa循环中的第一个也是最慢的反应，它控制着整个循环的速度。目前，$^{20}$Ne($p,\gamma$)$^{21}$Na的反应速率的不确定性影响着NeNa循环中元素的产生。特别是在0.1 GK至1 GK的温度范围内，速率主要由对应于EX=2797.5 keV的激发态的366 keV共振和直接捕获成分控制。目前的研究重点是对366 keV共振和低于400 keV的直接捕获进行研究。在地下核天体物理学实验室（LUNA）中，使用LUNA 400 kV加速器提供的强质子束和无窗差分抽气气体靶，已经测量了$^{20}$Ne($p,\gamma$)$^{21}$Na反应。该反应的产品用两个高纯度锗探测器检测。将展示关于366 keV共振和极低能量下的直接捕获成分的实验细节和初步结果，以及它们对$^{20}$Ne($p,\gamma$)$^{21}$Na反应速率的可能影响。

论文及项目相关链接

PDF

Summary
NeNa-MgAl循环涉及Ne、Na、Mg和Al同位素的合成。其中，$^{20}$Ne($p,\gamma$)$^{21}$Na反应是NeNa循环中的首个反应，也是整个循环速度的调节点。目前，该反应速率的不确定性影响了NeNa循环中元素的生成。本研究的重点是针对该反应的某些具体能量共振及直接捕获部分进行研究。LUNA实验室利用强大的质子束和真空差分泵气目标进行了该反应的实验测量。初步结果及其对该反应速率的可能影响将一并展示。

Key Takeaways

NeNa-MgAl循环涉及多种元素的同位素合成。
$^{20}$Ne($p,\gamma$)$^{21}$Na反应是NeNa循环的首个关键反应，决定整体循环速度。
目前该反应速率的不确定性影响了NeNa循环中元素的生成。
研究重点在于特定能量共振（如366keV共振）及低于400keV的直接捕获部分。
LUNA实验室采用高强度质子束和无窗差分泵气目标进行实验研究。
初步实验结果展示了关于特定能量共振和直接捕获的详细信息。

Cool Papers

点此查看论文截图

MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Authors:Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour

Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

多模态对话摘要（MDS）是一项具有广泛应用的关键任务。为了支持有效的MDS模型的开发，稳健的自动评估方法对于减少成本和人力的投入至关重要。然而，这些方法需要基于人类注释的强大的元评估基准。在这项工作中，我们介绍了MDSEval，这是MDS的第一个元评估基准，包含共享对话的图像、相应的摘要和人类在八个明确界定的质量方面的判断。为了确保数据的质量和丰富性，我们提出了一个新的过滤框架，该框架利用跨模态的相互排斥的关键信息（MEKI）。我们的工作是首次识别和正式提出与MDS特定的关键评估维度。我们对最先进的模态评估方法进行了基准测试，揭示了它们在区分摘要和高级MLLM方面的局限性，以及它们容易受到的各种偏见。

论文及项目相关链接

PDF Accepted by EMNLP 2025

Summary

多模态对话摘要（MDS）是一项具有广泛应用的关键任务。为了支持MDS模型的有效发展，需要减少成本和人力投入的自动评估方法。为此，我们引入了MDSEval，这是首个为MDS设计的元评估基准测试，包含图像共享对话、相应摘要和人类对八个明确界定的质量方面的判断。为确保数据的质量和丰富性，我们提出了利用多模态之间的互斥关键信息（MEKI）的新过滤框架。我们的工作是首次专门针对MDS识别并形式化关键评估维度。我们对最先进的模态评估方法进行基准测试，揭示了它们在区分摘要和高级MLLM方面的局限性，以及它们容易受到的各种偏见。

Key Takeaways

多模态对话摘要（MDS）是一个关键任务，需要自动评估方法来支持模型发展，减少成本和人力投入。
MDSEval是首个为MDS设计的元评估基准测试，包含图像共享对话、相应摘要和人类判断。
我们利用多模态之间的互斥关键信息（MEKI）提出了一个新的过滤框架以确保数据质量。
首次专门针对MDS识别并形式化关键评估维度。
现有的模态评估方法在区分高级MLLM摘要方面存在局限性。
这些评估方法容易受到各种偏见的影响。

Cool Papers

点此查看论文截图

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Authors:Guolei Huang, Qinzhi Peng, Gan Xu, Yuxuan Lu, Yongjun Shen

As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.

随着视觉语言模型（VLMs）逐渐转向交互式多轮使用，出现了新的安全风险，这些风险是单一轮次或单一模态的监管所无法察觉的。在多模态多轮（MMT）对话中，恶意意图可以跨越多个回合和图像进行传播，而上下文敏感的回复仍可能推动有害内容的传播。为了应对这一挑战，我们对MMT对话安全进行了系统的定义和研究。在此基础上，我们引入了多模态多轮对话安全（MMDS）数据集。我们进一步开发了一个基于蒙特卡洛树搜索（MCTS）的自动化多模态多轮红队框架，以生成用于MMDS的不安全多模态多轮对话。MMDS包含4484个经过注释的多模态对话样本，具有细粒度的安全评级、政策维度标签以及基于证据的用户和助理的合理性。借助MMDS，我们推出了LLaVAShield，这是一款能够联合检测和评估用户输入和助理回复中的风险的强大工具。在全面的实验中，LLaVAShield在MMT内容审核任务上始终优于强大的基线，并在动态政策配置下建立了新的最新结果。我们将公开发布数据集和模型，以支持未来的研究。

论文及项目相关链接

PDF

Summary

随着视觉语言模型（VLMs）向交互式多轮对话应用发展，出现了新的安全风险，这些风险是传统的单轮或者单模态模型难以应对的。为了应对这些挑战，本研究首次系统地定义了多模态多轮对话安全性的概念，并构建了相关的数据集MMDS。此外，研究还基于蒙特卡洛树搜索（MCTS）开发了一个自动化多模态多轮红队模拟框架，用于生成MMDS中的不安全多模态多轮对话样本。借助MMDS数据集，研究推出了LLaVAShield工具，该工具能够联合检测并评估用户和助理输入的风险。实验证明，LLaVAShield在多模态多轮对话内容管理任务上表现卓越，将公开数据集和模型以支持未来研究。

Key Takeaways