发布日期: 2025-07-03

更新日期: 2025-07-09

文章字数: 20.7k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-07-03 更新

Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

Authors:André de Souza Loureiro, Jorge Valverde-Rebaza, Julieta Noguez, David Escarcega, Ricardo Marcacini

Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.

近年来，大型语言模型（LLM）的进步极大地提高了其解决问题的能力。然而，当面对复杂的多步骤推理任务时，这些模型仍然面临挑战。在本文中，我们提出了多层次自我反思与自动提示（MAPS）框架，这是一种新颖的方法，旨在通过集成思维链（CoT）、自我反思和自动提示等技术，增强LLM中的多步骤数学推理能力。与传统的静态提示方法不同，MAPS采用迭代细化过程。最初，模型使用CoT提示生成解决方案。当检测到错误时，自适应的自我反思机制会识别并分析这些错误，生成定制的提示来指导更正。这些动态调整的提示使模型能够迭代地优化其推理。在多个LLM的四个成熟基准上的实验表明，MAPS显著优于标准CoT，并与优化推理的模型取得具有竞争力的结果。此外，MAPS使通用LLM能够达到与专用推理模型相当的性能水平。虽然更深的反思层次提高了准确性，但它们也增加了令牌使用量和成本。为了平衡这一权衡，MAPS战略性地限制了反思深度，确保成本与推理性能之间的最佳平衡。

论文及项目相关链接

PDF Accepted for publication in: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025). Research Track

Summary

大型语言模型（LLM）在解决数学问题方面的能力已经显著提高，但在处理复杂的多步骤推理任务时仍面临挑战。本文提出的Multi-Layered Self-Reflection with Auto-Prompting（MAPS）框架，通过结合链式思维（CoT）、自我反思和自动提示等技术，增强了LLM的多步骤数学推理能力。实验结果表明，MAPS显著优于标准CoT，与经过推理优化的模型相比具有竞争力，并能使通用LLM的性能达到与专用推理模型相当的水平。

Key Takeaways

LLMs在解决多步骤数学推理任务时仍存在挑战。
MAPS框架通过结合CoT、自我反思和自动提示等技术增强LLM的多步骤数学推理能力。
MAPS采用迭代优化过程，通过自我反思机制识别错误，并生成针对性的提示来引导修正。
实验结果表明，MAPS在多个基准测试上显著优于标准CoT，并实现了与推理优化模型的竞争性能。
MAPS使通用LLM的性能达到与专用推理模型相当的水平。
更深的反思层次虽然能提高准确性，但也增加了令牌使用量和成本。

Cool Papers

点此查看论文截图

AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

Authors:JiaRu Wu, Mingwei Liu

Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.

大型语言模型（LLMs）在各种任务中表现出卓越的性能，但现有的评估基准通常是静态的，不足以在真实场景中全面评估其鲁棒性和泛化能力。先前使用进化或对抗性数据增强的工作提高了评估的多样性，但缺乏对扰动类型和多步复杂性的系统控制，限制了全面的鲁棒性分析。为了解决这些空白，我们提出了AutoEvoEval，这是一个基于进化的评估框架，用于封闭任务，如多项选择题回答。AutoEvoEval引入了22个可解释的原子进化操作，并支持多轮组合，能够控制生成多样、具有挑战性和现实的测试样本。我们对开源和封闭源码的LLMs进行了广泛的实验，回答了四个研究问题。我们的结果表明，原子操作导致平均准确率下降7.283%，其中破坏结构或误导语义的编辑导致下降幅度最大。同一扰动对模型的敏感性存在显著差异，结合多个进化步骤会使对抗性效果放大高达52.932%。这些发现表明，当前基准测试可能会高估模型的真正泛化能力，并强调需要进行进化感知的鲁棒性评估。相关代码和资源可在https://github.com/SYSUSELab/AutoEvoEval获取。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在多种任务上表现出卓越性能，但现有评估基准测试往往是静态的且不足以全面评估其在现实场景中的稳健性和泛化能力。针对这一问题，本文提出基于进化的评估框架AutoEvoEval，适用于如多选问答等封闭任务。该框架引入22种可解释的原子进化操作，支持多轮组合，可控制生成多样化、具挑战性和现实的测试样本。实验结果显示，原子操作导致平均准确率下降7.283%，其中破坏结构或误导语义的编辑导致下降幅度最大。模型对同一扰动的敏感度存在显著差异，多步进化结合可放大对抗效果高达52.932%。这提示我们现有的基准测试可能高估了模型的真实泛化能力，强调需要进行基于进化的稳健性评估。

Key Takeaways

大型语言模型（LLMs）在多种任务上表现出卓越性能，但现有评估方法不足以全面评估其稳健性和泛化能力。
提出的AutoEvoEval框架是基于进化的评估方法，适用于封闭任务如多选问答。
AutoEvoEval通过引入22种原子进化操作，支持多轮组合，以控制生成多样化、具挑战性和现实的测试样本。
实验表明，原子操作导致模型平均准确率下降，其中某些操作（如结构破坏或语义误导）影响更大。
模型对同一扰动的敏感度有显著差异，且多步进化结合会显著放大对抗效果。
现有基准测试可能高估模型的真正泛化能力，需要更全面的评估方法。
AutoEvoEval框架的代码和资源已公开可用。

Cool Papers

点此查看论文截图

DABstep: Data Agent Benchmark for Multi-step Reasoning

Authors:Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, Thomas Wolf

We introduce DABstep, a novel benchmark for evaluating AI agents on realistic multi-step data analysis tasks. DABstep comprises over 450 real-world challenges derived from a financial analytics platform, requiring models to combine code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands an iterative, multi-step problem-solving approach, testing capabilities in data manipulation, cross-referencing multiple sources, and precise result reporting. The benchmark provides a factoid-style answer format with automatic correctness checks for objective scoring at scale. We evaluate leading LLM-based agents, revealing a substantial performance gap: even the best agent achieves only 14.55% accuracy on the hardest tasks. We detail our benchmark’s design, dataset composition, task formulation, evaluation protocol, report baseline results and analyze failure modes. DABstep is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.

我们介绍了DABstep，这是一个新的基准测试，用于评估AI代理在现实的多步骤数据分析任务上的表现。DABstep包含了来自金融分析平台的450多个现实挑战，要求模型结合代码数据处理和基于上下文的异质文档推理。每个任务都需要迭代、多步骤的问题解决方式，测试数据操控能力、跨源交叉引用能力和精确结果报告能力。该基准测试提供了一个事实型答案格式，带有自动正确性检查，可大规模进行客观评分。我们评估了基于大型语言模型的领先代理，结果显示存在显著的性能差距：即使在最难的任务上，最佳代理也仅达到14.55%的准确率。我们详细介绍了基准测试的设计、数据集组成、任务制定、评估协议，报告了基线结果并分析了失败模式。DABstep已发布公共排行榜和工具包，以加快自主数据分析的研究进程。

论文及项目相关链接

PDF 13 pages, 5 figures

Summary：推出DABstep基准测试，用于评估AI代理在真实的多步骤数据分析任务上的表现。DABstep包含450多个来自金融分析平台的真实挑战，要求模型结合代码数据处理和异质文档的上下文推理。该基准测试采用事实型答案格式，提供大规模客观评分，并评估了领先的大型语言模型代理，显示出现存模型的性能差距。

Key Takeaways：

DABstep是一个新的基准测试，用于评估AI在多步骤数据分析任务上的性能。
它包含450多个从金融分析平台衍生出的真实世界挑战。
DABstep要求模型结合代码数据处理和上下文推理，解决异质文档问题。
基准测试采用事实型答案格式，便于自动正确性检查和大规模客观评分。
评估了领先的大型语言模型代理，显示性能存在显著差距。
最优秀的代理在最具挑战性的任务上仅达到14.55%的准确率。

Cool Papers

点此查看论文截图

L0: Reinforcement Learning to Become General Agents

Authors:Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie

Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a “code-as-action” fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (https://github.com/cmriat/l0).

训练大型语言模型（LLM）以执行多轮、长期任务作为自主代理仍然存在可扩展性和训练效率方面的重大挑战。为了解决这个问题，我们引入了L-Zero（L0），这是一种用于通用代理的可扩展端到端训练管道。L0采用低成本、可扩展、沙箱化的并发代理工作者池，降低了在复杂环境中应用强化学习的门槛。我们还介绍了L0内的NB-Agent代理架构，它通过Read-Eval-Print-Loop（REPL）以“代码即行动”的方式运行。我们在事实性问题回答基准测试上评估了L0的性能。实验表明，基础模型仅使用带有可验证奖励的强化学习（RLVR）就能发展出稳健的问题解决能力。在Qwen2.5-7B-Instruct模型上，我们的方法将SimpleQA的准确率从30%提高到80%，HotpotQA的准确率从22%提高到41%。我们已公开了整个L0系统，包括我们的L0系列模型、NB-Agent、完整的训练管道和相应的训练配方（https://github.com/cmriat/l0）。

论文及项目相关链接

PDF

Summary

L-Zero（L0）是一个用于通用代理的可扩展端到端训练管道，解决了大型语言模型（LLM）在复杂环境中进行多轮长周期任务所面临的挑战。通过引入低成本的、可扩展的、沙箱化的并发代理工作者池，L0降低了在复杂环境中应用强化学习的门槛。实验评估表明，基于强化学习与可验证奖励（RLVR）的基础模型可以发展出强大的问题解决能力。在Qwen2.5-7B-Instruct模型上，L0方法将SimpleQA的准确率从30%提高到80%，HotpotQA的准确率从22%提高到41%。L0系统和相关模型已开源共享。

Key Takeaways

L-Zero（L0）是一个针对大型语言模型（LLM）的可扩展端到端训练管道。
L0通过引入并发代理工作者池来降低强化学习在复杂环境中的使用门槛。
L0通过强化学习与可验证奖励（RLVR）结合的方法使基础模型具备强大的问题解决能力。
在Qwen2.5-7B-Instruct模型上，L0显著提高了问答任务的准确率。
L-Zero（L0）系统和相关模型已在GitHub上开源共享，方便其他研究人员使用和改进。
L0系统的关键组成部分之一是NB-Agent代理骨架，它以“代码即行动”的方式工作，通过Read-Eval-Print-Loop（REPL）进行交互。

Cool Papers

点此查看论文截图

Authors:Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao

Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.

推理在多模态大型语言模型（MLLMs）向通用人工智能迈进的过程中起着至关重要的作用。然而，现有的MLLM基准测试在精确全面地评估长期推理能力从三个方面往往有所不足：（1）缺乏难度和多样性；（2）容易猜测和记忆；（3）对中间推理步骤的评估不足。为了填补这一空白，我们推出了MMReason，这是一个新的基准测试，旨在精确全面地评估MLLM的长期推理能力，采用多样、开放、具有挑战性的问题。首先，我们从各个学科领域（即6个学科）收集需要多步骤推理的挑战性问题，并设置多个难度级别（即从大学预科到大学，从基础到竞赛级别）。其次，这些问题被重新制定为开放式问题，并使用多模型投票技术进行筛选，以消除与猜测和记忆相关的捷径案例，确保稳健的推理评估。第三，我们为问题提供了详细的逐步解决方案，并设计了一种基于参考的三元评分机制，以可靠地评估中间推理步骤。通过MMReason，我们对流行的大型语言模型进行了基准测试，并对它们的推理能力进行了深入分析。我们希望MMReason能成为推动MLLM推理研究的有价值的资源。代码将在https://github.com/HJYao00/MMReason上提供。

论文及项目相关链接

PDF Technical report

Summary
多模态大型语言模型（MLLMs）在推进人工智能通用智能（AGI）的过程中，推理发挥着关键作用。但现有MLLM基准测试在评估长期推理能力时存在缺陷，如缺乏难度和多样性、易于猜测和记忆、无法充分评估中间推理步骤等。为此，我们推出MMReason新基准测试，旨在精准全面地评估MLLM的长期推理能力，包含多样、开放、具有挑战性的问题。该测试从六个学科领域挑选具有挑战、需要多步骤推理的问题，并对其进行改革制定开放性格式的问题；运用多模型投票技术消除与猜测和记忆相关的捷径案例，确保推理评估的稳健性；为问题提供详细的分步解答，并采用基于参考的三元评分机制来可靠地评估中间推理步骤。MMReason为流行的领先MLLM提供了基准测试，并深入分析了它们的推理能力。

Key Takeaways

推理在多模态大型语言模型（MLLMs）中起关键作用，对于实现人工智能通用智能（AGI）至关重要。
现有MLLM基准测试在评估长期推理能力时存在缺陷，主要包括缺乏难度和多样性、易于猜测和记忆、对中间推理步骤评估不足。
MMReason新基准测试旨在精准全面地评估MLLM的长期推理能力，包括多样、开放、具有挑战性的问题。
MMReason从六个学科领域挑选问题，要求多步骤推理，并改革制定开放性格式的问题。
MMReason运用多模型投票技术减少猜测和记忆的影响，确保推理评估的稳健性。
MMReason为问题提供详细的分步解答，并采用基于参考的三元评分机制评估中间推理步骤。

Cool Papers

点此查看论文截图

Datasets for Fairness in Language Models: An In-Depth Survey

Authors:Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang

Fairness benchmarks play a central role in shaping how we evaluate language models, yet surprisingly little attention has been given to examining the datasets that these benchmarks rely on. This survey addresses that gap by presenting a broad and careful review of the most widely used fairness datasets in current language model research, characterizing them along several key dimensions including their origin, scope, content, and intended use to help researchers better appreciate the assumptions and limitations embedded in these resources. To support more meaningful comparisons and analyses, we introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods. Applying this framework to twenty four common benchmarks, we highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets. We also point to opportunities for creating new fairness benchmarks that reflect more diverse social contexts and encourage more thoughtful use of these tools going forward. All code, data, and detailed results are publicly available at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets to promote transparency and reproducibility across the research community.

公平性基准在塑造我们如何评估语言模型方面起着核心作用，然而令人惊讶的是，人们几乎没有注意到这些基准所依赖的数据集的检验。本文旨在弥补这一空白，对当前语言模型研究中使用最广泛的公平性数据集进行了广泛而细致的评述，从多个关键维度对这些数据集进行了特征描述，包括其来源、范围、内容和预期用途等，以帮助研究人员更好地了解这些资源中的假设和局限性。为了支持更有意义的比较和分析，我们引入了一个统一的评估框架，揭示了不同数据集和评分方法中的一贯的人口差异模式。在二十四项常见基准测试中应用这一框架，我们强调了可能影响模型公平性的结论中经常被忽视偏见问题，并为选择、组合和解释这些数据集提供了实用指导。我们还指出了创建反映更多样化社会背景的新公平性基准的机会，并鼓励在未来的研究中更谨慎地使用这些工具。所有代码、数据和详细结果都可在https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets公开访问，以促进研究社区的透明度和可重复性。%EF%BC%}本链接提供了一个公开透明的环境以便于科研人员合作进步并持续探索与精进关于大型语言模型公平性的研究问题。

论文及项目相关链接

PDF

Summary

本文调查了当前语言模型研究中广泛使用的公平性数据集，通过介绍一个统一的评估框架来揭示数据集和评分方法中的持续人口统计差异模式，帮助研究人员更好地了解这些资源中的假设和局限性。文章强调了常被忽视的偏见对模型公平性的影响，并为选择、组合和解释这些数据集提供了实用指导。

Key Takeaways

公平性基准在评估语言模型时起到核心作用，但人们往往忽视其所依赖的数据集。
本文全面审查了当前广泛使用的公平性数据集，包括其来源、范围、内容和用途。
介绍了一个统一的评估框架，以揭示数据集和评分方法中的持续人口统计差异模式。
通过应用此框架于24个常见基准测试，突出了可能影响模型公平性的常被忽视的偏见。
提供了关于如何选择、组合和解释这些数据集的实用指导。
指出创建反映更多社会背景的新公平性基准的机会。

Cool Papers

点此查看论文截图

Teaching a Language Model to Speak the Language of Tools

Authors:Simeon Emanuilov

External tool integration through function-calling is essential for practical language model applications, yet most multilingual models lack reliable tool-use capabilities in non-English languages. Even state-of-the-art multilingual models struggle with determining when to use tools and generating the structured outputs required for function calls, often exhibiting language confusion when prompted in lower-resource languages. This work presents a methodology for adapting existing language models to enable robust tool use in any target language, using Bulgarian as a case study. The approach involves continued training of the BgGPT model series (2.6B, 9B, 27B parameters) on a novel bilingual dataset of 10,035 function-calling examples designed to support standardized protocols like MCP (Model Context Protocol). The research introduces TUCAN (Tool-Using Capable Assistant Navigator), which achieves up to 28.75% improvement in function-calling accuracy over base models while preserving core language understanding, as verified on established Bulgarian benchmarks. Beyond accuracy gains, TUCAN models demonstrate production-ready response formatting with clean, parsable function calls, contrasting with the verbose and inconsistent outputs of base models. The models, evaluation framework, and dataset are released to enable replication for other languages. This work demonstrates a practical approach for extending tool-augmented capabilities beyond English-centric systems.

通过函数调用进行外部工具集成对于实际语言模型应用至关重要，但大多数多语种模型在非英语语境下的工具使用能力并不可靠。即使是最先进的多语种模型也难以确定何时使用工具以及生成所需的函数调用结构输出，在非资源语言提示时经常出现语言混淆。本研究提出了一种适应现有语言模型的方法，使其能够在任何目标语言中稳健地使用工具，并以保加利亚语作为案例研究。该方法涉及在新型双语数据集上持续训练BgGPT模型系列（2.6B、9B、27B参数），该数据集包含10,035个函数调用示例，支持如MCP（模型上下文协议）等标准化协议。该研究引入了TUCAN（能够使用工具的助理导航器），与基础模型相比，它在函数调用准确性方面提高了高达28.75%，同时保留了核心语言理解功能，这在已建立的保加利亚基准测试中得到了验证。除了提高准确性外，TUCAN模型还展示了生产就绪的响应格式，具有干净、可解析的函数调用，与基础模型的冗长和输出不一致形成鲜明对比。为了在其他语言中实现复制，我们发布了模型、评估框架和数据集。这项工作展示了一种将工具增强功能扩展到英语中心系统以外的实用方法。

论文及项目相关链接

PDF

Summary：通过函数调用实现外部工具集成对于实际语言模型应用至关重要，但大多数多语言模型在非英语环境中的工具使用能力不可靠。本文提出一种适应现有语言模型的方法，使其能够在任何目标语言中稳健地使用工具，并以保加利亚语为案例进行研究。该方法涉及继续训练BgGPT模型系列（2.6B、9B、27B参数），在10,035个函数调用示例的新双语数据集上支持标准化协议（如MCP）。研究中引入了TUCAN（工具使用能力辅助导航器），与基础模型相比，其在函数调用准确性方面提高了28.75%，同时保留了核心语言理解能力，并在保加利亚语基准测试上得到了验证。此外，TUCAN模型还展示了清晰、可解析的函数调用响应格式，与基础模型的冗长和输出不一致形成鲜明对比。该模型、评估框架和数据集已发布，以支持在其他语言中的复制。这项工作展示了将工具增强功能扩展到英语以外系统的实用方法。

Key Takeaways：

外部工具集成对于语言模型在实际应用中的重要性。
多语言模型在非英语环境中的工具使用能力存在挑战。
本文提出了一种适应现有语言模型的方法，使其能在任何目标语言中稳健地使用工具。
方法涉及继续训练BgGPT模型系列，并在新双语数据集上进行支持标准化协议的训练。
TUCAN（工具使用能力辅助导航器）在函数调用准确性方面显著提高，同时保留核心语言理解能力。
TUCAN模型具有清晰、可解析的响应格式，与基础模型的输出形成对比。

Cool Papers

点此查看论文截图

GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Authors:Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Masato Taki, Yutaka Matsuo

The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at https://snskysk.github.io/GeoProg3D/.

三维语言领域的进步已经能够通过自然语言与三维场景进行直观交互。然而，现有方法通常局限于小规模环境，缺乏在大规模、复杂的城市环境中所需的可扩展性和组合推理能力。为了克服这些局限性，我们提出了GeoProg3D，这是一个视觉编程框架，能够通过自然语言与城市规模的高保真三维场景进行交互。GeoProg3D由两个关键组件构成：（i）地理感知城市规模三维语言领域（GCLF），它利用高效的分层三维模型处理大规模数据，并结合地理信息，通过方向线索、距离测量、海拔数据和地标参考有效地过滤广阔的城市空间；（ii）地理视觉API（GV-APIs），包括区域分割和对象检测等专业的地理视觉工具。我们的框架采用大型语言模型（LLMs）作为推理引擎，动态结合GV-APIs并操作GCLF，有效支持各种地理视觉任务。为了评估在城市规模推理中的性能，我们引入了GeoEval3D，这是一个包含952个查询答案对的综合基准数据集，涵盖五个具有挑战性的任务：接地、空间推理、比较、计数和测量。实验表明，GeoProg3D在多个任务上显著优于现有的三维语言领域和视觉语言模型。据我们所知，GeoProg3D是第一个能够通过自然语言在高保真城市规模三维环境中实现组合地理推理的框架。相关代码可在https://snskysk.github.io/GeoProg3D/找到。

论文及项目相关链接

PDF Accepted by ICCV 2025

Summary

本文介绍了GeoProg3D框架，该框架通过自然语言驱动与城市规模的高保真3D场景进行交互。它包含两个关键组件：地理感知城市规模三维语言场（GCLF）和地理视觉API（GV-APIs）。框架使用大型语言模型作为推理引擎来结合GV-APIs并操作GCLF，支持多样化的地理视觉任务。框架引入GeoEval3D数据集进行评估，并显示出显著的优越性能。这是首个支持城市规模三维环境中的自然语言的地理推理框架。

Key Takeaways

GeoProg3D是一个视觉编程框架，支持自然语言与城市规模的高保真3D场景交互。
它包含地理感知城市规模三维语言场（GCLF）和地理视觉API（GV-APIs）两个关键组件。
GCLF利用高效的三维模型处理大规模数据，并结合地理信息实现大规模空间的过滤。
GV-APIs提供地理视觉工具，如区域分割和对象检测。
使用大型语言模型（LLMs）作为推理引擎，实现多样化的地理视觉任务支持。
引入GeoEval3D数据集评估性能，涵盖五个具有挑战性的任务。
GeoProg3D在多个任务上显著优于现有的三维语言场和视觉语言模型。

Cool Papers

点此查看论文截图

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Authors:David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim

随着大型语言模型（LLM）越来越多地被部署为自主代理，理解它们的合作和社会机制变得越来越重要。特别是，LLM如何平衡自我利益和集体福祉是确保其对齐、稳健和安全部署的关键挑战。在本文中，我们研究了多代理LLM系统中成本高昂的制裁挑战，其中代理必须决定是投入自己的资源来激励合作还是惩罚违规行为。为了研究这个，我们从行为经济学中采用了公共物品游戏与制度选择的方法，让我们可以观察到不同的LLM如何重复互动并解决社会困境。我们的分析揭示了四种不同的行为模式：一些模型始终建立和维持高水平的合作，另一些则在参与和不参与之间波动，还有一些随着时间的推移逐渐在合作行为上减弱，而其他模型则僵化地遵循固定策略而不考虑结果。令人惊讶的是，我们发现像O1系列这样的推理LLM在合作方面遇到了重大困难，而一些传统LLM则始终实现了高水平的合作。这些发现表明，当前改进LLM的方法——主要集中在提高其推理能力上——并不一定导致合作，这为在需要持续协作的环境中部署LLM代理提供了有价值的见解。我们的代码可在https://github.com/davidguzmanp/SanctSim找到。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）作为自主代理的部署日益普遍，理解其合作和社会机制变得至关重要。本文研究了多代理LLM系统中成本高昂的制裁挑战，其中代理必须决定是投入自身资源以激励合作还是惩罚背叛。通过适应行为经济学的公共物品游戏与制度选择，我们观察到不同LLM在重复互动中如何解决社会困境。分析揭示了四种不同的行为模式，并意外发现注重推理的LLM，如o1系列，在合作方面存在明显困难，而一些传统LLM则始终维持高水平合作。这为在需要持续协作的环境中部署LLM代理提供了宝贵见解。

Key Takeaways

理解大型语言模型（LLM）的合作和社会机制在自主代理部署中的重要性。
在多代理LLM系统中，代理面临成本高昂的制裁挑战时需要平衡自我利益和集体福祉。
通过公共物品游戏与制度选择的研究方法，观察到LLM解决社会困境的四种不同行为模式。
推理型LLM（如o1系列）在合作方面存在困难。
传统LLM能持续维持高水平合作。
改进LLM的推理能力并不一定会提高其合作能力。

Cool Papers

点此查看论文截图

Why Settle for One? Text-to-ImageSet Generation and Evaluation

Authors:Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W. Tsang, Minnan Luo

Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers’ in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.

尽管文本到图像（Text-to-Image）模型取得了显著的进步，但在许多现实世界的应用中，需要生成具有不同一致性要求的连贯图像集。现有的一致性方法通常专注于具有特定一致性的特定领域，这显著限制了它们在更广泛应用中的泛化能力。在本文中，我们提出了一个更具挑战性的问题，即文本到图像集（Text-to-ImageSet，T2IS）生成。它的目标是基于用户指令生成满足各种一致性要求的图像集。为了系统地研究这个问题，我们首先引入了T2IS-Bench，它包含26个子类别中的596条多样化指令，为T2IS生成提供了全面的覆盖。在此基础上，我们提出了T2IS-Eval评估框架，它将用户指令转化为多方面的评估标准，并采用有效的评估器来适应性地评估生成集与标准之间的一致性。随后，我们提出了无需训练的AutoT2IS框架，它最大限度地利用预训练的扩散变压器的上下文能力，协调视觉元素，以满足图像级别的提示对齐和集合级别的视觉一致性。在T2IS-Bench上的广泛实验表明，多样化的一致性挑战了所有现有方法，而我们的AutoT2IS显著优于当前的通用甚至专用方法。我们的方法还证明了在无数尚未探索的现实世界应用中的实用性，证实了其巨大的实际价值。请访问我们的项目网站：https://chengyou-jia.github.io/T2IS-Home。

论文及项目相关链接

PDF

Summary

本文提出一个新的挑战性问题——文本到图像集生成（T2IS），旨在根据用户指令生成满足各种一致性要求的图像集。为系统研究这一问题，介绍T2IS-Bench，包含596条跨越26个子类别的指令，为T2IS生成提供全面覆盖。在此基础上，提出T2IS-Eval评估框架，将用户指令转化为多元评估标准，并有效利用评估器自适应评估生成图像集的一致性满足程度。最后，提出无需训练的AutoT2IS框架，最大化利用预训练的Diffusion Transformers的上下文能力，协调视觉元素，满足图像级提示对齐和集级视觉一致性。实验表明，AutoT2IS在T2IS-Bench上显著优于现有通用及专用方法，并展示了在多个未探索的实际应用中的实用性。

Key Takeaways

文本到图像集生成（T2IS）是一个新的挑战性问题，旨在根据用户指令生成满足多种一致性要求的图像集。
T2IS-Bench包含596条指令，覆盖26个子类别，为T2IS生成提供全面覆盖。
T2IS-Eval框架能够将用户指令转化为多元评估标准，并自适应评估生成图像集的一致性。
AutoT2IS框架利用预训练的Diffusion Transformers的上下文能力，满足图像级和集级的一致性要求。
实验表明，AutoT2IS在T2IS-Bench上表现优异，优于现有方法。
AutoT2IS具有实现多个未探索的实际应用的潜力。

Cool Papers

点此查看论文截图

Authors:Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

城市研究涉及多种场景和任务，需要理解多模态数据。当前的方法往往侧重于特定的数据类型，缺乏城市领域的统一框架来进行综合处理。多模态大型语言模型（MLLMs）的最新成功为解决这一限制提供了有前景的机会。在本文中，我们介绍了UrbanLLaVA，这是一个多模态大型语言模型，旨在同时处理这四种类型的数据，并在多种城市任务中相对于一般的MLLMs表现出强大的性能。在UrbanLLaVA中，我们首先创建了一个包含单模态和跨模态城市数据的多样化城市指令数据集，涵盖了从局部到全局的城市环境视角。此外，我们提出了一个分阶段训练框架，将空间推理增强与领域知识学习解耦，从而提高了UrbanLLaVA在多种城市任务中的兼容性和下游性能。最后，我们还扩展了城市研究的现有基准测试，以评估MLLMs在广泛的城市任务中的性能。来自三个城市的实验结果证明，UrbanLLaVA在单模态任务和复杂的跨模态任务中都优于开源和专有MLLMs，并显示出跨城市的稳健泛化能力。相关源代码和数据可通过https://github.com/tsinghua-fib-lab/UrbanLLaVA对研究社区公开访问。

论文及项目相关链接

PDF Accepted by ICCV 2025

Summary
城市研究涉及多种场景和任务，需要理解多模态数据。现有的方法往往关注特定数据类型，缺乏统一框架进行综合处理。近期多模态大语言模型（MLLMs）的成功为解决这一局限提供了机会。本文介绍了一个针对四种数据类型设计的多模态大语言模型UrbanLLaVA，相较于一般MLLMs，在多种城市任务中表现出卓越性能。UrbanLLaVA构建了一个包含单模态和跨模态城市数据的多样化城市指令数据集，并提出了一个分阶段训练框架，以提高其在多种城市任务中的兼容性和性能。此外，还扩展了现有城市研究基准测试，以评估MLLMs在广泛的城市任务中的表现。实验结果显示，UrbanLLaVA在单模态任务和复杂的跨模态任务中均超越了开源和专有MLLMs，并展现出跨城市的稳健泛化能力。

Key Takeaways

Urban research需要处理多模态数据，但现有方法缺乏综合处理这些数据的统一框架。
多模态大语言模型（MLLMs）的成功为城市研究领域带来了新的机遇。
UrbanLLaVA是一个设计用于处理四种类型数据的多模态大语言模型，相较于一般MLLMs在多种城市任务中表现优越。
UrbanLLaVA构建了包含单模态和跨模态数据的多样化城市指令数据集。
UrbanLLaVA采用分阶段训练框架，以提高其在不同城市任务中的兼容性和性能。
实验结果显示，UrbanLLaVA在多种任务中超越了其他MLLMs，展现出良好的泛化能力。

Cool Papers

点此查看论文截图

Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons

Authors:Chi Chiu So, Yueyue Sun, Jun-Min Wang, Siu Pang Yung, Anthony Wai Keung Loh, Chun Pong Chau

How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeek-R1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and relational inference. However, all evaluated models, including DeepSeek-R1, struggle significantly as problem complexity increases, largely due to token length limitations and incomplete output structures. A detailed analysis of DeepSeek-R1’s long Chain-of-Thought responses uncovers its unique planning and verification strategies, but also highlights instances of incoherent or incomplete reasoning, calling attention to the need for deeper scrutiny into LLMs’ internal inference dynamics. We further discuss key directions for future work, including the role of multimodal reasoning and the systematic examination of reasoning failures. Our findings provide both empirical insights and theoretical implications for advancing LLMs’ reasoning abilities, particularly in tasks that demand structured, multi-step logical inference. Our code repository will be publicly available at https://github.com/kelvinhkcs/Deep-Relational-Reasoning.

关于大型语言模型（LLM）在深度关系推理方面的能力有多强？在这篇论文中，我们通过家庭树和通用图推理的精心设计的基准任务，评估并比较了三个尖端LLM（即DeepSeek-R1、DeepSeek-V3和GPT-4o）的推理能力。我们的实验表明，DeepSeek-R1在多任务和不同问题规模上始终获得最高的F1分数，表现出强大的逻辑演绎和关系推断能力。然而，所有评估的模型，包括DeepSeek-R1，随着问题复杂性的增加，都面临着显著的挑战，这主要是由于令牌长度限制和输出结构不完整所导致的。对DeepSeek-R1的长篇思维链反应的详细分析揭示了其独特的规划和验证策略，但也突出了不连贯或不完全推理的实例，这提醒我们需要对LLM的内部推理动态进行更深入的研究。我们还进一步讨论了未来的关键研究方向，包括多模式推理的作用和推理失败的系统检查。我们的发现为推进LLM的推理能力提供了实证见解和理论启示，特别是在需要结构化、多步骤逻辑推断的任务中。我们的代码仓库将在https://github.com/kelvinhkcs/Deep-Relational-Reasoning上公开可用。

论文及项目相关链接

PDF 10 pages, 0 figures, accepted by 2025 IEEE international conference on artificial intelligence testing (AITest)

Summary

新一代的大型语言模型（LLM）在处理深度关系推理时的能力如何？文章对三款尖端LLM模型（DeepSeek-R1、DeepSeek-V3和GPT-4o）进行了评估与比较。实验表明，DeepSeek-R1在多任务与不同问题规模中均取得最高F1分数，展现了强大的逻辑演绎与关系推理能力。然而，随着问题复杂度的提升，所有评估模型均面临显著挑战，主要原因是令牌长度限制与输出结构不完整。对DeepSeek-R1的长篇推理回应的深入分析揭示了其独特的规划与验证策略，但也暴露出推理不一致或不完整的情况，这引起了人们对LLM内部推理动态的更深入审查的需要。文章还讨论了未来的关键研究方向，包括多模态推理和推理失败的系统审查。本文的发现为推进LLM的推理能力提供了实证见解和理论启示，特别是在需要结构化、多步骤逻辑推理的任务中。

Key Takeaways

评估了三种顶尖的大型语言模型（LLM）在深度关系推理方面的能力。
DeepSeek-R1在多个任务中表现出最佳的推理能力，特别是在逻辑演绎和关系推理方面。
所有评估的模型在问题复杂度增加时都面临挑战，这主要是由于令牌长度限制和输出结构的不完整性。
DeepSeek-R1的推理回应分析揭示了其独特的规划和验证策略，但也暴露出推理过程中的不一致性和不完整性。
文章强调了深入审查LLM内部推理动态的必要性。
文章讨论了未来的研究方向，包括多模态推理和推理失败的系统审查。

Cool Papers

点此查看论文截图

Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries

Authors:Zhengren Wang, Bozhou Li, Dongwen Yao, Wentao Zhang

While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.

虽然文本到SQL（Text-to-SQL）技术能够实现与结构化数据库的自然语言交互，但由于其语法过于僵化、表达能力有限，在处理非结构化数据或模糊查询时，其效果会大打折扣。同时，向量搜索作为一种强大的语义检索范式，特别是在处理非结构化数据时，表现出了巨大的潜力。然而，现有的VectorSQL实现仍然严重依赖于手动构建，并且缺乏针对性的评估框架，导致理论潜力与实际部署之间存在巨大差距。为了弥合这两种范式的鸿沟，我们引入了Text2VectorSQL这一新型框架，它融合了文本到SQL和向量搜索技术，克服了表达能力的限制，支持更多样化、更全面的自然语言查询。具体来说，Text2VectorSQL能够实现语义过滤、多模态匹配和检索加速。在评估方面，我们在适当的列上构建向量索引，通过自动管道扩展用户查询并进行语义搜索，并通过专家评审进行真实标注。此外，我们还使用合成数据开发了专用的Text2VectorSQL模型，在基准方法上实现了显著的性能改进。我们的工作为Text2VectorSQL任务奠定了基础，为更通用、更直观的数据库接口铺平了道路。该仓库将在https://github.com/Open-DataFlow/Text2VectorSQL上公开可用。

论文及项目相关链接

PDF Work in progess

Summary

文本介绍了Text-to-SQL在处理结构化数据库时的局限性，尤其是在处理非结构化数据或模糊查询时的效果减弱。为此，引入了一种新的框架Text2VectorSQL，它将Text-to-SQL与向量搜索相结合，提高了表达性，支持更多样化和全面的自然语言查询。该框架实现了语义过滤、多模式匹配和检索加速等功能。同时，还介绍了对评估方法的改进，包括建立向量索引、扩展用户查询语义搜索和自动管道化标注真实情况等内容。Text2VectorSQL的推出为建立更通用和直观数据库接口奠定了基础。

Key Takeaways

Text-to-SQL在处理非结构化数据或模糊查询时存在局限性。
Text2VectorSQL框架结合了Text-to-SQL和向量搜索，提高了表达性和对多样化自然语言查询的支持。
Text2VectorSQL实现了语义过滤、多模式匹配和检索加速功能。
该框架引入了改进的评估方法，包括建立向量索引、扩展用户查询的语义搜索和自动管道化标注真实情况等。
Text2VectorSQL框架的建立为数据库接口的发展奠定了基础，使其更加通用和直观。
Text2VectorSQL模型通过合成数据进行了开发，相较于基准方法，性能得到了显著提升。

Cool Papers

点此查看论文截图

Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Authors:Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen

Molecular structure elucidation involves deducing a molecule’s structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs’ limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs’ coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.

分子结构阐释是从各种类型的光谱数据中推断分子的结构，这在化学实验分析中至关重要。虽然大型语言模型（LLMs）在分析推理复杂任务方面表现出卓越的能力，但在分子结构阐释方面仍面临巨大挑战。我们发现这些挑战主要源于LLMs对专业化学知识的有限掌握。在这项工作中，我们引入了知识增强分子结构阐释推理框架（K-MSE），利用蒙特卡罗树搜索作为插件进行实时测试扩展。具体来说，我们构建了一个外部分子子结构知识库，以扩展LLMs对化学结构空间的覆盖范围。此外，我们还设计了一个专门的分子光谱评分器，作为推理过程的奖励模型，解决LLMs中解决方案评估不准确的问题。实验结果表明，我们的方法显著提高了性能，特别是在GPT-4o-mini和GPT-4o上都取得了超过20%的改进。我们的代码可在https://github.com/HICAI-ZJU/K-MSE找到。

论文及项目相关链接

PDF ACL 2025 Main

Summary

分子结构解析是从各种光谱数据中推断分子结构的过程，对化学实验分析至关重要。大型语言模型（LLMs）在分析推理复杂任务方面表现出卓越的能力，但在分子结构解析方面仍面临巨大挑战，主要源于其对专业化学知识的有限掌握。本研究提出一种知识增强分子结构解析推理框架（K-MSE），利用蒙特卡洛树搜索作为测试时的可扩展插件。通过构建外部分子子结构知识库，扩展LLMs对化学结构空间的覆盖。同时，设计专门的分子光谱评分器，作为推理过程的奖励模型，解决LLMs中解决方案评价不准确的问题。实验结果表明，该方法显著提高性能，特别是在GPT-4o-mini和GPT-4o上提升超过20%。相关代码可访问https://github.com/HICAI-ZJU/K-MSE。

Key Takeaways

分子结构解析是化学实验中重要的分析过程，涉及从光谱数据中推断分子结构。
大型语言模型（LLMs）在复杂任务分析推理方面具有显著能力，但在分子结构解析领域面临挑战。
挑战主要源于LLMs对专业化学知识的有限掌握，需要知识增强来提升其在该领域的性能。
引入K-MSE知识增强推理框架，利用蒙特卡洛树搜索进行测试时扩展。
通过构建外部分子子结构知识库，扩展LLMs对化学结构空间的覆盖，提高解析准确性。
设计专门的分子光谱评分器，解决LLMs中解决方案评价不准确的问题。

Cool Papers

点此查看论文截图

Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Authors:Yiwei He, Xiangtai Li, Zhenglin Huang, Yi Dong, Hao Fei, Jiangning Zhang, Baoyuan Wu, Guangliang Cheng

The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

随着多媒体内容现实感的增强，错误信息变得更加微妙和难以检测，特别是在新闻媒体中，图像经常配有双语（如中英文）字幕。此类内容通常包括本地化图像编辑和跨语言不一致之处，它们共同扭曲了意义，同时在表面上似乎合理。我们推出了BiMi，这是一个双语多媒体框架，它联合执行区域级定位、跨模态和跨语言一致性检测，以及错误信息分析的自然语言解释。为了支持通用性，BiMi集成了一个在线检索模块，该模块用最新的外部上下文补充模型推理。我们进一步发布了BiMiBench，这是一个通过系统编辑真实新闻图像和字幕构建的大规模综合基准测试，包含104000个样本，在视觉和语言模式方面进行了现实操作。为了提高可解释性，我们应用了群体相对策略优化（GRPO）来提高解释质量，标志着GRPO首次在这个领域的使用。大量实验表明，BiMi在分类精度上比强基线高出+8.9，定位精度高出+15.9，解释BERTScore高出+2.5，在真实、多语言错误信息检测方面达到了最新技术性能。代码、模型和数据集将予以发布。

论文及项目相关链接

PDF

Summary
多模态内容的现实性增强使得虚假信息更为微妙且难以检测，尤其是在配搭双语字幕的新闻中。针对这种情况，提出Bilingual Multimodal（BiMi）框架，用于地区级别定位、跨模态与跨语言一致性检测以及自然语言解释的分析。框架集成了在线检索模块，以支持普及并提供最新外部背景信息。同时发布了大型综合基准测试BiMiBench，通过系统编辑真实新闻图像和字幕生成包含视觉和语言模态真实操作的样本。为提高解释质量，首次采用Group Relative Policy Optimization（GRPO）。实验证明BiMi在分类准确性上超越强大基线达+8.9，定位准确性达+15.9，解释BERTScore达+2.5，为真实的多语言虚假信息检测带来最新技术进展。模型和数据集将会公开。

Key Takeaways

以下是基于文本的七个关键洞察点：

多模态内容的增加使虚假信息更难以检测。尤其是双语字幕新闻中，虚假信息更为微妙。
BiMi框架能进行地区级别定位、跨模态和跨语言一致性检测以及自然语言解释的分析。
BiMi框架集成了在线检索模块以支持普及并提供最新外部背景信息。
BiMiBench是一个大型综合基准测试，包含真实新闻图像和字幕的系统编辑样本。
采用Group Relative Policy Optimization（GRPO）提高解释质量。
BiMi在分类准确性、定位准确性和解释BERTScore方面表现优于其他方法。

Cool Papers

点此查看论文截图

Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Authors:Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min zhang

Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a Critic-Discernment Game(CDG) in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.

大型语言模型（LLM）在各种任务中表现出了相当出色的推理能力，如数学和编程。然而，最近的研究表明，即使是最好的模型也缺乏对推理过程的真正理解。在本文中，我们探讨了无需人类或更高级模型的监督，自我游戏如何增强模型在推理过程中的合理性。我们设计了一种批判鉴别游戏（CDG），其中证明者首先提供给定问题的解决方案，随后受到批判者的挑战。这些批判旨在帮助或误导证明者。证明者的目标是面对误导性评论时保持正确答案，同时根据建设性反馈纠正错误。我们在涉及数学推理、逐步错误检测、自我修正和长链推理的任务上的实验表明，CDG训练可以显著提高良好对齐的LLM理解其推理过程的能力。

论文及项目相关链接

PDF Accepted by ICML 2025

Summary

大型语言模型（LLM）在各类任务中展现出相当的推理能力，如数学和编程。但最新研究显示，顶尖模型仍缺乏对自身推理过程真正理解的能力。本文探索了无需人类或更高级模型监督的情况下，如何通过自我博弈增强模型的推理理性。我们设计了一种批判鉴别游戏（CDG），其中证明者首先提供问题的解决方案，随后接受批判者的挑战。这些批判旨在协助或误导证明者。证明者的目标是面对误导性评论时保持正确回答，并在建设性反馈中纠正错误。我们在涉及数学推理、逐步错误检测、自我修正和长链推理的任务上的实验表明，CDG训练能显著提高具有良好对齐性的LLM对自身推理过程的理解能力。

Key Takeaways

大型语言模型（LLM）已在多种任务中展现出强大的推理能力。
现有研究指出，LLM缺乏对自身推理过程的真正理解。
批判鉴别游戏（CDG）被设计用于增强模型的推理理性。
在CDG中，证明者需面对批判者的挑战，旨在提高模型在面对误导时的判断力和对正确解答的坚持。
CDG训练显著提高了LLM在多方面的能力，如数学推理、错误检测与自我修正、长链推理等。
CDG训练能帮助LLM更好地理解其推理过程。

Cool Papers

点此查看论文截图

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

Authors:Dang Jisheng, Wu Xudong, Wang Bimei, Lv Ning, Chen Jiayu, Jingwen Zhao, Yichu liu, Jizhao Liu, Juncheng Li, Teng Wang

Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model’s semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.

现有的视频分割器和打底器方法，以Sa2VA为例，直接在分割模型中融合特征。这通常会导致动态视觉信息和静态语义的纠缠，从而降低分割精度。为了系统地缓解这个问题，我们提出了DeSa2VA，这是一种增强解耦的提示方案，它结合了文本预训练和非线性解耦模块，以解决SAM-2固有的信息处理限制。具体来说，首先，我们设计了一种预训练模式，将文本真实标签转换为点级提示，同时生成相应的文本掩码。这些掩码通过混合损失函数进行精炼，以增强模型的语义定位能力。其次，我们采用线性投影将大型语言模型生成的隐藏状态分解到不同的文本和视觉特征子空间中。最后，通过预测文本/视觉掩码和真实注释的三重监督，动态掩融合策略协同地结合了这些解耦的特征。大量实验表明，我们在包括图像分割、图像问答、视频分割和视频问答等多种任务上均达到了最先进的性能。我们的代码位于https://github.com/longmalongma/DeSa2VA。

论文及项目相关链接

PDF

Summary

本文提出一种名为DeSa2VA的改进方案，旨在解决现有视频分割器和接地器（如Sa2VA）在融合特征时出现的动态视觉信息与静态语义纠缠问题，从而提高分割精度。该方案通过预训练将文本标签转换为点级提示并生成文本掩码，使用混合损失函数增强模型的语义接地能力。同时，采用线性投影技术将大型语言模型生成的隐藏状态分解成独立的文本和视觉特征子空间。最后，通过来自预测文本/视觉掩码和真实注释的三重监督，采用动态掩码融合策略协同结合这些解耦特征。实验表明，该方案在图像分割、图像问答、视频分割和视频问答等多项任务上均达到领先水平。

Key Takeaways

DeSa2VA解决了现有视频分割器和接地器（如Sa2VA）在融合特征时动态视觉信息与静态语义纠缠的问题。
通过预训练将文本标签转换为点级提示并生成文本掩码，增强模型的语义接地能力。
采用线性投影技术分解隐藏状态，将大型语言模型的输出分为文本和视觉特征子空间。
使用混合损失函数优化文本掩码，提高模型性能。
通过三重监督（预测文本/视觉掩码和真实注释）实现动态掩码融合策略。
DeSa2VA在图像分割、图像问答、视频分割和视频问答等多项任务上表现出卓越性能。

Cool Papers

点此查看论文截图

ReasonBridge: Efficient Reasoning Transfer from Closed to Open-Source Language Models

Authors:Ziqi Zhong, Xunzhu Tang

Recent advancements in Large Language Models (LLMs) have revealed a significant performance gap between closed-source and open-source models, particularly in tasks requiring complex reasoning and precise instruction following. This paper introduces ReasonBridge, a methodology that efficiently transfers reasoning capabilities from powerful closed-source to open-source models through a novel hierarchical knowledge distillation framework. We develop a tailored dataset Reason1K with only 1,000 carefully curated reasoning traces emphasizing difficulty, diversity, and quality. These traces are filtered from across multiple domains using a structured multi-criteria selection algorithm. Our transfer learning approach incorporates: (1) a hierarchical distillation process capturing both strategic abstraction and tactical implementation patterns, (2) a sparse reasoning-focused adapter architecture requiring only 0.3% additional trainable parameters, and (3) a test-time compute scaling mechanism using guided inference interventions. Comprehensive evaluations demonstrate that ReasonBridge improves reasoning capabilities in open-source models by up to 23% on benchmark tasks, significantly narrowing the gap with closed-source models. Notably, the enhanced Qwen2.5-14B outperforms Claude-Sonnet3.5 on MATH500 and matches its performance on competition-level AIME problems. Our methodology generalizes effectively across diverse reasoning domains and model architectures, establishing a sample-efficient approach to reasoning enhancement for instruction following.

最近大型语言模型（LLM）的进步揭示了闭源模型和开源模型之间的性能差距，特别是在需要复杂推理和精确指令执行的任务中。本文介绍了ReasonBridge方法，它通过新型分层知识蒸馏框架，有效地将从强大闭源模型的推理能力转移到开源模型。我们开发了一个专用数据集Reason1K，其中仅包含1000个精心策划的推理轨迹，强调难度、多样性和质量。这些轨迹是通过结构化多标准选择算法从多个领域筛选出来的。我们的迁移学习方法结合了以下几点：（1）分层蒸馏过程，捕捉战略抽象和战术实施模式；（2）稀疏推理聚焦适配器架构，只需额外的0.3%可训练参数；（3）测试时的计算缩放机制，采用指导推理干预。综合评估表明，ReasonBridge在基准测试任务上提高了开源模型的推理能力，最多可提高23%，显著缩小了与闭源模型的差距。值得注意的是，经过改进的Qwen2.5-14B在MATH500上表现优于Claude-Sonnet3.5，并在竞赛级别的AIME问题上与之相匹配。我们的方法在不同的推理领域和模型架构中有效地通用化，建立了一种样本高效的推理增强方法，用于指令执行。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的最新进展揭示了闭源和开源模型之间的性能差距，特别是在需要复杂推理和精确指令执行的任务中。本文提出了ReasonBridge方法，通过一种新型分层知识蒸馏框架，有效地将强大的闭源模型的推理能力转移到开源模型上。开发了一个专为推理能力设计的数据集Reason1K，包含1000个精心筛选的推理轨迹，强调难度、多样性和质量。通过分层蒸馏过程、稀疏推理适配器架构和测试时计算缩放机制，我们的转移学习方法显著提高了开源模型的推理能力。评估表明，ReasonBridge在基准任务上提高了开源模型的推理能力高达23%，显著缩小了与闭源模型的差距。

Key Takeaways

大型语言模型（LLM）在复杂推理和精确指令执行任务上，闭源与开源模型存在显著性能差距。
ReasonBridge方法通过分层知识蒸馏框架，将闭源模型的推理能力有效转移到开源模型上。
开发了Reason1K数据集，包含精选的推理轨迹，强调难度、多样性和质量。
转移学习方法包括分层蒸馏、稀疏推理适配器架构和测试时计算缩放机制。
ReasonBridge提高了开源模型的推理能力，最高达23%，显著缩小与闭源模型的差距。
评估结果显示，增强后的开源模型在基准任务上的表现与高级闭源模型相匹敌。

Cool Papers

点此查看论文截图

Listener-Rewarded Thinking in VLMs for Image Preferences

Authors:Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model’s reasoning trace contradicts that of an independent, frozen vision-language model (“listener”) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner’s chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

训练对人类视觉偏好具有鲁棒性和通用性的奖励模型对于使文本到图像和文本到视频生成模型与人类意图保持一致至关重要。然而，当前的奖励模型通常无法推广，而监督微调会导致记忆，需要复杂的注释管道。虽然强化学习（RL），特别是群组相对策略优化（GRPO）可以改善通用性，但我们发现了一个关键的失败模式：当模型的推理轨迹与独立、冻结的视觉语言模型（“监听器”）评估的相同输出相矛盾时，推理精度会出现大幅下降。为了解决这一问题，我们引入了增强监听器的GRPO框架。在这里，监听器重新评估推理者的思维链，以提供密集、校准的信心分数，塑造RL奖励信号。这鼓励推理者不仅回答正确，而且产生对独立模型有说服力的解释。我们基于监听器的奖励方案在ImageReward基准测试上取得了最佳精度（67.4%），在大型人类偏好数据集上显著提高了超出分布的性能（120万票，优于简单推理器高达+6%），并减少了与强大的GRPO和SFT基准测试的推理矛盾。这些结果表明，基于监听器的奖励为将视觉语言模型与微妙的人类偏好保持一致提供了一条可扩展和高效的数据路径。我们的推理模型将在此处发布：https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner。

论文及项目相关链接

PDF

Summary

本文强调训练能够反映人类视觉偏好的奖励模型对于文本到图像和文本到视频的生成模型与人类意图对齐至关重要。然而，当前奖励模型存在通用化不足和通过监督微调导致的记忆化问题。强化学习（RL）中的群组相对策略优化（GRPO）虽能提高通用性，但在模型推理轨迹与独立冻结的视觉语言模型（“监听器”）评估相同输出时存在推理准确度大幅下降的问题。为解决这一问题，本文引入了一个监听器增强的GRPO框架。在该框架中，监听器重新评估推理者的思维链，提供密集的校准置信度分数，塑造RL奖励信号，鼓励推理者不仅回答正确，而且产生对独立模型有说服力的解释。该监听器形状的奖励方案在ImageReward基准测试上达到最佳准确度（67.4%），在大规模人类偏好数据集上显著提高离群值性能（最多+6%），并减少了与强GRPO和SFT基准测试的推理矛盾。这些结果证明，基于监听器的奖励为对齐视觉语言模型与微妙人类偏好提供了一条可扩展且数据高效的途径。

Key Takeaways

训练反映人类视觉偏好的奖励模型对于文本到图像和文本到视频的生成模型至关重要。
当前奖励模型存在通用化不足和记忆化问题。
强化学习中的群组相对策略优化（GRPO）能提高模型的通用性，但存在推理准确度下降的问题。
监听器增强的GRPO框架通过重新评估推理者的思维链来解决GRPO的问题。
监听器提供密集的校准置信度分数，塑造RL奖励信号，鼓励产生对独立模型有说服力的解释。
监听器形状的奖励方案在ImageReward基准测试上达到最佳准确度，并显著提高离群值性能。

Cool Papers

点此查看论文截图

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Authors:Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique

The growing capabilities of large language models (LLMs) have exposed them to increasingly sophisticated jailbreak attacks. Among these, obfuscation-based attacks – which encrypt malicious content to evade detection – remain highly effective. By leveraging the reasoning ability of advanced LLMs to interpret encrypted prompts, such attacks circumvent conventional defenses that rely on keyword detection or context filtering. These methods are very difficult to defend against, as existing safety mechanisms are not designed to interpret or decode ciphered content. In this work, we propose \textbf{MetaCipher}, a novel obfuscation-based jailbreak framework, along with a reinforcement learning-based dynamic cipher selection mechanism that adaptively chooses optimal encryption strategies from a cipher pool. This approach enhances jailbreak effectiveness and generalizability across diverse task types, victim LLMs, and safety guardrails. Our framework is modular and extensible by design, supporting arbitrary cipher families and accommodating evolving adversarial strategies. We complement our method with a large-scale empirical analysis of cipher performance across multiple victim LLMs. Within as few as 10 queries, MetaCipher achieves over 92% attack success rate (ASR) on most recent standard malicious prompt benchmarks against state-of-the-art non-reasoning LLMs, and over 74% ASR against reasoning-capable LLMs, outperforming all existing obfuscation-based jailbreak methods. These results highlight the long-term robustness and adaptability of our approach, making it more resilient than prior methods in the face of advancing safety measures.

随着大型语言模型（LLM）的功能不断增强，它们面临着越来越复杂的越狱攻击。其中，基于模糊技术的攻击——通过加密恶意内容来躲避检测——仍然非常有效。这些攻击利用先进LLM的推理能力来解释加密提示，从而绕过依赖关键词检测或上下文过滤的传统防御措施。由于现有的安全机制没有被设计成解释或解码加密内容，因此这些攻击方法很难防范。在这项工作中，我们提出了MetaCipher，这是一个新的基于模糊技术的越狱框架，以及一种基于强化学习的动态密码选择机制，该机制可以从密码池中自适应选择最佳加密策略。这种方法提高了越狱攻击的有效性和跨不同任务类型、受害者LLM和安全防护措施的通用性。我们的框架按设计是模块化和可扩展的，支持任意密码家族，并能适应不断变化的对抗策略。我们对多个受害者LLM的密码性能进行了大规模实证分析，以补充我们的方法。在仅10次查询内，MetaCipher在最新的标准恶意提示基准测试上达到了超过92%的攻击成功率（ASR），针对最新非推理型LLM，以及超过7.对具备推理能力的LLM达到约4%的ASR。这一表现超过了所有现有的基于模糊技术的越狱方法。这些结果凸显了我们方法的长期稳健性和适应性，使其在面对日益进步的安全措施时更具韧性。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）的能力增长使其面临越来越复杂的越狱攻击，其中基于模糊的攻击——通过加密恶意内容来躲避检测——仍然非常有效。利用先进LLMs的推理能力来解释加密提示，这些攻击绕过了依赖关键词检测或上下文过滤的传统防御手段。本研究提出一种名为MetaCipher的新型基于模糊技术的越狱框架，以及一种基于强化学习的动态密码选择机制，该机制可以从密码池中自适应选择最佳加密策略。这种方法提高了越狱攻击的有效性和跨不同任务类型、受害者LLMs和安全防护措施的通用性。我们的框架按模块化设计，具有可扩展性，支持任意密码家族，并能适应不断发展的对抗策略。我们结合大量实证研究，对多个受害者LLMs的密码性能进行了补充分析。在仅10次查询内，MetaCipher在最新标准恶意提示基准测试上的攻击成功率（ASR）超过92%，针对现有非推理型LLMs的ASR超过74%，显示出我们方法的长期稳健性和适应性，使其在面对不断发展的安全措施时比现有方法更具抗性。

Key Takeaways

大型语言模型（LLMs）面临日益复杂的攻击，其中基于模糊的加密攻击尤为有效。
基于模糊技术的MetaCipher框架被提出，结合强化学习动态选择加密策略。
MetaCipher增强了越狱攻击的有效性和通用性，可适应不同的任务类型、LLMs和安全措施。
MetaCipher框架模块化设计，支持多种密码家族和适应不断发展的对抗策略。
MetaCipher在标准恶意提示基准测试上的攻击成功率超过92%，显示出其长期稳健性和适应性。
与非推理型LLMs相比，MetaCipher对具有推理能力的LLMs的攻击成功率超过74%。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-07-03/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-07-03 Positional Bias in Binary Question Answering How Uncertainty Shapes Model Preferences

2025-07-03 LLM

LLM

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-06-29 MambaMorph a Mamba-based Framework for Medical MR-CT Deformable Registration

2025-06-29 医学图像

医学图像

R1_Reasoning

2025-07-03 更新

Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

DABstep: Data Agent Benchmark for Multi-step Reasoning

L0: Reinforcement Learning to Become General Agents

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Datasets for Fairness in Language Models: An In-Depth Survey

Teaching a Language Model to Speak the Language of Tools

GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Why Settle for One? Text-to-ImageSet Generation and Evaluation

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons

Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries

Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

ReasonBridge: Efficient Reasoning Transfer from Closed to Open-Source Language Models

Listener-Rewarded Thinking in VLMs for Image Preferences

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs