嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-03-15 更新

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

Authors:Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng

The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io

随着大型多模态模型(LMMs)的快速发展,它们被广泛应用于科学问题求解,但其精细功能仍被探索不足。在本文中,我们介绍了SciVerse,这是一个多模态科学评估基准测试,旨在全面评估五个不同版本共5735个测试实例中的大型多模态模型。我们的目标是研究大型多模态模型的三个关键维度:科学知识理解、多模态内容解释和链式思维(CoT)推理。为了揭示大型多模态模型是否具备足够的科学专业知识,我们首先将每个问题转化为包含不同知识水平的三个版本,即无知识版、精简版和丰富版。然后,为了探究大型多模态模型如何解释多模态科学内容,我们标注了另外两个版本,即视觉丰富版和仅视觉版,从文本到图表提供更多问题的信息。通过比较不同版本的结果,SciVerse系统地检验了大型多模态模型在科研领域的专业知识储备和视觉感知能力。此外,为了严格评估链式思维推理,我们提出了一种新的科学链式思维评估策略,对模型输出中的知识和逻辑错误进行分步评估。我们对SciVerse上的不同大型多模态模型进行了全面评估,揭示了它们在科研能力方面的关键局限性,并为未来发展提供了新见解。项目页面:https://sciverse-cuhk.github.io

论文及项目相关链接

PDF Initially released in September 2024. Project page: https://sciverse-cuhk.github.io

Summary

大型多模态模型(LMMs)在科学问题解答中的应用正日益普及,但其精细功能尚未得到充分探索。本文介绍了SciVerse,这是一个多模态科学评估基准测试,旨在全面评估LMMs在5735个测试实例中的表现,涵盖五个不同版本。本研究旨在探究LMMs的三个关键维度:科学知识理解、多模态内容解读和链式思维(CoT)推理。通过不同版本的测试,SciVerse系统地检验了LMMs的专业知识库存和科学领域的视觉感知技能。此外,我们还提出了一种新的科学CoT评估策略,对模型输出中的知识和逻辑错误进行逐步评估。对SciVerse上不同LMMs的广泛评估揭示了其在科学方面的局限性,并为未来发展提供了新见解。

Key Takeaways

  1. SciVerse是一个多模态科学评估基准测试,旨在全面评估大型多模态模型(LMMs)的表现。
  2. SciVerse涵盖了5735个测试实例,分为五个不同版本,以探究LMMs在科学知识理解、多模态内容解读和链式思维推理方面的能力。
  3. 通过不同版本的测试,SciVerse评估了LMMs的专业知识库存和视觉感知技能。
  4. 提出了一个新的科学链式思维(CoT)评估策略,以逐步评估模型输出中的知识和逻辑错误。
  5. LMMs在科学方面的表现存在局限性。
  6. SciVerse评估结果为LMMs的未来发展提供了新见解。

Cool Papers

点此查看论文截图

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Authors:Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan

While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.

虽然大型多模态模型(LMM)在各种视觉问答(VQA)任务中表现出了强大的性能,但某些挑战需要复杂的多步骤推理才能得出准确答案。一个特别具有挑战性的任务是自动驾驶,在做出决策之前,需要进行彻底的认知处理。在这个领域,对视觉线索的连续和解释性理解对于有效的感知、预测和规划至关重要。然而,常见的VQA基准测试通常侧重于最终答案的准确性,而忽视了解答过程中使用的推理方法。此外,现有方法缺乏一个综合框架来评估现实驾驶场景中逐步推理的效果。为了解决这一差距,我们提出了DriveLMM-o1,这是一个专门为推进自动驾驶的逐步视觉推理而设计的新数据集和基准测试。我们的基准测试训练集中包含超过1.8万个VQA示例,测试集中包含超过4000个示例,涵盖关于感知、预测和规划的多样化问题,每个问题都辅以逐步推理,以确保自动驾驶场景中的逻辑推断。我们进一步引入了一种经过我们推理数据集微调的大型多模态模型,在复杂驾驶场景中表现出稳健的性能。此外,我们在所提出的数据集上对各开源和闭源方法进行了基准测试,系统地比较了它们在自动驾驶任务中的推理能力。我们的模型在最终答案准确性方面取得了+7.49%的增益,在推理得分上较之前最佳的开源模型提高了3.62%。我们的框架、数据集和模型可在https://github.com/ayesha-ishaq/DriveLMM-o1找到。

论文及项目相关链接

PDF 8 pages, 4 figures, 3 tables, github: https://github.com/ayesha-ishaq/DriveLMM-o1

Summary

本文介绍了一项针对自动驾驶视觉问答任务的研究,提出一个新的数据集和基准测试标准DriveLMM-o1。该数据集包含超过18k个训练样本和超过4k个测试样本,涵盖感知、预测和规划等多样化问题,并强调逐步推理的重要性。该研究还引入了一个大型多模态模型,经过该数据集微调后,在复杂驾驶场景中表现出稳健性能。相较于当前最佳开源模型,该模型在最终答案准确性和推理得分方面均有显著提升。

Key Takeaways

  1. 大型多模态模型在视觉问答任务中表现出强大性能,但在需要复杂多步骤推理的自动驾驶任务中面临挑战。
  2. 自动驾驶需要全面理解视觉线索的序列和解释性,以进行有效感知、预测和规划。
  3. 现有视觉问答基准测试更多地关注最终答案的准确性,而忽视推理过程的重要性。
  4. 引入新的数据集和基准测试标准DriveLMM-o1,包含超过18k个训练样本和超过4k个测试样本,强调逐步推理的重要性。
  5. 提出的大型多模态模型经过DriveLMM-o1数据集的微调后,在复杂驾驶场景中表现出稳健性能。
  6. 与当前最佳开源模型相比,新模型在最终答案准确性和推理得分方面均有显著提升。

Cool Papers

点此查看论文截图

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Authors:Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

大型语言模型在复杂的文本任务中表现出了惊人的推理能力。然而,多模态推理,这需要整合视觉和文本信息,仍然是一个巨大的挑战。现有的视觉语言模型往往难以有效地分析和推理视觉内容,导致在复杂的推理任务上的性能不佳。此外,缺乏全面的基准测试阻碍了多模态推理能力的准确评估。在本文中,我们介绍了R1-Onevision,一个旨在弥合视觉感知和深度推理之间差距的多模态推理模型。为此,我们提出了一种跨模态推理管道,将图像转换为正式的纹理表示,从而实现基于精确语言的推理。利用这一管道,我们构建了R1-Onevision数据集,该数据集在各个领域提供了详细、逐步的多模态推理注释。我们进一步通过监督微调强化学习来开发R1-Onevision模型,以培养先进的推理和稳健的泛化能力。为了全面评估不同等级的多模态推理性能,我们推出了R1-Onevision-Bench,这是一个与人类教育阶段相一致的基准测试,涵盖从初中到大学及以后的考试。实验结果表明,R1-Onevision在多个具有挑战性的多模态推理基准测试中达到了最新技术水平,超越了GPT-4o和Qwen2.5-VL等模型。

论文及项目相关链接

PDF Code and Model: https://github.com/Fancy-MLLM/R1-onevision

Summary:大型语言模型在复杂的文本任务中展现出色的推理能力,但在需要整合视觉和文本信息的多模态推理方面仍存在挑战。现有视觉语言模型在分析视觉内容方面表现不足,导致在复杂推理任务上性能不佳。本文介绍R1-Onevision多模态推理模型,通过跨模态推理管道将图像转化为正式文本表示,实现精确的语言推理。构建R1-Onevision数据集并提供跨不同领域的详细、逐步多模态推理注释。通过监督微调与强化学习,培养先进的推理和稳健的泛化能力。为全面评估不同级别的多模态推理性能,本文还引入与人类教育阶段相符的R1-Onevision-Bench基准测试,涵盖从初中到大学及以后的考试。实验结果证明R1-Onevision在多个具有挑战性的多模态推理基准测试上实现卓越性能,优于GPT-4o和Qwen2.5-VL模型。

Key Takeaways

  1. 大型语言模型在多模态推理方面存在挑战,需要整合视觉和文本信息。
  2. 现有视觉语言模型在分析视觉内容方面表现不足,导致复杂推理任务性能不佳。
  3. R1-Onevision多模态推理模型通过跨模态推理管道实现图像到文本表示的转化,支持精确语言推理。
  4. R1-Onevision数据集的构建提供了详细、逐步的多模态推理注释,适用于不同领域。
  5. R1-Onevision模型通过监督微调和强化学习培养高级推理和泛化能力。
  6. R1-Onevision-Bench基准测试用于评估不同级别的多模态推理性能,与人类的各个阶段教育相匹配。

Cool Papers

点此查看论文截图

Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models

Authors:Afrar Jahin, Arif Hassan Zidan, Yu Bao, Shizhe Liang, Tianming Liu, Wei Zhang

With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathematical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limitations of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demonstrating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimization impact LLMs’ mathematical reasoning. Moreover, our study goes beyond mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs’ mathematical reasoning and lays the groundwork for future advancements

随着人工智能(AI)的快速发展,大型语言模型(LLM)已经重塑了各个领域的边界,涵盖了医疗、公共卫生、工程、科学、农业、教育、艺术、人文和数学推理等多个领域。在这些进展中,DeepSeek模型表现出色,展现出令人瞩目的能力,使其与同行区分开来。虽然之前的研究已经对LLM进行了比较分析,但很少有研究对一系列LLM的数学推理能力进行全面评估。在这项工作中,我们旨在通过进行深入的对比研究来填补这一空白,重点关注DeepSeek模型与其领先同行的优势和局限性。特别是,我们的研究系统地评估了两个DeepSeek模型和五种突出的LLM在三个独立基准数据集上的数学推理性能。研究发现揭示了几个关键见解:1. DeepSeek-R1在三个数据集中的两个上始终实现了最高精度,表现出强大的数学推理能力。2. 与同行相比,蒸馏变种LLM的表现显著较差,这突显了使用蒸馏技术可能存在的潜在缺陷。3. 在响应时间方面,Gemini 2.0 Flash显示出最快的处理速度,在效率方面超越了其他模型,这对于实时应用是一个关键因素。除了这些定量评估外,我们还深入研究了架构、训练和优化如何影响LLM的数学推理。此外,我们的研究超越了仅仅的性能比较,而是确定了未来在LLM驱动的数学推理方面发展的关键领域。这项研究增强了我们对于LLM数学推理的理解,并为未来的进步奠定了基础。

论文及项目相关链接

PDF

Summary

随着人工智能的快速发展,大型语言模型(LLM)已经重塑了多个领域的前沿,其中DeepSeek模型表现突出。然而,关于数学推理能力的全面评估仍有所欠缺。本研究旨在填补这一空白,深入评估DeepSeek模型及其领先同类模型的表现。研究结果显示DeepSeek-R1在两项数据集中表现最佳,展现了强大的数学推理能力。同时,研究也探讨了架构、训练和优化对LLM数学推理能力的影响,并为未来的发展方向提供了关键见解。

Key Takeaways

  1. DeepSeek-R1在多项数据集中展现出色的数学推理能力。
  2. 蒸馏型LLM表现较差,提示蒸馏技术在某些应用场景中可能存在局限性。
  3. Gemini 2.0 Flash在响应时间方面表现最佳,具有较高的处理效率。
  4. 架构、训练和优化对LLM的数学推理能力有重要影响。
  5. 研究结果不仅关注性能比较,还指出了未来LLM驱动的数学推理发展的关键方向。
  6. 本研究提高了对LLM数学推理能力的理解,为后续研究提供了基础。

Cool Papers

点此查看论文截图

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Authors:Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

This paper presents our work on the Light-R1 series, with models, data, and code all released. We first focus on training long COT models from scratch, specifically starting from models initially lacking long COT capabilities. Using a curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO, we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains. In the subsequent phase of this work, we highlight the significant benefit of the 3k dataset constructed for the second SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled models using this dataset, we obtain new SOTA models in 7B and 14B, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying reinforcement learning, specifically GRPO, on long-COT models to further improve reasoning performance. We successfully train our final Light-R1-14B-DS with RL, achieving SOTA performance among 14B parameter models in math. With AIME24 & 25 scores of 74.0 and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected behavior, showing simultaneous increase in response length and reward score. The Light-R1 series of work validates training long-COT models from scratch, showcases the art in SFT data and releases SOTA models from RL.

本文介绍了我们在Light-R1系列方面的工作,其中模型、数据和代码均已发布。我们首先专注于从头开始训练长COT模型,特别是从最初不具备长COT能力的模型开始。使用包含两阶段SFT和半在线策略DPO的课程训练配方,我们从Qwen2.5-32B-Instruct训练了Light-R1-32B模型,与DeepSeek-R1-Distill-Qwen-32B相比,其数学性能优越。尽管只在数学数据上进行训练,但Light-R1-32B在其他领域表现出强大的泛化能力。在这项工作的后续阶段,我们强调了为第二阶段SFT构建的3k数据集对增强其他模型的重要好处。通过对此数据集进行微调DeepSeek-R1-Distilled模型,我们获得了7B和14B的新SOTA模型,而32B模型Light-R1-32B-DS与QwQ-32B和DeepSeek-R1表现相当。此外,我们通过将强化学习,特别是GRPO,应用于长COT模型,进一步提高了推理性能。我们成功地用强化学习训练了最终的Light-R1-14B-DS,在14B参数模型中实现数学方面的SOTA性能。Light-R1-14B-DS在AIME24和AIME25的得分分别为74.0和60.2,甚至超越了许多32B模型和DeepSeek-R1-Distill-Llama-70B。其强化学习训练还表现出预期的行为,响应长度和奖励分数同时增加。Light-R1系列工作验证了从头开始训练长COT模型的可行性,展示了SFT数据的艺术,并发布了来自RL的SOTA模型。

论文及项目相关链接

PDF all release at https://github.com/Qihoo360/Light-R1

Summary

本文介绍了Light-R1系列的研究工作,涉及模型、数据和代码的发布。研究重点在于从头开始训练具有长期连续输出(COT)能力的模型。通过使用包括两个阶段自训练(SFT)和半在线策略DPO的课程训练配方,成功训练了Light-R1-32B模型,表现出优异的数学性能。该模型在多个领域都展现出强大的泛化能力。后续研究中,构建了3k数据集用于第二阶段SFT,提升了其他模型性能。通过对DeepSeek-R1-Distilled模型进行微调,获得了新的SOTA模型。此外,将强化学习应用于长期COT模型,进一步提高了推理性能。最终,Light-R1系列的研究验证了从头训练长期COT模型的可行性,展示了自训练数据的重要性,并发布了使用强化学习训练的SOTA模型。

Key Takeaways

  1. Light-R1系列研究关注于从头开始训练具有长期连续输出(COT)能力的模型。
  2. 通过特殊的课程训练配方,成功训练了Light-R1-32B模型,表现出卓越的数学性能。
  3. Light-R1-32B模型在多个领域展现出强大的泛化能力。
  4. 构建了3k数据集用于第二阶段自训练,显著提升了其他模型的性能。
  5. 通过微调DeepSeek-R1-Distilled模型,获得了新的SOTA模型在7B和14B参数模型中表现优异。
  6. 应用强化学习进一步提高了长期COT模型的推理性能。

Cool Papers

点此查看论文截图

Authors:Sinuo Liu, Chenyang Lyu, Minghao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang

Recent advances in Large Reasoning Models (LRMs), particularly those leveraging Chain-of-Thought reasoning (CoT), have opened brand new possibility for Machine Translation (MT). This position paper argues that LRMs substantially transformed traditional neural MT as well as LLMs-based MT paradigms by reframing translation as a dynamic reasoning task that requires contextual, cultural, and linguistic understanding and reasoning. We identify three foundational shifts: 1) contextual coherence, where LRMs resolve ambiguities and preserve discourse structure through explicit reasoning over cross-sentence and complex context or even lack of context; 2) cultural intentionality, enabling models to adapt outputs by inferring speaker intent, audience expectations, and socio-linguistic norms; 3) self-reflection, LRMs can perform self-reflection during the inference time to correct the potential errors in translation especially extremely noisy cases, showing better robustness compared to simply mapping X->Y translation. We explore various scenarios in translation including stylized translation, document-level translation and multimodal translation by showcasing empirical examples that demonstrate the superiority of LRMs in translation. We also identify several interesting phenomenons for LRMs for MT including auto-pivot translation as well as the critical challenges such as over-localisation in translation and inference efficiency. In conclusion, we think that LRMs redefine translation systems not merely as text converters but as multilingual cognitive agents capable of reasoning about meaning beyond the text. This paradigm shift reminds us to think of problems in translation beyond traditional translation scenarios in a much broader context with LRMs - what we can achieve on top of it.

近期大型推理模型(LRMs)的最新进展,特别是那些采用思维链推理(CoT)的模型,为机器翻译(MT)开辟了新的可能性。本立场论文认为,LRMs通过重新构建翻译作为一个需要上下文、文化和语言理解和推理的动态推理任务,从而极大地转变了传统的神经机器翻译以及基于大型语言模型的机器翻译范式。我们确定了三个基本转变:1)上下文连贯性,LRMs通过跨句和复杂上下文或甚至无上下文的显式推理来解决歧义并保留话语结构;2)文化意图性,使模型能够通过推断说话者的意图、受众期望和社会语言规范来适应输出;3)自我反思,LRMs可以在推理时间进行自我反思,以纠正翻译中可能存在的错误,特别是在极端嘈杂的情况下,显示出比简单的X->Y翻译更高的稳健性。我们通过展示实证例子来探索翻译中的各种场景,包括风格化翻译、文档级翻译和多模态翻译,这些例子证明了LRMs在翻译中的优越性。我们还发现了LRMs在机器翻译方面的几个有趣现象,包括自动枢轴翻译以及关键挑战,如翻译的过度本地化和推理效率。总之,我们认为LRMs重新定义了翻译系统,不仅仅作为文本转换器,而是作为能够超越文本进行意义推理的多语言认知代理。这种范式转变提醒我们,在更广泛的背景下,要超越传统的翻译场景来思考LRMs在翻译方面的问题——我们可以在它之上实现什么。

论文及项目相关链接

PDF

Summary

大型推理模型(LRMs)特别是利用思维链(CoT)的模型为机器翻译(MT)带来了全新可能性。该立场论文认为,LRMs通过重新构建翻译作为一个需要上下文、文化和语言理解和推理的动态推理任务,从而极大地改变了传统的神经机器翻译以及基于LLMs的机器翻译范式。包括三大基础转变:上下文连贯性、文化意图和自我反思。通过实证例子展示了LRMs在翻译中的优越性,并指出了自动转换翻译等现象以及过度本地化翻译和推理效率等挑战。总的来说,LRMs重新定义了翻译系统,不仅仅是文本转换器,而是能够超越文本进行意义推理的多语言认知代理。

Key Takeaways

  1. LRMs通过引入思维链推理,为机器翻译领域带来全新可能性。
  2. LRMs改变了传统的机器翻译范式,将其构建为一个需要上下文、文化和语言理解的动态推理任务。
  3. LRMs实现了三大基础转变:上下文连贯性、文化意图和自我反思。
  4. LRMs通过实证例子展示了在风格化翻译、文档级翻译和多模态翻译等场景中的优越性。
  5. LRMs在机器翻译中展现出自动转换翻译等现象。
  6. LRMs面临过度本地化翻译和推理效率等挑战。

Cool Papers

点此查看论文截图

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Authors:Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.

我们介绍了VisualPRM,这是一个先进的跨模态过程奖励模型(PRM),拥有8B参数,它通过采用最佳N(BoN)评估策略,提高了现有跨模态大型语言模型(MLLMs)在不同模型规模和家族中的推理能力。具体来说,我们的模型改进了三种类型的MLLMs和四种不同模型规模的推理性能。即使应用到高度能力的InternVL2.5-78B上,它在七个跨模态推理基准测试中实现了5.9分的提升。实验结果表明,与结果奖励模型和自我一致性相比,我们的模型在BoN评估中表现出更优越的性能。为了训练跨模态PRMs,我们使用自动化数据管道构建了一个跨模态过程监督数据集VisualPRM400K。为了评估跨模态PRMs,我们提出了VisualProcessBench基准测试,它带有由人类注释的步骤正确性标签,用于衡量PRMs在跨模态推理任务中检测错误步骤的能力。我们希望我们的工作能够激发更多的未来研究,并为MLLMs的发展做出贡献。我们的模型、数据和基准测试已在https://internvl.github.io/blog/2025-03-13-VisualPRM/上发布。

论文及项目相关链接

PDF

Summary

视觉PRM是一种先进的跨模态过程奖励模型,具有8B参数,提高了多种不同规模和家族的多模态大语言模型的推理能力。通过最佳N评估策略,我们的模型在三种类型的MLLMs和四个不同的模型规模上取得了显著的改进。此外,我们的模型还表现出了出色的性能,在高度复杂的InternVL模型上实现了跨七个跨模态推理基准测试5.9点的改进。我们构建了视觉PRM数据集和多模态过程基准测试,以支持多模态PRM的训练和评估。我们相信这项工作将激发更多未来的研究并为多模态语言模型的发展做出贡献。具体详情可访问:链接地址

Key Takeaways

一、提出了视觉PRM模型,一种先进的多模态过程奖励模型(PRM),参数规模达到8B。
二、该模型增强了多模态大语言模型(MLLMs)的推理能力,适用于不同规模和类型。
三、在最佳N评估策略下,视觉PRM模型对三种类型的MLLMs和四个不同规模的模型均有显著提高表现。
四、在高度复杂的InternVL模型中,视觉PRM实现了跨七个跨模态推理基准测试5.9点的显著改进。
五、为了支持多模态PRMs的训练和评估,构建了视觉PRM数据集和多模态过程基准测试。
六、该模型的性能优于结果奖励模型和自一致性模型在最佳N评估中的表现。

Cool Papers

点此查看论文截图

SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence

Authors:Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhitao Zeng, Zhu Zhuo, Evangelos B. Mazomenos, Yueming Jin

Integration of Vision-Language Models (VLMs) in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies within surgical scenes, undermining clinical reliability. While recent VLMs demonstrate strong general reasoning and thinking capabilities, they still lack the domain expertise and task-awareness required for precise surgical scene interpretation. Although Chain-of-Thought (CoT) can structure reasoning more effectively, current approaches rely on self-generated CoT steps, which often exacerbate inherent domain gaps and hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery. By employing specialized CoT prompts across five tasks: instrument recognition, action recognition, action prediction, patient data extraction, and outcome assessment, SurgRAW mitigates hallucinations through structured, domain-aware reasoning. Retrieval-Augmented Generation (RAG) is also integrated to external medical knowledge to bridge domain gaps and improve response reliability. Most importantly, a hierarchical agentic system ensures that CoT-embedded VLM agents collaborate effectively while understanding task interdependencies, with a panel discussion mechanism promotes logical consistency. To evaluate our method, we introduce SurgCoTBench, the first reasoning-based dataset with structured frame-level annotations. With comprehensive experiments, we demonstrate the effectiveness of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12 robotic procedures, achieving the state-of-the-art performance and advancing explainable, trustworthy, and autonomous surgical assistance.

将视觉语言模型(VLMs)整合到手术智能中面临着诸多挑战,如幻觉、领域知识差距以及手术场景中任务相互依赖性的有限理解,这削弱了其在临床上的可靠性。尽管最新的VLMs表现出强大的通用推理和思维能力,但它们仍然缺乏精确解读手术场景所需的领域专业知识和任务意识。虽然“思维链”(CoT)能够更有效地结构化推理,但当前的方法依赖于自我生成的CoT步骤,这往往会加剧固有的领域差距和幻觉。

为了克服这一问题,我们推出了SurgRAW,这是一个由CoT驱动的多代理框架,为机器人辅助手术中的大多数任务提供透明、可解释的见解。通过五个任务中的专门CoT提示:仪器识别、动作识别、动作预测、患者数据提取和结果评估,SurgRAW通过结构化、领域感知推理来缓解幻觉。同时集成了增强生成(RAG)以访问外部医学知识,以弥补领域差距并提高响应可靠性。最重要的是,一个分层的多代理系统确保CoT嵌入的VLM代理能够进行有效协作,同时了解任务相互依赖性,而小组讨论机制则促进了逻辑一致性。

论文及项目相关链接

PDF

Summary:视觉语言模型(VLMs)在手术智能集成中面临幻觉、领域知识差距和手术场景内任务相互依赖的有限理解等问题,影响临床可靠性。虽然最近的VLMs展现出强大的通用推理和思维能力,但它们仍然缺乏精确解读手术场景所需的领域专业知识和任务意识。为此,提出SurgRAW,一个基于思维链(CoT)的多智能体框架,为机器人辅助手术中的大多数任务提供透明、可解释性的见解。通过五个任务的专门思维链提示,SurgRAW通过结构化、领域知识驱动的推理来缓解幻觉问题。同时集成检索增强生成(RAG)以获取外部医学知识,缩小领域差距并提高响应可靠性。最重要的是,层次化的智能体系确保思维链嵌入的VLM智能体有效协作,理解任务相互依赖性,并通过小组讨论机制促进逻辑一致性。

Key Takeaways

  1. VLM在手术智能应用中存在幻觉、领域知识差距和任务相互依赖理解有限的问题。
  2. 虽然VLM具有强大的通用推理和思维能力,但仍需提高在手术领域的专业知识和任务意识。
  3. SurgRAW框架利用CoT结构来增强手术场景的解读,通过结构化、领域知识驱动的推理缓解幻觉问题。
  4. SurgRAW集成RAG以获取外部医学知识,缩小领域差距并提高响应的可靠性。
  5. 层次化的智能体系确保VLM智能体有效协作,理解任务相互依赖性。
  6. SurgRAW引入SurgCoTBench数据集进行方法评估,该数据集是首个基于推理的结构化帧级别注释数据集。

Cool Papers

点此查看论文截图

ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

Authors:Pengfei Luo, Jingbo Zhou, Tong Xu, Yuan Xia, Linli Xu, Enhong Chen

With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.

随着在线内容中图像数量的激增,语言指导的图像检索(LGIR)在过去十年中已成为研究热点,涵盖了各种具有不同输入形式的子任务。尽管大型多模态模型(LMM)的发展极大地促进了这些任务,但现有方法通常孤立地解决它们,需要为每个任务构建单独的系统。这不仅增加了系统复杂性和维护成本,而且还加剧了由语言模糊和复杂的图像内容引起的挑战,使得检索系统难以提供准确和可靠的结果。为此,我们提出了ImageScope,这是一个无需训练的三阶段框架,它利用集体推理来统一LGIR任务。统一的背后关键见解在于语言的组合性质,它将各种LGIR任务转变为一种通用的文本到图像检索过程,加上LMM的推理作为通用的验证来完善结果。具体而言,在第一阶段,我们通过利用链式思维(CoT)推理在不同级别的语义粒度上合成搜索意图,提高了框架的稳健性。在第二和第三阶段,然后我们通过局部验证谓词命题以及全局进行配对评估来反思检索结果。在六个LGIR数据集上进行的实验表明,ImageScope优于竞争基线。综合评估和消融研究进一步证实了我们的设计有效性。

论文及项目相关链接

PDF WWW 2025

Summary

该文本介绍了随着网络内容中图片数量的增加,语言引导的图像检索(LGIR)已成为过去十年的研究热点。尽管大型多模态模型(LMMs)的发展极大地促进了这些任务,但现有方法通常孤立地解决它们,需要为每个任务构建单独的系统。这不仅增加了系统复杂性和维护成本,还加剧了来自语言模糊和复杂图像内容的挑战,使得检索系统难以提供准确可靠的结果。为此,提出了ImageScope,一个无需训练的三阶段框架,利用集体推理来统一LGIR任务。关键洞察力在于语言的组成性质,它将各种LGIR任务转变为一种通用的文本到图像检索过程,而LMMs的推理则作为通用验证来优化结果。

Key Takeaways

  1. 语言引导的图像检索(LGIR)已成为研究热点,面临系统复杂性和维护成本等问题。
  2. ImageScope是一个无需训练的三阶段框架,旨在统一LGIR任务,提供准确可靠的检索结果。
  3. ImageScope的关键在于利用语言的组成性质,将LGIR任务转变为文本到图像检索过程。
  4. 第一阶段通过链式思维(CoT)推理合成不同语义粒度的搜索意图,提高了框架的稳健性。
  5. 第二、三阶段通过局部验证谓词命题和全局成对评估进行结果反思。
  6. 在六个LGIR数据集上的实验表明,ImageScope优于竞争基线。

Cool Papers

点此查看论文截图

Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text

Authors:Avinash Patil, Amardeep Kour Gedhu

Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52% over M-LLM, +0.82% over BERT) and SDCNL (+4.67% over M-LLM, +2.17% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.

大型语言模型(LLM)已显示出从在线文本预测心理健康结果的潜力,但传统分类方法往往缺乏可解释性和稳健性。本研究评估了结构化推理技术——思维链(CoT)、自我一致性(SC-CoT)和思维树(ToT)——以提高在多个来自Reddit的心理健康数据集上的分类精度。我们分析了以推理为核心的提示策略,包括零镜头CoT和少镜头CoT,使用平衡精度、F1分数和敏感性/特异性等关键性能指标。我们的研究结果表明,通过推理增强的技术提高了直接预测的分类性能,特别是在复杂情况下。与基线方法(如零镜头非CoT提示)以及经过微调预训练的转换器(如BERT和Mental-RoBERTa)以及经过微调开源LLM(如Mental Alpaca和Mental-Flan-T5)相比,以推理为核心的LLM在数据集上取得了显著的成绩,如在Dreaddit上较M-LLM提高0.52%,较BERT提高0.82%,在SDCNL上较M-LLM提高4.67%,较BERT提高2.17%。然而,在抑郁症严重程度和CSSRS预测方面的表现下降,表明存在特定数据集的局限性,这可能是由于我们使用了更广泛的测试集。在提示策略中,少镜头CoT始终表现最佳,进一步证明了以推理为核心的LLM的有效性。然而,数据集的变化突显了模型可靠性和可解释性方面的挑战。本研究为基于推理的LLM技术在心理健康文本分类方面提供了全面的基准测试。它为可扩展的临床应用提供了见解,并指出了未来改进的关键挑战。

论文及项目相关链接

PDF 8 pages, 4 Figures, 3 tables

摘要

大型语言模型(LLMs)在预测心理健康结果方面具有潜力,但传统分类方法缺乏可解释性和稳健性。本研究评估了基于结构化推理的技术,包括思维链(CoT)、自我一致性(SC-CoT)和思维树(ToT),以提高多个来自Reddit的心理健康数据集的分类准确性。我们分析了基于推理的提示策略,包括零射CoT和少射CoT,使用平衡精度、F1分数和敏感性/特异性等关键性能指标。研究发现,基于推理的技术在直接预测方面表现出改进,特别是在复杂情况下。与基线方法(如零射非CoT提示)以及微调预训练变压器(如BERT和Mental-RoBERTa)和微调开源LLMs(如Mental Alpaca和Mental-Flan-T5)相比,基于推理的LLMs在数据集(如Dreaddit和SDCNL)上实现了显著收益。然而,在抑郁症严重性和CSSRS预测方面的性能下降表明,特定数据集可能存在局限性,这可能是由于使用了更广泛的测试集。在提示策略中,少射CoT表现最好,证实了基于推理的LLMs的有效性。然而,数据集的变化强调了模型可靠性和可解释性的挑战。本研究为基于推理的LLM技术在心理健康文本分类方面提供了全面的基准测试。它为可扩展的临床应用提供了见解,并指出了未来改进的关键挑战。

关键见解

  1. 大型语言模型(LLMs)在预测心理健康结果方面具有潜力。
  2. 传统分类方法缺乏可解释性和稳健性。
  3. 结构化推理技术(如思维链和思维树)能提高分类准确性。
  4. 基于推理的提示策略(如零射和少射CoT)表现良好。
  5. 与基线方法和微调预训练模型相比,基于推理的LLMs在某些数据集上实现了显著收益。
  6. 数据集的多样性对模型性能和可解释性构成挑战。

Cool Papers

点此查看论文截图

Why Does Your CoT Prompt (Not) Work? Theoretical Analysis of Prompt Space Complexity, its Interaction with Answer Space During CoT Reasoning with LLMs: A Recurrent Perspective

Authors:Xiang Zhang, Juntai Cao, Jiaqi Wei, Chenyu You, Dujian Ding

Despite the remarkable successes of Large Language Models (LLMs), their fundamental Transformer architecture possesses inherent theoretical limitations that restrict their capability to handle reasoning tasks with increasing computational complexity. Chain-of-Thought (CoT) prompting has emerged as a practical solution, supported by several theoretical studies. However, current CoT-based methods (including ToT, GoT, etc.) generally adopt a “one-prompt-fits-all” strategy, using fixed templates (e.g., “think step by step”) across diverse reasoning tasks. This method forces models to navigate an extremely complex prompt space to identify effective reasoning paths. The current prompt designing research are also heavily relying on trial-and-error rather than theoretically informed guidance. In this paper, we provide a rigorous theoretical analysis of the complexity and interplay between two crucial spaces: the prompt space (the space of potential prompt structures) and the answer space (the space of reasoning solutions generated by LLMs) in CoT reasoning. We demonstrate how reliance on a single universal prompt (e.g. think step by step) can negatively impact the theoretical computability of LLMs, illustrating that prompt complexity directly influences the structure and effectiveness of the navigation in answer space. Our analysis highlights that sometimes human supervision is critical for efficiently navigating the prompt space. We theoretically and empirically show that task-specific prompting significantly outperforms unsupervised prompt generation, emphasizing the necessity of thoughtful human guidance in CoT prompting.

尽管大型语言模型(LLM)取得了显著的成就,但其基本的Transformer架构具有固有的理论局限性,限制了其处理计算复杂度不断增加的推理任务的能力。Chain-of-Thought(CoT)提示已成为一种实用的解决方案,并得到了一些理论研究的支持。然而,当前的基于CoT的方法(包括ToT、GoT等)通常采用“一提示适合所有”的策略,在多样化的推理任务中使用固定的模板(例如,“一步一步思考”)。这种方法迫使模型在极其复杂的提示空间中识别有效的推理路径。当前关于提示设计的研究还严重依赖于试错,而非理论引导的指南。在本文中,我们对两个关键空间之间的复杂性和相互作用进行了严格的理论分析:提示空间(潜在提示结构的空间)和答案空间(由LLM生成的推理解决方案的空间)在CoT推理中。我们展示了依赖单一通用提示(例如“一步一步思考”)如何对LLM的理论计算能力产生负面影响,并说明了提示复杂性直接影响答案空间中的导航结构及其有效性。我们的分析强调,有时人类的监督对于有效地导航提示空间至关重要。我们从理论和实践上证明,任务特定的提示显著优于无监督的提示生成,强调在CoT提示中需要深思熟虑的人类指导。

论文及项目相关链接

PDF arXiv admin note: substantial text overlap with arXiv:2410.14198

Summary

本文探讨了大型语言模型(LLMs)在应对复杂推理任务时的理论局限性,并介绍了思维链(CoT)提示作为一种实用解决方案。然而,当前CoT方法采用“一刀切”策略,使用固定模板应对各种推理任务,这限制了模型在识别有效推理路径时的理论计算能力。本文严格分析了提示空间(潜在提示结构)和答案空间(LLMs生成的推理解决方案)之间的复杂性和相互作用。文章指出,依赖单一通用提示可能影响LLMs的理论计算能力,提示复杂性直接影响答案空间的结构和有效性。有时,人类的监督对于有效地导航提示空间至关重要。本文理论和实证地表明,任务特定的提示显著优于无监督的提示生成,强调在CoT提示中需要深思熟虑的人类指导。

Key Takeaways

  1. 大型语言模型(LLMs)在应对复杂推理任务时存在理论局限性。
  2. 思维链(CoT)提示是一种实用的解决方案来应对LLMs的局限性。
  3. 当前的CoT方法采用“一刀切”策略,使用固定模板,限制了模型在识别有效推理路径时的能力。
  4. 提示空间和答案空间之间的复杂性及相互作用对LLMs的理论计算能力有影响。
  5. 依赖单一通用提示可能降低LLMs的性能。
  6. 提示的复杂性对答案空间的结构和有效性有直接影响。

Cool Papers

点此查看论文截图

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Authors:Ziyue Wang, Yurui Dong, Fuwen Luo, Minyuan Ruan, Zhili Cheng, Chi Chen, Peng Li, Yang Liu

The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.

多模态大型语言模型(MLLMs)的快速发展激发了现实和虚拟环境中复杂多模态推理任务的兴趣。这些任务需要协调多种能力,包括视觉感知、视觉推理、空间意识和目标推断。然而,现有的评估主要侧重于最终任务完成,经常将评估简化为孤立的技能,如视觉接地和视觉问答。对于在多模态环境中全面和定量地分析推理过程,人们给予的关注度不够。这对于理解模型行为和超越任务成功的底层推理机制至关重要。为了解决这个问题,我们引入了MM-Escape,这是一个可扩展的基准测试,用于研究多模态推理,其灵感来自于现实世界的逃脱游戏。MM-Escape除了最终的任务完成外,还强调模型中间的行为。为了实现这一点,我们开发了EscapeCraft,这是一个可定制和开放的环境,让模型能够进行自由形式的探索,以评估多模态推理。大量实验表明,无论规模大小,MLLMs都可以成功完成最简单的房间逃脱任务,其中一些表现出类似人类的探索策略。然而,随着任务难度的增加,性能急剧下降。此外,我们观察到不同模型的性能瓶颈各不相同,揭示了其多模态推理能力的不同失败模式和局限性,例如重复轨迹缺乏自适应探索、因视觉空间意识差而陷入角落、以及无效地使用获得的道具(如钥匙)。我们希望我们的工作能揭示多模态推理的新挑战,并揭示MLLMs能力的潜在改进。

论文及项目相关链接

PDF

Summary

多模态大型语言模型(MLLMs)的快速进步激发了现实世界和虚拟环境中复杂多模态推理任务的兴趣。然而,现有评估主要关注任务完成的最终效果,忽视了多模态环境中推理过程的综合定量分析,这对于理解模型行为和推理机制至关重要。为此,我们引入了MM-Escape,一个用于研究多模态推理的可扩展基准测试,并开发了EscapeCraft,一个可定制的开源环境,以评估模型在自由形式探索中的多模态推理能力。实验表明,不同规模的多模态语言模型可以在最简单的房间逃生任务中成功完成任务,但面对难度更大的任务时性能会大幅下降。此外,我们还观察到不同模型之间存在性能瓶颈,揭示了其在多模态推理能力方面的不同失败模式和局限性。

Key Takeaways

  1. 多模态大型语言模型(MLLMs)的进步推动了复杂多模态推理任务的发展。
  2. 现有评估主要关注任务完成效果,忽视了多模态推理过程的综合分析。
  3. MM-Escape是一个用于研究多模态推理的基准测试,强调模型在任务完成过程中的中间行为。
  4. EscapeCraft是一个可定制的开源环境,用于评估模型在自由形式探索中的多模态推理能力。
  5. MLLMs可以在简单的房间逃生任务中成功完成任务,但难度增加时性能显著下降。
  6. 不同模型在多模态推理能力方面存在不同的失败模式和局限性。

Cool Papers

点此查看论文截图

OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problem with Reasoning Large Language Model

Authors:Bowen Zhang, Pengcheng Luo

Operations Research (OR) has been widely applied in various fields such as resource allocation, production planning, and supply chain management. However, addressing real-world OR problems requires OR experts to perform mathematical modeling and programmers to develop solution algorithms. This traditional method, heavily reliant on experts, is costly and has long development cycles, severely limiting the widespread adoption of OR techniques. Few have considered using Artificial Intelligence (AI) to replace professionals to achieve fully automated solutions for OR problems. We propose OR-LLM-Agent, the first AI agent that enables end-to-end automation for solving real-world OR problems. OR-LLM-Agent leverages the Chain-of-Thought (CoT) reasoning capabilities of Large Language Models (LLMs) to translate natural language problem descriptions into formal mathematical models and automatically generate Gurobi solver code. In OR-LLM-Agent, OR-CodeAgent is designed to automate code execution and repair within a sandbox environment, facilitating the derivation of the final solution. Due to the lack of dedicated benchmark datasets for evaluating the automated solving of OR problems, we construct a benchmark dataset comprising 83 real-world OR problems described in natural language. We conduct comparative experiments with state-of-the-art (SOTA) reasoning LLMs, including GPT-o3-mini, DeepSeek-R1, and Gemini 2.0 Flash Thinking. The OR-LLM-Agent achieved the highest pass rate of 100% and the highest solution accuracy of 85%, demonstrating the feasibility of automated OR problem-solving. Data and code have been publicly available at https://github.com/bwz96sco/or_llm_agent.

运筹学(OR)已广泛应用于资源分配、生产计划和供应链管理等各个领域。然而,解决现实世界的运筹学问题,需要运筹学专家进行数学建模和程序员开发解决方案算法。这种传统的方法,严重依赖专家,成本高昂且开发周期长,严重限制了运筹学技术的广泛应用。很少有人考虑使用人工智能(AI)来替代专业人员,以实现运筹学问题的全自动解决方案。我们提出OR-LLM-Agent,这是第一个能够实现解决现实世界运筹学问题的端到端自动化的AI代理。OR-LLM-Agent利用大型语言模型(LLM)的链式思维(CoT)推理能力,将自然语言问题描述翻译成正式的数学模型,并自动生成Gurobi求解器代码。在OR-LLM-Agent中,OR-CodeAgent旨在自动化代码执行和修复,在沙箱环境中进行,便于得出最终解决方案。由于缺乏用于评估运筹学问题自动化求解的专用基准数据集,我们构建了一个包含83个用自然语言描述的现实世界运筹学问题的基准数据集。我们与最新技术(SOTA)推理LLM进行了比较实验,包括GPT-o3-mini、DeepSeek-R1和Gemini 2.0 Flash Thinking。OR-LLM-Agent的通过率达到了最高的100%,解决方案的准确性也达到了最高的85%,证明了自动化解决运筹学问题的可行性。数据和代码已公开在https://github.com/bwz96sco/or_llm_agent

论文及项目相关链接

PDF 11 pages, 6 figures

摘要

本文介绍了操作研究(OR)在资源分配、生产规划和供应链管理等领域的广泛应用。然而,解决现实世界中的OR问题传统上需要依赖专家进行数学建模和程序员开发解决方案算法,这种方法成本高且开发周期长,限制了OR技术的广泛应用。本文提出使用人工智能(AI)代替专业人士,实现OR问题的全自动解决方案。我们提出了OR-LLM-Agent,这是一种能够实现端到端自动化解决现实世界中OR问题的人工智能代理。OR-LLM-Agent利用大型语言模型(LLM)的链式思维(CoT)推理能力,将自然语言问题描述翻译成正式的数学模型,并自动生成Gurobi求解器代码。在OR-LLM-Agent中,设计了OR-CodeAgent在沙箱环境中自动化代码执行和修复,便于得出最终解决方案。由于缺乏专门的基准数据集来评估OR问题的自动化解决程度,我们构建了一个包含83个用自然语言描述的现实生活中OR问题的基准数据集。我们对包括GPT-o3-mini、DeepSeek-R1和Gemini 2.0 Flash Thinking在内的最新推理LLM进行了比较实验。OR-LLM-Agent取得了100%的通过率,解决方案准确性达到85%,证明了自动化解决OR问题的可行性。数据和代码已公开在https://github.com/bwz96sco/or_llm_agent

关键见解

  1. OR问题在传统解决方式上依赖专家进行数学建模和程序员开发,过程成本高且周期长。
  2. 提出使用AI实现OR问题的全自动解决方案,引入OR-LLM-Agent实现端到端自动化。
  3. OR-LLM-Agent利用LLM的CoT推理能力,将自然语言转化为数学模odel,自动生成求解器代码。
  4. 介绍了OR-CodeAgent的设计,用于在沙箱环境中自动化代码执行和修复。
  5. 缺乏专门评估OR问题自动化解决的基准数据集,因此构建了包含83个现实问题的基准数据集。
  6. OR-LLM-Agent在比较实验中表现出优异的性能,通过率高达100%,解决方案准确性达85%。

Cool Papers

点此查看论文截图

ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content

Authors:Bhavik Chandna, Mariam Aboujenane, Usman Naseem

Large Multimodal Models (LMMs) are increasingly vulnerable to AI-generated extremist content, including photorealistic images and text, which can be used to bypass safety mechanisms and generate harmful outputs. However, existing datasets for evaluating LMM robustness offer limited exploration of extremist content, often lacking AI-generated images, diverse image generation models, and comprehensive coverage of historical events, which hinders a complete assessment of model vulnerabilities. To fill this gap, we introduce ExtremeAIGC, a benchmark dataset and evaluation framework designed to assess LMM vulnerabilities against such content. ExtremeAIGC simulates real-world events and malicious use cases by curating diverse text- and image-based examples crafted using state-of-the-art image generation techniques. Our study reveals alarming weaknesses in LMMs, demonstrating that even cutting-edge safety measures fail to prevent the generation of extremist material. We systematically quantify the success rates of various attack strategies, exposing critical gaps in current defenses and emphasizing the need for more robust mitigation strategies.

大型多模态模型(LMM)越来越容易受到AI生成的极端内容的影响,包括逼真的图像和文本,这些内容可用于绕过安全机制并产生有害输出。然而,现有的用于评估LMM稳健性的数据集在探索极端内容方面提供了有限的研究,往往缺乏AI生成的图像、多样化的图像生成模型以及对历史事件的综合覆盖,这阻碍了模型漏洞的全面评估。为了填补这一空白,我们引入了ExtremeAIGC,这是一个用于评估LMM对此类内容漏洞的基准数据集和评估框架。ExtremeAIGC通过收集使用最新图像生成技术制作的多样化文本和图像示例,模拟现实世界的事件和恶意用例。我们的研究发现大型多模态模型存在令人担忧的弱点,表明即使是最前沿的安全措施也无法防止极端材料的生成。我们系统地量化了各种攻击策略的成功率,暴露了当前防御手段中的关键差距,并强调需要更稳健的缓解策略。

论文及项目相关链接

PDF Preprint

Summary

大型多模态模型(LMMs)易受AI生成极端内容的影响,包括逼真的图像和文字。现有评估LMM稳健性的数据集对极端内容的探索有限,缺乏AI生成的图像、多样的图像生成模型和历史的全面覆盖,这阻碍了模型漏洞的完全评估。为解决此问题,我们推出了ExtremeAIGC,一个用于评估LMM对此类内容脆弱性的基准数据集和评估框架。ExtremeAIGC通过策划使用最新图像生成技术制作的多样文本和图像样本,模拟现实世界事件和恶意用例。我们的研究揭示了LMM惊人的弱点,证明即使是最先进的安全措施也无法防止极端材料的生成。我们系统地量化了各种攻击策略的成功率,突显了当前防御措施中的关键漏洞,并强调了需要更稳健的缓解策略。

Key Takeaways

  1. 大型多模态模型(LMMs)易受AI生成的极端内容影响,包括逼真的图像和文字。
  2. 现有数据集在评估LMM对极端内容的稳健性方面存在局限性,缺乏AI生成的图像和全面的历史事件覆盖。
  3. ExtremeAIGC基准数据集和评估框架用于评估LMM对极端内容的脆弱性。
  4. ExtremeAIGC模拟现实事件和恶意用例,通过多样文本和图像样本展示攻击策略。
  5. 研究发现LMM存在惊人弱点,即使安全措施也无法完全防止生成极端材料。
  6. 系统量化各种攻击策略的成功率,突显当前防御措施的关键漏洞。

Cool Papers

点此查看论文截图

Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets

Authors:Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, Mohammad Aliannejadi

The rise of personalized conversational search systems has been driven by advancements in Large Language Models (LLMs), enabling these systems to retrieve and generate answers for complex information needs. However, the automatic evaluation of responses generated by Retrieval Augmented Generation (RAG) systems remains an understudied challenge. In this paper, we introduce a new resource for assessing the retrieval effectiveness and relevance of response generated by RAG systems, using a nugget-based evaluation framework. Built upon the foundation of TREC iKAT 2023, our dataset extends to the TREC iKAT 2024 collection, which includes 17 conversations and 20,575 relevance passage assessments, together with 2,279 extracted gold nuggets, and 62 manually written gold answers from NIST assessors. While maintaining the core structure of its predecessor, this new collection enables a deeper exploration of generation tasks in conversational settings. Key improvements in iKAT 2024 include: (1) ``gold nuggets’’ – concise, essential pieces of information extracted from relevant passages of the collection – which serve as a foundation for automatic response evaluation; (2) manually written answers to provide a gold standard for response evaluation; (3) unanswerable questions to evaluate model hallucination; (4) expanded user personas, providing richer contextual grounding; and (5) a transition from Personal Text Knowledge Base (PTKB) ranking to PTKB classification and selection. Built on this resource, we provide a framework for long-form answer generation evaluation, involving nuggets extraction and nuggets matching, linked to retrieval. This establishes a solid resource for advancing research in personalized conversational search and long-form answer generation. Our resources are publicly available at https://github.com/irlabamsterdam/CONE-RAG.

随着大型语言模型(LLM)的进步,个性化对话搜索系统的崛起使得这些系统能够检索和生成针对复杂信息需求的答案。然而,由检索增强生成(RAG)系统生成的响应的自动评估仍然是一个被忽视的挑战。在本文中,我们介绍了一个使用基于片段的评估框架来评估RAG系统生成的检索有效性和响应相关性的新资源。我们的数据集建立在TREC iKAT 2023的基础上,扩展到TREC iKAT 2024集合,其中包括17个对话和20,575个相关段落评估,以及从相关段落中提取的2,279个黄金片段和NIST评估人员编写的62个手动黄金答案。在保持其前身的核心结构的同时,这个新集合使我们对对话环境中的生成任务进行了更深入的探索。iKAT 2024的关键改进包括:(1)“黄金片段”——从集合的相关段落中提取的简洁、必要的信息片段——作为自动响应评估的基础;(2)手动编写的答案,为响应评估提供黄金标准;(3)无法回答的问题以评估模型的幻觉;(4)扩展的用户角色,提供更丰富的上下文背景;(5)从个人文本知识库(PTKB)排名过渡到PTKB分类和选择。基于这一资源,我们提供了一个框架,用于长形式答案生成评估,涉及片段提取和与检索相关的片段匹配。这为个性化对话搜索和长形式答案生成的研究进步建立了坚实的资源。我们的资源在https://github.com/irlabamsterdam/CONE-RAG公开可用。

论文及项目相关链接

PDF

Summary

本文介绍了基于大型语言模型(LLMs)的个性化对话搜索系统的进步,并指出评估这些系统生成的响应的检索效果和相关性仍是一个挑战。为此,本文引入了一个新的评估资源,使用基于“精粹”的评估框架来评估RAG系统生成的响应的检索效果。该数据集包括TREC iKAT 2024集合,含有17个对话和超过两万条相关段落评估,以及从相关段落中提取的金精粹和NIST评估者撰写的金标准答案。该数据集的关键改进包括金精粹、手动编写的答案、无法回答的问题、扩展的用户角色以及从个人文本知识库(PTKB)排名到PTKB分类和选择的转变。这些资源为推进个性化对话搜索和长形式答案生成的研究提供了坚实的基础。

Key Takeaways

  1. 大型语言模型(LLMs)的进展推动了个性化对话搜索系统的崛起。
  2. 评估RAG系统生成的响应的检索效果和相关性是一个未充分研究的挑战。
  3. 引入了新的资源,使用基于“精粹”的评估框架来评估RAG系统的响应。
  4. TREC iKAT 2024集合包含17个对话、超过两万条相关段落评估。
  5. 数据集包括从相关段落中提取的金精粹和手动编写的金标准答案。
  6. 数据集的关键改进包括金精粹、无法回答的问题、扩展的用户角色等。

Cool Papers

点此查看论文截图

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving

Authors:Sara Rajaee, Kumar Pratik, Gabriele Cesa, Arash Behboodi

The most promising recent methods for AI reasoning require applying variants of reinforcement learning (RL) either on rolled out trajectories from the model, even for the step-wise rewards, or large quantities of human annotated trajectory data. The reliance on the rolled-out trajectory renders the compute cost and time prohibitively high. In particular, the correctness of a reasoning trajectory can typically only be judged at its completion, leading to sparse rewards in RL or requiring expensive synthetic data generation in expert iteration-like methods. In this work, we focus on the Automatic Theorem Proving (ATP) task and propose a novel verifier-in-the-loop design, which unlike existing approaches that leverage feedback on the entire reasoning trajectory, employs an automated verifier to give intermediate feedback at each step of the reasoning process. Using Lean as the verifier, we empirically show that the step-by-step local verification produces a global improvement in the model’s reasoning accuracy and efficiency.

近期最有前景的人工智能推理方法需要应用强化学习(RL)的变种,要么在模型展开的轨迹上,甚至对于逐步奖励,要么需要大量的人工注释轨迹数据。对展开轨迹的依赖使得计算成本和时间非常高昂。特别地,推理轨迹的正确性通常只能在完成时进行判断,这导致强化学习中的奖励稀疏,或需要昂贵的合成数据生成,类似于专家迭代的方法。在这项工作中,我们关注自动定理证明(ATP)任务,并提出了一种新型的在循环验证器设计,它不同于那些依赖整个推理轨迹反馈的现有方法,而是采用自动化验证器在推理过程的每一步提供中间反馈。使用Lean作为验证器,我们从实证上证明,逐步的局部验证提高了模型推理的准确性和效率。

论文及项目相关链接

PDF Accepted at ICLR 2025 Workshop on Reasoning and Planning for Large Language Models

Summary
最近的人工智能推理方法主要依赖于强化学习(RL)在各种模型轨迹上的应用,包括逐步奖励和大量的人类注释轨迹数据。然而,依赖于展开的轨迹使得计算成本和时间变得非常高昂。本工作专注于自动定理证明(ATP)任务,提出了一种新颖的循环验证器设计,与传统的对整个推理轨迹进行反馈的方法不同,该设计采用自动化验证器在推理过程的每一步提供中间反馈。使用Lean作为验证器,实证表明,逐步的局部验证提高了模型的推理准确性和效率。

Key Takeaways

  1. 人工智能推理方法主要依赖强化学习在各种模型轨迹上的应用。
  2. 对展开的推理轨迹的依赖使得计算成本和时间变得非常高昂。
  3. 当前工作提出了一个新颖的循环验证器设计,该设计在推理过程的每一步提供中间反馈。
  4. 与传统方法不同,新方法采用Lean作为验证器进行逐步的局部验证。
  5. 实证研究表明,逐步的局部验证有助于提高模型的推理准确性。
  6. 这种方法还提高了推理效率。

Cool Papers

点此查看论文截图

Probabilistic Reasoning with LLMs for k-anonymity Estimation

Authors:Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu

Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a novel numerical reasoning task under uncertainty, focusing on estimating the k-anonymity of user-generated documents containing privacy-sensitive information. We propose BRANCH, which uses LLMs to factorize a joint probability distribution to estimate the k-value-the size of the population matching the given information-by modeling individual pieces of textual information as random variables. The probability of each factor occurring within a population is estimated using standalone LLMs or retrieval-augmented generation systems, and these probabilities are combined into a final k-value. Our experiments show that this method successfully estimates the correct k-value 67% of the time, an 11% increase compared to GPT-4o chain-of-thought reasoning. Additionally, we leverage LLM uncertainty to develop prediction intervals for k-anonymity, which include the correct value in nearly 92% of cases.

概率推理是人类和人工智能中的关键方面,能够处理决策中的不确定性和模糊性。在本文中,我们介绍了一种新型的概率推理任务,重点估计包含隐私敏感信息的用户生成文档的k匿名度。我们提出了BRANCH方法,它使用大型语言模型对联合概率分布进行分解估计k值(与给定信息匹配的人口数量),并将个别文本信息片段作为随机变量进行建模。通过使用独立的大型语言模型或检索增强生成系统来估计每个因素在人口中的出现概率,并将这些概率合并到最终的k值中。我们的实验表明,该方法成功估计出正确的k值的概率达到67%,比GPT-4的深度思考推理提高了11%。此外,我们还利用大型语言模型的不确定性为k匿名性开发预测区间,在接近92%的情况下包含正确的值。

论文及项目相关链接

PDF 9 pages

Summary

本文介绍了一种新型数值推理任务——不确定性下的k-匿名性估算。文章重点讨论如何估计包含隐私敏感信息的用户生成文档的k-匿名性。为此,提出了基于大型语言模型(LLM)的方法BRANCH,它通过分解联合概率分布来估算k值。实验表明,该方法成功估算k值的准确率为67%,相较于GPT-4链式思维推理提升了11%。此外,文章还利用LLM的不确定性为k-匿名性开发了预测区间,其包含正确值的比例接近92%。

Key Takeaways

  1. 介绍了在不确定性下处理k-匿名性估算的新型数值推理任务。
  2. BRANCH方法利用大型语言模型(LLM)来估算k-匿名性。
  3. BRANCH通过分解联合概率分布来估算k值。
  4. BRANCH方法的实验显示,其成功估算k值的准确率为67%。
  5. 与GPT-4链式思维推理相比,BRANCH方法的准确率提升了11%。
  6. 利用LLM的不确定性为k-匿名性开发预测区间。

Cool Papers

点此查看论文截图

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Authors:Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions (“implicit post-hoc rationalization”). For example, when separately presented with the questions “Is X bigger than Y?” and “Is Y bigger than X?”, models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.

“链式思维(Chain-of-Thought,简称CoT)推理在人工智能领域取得了显著的进步。然而,最近的研究表明,CoT推理并不总是可靠的,即CoT推理并不总是反映模型如何得出结论。迄今为止,大多数研究主要集中在非自然环境下的不忠实情况,即引入明确偏见的环境。与此相反,我们证明了在真实的提示且没有人为偏见的情况下也会出现不忠实的CoT。我们的结果揭示了前沿模型中不可忽略的多种不忠实推理形式的发生率:Sonnet 3.7(16.3%)、DeepSeek R1(5.3%)和ChatGPT-4o(7.0%)都会回答相当比例的问题对时不忠实。具体来说,我们发现模型在回答二元问题时(如“X是否大于Y?”和“Y是否大于X?”),会为自己的答案进行事后合理化。有时模型会提供看似合理的论据来证明两个问题的答案都是“是”或都是“否”,尽管这样的回答在逻辑上是矛盾的。我们还研究了模型在推理过程中出现的修复错误和不忠实的简化策略,即在解决普特南问题(一个困难的基准测试)时,模型采用明显不合逻辑的推理来简化问题。我们的研究对依赖于监控CoT来检测不期望行为的AI安全工作提出了挑战。”

论文及项目相关链接

PDF Accepted to the Reasoning and Planning for Large Language Models Workshop (ICLR 25), 10 main paper pages, 38 appendix pages

Summary

本文探讨了Chain-of-Thought(CoT)推理在先进AI能力中的重要作用,但研究发现CoT推理并不总是忠实于模型的决策过程。过去的研究主要集中在非自然情境下的不忠实现象,而本文则展示了在现实提示下也会出现不忠实的CoT。研究结果显示,前沿模型如Sonnet 3.7、DeepSeek R1和ChatGPT-4o存在不可忽略的不忠实推理现象。模型会在回答二元问题时理性化其隐含偏见(隐性事后合理化),如回答两个逻辑矛盾的问题时,会提供看似合理的论证。此外,还探讨了模型在推理中的修复错误和不忠实的简化策略所带来的挑战。

Key Takeaways

  1. Chain-of-Thought(CoT)推理对先进AI能力有重要作用,但存在不忠实现象。
  2. 不忠实CoT可出现在现实提示中,而非仅限于非自然情境。
  3. 多个前沿模型存在不忠实推理现象,如Sonnet 3.7、DeepSeek R1和ChatGPT-4o。
  4. 模型会在回答二元问题时理性化其隐含偏见。
  5. 模型会出现修复错误的情况,即默默地纠正推理中的错误。
  6. 模型存在不忠实的简化策略,如在解决Putnam问题时使用明显不合逻辑的推理。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
LLM LLM
LLM 方向最新论文已更新,请持续关注 Update in 2025-03-15 HybridVLA Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
2025-03-15
下一篇 
Text-to-Motion Text-to-Motion
Text-to-Motion 方向最新论文已更新,请持续关注 Update in 2025-03-14 PersonaBooth Personalized Text-to-Motion Generation
2025-03-14
  目录