⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-20 更新
Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
Authors:A H M Rezaul Karim, Ozlem Uzuner
Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs – a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking – provides a simple and effective baseline for multimodal clinical NLP tasks.
医疗视觉问答(MedVQA)支持通过医疗图像进行自然语言查询,以辅助临床决策和病人护理。MEDIQA-WV 2025共享任务针对伤口护理问答(VQA),要求系统根据图像和患者查询生成自由文本回答和结构化伤口属性。我们推出了MasonNLP系统,它采用通用指令优化的大型语言模型,并结合了检索增强生成(RAG)框架,该框架结合了领域内的文本和视觉示例。这种方法以临床相关的范例为基础,提高了推理、模式遵循和响应质量,涉及dBLEU、ROUGE、BERTScore和基于大型模型的指标。我们表现最佳的系统在19支队伍和51次提交中排名第3,平均得分为41.37%,这表明带有通用大型语言模型的轻量化RAG——一个最小推理时间层,通过简单索引和融合添加少量相关范例,无需额外训练或复杂排序——为多模态临床自然语言处理任务提供了一个简单有效的基线。
论文及项目相关链接
Summary
医学视觉问答(MedVQA)能够通过自然语言查询支持临床决策和患者护理。MEDIQA-WV 2025共享任务关注伤口护理领域的问答系统,要求系统根据图像和患者查询生成自由文本回答和结构化伤口属性。本文介绍了MasonNLP系统,该系统采用通用指令优化的大型语言模型,结合领域内的文本和视觉示例,以检索增强生成(RAG)框架为基础,输出与临床相关的范例,提高推理、模式遵守和响应质量。MasonNLP系统在多项指标上的表现良好,在最佳模型评估下排名第3。这一研究证明采用通用的大型语言模型和轻量化增强生成技术,能够在无需额外训练或复杂排序的情况下,为临床自然语言处理任务提供简单有效的基线模型。
Key Takeaways
- MedVQA技术通过自然语言查询支持医疗图像的临床决策和患者护理。
- MEDIQA-WV 2025共享任务集中在伤口护理问答系统上。
- MasonNLP系统使用大型语言模型和检索增强生成框架。
- 系统结合了领域内的文本和视觉示例来优化输出,使之更贴近临床实际情况。
- 该系统在多项评价指标上表现优异,验证了方法的实用性和有效性。
点此查看论文截图




Adaptive Selection of Symbolic Languages for Improving LLM Logical Reasoning
Authors:Xiangyu Wang, Haocheng Yang, Fengxiang Cheng, Fenrong Liu
Large Language Models (LLMs) still struggle with complex logical reasoning. While previous works achieve remarkable improvements, their performance is highly dependent on the correctness of translating natural language (NL) problems into a symbolic language (SL). Though numerous works focusing on improving this translation accuracy, they only consider the similarity between the meaning of SL and NL, overlooking another crucial influencing factor, the selection of the target SL type itself. For example, first-order logic language specializes in logical reasoning with categorical syllogisms and complex quantifiers, while Boolean satisfiability formalism excels at representing constraint satisfaction like partial problems. To our knowledge, this is the first paper to claim and verify that different NL logical reasoning problem corresponds to different optimal SL formalization for translation. Based on this, we propose a methods to improve the logical reasoning performance of LLMs by adaptively selecting the most suitable SL for each problem prior to translation. Specifically, we leverage LLMs to select the target SL among first-order logic, logic programming and Boolean satisfiability and then translate the problem in NL to target SL expressions as well as employ the corresponding logical solver to derive the final answer. Experimental results on benchmarks show that our adaptive selection method significantly outperforms translating all into single SL and randomly selecting the SL. On a mixed dataset of these benchmarks, our approach achieves 96% accuracy, which improving performance by 25% compared to the second highest accuracy from the first-order logic translation.
大型语言模型(LLM)在复杂逻辑推理方面仍存在困难。尽管之前的研究取得了显著的改进,它们的性能高度依赖于将自然语言(NL)问题正确翻译为目标符号语言(SL)。尽管许多研究致力于提高这种翻译的准确性,但它们只考虑了符号语言与自然语言之间的相似性,而忽视了另一个关键影响因素,即目标符号语言类型本身的选取。例如,一阶逻辑语言擅长于逻辑推理,包括分类归纳和复杂量词,而布尔可满足性形式化则擅长于表示约束满足问题。据我们所知,这是第一篇声称并验证不同的自然语言逻辑推理问题对应于不同的最佳符号语言形式化的论文。基于此,我们提出了一种方法,通过自适应选择每个问题最合适的符号语言来进行翻译,以提高LLM的逻辑推理性能。具体来说,我们利用大型语言模型在问题确定目标符号语言后翻译至第一阶逻辑语言、逻辑编程和布尔可满足性间,并使用相应的逻辑求解器推导出最终答案。在基准测试上的实验结果表明,我们的自适应选择方法显著优于将所有问题都翻译成单一符号语言和随机选择符号语言的方法。在这些基准测试的混合数据集上,我们的方法达到了96%的准确率,相较于第一阶逻辑翻译的第二高准确率提高了25%。
论文及项目相关链接
Summary
大型语言模型在复杂的逻辑推理方面仍存在挑战。虽然之前的研究取得了显著进展,但其性能高度依赖于将自然语言问题正确翻译为符号语言的能力。许多研究集中在提高这种翻译的准确性上,但往往忽视了目标符号语言类型本身的选择也是关键影响因素。本文首次提出并验证了不同的自然语言逻辑推理问题对应不同的最佳符号语言形式化翻译。基于此,本文提出了一种方法,通过自适应选择每个问题最合适的符号语言进行翻译,以提高大型语言模型的逻辑推理性能。在多个基准测试上进行的实验表明,与将所有问题翻译成单一的符号语言以及随机选择符号语言的方法相比,本文提出的自适应选择方法具有显著优势。在混合数据集上,我们的方法达到了96%的准确率,相较于此前最佳的一阶逻辑翻译方法提升了25%。
Key Takeaways
- 大型语言模型在复杂逻辑推理方面存在挑战。
- 自然语言问题正确翻译成符号语言对模型性能有重要影响。
- 符号语言类型选择对逻辑推理性能有重要影响。
- 本文首次提出并验证了不同的自然语言逻辑问题对应不同的最佳符号语言形式化翻译。
- 提出了一种自适应选择最适合的符号语言进行翻译的方法。
- 实验结果显示自适应选择方法在多个基准测试上具有显著优势。
点此查看论文截图



VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Authors:Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
近期多模态奖励模型(RMs)的进展为视觉生成模型的训练后表现带来了显著的提升。然而,当前的RMs存在一些固有的局限性:(1)视觉输入需要大量上下文预算,导致框架数量减少并造成精细细节的损失;(2)所有视觉信息都被打包到初始提示中,加剧了思维链推理过程中的幻觉和遗忘。为了克服这些问题,我们引入了VideoReward Thinker(VR-Thinker),这是一种与图像思考框架相结合的多模态奖励模型。它配备了视觉推理操作(例如选择框架)和可配置视觉记忆窗口。这允许RM在上下文限制内主动获取和更新视觉证据,提高了推理的准确性和可靠性。我们通过强化微调管道激活视觉推理:(i)使用精选的视觉思维链数据启动,以提炼基本的推理技能和操作格式化;(ii)选择每个维度和整体判断都正确的样本,然后对这些高质量的轨迹进行拒绝采样微调,以进一步增强推理能力;(iii)应用群组相对策略优化(GRPO)以增强推理能力。我们的方法在视频偏好基准测试上实现了开源模型中的最新准确性,特别是对于较长的视频:一个规模为7B的VR-Thinker在VideoGen Reward上实现了80.5%,在GenAI-Bench上实现了82.3%,在MJ-Bench-Video上实现了75.6%。这些结果验证了与图像思考相结合的多模态奖励建模的有效性和前景。
论文及项目相关链接
Summary
本文介绍了针对视觉生成模型的后训练改进中遇到的挑战,如视觉输入消耗大量上下文预算和视觉信息过于集中导致的幻觉和遗忘问题。为解决这些问题,研究团队推出了VideoReward Thinker(VR-Thinker)框架,配备了视觉推理操作,包括选择帧等功能。此框架允许模型在上下文限制内主动获取和更新视觉证据,提高了推理的准确性和可靠性。通过强化微调管道激活视觉推理,包括冷启动阶段、拒绝采样微调以及群体相对策略优化等步骤。在视频偏好基准测试中,该方法的准确性达到了领先水平,特别是在处理长视频时表现优异。这一结果验证了思考型图像多模态奖励建模的有效性和前景。
Key Takeaways
- 当前多模态奖励模型(RMs)在视觉生成模型的训练后改进面临两大问题:一是视觉输入占用大量上下文预算导致缺少精细粒度的细节,二是所有视觉信息都集中于初始提示导致幻觉和遗忘问题。
- VR-Thinker框架通过配备视觉推理操作(如选择帧)和可配置的视觉记忆窗口来解决这些问题。它允许模型在上下文限制内主动获取和更新视觉证据,从而提高推理的准确性和可靠性。
- VR-Thinker框架采用强化微调管道来激活视觉推理,包括冷启动阶段、拒绝采样微调以及群体相对策略优化等步骤。这三个步骤共同提高了模型的推理能力。
点此查看论文截图



Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning
Authors:Yujian Zhang, Keyu Chen, Zhifeng Shen, Ruizhi Qiao, Xing Sun
Although Long Reasoning Models (LRMs) have achieved superior performance on various reasoning scenarios, they often suffer from increased computational costs and inference latency caused by overthinking. To address these limitations, we propose Adaptive Dual Reasoner, which supports two reasoning modes: fast thinking and slow thinking. ADR dynamically alternates between these modes based on the contextual complexity during reasoning. ADR is trained in two stages: (1) A cold-start stage using supervised fine-tuning (SFT) to equip the model with the ability to integrate both fast and slow reasoning modes, in which we construct a hybrid reasoning dataset through a dedicated pipeline to provide large-scale supervision. (2) A reinforcement learning stage for optimizing reasoning effort, where we introduce Entropy-guided Hybrid Policy Optimization EHPO, an RL training framework employing an entropy-guided dynamic rollout strategy for branching at high-entropy units and a difficulty-aware penalty to balance fast and slow reasoning. Across challenging mathematical reasoning benchmarks, ADR achieves an effective balance between reasoning performance and efficiency among state-of-the-art approaches. Specifically, ADR yields a performance gain of up to 6.1%, while reducing the reasoning output length by 49.5% to 59.3%.
尽管长推理模型(LRMs)在各种推理场景上取得了卓越的性能,但它们常常因为过度思考而面临计算成本增加和推理延迟的问题。为了解决这些局限性,我们提出了自适应双推理器(ADR),它支持两种推理模式:快速思考和慢速思考。ADR会根据推理过程中的上下文复杂性动态地在这些模式之间交替。ADR的训练分为两个阶段:(1)使用监督微调(SFT)的冷启动阶段,通过专用管道构建混合推理数据集,为模型提供集成快速和慢速推理模式的能力,以提供大规模监督。(2)优化推理努力的强化学习阶段,我们引入了熵引导混合策略优化(EHPO),这是一种RL训练框架,采用熵引导的动态滚动策略在高熵单位进行分支,以及难度感知惩罚来平衡快速和慢速推理。在具有挑战性的数学推理基准测试中,ADR在最新方法之间实现了推理性能和效率的有效平衡。具体来说,ADR的性能提高了高达6.1%,同时减少了推理输出长度达49.5%至59.3%。
论文及项目相关链接
PDF Accepted to NeurIPS 2025 Workshop on Efficient Reasoning
Summary
自适应双推理器(ADR)是一种针对长推理模型(LRMs)的计算成本和推理延迟问题提出的解决方案。它支持快速和慢速两种推理模式,并根据上下文复杂性动态切换。ADR的训练分为两个阶段:首先是使用监督微调(SFT)的冷启动阶段,赋予模型结合两种推理模式的能力;其次是强化学习阶段,优化推理努力。EHPO是一种RL训练框架,采用熵引导的动态滚动策略进行分支,并引入难度感知惩罚来平衡快速和慢速推理。在挑战性的数学推理基准测试中,ADR在最新方法之间实现了推理性能和效率的有效平衡,性能提升达6.1%,同时减少推理输出长度达49.5%至59.3%。
Key Takeaways
- 自适应双推理器(ADR)旨在解决长推理模型(LRMs)的计算成本和推理延迟问题。
- ADR支持快速和慢速两种推理模式,并可根据上下文复杂性动态切换。
- ADR的训练分为监督微调(SFT)的冷启动阶段和强化学习阶段。
- EHPO是一种RL训练框架,采用熵引导的动态滚动策略,在推理过程中进行分支决策。
- ADR在难度感知惩罚机制下平衡了快速和慢速推理,实现了推理性能和效率的优化。
点此查看论文截图



The Idola Tribus of AI: Large Language Models tend to perceive order where none exists
Authors:Shin-nosuke Ishikawa, Masato Todo, Taiki Ogihara, Hirotsugu Ohba
We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real-world tasks, such as providing knowledge through retrieval-augmented generation and executing multi-step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self-coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over-recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi-step reasoning models, including OpenAI o3, o4-mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non-existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain-of-thought reasoning mechanisms.
我们注意到大型语言模型(LLM)在识别数字序列规律这一简单任务中,尽管明显不合适,但仍倾向于生成荒谬的模式。尽管已经提出了将LLM应用于复杂的现实世界任务的几种方法,例如通过检索增强生成提供知识,以及使用人工智能代理框架执行多步骤任务,但这些方法依赖于LLM的逻辑连贯性和自我一致性,因此评估这些方面并考虑潜在的对策至关重要。为了确定LLM在哪些情况下无法保持逻辑一致性,我们进行了一项实验,要求LLM解释各种整数序列的模式,这些序列从算术序列到随机生成的整数系列不等。虽然这些模型在算术和几何序列中成功识别出了正确的模式,但在分析随机生成的系列时,它们经常过度识别与给定数字不一致的模式。这一问题在多步推理模型中也被观察到,包括OpenAI o3、o4-mini和Google Gemini 2.5 Flash Preview Thinking等。这种感知不存在的模式的倾向可以被解释为相当于Idola Tribus的人工智能模型,并突出了它们在应用任务中的潜在局限性,尤其是在需要逻辑推理的任务中,即使采用思维链推理机制。
论文及项目相关链接
PDF 14 pages, 3 figures, accepted to Findings of EMNLP 2025
Summary
大型语言模型(LLMs)在识别数字序列规律的任务中,倾向于生成荒谬的模式。尽管在算术序列和几何序列中,这些模型能够成功识别正确的模式,但在分析随机生成的系列时,它们经常过度识别与给定数字不一致的模式。这反映出LLMs在需要逻辑推理的应用任务中的潜在局限性,即使采用思维链推理机制也是如此。
Key Takeaways
- LLMs在识别数字序列规律的任务中有生成荒谬模式的倾向。
- LLMs在分析随机生成的数字系列时,会过度识别与给定数字不一致的模式。
- LLMs在逻辑一致性方面存在潜在局限性,特别是在需要逻辑推理的应用任务中。
- 即使是采用多步推理机制的大型语言模型,也可能会出现逻辑不一致的问题。
- LLMs的成功应用依赖于其逻辑一致性和自我连贯性。
- Idola Tribus的概念可以类比理解LLMs的这种倾向,即误识非存在的模式。
点此查看论文截图



Vision Language Models: A Survey of 26K Papers
Authors:Fengming Lin
We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.
我们对CVPR、ICLR和NeurIPS在2023-2025年间接受的26,104篇论文的研究趋势进行了透明、可复制的测量。我们对标题和摘要进行规范化、短语保护,并与手工制作的词汇表进行匹配,最多分配了35个主题标签,挖掘了关于任务、架构、训练方案、目标、数据集和共同提及的模态的精细线索。分析量化了三个宏观变化:(1)多模态视觉语言-大型语言模型(LLM)工作的急剧增长,它越来越多地将经典感知重新定义为遵循指令和分步推理;(2)生成方法的稳步扩展,扩散研究围绕可控性、蒸馏和速度进行整合;(3)3D和视频活动具有韧性,组合从NeRFs转向高斯贴图,并对人类和代理为中心的理解越来越强调。在视觉语言模型(VLM)内部,参数有效的适应方法如提示/适配器/LoRA和轻量级视觉语言桥梁占主导地位;训练实践从从头开始构建编码器转向指令调整和微调强大的主干;对比目标相对于交叉熵/排名和蒸馏而言有所减弱。跨场地比较显示,CVPR在3D足迹方面具有更强的地位,而ICLR在视觉语言模型(VLM)方面占比最高,同时效率或稳健性等可靠性主题扩散到各个区域。我们发布词汇表和方法,以便进行审计和扩展。局限性包括词汇表召回和仅涵盖摘要范围,但纵向信号在场地和年份之间是一致的。
论文及项目相关链接
PDF VLM/LLM Learning Notes
摘要
本文介绍了对CVPR、ICLR和NeurIPS会议在2023-2025年间所接受的26,104篇论文趋势的透明、可复现的测量结果。通过对标题和摘要进行规范化、短语保护和与手工词典匹配,对论文进行主题标签分配和精细线索挖掘,分析包括任务、架构、训练制度、目标、数据集和共同提及的模式等方面的变化。本文量化三个宏观变化:一是多模态视觉语言与大型语言模型工作的急剧增加,重新定义了传统感知为指令跟随和多步推理;二是生成方法的稳步扩展,扩散研究集中在可控性、提炼和速度方面;三是3D和活动视频持续活跃,组合从NeRF转向高斯拼贴,并越来越强调人类和代理为中心的理解。此外,还探讨了参数效率适应如提示/适配器/LoRA和轻量级视觉语言桥梁的主导地位、训练实践的转变以及目标函数的变化等。通过跨场地比较显示CVPR在3D足迹方面具有优势,而ICLR在VLM份额方面最高。同时,本文强调了效率和鲁棒性等可靠性主题在整个领域的普及。最后,本文发布了词典和方法论以促进审计和扩展。尽管存在词典召回和仅涵盖摘要等局限性,但各会场和年份的长期信号是一致的。
关键见解
- 通过分析CVPR、ICLR和NeurIPS会议论文,覆盖2023-2025年的研究趋势变得透明和可复现。
- 观察到三个主要宏观变化:多模态视觉语言与大型语言模型工作的增加,生成方法的扩展以及3D和视频活动领域的持续发展。
- 词汇表和方法论被发布以推动审计和扩展。
- 多模态融合的工作逐渐成为一种趋势,越来越多的将传统感知任务重塑为指令跟随和多步推理。
- 扩散模型研究开始重视可控性、提炼和速度方面的改进。
- 在视觉语言模型(VLM)领域,参数效率适应方法如提示/适配器/LoRA受到重视,训练实践也在发生变化。
点此查看论文截图





StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
Authors:Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou
Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.
大型语言模型(LLM)在数学和逻辑推理方面取得了显著的进步,然而统计学作为一门独特且综合的学科,在基准测试中仍然被忽视。为了弥补这一空白,我们引入了StatEval,这是首个专门用于统计的综合性基准测试,涵盖了不同难度级别的广度和深度。StatEval包含13817个涵盖本科和研究生课程的基础问题,以及从领先期刊中提取的2374个研究级证明任务。在构建基准测试时,我们设计了一个可扩展的多智能体管道,采用人工循环验证,自动化大规模问题提取、重写和质量控制,同时确保学术严谨性。我们还针对计算和证明任务提出了一个稳健的评估框架,能够对推理能力进行精细评估。实验结果表明,虽然封闭式模型如GPT5-mini在研究级别问题上的表现低于57%,但开源模型的表现更低。这些发现突显了统计推理的独特挑战和当前大型语言模型的局限性。我们希望StatEval能成为推进大型语言模型中的统计智能的严谨基准测试。所有数据和代码都可在我们的网络平台:https://stateval.github.io/ 上找到。
论文及项目相关链接
Summary
大型语言模型在数学和逻辑推理方面取得了显著进展,但在统计学这一独特且综合的学科领域,其评估仍存在缺口。为解决这一问题,我们推出StatEval——首个专注于统计学的全面基准测试,涵盖不同难度层次,既广泛又深入。StatEval包含13817个涵盖本科和研究生课程的基础问题,以及从领先期刊提取的2374个研究级证明任务。我们设计了一个可规模化、多智能体管道,通过人机协作验证,实现大规模问题提取、改写和质量控制,同时确保学术严谨性。我们对计算型和证明型任务提出了稳健的评估框架,实现对推理能力的精细评估。实验结果显示,封闭式模型如GPT5-mini在研究级别问题上的表现低于57%,而开源模型表现更差。这突显了统计推理的独特挑战和当前大型语言模型的局限性。我们期望StatEval能成为推动大型语言模型中统计智能发展的严谨基准测试。
Key Takeaways
- 大型语言模型在数学和逻辑推理方面有所进步,但在统计学领域的评估存在缺口。
- StatEval是首个专注于统计学的全面基准测试,涵盖本科、研究生及研究级别问题。
- StatEval包含一个可规模化、多智能体管道,确保问题提取、改写和质量控制过程的学术严谨性。
- 评估框架可针对计算型和证明型任务进行精细评估。
- 封闭式模型如GPT5-mini在研究级别问题上的表现不佳。
- 统计推理面临独特挑战,当前大型语言模型存在局限性。
点此查看论文截图


TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
Authors:Yincen Qu, Huan Xiao, Feng Li, Gregory Li, Hui Zhou, Xiangying Dai, Xiaoru Dai
Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs’ planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.
行程规划是一项有价值但又复杂的任务,即使对于先进的大型语言模型(LLM)也构成了重大挑战。虽然最近的基准测试在评估LLM的规划能力方面取得了进展,但它们在评估行程计划的可行性、可靠性和参与度方面往往有所欠缺。我们引入了全面的行程规划基准测试,它将精细的准则统一为一个单一的奖励,能够直接比较计划的质量,并与强化学习无缝集成(RL)。我们的评估器与旅行专家注释之间达到中等协议(60.75%),并优于多个LLM判断基准。此外,我们发布了包含4870个查询的大规模数据集,其中219个是真实世界的自由形式请求,以推广到真实的用户意图。使用这个基准测试,我们在各种方法和LLM上进行了广泛的实验,包括测试时间计算、神经符号方法、监督微调以及通过GRPO的RL。在基础模型方面,RL通常能提高行程的可行性,超越仅提示和监督基线,从而得到更高的统一奖励分数。
论文及项目相关链接
Summary
旅行规划是一项有价值但复杂的任务,对大型语言模型(LLM)提出重大挑战。现有的评估方法难以全面评价计划的可行性、可靠性和参与度。为此,我们提出了一个全面的旅行规划评估标准,将精细标准整合为单一奖励,能够直接比较计划质量,并与强化学习无缝集成。我们的评估器与旅行专家注释达成中等共识(60.75%),并优于多个LLM基线。此外,我们还发布了包含4870个查询的大规模数据集,其中219个是真实世界的自由形式请求,用于推广到真实用户意图。利用此评估标准,我们在不同的方法和LLM上进行了广泛的实验,包括测试时间计算、神经符号方法、监督微调以及通过GRPO的强化学习。在基础模型上,强化学习通常能提高行程的可行性,相对于仅提示和监督基线,获得更高的统一奖励分数。
Key Takeaways
- 旅行规划对大型语言模型(LLM)来说是一项复杂任务,现有评估方法存在缺陷。
- 提出了一种新的旅行规划评估标准,能够全面评价计划的可行性、可靠性和参与度。
- 评估器与旅行专家注释达成中等共识,并优于多个LLM基线。
- 发布了包含真实世界自由形式请求的大规模数据集。
- 强化学习在提升行程规划质量方面表现出潜力,特别是在基础模型上。
- 不同方法和LLM的实验验证了评估标准的广泛适用性。
点此查看论文截图




SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures
Authors:Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao
As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.
随着大型语言模型(LLM)被广泛应用于特定领域的代理,许多基准测试已经被提出来评估它们在现实世界中遵循指令和做决策的能力。然而,商业场景通常涉及复杂的标准操作流程(SOPs),而在这种背景下对LLM能力的评估尚未得到充分探索。为了弥补这一差距,我们提出了SOP-Maze,这是一个由真实世界商业数据构建的基准测试,从23个复杂的SOP场景中改编为397个任务。我们进一步将SOP任务分为两大类:横向根系(LRS),代表需要精确选择的多选项任务;以及心脏根系(HRS),强调具有复杂分支的深度逻辑推理。大量实验表明,几乎所有最新模型在SOP-Maze中都表现挣扎。我们进行了综合分析,并确定了三个主要的错误类别:(i)路线盲目性:难以按照流程操作;(ii)对话脆弱性:无法处理真实的对话细微差别;(iii)计算错误:在复杂背景下时间和算术推理的错误。这项系统研究探讨了LLM在挑战广度和深度的SOP任务中的表现,为改进模型能力提供了新见解。我们的工作已在https://github.com/ADoublLEN/SOP-Maze上开源。
论文及项目相关链接
Summary:针对大型语言模型(LLM)在真实世界场景中执行指令和决策的能力,提出了多种评估基准。然而,业务场景通常涉及复杂的标准操作流程(SOPs),而在这类语境下对LLM能力的评估尚未得到充分探索。为此,本文提出了SOP-Maze基准测试,它基于真实世界业务数据构建,从23个复杂的SOP场景中衍生出397项任务。实验发现,几乎所有最新模型在SOP-Maze中都面临挑战,并出现了三种主要错误类型。对此基准测试的详细研究为改进模型能力提供了新的见解。
Key Takeaways:
- 大型语言模型(LLM)在真实世界场景中的表现能力得到了广泛的评估基准测试,但针对复杂标准操作流程(SOPs)的评估尚未充分探索。
- SOP-Maze基准测试从真实世界业务数据中构建,包含从多种复杂SOP场景中派生的任务,为评估LLM的能力提供了新的角度。
- SOP任务分为两大类别:横向根系系统(LRS)和心脏根系系统(HRS),分别侧重于不同的任务需求。
- 实验发现几乎所有最新模型在SOP-Maze中都面临挑战,表明在这些场景中模型的性能有待提高。
- 在SOP-Maze中出现的关键错误类型包括:路线盲视、对话的脆弱性和计算错误。
- 系统性的研究为改进模型能力提供了方向,特别是在处理宽度和深度都挑战的任务时。
点此查看论文截图





Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
Authors:Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang
While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs’ ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs’ ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.
虽然大型语言模型(LLM)已经展示了先进的推理能力,但在通用中文环境下的全面评估仍然研究不足。为了弥补这一空白,我们提出了中文常识多跳推理(CCMOR)这一新型基准测试,旨在评估LLM将中文特定的事实知识与多步骤逻辑推理相结合的能力。具体而言,我们首先从现有的问答数据集中构建了一个领域平衡的种子集,然后开发了一个由LLM驱动的管道,以生成基于事实单位链的多跳问题。为确保最终数据集的质量,我们实施了一个人工循环验证系统,领域专家系统地验证和精炼生成的问题。使用CCMOR,我们评估了最先进的大型语言模型,表明大型语言模型在处理长尾知识和执行知识密集型推理方面存在持续局限。值得注意的是,通过检索增强生成显著缓解了这些知识差距,带来了显著的性能提升。
论文及项目相关链接
Summary
中国语境下对大型语言模型(LLM)的综合评价研究仍显不足,为此提出中国常识多跳推理(CCMOR)基准测试。该测试旨在评估LLM在整合汉语特定事实知识和多步骤逻辑推理方面的能力。构建领域均衡的种子集,通过LLM管道生成基于事实单元链的多跳问题。为确保数据集质量,实施人机循环验证系统,由领域专家系统地验证和精炼生成的问题。使用CCMOR评估了最先进的大型语言模型,表明其在处理长尾知识和执行知识密集型推理方面存在持续局限性。值得注意的是,通过检索增强生成技术显著缓解了这些知识差距,实现了显著的性能提升。
Key Takeaways
- 大型语言模型(LLMs)在整合汉语特定事实知识和多步骤逻辑推理方面的能力评价仍显不足。
- 提出中国常识多跳推理(CCMOR)基准测试,用于评估LLMs在上述方面的能力。
- CCMOR通过构建领域均衡的种子集和LLM管道生成基于事实单元链的多跳问题。
- 实施人机循环验证系统确保数据集质量。
- 评估发现LLMs在处理长尾知识和执行知识密集型推理方面存在局限性。
- 检索增强生成技术能显著缓解LLMs的知识差距问题,提升性能。
点此查看论文截图






Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Authors:Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis
Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token’s influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO’s learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
强化学习与可验证奖励极大地提高了大型语言模型的推理能力,但如何明确指导训练走向探索或利用仍然是一个悬而未决的问题。我们引入了Token Hidden Reward(THR),这是一个基于令牌级别的度量标准,可以量化每个令牌在Group Relative Policy Optimization(GRPO)下对正确反应概率的影响。我们发现训练动态主要由具有TH高绝对值的小部分令牌主导。最有趣的是,具有正向THR的令牌会增强对正确输出的信心,有利于利用已有知识;而具有负向THR的令牌则保留了对替代输出的概率质量,从而能够探索更多可能性。这一发现提出了一种自然的干预措施:一种由THR引导的重新加权算法,该算法可以调节GRPO的学习信号,明确地将训练偏向探索或利用。我们在多种数学推理基准测试上验证了该算法的有效性。通过增强具有正向THR值的令牌并削弱负向THR值的令牌,我们的算法提高了贪婪解码的准确性,更有利于利用现有知识。相反的策略则有利于提高Pass@K准确性,有利于探索。我们还证明了该算法能够无缝地与其他RL目标(如GSPO)相结合,并适用于各种架构,包括Llama。这些发现确立了THR作为一种有原则、精细的机制,用于动态控制RL调整的大型语言模型中的探索和利用,为针对推理密集型应用的目标微调提供了新的工具。
论文及项目相关链接
PDF Full version of submission to 2nd AI for Math Workshop@ ICML 2025 (best paper)
Summary
强化学习结合可验证奖励显著提升了大型语言模型的推理能力,但如何明确引导训练过程走向探索或利用仍是一个待解决的问题。我们引入了Token Hidden Reward(THR)这一指标,用于量化每个token在Group Relative Policy Optimization(GRPO)下对正确响应概率的影响。我们发现训练过程主要由具有极高绝对THR值的少量tokens主导。最有趣的是,具有正THR的tokens会增强对正确输出的信心,从而有利于利用,而具有负THR的tokens则保留了对替代输出的概率质量,从而实现探索。这一见解提出了一种自然的干预措施:一种由THR引导的重新加权算法,该算法可以调节GRPO的学习信号,以明确引导训练过程走向利用或探索。我们在多种数学推理基准测试上验证了该算法的有效性。通过放大具有正THR值的tokens并削弱具有负THR值的tokens,我们的算法提高了贪婪解码的准确性,有利于利用。相反的策略则在Pass@K准确性上取得了一致的收益,有利于探索。我们还证明了该算法可以无缝集成到其他RL目标中,如GSPO,并适用于各种架构,包括Llama。这些发现确立了THR作为在RL调整LLM中动态控制探索和利用的精细化机制的地位,为针对推理密集型应用的目标微调提供了新的工具。
Key Takeaways
- 强化学习结合可验证奖励提升了大型语言模型的推理能力。
- 引入了Token Hidden Reward(THR)指标来量化token对正确响应概率的影响。
- 训练过程由具有高THR值的少数tokens主导。
- 正THR的tokens有利于利用,而负THR的tokens则促进探索。
- 提出了一种由THR引导的重新加权算法,可明确引导训练过程走向利用或探索。
- 该算法在多种数学推理基准测试上表现有效,能无缝集成到其他RL目标中。
点此查看论文截图





Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning
Authors:Arash Marioriyad, Shaygan Adim, Nima Alighardashi, Mahdieh Soleymani Banghshah, Mohammad Hossein Rohban
Large language models (LLMs) increasingly rely on chain-of-thought (CoT) prompting to solve mathematical and logical reasoning tasks. Yet, a central question remains: to what extent are these generated rationales \emph{faithful} to the underlying computations, rather than post-hoc narratives shaped by hints that function as answer shortcuts embedded in the prompt? Following prior work on hinted vs.\ unhinted prompting, we present a systematic study of CoT faithfulness under controlled hint manipulations. Our experimental design spans four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models (GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in correctness (correct and incorrect), presentation style (sycophancy and data leak), and complexity (raw answers, two-operator expressions, four-operator expressions). We evaluate both task accuracy and whether hints are explicitly acknowledged in the reasoning. Our results reveal three key findings. First, correct hints substantially improve accuracy, especially on harder benchmarks and logical reasoning, while incorrect hints sharply reduce accuracy in tasks with lower baseline competence. Second, acknowledgement of hints is highly uneven: equation-based hints are frequently referenced, whereas raw hints are often adopted silently, indicating that more complex hints push models toward verbalizing their reliance in the reasoning process. Third, presentation style matters: sycophancy prompts encourage overt acknowledgement, while leak-style prompts increase accuracy but promote hidden reliance. This may reflect RLHF-related effects, as sycophancy exploits the human-pleasing side and data leak triggers the self-censoring side. Together, these results demonstrate that LLM reasoning is systematically shaped by shortcuts in ways that obscure faithfulness.
大型语言模型(LLMs)越来越依赖链式思维(CoT)提示来解决数学和逻辑推理任务。然而,一个核心问题仍然存在:这些生成的原理在多大程度上是忠于底层计算的,而不是由提示所塑造的事后叙事,这些提示作为嵌入在提示中的答案捷径?在关于有提示与无提示提示的先前工作之后,我们呈现了一项在受控提示操作下对CoT忠实度的系统研究。我们的实验设计涵盖了四个数据集(AIME、GSM-Hard、MATH-500、UniADILR)、两个最先进的模型(GPT-4o和Gemini-2-Flash),以及一套结构化的提示条件,这些条件在正确性(正确和错误)、表现风格(奉承和数据泄露)和复杂性(原始答案、两个运算符的表达式、四个运算符的表达式)上有所不同。我们评估了任务准确性和是否在推理中明确承认提示。我们的研究结果揭示了三个关键发现。首先,正确的提示会显著提高准确性,特别是在更难的标准和逻辑推理方面,而错误的提示会大幅度降低基线能力较低的任务准确性。其次,对提示的认可极不均衡:基于方程的提示经常被引用,而原始提示往往被默默采纳,这表明更复杂的提示推动模型在推理过程中表达其依赖。第三,表现方式很重要:奉承提示鼓励公开认可,而泄露式提示虽然提高了准确性,但促进了隐藏的依赖。这可能反映了RLHF(人类反馈强化学习)的相关影响,因为奉承利用人类愉悦的一面,数据泄露触发了自我审查的一面。总之,这些结果表明,LLM推理受到捷径的系统性影响,这掩盖了忠实度。
论文及项目相关链接
PDF 5 Pages, 4 Figures, 4 Tables
Summary
大型语言模型(LLM)通过链式思维(CoT)提示来解决数学和逻辑推理任务。然而,中心问题是这些生成的解释在多大程度上忠实于底层计算,而不是由提示构成的答案捷径所塑造的后验叙述?在关于提示与非提示性提示的先前研究基础上,我们对受控提示操纵下的CoT忠实性进行了系统研究。实验设计涵盖四个数据集、两个最先进的模型,以及涵盖正确性、呈现风格和复杂性的结构化提示条件。我们的结果揭示了三个关键发现:首先,正确的提示会显著提高准确率,特别是在更难的任务和逻辑推理方面;其次,对提示的承认极不均匀,方程提示经常被引用,而原始提示往往被静默接受;最后,呈现方式很重要:奉承提示鼓励公开承认,而泄露式提示虽然能提高准确率,但会增加隐含依赖。这些发现表明LLM推理受到捷径的系统性影响,这可能会掩盖忠实性。
Key Takeaways
- 大型语言模型(LLM)通过链式思维(CoT)提示解决数学和逻辑任务时,生成的解释与底层计算的忠实性是一个核心问题。
- 正确提示可以显著提高任务准确率,尤其在难度较大的任务和逻辑推理方面。
- 模型对提示的承认程度不均,方程提示常被明确引用,而原始答案等简单提示则常静默接受。
- 提示的呈现方式影响模型的表现,奉承式提示鼓励公开承认依赖,而泄露式提示能提高准确率但可能增加隐含依赖。
- 这些发现表明LLM推理受到捷径的系统性影响,可能导致解释缺乏忠实性。
- 结果反映了人类反馈(RLHF)对模型表现的影响,如奉承利用人类喜爱面、数据泄露触发自我审查等。
点此查看论文截图


RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Authors:Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
现有大型语言模型(LLM)的安全评估方法存在固有的局限性,包括评估者偏见和由模型同质性引起的检测失败,这些局限性共同削弱了风险评估过程的稳健性。本文旨在通过引入一个重构基础风险概念空间的理论框架,重新考察风险评估范式。具体来说,我们将潜在风险概念空间分解为三个相互排斥的子空间:显式风险子空间(包含直接违反安全指导原则的风险)、隐式风险子空间(捕捉需要上下文推理才能识别的潜在恶意内容),以及非风险子空间。此外,我们提出了RADAR,这是一个多智能体协作评估框架,它通过四个专业互补角色利用多轮辩论机制,并采用动态更新机制实现风险概念分布的自我进化。该方法能够全面覆盖显式和隐式风险,同时减轻评估者偏见。为了验证我们框架的有效性,我们构建了一个包含800个挑战案例的评估数据集。在我们具有挑战性的测试集和公共基准测试上的大量实验表明,RADAR在多个维度上显著优于基线评估方法,包括准确性、稳定性和自我评估风险敏感性。值得注意的是,与最强的基线评估方法相比,RADAR在风险识别准确性方面提高了28.87%。
论文及项目相关链接
Summary
本文指出当前大型语言模型的安全评估方法存在评价者偏见和检测失败等内在局限性。为此,文章提出了一个新的风险评价框架,将潜在风险概念空间分解为三个相互排斥的子空间:明确风险子空间、隐含风险子空间和非风险子空间。同时,文章还介绍了RADAR多智能体协作评估框架,该框架利用多轮辩论机制,通过四个专业互补角色和动态更新机制实现风险概念分布的自我进化。实验表明,RADAR在风险识别准确性方面较最强基线评价方法提高了28.8
Key Takeaways
- 当前大型语言模型的安全评估方法存在局限性,包括评价者偏见和检测失败等问题。
- 新的风险评价框架被提出,将潜在风险概念空间分解为明确风险子空间、隐含风险子空间和非风险子空间。
- RADAR框架采用多智能体协作方式,通过多轮辩论机制进行风险评估。
- RADAR框架包含四个专业互补角色,实现风险概念分布的自我进化。
- 文章构建了包含800个挑战案例的评估数据集以验证RADAR框架的有效性。
- 实验表明,RADAR在多个维度上显著优于基线评估方法,包括准确性、稳定性和自我评估风险敏感性。
点此查看论文截图


p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
Authors:Runyan Tan, Shuang Wu, Phillip Howard
Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments.
从大型语言模型(LLM)获得高质量输出通常取决于基于采样的解码策略的选择,该策略以概率方式选择每个生成步骤中的下一个令牌。虽然已经提出了多种这样的采样方法,但它们的性能对超参数的选择很敏感,这可能需要根据不同的生成任务和温度配置进行设置。在这项工作中,我们引入了$p$-less采样:这是一种基于信息论的采样方法,它根据整个令牌概率分布动态地在每个解码步骤中设置截断阈值。与现有方法不同,$p$-less采样没有超参数,随着温度的升高,它始终能产生高质量的输出。我们从理论角度介绍了$p$-less采样的方法,并通过实验对其在多种数学、逻辑推理和创造性写作任务中的有效性进行了实证验证。结果表明,$p$-less采样在多个方面均优于现有采样方法,并且在较高温度下文本质量下降更少。此外,我们还展示了$p$-less如何通过更低的平均令牌采样时间和更短的生成长度实现更高的推理效率,同时不牺牲准确性。最后,我们通过定性示例、案例研究和多样性评估来分析突出$p$-less的优势。
论文及项目相关链接
Summary
本文介绍了无参数采样(p-less sampling)这一信息理论采样方法,该方法根据整个token概率分布动态设置截断阈值。该方法无需调整超参数,可在温度值增加时持续产生高质量的输出。通过理论分析和实验验证,p-less采样在多个数学、逻辑推理和创造性写作任务中表现出卓越性能,特别是在高温环境下文本质量下降更少。此外,p-less采样还具有更高的推理效率,平均token采样时间和生成长度均较低,同时不牺牲准确性。最后,通过定性示例、案例研究和多样性评估,展示了p-less采样的优势。
Key Takeaways
- p-less采样是一种基于信息理论的采样方法,根据整个token概率分布动态设置截断阈值。
- 与现有方法不同,p-less采样无需调整超参数。
- p-less采样在多种任务中表现出卓越性能,特别是在高温环境下文本质量下降更少。
- p-less采样提高了推理效率,具有更快的平均token采样时间和更短的生成长度。
- p-less采样不牺牲准确性。
- 通过定性示例、案例研究和多样性评估,展示了p-less采样的优势。
点此查看论文截图




Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Authors:Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li, Yuxiang Cai, Zhonggen Su, Zhaojun Liu, Jianwei Yin, Xiang Li
Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This “reason first, then act” process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at: https://github.com/Geo-R1/geo-r1.
遥感中的指代表达式理解面临着独特的挑战,因为它需要对复杂的对象上下文关系进行推理。虽然基于多模态大型语言模型的监督微调(SFT)在大量标记数据集上表现出强大的性能,但在数据稀缺的情况下,它们的表现却不尽如人意,导致泛化能力有限。为了解决这一局限性,我们提出了Geo-R1,这是一个以推理为中心的强化微调(RFT)范式,用于进行少样本地理空间指代。Geo-R1强制模型首先生成明确、可解释的推理链来分解指代表达式,然后利用这些推理来定位目标对象。这种“先推理,后行动”的过程使模型能够更有效地利用有限的注释,增强了泛化能力,并提供了可解释性。我们在三个精心设计的少样本地理空间指代基准测试上对Geo-R1进行了验证,我们的模型始终且大幅度地超越了SFT基准测试。它还表现出了强大的跨数据集泛化能力,凸显了其稳健性。相关代码和数据将在https://github.com/Geo-R1/geo-r1上发布。
论文及项目相关链接
Summary
远程感应中的指代表达式理解面临独特挑战,需对复杂的对象上下文关系进行推理。虽然监督微调(SFT)在多模态大型语言模型上利用大量标注数据集取得了强大的性能,但在数据稀缺的情况下表现不佳,导致泛化能力有限。为解决此问题,我们提出Geo-R1,这是一个以推理为中心的强化微调(RFT)范式,用于进行少量地理空间指代。Geo-R1强制模型首先生成明确、可解释的推理链,分解指代表达式,然后利用这些理性来定位目标对象。这种“先推理,后行动”的过程使模型更有效地利用有限的注释,增强了泛化能力,并提供了可解释性。我们在三个精心设计的少量地理空间指代基准上验证了Geo-R1,我们的模型持续且大幅度地超越了SFT基准测试。它还展示了强大的跨数据集泛化能力,凸显了其稳健性。
Key Takeaways
- 远程感应中的指代表达式理解需要处理复杂的对象上下文关系。
- 监督微调(SFT)在大型语言模型上表现强大,但在数据稀缺时泛化能力有限。
- Geo-R1是一种强化微调(RFT)范式,用于解决少量地理空间指代问题。
- Geo-R1通过生成推理链并利用它们来定位目标对象,实现“先推理,后行动”的过程。
- Geo-R1模型在多个基准测试中持续超越监督微调基准测试。
- Geo-R1模型展示了强大的跨数据集泛化能力。
点此查看论文截图





CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Authors:Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
强化学习(RL)已成为优化大型语言模型(LLM)以处理复杂推理任务的有力范式。此过程中的核心挑战在于管理策略熵,这反映了训练过程中的探索与利用之间的平衡。现有方法,如近端策略优化(PPO)及其变体,由于裁剪机制而丢弃了来自低概率标记的有价值梯度信号。我们系统地分析了熵动态,并发现这些被裁剪的标记在调节熵演化方面起着关键但被忽视的作用。我们提出了通过梯度保留策略优化协调熵(CE-GPPO)的新算法,它以温和且有限的方式在原生PPO中重新引入了被裁剪标记的梯度。通过控制来自裁剪区间外标记的梯度幅度,CE-GPPO能够实现探索与利用之间的平衡。我们提供了理论证明和实验证据,表明CE-GPPO有效地缓解了熵不稳定性的问题。在数学推理基准测试上的广泛实验表明,CE-GPPO在不同模型规模上始终优于强大的基线。
论文及项目相关链接
Summary
强化学习(RL)已成为优化大型语言模型(LLM)以处理复杂推理任务的有力工具。核心挑战在于管理策略熵,即训练过程中的探索与利用之间的平衡。现有方法(如近端策略优化(PPO)及其变体)由于裁剪机制而丢弃了来自低概率标记的有价值梯度信号。本文系统地分析了熵动力学,并揭示了这些被裁剪的标记在调节熵演化方面发挥着重要却被忽视的作用。我们提出了协调熵梯度保留策略优化(CE-GPPO)的新算法,以温和和有界的方式重新引入了原生PPO中被裁剪的标记的梯度。通过控制来自裁剪区间外标记的梯度幅度,CE-GPPO能够实现探索与利用之间的平衡。我们提供了理论证明和实验证据表明CE-GPPO有效地缓解了熵不稳定问题。在数学推理基准测试上的广泛实验表明,CE-GPPO在不同模型规模上均优于强大的基线模型。
Key Takeaways
- 强化学习是优化大型语言模型处理复杂推理任务的有效方法。
- 策略熵管理是强化学习中的核心挑战,涉及探索与利用的平衡。
- 现有方法如PPO因裁剪机制而忽略低概率标记的梯度信号。
- 被裁剪的标记在调节熵演化中起重要作用。
- 新算法CE-GPPO重新引入被裁剪标记的梯度,实现温和有界的策略优化。
- CE-GPPO通过控制梯度幅度实现探索与利用的平衡。
点此查看论文截图





From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
Authors:Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, Sicong Leng
Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an “easy to hard” approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.
多图像交错推理旨在提高多模态大型语言模型(MLLMs)在多个图像及其相关文本上下文中的联合理解和推理能力,这带来了超出单图像或非交错多图像任务的独特挑战。当前的多图像基准测试忽视了交错的文本上下文以及单个图像与其相关文本之间的独特关系,使得模型能够在多图像交错数据上进行推理可能会显著增强其对复杂场景的理解,并更好地捕捉跨模态相关性。为了弥补这一差距,我们引入了一个新的基准测试MIR,它要求在多图像伴随交错文本上下文的情境下进行联合推理,以准确地将图像区域与相应的文本相关联,并在图像之间逻辑地连接信息。为了提高MLLMs对多图像交错数据的理解能力,我们在基准测试中引入了针对每个实例的推理步骤,并提出了一种分阶段的课程学习策略。该策略遵循“从易到难”的方法,逐步引导模型从简单到复杂的场景,从而提高其处理复杂任务的能力。对多个MLLMs的广泛基准测试实验表明,我们的方法在MIR和其他既定基准测试上显著提高了模型的推理性能。我们相信,MIR将鼓励对多图像交错推理的进一步研究,推动MLLMs处理复杂跨模态任务的能力的进步。
论文及项目相关链接
PDF Accepted by ICCV 2025
Summary
多图像交错推理旨在提高多模态大型语言模型(MLLMs)对多个图像及其相关文本上下文的联合理解和推理能力,这带来了超越单图像或非交错多图像任务的独特挑战。针对当前多图像基准测试对交错文本上下文的忽视以及单个图像与关联文本之间关系的忽视,我们引入了MIR基准测试,要求进行多图像交错数据的联合推理,以准确地将图像区域与相应文本相关联,并在图像之间逻辑连接信息。为了增强MLLMs对多图像交错数据的理解能力,我们在基准测试中为每个实例引入了推理步骤,并提出了一种分阶段学习策略。通过广泛的实验评估多个MLLMs,证明我们的方法在MIR和其他基准测试上的推理性能得到了显著提升。我们相信,MIR将促进多图像交错推理的进一步研究,推动MLLMs在处理复杂跨模态任务的能力的发展。
Key Takeaways
- 多图像交错推理旨在提高多模态大型语言模型对多个图像和文本的理解与推理能力。
- 当前多图像基准测试忽视了交错文本上下文及图像与文本间的独特关系。
- MIR基准测试要求模型进行多图像交错数据的联合推理,准确关联图像和文本。
- 为增强MLLMs对多图像交错数据的理解,引入了推理步骤和分阶段学习策略。
- 分阶段学习策略采用“由易到难”的方法,逐步提升模型的应对复杂任务的能力。
- 实验证明,所提方法在MIR和其他基准测试上显著提升了模型的推理性能。
点此查看论文截图






EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Authors:Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.
基于指令的图像编辑技术已经迅速发展,但可靠且可解释的评价仍然是一个瓶颈。当前的评价协议要么(i)依赖于配对参考图像,这导致覆盖范围有限并继承了先前生成模型的偏见;要么(ii)完全依赖于零样本视觉语言模型(VLMs),这些模型基于提示对指令遵循、内容一致性和视觉质量的评估往往不精确。为了解决这个问题,我们引入了EdiVal-Agent,这是一个以对象为中心的自动化精细评价框架,旨在不仅精确评估标准单轮指令编辑,而且还评估多轮指令编辑。给定输入图像,EdiVal-Agent首先将其分解为语义上有意义的对象,然后合成多样且上下文感知的编辑指令,同时在各轮之间动态更新对象池。这两个阶段使针对多轮评估的两个新型对象中心度量和一个全局视觉质量度量成为可能:(1)EdiVal-IF,它通过结合开放词汇对象检测器进行符号检查和使用VLMs对检测器引导裁剪进行语义验证,来测量指令遵循情况;(2)EdiVal-CC,它通过计算不变对象和背景之间的语义相似性,使用不断发展的对象池来评估内容一致性;(3)EdiVal-VQ,它利用人类偏好模型量化总体视觉质量的变化。通过实现此管道,我们构建了EdiVal-Bench,这是一个涵盖9种指令类型和13种包括上下文、流程匹配和扩散范式在内的先进编辑模型的多轮编辑基准测试。我们证明,EdiVal-Agent可用于识别现有失败模式,从而为下一代编辑模型的开发提供信息。
论文及项目相关链接
PDF Tianyu Chen and Yasi Zhang contributed equally; Oscar Leong, Lijuan Wang, Ying Nian Wu, and Mingyuan Zhou advised equally
Summary
本文介绍了针对基于指令的图像编辑评估的瓶颈问题,提出了一种自动化、精细化的评估框架EdiVal-Agent。该框架以对象为中心,不仅能精确评估单轮指令编辑,还能评估多轮指令编辑。通过分解图像和合成多样化的上下文编辑指令,EdiVal-Agent提供了三个新的评估指标:EdiVal-IF、EdiVal-CC和EdiVal-VQ,分别用于评估指令遵循、内容一致性和整体视觉质量。此外,文章还构建了EdiVal-Bench基准测试集,展示了EdiVal-Agent在多种指令类型和编辑模型评估中的应用潜力。
Key Takeaways
- 当前基于指令的图像编辑评估面临瓶颈,需要更可靠和可解释的评估方法。
- EdiVal-Agent是一种新的自动化、精细化评估框架,以对象为中心,能精确评估单轮和多轮指令编辑。
- EdiVal-Agent通过分解图像和合成多样化的上下文编辑指令,提供三个新的评估指标:EdiVal-IF、EdiVal-CC和EdiVal-VQ。
- EdiVal-IF结合开放词汇对象检测器和VLMs,对指令遵循进行评估。
- EdiVal-CC通过计算语义相似性,评估内容一致性。
- EdiVal-VQ量化整体视觉质量的变化,与人类偏好模型相结合。
点此查看论文截图






LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
Authors:Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Changtao Miao, Huazhe Tan, Weibin Yao, Jianshu Li
As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: https://github.com/HJNVR/LaV-CoT
随着大型视觉语言模型(VLMs)的进步,它们在多语言视觉问答(mVQA)方面的能力得到了显著提升。思维链(CoT)推理已经被证明可以提高解释性和复杂推理能力。然而,大多数现有方法主要依赖文本CoT,对多语言多模态推理的支持有限,限制了它们在现实世界应用中的部署。为了弥补这一差距,我们引入了LaV-CoT,这是第一个带有多方面奖励优化的语言感知视觉CoT框架。LaV-CoT采用了一个可解释的多阶段推理管道,包括带有边界框(BBox)的文本摘要、语言识别、空间对象级描述和逐步逻辑推理。遵循这一推理管道,我们设计了一种自动化数据整理方法,通过迭代生成、修正和细化生成多语言CoT注释,实现可扩展和高质量的训练数据。为了提高推理和泛化能力,LaV-CoT采用了一种两阶段训练范式,结合监督微调(SFT)和语言感知组相对策略优化(GRPO),以可验证的多方面奖励为指导,包括语言一致性、结构准确性和语义对齐。在公共数据集(包括MMMB、多语言MMBench和MTVQA)上的广泛评估表明,与类似规模的开源基线相比,LaV-CoT的准确度提高了约9.5%,甚至超过了规模大两倍的模型约2.6%。此外,LaV-CoT在GPT-4o-0513和Gemini-2.5-flash等先进专有模型上表现出色。我们进一步进行了在线AB测试,以在现实世界的数据库中验证我们的方法,突显其在工业部署中的有效性。我们的代码位于链接:https://github.com/HJNVR/LaV-CoT。
论文及项目相关链接
PDF 12 Pages, 12 Figures, 3 Tables
Summary
随着大型视觉语言模型(VLMs)的进步,它们在多语言视觉问答(mVQA)方面的能力显著提高。为了改进推理并提升在真实世界应用中的部署能力,本文引入了LaV-CoT,这是一个具有多方面奖励优化的语言感知视觉链式思维框架。它通过多阶段推理管道、自动数据整理方法和两阶段训练模式,在多语言环境中实现高效推理和良好泛化能力。在公共数据集上的评估表明,LaV-CoT相较于类似规模的开源基线模型,准确率提高了约9.5%,并且在实际部署中表现出色。相关代码已公开分享。
Key Takeaways
- 大型视觉语言模型在多语言视觉问答方面的能力显著提升。
- LaV-CoT是首个语言感知的视觉链式思维框架,支持多语言多模态推理。
- LaV-CoT采用多阶段推理管道,包括文本摘要、语言识别、空间对象级描述和逐步逻辑推理。
- LaV-CoT通过自动化数据整理方法生成多语言链式思维注释,实现高质量训练数据的可扩展性。
- LaV-CoT采用两阶段训练模式,结合监督微调与语言感知的相对策略优化,提高推理和泛化能力。
- 在公共数据集上的评估显示,LaV-CoT相较于现有模型有明显性能提升。
点此查看论文截图





Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization
Authors:Lei Yu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Fengjun Zhang
Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. This challenge is amplified in Large Language Models (LLMs) by two interconnected failures: they operate as unauditable “black boxes” lacking a transparent reasoning process, and consequently, generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 (based on Qwen2.5-Coder-7B), a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the model. We then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 expert-validated reasoning-and-code samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 17 baselines on a benchmark of 756 real-world functions, SmartCoder-R1 establishes a new state of the art, achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).
智能合约自动管理高价值资产,其中存在的漏洞可能导致重大财务损失。这一挑战在大规模语言模型(LLMs)中被两种相互关联的失败所放大:它们作为无法审计的“黑箱”缺乏透明的推理过程,因此生成的代码充斥着关键的安全漏洞。为了解决这两个问题,我们提出了基于Qwen2.5-Coder-7B的智能合约生成新型框架SmartCoder-R1。它首先通过持续预训练(CPT)来专业化模型。然后我们在7998个经过专家验证的推理和代码样本上应用长链思维监督微调(L-CoT SFT),以训练模型模拟人类安全分析。最后,为了直接缓解漏洞问题,我们采用了安全感知组相对策略优化(S-GRPO),这是一种强化学习阶段,通过优化编译成功、安全合规和格式正确的加权奖励信号来完善生成策略。在包含756个真实世界函数的基准测试上,SmartCoder-R1与17个基线系统进行了评估,并在五个关键指标上达到了最新水平:Compass达到87.70%,VulRate达到8.60%,SafeAval达到80.16%,FuncRate达到53.84%,FullRate达到50.53%。FullRate相对于表现最强的基线系统DeepSeek-R1有45.79%的相对改进。关键的是,其生成的推理在人类评估中也表现出色,在功能、安全和清晰度方面获得了高质量的评价(功能82.7%,安全85.3%,清晰度90.7%)。
论文及项目相关链接
Summary
该文本介绍了智能合约在高价值资产管理中的自动化挑战,并指出大型语言模型(LLMs)存在的两大问题:无法审计的“黑箱”操作和缺乏透明推理过程,以及由此产生的代码中的关键安全漏洞。为解决这个问题,提出了一种基于Qwen2.5-Coder-7B的新型框架SmartCoder-R1,通过持续预训练(CPT)进行模型专业化,并通过长思考链监督微调(L-CoT SFT)训练模型以模拟人类安全分析。最后,通过安全感知组相对策略优化(S-GRPO)直接缓解漏洞。SmartCoder-R1在真实世界函数基准测试中相对于17个基线取得了最佳性能,并在五个关键指标上达到了新的水平。同时,其生成的理由在人类评估中也表现优异。
Key Takeaways
- 智能合约管理高价值资产时存在挑战,因漏洞可能导致巨大财务损失。
- 大型语言模型(LLMs)因缺乏透明度和安全性面临批评。
- SmartCoder-R1框架被提出以解决这些问题,它通过持续预训练、长思考链监督微调和安全感知组相对策略优化等技术来提升智能合约的生成质量和安全性。
- SmartCoder-R1在真实世界函数基准测试中表现优异,相对于最强的基线DeepSeek-R1,FullRate提高了45.79%。
- SmartCoder-R1生成的推理在人类评估中也获得了高评价,在功能性、安全性和清晰度方面表现突出。
- 该框架基于Qwen2.5-Coder-7B,具有强大的生成能力和安全性优化潜力。
点此查看论文截图

