⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-07 更新
Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection
Authors:Dongkeun Kim, Minsu Cho, Suha Kwak
Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.
社会互动往往源于微妙的、精细的线索,如面部表情、目光和手势。然而,现有的社会互动检测方法忽视了这些细微的线索,主要依赖于个人的整体表现。而且,它们直接检测社会群体,而没有明确建模个人之间的基本互动。这些缺点限制了它们捕捉局部社会信号的能力,并在应该根据细微线索推断社会互动来确定群体配置时引入了歧义。在这项工作中,我们提出了一种面向精细社会互动检测的部件感知自下而上群体推理框架。所提出的方法利用身体部位特征和人际关系来推断社会群体及其互动。我们的模型首先检测个人,并使用部件感知线索增强其特征,然后通过基于相似性的推理来关联个人,从而推断群体配置。这种推理不仅考虑空间关系,还考虑暗示互动的微妙社会线索,导致更准确的群体推断。在NVI数据集上的实验表明,我们的方法优于以前的方法,达到了新的水平。
论文及项目相关链接
PDF Accepted to NeurIPS 2025
Summary
本文指出社会互动往往源于微妙的、精细的线索,如面部表情、目光和手势。现有的社会互动检测方法忽视了这些细微的线索,主要依赖于个体的整体表现。此外,它们直接检测社会群体,而没有明确模拟个体之间的基本互动。为了解决这个问题,本文提出了一种基于局部感知的自下而上的群体推理框架,用于精细的社会互动检测。该方法利用身体部位特征和人际关系来推断社会群体及其互动。首先检测个体并增强他们的特征,然后通过基于相似性的推理关联个体来推断群体配置,这种方法不仅考虑空间关系,还考虑暗示互动的微妙社会线索,从而实现更准确的群体推断。在NVI数据集上的实验表明,该方法优于以前的方法,达到了新的技术水平。
Key Takeaways
- 社会互动通常源于微妙的线索,如面部表情、目光和手势。
- 现有社会互动检测方法主要依赖于个体的整体表现,忽视了细微线索。
- 直接检测社会群体而忽视个体间基本互动的方法存在局限性。
- 提出了一种基于局部感知的自下而上的群体推理框架用于精细社会互动检测。
- 该方法利用身体部位特征和人际关系来推断社会群体及其互动。
- 通过增强个体特征并进行基于相似性的推理来推断群体配置。
点此查看论文截图
CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Authors:Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre
Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.
文献批判是生物医学领域的一项基本能力。虽然大型语言模型(LLM)在此任务中提供了有前景的支持,但其可靠性仍然有限,特别是在特定领域的批判性推理方面。我们介绍了CareMedEval,这是一个专门设计用于评估LLM在生物医学批判性评价和推理任务上的表现的数据集。该数据集来源于法国医学生真实考试,包含基于37篇科学文章的534个问题。与现有基准测试不同,CareMedEval明确评估基于科学论文的批判性阅读和推理能力。在多种上下文条件下对最新通用和生物医学专业LLM进行基准测试表明该任务的难度:开放和商业模型未能达到精确匹配率超过0.5的水平,尽管生成中间推理令牌可以显著改善结果。然而,在关于研究局限性和统计分析的问题上,模型仍然面临挑战。CareMedEval为基于现实情况的推理提供了一个具有挑战性的基准测试,揭示了当前LLM的局限性,并为未来自动化支持批判性评价的发展铺平了道路。
论文及项目相关链接
PDF Preprint submitted to LREC 2026 (under review) To access the dataset, see https://github.com/bonzid/CareMedEval
Summary
本文介绍了CareMedEval数据集,该数据集用于评估大型语言模型(LLMs)在生物医学批评评价和推理任务中的表现。该数据集包含基于真实考试的534个问题,这些问题基于医学学生的实际阅读材料生成。虽然大型语言模型面临挑战,但通过生成中间推理标记可以显著提高结果。然而,模型仍然面临关于研究局限性和统计分析的问题。CareMedEval为未来的研究提供了一个挑战性的基准测试,揭示了当前大型语言模型的局限性,并为未来的自动化支持提供了发展机会。
Key Takeaways
- CareMedEval是一个专门设计用于评估大型语言模型在生物医学批评评价和推理任务性能的数据集。
- 数据集包含基于真实考试的534个问题,这些问题基于医学学生的阅读材料生成。
- 大型语言模型在完成任务时面临挑战,尤其是在处理关于研究局限性和统计分析的问题时。
- 生成中间推理标记可以显著提高大型语言模型的结果。
- CareMedEval提供了一个挑战性的基准测试,用于评估模型的性能并揭示当前大型语言模型的局限性。
- 该数据集有助于为未来的研究和发展方向提供指导,特别是在自动化支持生物医学领域的批评评价方面。
点此查看论文截图
LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning
Authors:Shenghao Li
For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.
对于复杂逻辑数据增强,过度依赖人工标注成本高昂,而直接使用大型语言模型生成则会产生不可解释和逻辑上单一化的例子。为解决这一问题,我们提出了LFC-DA,这是一个受符号逻辑控制的过程:首先,将逻辑文本映射到命题表达式,编译一个紧凑的规则库,然后在有界状态空间搜索中系统地发现有效的公式,这些公式随后被转化为自然语言问题,确保了命题逻辑下的多样性和逻辑严谨性。在ReClor和LogiQA上的实验表明,预训练模型的逻辑推理准确性得到了显著提高,证实了LFC-DA在大型语言模型引导的逻辑数据增强中的有效性。
论文及项目相关链接
PDF 10 pages, 6 figures
Summary:针对复杂逻辑数据增强,单纯依赖人工标注成本高昂,而直接使用大型语言模型生成则产生难以解释且逻辑单一的例子。为此,我们提出LFC-DA,一种符号逻辑控制流程:首先将逻辑文本映射到命题表达式,编译简洁的规则库,然后在有界状态空间搜索中系统地发现有效公式,再将其转化为自然语言问题,确保多样性和逻辑严谨性。在ReClor和LogiQA上的实验表明,LFC-DA能显著提高预训练模型的逻辑推理准确性,证明其在大型语言模型引导的逻辑数据增强中的有效性。
Key Takeaways:
- LFC-DA是一种符号逻辑控制流程,用于逻辑数据增强。
- 流程包括将逻辑文本映射到命题表达式,编译规则库,并在有界状态空间搜索中发现有效公式。
- 该方法能确保生成的例子既多样又逻辑严谨。
- LFC-DA通过将公式转化为自然语言问题,提高了预训练模型的逻辑推理准确性。
- 实验在ReClor和LogiQA数据集上进行,证明了LFC-DA的有效性。
- 单纯依赖人工标注成本高昂,而直接使用大型语言模型生成则可能产生难以解释的结果。
点此查看论文截图
Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework
Authors:Junhao Li, Jiahao Chen, Zhou Feng, Chunyi Zhou
Recent advances in multi-modal Large Language Models (M-LLMs) have demonstrated a powerful ability to synthesize implicit information from disparate sources, including images and text. These resourceful data from social media also introduce a significant and underexplored privacy risk: the inference of sensitive personal attributes from seemingly daily media content. However, the lack of benchmarks and comprehensive evaluations of state-of-the-art M-LLM capabilities hinders the research of private attribute profiling on social media. Accordingly, we propose (1) PRISM, the first multi-modal, multi-dimensional and fine-grained synthesized dataset incorporating a comprehensive privacy landscape and dynamic user history; (2) an Efficient evaluation framework that measures the cross-modal privacy inference capabilities of advanced M-LLM. Specifically, PRISM is a large-scale synthetic benchmark designed to evaluate cross-modal privacy risks. Its key feature is 12 sensitive attribute labels across a diverse set of multi-modal profiles, which enables targeted privacy analysis. These profiles are generated via a sophisticated LLM agentic workflow, governed by a prior distribution to ensure they realistically mimic social media users. Additionally, we propose a Multi-Agent Inference Framework that leverages a pipeline of specialized LLMs to enhance evaluation capabilities. We evaluate the inference capabilities of six leading M-LLMs (Qwen, Gemini, GPT-4o, GLM, Doubao, and Grok) on PRISM. The comparison with human performance reveals that these MLLMs significantly outperform in accuracy and efficiency, highlighting the threat of potential privacy risks and the urgent need for robust defenses.
最近多模态大型语言模型(M-LLM)的进展表明,它们具有从各种来源合成隐含信息(包括图像和文本)的强大能力。这些来自社交媒体的有用数据也带来了一个重大且尚未被充分研究的隐私问题:从看似日常的媒体内容中推断出敏感的个人属性。然而,由于缺乏基准测试和最新M-LLM能力的全面评估,社交媒体上的个人属性分析的研究受到了阻碍。因此,我们提出了(1)棱镜(PRISM),这是第一个多模态、多维度、精细合成的数据集,融合了全面的隐私状况和动态用户历史;(2)一个高效的评估框架,用于衡量先进M-LLM的跨模态隐私推断能力。具体来说,棱镜是一个大规模合成基准测试,用于评估跨模态隐私风险。它的关键特点是包含12个敏感属性标签,跨越一系列多模态配置文件,可实现有针对性的隐私分析。这些配置文件是通过复杂的LLM代理工作流程生成的,遵循先验分布,以确保它们真实地模拟社交媒体用户。此外,我们提出了一个多代理推断框架,该框架利用一系列专门的LLMs来增强评估能力。我们在棱镜上评估了六个领先M-LLM(Qwen、Gemini、GPT-4o、GLM、Doubao和Grok)的推断能力。与人类性能的比较表明,这些MLLM在准确性和效率方面表现出显著的优势,这突显了潜在的隐私风险威胁和对稳健防御的迫切需求。
论文及项目相关链接
PDF 14 pages, 3 figures; Accepted by MMM 2026; Complete version in progress
Summary
多模态大型语言模型(M-LLM)能够从各种来源中综合隐含信息,包括图像和文本。然而,这些社交媒体资源也带来了重大且尚未被充分探索的隐私风险,即从日常媒体内容中推断出敏感的个人属性。针对这一领域缺乏基准测试和M-LLM的最新能力综合评估,我们提出了(1)PRISM,这是第一个多模态、多维度、精细合成的数据集,融入全面的隐私景观和动态用户历史;(2)有效的评估框架,用以衡量先进M-LLM的跨模态隐私推断能力。PRISM是一个大型合成基准测试,旨在评估跨模态隐私风险。其关键功能在于包含12个敏感属性标签,这些标签应用于一系列多模态资料的多样化集合上,以实现有针对性的隐私分析。这些资料是通过遵循先验分布的LLM代理工作流程生成的,以确保它们真实地模拟社交媒体用户。此外,我们提出了多代理推理框架,利用一系列专业LLM增强评估能力。我们在PRISM上评估了六种领先的M-LLM(Qwen、Gemini、GPT-4o、GLM、Doubao和Grok)的推断能力。与人类的性能相比,这些MLLM在准确性和效率方面表现出显著优势,突显了潜在的隐私风险威胁和亟需的稳健防御。
Key Takeaways
- 多模态大型语言模型(M-LLM)能从图像和文本等多种来源中综合信息。
- 社交媒体资源引入重大且未被充分探索的隐私风险,即从日常内容中推断敏感个人属性。
- 缺乏基准测试和M-LLM的综合能力评估阻碍了社交媒体的隐私属性分析的研究。
- 提出了PRISM数据集,一个用于评估跨模态隐私风险的基准测试,包含多模态、多维度和精细合成的数据。
- PRISM具有12个敏感属性标签,用于有针对性的隐私分析,并生成模拟社交媒体用户行为的合成资料。
- 引入多代理推理框架,利用专业LLM增强评估能力。
点此查看论文截图
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
Authors:Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo
Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
尽管最近在利用大型语言模型(LLM)自动生成3D场景方面取得了进展,但生成的场景通常缺乏现实世界环境中存在的真实空间布局和物体属性。这个问题源于指令不够详细且过于粗略,因此,通过更详细、更精细的指令来推动3D场景合成,以反映现实世界环境至关重要。没有这样的真实场景,在不真实的环境中训练实体代理会导致它们学到的先验知识严重偏离现实世界的物理和语义,在部署时性能下降。因此,验证精细指令与生成场景之间的对齐对于有效学习至关重要。然而,现有的评估方法,如CLIPScore和视觉语言模型(VLM),通常无法可靠地评估这种对齐。这一缺陷主要源于它们对3D场景的浅薄理解,这往往导致场景组件不当定位。为了解决这一问题,我们引入了LEGO-Eval评估框架,该框架配备了各种工具,旨在明确定位场景组件,从而实现更准确的对齐评估。我们还推出了LEGO-Bench基准测试,包含详细的指令,指定复杂布局和现实世界环境的属性。实验表明,在评估场景指令对齐方面,LEGO-Eval的F1分数比以视觉语言模型作为评判高出0.41分。使用LEGO-Bench基准测试表明,当前生成方法的局限性显著。在所有评估方法中,成功生成与精细指令完全对齐的场景的比例最高仅达10%。
论文及项目相关链接
PDF Work in Progress
Summary
本文指出,尽管大型语言模型(LLMs)在自动生成3D场景方面取得了进展,但生成的场景通常缺乏真实世界的空间布局和物体属性。问题根源在于指令不够详细和精细。因此,需要更详细地指导3D场景合成,以反映真实世界环境。在不真实的环境中训练实体代理会导致其学习到的优先事项与真实世界的物理和语义大相径庭,从而影响其部署后的性能。因此,验证精细指令与生成场景之间的对齐至关重要。然而,现有的评估方法如CLIPScore和视觉语言模型(VLM)无法可靠地评估这种对齐。为解决这一问题,本文引入了LEGO-Eval评估框架和LEGO-Bench基准测试,旨在更准确地评估场景与指令的对齐情况。实验表明,LEGO-Eval在评估场景指令对齐方面优于VLM-as-a-judge,而LEGO-Bench基准测试揭示了当前生成方法的重要局限性。
Key Takeaways
- LLMs在生成3D场景时面临缺乏真实世界空间布局和物体属性的问题。
- 现有评估方法如CLIPScore和VLM无法准确评估场景与指令的对齐情况。
- 为解决上述问题,引入了LEGO-Eval评估框架和LEGO-Bench基准测试。
- LEGO-Eval通过多样的工具设计能更准确地评估场景与指令的对齐情况。
- 实验表明LEGO-Eval在评估场景指令对齐方面优于VLM-as-a-judge。
- 目前生成方法的成功率在完全对齐精细指令生成的场景中最多只达到10%。
点此查看论文截图
Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Authors:Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.
多模态大型语言模型(MLLMs)已经显示出强大的能力,但仍然局限于固定的模态对,需要昂贵的大型对齐数据集进行微调。构建能够整合文本、图像、音频和视频的完全通用模型仍然不切实际,缺乏稳健的推理支持。在本文中,我们提出了一种Agent-Omni框架,它通过主代理系统协调现有的基础模型,实现了灵活的多模态推理而无需重新训练。主代理解释用户意图,将子任务委托给特定模态的代理,并将他们的输出整合为连贯的响应。在文本、图像、音频、视频和全基准测试的大量实验表明,Agent-Omni持续实现最先进的性能,特别是在需要复杂跨模态推理的任务上。其基于代理的设计能够无缝集成专业的基础模型,确保适应各种输入,同时保持透明度和可解释性。此外,该框架是模块化的,易于扩展,允许随着更强大的模型的出现而进行未来改进。
论文及项目相关链接
PDF 16 pages, 7 figures, 14 tables. Under Review
Summary
本文提出了一个名为Agent-Omni的框架,通过主代理系统协调现有的基础模型,实现了灵活的多模态推理而无需重新训练。该框架能够解释用户意图,将子任务委派给特定模态的代理,并将它们的输出整合为连贯的响应。实验表明,Agent-Omni在文本、图像、音频、视频和全模态基准测试中始终实现最新性能,特别是在需要复杂跨模态推理的任务上。其基于代理的设计可无缝集成专业基础模型,确保适应各种输入并保持透明度和可解释性。此外,该框架是模块化的并且易于扩展,允许随着更强大模型的可用而改进。
Key Takeaways
- Agent-Omni框架通过主代理系统协调现有基础模型,实现灵活多模态推理。
- 该框架能够解释用户意图,并将子任务委派给特定模态的代理。
- Agent-Omni在多种基准测试中实现最新性能,尤其在需要复杂跨模态推理的任务上。
- 基于代理的设计确保适应各种输入并保持透明度和可解释性。
- Agent-Omni框架是模块化的,易于扩展,允许未来随着更强模型的可用而改进。
- 该框架的出现解决了多模态大型语言模型在固定模态对上的局限性,以及需要大量精细调整和对齐数据集的问题。
点此查看论文截图
Scalable Evaluation and Neural Models for Compositional Generalization
Authors:Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi
Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts. Our code is available at https://github.com/IBM/scalable-compositional-generalization.
组合泛化是现代机器学习中的一个关键开放挑战,要求模型能够预测已知概念未知的组合。然而,由于缺乏标准化的评估协议和当前基准测试的局限性(通常效率优先于严谨性),评估组合泛化仍然是一个基本挑战。同时,通用视觉架构缺乏必要的归纳偏见,现有的赋予它们的方法牺牲了可扩展性。为了补救这一点,本文介绍了:1)一个严谨的评价框架,它统一并扩展了以前的方法,同时将计算要求从组合降低到常数;2)对监督视觉主干中组合泛化现状的广泛和现代评估,训练了超过5000个模型;3)属性不变网络,是一类模型在组合泛化方面建立了新的帕累托边界,在基线测试上实现了23.43%的准确率提升,同时与完全解耦的同类模型相比,参数开销从600%减少到16%。我们的代码可在https://github.com/IBM/scalable-compositional-generalization处获得。
论文及项目相关链接
PDF Accepted at the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
Summary
该论文介绍了当前机器学习面临的一个关键挑战:组合泛化。文章提出评估模型预测已知概念组合的能力是至关重要的。为了解决当前在组合泛化评估上存在的标准化评估协议缺失及现有评估工具的局限性,该研究提出了一个统一并扩展的严谨评价框架,并在计算机视觉模型上进行了大量实验。此外,该研究还提出了一种新的模型架构——属性不变网络,它在组合泛化方面取得了显著成效,并改善了参数开销问题。该研究的代码已在GitHub上公开。
Key Takeaways
- 组合泛化是机器学习的一个重要挑战,需要模型预测已知概念的未知组合。
- 缺乏标准化的评估协议和现有的评估工具限制了组合泛化的研究进程。
- 为了解决这些问题,该研究提出了一个严谨的评价框架,该框架统一并扩展了之前的评估方法,并降低了计算要求。
- 在计算机视觉模型上进行了大量实验来评估组合泛化的现状。
- 属性不变网络是一种新的模型架构,它在组合泛化方面表现出显著的效果。
- 属性不变网络相比基准模型有高达23.43%的准确性改进。
点此查看论文截图
TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data
Authors:Changjiang Jiang, Fengchang Yu, Haihua Chen, Wei Lu, Jin Zeng
Complex reasoning over tabular data is crucial in real-world data analysis, yet large language models (LLMs) often underperform due to complex queries, noisy data, and limited numerical capabilities. To address these issues, we propose TabDSR, a framework consisting of: (1) a query decomposer that breaks down complex questions, (2) a table sanitizer that cleans and filters noisy tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates executable code to derive the final answer from the sanitized table. To ensure unbiased evaluation and mitigate data leakage, we introduce a new dataset, CalTab151, specifically designed for complex numerical reasoning over tables. Experimental results demonstrate that TabDSR consistently outperforms existing methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and 19.87% accuracy improvement on TAT-QA, TableBench, and TabDSR, respectively. Moreover, our framework integrates seamlessly with mainstream LLMs, providing a robust solution for complex tabular numerical reasoning. These findings highlight the effectiveness of our framework in enhancing LLM performance for complex tabular numerical reasoning. Data and code are available upon request.
表格数据的复杂推理在真实世界数据分析中至关重要,然而大型语言模型(LLMs)由于复杂的查询、嘈杂的数据和有限的数值能力而常常表现不佳。为了解决这些问题,我们提出了TabDSR框架,它包括:(1)查询分解器,用于分解复杂问题;(2)表格净化器,用于清洁和过滤嘈杂的表格;(3)基于思维程序(PoT)的推理器,生成可执行代码,从净化后的表格中得出最终答案。为了确保公平评估和减轻数据泄露,我们引入了一个新数据集CalTab151,该数据集专为表格上的复杂数值推理设计。实验结果表明,TabDSR始终优于现有方法,在TAT-QA、TableBench和TabDSR上分别实现了先进性能,准确率提高了8.79%、6.08%和19.87%。此外,我们的框架与主流LLMs无缝集成,为复杂表格数值推理提供了稳健的解决方案。这些发现凸显了我们的框架在增强LLM进行复杂表格数值推理性能方面的有效性。数据和代码可根据要求提供。
论文及项目相关链接
PDF Accepted to EMNLP 2025 Findings
Summary
该文提出一个名为TabDSR的框架,用于解决大型语言模型在处理表格数据时的复杂推理难题。该框架包括查询分解器、表格清理器和基于程序思维(PoT)的推理器。框架引入了CalTab151数据集以确保公正评估和减少数据泄露问题。实验结果显示,TabDSR框架在TAT-QA、TableBench和TabDSR上的准确率分别提高了8.79%、6.08%和19.87%,展现出卓越性能。该框架可无缝集成主流大型语言模型,为解决复杂表格数值推理问题提供了稳健解决方案。
Key Takeaways
- 大型语言模型在处理表格数据时的复杂推理存在挑战,包括复杂查询、噪声数据和有限数值处理能力。
- TabDSR框架包括查询分解器、表格清理器和基于程序思维的推理器,以应对这些挑战。
- 引入CalTab151数据集,专为复杂数值表格推理设计,确保公正评估并减少数据泄露。
- 实验结果显示TabDSR框架在多个数据集上表现优异,较现有方法有明显提升。
- TabDSR框架可无缝集成主流大型语言模型。
- TabDSR框架提高了大型语言模型处理复杂表格数值推理的能力。
点此查看论文截图
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
Authors:Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.
最近,多模态大型语言模型(MLLMs)的进展在二维视觉理解方面取得了显著的提升,促使人们对其应用于复杂三维推理任务的兴趣。然而,尚不清楚这些模型是否能有效地捕获用于稳健现实世界性能所需的详细空间信息,特别是跨视图一致性,这是准确三维推理的关键要求。考虑到这个问题,我们引入了视角学习(Viewpoint Learning)任务,旨在评估和改进MLLMs的空间推理能力。我们推出了视角数据集Viewpoint-100K,其中包含以不同视角为中心的对象图像对以及相应的问题答案对。我们的方法采用两阶段微调策略:首先通过Viewpoint-100K的监督微调(SFT)向基线MLLM注入基础知识,这在多个任务上实现了显著的改进;其次,通过在更广泛的问题集上使用群体相对策略优化(GRPO)算法进行强化学习来增强泛化能力。此外,我们还引入了一种混合的冷启动初始化方法,旨在同时学习视角表示并保持连贯的推理思维。实验结果表明,我们的方法显著激活了MLLM的空间推理能力,提高了领域内外推理任务的性能。我们的研究强调了在大规模模型中培养基础空间技能的价值,支持未来在机器人技术、自主系统和三维场景理解方面的进步。
论文及项目相关链接
Summary
最近的多模态大型语言模型(MLLMs)进步显著,提升了对2D视觉的理解,并引发了将其应用于复杂3D推理任务的兴趣。然而,尚不清楚这些模型是否能有效捕捉用于稳健现实表现的详细空间信息,特别是跨视图一致性,这是准确3D推理的关键要求。为此,我们引入了视点学习(Viewpoint Learning)任务来评估和改进MLLMs的空间推理能力,并介绍了Viewpoint-100K数据集。通过两阶段微调策略,首先在Viewpoint-100K数据集上通过监督微调(SFT)注入基础知识到基线MLLM中,然后在更广泛的问题集上通过强化学习使用集团相对策略优化(GRPO)算法增强泛化能力。此外,我们还介绍了一种混合冷启动初始化方法,旨在同时学习视点表示并保持连贯的推理思维。实验结果表明,我们的方法显著激活了MLLM的空间推理能力,提高了域内和域外推理任务的性能。
Key Takeaways
- MLLMs在2D视觉理解方面取得了显著进展,推动了在复杂3D推理任务中的应用。
- 当前模型在捕捉详细空间信息,尤其是跨视图一致性方面,仍面临挑战。
- 引入了视点学习(Viewpoint Learning)任务来评估和改进MLLMs的空间推理能力。
- Viewpoint-100K数据集包含10万个以对象为中心的图像对和相应的问题答案对,用于支持视点学习任务。
- 采用两阶段微调策略,通过监督微调(SFT)和集团相对策略优化(GRPO)算法强化泛化能力。
- 引入混合冷启动初始化方法,同时学习视点表示并保持连贯推理思维。
点此查看论文截图
Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis
Authors:Yuhang Huang, Zekai Lin, Fan Zhong, Lei Liu
Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action-based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18% compared to a non-interactive baseline. To validate the faithfulness of the agent’s explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance ($\Delta$Brier=+0.029), confirming that the evidence is integral to its decision-making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.
在像医学这样高风险领域中,AI模型的解释通常缺乏可验证性,这可能会阻碍信任。为了解决这个问题,我们提出了一种通过可审核的行动序列产生解释的交互代理。该代理学习一种策略,以战略性地寻求外部视觉证据来支持其诊断推理。该策略通过强化学习进行优化,从而得到一个既高效又通用的模型。我们的实验表明,这种基于行动的理由过程显著提高了校准精度,与非交互基线相比,Brier得分降低了18%。为了验证代理解释的真实性,我们引入了一种因果干预方法。通过掩盖代理选择的视觉证据,我们观察到其性能出现了可衡量的下降(ΔBrier=+0.029),这证实了证据是其决策过程中的重要组成部分。我们的工作提供了一个实用的框架,用于构建具有可验证和忠诚推理能力的AI系统。
论文及项目相关链接
PDF 12 pages, 3 figures. Under review at the Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Summary
本文提出一种在医学等高风险领域中,AI模型解释缺乏可验证性的问题。为解决此问题,研究团队设计了一种交互式智能体,它通过一系列可审核的行动产生解释。智能体学习一种策略性寻找外部视觉证据以支持其诊断推理的策略,该策略通过强化学习进行优化,使得模型既高效又具有泛化能力。实验表明,基于行动推理过程显著提高了校准精度,与非交互式基准相比,Brier得分降低了18%。为验证智能体解释的可信性,研究团队引入了一种因果干预方法。通过屏蔽智能体选择的视觉证据,观察到其性能的可衡量下降(ΔBrier=+0.029),证实了证据对其决策过程的重要性。本研究为构建具有可验证和可靠推理能力的AI系统提供了实用框架。
Key Takeaways
- AI模型在高风险领域(如医学)中的解释缺乏可验证性,可能影响信任。
- 交互式智能体通过一系列可审核的行动产生解释。
- 智能体学习策略性寻找外部视觉证据以支持其诊断推理。
- 强化学习用于优化智能体的策略,使其既高效又具泛化能力。
- 基于行动推理过程显著提高校准精度,Brier得分降低18%。
- 引入因果干预方法来验证智能体解释的可信性。
点此查看论文截图
Lares: LLM-driven Code Slice Semantic Search for Patch Presence Testing
Authors:Siyuan Li, Yaowen Zheng, Hong Li, Jingdong Guo, Chaopeng Dong, Chunpeng Yan, Weijie Wang, Yimo Ren, Limin Sun, Hongsong Zhu
In modern software ecosystems, 1-day vulnerabilities pose significant security risks due to extensive code reuse. Identifying vulnerable functions in target binaries alone is insufficient; it is also crucial to determine whether these functions have been patched. Existing methods, however, suffer from limited usability and accuracy. They often depend on the compilation process to extract features, requiring substantial manual effort and failing for certain software. Moreover, they cannot reliably differentiate between code changes caused by patches or compilation variations. To overcome these limitations, we propose Lares, a scalable and accurate method for patch presence testing. Lares introduces Code Slice Semantic Search, which directly extracts features from the patch source code and identifies semantically equivalent code slices in the pseudocode of the target binary. By eliminating the need for the compilation process, Lares improves usability, while leveraging large language models (LLMs) for code analysis and SMT solvers for logical reasoning to enhance accuracy. Experimental results show that Lares achieves superior precision, recall, and usability. Furthermore, it is the first work to evaluate patch presence testing across optimization levels, architectures, and compilers. The datasets and source code used in this article are available at https://github.com/Siyuan-Li201/Lares.
在现代软件生态系统中,由于大量代码的重复使用,为期一天的漏洞构成了重大安全风险。仅识别目标二进制文件中的脆弱功能是不够的;确定这些功能是否已被修复也是至关重要的。然而,现有方法存在可用性和准确性方面的局限性。它们通常依赖于编译过程来提取特征,需要大量的人工努力,并且对某些软件不起作用。而且,它们无法可靠地区分由补丁或编译变化引起的代码更改。为了克服这些限制,我们提出了Lares,这是一种可扩展且准确的补丁存在测试方法。Lares引入了代码切片语义搜索,它直接从补丁源代码中提取特征,并识别目标二进制文件伪代码中的语义等效代码切片。通过消除对编译过程的需求,Lares提高了可用性,同时利用大型语言模型(LLM)进行代码分析和SMT求解器进行逻辑推理以提高准确性。实验结果表明,Lares在精度、召回率和可用性方面均表现优越。此外,它是首项在优化级别、架构和编译器方面评估补丁存在测试的工作。本文使用的数据集和源代码可在https://github.com/Siyuan-Li201/Lares获取。
论文及项目相关链接
Summary
文本介绍了在现代软件生态系统中,一日漏洞修复的重要性及其带来的安全风险。现有的方法存在局限性,需要编译过程提取特征且难以区分补丁引起的代码变化和编译引起的变化。为解决这些问题,提出名为Lares的可靠测试补丁存在性的方法。它直接提取补丁源代码的特征,并在目标二进制文件的伪代码中寻找语义等效的代码片段。该方法无需编译过程,利用大型语言模型和SMT求解器提高准确性和可用性。实验结果显示,Lares在精度、召回率和可用性方面表现优异,并首次对优化级别、架构和编译器进行了补丁存在测试评估。数据集和源代码可在指定链接找到。
Key Takeaways
- Lares是一种用于测试补丁存在性的可扩展且准确的方法。
- 它直接提取补丁源代码的特征,提高了测试的可靠性和准确性。
- Lares利用大型语言模型和SMT求解器进行代码分析和逻辑推理。
- 与现有方法相比,Lares提高了易用性并减少了人工工作量。
- 实验结果表明Lares在多个场景下具有较高的性能表现。
- Lares解决了现有方法无法准确区分补丁引起的代码变化和编译变化的问题。
点此查看论文截图
Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning
Authors:Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.
测试时强化学习(TTRL)提供了一种无标签范式,可在推理时仅使用合成信号来适应模型,但其成功取决于构建可靠的学习信号。标准方法(如多数投票)通常偏向于虚假但流行的答案。我们引入了Self-Harmony框架,该框架建立在简单直觉之上:正确答案应该在原始问题和其同义转述之间保持稳定。Self-Harmony通过两种方式实现这一点:使用模型在两个互补角色中扮演求解器和重新构造者。基于这一点,我们进一步提出了一种伪标签方法:它不使用多数投票,而是使用调和平均数来汇总原始和重构视图中的答案频率。这是一个自然选择重构下稳定解决方案的过程,从而避免了倾向于依赖于视图的虚假答案的常见陷阱。关键的是,这不需要人工监督或辅助模型。在多种推理基准测试中,Self-Harmony在无标签测试时间设置中实现了最新技术成果,在多个方法的30个设置中有28个排名第一。除了准确性之外,它还表现出前所未有的稳健性,在所有实验中均未出现训练失败的情况,突显了其稳定性和可靠性。
论文及项目相关链接
Summary
本文介绍了测试时间强化学习(TTRL)的一种新型框架——Self-Harmony。它通过采用单一模型的两个互补角色(Solver和Reframer)来在无需人工监督或辅助模型的情况下,利用原始问题和其同义句之间的正确答案为稳定的特性,实现了一种伪标签方法,通过计算原始问题和同义问题答案频率的调和平均数来聚合答案。这种方法在多种推理基准测试中实现了无标签测试环境下的最新结果,并在多个方法的30个设置中排名第一。除了准确性外,它还具有前所未有的稳健性,所有实验中均未出现训练失败的情况,突显了其稳定性和可靠性。
Key Takeaways
- Self-Harmony是一种新型的测试时间强化学习框架,用于在无需人工监督或辅助模型的情况下适应模型。
- 它通过采用单一模型的两个互补角色(Solver和Reframer)来操作,分别产生答案和重新表述输入。
- Self-Harmony提出了一种伪标签方法,该方法通过计算原始问题和同义问题答案频率的调和平均数来聚合答案。
- 该方法避免了常见的选择视图相关、虚假答案的陷阱。
- 在多种推理基准测试中,Self-Harmony在无需标签的测试环境中实现了最新结果,并在多个设置中排名第一。
- 除了准确性外,Self-Harmony还具有前所未有的稳健性,所有实验中均未出现训练失败的情况。
点此查看论文截图
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Authors:Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria
Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.
最近,先进的大型语言模型(LLM)的出现速度越来越快。然而,当面对复杂问题时,大多数用户往往无法提供与LLM交互的准确和有效的提示,从而限制了LLM的性能。为了解决这一挑战,我们提出了Prompt-R1,这是一个端到端的强化学习框架,它使用小型LLM与大型LLM进行协作,以更好地解决问题来代替用户交互。这种协作被塑造为一种多轮提示交互,小型LLM思考并生成提示,而大型LLM进行复杂推理。设计了一个双重约束奖励来优化正确性、生成质量和推理准确性。Prompt-R1提供了一个即插即用的框架,支持使用各种大型LLM进行推理和训练。在多个公共数据集上的实验表明,Prompt-R1在各项任务中显著优于基准模型。我们的代码可在https://github.com/QwenQKing/Prompt-R1上公开获取。
论文及项目相关链接
Summary
大语言模型(LLMs)在处理复杂问题时,多数用户难以提供准确的提示与之交互,限制了LLMs的性能。为此,提出Prompt-R1,一个端到端的强化学习框架,采用小型LLM与大型LLM协作,替代用户交互,更好地解决问题。此协作被构建为多轮提示交互,小型LLM负责思考和生成提示,大型LLM进行复杂推理。设计双重约束奖励以优化正确性、生成质量和推理准确性。Prompt-R1提供了一个即插即用的框架,支持各种大型LLMs的推理和训练。在多个公共数据集上的实验表明,Prompt-R1显著优于基准模型。
Key Takeaways
- 大语言模型(LLMs)在处理复杂问题时面临用户提示不准确的问题。
- Prompt-R1是一个强化学习框架,旨在解决这一问题。
- Prompt-R1采用小型LLM与大型LLM协作,进行多轮提示交互。
- 小型LLM负责思考和生成提示,大型LLM进行复杂推理。
- 设计了双重约束奖励以优化正确性、生成质量和推理准确性。
- Prompt-R1提供了一个即插即用的框架,适用于各种大型LLMs的推理和训练。
点此查看论文截图
Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs
Authors:Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai
Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature – encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.
多模态大型语言模型(MLLMs)在各种通用场景中表现出了显著的有效性,例如视觉问答和图像描述。最近,研究者越来越关注为MLLMs赋予医疗对话能力,这在临床应用方面显示出巨大的潜力。然而,医疗数据由于其异质性而呈现出独特的挑战,涵盖多种模式,包括二维图像、三维体积扫描和时序视频序列。这些模式之间的巨大领域差距和数据格式不一致性阻碍了统一医疗MLLMs的发展。为了解决这些挑战,我们提出了Fleming-VL,这是一个用于跨异构模式进行全面医疗视觉理解的统一端到端框架。Fleming-VL从数据中心视角通过三个关键策略来解决这个问题:(1)通过整合自然和医疗特定领域的长上下文数据来扩大预训练规模;(2)用稀缺医疗数据补充微调,包括整体视频分析和超声和皮肤镜图像等代表性不足的二维模式;(3)扩展现有评估框架,纳入三维体积和视频理解基准测试。我们通过有监督的微调(SFT)和群体相对策略优化(GRPO)在多个模型规模上开发Fleming-VL。大量实验表明,Fleming-VL在多个基准测试中达到最新性能,包括医疗VQA、视频QA和三维医学图像理解。我们公开发布Fleming-VL,以促进医疗人工智能的透明、可复制和可审核的进步。
论文及项目相关链接
Summary
大规模多模态语言模型(MLLMs)在通用领域场景如视觉问答和图像描述等方面表现出卓越的效果。近期,研究者致力于赋予MLLMs医疗对话能力,为其在临床应用提供巨大潜力。然而,医疗数据因其异构性带来独特挑战,包括2D图像、3D体积扫描和时序视频序列等多种模态。显著的领域差距和数据格式不一致性阻碍了统一医疗MLLMs的发展。为解决这些挑战,提出Fleming-VL,一个统一的端到端框架,用于跨异构模态进行全面医疗视觉理解。Fleming-VL从数据中心视角通过三个关键策略解决这个问题:(1)通过来自自然和医疗特定领域的长上下文数据扩大预训练规模;(2)用精细调节补充稀有医疗数据,包括整体视频分析和代表性不足的2D模态,如超声和皮肤镜检查图像;(3)扩展现有评估框架,以纳入3D体积和视频理解基准测试。通过监督精细调节(SFT)和群体相对政策优化(GRPO),我们在多个模型规模上开发Fleming-VL。大量实验表明,Fleming-VL在多个基准测试中实现最佳性能,包括医疗VQA、视频QA和3D医学图像理解。我们公开发布Fleming-VL,以促进医疗AI的透明、可复制和可审核的进步。
Key Takeaways
- 多模态大型语言模型(MLLMs)在多个领域表现优秀,现在正被赋予医疗对话能力,为临床应用带来希望。
- 医疗数据因其异构性带来挑战,需要处理多种模态如2D图像、3D体积扫描和时序视频序列等。
- Fleming-VL是一个统一的端到端框架,旨在解决医疗数据异构性问题,通过扩大预训练规模、补充稀有医疗数据和扩展评估框架等方法来提升医疗视觉理解的效果。
- Fleming-VL通过监督精细调节(SFT)和群体相对政策优化(GRPO)在多个模型规模上进行开发。
- 实验表明,Fleming-VL在医疗VQA、视频QA和3D医学图像理解等多个基准测试中达到最佳性能。
- Fleming-VL被公开发布,旨在促进医疗人工智能的透明、可复制和可审核的进步。
点此查看论文截图
Do Math Reasoning LLMs Help Predict the Impact of Public Transit Events?
Authors:Bowen Fang, Ruijian Zha, Xuan Di
Predicting public transit incident duration from unstructured text alerts is a critical but challenging task. Addressing the domain sparsity of transit operations with standard Supervised Fine-Tuning (SFT) is difficult, as the task involves noisy, continuous labels and lacks reliable expert demonstrations for reasoning. While Reinforcement Learning from Verifiable Rewards (RLVR) excels at tasks with binary correctness, like mathematics, its applicability to noisy, continuous forecasting is an open question. This work, to our knowledge, is the first to bridge the gap between RLVR LLM training with the critical, real-world forecasting challenges in public transit operations. We adapt RLVR to this task by introducing a tolerance-based, shaped reward function that grants partial credit within a continuous error margin, rather than demanding a single correct answer. We systematically evaluate this framework on a curated dataset of NYC MTA service alerts. Our findings show that general-purpose, instruction-tuned LLMs significantly outperform specialized math-reasoning models, which struggle with the ambiguous, real-world text. We empirically demonstrate that the binary reward is unstable and degrades performance, whereas our shaped reward design is critical and allows our model to dominate on the most challenging metrics. While classical regressors are superior at minimizing overall MAE or MSE, our RLVR approach achieved a 35% relative improvement in 5-minute accuracy (Acc@5) over the strongest baseline. This demonstrates that RLVR can be successfully adapted to real-world, noisy forecasting, but requires a verifier design that reflects the continuous nature of the problem.
预测公共交通事件持续时间从非结构化文本警报是一个至关重要但具有挑战性的任务。使用标准的监督微调(SFT)解决公共交通运营领域的稀疏性问题是很困难的,因为该任务涉及带有噪声的连续标签,并且缺乏可靠的专家演示来进行推理。虽然强化学习从可验证奖励(RLVR)在数学等二元正确性任务上表现出色,但其对噪声、连续预测的适用性仍是未解的问题。据我们所知,这项工作是第一次架起RLVR大型语言模型训练和公共交通运营中关键的现实世界预测挑战之间的桥梁。我们通过引入基于容忍度的成型奖励函数来适应RLVR适应这项任务,该函数在连续的误差范围内授予部分信用,而不是要求一个单一的正确答案。我们在纽约市MTA服务警报的手工数据集上系统地评估了这一框架。我们的研究结果表明,通用指令调整的大型语言模型显著优于专业数学推理模型,后者在处理模糊的现实世界文本时感到困难重重。我们实证表明二元奖励不稳定且性能下降,而我们的成型奖励设计是关键所在,并允许我们的模型在最具挑战性的指标上占据主导地位。虽然经典回归器在最小化总体MAE或MSE方面更出色,但我们的RLVR方法在最严格的基线基础上实现了5分钟准确率(Acc@5)的相对改进提高了35%。这表明RLVR可以成功适应现实世界的噪声预测,但需要反映问题连续性的验证器设计。
论文及项目相关链接
Summary
该文本探讨了公共交通事件持续时间预测的挑战性问题,特别是从非结构化文本警报中进行预测。文章指出,针对公共交通运营的领域稀疏性,使用标准监督微调(SFT)面临困难。文章首次将强化学习从可验证奖励(RLVR)与公共交通运营中的关键现实预测挑战相结合。通过引入基于容忍度的奖励函数,该模型可在连续误差范围内给予部分信用,而不是要求单一正确答案。文章对纽约市交通局服务警报的数据集进行了系统评估。研究发现,通用、指令调优的大型语言模型(LLM)显著优于专门用于数学推理的模型,后者在处理模糊现实文本时遇到困难。文章还实证了二元奖励的不稳定性及其对性能的影响,而设计的连续奖励对于模型成功至关重要。虽然经典回归器在最小化整体MAE或MSE方面表现优越,但RLVR方法在5分钟准确率(Acc@5)上实现了相对于最强基线35%的相对改进。这表明RLVR可以成功适应现实世界的嘈杂预测,但需要反映问题的连续性的验证器设计。
Key Takeaways
- 预测公共交通事件持续时间是一项挑战,尤其是从非结构化文本警报中进行预测。
- 标准监督微调(SFT)在解决公共交通运营的领域稀疏性方面存在困难。
- 该研究首次结合了强化学习从可验证奖励(RLVR)与现实世界的公共交通运营预测挑战。
- 引入基于容忍度的奖励函数,可在连续误差范围内部分信用。
- 通用、指令调优的大型语言模型(LLM)在处理模糊现实文本时表现优于数学推理模型。
- 二元奖励的不稳定性及其对模型性能的影响被实证。
点此查看论文截图
GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents
Authors:Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard
With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)\cite{saito2021openbanditdatasetpipeline} and Scope-RL\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen’s 34%. These findings demonstrate the feasibility of LLM-based agents as automated “growth hackers” to enhance OPE systems, with implications for scaling data-driven decision-making in production.
随着软件行业向数据驱动的文化转变,在线A/B测试是评估新技术的重要工具。然而,部署此类实验需要大量资源,可能会对用户产生负面影响,并且涉及长期的数据收集。为了解决这个问题,“离策略评估(OPE)”或离线A/B测试使用日志数据来评估技术,这在强化学习中是根本性的,因此在在线测试成本高昂或具有风险(如医疗保健、推荐系统、教育、对话系统和机器人技术)的领域尤为重要。尽管编码大型语言模型(LLMs)和智能体人工智能有所进展,但如何利用它们优化OPE结果知之甚少。我们调查LLMs和基于LLM的代理是否可以通过代码优化改善OPE性能。我们提出了一个名为“GrowthHacker”的基准测试,该测试在大型现实世界数据集上对代理和基准方法进行了评估,它迭代地优化代码、评估结果并启动新的优化周期。我们在开放匪徒管道(OBP)和Scope-RL上收集了数据集,建立了协议,为OPE实施了基线,并开发了“two_agent”框架,该框架降低了系统复杂性同时保留了优化的有效性。结果显示,two_agent框架实现了100%的可靠性,并在积极成果中平均提高了最高的106.7%。two_agent和CrewAI的成功率均达到45%,高于AutoGen的34%。这些发现证明了基于LLM的代理作为自动化“增长黑客”增强OPE系统的可行性,为生产中的数据驱动决策提供了扩展的启示。
论文及项目相关链接
Summary
在线A/B测试是评估新技术的关键工具,但在实际操作中需要消耗大量资源并可能对用户产生负面影响。为解决这一问题,离线A/B测试(即Off-Policy Evaluation,OPE)利用日志数据评估技术,在强化学习中至关重要。针对利用大型语言模型(LLMs)和基于LLM的代理优化OPE结果的研究表明,LLMs在优化OPE性能方面具有潜力。本文提出GrowthHacker基准测试,使用大规模现实世界数据集,旨在通过迭代优化代码和提高评估结果,开始新的优化循环。研究发现两agent框架可实现高可靠性和最高平均改善率。此研究展示LLM代理在提升离线测试的潜力。有助于推进生产环境中的数据驱动决策规模化。
Key Takeaways
- 软件行业正在向数据驱动文化转变,在线A/B测试是评估新技术的关键工具,但存在资源消耗大、用户影响可能负面的缺点。
- Off-Policy Evaluation (OPE) 或离线A/B测试使用日志数据评估技术,在强化学习中至关重要,特别适用于在线测试成本高或风险大的领域。
- 大型语言模型(LLMs)和基于LLM的代理在优化OPE性能方面具有潜力。
点此查看论文截图
CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks
Authors:Long Li, Shuichen Ji, Ziyang Luo, Nian Liu, Dingwen Zhang, Junwei Han
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO’s key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an “output-to-reasoning” strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.
我们提出了第一个统一框架,通过采用思维链(CoT)推理过程,将操作异质的显著性任务(例如SOD、CoSOD和SIS)联合处理,以弥合任务之间的异质性鸿沟。思维链训练遵循两阶段模式:监督微调(SFT)和强化学习(RL)。为了提高强化学习中的思维链质量,我们提出了信心引导策略优化(CGPO),这是一种轻量级的单样本算法,利用奖励与模型信心之间的差异作为每个样本的优势信号。这种设计自然地专注于更新信息响应,同时消除了群体采样,从而解决了关键限制:信心无关的学习、信号稀释和计算开销过高的问题。我们还引入了“输出到推理”的策略来构建高保真SFT数据,确保与地面真实掩码的内在逻辑一致性。实验表明,我们的模型在所有任务上达到了或超越了专门的最新方法以及强大的封闭式视觉语言模型的表现。特别是在CoCA上的CoSOD任务上,我们实现了S-measure为0.899的得分,在之前最好的成绩基础上提高了8个百分点,尽管使用了较少的训练数据。
论文及项目相关链接
PDF 14 pages,10 figures
Summary
本文提出了首个统一框架,通过链式思维(CoT)推理过程在视觉语言模型(VLM)中解决三种操作异质的显著性任务,如SOD、CoSOD和SIS。采用监督微调(SFT)和强化学习(RL)的两阶段训练模式,并提出信心引导策略优化(CGPO)提升CoT质量。实验结果显示,该模型匹配或超越特定领域的最先进方法和强大的封闭源VLM在所有任务上的表现。
Key Takeaways
- 提出了首个统一框架,能够处理三种操作异质的显著性任务:SOD、CoSOD和SIS。
- 通过链式思维(CoT)推理过程在视觉语言模型(VLM)中解决任务异质性。
- 采用监督微调(SFT)和强化学习(RL)的两阶段训练模式。
- 提出信心引导策略优化(CGPO)算法,利用奖励与模型信心之间的不一致性作为每样本的优势信号,提高CoT质量。
- 引入“输出到推理”策略,构建高保真SFT数据,确保与地面真实掩膜的逻辑一致性。
- 实验结果表明,该模型在所有任务上的表现均匹配或优于专业最先进方法和强大的封闭源VLM。
点此查看论文截图
Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models
Authors:Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce Token-Regulated Group Relative Policy Optimization (TR-GRPO), a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model’s predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks, including logic, math, and agentic reasoning, highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.
强化学习可验证奖励(RLVR)已成为增强大型语言模型(LLM)推理能力的一种强大方法。在现有算法中,群组相对策略优化(GRPO)表现出了强大的性能,但它存在一个关键问题:由于低概率标记的梯度幅度较大,它们在梯度更新中占主导地位。这种不平衡导致训练不稳定,并抑制了高概率标记对学习的更可靠贡献。在这项工作中,我们引入了Token调控群组相对策略优化(TR-GRPO),这是GRPO的一个简单而有效的扩展,它通过分配与模型预测概率正相关标记级别的权重来发挥作用。通过降低低概率标记的权重并强调高概率标记,TR-GRPO缓解了梯度过度放大问题,同时保留了信息丰富的学习信号。大量实验表明,在RLVR任务上,包括逻辑、数学和智能体推理,TR-GRPO始终优于GRPO,这突出了在RL训练过程中调节标记贡献的重要性,并确立了TR-GRPO作为一个增强LLM推理的稳健框架。
论文及项目相关链接
Summary:强化学习可验证奖励(RLVR)方法增强了大型语言模型(LLM)的推理能力。Group Relative Policy Optimization(GRPO)算法表现优异,但存在低概率标记过分主导梯度更新的问题。为解决此问题,本文提出了Token-Regulated Group Relative Policy Optimization(TR-GRPO),它通过根据模型预测概率分配标记级权重,有效平衡了梯度更新。实验证明,TR-GRPO在RLVR任务上表现优于GRPO,特别是在逻辑、数学和智能体推理方面。这表明在RL训练中调节标记贡献的重要性,并确立了TR-GRPO作为增强LLM推理的稳健框架。
Key Takeaways:
- 强化学习可验证奖励(RLVR)有助于增强大型语言模型(LLM)的推理能力。
- Group Relative Policy Optimization(GRPO)算法在RLVR中表现良好,但存在梯度更新不平衡问题。
- TR-GRPO是GRPO的有效扩展,通过调节标记级权重解决梯度更新不平衡问题。
- TR-GRPO在逻辑、数学和智能体推理等RLVR任务上表现优于GRPO。
- TR-GRPO能缓解梯度过度放大问题,同时保留有用的学习信号。
- 标记贡献的调节在RL训练中很重要。
点此查看论文截图
Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Authors:Sriram Balasubramaniam, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka
Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
大型语言模型(LLM)越来越多地用于长文档问答,其中可靠的来源归属对于信任至关重要。现有的事后归因方法在提取式问答中效果很好,但在多跳、抽象和半抽象设置中面临困难,这些设置中的答案会综合跨段落的信息。为了应对这些挑战,我们认为事后归因可以重新构建为推理问题,其中答案被分解为构成单元,每个单元都与特定上下文相关联。我们首先证明,提示模型生成这种分解同时归因可以提高性能。在此基础上,我们引入了DecompTune,这是一种事后训练方法,可以教导模型将答案分解作为中间推理步骤来产生。我们整理了一个复杂的问答任务数据集,由强大的LLM进行分解注释,并使用任务特定奖励的两阶段SFT+GRPO管道对Qwen-2.5(7B和14B)进行事后训练。通过广泛的实验和消融研究,DecompTune大幅提高了归因质量,优于以前的方法,并匹配或超过最前沿的模型。
论文及项目相关链接
PDF Post-hoc attribution
Summary
大型语言模型(LLM)在长篇文档问答中的应用日益广泛,可靠的来源归属对信任至关重要。现有的事后归属方法在提取式问答中表现良好,但在多跳、抽象和半提取式设置中面临挑战,这些设置中的答案会综合各段落的信息。为应对这些挑战,我们提出将事后归属重新定位为推理问题,答案被分解为与特定上下文相关联的组成部分。我们首先展示通过提示模型生成此类分解并伴随归属可以提高性能。在此基础上,我们引入了DecompTune,这是一种事后训练方法,可教导模型将答案分解作为中间推理步骤。我们使用强大的LLM对复杂的问答任务进行了多样化的数据集标注,并通过两阶段SFT+GRPO管道使用特定任务奖励对Qwen-2.5(7B和14B)进行事后训练。经过广泛的实验和减损测试,DecompTune大幅提高了归属质量,优于先前的方法,并达到或超越了前沿模型的水平。
Key Takeaways
- 大型语言模型(LLMs)在长篇文档问答中的可靠来源归属对信任至关重要。
- 现有事后归属方法在复杂问答场景中表现不足,需要更精细的推理和分解能力。
- 通过将答案分解为与特定上下文关联的组成部分,可以提高模型的性能。
- DecompTune是一种事后训练方法,旨在教导模型产生作为中间推理步骤的答案分解。
- DecompTune使用特定任务奖励对模型进行训练,并通过两阶段管道实现高效训练。
- DecompTune大幅提高了归属质量,优于先前的方法。
点此查看论文截图