LLM

发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 20.8k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Authors:Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, Afshin Dehghan

We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.

我们推出了UniGen-1.5，这是一款统一的多模态大型语言模型（MLLM），用于高级图像理解、生成和编辑。基于UniGen，我们全面增强了模型架构和训练流程，加强了图像理解和生成能力，同时解锁了强大的图像编辑能力。特别是，我们提出了一种统一强化学习（RL）策略，通过共享奖励模型，可以同时提高图像生成和图像编辑的能力。为了进一步改进图像编辑性能，我们提出了轻量级的编辑指令对齐阶段，这大大提高了编辑指令的理解，对于强化学习训练的成功至关重要。实验结果表明，UniGen-1.5在理解和生成方面表现出竞争力。具体来说，UniGen-1.5在GenEval和ImgEdit上的总体得分分别为0.89和4.31，超越了最先进的模型，如BAGEL，并达到了与专有模型（如GPT-Image-1）相当的性能。

论文及项目相关链接

PDF

Summary

UniGen-1.5是一款统一的多模态大型语言模型，用于高级图像理解、生成和编辑。该模型基于UniGen构建，全面增强了模型架构和训练流程，提高了图像理解和生成能力，同时解锁了强大的图像编辑能力。提出一种统一的强化学习策略，通过共享奖励模型，同时提高图像生成和图像编辑的能力。实验结果证明，UniGen-1.5在理解和生成方面表现出竞争力，特别是在GenEval和ImgEdit上的总体得分达到了超越BAGEL等最新模型的水平，与GPT-Image-1等专有模型的表现相当。

Key Takeaways

UniGen-1.5是一个多模态大型语言模型，用于高级图像理解、生成和编辑。
基于UniGen架构，全面增强了模型架构和训练流程。
提出一种统一的强化学习策略，通过共享奖励模型，同时提高图像生成和图像编辑能力。
引入轻量级的编辑指令对齐阶段，显著提高了对编辑指令的理解，这是强化学习训练成功的关键。
UniGen-1.5在理解和生成方面表现出竞争力，总体得分超越了一些最新模型。
与专有模型如GPT-Image-1相比，UniGen-1.5的性能表现相当。
UniGen-1.5在图像理解、生成和编辑方面有着广泛的应用前景。

Cool Papers

点此查看论文截图

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

Authors:Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, Xiaobai Li

Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

视频数据集的情感识别与传统的图像分类任务不同，它受到主观标签和噪声对模型性能的限制而面临特别挑战。为了克服主观和噪声情感标签带来的挑战，我们提出了一个利用视觉大型语言模型（VLMs）来优化注释并引导训练过程的框架。我们的框架使用问卷来提取行为线索并将数据划分为高可靠性和低可靠性子集。我们还介绍了一种结合课程学习与软标签优化的训练策略，逐步引入模糊样本，同时调整监督以反映不确定性。我们证明，在优化后的高可靠性子集上训练的经典计算机视觉模型，并结合我们的课程策略进行增强，表现出了改进，突出了使用VLMs解决标签主观性的好处。该方法在诸如EngageNet（六个功能设置中的三个，最高提升+1.21%）和DREAMS/PAFE等情感基准测试上超越了之前的最先进水平，F1得分分别提高了+0.22/+0.06。

论文及项目相关链接

PDF

Summary

在视频数据集上进行参与度识别与传统的图像分类任务不同，面临着主观标签和噪声对模型性能的挑战。为了克服这些挑战，我们提出了一种利用视觉大型语言模型（VLMs）的框架，以优化注释并引导训练过程。我们的框架通过问卷调查提取行为线索，并将数据分为高可靠性和低可靠性子集。我们还结合了课程学习与软标签优化策略，逐步引入模糊样本，同时调整监督以反映不确定性。在优化后的高可靠性子集上训练的经典计算机视觉模型，结合我们的课程策略，表现出改进的效果，突显了使用VLMs解决标签主观性的好处。该方法超越了现有的参与度基准测试水平，如EngageNet（六个特征设置中的三个，最高改进+1.21%），并在DREAMS/PAFE上实现了F1得分的+0.22/+0.06的增益。

Key Takeaways

视频数据集的参与度识别面临主观标签和噪声的挑战。
提出利用视觉大型语言模型（VLMs）的框架来解决标签问题。
通过问卷调查提取行为线索，将数据分为高可靠性和低可靠性子集。
结合课程学习与软标签优化策略，逐步引入模糊样本并调整监督。
在高可靠性子集上训练的经典计算机视觉模型表现改进。
该方法超越了现有的参与度基准测试水平。

Cool Papers

点此查看论文截图

When AI Democratizes Exploitation: LLM-Assisted Strategic Manipulation of Fair Division Algorithms

Authors:Priyanka Verma, Balagopal Unnikrishnan

Fair resource division algorithms, like those implemented in Spliddit platform, have traditionally been considered difficult for the end users to manipulate due to its complexities. This paper demonstrates how Large Language Models (LLMs) can dismantle these protective barriers by democratizing access to strategic expertise. Through empirical analysis of rent division scenarios on Spliddit algorithms, we show that users can obtain actionable manipulation strategies via simple conversational queries to AI assistants. We present four distinct manipulation scenarios: exclusionary collusion where majorities exploit minorities, defensive counterstrategies that backfire, benevolent subsidization of specific participants, and cost minimization coalitions. Our experiments reveal that LLMs can explain algorithmic mechanics, identify profitable deviations, and generate specific numerical inputs for coordinated preference misreporting–capabilities previously requiring deep technical knowledge. These findings extend algorithmic collective action theory from classification contexts to resource allocation scenarios, where coordinated preference manipulation replaces feature manipulation. The implications reach beyond rent division to any domain using algorithmic fairness mechanisms for resource division. While AI-enabled manipulation poses risks to system integrity, it also creates opportunities for preferential treatment of equity deserving groups. We argue that effective responses must combine algorithmic robustness, participatory design, and equitable access to AI capabilities, acknowledging that strategic sophistication is no longer a scarce resource.

资源公平分配算法，如Spliddit平台所实现的算法，由于其复杂性，传统上被认为难以供最终用户使用。本文展示了大型语言模型（LLM）如何通过普及战略专业知识来消除这些保护壁垒。通过对Spliddit算法中的租金分配情景进行实证分析，我们表明用户可以通过对人工智能助理进行简单的查询来获得可行的操纵策略。我们提出了四种不同的操纵情景：多数人对少数人的排斥性勾结、适得其反的防御性反击策略、对特定参与者的仁慈补贴以及成本最小化联盟。我们的实验表明，LLM可以解释算法机制、发现有利偏差，并生成协调偏好误报的特定数值输入——这些能力以前需要深厚的技术知识。这些发现将算法集体行动理论从分类背景扩展到资源分配场景，其中协调偏好操纵取代了特征操纵。其影响不仅仅局限于租金分配，而且延伸到使用算法公平机制进行资源分配的任何领域。虽然人工智能驱动的操纵对系统完整性构成风险，但它也为应得到公平待遇的群体提供了优待的机会。我们认为，有效的应对措施必须结合算法稳健性、参与性设计和公平的AI能力访问，同时认识到战略精细不再是一种稀缺资源。

论文及项目相关链接

PDF submitted to NeurIPS 2025 workshop on Algorithmic Collective Action

Summary

本论文展示了如何通过大型语言模型（LLM）来简化复杂资源分配算法的使用难度，并揭示用户通过简单的对话查询即可获得策略性操作策略。通过对Spliddit算法租金分配场景的实证分析，论文展示了四种不同的操纵场景，并指出大型语言模型能够解释算法机制、识别有利偏差以及生成协调偏好误报的特定数值输入。这些发现将算法集体行动理论从分类情境扩展到资源分配场景，并对系统完整性提出了风险与机遇并存的问题。论文强调了结合算法稳健性、参与性设计和公平访问AI能力的必要响应方式。

Key Takeaways

大型语言模型（LLM）可以消除资源分配算法的复杂性壁垒，使用户更容易操纵这些算法。
用户通过简单的对话查询可以获得战略性的操作策略。
四种操纵场景被揭示，包括多数人的排斥勾结、防御策略的适得其反、特定参与者的仁慈补贴以及成本最小化联盟。
大型语言模型能够解释算法机制，识别有利偏差，生成协调偏好误报的特定数值输入。
这些发现将算法集体行动理论扩展到资源分配场景，即协调偏好操纵取代特征操纵。
AI驱动的操纵对系统完整性提出了风险，但同时也为需要公平对待的群体提供了机会。

Cool Papers

点此查看论文截图

Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries

Authors:Kiera McCormick, Rafael Martínez-Galarza

Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.

大型语言模型已经显示出在多领域、多模态的多个层面上的良好泛化能力，甚至展现出上下文学习能力。这使得关于如何利用它们来编码通常仅通过科学测量获得且松散地编码在文本描述中的物理信息的研究问题浮出水面。我们以天体物理学为测试平台，通过两个问题来研究LLM嵌入是否能够编纂从科学测量中获得的物理摘要统计信息：1）提示在LLM如何编纂这些数量方面是否发挥作用？以及2）在编码测量所代表的物理学方面，语言方面的哪些因素最为重要？我们通过稀疏自动编码器从文本中提取可解释的特征来研究这个问题。

论文及项目相关链接

PDF Accepted to the Machine Learning and the Physical Sciences Workshop at NeurIPS 2025, 11 pages, 4 figures

Summary

大型语言模型展现出跨领域、跨模态的多层次泛化能力，并具备上下文学习能力。本研究探讨如何利用这些模型编码通常仅通过科学实验测量得到的物理信息，以及这些信息的文本描述方式。以天文学为实验平台，研究两个问题：一是提示对模型编码这些物理量的影响；二是测量所代表物理语言的哪些方面最重要？本研究使用稀疏自动编码器从文本中提取可解释特征。

Key Takeaways

大型语言模型具备跨领域、跨模态的泛化能力。
研究探索了大型语言模型在编码物理信息方面的潜力，特别是如何利用文本描述中的科学测量信息。
研究以天文学为实验平台，通过两个问题展开研究：提示对模型编码物理量的影响以及测量物理语言的方面重要性。
使用稀疏自动编码器来提取文本中的可解释特征，为理解模型如何编码物理信息提供重要工具。
大型语言模型可能通过文本描述中的某些语言特征来编码物理信息，如上下文学习能力的应用。
此研究对于如何利用大型语言模型在数据处理和解析领域具有重要意义，尤其是在科学与技术领域。

Cool Papers

点此查看论文截图

SMRC: Aligning Large Language Models with Student Reasoning for Mathematical Error Correction

Authors:Biaojie Zeng, Min Zhang, Juan Zhou, Fengrui Liu, Ruiyang Huang, Xin Lin

Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textit{mainly focus on self-correction within the model}, which falls short of the teacher-style correction required in educational settings, \textit{i.e.}, systematically guiding and revising a student’s problem-solving process. To address this gap, we propose \texttt{SMRC} (\textit{\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection}), a novel method that aligns LLMs with student reasoning. Specifically, \texttt{SMRC} formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbf{solution accuracy} and \textbf{correct-step retention}, offering a comprehensive measure of educational applicability. Experiments demonstrate that \texttt{SMRC} significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at https://github.com/Mind-Lab-ECNU/SMRC.

大型语言模型（LLM）在解决数学问题时往往会出现推理错误，如何自动检测和纠正这些错误已成为一个重要研究方向。然而，现有方法主要集中在模型的自我修正上，这无法达到教育环境中所需的“教师式”修正，即系统地指导和修订学生的问题解决过程。为了弥补这一差距，我们提出了SMRC（学生数学推理修正，\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection），这是一种使LLM与学生推理相一致的新方法。具体而言，SMRC将学生推理制定为一个多步骤的序列决策问题，并引入蒙特卡洛树搜索（MCTS）来探索最佳的修正路径。为了降低对过程级奖励进行标注的成本，我们利用广度优先搜索（BFS）以LLM为指导并结合最终答案评估来生成奖励信号，然后通过反向传播机制将这些奖励信号分配到中间推理步骤，从而实现精细的过程监督。此外，我们构建了一个高中数学基准测试MSEB（多解误差基准），包含158个实例，包括问题描述、学生解决方案和正确的推理步骤。我们还提出了以解决方案准确性和正确步骤保留率为中心的双重评估协议，为教育适用性提供了全面的衡量方法。实验表明，SMRC在公开数据集（ProcessBench和MR-GSM8K）以及我们的MSEB上的有效性和整体性能均显著优于现有方法。代码和数据集可在https://github.com/Mind-Lab-ECNU/SMRC找到。

论文及项目相关链接

PDF 13 pages, 3 figures

摘要

大语言模型（LLMs）在解决数学问题时存在推理错误的问题，如何自动检测和纠正这些错误已成为重要的研究方向。然而，现有方法主要关注模型内部的自我纠正，缺乏教育环境中所需的“教师式”纠正，即系统地指导和修订学生的问题解决过程。为解决这一差距，我们提出了\texttt{SMRC}方法，使LLMs与学生推理相一致。具体而言，\texttt{SMRC}将学生推理制定为一个多步骤的序列决策问题，并引入蒙特卡洛树搜索（MCTS）来探索最佳的纠正路径。为减少标注过程级奖励的成本，我们利用广度优先搜索（BFS）由LLMs引导并结合最终答案评估来生成奖励信号，这些信号通过反向传播机制被分配到中间推理步骤，从而实现精细的过程监督。此外，我们构建了高中数学的基准测试MSEB（多解误差基准），包含158个包含问题陈述、学生解决方案和正确推理步骤的实例。我们进一步提出了以解决方案准确性和正确步骤保留率为中心的双重评估协议，为教育应用提供了全面的衡量方法。实验表明，\texttt{SMRC}在公开数据集（ProcessBench和MR-GSM8K）以及我们的MSEB基准测试上，其有效性和整体性能均显著优于现有方法。相关代码和数据可在https://github.com/Mind-Lab-ECNU/SMRC上获得。

关键见解

LLMs在解决数学问题时存在推理错误的问题，自动检测和纠正这些错误是重要研究方向。
现有方法主要关注模型内部的自我纠正，缺乏教育环境中需要的教师式纠正。
\texttt{SMRC}方法通过将学生推理制定为序列决策问题，并利用蒙特卡洛树搜索探索最佳纠正路径来解决上述问题。
\texttt{SMRC}利用广度优先搜索引导并结合最终答案评估来生成奖励信号，实现精细的过程监督。
我们构建了高中数学的MSEB基准测试，用于评估解决方案的有效性和正确性。
\texttt{SMRC}在公开数据集和MSEB基准测试上的表现均显著优于现有方法。

Cool Papers

点此查看论文截图

AutoTool: Efficient Tool Selection for Large Language Model Agents

Authors:Jingyi Jia, Qinbin Li

Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision-making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia - the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter-level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference-heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.

大型语言模型（LLM）代理工具通过利用LLM的推理和决策能力，已逐渐发展为自动化复杂任务的强大工具。然而，当前代理框架的主要瓶颈在于工具选择的高推理成本，特别是在像ReAct这样的方法中，需要反复调用LLM来确定每一步应使用哪个工具。在这项工作中，我们提出了AutoTool，这是一个基于图的新型框架，它通过利用一个关键的经验观察来绕过重复的LLM推理：工具使用惯性——工具调用的趋势遵循可预测的顺序模式。AutoTool从历史代理轨迹构建一个有向图，其中节点代表工具，边捕捉转换概率，有效地对工具选择中的惯性进行建模。它进一步整合参数级别的信息来优化工具输入生成。通过遍历这种结构化表示，AutoTool能够高效选择工具和参数，对LLM推理的依赖度降到最低。在多种代理任务上的广泛实验表明，AutoTool将推理成本降低了高达30%，同时保持竞争力的任务完成率，为推理繁重的框架提供了实用且可扩展的增强。我们的工作突出了将统计结构整合到LLM代理设计中以提高效率而不牺牲性能的潜力。

论文及项目相关链接

PDF Accepted by AAAI 2026, 18 pages, 11 figures, Code: https://github.com/jiajingyyyyyy/AutoTool

Summary

大型语言模型（LLM）代理通过利用LLM的推理和决策能力自动化复杂任务，已成为强大的工具。然而，当前代理框架中存在的主要瓶颈在于工具选择的高推理成本，尤其是在ReAct等方法中，会反复调用LLM来确定每一步应使用哪个工具。针对此问题，本文提出AutoTool，这是一个基于图的新型框架，它通过利用工具使用惯性这一关键观察结果来绕过重复的LLM推理。AutoTool通过构建有向图来捕捉工具选择的顺序模式，其中节点表示工具，边表示转换概率。通过遍历此结构化表示，AutoTool可高效选择工具和参数，对LLM推理的依赖度降到最低。在多种代理任务上的广泛实验表明，AutoTool将推理成本降低了30%，同时保持了竞争性的任务完成率，为推理密集型框架提供了实用且可扩展的增强。

Key Takeaways

LLM代理已成为自动化复杂任务的重要工具，但存在工具选择的高推理成本问题。
当前方法如ReAct反复调用LLM来确定工具选择，导致效率较低。
AutoTool是一个基于图的框架，通过捕捉工具使用的顺序模式来绕过重复的LLM推理。
AutoTool通过构建有向图来表示工具和参数，有效建模工具选择惯性。
通过遍历结构化表示，AutoTool可高效选择工具和参数，减少对LLM推理的依赖。
实验表明，AutoTool能显著降低推理成本，同时保持任务完成率。

Cool Papers

点此查看论文截图

A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases

Authors:Tao Yang, Dandan Huang, Yunting Lin, Pengfei Wu, Zhikun Wu, Gangyuan Ma, Yulan Lu, Xinran Dong, Dingpeng Li, Junshuang Ge, Zhiyan Zhang, Xuanzhao Huang, Wenyan Nong, Yao Zhou, Hui Tang, Hongxi Yang, Shijie Zhang, Juan Li, Xiaojun Cao, Lin Yang, Xia Gao, Kaishou Xu, Xiaoqiong Gu, Wen Zhang, Huimin Xia, Li Liu, Wenhao Zhou, Mulin Jun Li

Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.

罕见疾病影响全球数亿人，但诊断过程往往长达数年。传统管道将嘈杂的证据提取与下游推理诊断解耦，而通用/医疗大型语言模型（LLM）面临现实世界电子健康记录（EHRs）稀缺、领域知识过时以及虚构等问题。我们收集了大量领域专业化的临床语料库和经临床医生验证的推理集，并通过分阶段指令调整、思维链学习和图基检索开发出了RareSeek R1。在多中心电子健康记录叙述和公共基准测试中，RareSeek R1达到了最先进的准确性、稳健的通用性和在嘈杂或重叠表型下的稳定性。当叙事与优先变体配对时，增强检索会获得最大收益，通过解决歧义和对候选者的机制对齐。人类研究表明，其表现与经验丰富的医生相当，并且在辅助使用方面表现出持续的优势。值得注意的是，透明的推理突出了许多正确诊断所依赖的决定性非表型证据（中位数为23.1%，例如成像、干预、功能测试）。这项工作推进了一种以叙事为主、知识整合的推理范式，缩短了诊断过程，并实现了可审核的、临床上可转化的决策支持。

论文及项目相关链接

PDF 50 pages, 5 figures

Summary

该文介绍了一种名为RareSeek R1的新系统，该系统通过大型专有临床语料库和医生验证的推理集的开发，采用分阶段指令调整、思维链学习和图检索技术，旨在解决罕见疾病诊断难题。该系统在多中心电子健康记录公共基准测试中的表现达到了最新水平，能够在有噪声或重叠现象的条件下实现准确的诊断。人类研究证明，该系统与经验丰富的医生的性能相当，并且在辅助使用方面表现出一致的增益。此外，该系统具有透明的推理能力，能够突出许多正确诊断所依据的非决定性证据。总的来说，这项工作推进了一种以叙事为主、整合知识的推理模式，缩短了诊断过程，并为可审计的、可临床转化的决策支持提供了可能。

Key Takeaways

RareSeek R1系统旨在解决罕见疾病的诊断难题，通过处理电子健康记录（EHRs）来实现高效且准确的诊断。
该系统利用大规模专有临床语料库和医生验证的推理集进行开发，以增强其在实际应用中的表现。
RareSeek R1采用了分阶段指令调整、思维链学习和图检索技术，提高了诊断的准确性和稳定性。
在多中心EHR叙事和公共基准测试中，RareSeek R1达到了最新水平的表现。
人类研究表明，该系统的性能与经验丰富的医生相当，并在辅助使用方面表现出一致的增益。
该系统具有透明的推理能力，能够强调非决定性证据在诊断过程中的重要性。

Cool Papers

点此查看论文截图

Failure to Mix: Large language models struggle to answer according to desired probability distributions

Authors:Ivy Yuqian Yang, David Yu Zhang

Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of “1” 49% of the time produces an answer of “0” nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.

科学创意的产生和选择需要遵循目标概率分布进行探索。相比之下，当前的AI基准测试拥有客观正确答案，通过针对这些基准测试使用强化学习训练大型语言模型（LLM）会抑制概率性探索。在这里，我们进行了系统实验，要求LLM遵循简单概率分布产生输出，并发现所有测试过的现代LLM都严重无法遵循这些分布。例如，要求产生输出“1”的概率为49%，但实际上产生答案“0”的概率接近100%。这种类似于阶跃函数的行为，即使产生具有稍微最高概率的输出，也会近乎专属地生成输出，甚至会覆盖掉LLM内部存在的强烈偏见。

论文及项目相关链接

PDF 13 pages, 6 figures. Code and reproducibility package: https://github.com/BiostateAIresearch/failure-to-mix

Summary：

本文探讨了科学思想产生与选择过程中概率性探索的重要性，指出当前AI基准测试与强化学习训练大型语言模型（LLMs）的方法忽略了这一点。实验表明，现代LLMs无法遵循简单的概率分布指令，往往生成具有最高概率的输出，忽略了其他可能性。

Key Takeaways：

科学思想生成与选择需要遵循目标概率分布进行探索。
当前AI基准测试通常具有客观正确答案，这导致LLM训练过程中缺乏概率性探索。
LLMs在遵循简单概率分布指令时表现不佳。
LLMs倾向于生成具有最高概率的输出，忽略其他可能性。
这种行为甚至能覆盖LLM的内置偏见。
忽视概率性探索可能影响LLM在科学研究等领域的表现。

Cool Papers

点此查看论文截图

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Authors:Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

随着大型视觉语言模型（LVLMs）在真实世界应用中的部署越来越多，它们对抽象视觉输入的解读能力仍然有限。特别是，它们在理解手绘草图方面遇到了困难，草图是一种直观的表达概念的手段，这些概念很难用文字来描述。我们将主要的瓶颈确定为缺乏一个能够同时建模草图、写实图像和相应自然语言指令的大规模数据集。为了解决这个问题，我们提出了两项关键贡献：（1）一个全新的大规模图像-草图-指令三元组数据集，旨在促进预训练和指令调整；（2）一个在此数据集上训练的O3SLM大型视觉语言模型。在多个基于草图的任务上的综合评估包括：（a）目标定位，（b）计数，（c）图像检索（即SBIR和精细粒度SBIR），以及（d）视觉问答（VQA）；在融入QuickDraw!、Sketchy和Tu Berlin这三个现有的草图数据集以及我们生成的SketchVCL数据集的同时，表明O3SLM达到了最先进的性能，在草图理解和推理方面大幅超越了现有的LVLMs。

论文及项目相关链接

PDF Accepted to AAAI 2026

Summary

大型视觉语言模型（LVLM）在解读抽象视觉输入方面仍存在局限，特别是在理解手绘草图方面。本文识别出主要瓶颈在于缺乏一个能够同时建模草图、写实图像和相应自然语言指令的大规模数据集。为解决这一问题，本文做出了两项重要贡献：一是推出了新的大规模图像-草图-指令三元组数据集，用于促进预训练和指令调整；二是基于该数据集训练了O3SLM模型。在多个基于草图的任务上的综合评估表明，O3SLM在实现草图理解、计数、图像检索和视觉问答等任务上达到了最新技术水平，显著优于现有LVLM模型。

Key Takeaways

大型视觉语言模型（LVLM）在解读抽象视觉输入，尤其是手绘草图方面存在局限。
主要瓶颈在于缺乏能够同时建模草图、写实图像和自然语言指令的大规模数据集。
为解决这一问题，研究团队推出了新的大规模图像-草图-指令数据集。
O3SLM模型基于该数据集进行训练，旨在提高LVLM在草图理解方面的能力。
在多个基于草图的任务上，O3SLM表现优异，达到最新技术水平。
O3SLM模型在草图理解、计数、图像检索和视觉问答等任务上的表现均优于现有LVLM模型。

Cool Papers

点此查看论文截图

PathMind: A Retrieve-Prioritize-Reason Framework for Knowledge Graph Reasoning with Large Language Models

Authors:Yu Liu, Xixun Lin, Yanmin Shang, Yangxi Li, Shi Wang, Yanan Cao

Knowledge graph reasoning (KGR) is the task of inferring new knowledge by performing logical deductions on knowledge graphs. Recently, large language models (LLMs) have demonstrated remarkable performance in complex reasoning tasks. Despite promising success, current LLM-based KGR methods still face two critical limitations. First, existing methods often extract reasoning paths indiscriminately, without assessing their different importance, which may introduce irrelevant noise that misleads LLMs. Second, while many methods leverage LLMs to dynamically explore potential reasoning paths, they require high retrieval demands and frequent LLM calls. To address these limitations, we propose PathMind, a novel framework designed to enhance faithful and interpretable reasoning by selectively guiding LLMs with important reasoning paths. Specifically, PathMind follows a “Retrieve-Prioritize-Reason” paradigm. First, it retrieves a query subgraph from KG through the retrieval module. Next, it introduces a path prioritization mechanism that identifies important reasoning paths using a semantic-aware path priority function, which simultaneously considers the accumulative cost and the estimated future cost for reaching the target. Finally, PathMind generates accurate and logically consistent responses via a dual-phase training strategy, including task-specific instruction tuning and path-wise preference alignment. Extensive experiments on benchmark datasets demonstrate that PathMind consistently outperforms competitive baselines, particularly on complex reasoning tasks with fewer input tokens, by identifying essential reasoning paths.

知识图谱推理（KGR）是通过在知识图谱上进行逻辑推断来推断新知识的任务。最近，大型语言模型（LLM）在复杂的推理任务中表现出了显著的性能。尽管前景广阔，但基于LLM的KGR方法仍然面临两个关键的局限性。首先，现有方法通常会不加区别地提取推理路径，而不评估它们的不同重要性，这可能会引入无关的噪声，误导LLM。其次，虽然许多方法利用LLM来动态探索潜在的推理路径，但需要大量的检索要求和频繁的LLM调用。为了解决这些局限性，我们提出了PathMind，这是一个新型框架，旨在通过选择性引导LLM走重要的推理路径，来提升忠实和可解释的推理。具体来说，PathMind遵循“检索-优先排序-推理”的模式。首先，它通过检索模块从知识图谱中检索查询子图。接下来，它引入了一个路径优先级机制，该机制使用一个语义感知的路径优先级函数来识别重要的推理路径，同时考虑累积成本和到达目标的预估未来成本。最后，PathMind通过双阶段训练策略生成准确且逻辑一致的响应，包括任务特定指令调整和路径偏好对齐。在基准数据集上的大量实验表明，PathMind始终优于竞争基线，特别是在输入令牌较少的复杂推理任务中，通过识别关键推理路径表现更佳。

论文及项目相关链接

PDF AAAI 2026, Long Paper, Oral

总结

知识图谱推理（KGR）是在知识图谱上进行逻辑推断以获取新知识的任务。虽然大型语言模型（LLM）在复杂推理任务中表现出卓越的性能，但基于LLM的KGR方法仍面临两个关键问题。一是现有方法常常无法区分推理路径的重要性，这可能会引入无关噪声误导LLM。二是许多方法虽然利用LLM动态探索潜在推理路径，但需求过高，频繁调用LLM。针对这些问题，提出了PathMind框架，通过选择重要的推理路径，提高忠实度和可解释性的推理。PathMind遵循“检索-优先排序-推理”的模式，通过检索模块从知识图谱中检索查询子图，引入路径优先级机制，使用语义感知的路径优先级函数识别重要推理路径，同时考虑累积成本和到达目标的预估未来成本。通过双阶段训练策略生成准确且逻辑一致的响应，包括任务特定指令调整和路径偏好对齐。在基准数据集上的广泛实验表明，PathMind在复杂推理任务中，特别是输入令牌较少的情况下，通过识别关键推理路径，始终优于竞争对手。

关键见解

知识图谱推理（KGR）是通过对知识图谱进行逻辑推断来推断新知识的任务。
当前LLM在KGR中面临两个主要问题：无法区分推理路径的重要性和高检索需求。
PathMind框架旨在通过选择重要的推理路径来提高忠实度和可解释性的推理。
PathMind遵循“检索-优先排序-推理”的模式来处理知识图谱中的推理任务。
PathMind使用语义感知的路径优先级函数来识别重要推理路径。
PathMind采用双阶段训练策略来提高模型的性能。
在基准数据集上的实验表明，PathMind在复杂推理任务中表现优异，特别是在输入令牌较少的情况下。

Cool Papers

点此查看论文截图

MalRAG: A Retrieval-Augmented LLM Framework for Open-set Malicious Traffic Identification

Authors:Xiang Luo, Chang Liu, Gang Xiong, Chen Yang, Gaopeng Gou, Yaochen Ren, Zhen Li

Fine-grained identification of IDS-flagged suspicious traffic is crucial in cybersecurity. In practice, cyber threats evolve continuously, making the discovery of novel malicious traffic a critical necessity as well as the identification of known classes. Recent studies have advanced this goal with deep models, but they often rely on task-specific architectures that limit transferability and require per-dataset tuning. In this paper we introduce MalRAG, the first LLM driven retrieval-augmented framework for open-set malicious traffic identification. MalRAG freezes the LLM and operates via comprehensive traffic knowledge construction, adaptive retrieval, and prompt engineering. Concretely, we construct a multi-view traffic database by mining prior malicious traffic from content, structural, and temporal perspectives. Furthermore, we introduce a Coverage-Enhanced Retrieval Algorithm that queries across these views to assemble the most probable candidates, thereby improving the inclusion of correct evidence. We then employ Traffic-Aware Adaptive Pruning to select a variable subset of these candidates based on traffic-aware similarity scores, suppressing incorrect matches and yielding reliable retrieved evidence. Moreover, we develop a suite of guidance prompts where task instruction, evidence referencing, and decision guidance are integrated with the retrieved evidence to improve LLM performance. Across diverse real-world datasets and settings, MalRAG delivers state-of-the-art results in both fine-grained identification of known classes and novel malicious traffic discovery. Ablation and deep-dive analyses further show that MalRAG effective leverages LLM capabilities yet achieves open-set malicious traffic identification without relying on a specific LLM.

在网络安全中，对IDS标记的可疑流量进行精细粒度的识别至关重要。实际上，网络威胁是不断演变的，因此对新型恶意流量的发现以及对已知类别的识别都是至关重要的。尽管最近的研究已经使用深度模型推动了这一目标，但它们通常依赖于特定任务的架构，这些架构限制了可迁移性并需要针对每个数据集进行调整。在本文中，我们介绍了MalRAG，这是第一个用于开放集恶意流量识别的LLM驱动检索增强框架。MalRAG冻结了LLM，并通过全面的流量知识构建、自适应检索和提示工程进行操作。具体来说，我们通过从内容、结构和时间等多个角度挖掘先前的恶意流量，构建了一个多视角流量数据库。此外，我们引入了一种增强覆盖率的检索算法，该算法可以在这些视角中进行查询，以组合最可能的候选者，从而提高正确证据的可包含性。然后，我们采用流量感知自适应修剪法，基于流量感知相似度得分选择这些候选者的可变子集，从而抑制错误的匹配并产生可靠的检索证据。此外，我们开发了一套指导提示，将任务指令、证据引用和决策指导与检索到的证据相结合，以提高LLM的性能。在多种真实世界的数据集和环境中，MalRAG在已知类别的精细粒度识别和新型恶意流量发现方面都达到了最新结果。消去分析和深入探索进一步表明，MalRAG有效地利用了LLM的能力，同时实现了开放集恶意流量识别，而不依赖于特定的LLM。

论文及项目相关链接

PDF 13 pages, 13 figures. Intended for submission to IEEE Transactions on Information Forensics and Security (TIFS)

Summary：

本文介绍了网络安全领域中IDS标记的疑似流量精细粒度识别的重要性。随着网络威胁的不断演变，发现新型恶意流量和识别已知类别同样关键。尽管已有研究使用深度模型推进这一目标，但它们通常依赖于特定任务的架构，限制了可迁移性并需要针对每个数据集进行调整。本文引入了MalRAG，这是第一个用于开放集恶意流量识别的LLM驱动检索增强框架。MalRAG构建了一个多视图流量数据库，通过检索增强算法和流量感知自适应剪枝方法，提高LLM在恶意流量识别方面的性能。在多种真实数据集和设置下，MalRAG在已知类别的精细粒度识别和新型恶意流量发现方面均取得了最新结果。

Key Takeaways：

网络安全中IDS标记的疑似流量精细粒度识别至关重要。
恶意流量的发现和已知类别的识别都是关键需求。
MalRAG是首个用于开放集恶意流量识别的LLM驱动检索增强框架。
MalRAG通过构建多视图流量数据库、使用覆盖增强检索算法和流量感知自适应剪枝方法提高LLM性能。
MalRAG在多种真实数据集和设置下取得了最新结果，在已知类别的精细粒度识别和新型恶意流量发现方面表现优异。
MalRAG通过引入指导提示，将任务指令、证据引用和决策指导与检索到的证据相结合，进一步提高了LLM的性能。

Cool Papers

点此查看论文截图

HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection

Authors:Junjie Wu, Yumeng Fu, Nan Yu, Guohong Fu

Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.

最近，多模态脱离上下文（OOC）虚假信息检测领域的进展，在检查不同模态之间的一致性以支持或反驳图文对方面取得了显著进展。然而，现有的OOC虚假信息检测方法往往强调内部一致性的作用，忽略了图文对与外部证据之间的外部一致性的重要性。在本文中，我们提出了HiEAG，一个新型的分层次证据增强生成框架，通过利用多模态大型语言模型（MLLM）的广泛知识来完善外部一致性检查。我们的方法将外部一致性检查分解为一个综合的引擎管道，除了检索之外，还结合了重新排序和重写。证据重新排序模块利用自动证据选择提示（AESP）从证据检索的产品中获得相关的证据项。随后，证据重写模块利用自动证据生成提示（AEGP）来提高基于MLLM的OOC虚假信息检测器的任务适应性。此外，我们的方法还为判断提供了解释，并通过指令调整实现了令人印象深刻的性能。在不同基准数据集上的实验结果表明，我们提出的HiEAG在所有样本的准确性上超越了先前最先进的（SOTA）方法。

论文及项目相关链接

PDF

Summary

随着多模态脱离上下文（OOC）的误检技术在不同模态间一致性检测方面的显著进展，现有方法过于强调内部一致性，忽视了图像文本对与外部证据间外部一致性的重要性。本文提出HiEAG，一种新型分层证据增强生成框架，利用多模态大型语言模型（MLLMs）的丰富知识来改进外部一致性检查。HiEAG将外部一致性检查分解为包括重排和重写在内的综合引擎管道，除检索外，还集成了证据重排模块和证据重写模块。通过自动证据选择提示（AESP）和自动证据生成提示（AEGP），HiEAG提高了对OOC误检检测器的任务适应性，并实现了判断解释，且在不同基准数据集上的实验结果表明，HiEAG的准确率超过了之前的最先进方法。

Key Takeaways

现有OOC误检检测法侧重于内部一致性，忽略了与外部证据的一致性。
提出HiEAG框架，利用MLLMs的丰富知识来改进外部一致性检查。
HiEAG将外部一致性检查分解为包括重排和重写在内的综合引擎管道。
证据重排模块通过AESP选择相关证据。
证据重写模块通过AEGP提高任务适应性。
HiEAG能够解释判断，并实现了较高的准确率。

Cool Papers

点此查看论文截图

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Authors:Youpeng Li, Fuxun Yu, Xinda Wang

The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.

对开源软件的广泛依赖显著增加了漏洞利用的风险，这凸显了有效且可扩展的漏洞检测（VD）的必要性。现有的VD技术，无论是基于传统机器学习的还是基于大型语言模型（LLM）的方法，如提示工程、监督微调或离策略偏好优化，它们在执行上下文感知分析方面的能力都受到根本性限制。它们依赖于固定输入或静态偏好数据集，无法自适应地探索仓库级别的依赖关系，并受到功能级别基准测试的限制，而这些测试忽略了关键的漏洞上下文。

论文及项目相关链接

PDF

Summary

开源软件的广泛应用增加了漏洞利用的风险，凸显了对有效且可扩展的漏洞检测（VD）的需求。现有VD技术存在上下文感知分析的局限性。本文提出一种基于策略的漏洞自适应优化（VULPO）框架，结合上下文感知VD。为支持训练和评估，构建了ContextVul数据集，设计多维奖励结构，并引入标签级别和样本级别的难度自适应奖励缩放。实验证明，VULPO框架在上下文感知VD中表现卓越，显著提高性能。

Key Takeaways

开源软件的广泛应用增加了漏洞风险，需要有效和可扩展的漏洞检测技术（VD）。
现有VD技术面临上下文感知分析的局限性，难以应对变化的环境和需求。
本文介绍了一种新型的基于策略的漏洞自适应优化（VULPO）框架，以提高上下文感知VD的能力。
VULPO利用ContextVul数据集进行训练和评估，该数据集融合了函数级别样本和仓库级别的上下文信息。
VULPO设计多维奖励结构，综合考虑预测正确性、漏洞定位准确性和语义相关性。
VULPO引入标签级别和样本级别的难度自适应奖励缩放，以应对不同的漏洞情况并平衡奖励分布。

Cool Papers

点此查看论文截图

VSPO: Validating Semantic Pitfalls in Ontology via LLM-Based CQ Generation

Authors:Hyojun Choi, Seokju Hwang, Kyong-Ho Lee

Competency Questions (CQs) play a crucial role in validating ontology design. While manually crafting CQs can be highly time-consuming and costly for ontology engineers, recent studies have explored the use of large language models (LLMs) to automate this process. However, prior approaches have largely evaluated generated CQs based on their similarity to existing datasets, which often fail to verify semantic pitfalls such as “Misusing allValuesFrom”. Since such pitfalls cannot be reliably detected through rule-based methods, we propose a novel dataset and model of Validating Semantic Pitfalls in Ontology (VSPO) for CQ generation specifically designed to verify the semantic pitfalls. To simulate missing and misused axioms, we use LLMs to generate natural language definitions of classes and properties and introduce misalignments between the definitions and the ontology by removing axioms or altering logical operators (e.g., substituting union with intersection). We then fine-tune LLaMA-3.1-8B-Instruct to generate CQs that validate these semantic discrepancies between the provided definitions and the corresponding axioms. The resulting CQs can detect a broader range of modeling errors compared to existing public datasets. Our fine-tuned model demonstrates superior performance over baselines, showing 26% higher precision and 28.2% higher recall than GPT-4.1 in generating CQs for pitfall validation. This research enables automatic generation of TBox-validating CQs using LLMs, significantly reducing manual effort while improving semantic alignment between ontologies and expert knowledge. To the best of our knowledge, this is the first study to target semantic pitfall validation in CQ generation using LLMs.

能力问题（CQs）在验证本体设计方面发挥着至关重要的作用。虽然手工编制CQs对于本体工程师来说可能既耗时又成本高昂，但最近的研究已经探索了使用大型语言模型（LLM）来自动化这一过程。然而，先前的方法大多基于生成的CQs与现有数据集的相似性进行评估，这往往无法验证语义陷阱，例如“误用allValuesFrom”。由于这种陷阱无法可靠地通过基于规则的方法进行检测，我们针对CQ生成提出了一个专门用于验证语义陷阱的新数据集和模型——本体语义陷阱验证（VSPO）。为了模拟缺失和误用的公理，我们使用LLM生成类和属性的自然语言定义，并通过移除公理或改变逻辑运算符（例如，用交集替换并集）来引入定义与本体之间的不匹配。然后，我们对LLaMA-3.1-8B-Instruct进行微调，生成验证这些语义差异（存在于提供的定义和相应公理之间）的CQs。生成的CQs能够检测比现有公共数据集更广泛的建模错误。我们的微调模型相对于基线表现出卓越的性能，在生成用于陷阱验证的CQs方面，相比GPT-4.1，具有26%的更高精度和28.2%的更高召回率。这项研究能够使用LLM自动生成验证TBox的CQs，显著减少手动工作，同时提高本体与专业知识之间的语义对齐。据我们所知，这是首次使用LLM针对CQ生成中的语义陷阱验证进行研究。

论文及项目相关链接

PDF Accepted at AAAI 2026 oral

Summary

大型语言模型（LLM）在自动验证本体设计语义陷井中的潜在价值和应用受到关注。针对以往研究中仅依赖数据集相似性评估生成的问题（CQs）的问题，本文提出了一种新的数据集和模型——用于验证本体的语义陷井（VSPO）。通过模拟缺失和误用的公理，利用LLM生成自然语言的类和属性定义，并通过微调LLaMA模型生成验证语义差异的问题。该方法能更广泛地检测建模错误，相较于现有方法表现优越。本研究降低了手动生成CQs的难度，提高了本体与专家知识的语义一致性。此为针对LLM在CQ生成中验证语义陷井的首个研究。

Key Takeaways

大型语言模型（LLM）在自动化生成验证本体的能力问题（CQs）中扮演重要角色，减少了手动工作的需求。
现有方法主要基于数据集相似性评估生成的CQs，但无法有效验证语义陷井如“误用allValuesFrom”。
提出了一种新的数据集和模型——用于验证本体的语义陷井（VSPO），专注于检测语义差异。
利用LLM生成自然语言的类和属性定义，模拟缺失和误用的公理。
通过微调LLaMA模型，生成的CQs能更广泛地检测建模错误。
相较于现有方法，该研究的模型表现出更高的精确性和召回率。

Cool Papers

点此查看论文截图

SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices

Authors:Will Chow

Large Language Models (LLMs), as the foundational architecture for next-generation interactive AI applications, not only power intelligent dialogue systems but also drive the evolution of embodied intelligence on edge devices, including humanoid robots, smart vehicles, and other scenarios. The applications running on these edge devices impose differentiated Service Level Objectives (SLO) requirements on LLM services, specifically manifested as distinct constraints on Time to First Token (TTFT) and Time Per Output Token (TPOT) as well as end-to-end latency. Notably, edge devices typically handle real-time tasks that are extremely sensitive to latency, such as machine control and navigation planning. However, existing scheduling service systems still prioritize maximizing output token throughput as the sole optimization objective, failing to adequately address the diversity of SLO requirements. This ultimately results in persistently high violation rates for end-to-end latency or TPOT related SLOs. This paper proposes SLICE, an innovative scheduling solution designed for edge computing scenarios with differentiated SLO requirements. By combining a utility-maximizing request scheduling algorithm with a dynamic iterative control mechanism for generation rates, SLICE significantly improves LLM inference service SLO attainment. Experimental results demonstrate that compared to state-of-the-art solutions Orca and FastServe, SLICE achieves up to 35x higher SLO attainment and 3.4x advantage in task completion time than the other two solutions. This version is temporarily hosted anonymously for double-blind review.

大型语言模型（LLM）作为下一代交互式AI应用的基础架构，不仅助力智能对话系统的发展，还推动边缘设备上的体现智能进化，包括人形机器人、智能车辆和其他场景。在这些边缘设备上运行的应用对LLM服务提出了差异化的服务级别目标（SLO）要求，具体表现为首字节时间（TTFT）、每输出令牌时间（TPOT）以及端到端延迟的不同约束。值得注意的是，边缘设备通常处理对延迟极其敏感的任务，如机器控制和导航规划。然而，现有的调度服务系统仍然将最大化输出令牌吞吐量作为唯一的优化目标，未能充分应对多样化的SLO要求。这最终导致端到端延迟或TPOT相关的SLO持续高违规率。本文提出了一种针对具有差异化SLO要求的边缘计算场景的创新调度解决方案SLICE。通过结合效用最大化请求调度算法和生成速率的动态迭代控制机制，SLICE显著提高了LLM推理服务SLO的达成度。实验结果表明，与最新解决方案Orca和FastServe相比，SLICE的SLO达成度提高了高达35倍，任务完成时间比其他两个解决方案缩短了3.4倍。本版本暂时匿名托管以供双重盲审。

论文及项目相关链接

PDF This work has been submitted to the IEEE for possible publication. This version is temporarily hosted anonymously for double-blind review

Summary

大型语言模型（LLM）作为下一代交互人工智能应用的基础架构，不仅助力智能对话系统，还推动边缘设备上的体现智能的发展，如人形机器人、智能车辆等。边缘设备上的应用程序对LLM服务有不同的服务级别目标（SLO）要求，表现为时间至首令牌（TTFT）、每输出令牌时间（TPOT）以及端到端延迟的特定约束。然而，现有调度服务系统仍把优化目标单一地定在最大化输出令牌吞吐量上，无法充分应对多样的SLO要求，导致端到端延迟或TPOT相关SLO的违反率持续偏高。本文提出了SLICE，这是一种为具有差异化SLO要求的边缘计算场景设计的创新调度解决方案。通过结合效用最大化的请求调度算法和生成速率的动态迭代控制机制，SLICE显著提高了LLM推理服务SLO的达成率。

Key Takeaways

LLMs作为下一代AI应用的基础架构，推动了智能对话系统和边缘设备上的体现智能的发展。
边缘设备上的应用程序对LLM服务有差异化的SLO要求，包括TTFT、TPOT和端到端延迟等。
现有调度服务系统主要优化目标是最大化输出令牌吞吐量，无法满足多样化的SLO要求。
SLICE调度解决方案针对差异化SLO要求的边缘计算场景进行设计，结合了请求调度算法和生成速率的控制机制。
SLICE显著提高LLM推理服务SLO达成率，相较于现有解决方案有显著改善。
SLICE在任务完成时间上相比其他解决方案有优势。

Cool Papers

点此查看论文截图

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Authors:Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.

基于Transformer的视频扩散模型（VDM）提供了最先进的视频生成质量，但由于自注意力的二次成本而受到限制，使得长序列和高分辨率的计算成本高昂。虽然线性注意力提供了次二次复杂性，但以前的方法在匹配softmax注意力的表现力方面失败了，除非以重大的计算成本重新训练。我们引入了Attention Surgery，这是一个高效的框架，能够在预训练的VDM中使用线性或混合注意力，无需从头开始训练。我们的方法受到最近语言模型进展的启发，结合了一种新型混合注意力机制——混合softmax和线性令牌——以及一个轻量级的蒸馏和微调管道，只需几天的GPU时间。此外，我们采用了一种成本感知的块率策略，以平衡各层的表现力和效率。应用于Wan2.1 1.3B这一先进的Transformer VDM，并在VBench、VBench2.0和一项人类偏好研究中进行评估，Attention Surgery取得了具有竞争力的结果。此外，对移动端的延迟、内存使用和FLOPs的测量显示，对于较长的视频，其在扩展行为方面有明显的改进。项目页面可在：https://qualcomm-ai-research.github.io/attention-surgery访问。

论文及项目相关链接

PDF

Summary
基于Transformer的视频扩散模型（VDM）虽然生成视频质量一流，但由于自注意力的二次成本，处理长序列和高分辨率时计算成本高昂。本研究引入Attention Surgery框架，通过线性或混合注意力，让预训练的VDM无需从头开始训练。该框架结合了混合注意力机制和轻量级蒸馏微调管道，仅需要少量GPU天数。此外，我们还通过成本感知的块率策略来平衡各层的表达性和效率。应用于Wan2.1 1.3B这一高效的Transformer VDM并在VBench等平台上进行评测时，Attention Surgery表现优秀。Attention Surgery还显著提高了移动设备上处理长视频的延迟、内存使用和浮点运算性能。更多信息请访问：https://qualcomm-ai-research.github.io/attention-surgery。
（注：由于提供的摘要长度未达到所需长度限制，已将部分内容删改确保简洁且表达完整。）

Key Takeaways

Transformer-based视频扩散模型面临计算成本问题，特别是在处理长序列和高分辨率时。
Attention Surgery框架引入了一种混合注意力机制，结合了softmax和线性注意力令牌，以提高效率。
Attention Surgery框架通过轻量级蒸馏和微调管道实现了对预训练模型的优化，减少了计算成本。
该框架通过成本感知的块率策略平衡了表达性和效率，提高了模型性能。
Attention Surgery在多个平台上进行了评测并表现出竞争力。
Attention Surgery显著提高了移动设备上处理长视频的延迟、内存使用和浮点运算性能。

Cool Papers

点此查看论文截图

Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

Authors:Katsuaki Nakano, Reza Fayyazi, Shanchieh Jay Yang, Michael Zuzak

Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM’s reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8%, 72.8%, and 78.6% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5%, 16.5%, and 75.7% of subtasks and required 86.2%, 118.7%, and 205.9% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

近期大型语言模型（LLM）的进步激发了自动化网络安全渗透测试工作流的兴趣，为企业系统的漏洞评估提供了更快更一致的希望。现有的渗透测试LLM代理主要依赖于自我引导推理，这可能会产生不准确或虚构的操作步骤。因此，LLM代理可能会执行无效操作，如利用未使用的软件库或生成重复策略的循环响应。在这项工作中，我们提出了一种用于渗透测试LLM代理的引导推理管道，它结合了由MITRE ATT＆CK矩阵构建的确定性任务树，这是一个经过验证的渗透测试杀伤链，将LLM的推理过程限制在明确定义的策略、技术和程序内。这使推理锚定在成熟的渗透测试方法上，并通过引导代理执行更有效的攻击程序来过滤掉无效操作。为了评估我们的方法，我们使用三个LLM（Llama-3-8B、Gemini-1.5和GPT-4）构建了一个自动化的渗透测试LLM代理，并将其应用于导航HackTheBox网络安全演练中的十个网络安全演练中的一百零三个离散子任务代表真实世界网络攻击场景。我们提出的推理管道使用Llama-3-8B、Gemini-1.5和GPT-4分别引导了百分之七十一、百分之七十二和百分之七十八的子任务。相比之下，最先进的自我引导推理LLM渗透测试工具仅完成了百分之十三、百分之十六和百分之七十五的子任务并需要额外使用百分之八十六点二、百分之一百一十八点七和百分之二百零五点九的模型查询量。这表明将确定性任务树纳入LLM推理管道可以提高自动化网络安全评估的准确性和效率。

论文及项目相关链接

PDF

Summary

LLM在网络安全渗透测试自动化方面的应用展现出巨大潜力，但现有LLM代理主要依赖自我引导推理，存在不准确和产生循环响应的问题。本研究提出一种基于MITRE ATT&CK矩阵的引导式推理管道，将LLM的推理过程约束在明确的战术、技术和程序内，从而提高自动化渗透测试的准确性和效率。评估结果显示，该推理管道引导的LLM代理在完成子任务方面的表现优于自我引导式LLM渗透测试工具。

Key Takeaways

LLM在自动化网络安全渗透测试工作流程中具有巨大潜力。
当前LLM代理在渗透测试中主要依赖自我引导推理，存在不准确和产生循环响应的问题。
提出的引导式推理管道结合MITRE ATT&CK矩阵，提高了自动化渗透测试的准确性和效率。
确定的推理管道约束LLM的推理过程，使其遵循明确的战术、技术和程序。
与自我引导式LLM渗透测试工具相比，该推理管道在完成任务方面的表现更优。
使用三种LLM（Llama-3-8B、Gemini-1.5和GPT-4）构建的自动化渗透测试LLM代理在真实网络安全场景中有较高成功率。

Cool Papers

点此查看论文截图

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Authors:Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang

Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.

文本提示的图像分割能够实现精细的视觉理解，对于人机交互和机器人等应用至关重要。然而，现有的监督微调方法通常在测试时忽略明确的思维链（CoT）推理，这限制了它们对未见提示和域进行泛化的能力。为了解决这一问题，我们引入了LENS，这是一个可扩展的强化学习框架，以端到端的方式联合优化推理过程和分割。我们提出了统一的强化学习奖励，涵盖句子、盒子和分段级别的线索，鼓励模型在细化掩膜质量的同时生成信息丰富的CoT推理。我们使用公开可用的3亿参数视觉语言模型（即Qwen2.5-VL-3B-Instruct）进行试验，LENS在RefCOCO、RefCOCO+和RefCOCOg基准测试上达到了平均完全交并比（cIoU）81.2%，超过了强大的微调方法GLaMM，最多高出5.6%。这些结果表明，强化学习驱动的CoT推理显著提高了文本提示的分割效果，并为实现更通用的任意分割模型（SAM）提供了实际途径。相关代码可在https://github.com/hustvl/LENS找到。

论文及项目相关链接

PDF Code is released at https://github.com/hustvl/LENS

摘要
本文提出一种基于强化学习的推理框架LENS，实现了文本提示的图像分割和推理过程的联合优化。LENS采用统一的强化学习奖励机制，涵盖句子、框和分割级别的线索，鼓励模型生成信息丰富的推理理由，同时提高掩模质量。使用公开可用的视觉语言模型Qwen2.5-VL-3B-Instruct，在RefCOCO等多个基准测试中实现了较好的性能提升。

关键见解

文本提示的图像分割对于精细视觉理解和人机交互等应用至关重要。
现有监督微调方法忽略了测试时的显式链式思维（CoT）推理，限制了其在未见提示和领域中的泛化能力。
LENS框架是一种基于强化学习的推理和分割联合优化方法，旨在解决上述问题。
LENS采用统一的强化学习奖励机制，涵盖不同级别的线索，提高模型生成信息丰富的推理理由的能力。
使用Qwen2.5-VL-3B-Instruct视觉语言模型，LENS在多个基准测试中实现了平均cIoU为81.2%的性能，超过了精细微调方法GLaMM。
结果表明，基于强化学习的CoT推理显著提高了文本提示的图像分割性能。
LENS为开发更具泛化能力的任意分割模型（SAM）提供了实用途径。

Cool Papers

点此查看论文截图

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

Authors:Xinlin Zhuang, Feilong Tang, Haolin Yang, Xiwei Liu, Ming Hu, Huifa Li, Haochen Xue, Junjun He, Zongyuan Ge, Yichen Li, Ying Qian, Imran Razzak

Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample’s optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms baseline methods, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.

监督微调（SFT）在将大型语言模型（LLM）适应于医疗推理等特定领域方面发挥着至关重要的作用。然而，现有的SFT实践通常依赖于包含冗余和劣质样本的未过滤数据集，导致巨大的计算成本和性能不佳。尽管现有方法试图通过根据样本的复杂度和知识含量来选择数据来缓解这个问题，但它们忽略了反映在每个样本梯度上的优化效用。有趣的是，我们发现仅基于梯度的选择倾向于选择容易优化的样本，这些样本会引起较大的参数变化，但缺乏深度推理链，而仅依靠难度选择则会产生噪声或过于复杂的案例，无法指导稳定的优化。基于这一观察，我们提出了一种数据选择策略，即难度影响象限（DIQ），该策略优先选择在高难度和高影响象限中的样本，以平衡复杂的临床推理和显著的梯度影响，从而实现使用最少微调数据的有效医疗推理。此外，人类和LLM作为法官的评估表明，DIQ选择的子集显示出更高的数据质量，并产生与专家实践更一致的临床推理，包括鉴别诊断、安全检查和证据引用等方面。因为DIQ强调选择那些促进专家式推理模式的样本。在医疗推理基准测试上的大量实验表明，仅使用1%精选数据的DIQ模型就能达到全数据集的性能，而使用10%的数据时则始终优于基准方法，这凸显了原则性数据选择优于粗暴扩展的优势。相关代码和数据可在https://github.com/mihara-bot/DIQ获取。

论文及项目相关链接

PDF preprint, under review

Summary

大型语言模型（LLM）在医疗推理等专业领域的应用中，监督微调（SFT）起到关键作用。但现有SFT实践常使用包含冗余和劣质样本的未筛选数据集，导致巨大的计算成本及性能不佳。本文提出一种数据选择策略——难度影响象限（DIQ），优先选取高难度高影响度的样本，平衡复杂临床推理与显著的梯度影响，实现高效的医疗推理并最小化微调数据。经人类和LLM-as-a-judge评估，DIQ选定的子集数据质量更高，生成的临床推理与专家实践更一致。仅在1%的选定数据上进行微调即可达到全数据集的性能，凸显了原则性数据选择优于粗暴的规模扩展。

Key Takeaways

监督微调（SFT）在将大型语言模型（LLM）适应医疗推理等特定领域时具有重要作用。
现有SFT实践使用未筛选数据集，导致计算成本高昂和性能不佳。
难度影响象限（DIQ）策略结合样本的难度和梯度影响力进行数据选择。
DIQ策略平衡了复杂临床推理与显著的梯度影响，提高了数据质量和模型性能。
DIQ选定的子集在医疗推理方面与专家实践更一致。
仅使用1%的选定数据进行微调即可达到全数据集性能，凸显原则性数据选择的重要性。

Cool Papers

点此查看论文截图

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Authors:Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci

Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

多实例学习（MIL）是全幻灯片图像（WSI）分类的领先方法，可实现高效的对千兆像素病理切片的全面分析。最近的工作将视觉语言模型（VLM）引入到MIL管道中，通过基于文本的类别描述而非简单的类别名称来融入医学知识。然而，当这些方法依赖于大型语言模型（LLM）来生成临床描述或使用固定长度的提示来表示复杂的病理学概念时，VLM的有限令牌容量通常限制了编码类别信息的表达丰富性。此外，仅由LLM生成的临床描述可能缺乏领域基础和精细的医学特异性，导致与视觉特征的次优对齐。为了应对这些挑战，我们提出了一个具有两个主要贡献的视觉语言MIL框架：一是基于病理教科书和代理专业化（如形态学、空间上下文）的地基多代理描述生成系统，能够产生准确和多样化的临床描述；二是使用一系列描述而非单一提示的文本编码策略，捕获精细和补充的临床信号，以更好地与视觉特征对齐。融入VLM-MIL管道后，我们的方法在单提示类别基线的基础上表现出更好的性能，并在肾脏和肺癌数据集上实现了与最新模型相当的结果。

论文及项目相关链接

PDF Acccepted in MICCAI Workshop 2025

Summary

在病理学检测的全滑图像分类任务中，多重实例学习（MIL）是目前主流的方法。最新研究引入了视觉语言模型（VLM），通过基于文本的类别描述融入医学知识，而非仅使用简单的类别名称。然而，当这些方法依赖大型语言模型（LLM）生成临床描述或使用固定长度的提示来表示复杂的病理学概念时，视觉语言模型的有限令牌容量通常限制了编码类别信息的表达丰富性。为了解决这些挑战，我们提出了一个带有两个关键贡献的视觉语言MIL框架：一是基于病理教材和代理专业知识的多代理描述生成系统；二是使用一系列描述而非单一提示的文本编码策略，以捕捉精细且互补的临床信号，更好地与视觉特征对齐。该框架在肾脏和肺癌数据集上的表现优于单提示类别基线，并达到最新技术水平。

Key Takeaways

多重实例学习（MIL）是病理学检测全滑图像分类的主流方法。
视觉语言模型（VLM）通过医学知识融入路径学分类中，不仅使用简单的类别名称。
大型语言模型（LLM）在生成临床描述时存在局限性，如缺乏领域知识和精细医学特异性。
提出了一种视觉语言MIL框架，包含两个关键贡献：多代理描述生成系统和文本编码策略。
多代理描述生成系统基于病理教材和代理专业知识，能生成准确且多样的临床描述。
文本编码策略使用一系列描述而非单一提示，以捕捉更精细且互补的临床信号，与视觉特征更好地对齐。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-20/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-11-20 AutoTool Efficient Tool Selection for Large Language Model Agents

2025-11-20 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-11-20 UniGen-1.5 Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

2025-11-20 R1_Reasoning

R1_Reasoning