发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 20k

阅读时长: 81 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Authors:Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, Afshin Dehghan

We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.

我们介绍了UniGen-1.5，这是一个统一的多模态大型语言模型（MLLM），用于高级图像理解、生成和编辑。基于UniGen，我们全面增强了模型架构和训练流程，加强了图像理解和生成能力，同时解锁了强大的图像编辑能力。特别是，我们提出了一种统一强化学习（RL）策略，通过共享奖励模型，共同提高图像生成和图像编辑能力。为了进一步增强图像编辑性能，我们提出了轻量级的编辑指令对齐阶段，这大大提高了编辑指令的理解，对于RL训练的成功至关重要。实验结果表明，UniGen-1.5在理解和生成方面表现出竞争力。具体来说，UniGen-1.5在GenEval和ImgEdit上的总体得分分别为0.89和4.31，超越了最先进模型BAGEL，达到了类似专有模型GPT-Image-1的性能水平。

论文及项目相关链接

PDF

Summary

UniGen-1.5是一款统一的多模态大型语言模型，用于高级图像理解、生成和编辑。该模型基于UniGen构建，全面增强了模型架构和训练流程，提高了图像理解和生成能力，同时解锁了强大的图像编辑能力。通过共享奖励模型，提出了一种统一的强化学习策略，可以同时提高图像生成和图像编辑的能力。此外，还提出了一种轻量级的编辑指令对齐阶段，可以显著提高对编辑指令的理解，这是强化学习训练成功的重要组成部分。实验结果表明，UniGen-1.5在理解和生成方面的表现具有竞争力。特别地，UniGen-1.5在GenEval和ImgEdit上的总体得分分别为0.89和4.31，超过了目前先进的模型如BAGEL，并达到了与专有模型如GPT-Image-1相当的性能。

Key Takeaways

UniGen-1.5是一个多模态大型语言模型，用于高级图像理解、生成和编辑。
该模型增强了对图像的理解和生成能力，并具备强大的图像编辑功能。
提出了一种统一的强化学习策略，通过共享奖励模型，同时提高图像生成和编辑能力。
引入了一种轻量级的编辑指令对齐阶段，提高了对编辑指令的理解。
UniGen-1.5在理解和生成方面的表现具有竞争力，GenEval和ImgEdit总体得分分别为0.89和4.31。
UniGen-1.5超过了目前先进的模型如BAGEL。

Cool Papers

点此查看论文截图

Look-Ahead Reasoning on Learning Platforms

Authors:Haiqing Zhu, Tijana Zrnic, Celestine Mendler-Dünner

On many learning platforms, the optimization criteria guiding model training reflect the priorities of the designer rather than those of the individuals they affect. Consequently, users may act strategically to obtain more favorable outcomes, effectively contesting the platform’s predictions. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to a deployed model, without considering the behavior of other users. In contrast, look-ahead reasoning takes into account that user actions are coupled, and – at scale – impact future predictions. Within this framework, we first formalize level-$k$ thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their joint impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner’s and the users’ utilities emerges as a key concept. We discuss connections to several related mathematical frameworks, including strategic classification, performative prediction, and algorithmic collective action.

在许多学习平台上，引导模型训练的优化标准反映的是设计者的优先事项，而非其所影响个体的优先事项。因此，用户可能会采取策略性行为以获得更有利的结果，从而有效地质疑平台的预测。尽管过去的研究已经研究了学习平台上的策略性行为，但重点大多放在对部署模型的战略反应上，而没有考虑到其他用户的行为。相比之下，前瞻性推理考虑到了用户行为是相互关联的，而且在大规模上会影响未来的预测。在本框架中，我们首先正式提出行为经济学中的k级思维概念，用户通过前瞻性思考来超越同龄人。我们表明，虽然达到均衡的收敛速度加快了，但均衡状态仍然保持不变，从长远来看，并没有给个人带来高水平思考的好处。然后，我们关注集体推理，用户通过优化其联合行动对模型产生的影响来协调行动。通过对集体行为与自私行为的对比，我们分析了协调的利弊；在学习者和用户效用之间出现了一种新的对齐概念，这是关键。我们讨论了与战略分类、绩效预测和算法集体行动等几个相关数学框架的联系。

论文及项目相关链接

PDF accepted to NeurIPS 2025

Summary

该文探讨学习平台上的用户行为策略，指出平台优化标准往往反映设计者而非受影响个体的优先事项。用户为获得更有利的结果，可能会采取战略性行为来对抗平台预测。文章强调前瞻性推理的重要性，即用户行动间的关联性及其对未来预测的影响。文中提出了“level-$k$思考”概念，但即便加快达到平衡，长远看对个人并无益处。随后聚焦于集体推理，用户通过优化联合影响模型采取协调行动。对比集体与自私行为，揭示协调的利弊，并出现学习者与用户需求间的新对齐概念。

Key Takeaways

学习平台的优化标准往往反映设计者而非受影响个体的优先权。
用户为获得有利结果，会采取战略性行为对抗平台预测。
用户的行动是相互关联的，并对未来预测产生影响，这被称为前瞻性推理。
“level-$k$思考”虽有助于加速达到平衡，但对个体长远来看没有实际好处。
集体推理重视用户间的协调行动，通过优化对模型的联合影响来实现。
对比集体与自私行为揭示了协调的利弊，引出新的学习者与用户需求对齐概念。

Cool Papers

点此查看论文截图

From Random Determinants to the Ground State

Authors:Hao Zhang, Matthew Otten

Accurate quantum many-body calculations often depend on reliable reference states or good human-designed ansätze, yet these sources of knowledge can become unreliable in hard problems like strongly correlated systems. We introduce the Trimmed Configuration Interaction (TrimCI) method, a prior-knowledge-free algorithm that builds accurate ground states directly from random Slater determinants. TrimCI iteratively expands the variational space and trims away unimportant states, allowing a random initial core to self-refine into an accurate approximation of exact ground state. Across challenging benchmarks, TrimCI achieves state-of-the-art accuracy with strikingly efficiency gains of several orders of magnitude. For [4Fe-4S] cluster, it matches recent quantum computing results with $10^6$-fold fewer determinants and CPU-hours. For the nitrogenase P-cluster, it matches selected-CI accuracy using $10^5$-fold fewer determinants. For $8\times8$ Hubbard model, it recovers over $99%$ of the ground-state energy using only $10^{-28}$ of the Hilbert space. In some regimes, TrimCI attains orders-of-magnitude higher accuracy than AFQMC method. These results demonstrate that high-accuracy many-body ground states can be discovered directly from random determinants, establishing TrimCI as a prior-knowledge-free, accurate and highly efficient framework for quantum many-body systems. The compact explicit wavefunctions it produces further enable direct and rapid evaluation of observables.

精确的量子多体计算往往依赖于可靠的参考态或良好的人工设计的试探函数，但在强关联系统等难题中，这些知识来源可能变得不可靠。我们引入了Trimmed配置交互（TrimCI）方法，这是一种无需先验知识的算法，它直接从随机Slater行列式构建准确的基态。TrimCI通过迭代扩展变分空间并剔除不重要状态，使得随机初始核心能够自我完善，成为精确基态的准确近似。在具有挑战性的基准测试中，TrimCI实现了最新技术的准确性，并实现了几个数量级的效率提升。对于[4Fe-4S]簇，它与最新的量子计算结果相匹配，使用的行列式和CPU小时数减少了$10^6$倍。对于氮酶P簇，它使用较少的行列式达到了选定CI的准确性。对于$8\times8$的Hubbard模型，它恢复了99%以上的基态能量，仅使用了Hilbert空间的$10^{-28}$部分。在某些情况下，TrimCI的精度比AFQMC方法高出数个数量级。这些结果证明，可以直接从随机行列式中发现高准确度的多体基态，确立了TrimCI作为一个无需先验知识、准确且高效的量子多体系统框架。它产生的紧凑显式波函数可以进一步实现观测量的直接和快速评估。

论文及项目相关链接

PDF 13 pages, 5 figures

Summary：无先验知识的TrimCI方法通过直接利用随机Slater行列式构建精确基态，解决了强关联系统等难题中的可靠参考状态或良好人为设计ansätze可能带来的问题。TrimCI通过迭代扩展变空间并剔除不重要状态，使得随机初始核心能够自我完善，从而近似得到精确的基态。在具有挑战性的基准测试中，TrimCI实现了高精度，效率提高了几个数量级。对于不同的系统，TrimCI实现了与量子计算相匹配的结果，使用更少的确定因子和计算时间；对于氮酶P簇，它与所选CI的精度相匹配，使用更少的确定因子；对于Hubbard模型，它恢复了超过99%的基态能量，仅使用Hilbert空间的极小部分。TrimCI在一些情况下比AFQMC方法具有更高的精度。

Key Takeaways：

TrimCI是一种无先验知识的算法，可以直接从随机Slater行列式构建精确基态。
TrimCI通过迭代扩展变空间并剔除不重要状态，允许随机初始核心自我完善。
TrimCI在具有挑战性的基准测试中实现了高精度和显著的效率提升。
对于不同的系统，TrimCI达到了与量子计算相匹配的结果，使用的确定因子和计算时间大大减少。
TrimCI在与所选CI的精度相匹配的情况下，使用的确定因子大幅减少。
TrimCI在Hubbard模型中恢复超过99%的基态能量，仅使用Hilbert空间的极小部分。

Cool Papers

点此查看论文截图

Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images

Authors:Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Meryem Jabrane, Vicente Grau, Shahnaz Jamil-Copley, Richard H. Clayton, Chen, Chen

Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to “see beyond the image”, setting a new direction for robust and physiologically grounded cardiac scar segmentation.

对晚期钆增强（LGE）心脏磁共振成像中心肌瘢痕的精确分割对于评估组织活力至关重要，但由于对比度变化和成像伪影，这仍然是一个挑战。心电图（ECG）信号提供了补充的生理信息，因为传导异常可以帮助定位或提示心肌瘢痕区域。在这项工作中，我们提出了一种新型的多模式框架，该框架结合了心电图衍生的生理信息与AHA-17图谱的解剖先验知识，以进行基于LGE的生理一致性瘢痕分割。由于心电图和LGE-MRI并非同时采集，我们引入了一种时间感知特征融合（TAFF）机制，该机制根据采集时间差异动态加权并融合特征。我们的方法在临床数据集上进行了评估，与仅使用图像的最新基线（nnU-Net）相比，取得了显著进展，瘢痕的平均Dice得分从0.6149提高到0.8463，在精确度（0.9115）和灵敏度（0.9043）方面都实现了高性能。这些结果表明，整合生理和解剖知识可以使模型“超越图像”，为稳健和基于生理的心脏瘢痕分割设定了一个新的方向。

论文及项目相关链接

PDF

Summary：该文本介绍了使用心电图衍生信息和多模态框架融合的方法来进行心肌疤痕分割的挑战与最新研究。作者提出了一个融合心电图信息与AHA-17图谱解剖先验知识的新框架，并采用时间感知特征融合机制，实现了在晚期钆增强心脏MRI图像上的准确分割。评估结果显示，该方法在临床试验数据集上显著提高了分割准确性。

Key Takeaways：

心电图（ECG）信号提供了心肌疤痕区域的重要生理信息，可用于定位疤痕区域。
提出了一种新的多模态框架，融合了ECG信息和AHA-17图谱的解剖先验知识，用于基于解剖学的疤痕分割。
由于心电图和晚期钆增强MRI不是同时获取的，引入了时间感知特征融合（TAFF）机制，根据采集时间差异动态融合特征。
与仅使用图像的最新技术相比，该方法实现了显著的性能提升，提高了疤痕的Dice得分。
该方法不仅提高了分割的准确性，还提高了精确度和灵敏度。
融合生理和解剖知识使模型能够“超越图像”，为心脏疤痕分割提供了新的研究方向。

Cool Papers

点此查看论文截图

HyMAD: A Hybrid Multi-Activity Detection Approach for Border Surveillance and Monitoring

Authors:Sriram Srinivasan, Srinivasan Aruchamy, Siva Ram Krisha Vadali

Seismic sensing has emerged as a promising solution for border surveillance and monitoring; the seismic sensors that are often buried underground are small and cannot be noticed easily, making them difficult for intruders to detect, avoid, or vandalize. This significantly enhances their effectiveness compared to highly visible cameras or fences. However, accurately detecting and distinguishing between overlapping activities that are happening simultaneously, such as human intrusions, animal movements, and vehicle rumbling, remains a major challenge due to the complex and noisy nature of seismic signals. Correctly identifying simultaneous activities is critical because failing to separate them can lead to misclassification, missed detections, and an incomplete understanding of the situation, thereby reducing the reliability of surveillance systems. To tackle this problem, we propose HyMAD (Hybrid Multi-Activity Detection), a deep neural architecture based on spatio-temporal feature fusion. The framework integrates spectral features extracted with SincNet and temporal dependencies modeled by a recurrent neural network (RNN). In addition, HyMAD employs self-attention layers to strengthen intra-modal representations and a cross-modal fusion module to achieve robust multi-label classification of seismic events. e evaluate our approach on a dataset constructed from real-world field recordings collected in the context of border surveillance and monitoring, demonstrating its ability to generalize to complex, simultaneous activity scenarios involving humans, animals, and vehicles. Our method achieves competitive performance and offers a modular framework for extending seismic-based activity recognition in real-world security applications.

地震感知技术已作为边境监视和监控的一种有前途的解决方案出现。通常埋于地下的地震传感器体积小巧，难以察觉，使得入侵者难以发现、避开或破坏它们。与高度可见的摄像头或围栏相比，这显著提高了其有效性。然而，准确检测和区分同时发生的重叠活动，如人类入侵、动物移动和车辆震动，仍然是一个主要挑战，因为地震信号复杂且充满噪音。正确识别同时发生的活动至关重要，因为无法将它们分开可能会导致误分类、漏检以及情况了解不完整，从而降低监控系统的可靠性。为了解决这个问题，我们提出了HyMAD（混合多活动检测）方法，这是一种基于时空特征融合的深度神经网络架构。该框架结合了通过SincNet提取的频谱特征和通过循环神经网络（RNN）建模的时间依赖性。此外，HyMAD还采用自注意力层来加强内部模态表示和跨模态融合模块，以实现地震事件的稳健多标签分类。我们在边境监视和监控背景下收集的来自真实世界现场录音的数据集上评估了我们的方法，证明了其在涉及人类、动物和车辆的复杂同时活动场景中的泛化能力。我们的方法取得了有竞争力的性能表现，并提供了一个模块化框架，可扩展到基于地震的实时安全应用中的活动识别。

论文及项目相关链接

PDF Multi-label seismic signal classification using novel attention-based feature fusion. Submitting to cs.CV due to relevance to general pattern recognition and time-frequency (spectrogram) analysis

Summary
地震感知技术已成为边境监控和监视的一种有前途的解决方案。地震传感器通常埋于地下，难以被入侵者察觉、避免或破坏，从而提高其有效性。然而，由于地震信号的复杂性和噪声，准确检测和区分同时发生的重叠活动（如人类入侵、动物移动和车辆隆隆声）仍是一个主要挑战。为此，提出了HyMAD（混合多活动检测）方法，一种基于时空特征融合的深度神经网络架构。该方法融合了通过SincNet提取的频谱特征和由循环神经网络（RNN）建模的时间依赖性。HyMAD还采用自注意力层加强内部模态表示和跨模态融合模块，实现稳健的地震事件多标签分类。在边境监控和监视背景下收集的实地录音数据集上评估了该方法，证明了其在涉及人类、动物和车辆的复杂同时活动场景中的通用性。该方法实现了具有竞争力的性能，为基于地震的活动识别在现实安全应用中的扩展提供了模块化框架。

Key Takeaways

地震感知技术因其难以察觉的特性而成为边境监控的有效解决方案。
同时活动区分是地震感知技术的核心挑战。
HyMAD方法结合了多种技术来解决这一挑战，包括时空特征融合、频谱特征提取、RNN建模时间依赖性等。
自注意力层和跨模态融合模块强化了多标签分类的稳健性。
该方法在边境监控领域的实际数据集上表现出良好的通用性和竞争力。
HyMAD提供了一个模块化框架，可扩展到现实安全应用中的地震活动识别。

Cool Papers

点此查看论文截图

Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances

Authors:Rishu Kumar Singh, Navneet Shreya, Sarmistha Das, Apoorva Singh, Sriparna Saha

Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR

现有的投诉分析方式主要依赖于单模态的简短内容，如推特或产品评论。这项工作通过利用多模态、多回合的客户支持对话来推动该领域的发展，用户通常会在其中分享文本投诉和视觉证据（如截图、产品照片），以实现投诉方面的精细分类和严重程度判断。我们引入了VALOR，一个带有专家路由的验证意识学习者，专为这种多模态环境量身定制。它采用多专家推理设置，使用具有思维链（CoT）提示的大规模生成模型进行细微的决策。为了确保各模态之间的连贯性，计算了语义对齐得分，并将其通过元融合策略集成到最终分类中。符合联合国可持续发展目标（UN SDGs），所提框架支持SDG9（工业、创新和基础设施），通过推动AI驱动工具实现稳健、可扩展和上下文感知的服务基础设施。此外，通过启用投诉叙事和视觉上下文的结构化分析，它对SDG12（负责任的消费和生产）做出贡献，促进更具响应性的产品设计以及消费者服务中的改进问责制。我们在精选的多模态投诉数据集上评估VALOR，该数据集带有精细的方面和严重程度标签，表明它始终优于基线模型，尤其是在文本和图像信息分布复杂的投诉场景中。这项研究强调了多模态交互和专家验证在实际投诉理解系统中的价值。有关数据和代码的资源可在此处找到：https://github.com/sarmistha-D/VALOR

论文及项目相关链接

PDF To be published in the Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026 Special Track on AI for Social Impact )

Summary

本文介绍了现有的投诉分析方法的局限性，并强调了一种新的方法，该方法利用多模态、多回合的客户支持对话来提高投诉分类的精细度。提出VALOR框架，该框架利用多专家推理设置大型生成模型，通过链条思维提示进行微妙决策。VALOR框架通过计算语义对齐得分并整合到最终分类中，确保模态之间的连贯性。此外，该框架支持联合国可持续发展目标中的两个目标——产业、创新和基础设施（SDG 9）和责任消费和生产（SDG 12），通过推进AI驱动工具来实现稳健、可扩展和上下文感知的服务基础设施，促进更具响应性的产品设计和提高消费者服务的问责制。评估表明，VALOR在精细粒度的投诉方面和数据集上表现优于基线模型。

Key Takeaways

现有投诉分析方法主要依赖单模态短内容，如推特或产品评论，而新方法利用多模态、多回合客户支持对话提升投诉分类精细度。
VALOR框架引入多专家推理设置，使用大型生成模型进行微妙决策。
VALOR通过计算语义对齐得分确保不同模态间的连贯性。
VALOR支持联合国可持续发展目标中的SDG 9和SDG 12，推进AI工具实现稳健、可扩展和上下文感知的服务基础设施。
VALOR在精细粒度的投诉方面和数据集上的表现优于基线模型。
该方法尤其在复杂投诉场景中，当信息分布在文本和图像中时，表现出较高的性能。

Cool Papers

点此查看论文截图

SMRC: Aligning Large Language Models with Student Reasoning for Mathematical Error Correction

Authors:Biaojie Zeng, Min Zhang, Juan Zhou, Fengrui Liu, Ruiyang Huang, Xin Lin

Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textit{mainly focus on self-correction within the model}, which falls short of the teacher-style correction required in educational settings, \textit{i.e.}, systematically guiding and revising a student’s problem-solving process. To address this gap, we propose \texttt{SMRC} (\textit{\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection}), a novel method that aligns LLMs with student reasoning. Specifically, \texttt{SMRC} formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbf{solution accuracy} and \textbf{correct-step retention}, offering a comprehensive measure of educational applicability. Experiments demonstrate that \texttt{SMRC} significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at https://github.com/Mind-Lab-ECNU/SMRC.

大型语言模型（LLM）在解决数学问题时常常出现推理错误，如何自动检测和纠正这些错误已成为一个重要的研究方向。然而，现有的方法主要集中在模型的自我修正上，这无法达到教育环境中所需的“教师式”修正，即系统地指导和修订学生的问题解决过程。为了解决这一差距，我们提出了SMRC（学生数学推理修正方法），这是一种使LLM与学生推理相一致的新型方法。具体而言，SMRC将学生推理制定为一个多步骤的序列决策问题，并引入蒙特卡洛树搜索（MCTS）来探索最佳的修正路径。为了减少标注过程层面奖励的代价，我们利用广度优先搜索（BFS）在LLM的指导下进行搜索，并通过最终答案的评估来生成奖励信号，这些奖励信号通过反向传播机制被分配到中间的推理步骤中，实现了精细的过程监督。此外，我们构建了一个高中数学基准测试MSEB（多解误差基准测试），其中包括158个实例，每个实例都包含问题陈述、学生解答和正确的推理步骤。我们进一步提出了以“解决方案准确性”和“正确步骤保留率”为中心的双评估协议，为教育适用性提供了全面的衡量标准。实验表明，SMRC在公开数据集（ProcessBench和MR-GSM8K）以及我们的MSEB基准测试上，其有效性和整体性能均显著优于现有方法。代码和数据可在https://github.com/Mind-Lab-ECNU/SMRC找到。

论文及项目相关链接

PDF 13 pages, 3 figures

Summary

大型语言模型（LLMs）在解决数学问题时的推理错误自动检测和纠正方法成为重要研究方向。然而，现有方法主要关注模型内的自我纠正，缺乏教育环境中所需的“教师式”纠正，即系统地指导和修订学生的解题过程。为解决这一差距，本文提出一种新型方法\texttt{SMRC}，使LLMs与学生推理相一致。具体而言，\texttt{SMRC}将学生推理形式化为多步骤的序列决策问题，并引入蒙特卡洛树搜索（MCTS）以探索最优修正路径。同时利用广度优先搜索（BFS）降低过程级奖励的标注成本，通过LLMs的最终答案评估生成奖励信号，并通过反向传播机制将其分布到中间推理步骤，实现精细的过程监督。此外，构建高中数学的MSEB基准测试集，并提出以解决方案准确性和正确步骤保留率为中心的双重评估协议。实验表明，\texttt{SMRC}在公开数据集和自制测试集上均显著优于现有方法。

Key Takeaways

大型语言模型（LLMs）在解决数学问题时存在推理错误的问题。
现有方法主要关注模型内部的自我纠正，缺乏像教育环境中教师那样的系统性指导和修订。
\texttt{SMRC}方法旨在将LLMs与学生推理相结合，将其形式化为多步骤的序列决策问题。
使用蒙特卡洛树搜索（MCTS）探索最优修正路径，并利用广度优先搜索（BFS）降低标注成本。
通过LLMs的最终答案评估生成奖励信号，并通过反向传播机制精细地监督中间推理步骤。
构建了一个针对高中数学的基准测试集MSEB，并提出了以解决方案准确性和正确步骤保留率为主的双重评估协议。

Cool Papers

点此查看论文截图

AutoTool: Efficient Tool Selection for Large Language Model Agents

Authors:Jingyi Jia, Qinbin Li

Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision-making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia - the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter-level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference-heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.

大型语言模型（LLM）代理已经涌现为强大的工具，能够通过利用LLM的推理和决策能力来自动化复杂任务。然而，当前代理框架的主要瓶颈在于工具选择的高推理成本，特别是在像ReAct这样的方法中，它反复调用LLM来确定每一步应使用哪个工具。在这项工作中，我们提出了AutoTool，这是一种基于图的新型框架，它通过利用一个关键的经验观察结果来绕过重复的LLM推理：工具使用惯性——工具调用的趋势遵循可预测的顺序模式。AutoTool从历史代理轨迹构建有向图，其中节点代表工具，边捕捉转换概率，有效地对工具选择中的惯性进行建模。它进一步整合参数级别的信息来优化工具输入生成。通过遍历这种结构化表示，AutoTool能够高效地选择工具和参数，对LLM推理的依赖度降到最低。在多种代理任务上的广泛实验表明，AutoTool将推理成本降低了高达30%，同时保持了竞争性的任务完成率，为推理密集型框架提供了实用且可扩展的增强。我们的工作强调了将统计结构整合到LLM代理设计中以提高效率而不牺牲性能的潜力。

论文及项目相关链接

PDF Accepted by AAAI 2026, 18 pages, 11 figures, Code: https://github.com/jiajingyyyyyy/AutoTool

Summary

大型语言模型（LLM）代理通过利用LLM的推理和决策能力自动化复杂任务，已成为强大的工具。然而，当前代理框架的主要瓶颈在于工具选择的高推理成本，尤其是在ReAct等方法中，会反复调用LLM来确定每一步使用哪个工具。针对这一问题，本文提出AutoTool，一种基于图的新型框架，通过利用工具使用惯性这一关键实证观察来绕过反复的LLM推理。AutoTool通过构建有向图来捕捉工具使用的顺序模式，其中节点代表工具，边捕捉转换概率。通过遍历这种结构化表示，AutoTool能够高效选择工具及其参数，最大限度地减少对LLM推理的依赖。实验表明，AutoTool将推理成本降低了30%，同时保持了竞争性的任务完成率，为推理密集型框架提供了实用且可扩展的改进。

Key Takeaways

LLM代理已成为自动化复杂任务的重要工具，但存在高推理成本的瓶颈。
当前方法如ReAct反复调用LLM进行工具选择，导致效率降低。
AutoTool提出一种基于图的框架，通过捕捉工具使用的顺序模式来绕过反复LLM推理。
AutoTool构建的有向图包括节点代表工具和边代表转换概率，有效建模工具使用惯性。
AutoTool通过遍历结构化表示高效选择工具及其参数，减少对LLM推理的依赖。
实验表明，AutoTool降低了推理成本30%，同时保持任务完成率竞争力。

Cool Papers

点此查看论文截图

RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT

Authors:John M. Oyer, Ali Namvar, Benjamin A. Hoff, Wassim W. Labaki, Ella A. Kazerooni, Charles R. Hatt, Fernando J. Martinez, MeiLan K. Han, Craig J. Galbán, Sundaresh Ram

Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM’22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.

从胸部计算机断层扫描（CT）中准确的气道分割对定量肺分析至关重要。然而，手动标注并不实用，许多基于U-Net的自动化方法会产生断裂的组件，阻碍可靠的生物标志物提取。我们提出了RepAir，这是一个稳健的3D气道分割三阶段框架，它将基于nnU-Net的网络与解剖信息拓扑校正相结合。分割网络生成初始气道掩膜，之后基于骨架的算法识别潜在的不连续点并提出重新连接。然后，一维卷积分类器确定哪些候选链接对应于真正的解剖分支，哪些对应于错误或受阻的路径。我们在两个不同的数据集上对RepAir进行了评估：ATM’22主要由健康受试者的注释CT扫描组成，AeroPath包含带有严重气道病理的注释扫描。在两个数据集中，RepAir在体素级和拓扑度量方面都优于现有的3D U-Net方法（如Bronchinet和NaviAirway），并产生更完整和解剖上一致的气道树，同时保持较高的分割精度。

论文及项目相关链接

PDF 4 pages, 3 figures, 1 table. Preprint submitted to SSIAI 2026 Conference on November 17, 2025

Summary

本文介绍了一种名为RepAir的稳健三维气道分割框架，结合了nnU-Net基础网络和解剖信息拓扑校正。该框架用于从胸部计算机断层扫描（CT）中准确分割气道，为定量肺分析提供可靠基础。通过三个阶段的处理，RepAir能够产出完整且解剖上一致的气道树，同时保持高分割精度。在ATM’22和AeroPath两个数据集上的表现优于其他三维U-Net方法，如Bronchinet和NaviAirway。

Key Takeaways

RepAir是一个用于从胸部CT扫描中稳健进行三维气道分割的框架。
结合了nnU-Net基础网络和解剖信息拓扑校正，以提高分割精度和完整性。
RepAir通过三个阶段处理：初始气道遮罩生成、基于骨架的算法识别潜在断裂并提议重新连接，以及使用一维卷积分类器确定真正的解剖分支与错误或阻塞路径。
在ATM’22和AeroPath数据集上的实验表明，RepAir在体素级别和拓扑指标上的表现均优于现有的三维U-Net方法，如Bronchinet和NaviAirway。
RepAir能够产出更完整且解剖上一致的气道树，同时保持高分割精度。
框架的设计考虑了气道的复杂性和拓扑结构，使其更适合于定量肺分析和生物标志物提取等应用。

Cool Papers

点此查看论文截图

SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology

Authors:Marco Acerbis, Swarnadip Chatterjee, Christophe Avenel, Joakim Lindblad

Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: https://github.com/Ace95/SLAM-AGS.

计算细胞学面临两大挑战：一是实例级别的标签不可靠且获取成本高昂，二是目击率极低。我们提出了SLAM-AGS，这是一种基于幻灯片标签感知的多任务预训练框架，它联合优化了（i）幻灯片负补丁上的弱监督相似性目标和（ii）幻灯片正补丁上的自监督对比目标，从而在下游任务上实现更强的性能。为了稳定学习，我们应用自适应梯度手术来解决冲突的任务梯度并防止模型崩溃。我们将预训练的编码器集成到基于注意力的多次实例学习聚合器中，用于包级别预测和注意力引导的检索包中最异常的实例。在公开的骨髓细胞数据集上，通过从10%模拟目击率降至0.5%，SLAM-AGS在包级F1分数和前400个阳性细胞检索方面超过了其他预训练方法，并且在低目击率下获得了最大的收益，这表明解决梯度干扰可实现稳定的预训练和下游任务上的更好性能。为了便于复现，我们将完整的实现和评估框架作为开源分享：https://github.com/Ace95/SLAM-AGS。

论文及项目相关链接

PDF 5 pages, 2 figures, Submitted to ISBI2026

Summary

本文介绍了计算细胞学领域所面临的两个主要挑战：实例标签不可靠且获取成本高昂，目击率极低。针对这些问题，本文提出了一种名为SLAM-AGS的幻灯片标签感知多任务预训练框架，该框架通过优化弱监督相似度目标和自我监督对比目标，提高了下游任务性能。为稳定学习，采用自适应梯度手术解决任务梯度冲突，防止模型崩溃。将预训练编码器集成到基于注意力的多实例学习聚合器中，用于进行袋级预测和注意力引导的最异常实例检索。在公开可用的骨髓细胞学数据集上，当模拟目击率从10%下降到0.5%时，SLAM-AGS在袋级F1分数和前400个阳性细胞检索方面优于其他预训练方法，在低目击率下表现尤为突出。这表明解决梯度干扰可实现稳定预训练和下游任务性能提升。

Key Takeaways

计算细胞学面临实例标签不可靠和目击率低的挑战。
提出SLAM-AGS：一种联合优化弱监督相似度目标和自我监督对比目标的预训练框架。
采用自适应梯度手术稳定学习并防止模型崩溃。
集成预训练编码器进行袋级预测和注意力引导的最异常实例检索。
在公开数据集上，SLAM-AGS在低目击率下表现优越，特别是在袋级F1分数和前400个阳性细胞检索方面。
SLAM-AGS通过解决梯度干扰实现稳定预训练和下游任务性能提升。

Cool Papers

点此查看论文截图

A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases

Authors:Tao Yang, Dandan Huang, Yunting Lin, Pengfei Wu, Zhikun Wu, Gangyuan Ma, Yulan Lu, Xinran Dong, Dingpeng Li, Junshuang Ge, Zhiyan Zhang, Xuanzhao Huang, Wenyan Nong, Yao Zhou, Hui Tang, Hongxi Yang, Shijie Zhang, Juan Li, Xiaojun Cao, Lin Yang, Xia Gao, Kaishou Xu, Xiaoqiong Gu, Wen Zhang, Huimin Xia, Li Liu, Wenhao Zhou, Mulin Jun Li

Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.

罕见疾病影响全球数亿人，但诊断过程往往长达数年。传统管道将嘈杂的证据提取与下游推断性诊断相分离，而通用/医疗大型语言模型（LLM）面临现实世界电子健康记录（EHR）的稀缺、领域知识的陈旧以及幻觉等问题。我们汇集了大量专业领域特定的临床语料库和经临床医生验证的推理集，并通过分阶段指令调整、思维链学习和基于图的检索开发出了RareSeek R1。在多中心EHR叙事和公共基准测试中，RareSeek R1达到了最先进的准确性、稳健的推广性和在嘈杂或重叠表型下的稳定性。当叙事与优先变体配对时，增强检索解决了歧义，并将候选人与机制对齐，从而获得最大的收益。人类研究表明，其性能与经验丰富的医生相当，并且在辅助使用方面表现出持续的增益。值得注意的是，透明的推理突出了许多正确诊断所依赖的决定性非表型证据（占23.1%，例如成像、干预、功能测试）。这项工作推进了一种以叙事为主、知识整合的推理范式，缩短了诊断旅程，并实现了可审核的、临床上可转化的决策支持。

论文及项目相关链接

PDF 50 pages, 5 figures

Summary：罕见疾病影响全球数亿人，诊断过程耗时多年。为解决传统诊断方法存在的证据提取与推理分离、大型语言模型处理现实世界电子健康记录不足、领域知识滞后及虚构问题，我们构建了大规模、专业化的临床语料库和经过医生验证的推理集，通过分阶段指令调整、思维链学习和图检索技术，研发出名为“RareSeek R1”的系统。该系统在多中心电子健康记录叙述和公共基准测试上的准确率达到了业界领先水平，具有强大的泛化能力和在噪声或重叠现象下的稳定性。特别是结合叙事优先变体解决歧义和对应机制，辅助检索增益最为显著。人类实验表明其表现与资深医师相当，并在辅助使用方面表现出持续优势。该系统强调非现象证据的重要性（如影像、干预、功能测试等），推动了一种以叙事为先、知识整合的推理模式，缩短诊断旅程，实现可审计、可临床转化的决策支持。

Key Takeaways：

罕见疾病对全球数百万人产生影响，诊断过程具有挑战性和耗时性。
存在证据提取与诊断推理之间的脱节问题以及现有模型处理现实健康记录的挑战。
RareSeek R1系统通过大规模临床语料库和医生验证的推理集开发而成，具有先进的诊断准确性。
RareSeek R1在多中心电子健康记录叙述和公共基准测试上表现优异，具有强大的泛化能力和稳定性。
辅助检索结合叙事优先变体解决了歧义和对应机制问题，取得了最大的收益。
人类实验证明其表现与资深医师相当，并在辅助使用方面展现出优势。

Cool Papers

点此查看论文截图

Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Authors:Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia

We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent

我们展示了由视觉语言模型（VLM）引导的多智能体系统如何改进端到端的自主科学发现。通过将图表视为可验证的检查点，VLM作为法官根据动态生成的领域特定规则对图表进行评估，使智能体能够纠正自己的错误，并实时引导探索性数据分析。宇宙学和天体化学的案例研究证明了从错误的推理路径中恢复以及适应新数据集的能力，无需人工干预。在面向数据驱动发现的10项任务基准测试中，与仅使用代码的0.2-0.3分和代码与文本基准线的0.4-0.5分相比，增强型VLM系统的通过率高达0.7-0.8，同时提供可审核的推理轨迹，提高了可解释性。代码详情参见：https://github.com/CMBAgents/cmbagent

论文及项目相关链接

PDF

Summary：通过视觉语言模型（VLM）引导的多智能体系统能够改进端到端的自主科学发现。该系统通过将图表视为可验证的检查点，利用VLM作为评估器对图表进行动态生成的领域特定规则评估，使智能体能够纠正自己的错误并在实时中引导探索性数据分析。在宇宙学和天体化学的案例研究中，演示了从错误的推理路径中恢复以及适应新数据集的能力，无需人工干预。在数据驱动发现的10项基准测试中，使用VLM增强的系统实现了通过率为0.7-0.8的分数，而仅使用代码和代码与文本的基线分别为0.2-0.3和0.4-0.5。此外，该系统还提供可审核的推理轨迹，提高可解释性。

Key Takeaways：

多智能体系统结合视觉语言模型（VLM）能提高自主科学发现的效率。
VLM作为评估器，根据动态生成的领域特定规则对图表进行验证。
智能体能够自我纠正错误并在实时中调整探索性数据分析路径。
在宇宙学和天体化学领域，该系统能在无需人工干预的情况下恢复从错误推理路径并适应新数据集。
在数据驱动发现的10项基准测试中，VLM增强系统的通过率显著高于仅使用代码或代码与文本的基线。
系统提供可审核的推理轨迹，提高决策过程的可解释性。

Cool Papers

点此查看论文截图

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Authors:Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, Mingxing Zhang

Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.

强化学习（RL）对于推动现代大型语言模型（LLM）的发展至关重要，但现有的同步RL系统面临严重的性能瓶颈。滚动展开阶段在端到端迭代时间中占主导地位，由于固有的工作量不平衡，导致长尾延迟严重且资源利用率低下。我们提出了Seer，一种新型在线上下文学习系统，它通过利用之前被忽视的相同提示下请求输出长度和生成模式之间的相似性来解决这些挑战。Seer引入了三种关键技术：用于动态负载均衡的分割滚动、上下文感知调度和自适应分组投机解码。这些机制共同显著减少了长尾延迟，提高了滚动过程中的资源效率。在生产级RL工作量上的评估表明，与最新的同步RL系统相比，Seer的端到端滚动吞吐量提高了74%到97%，长尾延迟减少了75%到93%，显著加速了RL训练迭代。

论文及项目相关链接

PDF 16 pages, 12 figures, 6 tables

Summary

本文主要介绍了Seer，一个解决强化学习（RL）在大规模语言模型（LLM）应用中面临性能瓶颈的新颖在线上下文学习系统。针对现有同步RL系统存在的长尾延迟和资源利用率低下的问题，Seer通过利用相同提示的请求输出长度和生成模式的相似性，提出了动态负载均衡的分割滚动、上下文感知调度和自适应分组推测解码三项关键技术。评估表明，Seer在生产级的RL工作负载上能显著提高端到端的滚动吞吐量并大幅度减少长尾延迟。

Key Takeaways

强化学习（RL）在现代大规模语言模型（LLM）中至关重要，但现有同步RL系统存在性能瓶颈。
Seer是一个解决这一问题的新颖在线上下文学习系统。
Seer通过利用相同提示的请求输出长度和生成模式的相似性来提升性能。
Seer提出了三项关键技术：动态负载均衡的分割滚动、上下文感知调度和自适应分组推测解码。
Seer能显著提高端到端的滚动吞吐量，改善资源利用率。
相较于现有同步RL系统，Seer在长尾延迟方面也有显著优化。

Cool Papers

点此查看论文截图

Bridging Human and Model Perspectives: A Comparative Analysis of Political Bias Detection in News Media Using Large Language Models

Authors:Shreya Adrita Banik, Niaz Nafi Rahman, Tahsina Moiukh, Farig Sadeque

Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.

检测新闻媒体中的政治偏见是一项复杂的任务，需要解释微妙的语言和上下文线索。尽管自然语言处理（NLP）领域的最新进展已经实现了自动偏见分类，但大型语言模型（LLM）与人类判断的对齐程度仍然相对未被充分探索和理解。本研究旨在提出一个比较框架，以评估人类注释和多种LLM（包括GPT、BERT、RoBERTa和FLAN）对政治偏见的检测。我们构建了新闻文章的手动注释数据集，并评估注释一致性、偏见极性和模型间协议，以量化人类和模型对偏见感知之间的差异。实验结果表明，在传统的基于变换器模型中，RoBERTa与人标签的对齐度最高，而GPT等生成模型在零样本设置中与人类注释的整体协议最强。在所有基于变换器的基线中，我们微调的RoBERTa模型获得了最高的准确性和与人类注释标签的最强对齐。我们的研究结果表明，人类和LLM在感知政治倾向方面存在系统差异，强调需要在自动化媒体偏见检测中结合人类解释性和模型可扩展性的混合评估框架。

论文及项目相关链接

PDF

Summary
新闻媒体的政治倾向性检测是一项复杂的任务，需要解读语言和上下文中的微妙线索。尽管自然语言处理（NLP）的近期进展已经实现了自动偏见分类，但大型语言模型（LLMs）与人类判断之间的契合程度仍然相对较低且尚未得到充分理解。本研究旨在提出一个比较框架，以评估人类注释和多种LLMs对政治倾向性的检测效果，包括GPT、BERT、RoBERTa和FLAN。我们构建了新闻文章的手动注释数据集，并评估注释一致性、偏见极性和模型间协议，以量化人类和模型对偏见感知之间的差异。实验结果表明，在传统的基于转换器的模型中，RoBERTa与人类标签的契合度最高，而在零样本设置中，生成模型如GPT与人类注释的契合度最强。在所有基于转换器的基线模型中，我们微调过的RoBERTa模型获得了最高的准确性和与人类注释的最强契合度。我们的研究突出了人类和LLMs在感知政治倾向性方面的系统差异，强调了在自动化媒体偏见检测中，需要结合人类可解释性和模型可扩展性的混合评估框架。

Key Takeaways

政治偏见检测需解读语言和上下文中的微妙线索。
大型语言模型（LLMs）在自动偏见分类方面的应用已显现，但与人判断的契合程度仍需深入研究。
研究通过比较框架评估了多种LLMs在检测政治倾向性方面的表现。
手动注释的新闻文章数据集被构建以评估人类与模型对偏见感知的差异。
RoBERTa模型在传统的基于转换器的模型中与人类标签契合度最高。
生成模型如GPT在零样本设置下与人类注释契合度强。

Cool Papers

点此查看论文截图

XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation

Authors:Yilin Zhang, Leo D. Westbury, Elaine M. Dennison, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model’s potential in real-world scenarios.

不良的骨骼健康是一个重要的公共卫生问题，低骨矿物质密度（BMD）会增加骨折风险，这是骨质疏松症的主要特征。我们提出了XAttn-BMD（跨注意力BMD）系统，这是一个多模态深度学习框架，可以从髋关节X射线图像和结构化的临床元数据中预测股骨颈BMD。它利用一种新颖的双向跨注意力机制来动态融合图像和元数据特征，以实现跨模态相互增强。针对BMD不平衡问题，定制了加权平滑L1损失，并优先考虑具有临床意义的情况。在赫特福德郡队列研究的数据上进行的广泛实验表明，我们的模型在回归泛化和稳健性方面优于基线模型。消融研究证实了跨注意力融合和定制损失函数的有效性。实验结果表明，通过跨注意力整合多模态数据优于没有跨注意力的简单特征拼接，减少了均方误差（MSE）16.7%，平均绝对误差（MAE）减少6.03%，R2得分提高16.4%，凸显了该方法在股骨颈BMD估计中的有效性。此外，使用临床上相关的股骨颈BMD阈值进行二元分类评估筛查性能，显示了该模型在现实世界场景中的潜力。

论文及项目相关链接

PDF 11 figures, 10 tables, 38 pages. Submitted to Artificial Intelligence in Medicine (currently with editor)

Summary

本文介绍了一种名为XAttn-BMD的多模态深度学习框架，该框架可从髋关节X射线图像和结构化临床元数据中预测股骨颈骨密度。它采用一种新颖的双向跨注意力机制，动态集成图像和元数据特征，实现跨模态相互增强。在赫特福德郡队列研究的数据上进行的大量实验表明，我们的模型在回归泛化能力和稳健性方面优于基准模型。

Key Takeaways

XAttn-BMD是一个多模态深度学习框架，用于从髋关节X射线图像和结构化临床元数据中预测股骨颈骨密度。
该框架采用双向跨注意力机制，动态集成图像和元数据特征，实现跨模态相互增强。
定制了加权平滑L1损失来解决骨密度不平衡问题，并优先处理具有临床意义的情况。
在赫特福德郡队列研究的数据上进行的大量实验表明，XAttn-BMD模型在回归泛化和稳健性方面优于其他模型。
消融研究证实了跨注意力融合和定制损失函数的有效性。
通过跨注意力整合多模态数据的方法优于简单的特征拼接方法，降低了均方误差和平均绝对误差，提高了R2分数。

Cool Papers

点此查看论文截图

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

Authors:Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu

Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: https://MIT-CLEAR-Lab.github.io/Masked-IRL and Code: https://github.com/MIT-CLEAR-Lab/Masked-IRL

机器人可以通过从演示中学习奖励函数来适应用户偏好，但在数据有限的情况下，奖励模型往往会过度适应偶然的相关性，无法推广。这是因为演示告诉机器人如何完成任务，但没有告诉它们这个任务的重点是什么，导致模型关注于无关紧要的状态细节。自然语言可以更直接地指定机器人应该关注什么，并且理论上可以在与演示一致的许多奖励函数中消除歧义。然而，现有的语言条件奖励学习方法通常将指令视为简单的条件信号，没有充分利用它们解决歧义潜力。此外，真实指令通常本身具有模糊性，因此简单的条件设置并不可靠。我们的关键见解是这两种输入类型携带了互补信息：演示展示了如何行动，而语言则说明了重点。我们提出了Masked Inverse Reinforcement Learning（Masked IRL）框架，它使用大型语言模型（LLM）结合两种输入类型的优势。Masked IRL从语言指令中推断状态相关性掩码，并强制对无关状态组件具有不变性。当指令模糊时，它会使用LLM推理在演示的上下文中澄清它们。在模拟和实际机器人上，Masked IRL在语言调节的IRL方法上表现出最佳性能，提高了高达15%，同时使用数据高达4.7倍少，证明了其在样本效率、推广能力以及处理模糊语言方面的稳健性。项目页面：https://MIT-CLEAR-Lab.github.io/Masked-IRL 和代码：https://github.com/MIT-CLEAR-Lab/Masked-IRL。

论文及项目相关链接

PDF

Summary：
机器人可通过学习演示中的奖励函数来适应用户偏好，但在有限数据情况下，奖励模型常因过度拟合而失去泛化能力。原因是演示展示了机器人的任务执行方式，但未说明任务的核心所在，导致模型关注于无关的状态细节。自然语言能更直接地指定机器人应关注的内容，并在原则上消除与演示一致的多个奖励函数之间的歧义。然而，现有的语言调节奖励学习方法通常将指令视为简单的条件信号，未充分利用其解决歧义潜力。此外，真实指令通常具有歧义性，因此简单的条件化并不可靠。我们的关键见解是这两种输入类型携带了互补信息：演示展示了如何行动，而语言则说明了重要性。我们提出使用大型语言模型（LLM）的Masked Inverse Reinforcement Learning（Masked IRL）框架，结合两种输入的优势。Masked IRL从语言指令推断状态相关掩码，并执行对无关状态分量的不变性。当指令模糊时，它会利用LLM在演示的语境中明确指令。在模拟和实际机器人上的实验表明，Masked IRL在仅使用15%数据的情况下，相较于先前的语言调节IRL方法表现出更高的性能，提高了样本效率、泛化能力和对模糊语言的稳健性。更多信息可访问：https://MIT-CLEAR-Lab.github.io/Masked-IRL 并查阅代码 https://github.com/MIT-CLEAR-Lab/Masked-IRL。

Key Takeaways:

机器人通过奖励函数学习适应用户偏好时面临挑战，尤其是有限数据条件下模型的泛化能力问题。
自然语言能更直接地说明任务的核心内容，有助于机器人理解奖励函数的真正意图。
现有语言调节奖励学习方法未能充分利用语言解决奖励函数中的歧义性。
真实指令往往具有模糊性，简单的条件化方法并不足以处理这种模糊性。
Masked IRL框架结合演示和语言的优势，通过语言模型理解状态的重要性并消除演示中的模糊性。
Masked IRL框架能提高样本效率、泛化能力和对模糊语言的稳健性。

Cool Papers

点此查看论文截图

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Authors:Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

随着大型视觉语言模型（LVLMs）在现实世界应用中的部署越来越多，它们对抽象视觉输入的解读能力仍然有限。具体来说，它们在理解手绘草图方面遇到了困难，草图是一种表达难以用文字描述的概念的直观手段。我们将主要瓶颈确定为缺乏一个能够同时建模草图、写实图像和相应自然语言指令的大规模数据集。为了解决这一问题，我们提出了两个关键贡献：（1）一个全新的大规模图像-草图-指令三元组数据集，旨在促进预训练和指令调整；（2）一个在此数据集上训练的O3SLM大型视觉语言模型。对多个基于草图的任务的综合评估包括：（a）目标定位，（b）计数，（c）图像检索（即SBIR和精细粒度SBIR），以及（d）视觉问答（VQA）；在融入QuickDraw!、Sketchy和Tu Berlin这三个现有的草图数据集以及我们生成的SketchVCL数据集的同时，表明O3SLM实现了最先进的性能，在草图理解和推理方面大幅超越了现有的LVLMs。

论文及项目相关链接

PDF Accepted to AAAI 2026

Summary

大型视觉语言模型（LVLM）在解释抽象视觉输入方面的能力有限，特别是在理解手绘草图方面存在困难。为解决此问题，本文提出两个主要贡献：一是创建了一个大规模图像-草图-指令三元组数据集，用于促进预训练和指令调整；二是基于该数据集训练了O3SLM模型。在多个草图相关任务上的综合评估表明，O3SLM实现了卓越的性能，在草图理解和推理方面大幅超越了现有的LVLM。

Key Takeaways

LVLM在解释抽象视觉输入，尤其是手绘草图方面存在局限。
当前缺乏一个能够同时建模草图、照片级图像和相应自然语言指令的大规模数据集。
论文贡献之一是创建了一个大规模图像-草图-指令三元组数据集，用于促进预训练和指令调整。
论文还介绍了O3SLM模型，该模型在上述数据集上进行训练。
在多个草图相关任务上，O3SLM实现了卓越性能。
O3SLM在草图理解和推理方面大幅超越了现有的LVLM。

Cool Papers

点此查看论文截图

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

Authors:Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen

In today’s data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

在如今的数据驱动时代，完全自动化的端到端数据分析，尤其是发现洞察力，对于发现可操作的见解以协助组织做出有效决策至关重要。随着大型语言模型（LLM）的快速发展，LLM驱动的代理出现了一种有前景的自动化数据分析和见解发现的范式。然而，现有的数据见解代理在某些关键方面仍然存在局限性，常常因以下原因无法提供令人满意的结果：（1）对领域知识的利用不足；（2）分析深度不够；（3）在产生见解时错误倾向的代码生成。为了解决这些问题，我们提出了DataSage，这是一种新型的多代理框架，结合了三个创新功能，包括外部知识检索以丰富分析上下文、多角色辩论机制模拟多种分析视角并深化分析深度，以及多路径推理以提高生成代码和见解的准确性。在InsightBench上的广泛实验表明，DataSage在所有难度级别上均优于现有数据见解代理，为自动化数据见解发现提供了有效解决方案。

论文及项目相关链接

PDF

总结

在数据驱动的时代，全自动化的数据分析（尤其是洞察发现）对于发现可操作的见解至关重要，这些见解有助于组织做出有效决策。随着大型语言模型（LLMs）的快速发展，LLM驱动的代理成为自动化数据分析和洞察发现的潜力模式。然而，现有的数据洞察代理在多个关键方面仍存在局限性，通常由于（1）未能充分利用领域知识，（2）分析深度不足，以及（3）在生成见解时出现错误代码而导致结果不尽人意。为解决这些问题，我们提出了DataSage，这是一个新的多代理框架，包含三个创新功能：外部知识检索以丰富分析上下文，多角色辩论机制模拟多种分析视角并深化分析深度，以及多路径推理以提高生成代码和见解的准确性。在InsightBench上的广泛实验表明，DataSage在所有难度级别上始终优于现有数据洞察代理，为自动化数据洞察发现提供了有效解决方案。

关键见解

在数据驱动的时代，全自动化的数据分析对于组织决策至关重要。
大型语言模型（LLMs）为自动化数据分析提供了潜力。
现有数据洞察代理存在局限性，如缺乏领域知识利用、分析深度不足和错误代码生成。
DataSage是一个新的多代理框架，通过外部知识检索、多角色辩论机制和多路径推理解决这些问题。
DataSage在InsightBench上的实验表现优于现有数据洞察代理。
DataSage能有效提高数据洞察的准确性和深度。

Cool Papers

点此查看论文截图

PathMind: A Retrieve-Prioritize-Reason Framework for Knowledge Graph Reasoning with Large Language Models

Authors:Yu Liu, Xixun Lin, Yanmin Shang, Yangxi Li, Shi Wang, Yanan Cao

Knowledge graph reasoning (KGR) is the task of inferring new knowledge by performing logical deductions on knowledge graphs. Recently, large language models (LLMs) have demonstrated remarkable performance in complex reasoning tasks. Despite promising success, current LLM-based KGR methods still face two critical limitations. First, existing methods often extract reasoning paths indiscriminately, without assessing their different importance, which may introduce irrelevant noise that misleads LLMs. Second, while many methods leverage LLMs to dynamically explore potential reasoning paths, they require high retrieval demands and frequent LLM calls. To address these limitations, we propose PathMind, a novel framework designed to enhance faithful and interpretable reasoning by selectively guiding LLMs with important reasoning paths. Specifically, PathMind follows a “Retrieve-Prioritize-Reason” paradigm. First, it retrieves a query subgraph from KG through the retrieval module. Next, it introduces a path prioritization mechanism that identifies important reasoning paths using a semantic-aware path priority function, which simultaneously considers the accumulative cost and the estimated future cost for reaching the target. Finally, PathMind generates accurate and logically consistent responses via a dual-phase training strategy, including task-specific instruction tuning and path-wise preference alignment. Extensive experiments on benchmark datasets demonstrate that PathMind consistently outperforms competitive baselines, particularly on complex reasoning tasks with fewer input tokens, by identifying essential reasoning paths.

知识图谱推理（KGR）是通过在知识图谱上进行逻辑推断来推断新知识的任务。最近，大型语言模型（LLM）在复杂的推理任务中表现出了显著的性能。尽管前景充满希望，但目前的基于LLM的KGR方法仍然面临两个关键的局限性。首先，现有方法常常不加区别地提取推理路径，没有评估它们的不同重要性，这可能会引入无关的噪声，误导LLM。其次，虽然许多方法利用LLM来动态探索潜在的推理路径，但它们需要很高的检索需求和频繁的LLM调用。为了解决这些局限性，我们提出了PathMind，这是一个新型框架，旨在通过选择重要的推理路径来指导LLM，以增强忠实和可解释的推理。具体来说，PathMind遵循“检索-优先排序-推理”的模式。首先，它通过检索模块从知识图谱中检索查询子图。接下来，它引入了一个路径优先级机制，使用语义感知的路径优先级函数来识别重要的推理路径，同时考虑累积成本和到达目标的预估未来成本。最后，PathMind通过双阶段训练策略生成准确且逻辑一致的响应，包括任务特定指令调整和路径偏好对齐。在基准数据集上的大量实验表明，PathMind始终优于竞争基线，特别是在输入令牌较少的复杂推理任务上，它通过识别关键推理路径来实现。

论文及项目相关链接

PDF AAAI 2026, Long Paper, Oral

Summary

基于知识图谱推理（KGR）的任务是通过在知识图谱上进行逻辑推演来推断新知识。大型语言模型（LLMs）在复杂的推理任务中表现出卓越的性能。然而，现有的LLM-based KGR方法面临两个关键问题：一是现有方法常常不加区别地提取推理路径，没有评估它们的重要性，这可能会引入无关噪声误导LLMs；二是许多方法虽然利用LLMs动态探索潜在推理路径，但需要较高的检索需求和频繁的LLM调用。为解决这个问题，我们提出了PathMind框架，它通过选择重要的推理路径来增强忠实和可解释的推理。PathMind遵循“检索-优先排序-推理”的模式，通过检索模块从知识图谱中检索子图，引入路径优先级机制来识别重要的推理路径，并考虑累积成本和到达目标的预估未来成本。最后，PathMind通过双阶段训练策略生成准确且逻辑一致的回应，包括任务特定指令调整和路径偏好对齐。在基准数据集上的广泛实验表明，PathMind在复杂的推理任务中，特别是在输入令牌较少的情况下，通过识别关键推理路径，持续优于竞争对手。

Key Takeaways