LLM

发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 20.5k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Authors:Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai

We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

我们提出了一种科学推理基础模型，该模型将自然语言与异质科学表示对齐。该模型在包含科学文本、纯序列和序列文本对的206B标记语料库上进行预训练，然后通过40M指令进行SFT对齐，采用冷启动引导法进行长期思考链的引导，并结合任务特定奖励形状进行强化学习，从而灌输有意识的科学推理。它支持四个能力家族，涵盖多达103个工作流程任务，包括：(i)文本与科学格式之间的忠实翻译，(ii)文本/知识提取，(iii)属性预测，(iv)属性分类，(v)无条件和条件序列生成和设计。与专用系统相比，我们的方法扩大了指令覆盖范围，提高了跨域泛化能力，并增强了保真度。我们详细介绍了数据整理和训练过程，并表明跨学科学习加强了迁移和下游可靠性。该模型、指令调整数据集和评估代码已在https://huggingface.co/SciReason和https://github.com/open-sciencelab/SciReason上开源。

论文及项目相关链接

PDF technical report

Summary

该模型通过科学推理基础将自然语言与异质科学表示相结合，预训练在包含科学文本、纯序列和序列文本对的大型语料库上，通过SFT对齐4千万指令，采用退火冷启动引导激发长形式思维链，并结合任务特定奖励形状进行强化学习，培养了有意的科学推理能力。它支持四种能力家族，涵盖超过一万项任务：文本与科学格式之间的忠实翻译、文本/知识提取、属性预测、属性分类、无条件和有条件的序列生成和设计。相较于专业系统，该方法扩大了指令覆盖度，提高了跨域泛化能力，并提高了保真度。本文详细描述了数据收集和训练过程，并展示了跨学科学习如何增强迁移和下游可靠性。模型和评估代码已在Hugging Face和GitHub上开源。

Key Takeaways

模型结合了自然语言与异质科学表示。
预训练在涵盖科学文本、纯序列和序列文本对的大型语料库上进行。
通过SFT对齐指令，激发长形式思维链。
强化学习通过任务特定奖励形状培养科学推理能力。
支持多种任务，包括翻译、文本/知识提取、属性预测和分类等。
模型相比专业系统，扩大了指令覆盖度，提高泛化和保真度。

Cool Papers

点此查看论文截图

SAGE: A Realistic Benchmark for Semantic Understanding

Authors:Samarth Goel, Reagan J. Lee, Kannan Ramchandran

As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI’s text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI’s text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

随着大型语言模型（LLM）在传统基准测试上取得强劲表现，急需更具挑战性的评估框架来探究语义理解的更深层次方面。我们推出了SAGE（语义对齐与泛化评估），这是一个严格的基准测试，旨在评估嵌入模型和相似性度量在五个类别中的表现：人类偏好对齐、转换稳健性、信息敏感性、聚类性能和检索稳健性。与现有的主要集中在孤立能力的基准测试不同，SAGE通过对抗性条件、噪声转换和精细的人类判断任务在30多个数据集上评估语义理解。我们对9种嵌入模型和经典度量的全面评估显示，存在显著的性能差距，没有任何单一方法能在所有维度上都表现出卓越性能。例如，虽然最先进的嵌入模型如OpenAI的text-embedding-3-large在人类偏好对齐方面占据主导地位（0.682分对比最好的经典度量只有0.591分），但在信息敏感性任务上却被经典度量显著超越，Jaccard相似度得分高达0.905分，而顶级嵌入得分仅为0.794分。SAGE还揭示了关键的权衡：OpenAI的text-embedding-3-small虽然聚类性能最高（0.483分），但在稳健性方面表现极差，得分最低（只有0.011分）。SAGE揭示了当前语义理解能力的关键局限性，并为模型在现实世界的部署提供了更现实的稳健性评估。

论文及项目相关链接

PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Summary

大型语言模型（LLM）在传统基准测试中表现优异，但亟需更具挑战性的评估框架来深入探究其语义理解的能力。为此，我们推出了SAGE（语义对齐与泛化评估）基准测试，旨在评估嵌入模型和相似性度量在五个类别中的表现：人类偏好对齐、转换鲁棒性、信息敏感性、聚类性能和检索鲁棒性。SAGE通过对抗条件、噪声转换和精细的人类判断任务，在30多个数据集上进行语义理解评估。我们对9种嵌入模型和传统度量方法进行了全面评估，发现存在显著的性能差距，没有一种方法能在所有维度上都表现出卓越的性能。

Key Takeaways

大型语言模型（LLM）需要更具挑战性的评估框架以深入评估其语义理解能力。
SAGE基准测试旨在全面评估嵌入模型和相似性度量在多个方面的表现。
SAGE通过对抗条件、噪声转换和精细的人类判断任务来评估语义理解。
现有嵌入模型在特定领域表现出卓越性能，但在其他领域存在明显不足。
传统度量方法在信息敏感性任务上表现出较强性能。
SAGE揭示了模型间的权衡：某些模型在聚类性能上表现优秀，但在鲁棒性方面存在缺陷。

Cool Papers

点此查看论文截图

VC-Agent: An Interactive Agent for Customized Video Dataset Collection

Authors:Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han

Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users’ queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user’s requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent’s usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.

面对日益增长的法律规模，互联网视频数据变得愈发重要。然而，收集满足特定需求的大量视频极为耗费人力和时间。在此工作中，我们研究了如何加速这一收集过程，并提出了VC-Agent，这是一个首款交互式代理，能够理解用户的查询和反馈，并据此以最小的用户输入检索或扩充相关视频片段。具体而言，关于用户界面，我们的代理定义了各种用户友好的方式，让用户基于文本描述和确认来指定要求。至于代理功能，我们利用现有的多模态大型语言模型将用户的要求与视频内容连接起来。更重要的是，我们提出了两种可在用户交互持续进行时更新的新型过滤策略。最后，我们为个性化视频数据集收集提供了一个新的基准测试，并通过用户研究仔细验证了我们的代理在各种实际场景中的应用。大量实验证明了我们的代理在定制视频数据集收集中的有效性和效率。项目页面：https://allenyidan.github.io/vcagent_page/。

总结
VC-Agent作为一种全新的交互式代理，旨在加速视频数据收集过程。它可理解用户需求并据此检索相关视频片段。通过利用多模态大型语言模型连接用户需求与视频内容，提出两种可随用户互动持续更新的过滤政策，并建立个性化视频数据集收集的新基准。实验证明其在各种实际场景中效果显著且高效。更多详情参见项目页面：[https://allenyidan.github.io/vcagent_page/]

关键见解

VC-Agent是首个能够理解和响应用户查询和反馈的交互式代理，简化了视频数据收集过程。
该代理通过最小化用户输入，实现相关视频片段的检索和扩展。
利用多模态大型语言模型，将用户需求与视频内容相连接。
VC-Agent提供两种可随用户互动更新的过滤政策。
建立了个性化视频数据集收集的新基准。
通过用户研究验证了VC-Agent在各种实际场景中的使用效果。

Cool Papers

点此查看论文截图

Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training

Authors:Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma

Long context training is crucial for LLM’s context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP’s workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.

大型语言模型（LLM）的上下文扩展中，长上下文训练至关重要。现有的方案，如序列并行性，会产生大量的通信开销。管道并行性（PP）降低了这一成本，但其有效性取决于分区粒度。在长上下文场景中，批处理级别的PP对输入样本进行划分会导致较高的内存消耗，而基于标记的PP通过将序列切片减轻了内存开销，但可能会造成硬件使用不足。这种权衡促使我们根据资源和工作负载特性自适应选择PP粒度。此外，现实世界数据集序列长度分布的不平衡性对PP的工作负载平衡和高效调度提出了挑战。当前的静态PP调度方法忽略了序列长度的变化，导致性能不佳。在本文中，我们提出了弹性管道并行性（EPP），它结合了基于标记的PP和批处理级别的PP，以适应资源和工作负载的异质性。我们建立了分布式训练系统InfiniPipe，它通过以下两个方面释放EPP的潜力：（1）一种资源感知和工作负载平衡序列处理器，用于分割长序列并打包短序列；（2）一种协同优化方法，通过名为stage-aware chunk-level adaptive checkpointing的机制联合优化管道调度和梯度检查点。综合实验表明，InfiniPipe相较于最新系统实现了1.69倍的速度提升。

论文及项目相关链接

PDF

Summary：
大型语言模型（LLM）需要上下文长期训练以增强效果。现有的并行处理策略如序列并行处理会引入巨大的通信开销。管道并行（PP）降低了这一成本，但其有效性取决于分区粒度选择。批级别PP在具有长上下文的情况下表现出高内存消耗，而基于token的PP通过将序列切片来缓解内存开销，但可能导致硬件利用率不足。本文提出了一种自适应选择PP粒度的方法以适应资源和工作负载特性。此外，现实世界数据集序列长度分布的不平衡性对PP的工作负载平衡和调度效率提出了挑战。当前静态PP调度方法忽略了序列长度的变化，导致性能不佳。本文提出了弹性管道并行性（EPP），通过组合基于token的PP和批级别PP来适应资源和工作负载的异质性。构建了一个分布式训练系统InfiniPipe，通过资源感知和工作负载平衡序列处理器来应对长序列并打包短序列；采用阶段感知块级自适应检查点策略进行联合优化管道调度和梯度检查点优化策略的方式解决该挑战。实验证明，InfiniPipe相较于现有系统取得了高达1.69倍的速度提升。

Key Takeaways:

LLM的长期上下文训练至关重要，但现有并行训练方案存在通信开销问题。
管道并行性（PP）是降低通信开销的有效方法，但其效果受分区粒度影响。
批级别和基于token的PP各有优缺点，需要根据资源和工作负载特性选择。
现实世界数据集的序列长度分布不平衡，对PP的工作负载平衡和调度效率构成挑战。
当前静态PP调度方法忽略序列长度变化，导致性能不佳。
本文提出了Elastic Pipeline Parallelism（EPP）来适应资源和工作负载的异质性，结合使用两种PP策略。

Cool Papers

点此查看论文截图

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Authors:Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang

The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.

超级芯片的出现代表了下一代人工智能硬件的重大进步。这些超级芯片采用紧密耦合的异构架构，将GPU和CPU集成在同一个封装中，提供了前所未有的计算能力。然而，关于这种新架构如何受益于大型语言模型（LLM）训练的研究寥寥无几。在这项工作中，我们首次基于卸载技术研究了针对超级芯片的大型语言模型训练解决方案。我们观察到超级芯片与传统松散耦合的GPU-CPU架构之间的重大差异，这需要重新审视关于卸载的现有假设。基于此，我们推出了以超级芯片为中心的卸载系统SuperOffload，它更有效地同时使用Hopper GPU、Grace CPU和NVLink-C2C互联。SuperOffload通过结合多种技术实现这一目标，如自适应权重卸载、分桶再分区、超级芯片感知转换、推测执行以及针对Grace CPU的高度优化的Adam优化器。我们在NVIDIA GH200上对SuperOffload的评估表明，与最新的基于卸载的系统相比，其吞吐量提高了2.5倍，能够在单个超级芯片上训练高达25B的模型，同时实现高训练吞吐量。我们还通过ZeRO风格的数据并行性和DeepSpeed-Ulysses序列并行性扩展了SuperOffload，能够在8个GH200上训练具有高达1百万令牌序列长度的13B模型，同时实现55%的最大功能利用率。

论文及项目相关链接

PDF 16 pages, 15 figures

Summary：
超级芯片的出现代表了下一代人工智能硬件的重大突破。本文通过卸载方式首次研究超级芯片上的大型语言模型训练解决方案。研究发现超级芯片与传统松散耦合的GPU-CPU架构之间存在重要差异，需要重新考虑卸载的现有假设。因此，本文提出了以超级芯片为中心的卸载系统SuperOffload，通过一系列技术更高效地利用Hopper GPU、Grace CPU和NVLink-C2C互联。SuperOffload在NVIDIA GH200上的评估显示，与最新的基于卸载的系统相比，吞吐量提高了2.5倍，能够在单个超级芯片上训练高达25B的模型，同时保持较高的训练吞吐量。

Key Takeaways：

超级芯片代表下一代AI硬件的重大进展，采用紧密耦合的异构架构，集成GPU和CPU。
目前对于超级芯片在大型语言模型训练中的效益研究较少。
本文首次研究基于卸载的超级芯片大型语言模型训练解决方案。
超级芯片与传统GPU-CPU架构存在重要差异，需要重新审视卸载的现有假设。
提出SuperOffload系统，针对超级芯片进行优化，利用Hopper GPU、Grace CPU和NVLink-C2C互联。
SuperOffload通过一系列技术实现高效训练，如自适应权重卸载、分桶重划分、超级芯片感知转换、推测执行和优化的Adam优化器。
在NVIDIA GH200上评估显示，与最新卸载系统相比，SuperOffload提高了2.5倍的吞吐量，能在单个超级芯片上训练更大规模的模型。

Cool Papers

点此查看论文截图

LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

Authors:Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Maksim Kuprashevich

The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.

由大型语言模型（LLM）产生的人形文本广泛应用，这促使需要开发稳健的检测系统。然而，由于缺少合适的训练数据，进展受到限制；现有数据集通常由过时模型生成，主要为英语，并且未能解决日益普遍的人机混合创作场景。虽然一些数据集解决了混合创作问题，但没有数据集提供字符级别的注释，无法准确定位文本中AI生成的片段位置。为了解决这些空白，我们引入了LLMTrace，这是一个新的大规模双语（英语和俄语）语料库，用于检测AI生成的文本。通过使用各种现代专有和开源LLM构建，我们的数据集旨在支持两项关键任务：传统的全文二进制分类（人类与AI）和新的AI生成间隔检测任务，后者由字符级注释促进。我们相信LLMTrace将成为培训和评估下一代更微妙、更实用的AI检测模型的重要资源。项目页面可通过链接iitolstykh/LLMTrace（https://sweetdream779.github.io/LLMTrace-info/）访问。

论文及项目相关链接

PDF

Summary
人工智能生成文本广泛使用，对大语言模型（LLM）的检测系统提出了更高的要求。当前训练数据存在不足，需要开发新的数据集。LLMTrace是一个大规模双语（英语和俄语）语料库，用于检测AI生成的文本。该数据集支持传统全文二元分类（人类与AI）和新任务AI生成间隔检测，通过字符级注释实现。LLMTrace将成为训练和评估下一代更精细、更实用的AI检测模型的重要资源。

Key Takeaways

大语言模型（LLM）生成的文本广泛使用，需要开发更精确的AI文本检测系统。
当前训练数据存在不足，限制了AI文本检测系统的进步。
LLMTrace是一个大规模双语语料库，包括英语和俄语文本，解决了当前数据集的不足。
LLMTrace支持传统全文二元分类任务和新任务AI生成间隔检测任务。
LLMTrace使用了多样化的现代专有和开源大语言模型构建而成。
LLMTrace通过字符级注释实现AI生成文本的精准定位。

Cool Papers

点此查看论文截图

Grounding AI Explanations in Experience: A Reflective Cognitive Architecture for Clinical Decision Support

Authors:Zijian Shao, Haiyang Shen, Mugeng Liu, Gecheng Fu, Yaoqi Guo, Yanfeng Wang, Yun Ma

Effective disease prediction in modern healthcare demands the twin goals of high accuracy and transparent, clinically meaningful explanations. Existing machine learning and large language model (LLM) based approaches often struggle to balance these goals. Many models yield accurate but unclear statistical outputs, while others generate fluent but statistically unsupported narratives, often undermining both the validity of the explanation and the predictive accuracy itself. This shortcoming comes from a shallow interaction with the data, preventing the development of a deep, detailed understanding similar to a human expert’s. We argue that high accuracy and high-quality explanations are not separate objectives but are mutually reinforcing outcomes of a model that develops a deep, direct understanding of the data. To achieve this, we propose the Reflective Cognitive Architecture (RCA), a novel framework that coordinates multiple LLMs to learn from direct experience. RCA features an iterative rule refinement mechanism that improves its logic from prediction errors and a distribution-aware rules check mechanism that bases its reasoning in the dataset’s global statistics. By using predictive accuracy as a signal to drive deeper comprehension, RCA builds a strong internal model of the data. We evaluated RCA on one private and two public datasets against 22 baselines. The results demonstrate that RCA not only achieves state-of-the-art accuracy and robustness with a relative improvement of up to 40% over the baseline but, more importantly, leverages this deep understanding to excel in generating explanations that are clear, logical, evidence-based, and balanced, highlighting its potential for creating genuinely trustworthy clinical decision support systems. The code is available at \https://github.com/ssssszj/RCA.

在现代医疗中，有效的疾病预测需要同时实现高准确性和透明、临床意义的解释这两个目标。现有的基于机器学习和大型语言模型（LLM）的方法往往很难平衡这两个目标。许多模型产生的结果准确但解释不清晰，而另一些模型则生成了流畅但缺乏统计支持的叙述，这既破坏了解释的合理性，也影响了预测的准确性。这种缺点来自于与数据的浅层交互，阻碍了类似于人类专家那样深入、详细的理解的发展。我们认为，高准确性和高质量的解释不是独立的目标，而是相互促进的结果，一个模型通过对数据有深入、直接的理解而得出的结果。为了实现这一点，我们提出了反思认知架构（RCA），这是一个协调多个LLM从直接经验中学习的新型框架。RCA具有迭代规则优化机制，通过预测误差改进逻辑，以及基于数据全局统计的分布式规则检查机制。它以预测准确性作为驱动更深层次理解的信号，建立强大的内部数据模型。我们在一个私有数据集和两个公共数据集上对RCA进行了评估，并与22个基线方法进行了比较。结果表明，RCA不仅实现了最先进的准确性和稳健性，相对于基线方法有着高达40%的相对改进，更重要的是，它利用这种深入的理解在生成清晰、逻辑性强、基于证据和平衡的解释方面表现出色，突显其在创建真正可信的临床决策支持系统方面的潜力。代码可在https://github.com/ssssszj/RCA获取。

论文及项目相关链接

PDF under review

摘要

现代医疗疾病预测要求同时实现高准确性和透明、临床意义的解释。现有的机器学习和大型语言模型（LLM）方法往往难以平衡这两个目标。许多模型产生准确的但不清楚的统计输出，而其他模型生成流畅的但缺乏统计支持的叙述，这常常破坏了解释的有效性和预测准确性。本文提出了一种新型框架——反思认知架构（RCA），通过协调多个LLM从直接经验中学习来解决这一问题。RCA具有迭代规则优化机制和分布感知规则检查机制，分别用于从预测错误中改进逻辑和基于数据集全局统计进行推理。通过利用预测准确性作为驱动更深层理解的信号，RCA建立了强大的内部数据模型。我们对RCA在一个私有和两个公开数据集上与22个基线方法进行了评估。结果表明，RCA不仅实现了最先进的准确性和稳健性，相对基线方法有高达40%的相对改进，而且更重要的是，它利用深层理解来生成清晰、逻辑、有证据支持和平衡的解释，这突出了其在创建真正可信的临床决策支持系统方面的潜力。

关键见解

现代医疗疾病预测需要同时实现高准确性和透明、临床意义的解释。
现有机器学习和大型语言模型（LLM）在平衡预测准确性和解释清晰度方面存在挑战。
反思认知架构（RCA）是一种新型框架，通过协调多个LLM从直接经验中学习来提高预测和解释的质量。
RCA具有迭代规则优化和分布感知规则检查机制，以提高逻辑性和基于数据的推理。
RCA利用预测准确性作为驱动更深层理解的信号，建立强大的内部数据模型。
在多个数据集上的评估表明，RCA实现了最先进的预测准确性和稳健性，相对基线方法有显著改进。
RCA生成的解释清晰、逻辑、有证据支持且平衡，为创建可信的临床决策支持系统提供了潜力。

Cool Papers

点此查看论文截图

Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Authors:You-Won Jang, Yu-Jung Heo, Jaeseok Kim, Minsu Lee, Du-Seong Chang, Byoung-Tak Zhang

The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

近年来，随着大型语言模型（LLM）的发展，视觉-语言理解领域得到了积极的研究。然而，它仍然需要解决需要多步骤推理的问题，即使是对于非常简单的问题。最近的研究采用LLM通过迭代生成子问题和答案来解决这个问题。然而，仍然存在一些缺点，例如1）LLM无法读取视觉信息，因此无法获取图像的精细视觉内容；2）使用黑箱LLM无法访问其内部机制，也难以进行复制。为了解决这些问题，我们提出了SQ（自我提问）-InstructBLIP，通过迭代生成图像感知信息丰富的子问题和子答案，提高推理性能。SQ-InstructBLIP由具有相同架构的提问者、回答者和推理者组成。提问者和回答者生成子问题和子答案以帮助推断主问题，而推理者则考虑生成的子问题信息进行主问题的推理。我们的实验表明，所提出的SQ-InstructBLIP方法在使用生成的子问题作为解决VQA任务时的附加信息时，比以前的工作表现出更准确的推理能力。

论文及项目相关链接

PDF This paper was accepted to the “CLVL: 5th Workshop on Closing the Loop Between Vision and Language (ICCV 2023 CLVL workshop).”

Summary
视觉语言理解领域的研究近年来得到大力发展，但存在对多步骤推理问题的处理困难，即使面对简单问题也是如此。最近的研究采用大型语言模型（LLM）来解决这一问题，通过迭代生成子问题和答案。然而，存在图像精细内容无法获取以及黑箱模型机制难以再现等问题。为解决这些问题，我们提出了SQ-InstructBLIP方法，通过生成图像感知信息子问题和子答案进行迭代推理，提高了推理性能。该方法包括问答者和推理者三个部分，共同使用相同的架构生成子问题和子答案，以帮助推断主问题并进行考虑子问题的推理。实验表明，使用生成的子问题作为解决视觉问答任务的额外信息，该方法比先前的工作表现出更准确的推理能力。

Key Takeaways

大型语言模型（LLM）在视觉语言理解领域面临多步骤推理问题的挑战。
现有研究通过迭代生成子问题和答案来解决这一问题。
SQ-InstructBLIP方法通过生成图像感知信息子问题和子答案进行迭代推理，以提高推理性能。
SQ-InstructBLIP包括问答者和推理者三个组成部分，共同使用相同的架构。
生成子问题和子答案有助于推断主问题，并考虑子问题的推理。
实验表明，SQ-InstructBLIP方法在视觉问答任务中表现出更准确的推理能力。

Cool Papers

点此查看论文截图

Tree Search for LLM Agent Reinforcement Learning

Authors:Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

强化学习（RL）的最新进展极大地增强了大型语言模型（LLM）的代理能力。在长期和多轮代理任务中，仅由结果奖励驱动的传统方法常常面临监督稀疏的问题。为了解决这一挑战，我们提出了基于树搜索的分组相对策略优化（Tree-GRPO），这是一种分组代理RL方法，其中每个树节点代表完整的代理交互步骤。通过共享公共前缀，树搜索采样增加了在固定预算的令牌或工具调用范围内可实现的重播次数。此外，我们发现树形轨迹自然地允许即使仅使用结果奖励也能构建逐步过程监督信号。基于此，Tree-GRPO估计了树内和树间的分组相对优势。通过理论分析，我们证明了树内级别分组相对策略优化的目标与步骤级别直接偏好学习的目标是一致的。在11个数据集和3种问答任务上的实验证明了基于树的RL优于基于链的RL方法。

论文及项目相关链接

PDF

总结
最近强化学习在大型语言模型（LLM）的agentic能力方面的进展显著。在长期和多轮对话的agent任务中，现有仅由结果奖励驱动的方法常面临监督稀疏的问题。为解决此挑战，我们提出基于树搜索的分组agent强化学习方法——Tree-GRPO。树中的每个节点代表完整的agent交互步骤，通过共享公共前缀，树搜索采样提高了在固定预算的标记或工具调用次数中可达到的回滚次数。此外，我们发现树结构轨迹自然地允许使用仅结果奖励来构建步骤过程监督信号。基于此，Tree-GRPO估计了树内和树间的分组相对优势。通过理论分析，我们证明了树内级别的分组相对策略优化目标与步骤级别的直接偏好学习的目标是一致的。在11个数据集和3种问答任务上的实验表明，基于树的强化学习优于基于链的强化学习方法。

关键见解

Tree-GRPO是一种基于树搜索的分组agent强化学习方法，解决了长期和多轮对话任务中监督稀疏的问题。
树搜索中的每个节点代表完整的agent交互步骤，提高回滚次数并允许共享公共前缀。
树结构自然地支持使用仅结果奖励构建步骤过程监督信号。
Tree-GRPO能够估计树内和树间的分组相对优势。
通过对Tree-GRPO进行理论分析，证明了其与步骤级别的直接偏好学习的目标一致性。
在多个数据集和不同类型的问答任务上，基于树的强化学习（Tree-GRPO）表现出优于基于链的强化学习方法的性能。

Cool Papers

点此查看论文截图

Query-Centric Graph Retrieval Augmented Generation

Authors:Yaxiong Wu, Jianyuan Bo, Yongyue Zhang, Sheng Liang, Yong Liu

Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.

基于图的检索增强生成（RAG）技术为大型语言模型（LLM）增加了外部知识，以实现对长文本的理解和跨步推理。然而，现有方法面临粒度困境：精细的实体级图会产生高令牌成本并丢失上下文，而粗糙的文档级图则无法捕捉微妙关系。我们引入了QCG-RAG，这是一种以查询为中心的图形RAG框架，能够实现查询粒度索引和多步块检索。我们的以查询为中心的方法利用Doc2Query和Doc2Query{-}{-}构建可控粒度的查询中心图，提高了图的质量和可解释性。定制的跨步检索机制然后通过生成的查询选择相关块。在LiHuaWorld和MultiHop-RAG上的实验表明，QCG-RAG在问答准确性方面始终优于先前的基于块和基于图的RAG方法，为跨步推理建立了新的范式。

论文及项目相关链接

PDF 25 pages, 6 figures, 1 table

Summary

文本介绍了基于图的检索增强生成（RAG）方法在大语言模型（LLM）中融入外部知识，以实现长文本理解和多跳推理。现有方法面临粒度选择困境：精细粒度的实体级图导致高令牌成本并丢失上下文，而粗糙文档级图则无法捕捉微妙关系。为此，提出了查询中心图（QCG-RAG）的RAG框架，实现查询粒度索引和多跳块检索。通过Doc2Query和Doc2Query{-}{-}构建查询中心图，提高图质量和可解释性。在LiHuaWorld和MultiHop-RAG上的实验表明，QCG-RAG在问答准确性方面始终优于基于块和基于图的先前RAG方法，为建立多跳推理的新范式奠定了基础。

Key Takeaways

QCG-RAG解决了现有方法在基于图的检索增强生成（RAG）中的粒度选择困境。
QCG-RAG实现了查询粒度索引和多跳块检索功能，有助于提高信息检索的准确性。
查询中心图通过Doc2Query和Doc2Query{-}{-}构建，实现了可控制的粒度，提高了图的质量和可解释性。
QCG-RAG采用定制的多跳检索机制选择相关块，增强了检索功能的效果。
QCG-RAG在LiHuaWorld和MultiHop-RAG的实验中，相较于先前的RAG方法，表现出更高的问答准确性。
QCG-RAG为建立多跳推理的新范式奠定了基础。

Cool Papers

点此查看论文截图

Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models

Authors:Nikolay Blagoev, Bart Cox, Jérémie Decouchant, Lydia Y. Chen

Motivated by the emergence of large language models (LLMs) and the importance of democratizing their training, we propose GWTF, the first crash tolerant practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, GWTF enables the efficient collaborative training of a LLM on heterogeneous clients that volunteer their resources. In addition, GWTF addresses node churn, i.e., clients joining or leaving the system at any time, and network instabilities, i.e., network links becoming unstable or unreliable. The core of GWTF is a novel decentralized flow algorithm that finds the most effective routing that maximizes the number of microbatches trained with the lowest possible delay. We extensively evaluate GWTF on GPT-like and LLaMa-like models and compare it against the prior art. Our results indicate that GWTF reduces the training time by up to 45% in realistic and challenging scenarios that involve heterogeneous client nodes distributed over 10 different geographic locations with a high node churn rate.

受大型语言模型（LLM）的兴起及其训练民主化的重要性所驱动，我们提出了GWTF，这是第一个用于LLM的耐崩溃实用去中心化训练框架。与现有的分布式和联邦训练框架不同，GWTF能够在异质客户端上实现LLM的有效协作训练，这些客户端自愿提供其资源。此外，GWTF还解决了节点变化，即客户端随时加入或退出系统，以及网络不稳定，即网络链接变得不稳定或不可靠的问题。GWTF的核心是一种新型的去中心化流算法，该算法可找到最有效的路由，以尽可能低的延迟最大化训练的微批次数量。我们对GWTF在GPT和LLaMa类模型上进行了广泛评估，并将其与现有技术进行了比较。结果表明，在涉及分布在1 0个不同地理位置的异质客户端节点且节点更换率高的现实和具有挑战性的场景中，GWTF可将训练时间缩短高达45%。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的兴起及其对民主化训练的重要性，我们提出了GWTF，这是第一个适用于LLM的鲁棒性实用分散化训练框架。GWTF能够实现在异构客户端上高效协同训练LLM，与其他现有的分布式和联邦训练框架不同。此外，GWTF解决了节点流失和网络不稳定问题。其核心是一种新型的去中心化流算法，该算法可找到最有效的路由，以最低延迟训练最多的微批次。我们在GPT和LLaMa模型上对GWTF进行了广泛评估，并与现有技术进行了比较。结果表明，在涉及分布在10个不同地理位置的异构客户端节点且节点流失率高的现实和具有挑战性的场景中，GWTF可将训练时间缩短高达45%。

Key Takeaways

GWTF是第一个针对大型语言模型（LLM）的鲁棒性实用分散化训练框架。
GWTF能在异构客户端上实现LLM的高效协同训练。
GWTF解决了节点流失和网络不稳定问题。
GWTF的核心是一种新型的去中心化流算法，能找最有效路由并最小化训练延迟。
在涉及分布在多个地理位置的异构客户端节点的现实和具有挑战性的场景中，GWTF可将训练时间缩短高达45%。
GWTF在GPT和LLaMa模型上进行了广泛评估。

Cool Papers

点此查看论文截图

Dual-Path Phishing Detection: Integrating Transformer-Based NLP with Structural URL Analysis

Authors:Ibrahim Altan, Abdulla Bachir, Yousuf Parbhulkar, Abdul Muksith Rizvi, Moshiur Farazi

Phishing emails pose a persistent and increasingly sophisticated threat, undermining email security through deceptive tactics designed to exploit both semantic and structural vulnerabilities. Traditional detection methods, often based on isolated analysis of email content or embedded URLs, fail to comprehensively address these evolving attacks. In this paper, we propose a dual-path phishing detection framework that integrates transformer-based natural language processing (NLP) with classical machine learning to jointly analyze email text and embedded URLs. Our approach leverages the complementary strengths of semantic analysis using fine-tuned transformer architectures (e.g., DistilBERT) and structural link analysis via character-level TF-IDF vectorization paired with classical classifiers (e.g., Random Forest). Empirical evaluation on representative email and URL datasets demonstrates that this combined approach significantly improves detection accuracy. Specifically, the DistilBERT model achieves a near-optimal balance between accuracy and computational efficiency for textual phishing detection, while Random Forest notably outperforms other classical classifiers in identifying malicious URLs. The modular design allows flexibility for standalone deployment or ensemble integration, facilitating real-world adoption. Collectively, our results highlight the efficacy and practical value of this dual-path approach, establishing a scalable, accurate, and interpretable solution capable of enhancing email security against contemporary phishing threats.

网络钓鱼邮件构成了一种持续且日益精明的威胁，它通过利用语义和结构性漏洞的欺骗策略来破坏电子邮件的安全性。传统的检测方法通常基于电子邮件内容或嵌入链接的孤立分析，无法全面应对这些不断演变的攻击。在本文中，我们提出了一种双路径网络钓鱼检测框架，该框架结合了基于转换器的自然语言处理（NLP）与经典机器学习，以联合分析电子邮件文本和嵌入链接。我们的方法利用使用精细调整的转换器架构（例如DistilBERT）进行语义分析的互补优势，并通过字符级TF-IDF向量化与经典分类器（例如随机森林）配对进行结构链接分析。在代表性的电子邮件和链接数据集上的经验评估表明，这种组合方法可显著提高检测精度。具体而言，DistilBERT模型在文本钓鱼检测方面实现了精度和计算效率之间的近乎最佳的平衡，而随机森林在识别恶意链接方面显著优于其他经典分类器。模块化设计允许进行独立部署或集成组合，促进了实际应用中的采用。总体而言，我们的结果突出了双路径方法的效用和实际价值，建立了一种可扩展、准确且可解释的解决方案，能够增强电子邮件安全性以应对当前的钓鱼威胁。

论文及项目相关链接

PDF Paper accepted for presentation at the ACS/IEEE 22nd International Conference on Computer Systems and Applications (AICCSA 2025)

Summary：钓鱼邮件持续构成威胁，且手段日益狡猾，通过语义和结构上的漏洞进行欺骗性攻击，削弱电子邮件的安全性。传统检测方式往往局限于邮件内容或嵌入链接的单独分析，难以应对日益复杂的攻击。本文提出了一种双路径钓鱼邮件检测框架，融合了基于转换器的自然语言处理（NLP）和经典机器学习技术，同时分析邮件文本和嵌入链接。该方法结合了微调转换器架构（如DistilBERT）的语义分析和基于字符级别的TF-IDF向量化的结构链接分析，配合经典分类器（如随机森林）。在具有代表性的电子邮件和URL数据集上的实证评估表明，该组合方法显著提高了检测精度。具体来说，DistilBERT模型在准确率和计算效率之间实现了近乎最优的平衡，而随机森林在识别恶意URL方面表现尤为出色。本设计具有模块化特点，可灵活部署或集成，具有实际应用价值。总体而言，本研究结果突显了双路径方法的效能和实用性，建立了一种可扩展、准确且可解释的解决方案，可增强电子邮件对现代钓鱼威胁的安全性。

Key Takeaways：

钓鱼邮件威胁持续存在且不断进化，通过语义和结构漏洞削弱电子邮件安全性。
传统检测手段往往局限于内容或链接的单独分析，难以应对当前威胁。
提出了一种双路径检测框架，融合自然语言处理和经典机器学习技术。
该方法结合了语义分析和结构链接分析，通过DistilBERT模型和随机森林分类器实现高效检测。
DistilBERT模型在准确率和计算效率之间取得平衡，随机森林在识别恶意URL方面表现优秀。
模块化设计可灵活部署或集成，具有实际应用价值。

Cool Papers

点此查看论文截图

Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding

Authors:Ayan Sar, Sampurna Roy, Kanav Gupta, Anurag Kaushish, Tanupriya Choudhury, Abhijit Kumar

Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.

Transformer架构在自然语言任务上取得了最先进的性能，但它们通过处理扁平的令牌序列来根本性地误代表人类语言的层次结构。这导致了二次方的计算成本、较弱的计算成本和不足的语境建模。我们提出了分层解析Transformer（HRT），这是一种受小波启发的新型神经网络架构，能够同时处理多个分辨率的语言，从字符到语境级单位。HRT构建了多分辨率注意力机制，实现了自下而上的组合和自上而下的语境化。通过采用跨尺度的指数序列缩减，HRT实现了O(nlogn)的复杂度，相较于标准Transformer有着显著的效率提升。我们在GLUE、SuperGLUE、Long Range Arena和WikiText-103等多种基准测试集上评估了HRT，结果表明，HRT在GLUE上的平均性能较标准Transformer基线提高了+3.8%，在SuperGLUE上提高了+4.5%，在Long Range Arena上提高了+6.1%，同时相较于类似参数数量的BERT和GPT风格模型，内存使用率降低了42%，推理延迟降低了37%。消融研究证实了跨分辨率注意力和尺度专业化模块的有效性，表明它们各自对效率和准确性都有独立的贡献。我们的研究确立了HRT作为首个将计算结构与人类语言的层次结构相匹配的架构，证明了多尺度、受小波启发的处理既带来了理论上的效率提升，也在语言理解方面取得了实际的改进。

论文及项目相关链接

PDF Submitted in IEEE International Conference on Big Data 2025

Summary

本文介绍了Transformer架构在自然语言任务中的最新表现，但其对文本的处理方式忽略了人类语言的层次结构，导致计算成本较高、组成泛化能力弱和话语级别的建模不足。为解决这些问题，本文提出了基于小波启发的神经网络架构——分层解析Transformer（HRT）。HRT能够同时处理多个分辨率的语言，从字符到话语级别单位。通过构建多分辨率注意力机制和实现尺度专用的模块，HRT实现了O(nlogn)的复杂度，相较于标准Transformer，具有更高的效率和准确性。在多个基准测试中，HRT表现出卓越的性能，包括GLUE、SuperGLUE、Long Range Arena和WikiText-103等任务。

Key Takeaways

Transformer架构在自然语言任务中表现卓越，但忽略了人类语言的层次结构。
分层解析Transformer（HRT）是一种基于小波启发的神经网络架构，能同时处理多个分辨率的语言。
HRT通过构建多分辨率注意力和尺度专用的模块，实现了O(nlogn)的复杂度，提高了效率和准确性。
HRT在多个基准测试中表现出卓越性能，包括GLUE、SuperGLUE、Long Range Arena和WikiText-103等任务。
与标准Transformer相比，HRT降低了内存使用量和推理延迟。
消融研究证实了跨分辨率注意力和尺度专用模块的有效性。

Cool Papers

点此查看论文截图

Causal Understanding by LLMs: The Role of Uncertainty

Authors:Oscar Lithgow-Serrano, Vani Kanjirangat, Alessandro Antonucci

Recent papers show LLMs achieve near-random accuracy in causal relation classification, raising questions about whether such failures arise from limited pretraining exposure or deeper representational gaps. We investigate this under uncertainty-based evaluation, testing whether pretraining exposure to causal examples improves causal understanding >18K PubMed sentences – half from The Pile corpus, half post-2024 – across seven models (Pythia-1.4B/7B/12B, GPT-J-6B, Dolly-7B/12B, Qwen-7B). We analyze model behavior through: (i) causal classification, where the model identifies causal relationships in text, and (ii) verbatim memorization probing, where we assess whether the model prefers previously seen causal statements over their paraphrases. Models perform four-way classification (direct/conditional/correlational/no-relationship) and select between originals and their generated paraphrases. Results show almost identical accuracy on seen/unseen sentences (p > 0.05), no memorization bias (24.8% original selection), and output distribution over the possible options is almost flat, with entropic values near the maximum (1.35/1.39), confirming random guessing. Instruction-tuned models show severe miscalibration (Qwen: > 95% confidence, 32.8% accuracy, ECE=0.49). Conditional relations induce highest entropy (+11% vs. direct). These findings suggest that failures in causal understanding arise from the lack of structured causal representation, rather than insufficient exposure to causal examples during pretraining.

最近的论文显示，大型语言模型在因果关系分类中达到了近乎随机的准确率，这引发了关于这些失败是否源于有限的预训练暴露或更深层次的代表性差距的疑问。我们在基于不确定性的评估下对此进行调查，测试预训练时接触因果例子是否能提高因果理解。我们对超过18,000句来自PubMed的语句进行了实验，其中一半来自Pile语料库，一半是2024年以后的数据，涵盖了七个模型（Pythia-1.4B/7B/12B、GPT-J-6B、Dolly-7B/12B和Qwen-7B）。我们通过分析模型的行为来探究结果：(i)因果分类，即模型识别文本中的因果关系；(ii)逐字记忆探测，即我们评估模型是偏好之前见过的因果陈述还是它们的同义替换。模型执行四项分类（直接/条件/相关性/无关系），并在原始句子和其生成的同义替换句子之间进行选择。结果显示，在已见/未见句子上的准确率几乎相同（p > 0.05），没有记忆偏见（24.8%选择原始句子），并且输出分布几乎扁平化，熵值接近最大值（1.35/1.39），证实了随机猜测。指令调整模型显示出严重的校准误差（Qwen：> 95%的信心，准确率仅为32.8%，ECE=0.49）。条件关系引起最高的熵值（与直接相比增加11%）。这些发现表明，因果理解失败源于缺乏结构化因果表征，而非预训练期间接触到的因果示例不足。

论文及项目相关链接

PDF Accepted in second UncertaiNLP workshop at EMNLP 2025

Summary

本文探讨了大型语言模型（LLM）在因果关联分类中达到近似随机准确性的问题，质疑是否源于预训练暴露的有限性或更深层次的表征差距。通过不确定性评估方法，研究了预训练期间接触因果示例是否有助于改善模型的因果理解。研究结果指出，模型在已见和未见句子上的准确率几乎相同，无记忆偏见，且输出分布接近平坦，最大熵值接近，表明模型表现接近随机猜测。指令微调模型存在严重校准失误。这些发现暗示，因果理解的失败源于缺乏结构化因果表征，而非预训练期间接触到的因果示例不足。

Key Takeaways

LLM在因果关联分类中的表现接近随机猜测，引发关于其失败原因的质疑。
研究通过不确定性评估方法，探究预训练期间接触因果示例对模型因果理解的影响。
模型在已见和未见句子上的准确率相似，表明模型的表现不是基于预训练暴露的熟悉程度。
无记忆偏见的结果表明模型不偏好之前的因果陈述，而是根据文本内容做出判断。
输出分布接近平坦和高熵值暗示模型在分类任务中表现随机，缺乏明确的决策边界。
指令微调模型的校准问题可能是影响其性能的关键因素之一。

Cool Papers

点此查看论文截图

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

Authors:Alba Maria Marmol-Romero, Manuel Garcia-Vega, Miguel Angel Garcia-Cumbreras, Arturo Montejo-Raez

This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.

本文描述了SINAI-UJA团队参与eRisk@CLEF 2025实验室的情况。具体来说，我们解决了两个提出的任务：（i）任务2：上下文早期抑郁症检测，以及（ii）试点任务：通过LLM进行对话式抑郁症检测。我们在任务2中的方法结合了广泛的预处理流程，以及使用多个基于transformer的模型，如RoBERTa Base或MentalRoBERTA Large，以捕捉多用户对话的上下文和顺序特性。对于试点任务，我们设计了一套对话策略与LLM驱动的角色进行交互，侧重于在有限的对话回合内最大化信息获取。在任务2中，我们的系统在F1分数上排名第8，共有12支参赛队伍。然而，深入分析表明，我们的模型在提供早期预测方面速度最快，这在现实部署场景中是一个关键因素。这突出了早期检测与分类准确性之间的权衡，为未来工作中可能同时优化这两者提供了潜在途径。在试点任务中，我们在5支队伍中排名第一，在所有评估指标上获得最佳整体表现：DCHR、ADODL和ASHR。我们在这一任务中的成功展示了结构化对话设计与强大的语言模型相结合的有效性，进一步证明了在敏感的精神健康评估环境中部署LLM的可行性。

论文及项目相关链接

PDF 16 pages, 10 figures, 8 tables. CLEF (Working Notes). 2025

Summary

本文介绍了SINAI-UJA团队在eRisk@CLEF 2025实验室参与的两个任务：任务2是上下文化早期抑郁症检测，以及试点任务是使用LLM的对话式抑郁症检测。团队采用预处理管道与基于RoBERTa Base和MentalRoBERTa Large等transformer模型的组合来捕捉多用户对话的上下文和序列特征。在任务2中，团队的系统在F1分数上排名第8位（共12个团队参与）。深入分析显示，模型在快速进行早期预测方面表现出优势，这在实际部署场景中是一个关键因素。团队在试点任务中取得了第一名（共5个团队参与），在所有评估指标上都取得了最佳表现，包括DCHR、ADODL和ASHR。这表明结合强大的语言模型进行结构化对话设计在敏感的精神健康评估上下文中是有效的。

Key Takeaways

团队参与了eRisk@CLEF 2025的两个相关任务，即上下文早期抑郁症检测和对讲抑郁症检测试点任务。
对于任务2，采用了复杂的预处理管道和基于transformer的模型来捕捉对话的上下文和序列特征。
在F1分数评估中，团队在任务2中排名第8位，但模型在早期预测方面表现出优势。
在试点任务中，团队取得了第一名，所有评估指标均表现最佳。

Cool Papers

点此查看论文截图

BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens

Authors:Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun

Existing methods for training LLMs on long-sequence data, such as Tensor Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as sequence lengths and number of GPUs increase, especially when sequence lengths exceed 1M tokens. To address these challenges, we propose BurstEngine, an efficient framework designed to train LLMs on long-sequence data. BurstEngine introduces BurstAttention, an optimized distributed attention with lower communication cost than RingAttention. BurstAttention leverages topology-aware ring communication to fully utilize network bandwidth and incorporates fine-grained communication-computation overlap. Furthermore, BurstEngine introduces sequence-level selective checkpointing and fuses the language modeling head with the loss function to reduce memory cost. Additionally, BurstEngine introduces workload balance optimization for various types of attention masking. By integrating these optimizations, BurstEngine achieves a $1.2\times$ speedup with much lower memory overhead than the state-of-the-art baselines when training LLMs on extremely long sequences of over 1M tokens. We have made our code publicly available on GitHub: https://github.com/thunlp/BurstEngine.

现有的训练大型语言模型（LLMs）处理长序列数据的方法，如张量并行性和上下文并行性，随着序列长度的增加和GPU数量的增多，模型FLOPs利用率逐渐降低，特别是在序列长度超过1M令牌时。为了解决这些挑战，我们提出了BurstEngine，这是一个设计用于训练LLMs处理长序列数据的高效框架。BurstEngine引入了BurstAttention，这是一种优化的分布式注意力机制，其通信成本低于RingAttention。BurstAttention利用拓扑感知环形通信来充分利用网络带宽，并引入了细粒度的通信计算重叠。此外，BurstEngine引入了序列级选择性检查点，并将语言建模头与损失函数融合，以降低内存成本。另外，BurstEngine还进行了工作量平衡优化，适用于各种注意力屏蔽类型。通过整合这些优化，BurstEngine在训练LLMs处理超过1M令牌的长序列时，实现了比最新基线技术更高的1.2倍速度提升，同时降低了内存开销。我们的代码已在GitHub上公开可用：https://github.com/thunlp/BurstEngine。

论文及项目相关链接

PDF

Summary

针对现有训练大型语言模型处理长序列数据的方法（如张量并行性和上下文并行性）在序列长度和GPU数量增加时模型FLOPs利用率低的问题，提出了一种高效框架BurstEngine。该框架引入BurstAttention优化分布式注意力机制，降低通信成本；利用拓扑感知环形通信，充分利用网络带宽；采用细粒度通信计算重叠；并引入序列级选择性检查点以及融合语言建模头与损失函数以降低内存成本。此外，BurstEngine还进行了工作量平衡优化，适用于各种注意力掩码类型。这些优化使BurstEngine在训练大型语言模型处理超过百万令牌的长序列时实现了1.2倍的速度提升，同时降低了内存开销。

Key Takeaways

BurstEngine是一个用于训练大型语言模型处理长序列数据的框架。
它引入了BurstAttention优化分布式注意力机制以降低通信成本。
BurstEngine利用拓扑感知环形通信和细粒度通信计算重叠来充分利用网络带宽。
通过引入序列级选择性检查点和融合语言建模头与损失函数，降低了内存成本。
BurstEngine还包括工作量平衡优化，适用于不同类型的注意力掩码。
该框架实现了训练大型语言模型处理长序列时的速度提升，同时降低了内存开销。
BurstEngine的代码已公开在GitHub上。

Cool Papers

点此查看论文截图

Gyges: Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference

Authors:Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang

Efficiently processing the dynamics of requests, especially the context length variance, is important in Large Language Model (LLM) serving scenarios. However, there is an intrinsic trade-off: while leveraging parallelism strategies, such as Tensor Parallelism (TP), can coordinate multiple GPUs to accommodate larger context lengths, it inevitably results in degraded overall throughput. In this paper, we propose Cross-Instance Parallelism Transformation (Gyges), which adaptively adjusts the parallelism strategies of running instances to align with the dynamics of incoming requests. We design (1) a page-friendly, header-centric layout to accelerate KV cache transformations; (2) dedicated weight padding to accelerate model weight transformations; and (3) a transformation-aware scheduler to cooperatively schedule requests and parallelism transformations, optimizing the overall performance. Evaluations using real-world traces show that Gyges improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

在大语言模型（LLM）服务场景中，高效处理请求的动态变化，尤其是上下文长度变化是非常重要的。然而，存在一个固有的权衡：虽然利用并行性策略（如张量并行性）可以协调多个GPU以适应更大的上下文长度，但不可避免地会导致总体吞吐量下降。在本文中，我们提出了跨实例并行性转换（Gyges），它自适应地调整运行实例的并行性策略，以匹配传入请求的动态变化。我们设计了（1）一种友好的、以页眉为中心的布局来加速键值缓存转换；（2）专用的权重填充来加速模型权重转换；（3）一个感知转换的调度器，以协同调度请求和并行转换，从而优化整体性能。使用真实世界痕迹的评估表明，与最先进的解决方案相比，Gyges提高了1.75倍至6.57倍的吞吐量。

论文及项目相关链接

PDF 12 pages, 15 figures

Summary

大型语言模型（LLM）服务场景中，处理请求动态性，特别是上下文长度变化非常重要。本文提出一种名为Gyges的跨实例并行性转换方法，通过自适应调整运行实例的并行策略，以匹配传入请求的动态性。Gyges通过（1）加速KV缓存转换的页面友好、以头部为中心布局；（2）加速模型权重转换的专用权重填充；（3）以及一个能够协同调度请求和并行转换的转换感知调度器，优化整体性能。评估结果表明，与现有解决方案相比，Gyges可提高吞吐量达1.75x至6.57x。

Key Takeaways

处理大型语言模型（LLM）服务场景中的请求动态性至关重要，特别是上下文长度变化。
现有并行策略如Tensor Parallelism（TP）虽然可以处理更大上下文长度，但会降低整体吞吐量。
Gyges是一种跨实例并行性转换方法，能够自适应调整运行实例的并行策略，以匹配传入请求的动态性。
Gyges设计了页面友好、以头部为中心的布局，加速KV缓存转换。
Gyges采用专用权重填充，以加速模型权重转换。
Gyges包含一个转换感知调度器，能够协同调度请求和并行转换，从而优化整体性能。

Cool Papers

点此查看论文截图

LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

Authors:Yanfang Fanny Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Patricia Culligan, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh Chawla

Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

前沿的人工智能技术不断重塑我们对世界的认知。例如，基于大型语言模型（LLM）的应用，如ChatGPT，已经展现出在广泛主题上生成人类类似对话的能力。由于在各种语言相关任务（如开放域问答、翻译和文档摘要）上的令人印象深刻的表现，可以预见LLM在更广泛的现实世界应用（如客户服务、教育和可访问性、科学发现）所带来的深远影响。受它们成功的启发，本文将概述最新的大型语言模型及其在各种学术学科的集成，包括：（1）艺术、文学和法律（例如历史、哲学、政治学、艺术和建筑、法律）、（2）经济学和商业（例如金融、经济学、会计、市场营销），以及（3）科学和工程（例如数学、物理和机械工程、化学和化工、生命科学和生物工程、地球科学和土木工程、计算机科学和电气工程）。本文融合了人文学科和技术，将探索大型语言模型如何塑造这些领域的研究和实践，同时讨论生成人工智能时代的关键局限、开放挑战和未来发展方向。对大型语言模型如何跨学科应用的审查，以及关键观察和见解，可以帮助研究人员和实践者了解如何利用大型语言模型来推动其在各种现实世界应用中的工作。

论文及项目相关链接

PDF

Summary

最新的人工智能（AI）技术，如大型语言模型（LLM），正在不断改变我们对世界的看法。基于LLM的应用，如ChatGPT，已显示出在广泛主题上生成类似人类对话的能力。LLM在各种语言任务中表现出色，可应用于客户服务、教育、无障碍、科学发现等领域。本文概述了最前沿的LLM及其与多个学术领域的融合，包括艺术、法律、经济、科学、工程等。同时，文章也探讨了LLM的关键局限性、开放挑战和未来在生成AI时代的发展方向。

Key Takeaways

LLMs基于AI技术，正在重塑人们对世界的认知。
LLMs在多种语言任务中表现出色，具有广泛的应用前景。
LLMs可以应用于客户服务、教育、无障碍和科学研究等领域。
LLMs与多个学术领域融合，包括艺术、法律、经济、科学、工程等。
LLMs也存在关键局限性，面临开放挑战和未来发展方向。
LLMs的集成有助于推进研究和实践，对感兴趣的学者和实践者提供了重要见解和洞察力。

Cool Papers

点此查看论文截图

COLT: Enhancing Video Large Language Models with Continual Tool Usage

Authors:Yuyang Liu, Xinyuan Shi, Xiaondan Liang

The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering ‘catastrophic forgetting’ of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

大型语言模型（LLM）的成功极大地推动了视频理解研究的发展。为了利用预训练专家模型的优点（即工具），视频LLM优先探索工具使用能力。现有方法要么提示封闭源LLM，要么采用指令微调范式进行工具使用的微调。然而，这些方法假定存在一个固定的工具库，并难以推广到工具数据不断演变和流入的现实世界环境。为此，我们提出通过COLT（连续工具使用）增强开源视频LLM，它可以在连续的工具流中自动获取工具使用能力，而不会遭受过去学习工具的“灾难性遗忘”。具体来说，我们的COLT引入了一个可学习的工具代码本作为工具特定的内存系统。然后，根据用户指令与代码本中工具特征的相似性动态选择相关工具。为了释放视频LLM的工具使用潜力，我们收集了以视频为中心的工具使用指令微调数据集VideoToolBench。在之前的视频LLM基准测试和专用的VideoToolBench工具使用数据集上的大量实验表明，我们提出的COLT具有最先进的性能。

论文及项目相关链接

PDF 16 pages

Summary

大型语言模型（LLM）的成功推动了视频理解研究的进展。为了利用预训练模型的优势，视频LLM重点探索工具使用能力。现有方法通过提示闭源LLM或采用指令微调模式进行工具使用微调。然而，这些方法假定存在固定的工具库，难以泛化到现实环境中不断演变和流动的工具数据。为此，我们提出增强开源视频LLM的连续工具使用（COLT）能力，可自动在连续工具流中获取工具使用能力，而不会遗忘过去学习的工具。COLT采用可学习的工具代码本作为工具特定记忆系统，并根据用户指令与代码本中工具特征的相似性动态选择相关工具。为了释放视频LLM的工具使用潜力，我们收集了以视频为中心的工具使用指令调整数据集VideoToolBench。在之前的视频LLM基准测试和专用的VideoToolBench数据集上的实验表明，COLT具有卓越性能。

Key Takeaways

大型语言模型（LLM）在视频理解研究方面取得了显著进展。
视频LLM重视探索工具使用能力，以利用预训练模型的优势。
现有方法主要依赖于固定的工具库，难以适应现实环境中不断变化的工具数据。
提出了COLT方法，可增强视频LLM在连续工具流中的自动获取工具使用能力，避免遗忘过去的学习内容。
COLT采用可学习的工具代码本作为工具特定记忆系统，并根据用户指令与工具特征的相似性动态选择工具。
为释放视频LLM的工具使用潜力，研究者们收集了VideoToolBench数据集。

Cool Papers

点此查看论文截图

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

本文介绍了TempSamp-R1，这是一种新的强化精细调整框架，旨在提高多模态大型语言模型（MLLMs）在视频时间定位任务中的适应性。我们揭示，现有的强化学习方法，如组相对策略优化（GRPO），依赖于策略更新的在策略采样。然而，在具有大的时间搜索空间的任务中，这种策略变得效率低下且性能受限，因为它往往无法识别出时间上的精确解。为解决这一局限性，TempSamp-R1利用真实注释作为离线监督来提供时间上精确的指导，有效地补偿在线策略的稀疏性和不匹配问题。为了进一步稳定训练和减少基于奖励的更新的方差，TempSamp-R1提供了一种非线性软优势计算方法，该方法通过不对称转换动态地重塑奖励反馈。通过采用混合式的思维链（CoT）训练范式，TempSamp-R1优化了一个单一统一的模型来支持CoT和非CoT推理模式，从而有效地处理不同复杂度的查询。实验结果表明，TempSamp-R1在Charades-STA（R1@0.7：52.9%，+2.7%）、ActivityNet Captions（R1@0.5：56.0%，+5.3%）和QVHighlights（mAP：30.0%，+3.0%）等基准数据集上超越了基于GRPO的基线方法，取得了最新的最佳性能。此外，TempSamp-R1在有限数据下表现出强大的泛化能力。代码地址：https://github.com/HVision-NKU/TempSamp-R1。

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

TempSamp-R1是一个针对视频时序定位任务的多模态大语言模型强化微调框架。它解决了现有强化学习方法在大型时序搜索空间中的效率和性能问题，通过利用地面真实注释作为离线策略监督，提供了时序精确指导，并采用非线性软优势计算方法动态调整奖励反馈。此外，TempSamp-R1采用混合的Chain-of-Thought训练范式，优化单一统一模型以支持不同推理复杂度的查询。实验结果表明，TempSamp-R1在多个基准数据集上实现了卓越性能，并具有强大的少样本泛化能力。

Key Takeaways

TempSamp-R1是一个针对视频时序定位任务的多模态大语言模型强化微调框架，旨在提高模型适应此类任务的有效性。
现有强化学习方法在大型时序搜索空间中面临效率和性能问题，TempSamp-R1解决了这一问题。
TempSamp-R1利用地面真实注释作为离线策略监督，提供精确的时序指导，弥补了在线策略解决方案的稀疏性和不准确性。
该方法采用非线性软优势计算方法，动态调整奖励反馈，进一步稳定训练并降低基于奖励的更新的方差。
TempSamp-R1采用混合的Chain-of-Thought训练范式，支持不同推理复杂度的查询处理。
实验结果显示，TempSamp-R1在多个基准数据集上实现了卓越性能，包括Charades-STA、ActivityNet Captions和QVHighlights。

Cool Papers

点此查看论文截图