发布日期: 2025-09-17

更新日期: 2025-10-07

文章字数: 20.6k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-17 更新

Do machine learning climate models work in changing climate dynamics?

Authors:Maria Conchita Agana Navarro, Geng Li, Theo Wolf, María Pérez-Ortiz

Climate change is accelerating the frequency and severity of unprecedented events, deviating from established patterns. Predicting these out-of-distribution (OOD) events is critical for assessing risks and guiding climate adaptation. While machine learning (ML) models have shown promise in providing precise, high-speed climate predictions, their ability to generalize under distribution shifts remains a significant limitation that has been underexplored in climate contexts. This research systematically evaluates state-of-the-art ML-based climate models in diverse OOD scenarios by adapting established OOD evaluation methodologies to climate data. Experiments on large-scale datasets reveal notable performance variability across scenarios, shedding light on the strengths and limitations of current models. These findings underscore the importance of robust evaluation frameworks and provide actionable insights to guide the reliable application of ML for climate risk forecasting.

气候变化正在加速前所未有事件的频率和严重程度，这些事件偏离了既定的模式。预测这些离群事件（OOD事件）对于评估风险和指导气候适应至关重要。虽然机器学习（ML）模型在提供精确、高速的气候预测方面显示出潜力，但它们在处理分布变化时的泛化能力仍然是气候背景下尚未充分探索的重大局限。本研究通过适应现有的离群值评估方法对气候数据进行评估，系统地评价了最先进的基于机器学习的气候模型在不同离群值场景下的表现。在大规模数据集上的实验揭示了不同场景下的性能差异，揭示了当前模型的优势和局限性。这些发现强调了稳健评估框架的重要性，并提供了可操作性的见解，以指导机器学习在气候风险预测中的可靠应用。

论文及项目相关链接

PDF 8 pages, 2 figures

Summary：气候变化加速了前所未有的事件的发生频率和强度，这些事件偏离了既定的模式。虽然机器学习模型在提供精确、高速的气候预测方面显示出潜力，但它们对于分布变化下的泛化能力仍然是气候背景下尚未充分探索的重大局限。本研究通过适应既定的分布外评估方法对气候数据进行系统评估，发现大规模数据集实验在不同分布外场景中显著性能表现不同。这不仅突显了可靠的评估框架的重要性，还为应用机器学习进行气候风险预测提供了宝贵的见解。

Key Takeaways：

气候变化导致前所未有的事件越来越频繁和严重，这偏离了已知的模式。
机器学习模型在气候预测中显示出潜力，但在应对分布变化方面存在局限性。
系统地评估了先进的基于机器学习的气候模型在不同分布外的场景中的表现。
通过适应已有的分布外评估方法，使用大规模数据集进行实验。
发现不同场景下的性能表现存在显著差异。
可靠的评估框架对于评估机器学习模型在气候预测中的性能至关重要。

Cool Papers

点此查看论文截图

UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC

Authors:Tomer Bitan, Tal Kadosh, Erel Kaplan, Shira Meiri, Le Chen, Peter Morales, Niranjan Hasabnis, Gal Oren

Translating programs between various parallel programming languages is an important problem in the high-performance computing (HPC) community. Existing tools for this problem are either too narrow in scope and/or outdated. Recent explosive growth in the popularity of large language models (LLMs) and their ability to generate and translate code offers a potential alternative approach. Toward that end, we first need to systematically evaluate the ability of LLMs to translate between parallel languages. In this work, we introduce UniPar, a systematic evaluation framework for LLM-based parallel code translation. Specifically, in this work, we target translations between serial code, CUDA, and OpenMP. Our goal is to assess how well current instruction-tuned LLMs – specifically GPT-4o-mini and LLaMA-3.3-70B-Instruct – can be used out of the box or enhanced through known strategies. We evaluated four major usage modes: hyperparameter optimization for decoding, zero- and few-shot prompting, supervised fine-tuning, and iterative feedback through compiler-based repair. As a part of the evaluation, we construct a new dataset called PARATRANS, covering both serial-to-parallel translation and cross-paradigm transformations. Our findings reveal that while off-the-shelf models struggle under the default settings (e.g., GPT-4o-mini achieves only 46% compilation and 15% functional correctness), our UniPar methodology – combining fine-tuning, hyperparameter tuning, and compiler-guided repair – improves performance by up to 2X (69% compilation and 33% correctness). We believe that our findings will provide useful insights for researchers to further improve LLMs for the parallel language translation problem. UniPar source code and PARATRANS dataset are available at our GitHub repository https://github.com/Scientific-Computing-Lab/UniPar_AI.

在不同并行编程语言之间的程序翻译是高性能计算（HPC）社区中的一个重要问题。现有的工具要么范围太窄，要么过时。最近大型语言模型（LLM）的普及及其生成和翻译代码的能力提供了一种潜在的替代方法。为此，我们首先需要系统地评估LLM在并行语言之间的翻译能力。在这项工作中，我们介绍了UniPar，一个用于LLM并行代码翻译的系统性评价框架。具体来说，我们针对串行代码、CUDA和OpenMP之间的翻译。我们的目标是评估当前指令调优的LLM——特别是GPT-4o-mini和LLaMA-3.3-70B-Instruct——可以如何直接使用或通过已知策略进行增强。我们评估了四种主要使用模式：解码超参数优化、零样本和少样本提示、监督微调以及通过编译器修复的迭代反馈。作为评估的一部分，我们构建了一个名为PARATRANS的新数据集，涵盖串行到并行翻译和跨范式转换。我们的研究发现，尽管即插即用模型在默认设置下表现挣扎（例如，GPT-4o-mini仅实现了46%的编译和15%的功能正确性），但我们的UniPar方法——结合微调、超参数调整和编译器引导修复——可以将性能提高两倍（69%的编译和33%的正确性）。我们相信我们的研究结果将为研究人员进一步改进LLM以解决并行语言翻译问题提供有益的见解。UniPar源代码和PARATRANS数据集可在我们的GitHub仓库中找到：https://github.com/Scientific-Computing-Lab/UniPar_AI。

论文及项目相关链接

PDF Accepted to IEEE HPEC conference 2025. 9 pages, incl references

Summary

本文介绍了UniPar框架，用于评估大语言模型（LLMs）在并行代码翻译方面的能力。研究专注于指令微调LLMs在串行代码、CUDA和OpenMP之间的翻译表现。研究通过四种主要使用模式进行评估，发现出厂模型表现不佳，但通过微调、超参数优化和编译器辅助修复等方法，性能可提高一倍。数据集PARATRANS和UniPar源代码已上传至GitHub。

Key Takeaways

LLMs在并行编程语言的代码翻译上具有潜力。
UniPar框架用于评估LLMs在代码翻译方面的表现。
研究集中在指令微调LLMs在串行代码、CUDA和OpenMP之间的翻译。
出厂模型表现不佳，但通过优化方法，性能可显著提高。
新数据集PARATRANS用于研究代码翻译。
UniPar方法和PARATRANS数据集对提高LLMs在并行语言翻译问题上的性能具有指导意义。

Cool Papers

点此查看论文截图

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Authors:Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang

Recent advances in text-only “slow-thinking” reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective “slow thinking” in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

近期文本型“慢思考”推理的进展促使人们将这种能力转移到视觉语言模型（VLMs）上，以训练视觉推理模型（VRMs）。然而，这种转移面临着重大挑战：在VRMs中进行有效的“慢思考”需要“视觉反思”，即基于视觉信息进行推理过程检查的能力。通过定量分析，我们发现当前VRMs的视觉反思能力有限，随着生成响应的增长，它们对视觉信息的关注迅速减少。为了应对这一挑战，我们提出了一种新的VRM——“Reflection-V”。它通过构建用于冷启动的推理数据和设计强化学习（RL）的奖励机制，增强了视觉反思能力。首先，我们通过智能体在VLMs和推理大型语言模型（LLMs）之间进行交互，构建以视觉为中心的推理数据，以实现视觉反思模式的冷启动学习。其次，在RL过程中采用基于视觉注意力的奖励模型，以鼓励基于视觉信息的推理。因此，“Reflection-V”在多个视觉推理基准测试中表现出显著改进。此外，“Reflection-V”在视觉推理过程中更强烈且一致地依赖视觉信息，这显示出视觉反思能力得到了有效增强。

论文及项目相关链接

PDF EMNLP2025 Main

Summary

本文探讨了将文本推理能力转移到视觉语言模型（VLMs）所面临的挑战，并指出有效“慢思考”需要视觉反思能力。通过定量分析发现当前视觉推理模型存在视觉反思能力有限的问题，为此提出了新型视觉推理模型Reflection-V。该模型通过构建以视觉为中心的推理数据和设计强化学习奖励机制，增强视觉反思能力。实验表明，Reflection-V在多个视觉推理基准测试中表现出显著改进，并在视觉推理过程中保持了更强的视觉信息依赖。

Key Takeaways

当前视觉推理模型面临缺乏视觉反思能力的挑战。
“慢思考”在视觉推理中需要视觉反思能力。
提出新型视觉推理模型Reflection-V，以增强视觉反思能力。
Reflection-V通过构建以视觉为中心的推理数据实现冷启动学习。
采用基于视觉注意力的奖励模型，鼓励基于视觉信息的推理。
Reflection-V在多个视觉推理基准测试中表现优异。

Cool Papers

点此查看论文截图

Authors:Wei Cai, Shujuan Liu, Jian Zhao, Ziyan Shi, Yusheng Zhao, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li

Multimodal Large Language Models (MLLMs) are susceptible to the implicit reasoning risk, wherein innocuous unimodal inputs synergistically assemble into risky multimodal data that produce harmful outputs. We attribute this vulnerability to the difficulty of MLLMs maintaining safety alignment through long-chain reasoning. To address this issue, we introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for such a cross-modal challenge. A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset to align the MLLM’s internal reasoning process with human safety values. Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks, including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.

多模态大型语言模型（MLLMs）容易受到隐性推理风险的影响，其中无害的单模态输入会协同组合成具有风险的多模态数据，从而产生有害输出。我们认为这种漏洞是由于MLLMs通过长链推理来保持安全对齐的难度造成的。为了解决这一问题，我们引入了Safe-Semantics-but-Unsafe-Interpretation（SSUI）数据集，这是第一个针对这种跨模态挑战的、具有可解释推理路径的数据集。我们还基于SSUI数据集设计了一种新型训练框架，即Safety-aware Reasoning Path Optimization（SRPO），使MLLM的内部推理过程与人类安全价值观保持一致。实验结果表明，经过SRPO训练的模型在关键的安全基准测试中达到了最新水平，包括提出的Reasoning Path Benchmark（RSBench），显著优于开源和顶级商业MLLMs。

论文及项目相关链接

PDF

Summary:
多模态大型语言模型（MLLMs）存在隐性推理风险，无辜的单模态输入会协同形成风险性多模态数据，从而产生有害输出。针对此问题，提出了Safe-Semantics-but-Unsafe-Interpretation（SSUI）数据集和基于该数据集的Safety-aware Reasoning Path Optimization（SRPO）训练框架，旨在使MLLM的内部推理过程与人类安全价值观保持一致。实验结果表明，采用SRPO训练的模型在关键安全基准测试上达到最新水平，包括提出的Reasoning Path Benchmark（RSBench），显著优于开源和顶级商业MLLMs。

Key Takeaways:

多模态大型语言模型（MLLMs）面临隐性推理风险，无辜输入可产生有害输出。
MLLMs在维持长链推理中的安全对齐方面存在困难。
引入Safe-Semantics-but-Unsafe-Interpretation（SSUI）数据集，提供针对跨模态挑战的可解释推理路径。
设计了基于SSUI数据集的Safety-aware Reasoning Path Optimization（SRPO）训练框架。
SRPO训练框架旨在使MLLM的内部推理过程与人类安全价值观保持一致。
实验结果显示SRPO训练模型在关键安全基准测试上表现优异。

Cool Papers

点此查看论文截图

TabStruct: Measuring Structural Fidelity of Tabular Data

Authors:Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik

Evaluating tabular generators remains a challenging problem, as the unique causal structural prior of heterogeneous tabular data does not lend itself to intuitive human inspection. Recent work has introduced structural fidelity as a tabular-specific evaluation dimension to assess whether synthetic data complies with the causal structures of real data. However, existing benchmarks often neglect the interplay between structural fidelity and conventional evaluation dimensions, thus failing to provide a holistic understanding of model performance. Moreover, they are typically limited to toy datasets, as quantifying existing structural fidelity metrics requires access to ground-truth causal structures, which are rarely available for real-world datasets. In this paper, we propose a novel evaluation framework that jointly considers structural fidelity and conventional evaluation dimensions. We introduce a new evaluation metric, $\textbf{global utility}$, which enables the assessment of structural fidelity even in the absence of ground-truth causal structures. In addition, we present $\textbf{TabStruct}$, a comprehensive evaluation benchmark offering large-scale quantitative analysis on 13 tabular generators from nine distinct categories, across 29 datasets. Our results demonstrate that global utility provides a task-independent, domain-agnostic lens for tabular generator performance. We release the TabStruct benchmark suite, including all datasets, evaluation pipelines, and raw results. Code is available at https://github.com/SilenceX12138/TabStruct.

评估表格生成器仍然是一个具有挑战性的问题，因为异质表格数据的独特因果结构并不适合直观的人工检查。近期的研究已经引入了结构保真度作为表格特定的评估维度，以评估合成数据是否符合真实数据的因果结构。然而，现有基准测试通常忽略了结构保真度与常规评估维度之间的相互作用，因此无法全面理解模型性能。此外，它们通常仅限于小型数据集，因为量化现有结构保真度指标需要访问真实因果结构，而现实世界的数据集很少提供这些。在本文中，我们提出了一个联合考虑结构保真度和常规评估维度的新型评估框架。我们引入了一种新的评估指标——全局效用，即使在缺乏真实因果结构的情况下，也能评估结构保真度。此外，我们推出了TabStruct基准测试，对来自九个不同类别的13个表格生成器在29个数据集上进行大规模定量分析。我们的结果表明，全局效用为表格生成器性能提供了任务独立、领域无关的视角。我们发布了TabStruct基准测试套件，包括所有数据集、评估流程和原始结果。代码可在https://github.com/SilenceX12138/TabStruct找到。

论文及项目相关链接

PDF 55 pages, 60 tables, 7 figures

Summary

该文提出一个全新的评估框架，该框架同时考虑了结构保真度和传统评估维度。为在没有真实因果结构的情况下评估结构保真度，引入了一个新的评估指标——全局效用。此外，文章还介绍了TabStruct，一个对13个表格生成器进行大规模定量分析的综合评估基准，涉及9个类别的29个数据集。结果显示，全局效用为表格生成器性能提供了任务独立、领域无关的视角。

Key Takeaways

评估表格生成器是一个挑战，因为异质表格数据的独特因果结构不适合直觉检查。
现有基准测试往往忽视结构保真度与传统评估维度的相互作用，无法全面评估模型性能。
引入了一个新的评估指标——全局效用，能够在没有真实因果结构的情况下评估结构保真度。
提出TabStruct基准测试套件，包含大量数据集和评估流程。
TabStruct涵盖了13个表格生成器，涉及9个类别的29个数据集的大规模定量分析。
全局效用提供了一个任务独立、领域无关的评估视角，用于理解表格生成器的性能。

Cool Papers

点此查看论文截图

Agentic Temporal Graph of Reasoning with Multimodal Language Models: A Potential AI Aid to Healthcare

Authors:Susanta Mitra

Healthcare and medicine are multimodal disciplines that deal with multimodal data for reasoning and diagnosing multiple diseases. Although some multimodal reasoning models have emerged for reasoning complex tasks in scientific domains, their applications in the healthcare domain remain limited and fall short in correct reasoning for diagnosis. To address the challenges of multimodal medical reasoning for correct diagnosis and assist the healthcare professionals, a novel temporal graph-based reasoning process modelled through a directed graph has been proposed in the current work. It helps in accommodating dynamic changes in reasons through backtracking, refining the reasoning content, and creating new or deleting existing reasons to reach the best recommendation or answer. Again, consideration of multimodal data at different time points can enable tracking and analysis of patient health and disease progression. Moreover, the proposed multi-agent temporal reasoning framework provides task distributions and a cross-validation mechanism to further enhance the accuracy of reasoning outputs. A few basic experiments and analysis results justify the novelty and practical utility of the proposed preliminary approach.

医疗健康是多模式学科，涉及多模式数据以进行推理和诊断多种疾病。虽然一些多模式推理模型已经出现在科学领域的复杂任务推理中，但它们在医疗领域的应用仍然有限，在诊断正确推理方面存在不足。为了解决多模式医疗推理在正确诊断方面的挑战，并帮助医护人员，当前工作提出了一种基于时序图的新推理过程，该过程通过有向图进行建模。它有助于通过回溯来适应原因的动态变化，精炼推理内容，并创建新的或删除现有的原因，以得出最佳建议或答案。此外，考虑不同时间点的多模式数据可以实现对患者健康和疾病进展的跟踪和分析。而且，所提出的多智能体时序推理框架提供任务分配和交叉验证机制，以进一步提高推理输出的准确性。一些基本实验和分析结果证明了所提出初步方法的新颖性和实用性。

论文及项目相关链接

PDF

Summary：医疗健康领域涉及多模式数据推理和诊断多种疾病。当前工作提出了一种基于时序图的推理过程，通过有向图模拟，可回溯动态变化的原因，完善推理内容，创建或删除现有原因，以得出最佳建议或答案。同时，考虑不同时间点的多模式数据可追踪和分析患者健康和疾病进展。初步实验和分析结果验证了该方法的创新性和实用性。

Key Takeaways：

医疗健康领域需要处理多模式数据以进行推理和诊断多种疾病。
当前提出的一种基于时序图的推理过程可以适应动态变化的原因，并帮助完善推理内容。
该方法通过创建或删除现有原因来得出最佳建议或答案。
考虑不同时间点的多模式数据有助于追踪和分析患者的健康和疾病进展。
多模式推理在医疗健康领域的应用仍然有限，需要更多的研究和改进。
提出的多智能体时序推理框架通过任务分配和交叉验证机制进一步提高了推理输出的准确性。

Cool Papers

点此查看论文截图

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

Authors:Haodong Chen, Haojian Huang, XinXiang Yin, Dian Shao

Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.

基于大型语言模型（LLM）的视频问答（VideoQA）在通用视频理解方面显示出潜力，但当应用于内在复杂的体育视频领域时，面临着巨大的挑战。在这项工作中，我们提出了FineQuest，这是第一个无需训练的框架，它受到认知科学的启发，采用双模式推理：i）针对直接的体育查询采用反应式推理，ii）针对更复杂的查询采用审慎推理。为了弥补通用模型与特定领域体育理解之间的知识差距，FineQuest引入了SSGraph，这是一个跨越九种体育的多模式体育知识场景图，它编码视觉实例和特定领域的术语，以提高推理的准确性。此外，我们基于FineGym和FineDiving数据集推出了两个新的体育VideoQA基准测试，即Gym-QA和Diving-QA，它们能够进行多样化和全面的评估。FineQuest在这些基准测试以及现有的SPORTU数据集上达到了最先进的性能，同时保持了强大的通用VideoQA能力。

论文及项目相关链接

PDF ACM MM 2025

Summary

基于大型语言模型的视频问答（VideoQA）技术在通用视频理解领域显示出潜力，但在应用于复杂的体育视频领域时面临挑战。本研究提出FineQuest，这是一个无需训练的框架，它借鉴认知科学启发采用双模式推理：i) 直观推理应对直接的体育查询和ii) 审慎推理应对更复杂的查询。为了弥补通用模型和特定领域体育理解之间的知识鸿沟，FineQuest引入了SSGraph，一个跨越九种体育的多模式体育知识场景图，它编码视觉实例和特定领域的术语，以提高推理的准确性。此外，我们推出了两个新的体育VideoQA基准测试，Gym-QA和Diving-QA，它们分别来自FineGym和FineDiving数据集，以实现多样化和全面的评估。FineQuest在这些基准测试以及现有的SPORTU数据集上实现了最佳性能，同时保持了强大的通用VideoQA能力。

Key Takeaways

FineQuest是首个无需训练的VideoQA框架，适用于体育视频领域。
它采用双模式推理，包括直观推理和审慎推理。
FineQuest通过引入SSGraph来弥补通用模型和体育领域知识之间的鸿沟。
SSGraph是一个多模式体育知识场景图，包含视觉实例和特定领域的术语。
研究人员推出了两个新的体育VideoQA基准测试：Gym-QA和Diving-QA。
FineQuest在多个基准测试上实现了最佳性能。

Cool Papers

点此查看论文截图

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Authors:Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

时间序列推理将时间视为第一轴线，并将中间证据直接融入答案中。本文通过推理拓扑定义了问题并组织了文献，主要包括三个家族：一步直接推理、具有明确中间体的线性链推理和探究、修订和聚合的分支结构推理。拓扑与领域的主要目标相交叉，包括传统的时间序列分析、解释和理解、因果推理和决策制定以及时间序列生成等。同时，一个紧凑的标签集贯穿这些轴，包括分解和验证、集成、工具使用、知识访问、多模态性、代理循环和LLM对齐机制等。本文回顾了不同领域的方法和系统，展示了每种拓扑的优缺点，以及支持研究和部署的数据集、基准测试和资源（https://github.com/blacksnail789521/Time-Series-Reasoning-Survey）。本文强调了保持证据可见和时序对齐的评估实践，并提供了匹配拓扑与不确定性、与可观察到的伪迹进行接地、为转变和流式处理进行规划以及将成本和延迟视为设计预算的指导原则。我们强调，推理结构必须在保证接地和自校正能力的同时，平衡计算成本和可重复性。未来的进展可能取决于将推理质量与效用联系起来的基准测试，以及能够在成本和风险之间进行权衡的闭环测试平台，以适应转变意识、流式处理和长期视野设置。综上所述，这些方向标志着从狭隘的准确性向大规模可靠性转变，使系统不仅能够分析，而且能够理解、解释和动态世界行动具有可追溯的证据和可信的结果。

论文及项目相关链接

PDF This paper is currently under review

摘要

时间序列推理将时间视为首要考量轴，并将中间证据直接纳入答案中。这篇综述定义了问题并通过推理拓扑对文献进行分类，主要分为三个家族：一步直接推理、具有明确中间体的线性链推理和分支结构推理。拓扑与领域的主要目标相结合，包括传统时间序列分析、解释与理解、因果推断与决策制定以及时间序列生成。同时，紧凑的标签集跨越这些轴，涵盖分解与验证、集成、工具使用、知识访问、多模态性、代理循环和LLM对齐制度等内容。本文回顾了不同领域的方法和系统，展示了每种拓扑的优缺点及其在保真度或稳健性方面的局限性，以及支持研究和部署的数据集、基准测试和资源。强调了保持证据可见性和时间对齐性的评估实践，并提供了匹配拓扑与不确定性、用可观察到的文物进行接地处理、规划转变和流式传输以及将成本和延迟视为设计预算的指导。我们强调，推理结构必须在保持对基础知识和自我校正能力的同时，平衡计算成本和可重复性。未来的进展可能取决于将推理质量与实用性联系起来的基准测试以及在成本和风险权衡下适应转变意识、流式传输和长期视野设置的闭环测试平台。总的来说，这些方向标志着从狭隘的准确性向大规模可靠性转变的转移，使系统不仅能够分析，而且能够理解、解释和采取实际行动应对动态世界，同时拥有可追溯的证据和可靠的结果。

关键见解

时间序列推理将时间视为首要因素，将中间证据纳入答案。
综述定义了问题并通过推理拓扑分类文献为三个家族。
推理拓扑与领域目标相结合，包括时间序列分析、解释与理解等。
紧凑标签集涵盖分解与验证、工具使用和多模态性等关键方面。
方法和系统的回顾展示了各种拓扑的优缺点及局限性。
评估实践强调保持证据的可见性和时间对齐性。

Cool Papers

点此查看论文截图

Formal Reasoning for Intelligent QA Systems: A Case Study in the Educational Domain

Authors:Tuan Bui, An Nguyen, Phat Thai, Minh Hua, Ngan Pham L. N., Ngan Pham T. B., Dung Le, Long Nguyen, Thanh-Tung Tran, Thang Bui, Tho Quan

Reasoning is essential for closed-domain QA systems in which procedural correctness and policy compliance are critical. While large language models (LLMs) have shown strong performance on many reasoning tasks, recent work reveals that their reasoning traces are often unfaithful - serving more as plausible justifications than as causally grounded derivations. Efforts to combine LLMs with symbolic engines (e.g., Prover9, Z3) have improved reliability but remain limited to static forms of logic, struggling with dynamic, state-based reasoning such as multi-step progressions and conditional transitions. In this paper, we propose MCFR (Model Checking for Formal Reasoning), a neuro-symbolic framework that integrates LLMs with model checking to support property verification. MCFR translates natural language into formal specifications and verifies them over transition models. To support evaluation, we introduce EduMC-QA, a benchmark dataset grounded in real academic procedures. Our results show that MCFR improves reasoning faithfulness and interpretability, offering a viable path toward verifiable QA in high-stakes closed-domain applications. In addition to evaluating MCFR, we compare its performance with state-of-the-art LLMs such as ChatGPT, DeepSeek, and Claude to contextualize its effectiveness.

推理对于封闭领域的问答系统至关重要，尤其是在程序正确性和政策合规性方面。虽然大型语言模型（LLM）在许多推理任务上表现出了强大的性能，但最近的工作表明，它们的推理轨迹通常不忠实——更像是一种合理的辩解，而不是基于因果的推导。将LLM与符号引擎（例如Prover9、ZI等）相结合的尝试提高了可靠性，但仍然局限于静态形式的逻辑，对于动态、基于状态的理由（如多步骤进展和条件转换）感到困惑。在本文中，我们提出了MCFR（形式推理的模型检查），这是一个神经符号框架，它将LLM与模型检查相结合，支持属性验证。MCFR将自然语言翻译成形式化规范，并在转换模型上进行验证。为了支持评估，我们推出了EduMC-QA，这是一个基于真实学术流程的基准数据集。我们的结果表明，MCFR提高了推理的忠实性和可解释性，为高风险封闭领域中的可验证问答提供了一条可行的途径。除了评估MCFR外，我们还将其性能与最新的LLM（如ChatGPT、DeepSeek和Claude）进行了比较，以对其有效性进行上下文分析。

论文及项目相关链接

PDF Published at the 2nd ACM Workshop in AI-powered Question & Answering Systems (AIQAM ‘25), co-located with ACM Multimedia 2025

Summary：

文中提出了一种融合神经符号框架MCFR，结合大型语言模型与模型检测支持属性验证，以提高封闭域问答系统中推理的忠实性和可解释性。该框架可将自然语言翻译为形式规范并进行验证。为支持评估，引入了基于真实学术流程的基准数据集EduMC-QA。此外，该研究还对比了MCFR与ChatGPT、DeepSeek、Claude等前沿大型语言模型的性能，突显其有效性。

Key Takeaways：

推理在封闭域问答系统中至关重要，涉及程序正确性和政策遵守。
大型语言模型（LLMs）在推理任务上表现出色，但其推理轨迹常缺乏忠实性。
融合LLMs与符号引擎（如Prover9、Z3）提高了可靠性，但在动态、基于状态的推理方面仍有局限。
MCFR框架结合LLMs与模型检测，支持属性验证，提高推理忠实性和可解释性。
MCFR可将自然语言翻译为形式规范并在转换模型上进行验证。
为支持评估，引入了基于真实学术流程的基准数据集EduMC-QA。
MCFR与其他前沿LLMs的对比评估显示其有效性。

Cool Papers

点此查看论文截图

D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs

Authors:Yue Ding, Xiaofang Zhu, Tianze Xia, Junfei Wu, Xinlong Chen, Qiang Liu, Liang Wang

Although large Language Models (LLMs) have achieved remarkable success, their practical application is often hindered by the generation of non-factual content, which is called “hallucination”. Ensuring the reliability of LLMs’ outputs is a critical challenge, particularly in high-stakes domains such as finance, security, and healthcare. In this work, we revisit hallucination detection from the perspective of model architecture and generation dynamics. Leveraging the multi-layer structure and autoregressive decoding process of LLMs, we decompose hallucination signals into two complementary dimensions: the semantic breadth of token representations within each layer, and the semantic depth of core concepts as they evolve across layers. Based on this insight, we propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)}, a training-free and label-free framework that jointly measures: (1) \textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of token representations within each layer; and (2) \textbf{Inter-Layer Drift}, which tracks the progressive transformation of key token representations across layers. To ensure drift reflects the evolution of meaningful semantics rather than noisy or redundant tokens, we guide token selection using attention signals. By capturing both the horizontal and vertical dynamics of representation during inference, D$^2$HScore provides an interpretable and lightweight proxy for hallucination detection. Extensive experiments across five open-source LLMs and five widely used benchmarks demonstrate that D$^2$HScore consistently outperforms existing training-free baselines.

虽然大型语言模型（LLM）已经取得了显著的成就，但它们在实践应用时常受到生成非事实内容（称为“幻觉”）的阻碍。确保LLM输出的可靠性是一项关键挑战，特别是在金融、安全和医疗等高风险领域。在这项工作中，我们从模型架构和生成动态的角度重新审视了幻觉检测。我们利用LLM的多层结构和自回归解码过程，将幻觉信号分解为两个互补的维度：每层内令牌表示的语义广度以及核心概念在跨层时的语义深度。基于这一见解，我们提出了D$^2$HScore（基于分散和漂移的幻觉评分），这是一个无需训练和标签的免费框架，它联合测量：（1）层内分散度，量化每层内令牌表示的语义多样性；（2）层间漂移，跟踪关键令牌表示在跨层的渐进式转换。为了确保漂移反映有意义语义的演变，而不是嘈杂或冗余的令牌，我们使用注意力信号来引导令牌选择。通过捕捉推理过程中的横向和纵向动态，D$^2$HScore为幻觉检测提供了一个可解释且轻量级的代理。在五个开源LLM和五个广泛使用的基准测试上的大量实验表明，D$^2$HScore始终优于现有的无需训练的基准测试。

论文及项目相关链接

PDF under review

Summary：虽然大型语言模型（LLM）取得了显著的成果，但其在金融、安全、医疗等高风险领域的应用受到了输出内容非事实性的限制，即“幻觉”现象。本文重新探讨了从模型架构和生成动态角度检测幻觉的方法。利用LLM的多层结构和自回归解码过程，我们将幻觉信号分解为两个互补的维度：每层内令牌表示的语义广度，以及核心概念跨层演变的语义深度。基于此，我们提出了无需训练和标注的D$^2$HScore框架，该框架联合测量：（1）Intra-Layer Dispersion，量化每层内令牌表示的语义多样性；（2）Inter-Layer Drift，跟踪关键令牌表示跨层的渐进变化。为了确保漂移反映有意义语义的演变而非噪声或冗余令牌，我们使用注意力信号来指导令牌选择。D$^2$HScore通过捕捉推理过程中的水平（层内多样性）和垂直（层间变化）动态性，提供了一个可解释且轻量级的幻觉检测代理。

Key Takeaways：

大型语言模型（LLMs）在生成内容时会发生非事实性的幻觉现象，特别是在金融、安全和医疗等高风险领域。
本文通过重新审视模型架构和生成动态性来解决幻觉检测问题。
提出了D$^2$HScore框架，该框架无需训练和标注，能联合测量层内令牌的语义多样性和跨层的语义变化。
D$^2$HScore框架利用注意力信号来指导令牌选择，确保检测到的语义变化是有意义的。
通过在五个开源LLMs和五个广泛使用的基准测试上的实验，证明D$^2$HScore在幻觉检测方面表现优异。
D$^2$HScore为幻觉检测提供了可解释且轻量级的解决方案，可以捕捉推理过程中的水平（层内）和垂直（层间）动态性。

Cool Papers

点此查看论文截图

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Authors:Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang

Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.

图形用户界面（GUI）代理在通过强化学习自动化复杂用户界面交互方面已取得了显著进展。然而，当前的方法面临一个基本困境：离线RL能够在预收集的轨迹上进行稳定训练，但在多步任务执行方面因缺乏轨迹级别的奖励信号而遇到困难；在线RL通过与环境交互捕捉这些信号，但存在奖励稀疏和部署成本高昂的问题。为解决这一问题，我们提出了半在线强化学习，这是一种在新的框架下模拟离线轨迹上的在线RL的新范式。在每个滚动过程中，我们保留多轮对话中的原始模型输出，其中Patch模块自适应地恢复滚动和专家轨迹之间的分歧。为了捕捉长期的训练信号，半在线RL将折现未来回报引入奖励计算中，并优化具有加权步骤级和剧集级的优势策略。我们进一步引入了半在线性能（SOP），作为更贴近真实在线性能的实用有效代理，用于现实世界评估的指标。实验表明，我们的半在线RL在四个动态基准测试中实现了超越7B模型的最优性能，相较于基础模型有显著的提升（例如，在AndroidWorld上提升+12.0%，在AITW上提升+23.8%），在弥合离线训练效率和在线多轮推理之间的差距方面取得了显著进展。代码可通过以下链接获取：https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1。

论文及项目相关链接

PDF 22 pages, 17 figures

Summary

本文主要介绍了基于半在线强化学习（Semi-online Reinforcement Learning）的图形用户界面（GUI）自动化交互技术。该技术结合了离线强化学习和在线强化学习的优势，通过在离线轨迹上进行模拟在线学习来解决二者面临的挑战。新方法引入了Patch模块来恢复模型输出与实际轨迹之间的偏差，并采用加权步骤级别和剧集级别的优势进行优化。此外，还引入了Semi-Online Performance（SOP）作为更接近真实在线性能的实用评估指标。实验结果显示，该方法在多个动态基准测试中取得了领先水平，且在AndroidWorld和AITW等基准测试上的性能提升显著。

Key Takeaways

半在线强化学习结合了离线强化学习和在线强化学习的优点。
采用了Patch模块来恢复模型输出与实际轨迹之间的偏差。
通过引入加权步骤级别和剧集级别的优势进行优化，捕捉长期训练信号。
引入了Semi-Online Performance（SOP）作为更接近真实在线性能的评估指标。
实验结果显示，该方法在多个动态基准测试中实现了SOTA性能。
在AndroidWorld和AITW等基准测试上的性能提升显著。

Cool Papers

点此查看论文截图

HARP: Hallucination Detection via Reasoning Subspace Projection

Authors:Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan

Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.

大型语言模型（LLM）中的幻觉（hallucination）现象对它们在关键决策制定中的可靠应用构成了重大障碍。尽管现有的幻觉检测方法的准确性已经提高，但它们仍然难以区分语义和推理信息，并在保持稳健性方面存在困难。为了应对这些挑战，我们提出了HARP（基于推理子空间投影的幻觉检测）这一新颖的幻觉检测框架。HARP认为，LLM的隐藏状态空间可以分解为语义子空间和推理子空间的直接和，其中前者编码语言表达，后者捕捉内部推理过程。此外，我们证明了Unembedding层可以解开这些子空间，并且通过对其参数应用奇异值分解（SVD），可以获得跨越语义和推理子空间的基向量。最后，HARP将隐藏状态投射到推理子空间的基向量上，然后将得到的投影用作检测LLM中幻觉的输入特征。通过使用这些投影，HARP将特征维度降低到原始大小的约5%，过滤掉了大部分噪声，并实现了增强的稳健性。在多个数据集上的实验表明，HARP达到了最先进的幻觉检测性能；特别是在TrivaQA上，其AUROC达到了92.8%，比之前的最佳方法高出7.5%。

论文及项目相关链接

PDF

Summary

本文提出一种名为HARP的新型幻觉检测框架，用于解决大型语言模型（LLM）中的幻觉问题。该框架通过分解隐藏状态空间为语义子空间和推理子空间，利用Unembedding层参数进行奇异值分解（SVD），将隐藏状态投影到推理子空间的基础上，实现了对LLM中的幻觉检测。HARP提高了检测性能，减少了特征维度，过滤了大部分噪声，增强了稳健性，并在多个数据集上的实验结果显示出卓越的效果。

Key Takeaways

大型语言模型（LLM）中的幻觉是一个影响其可靠应用于关键决策制定的主要障碍。
现有幻觉检测方法的准确性有待提高，尤其是在区分语义和推理信息以及保持稳健性方面。
HARP框架通过分解隐藏状态空间为语义子空间和推理子空间来解决这些挑战。
Unembedding层能够解开这些子空间，通过对其参数进行奇异值分解（SVD），获得跨越语义和推理子空间的基向量。
HARP通过将隐藏状态投影到推理子空间的基向量上，然后使用这些投影作为输入特征来进行LLM的幻觉检测。
HARP减少了特征维度，过滤了大部分噪声，增强了稳健性，并在多个数据集上实现了最先进的幻觉检测性能。

Cool Papers

点此查看论文截图

Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning

Authors:Yijia Xiao, Edward Sun, Tong Chen, Fang Wu, Di Luo, Wei Wang

Developing professional, structured reasoning on par with human financial analysts and traders remains a central challenge in AI for finance, where markets demand interpretability and trust. Traditional time-series models lack explainability, while LLMs face challenges in turning natural-language analysis into disciplined, executable trades. Although reasoning LLMs have advanced in step-by-step planning and verification, their application to risk-sensitive financial decisions is underexplored. We present Trading-R1, a financially-aware model that incorporates strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. Trading-R1 aligns reasoning with trading principles through supervised fine-tuning and reinforcement learning with a three-stage easy-to-hard curriculum. Training uses Tauric-TR1-DB, a 100k-sample corpus spanning 18 months, 14 equities, and five heterogeneous financial data sources. Evaluated on six major equities and ETFs, Trading-R1 demonstrates improved risk-adjusted returns and lower drawdowns compared to both open-source and proprietary instruction-following models as well as reasoning models. The system generates structured, evidence-based investment theses that support disciplined and interpretable trading decisions. Trading-R1 Terminal will be released at https://github.com/TauricResearch/Trading-R1.

在金融领域的人工智能中，发展出与人类金融分析师和交易员相当的专业结构化推理能力仍然是一个核心挑战，因为市场对金融AI有着可解释性和信任度的要求。传统的时间序列模型缺乏可解释性，而大型语言模型（LLMs）则面临将自然语言分析转化为有纪律、可执行的交易的挑战。尽管推理型LLMs在逐步规划和验证方面取得了进展，但它们在风险敏感金融决策中的应用却被探索得很少。我们推出了Trading-R1，这是一个具备金融意识的模型，它结合了战略思维和规划，用于全面的论文撰写、基于事实的分析和波动调整决策。Trading-R1通过监督微调强化学习与交易原则相符的推理能力，采用从易到难的三个阶段课程。训练使用的是Tauric-TR1数据库，该数据库包含跨越18个月、涉及14种股票和五个不同金融数据源的10万个样本。在六种主要股票和ETF上进行的评估表明，与开源和专有指令跟踪模型以及推理模型相比，Trading-R1在风险调整后的回报方面表现更佳，同时降低了下跌幅度。该系统生成结构化、基于证据的投资论文，支持有纪律和可解释的交易决策。Trading-R1终端将发布在https://github.com/TauricResearch/Trading-R1上。

论文及项目相关链接

PDF Tauric Research: https://github.com/TauricResearch

Summary

本文介绍了在人工智能金融领域的一个挑战，即开发具有专业结构化推理能力的AI模型，以与人类金融分析师和交易员相匹敌。文章提出了一种名为Trading-R1的新型财务意识模型，该模型结合了策略思考、规划、全面的论文撰写、基于事实的分析以及波动性调整决策制定。通过监督微调与强化学习相结合的三阶段简单到复杂课程，Trading-R1模型与交易原则相一致。该模型使用Tauric-TR1-DB数据集进行训练，该数据集包含跨越18个月、涉及14种股票和五种不同金融数据源的10万样本。评估结果显示，Trading-R1在主要股票和ETF上的风险调整回报率有所提高，与传统的开源和专有指令跟踪模型以及推理模型相比，其回撤幅度较低。该模型生成结构化、基于证据的投资论文，支持纪律严明、可解释的交易决策。Trading-R1终端将在https://github.com/TauricResearch/Trading-R1发布。

Key Takeaways

以下是关于文本的关键见解：

开发具有与人类金融分析师和交易员相当的专业结构化推理能力的AI模型是人工智能在金融领域的一个核心挑战。
Trading-R1是一种新型的财务意识模型，结合了策略思考、规划等综合能力。
Trading-R1模型通过监督微调与强化学习相结合的方式进行训练，并采用了三阶段简单到复杂的学习课程。
Trading-R1模型使用了包含多种股票和多种金融数据源的Tauric-TR1-DB数据集进行训练。
Trading-R1在风险调整回报率方面表现出优异性能，与传统的模型和推理模型相比具有更低的回撤幅度。
该模型能够生成结构化、基于证据的投资论文，支持纪律严明和可解释的交易决策。

Cool Papers

点此查看论文截图

Continually Adding New Languages to Multilingual Language Models

Authors:Abraham Toluwase Owodunni, Sachin Kumar

Multilingual language models are trained on a fixed set of languages, and to support new languages, the models need to be retrained from scratch. This is an expensive endeavor and is often infeasible, as model developers tend not to release their pre-training data. Naive approaches, such as continued pretraining, suffer from catastrophic forgetting; however, mitigation strategies like experience replay cannot be applied due to the lack of original pretraining data. In this work, we investigate the problem of continually adding new languages to a multilingual model, assuming access to pretraining data in only the target languages. We explore multiple approaches to address this problem and propose Layer-Selective LoRA (LayRA), which adds Low-Rank Adapters (LoRA) to selected initial and final layers while keeping the rest of the model frozen. LayRA builds on two insights: (1) LoRA reduces forgetting, and (2) multilingual models encode inputs in the source language in the initial layers, reason in English in intermediate layers, and translate back to the source language in final layers. We experiment with adding multiple combinations of Galician, Swahili, and Urdu to pretrained language models and evaluate each method on diverse multilingual tasks. We find that LayRA provides the overall best tradeoff between preserving models’ capabilities in previously supported languages, while being competitive with existing approaches such as LoRA in learning new languages. We also demonstrate that using model arithmetic, the adapted models can be equipped with strong instruction following abilities without access to any instruction tuning data in the target languages.

跨语言语言模型是在一组固定的语言上进行训练的，为了支持新语言，这些模型需要从零开始重新训练。这是一项昂贵的任务，并且通常由于模型开发者不愿意发布其预训练数据而变得不可行。简单的做法，如持续的预训练，会遭受灾难性遗忘的困扰；然而，由于缺乏原始预训练数据，应用缓解策略（如经验回放）是行不通的。在这项工作中，我们研究了如何不断向跨语言模型添加新语言的问题，假设只访问目标语言的预训练数据。我们探索了多种解决此问题的方法，并提出了Layer-Selective LoRA（LayRA），它在选定的初始和最终层中添加低秩适配器（LoRA），同时保持模型的其余部分冻结。LayRA建立在两个见解之上：（1）LoRA可以减少遗忘；（2）多语言模型在初始层中以源语言编码输入，在中间层中以英语进行推理，并在最终层中翻译回源语言。我们通过向预训练的语言模型添加加泰罗尼亚语、斯瓦希里语和乌尔都语的多种组合进行实验，并在多种跨语言任务上评估每种方法。我们发现，LayRA在保留模型对先前支持的语言的能力方面提供了最佳的折衷方案，同时在学习新语言方面与现有方法（如LoRA）竞争。我们还证明，通过使用模型算术，可以在无需访问目标语言的指令微调数据的情况下，为调整后的模型配备强大的指令遵循能力。

论文及项目相关链接

PDF

Summary

本文探讨了为已存在的多语言模型持续添加新语言的问题。由于模型开发者通常不会释放预训练数据，因此需要重新训练整个模型来支持新语言是一个昂贵且不可行的方案。研究团队探索了多种方法来解决这一问题，并提出了Layer-Selective LoRA（LayRA）方法。该方法通过在选定初始和最终层添加低秩适配器（LoRA）同时保持模型其余部分冻结，来减少遗忘并优化多语言模型的编码、推理和翻译过程。实验表明，LayRA在保留模型对原有语言支持能力的同时，学习新语言的能力具有竞争力。此外，通过模型算术，适应后的模型可以在无需访问目标语言的指令调整数据的情况下，获得强大的指令遵循能力。

Key Takeaways

多语言模型需要支持新语言时面临挑战，因为重新训练整个模型成本高昂且不可行。
Naive方法（如持续预训练）会导致灾难性遗忘。
缺乏原始预训练数据使得一些缓解策略（如经验回放）无法应用。
研究团队提出了Layer-Selective LoRA（LayRA）方法来解决这个问题。
LayRA通过在选定层添加低秩适配器（LoRA）来减少遗忘，并优化模型的编码、推理和翻译过程。
实验表明，LayRA在保留模型原有语言支持能力的同时，学习新语言的能力具有竞争力。

Cool Papers

点此查看论文截图

Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic

Authors:Waikit Xiu, Qiang Lu, Xiying Li, Chen Hu, Shengbo Sun

As intelligent transportation systems advance, traffic video understanding plays an increasingly pivotal role in comprehensive scene perception and causal analysis. Yet, existing approaches face notable challenges in accurately modeling spatiotemporal causality and integrating domain-specific knowledge, limiting their effectiveness in complex scenarios. To address these limitations, we propose Traffic-MLLM, a multimodal large language model tailored for fine-grained traffic analysis. Built on the Qwen2.5-VL backbone, our model leverages high-quality traffic-specific multimodal datasets and uses Low-Rank Adaptation (LoRA) for lightweight fine-tuning, significantly enhancing its capacity to model continuous spatiotemporal features in video sequences. Furthermore, we introduce an innovative knowledge prompting module fusing Chain-of-Thought (CoT) reasoning with Retrieval-Augmented Generation (RAG), enabling precise injection of detailed traffic regulations and domain knowledge into the inference process. This design markedly boosts the model’s logical reasoning and knowledge adaptation capabilities. Experimental results on TrafficQA and DriveQA benchmarks show Traffic-MLLM achieves state-of-the-art performance, validating its superior ability to process multimodal traffic data. It also exhibits remarkable zero-shot reasoning and cross-scenario generalization capabilities.

随着智能交通系统的不断发展，交通视频理解在综合场景感知和因果分析中的作用越来越重要。然而，现有方法在准确建模时空因果关系和整合特定领域知识方面面临显著挑战，在复杂场景中限制了其有效性。为了解决这些局限性，我们提出了针对精细交通分析量身定制的多模态大型语言模型Traffic-MLLM。该模型基于Qwen2.5-VL骨干网构建，利用高质量的交通特定多模态数据集，采用低秩适配（LoRA）进行轻量级微调，显著提高了其对视频序列中连续时空特征的建模能力。此外，我们引入了一个创新的知识提示模块，融合了思维链（CoT）推理与增强检索生成（RAG），使精确的交通规则和领域知识能够注入推理过程。这种设计显著提升了模型的逻辑推理和知识适应能力。在TrafficQA和DriveQA基准测试上的实验结果表明，Traffic-MLLM实现了最先进的性能，验证了其处理多模态交通数据的卓越能力。同时，它还展现出惊人的零射击推理和跨场景泛化能力。

论文及项目相关链接

PDF

Summary

随着智能交通系统的不断发展，交通视频理解在全面场景感知和因果分析方面发挥着越来越重要的作用。然而，现有方法在准确建模时空因果关系和整合领域特定知识方面存在挑战。为应对这些挑战，我们提出了Traffic-MLLM多模态大型语言模型，用于精细交通分析。该模型基于Qwen2.5-VL骨架，利用高质量交通特定多模态数据集和Low-Rank Adaptation（LoRA）进行轻量级微调，提高了对视频序列中连续时空特征的建模能力。此外，我们引入了一个融合Chain-of-Thought（CoT）推理与检索增强生成（RAG）的知识提示模块，使精确注入交通法规和领域知识到推理过程中成为可能。该设计显著提升了模型的逻辑推理和知识适应能力。在TrafficQA和DriveQA基准测试上的实验结果表明，Traffic-MLLM实现了最先进的性能，验证了其处理多模态交通数据的卓越能力，并展现出卓越的零样本推理和跨场景泛化能力。

Key Takeaways

交通视频理解在智能运输系统中扮演着越来越重要的角色，但现有方法存在建模时空因果关系和整合领域知识的挑战。
Traffic-MLLM模型被提出以解决这些挑战，它结合多模态数据和Low-Rank Adaptation（LoRA）技术来提高对连续时空特征的建模能力。
模型引入知识提示模块，融合了Chain-of-Thought（CoT）推理和检索增强生成（RAG），能精确注入交通法规和领域知识。
Traffic-MLLM在TrafficQA和DriveQA基准测试中表现出卓越性能，具备处理多模态交通数据、零样本推理和跨场景泛化能力。
该模型采用Qwen2.5-VL作为骨架，并利用高质量交通特定多模态数据集进行训练。
知识提示模块的设计显著提升了模型的逻辑推理和知识适应能力，使模型能更好地理解和应对复杂的交通场景。

Cool Papers

点此查看论文截图

Authors:Qingxiang Liu, Ting Huang, Zeyu Zhang, Hao Tang

Embodied navigation requires agents to integrate perception, reasoning, and action for robust interaction in complex 3D environments. Existing approaches often suffer from incoherent and unstable reasoning traces that hinder generalization across diverse environments, and difficulty balancing long-horizon semantic reasoning with low-latency control for real-time navigation. To address these challenges, we propose Nav-R1, an embodied foundation model that unifies reasoning in embodied environments. We first construct Nav-CoT-110K, a large-scale dataset of step-by-step Chains-of-Thought (CoT) for embodied tasks, which enables cold-start initialization with structured reasoning. Building on this foundation, we design a GRPO-based reinforcement learning framework with three complementary rewards: format, understanding, and navigation, to improve structural adherence, semantic grounding, and path fidelity. Furthermore, we introduce a Fast-in-Slow reasoning paradigm, decoupling deliberate semantic reasoning from low-latency reactive control for efficient yet coherent navigation. Extensive evaluations on embodied AI benchmarks demonstrate that Nav-R1 consistently outperforms strong baselines, with over 8% average improvement in reasoning and navigation performance. Real-world deployment on a mobile robot further validates its robustness under limited onboard resources. Code: https://github.com/AIGeeksGroup/Nav-R1. Website: https://aigeeksgroup.github.io/Nav-R1.

沉浸式导航要求智能体在复杂的3D环境中进行感知、推理和行动的集成，以实现稳健的交互。现有方法常常存在推理轨迹不一致和不稳定的缺陷，阻碍了在不同环境中的泛化能力，以及难以平衡长期视野语义推理与低延迟控制，以实现实时导航。为了应对这些挑战，我们提出了Nav-R1，这是一个沉浸式环境推理的统一基础模型。我们首先构建了Nav-CoT-110K数据集，这是一套大规模逐步推理链（CoT）沉浸式任务数据集，能够实现结构化推理的冷启动初始化。在此基础上，我们设计了一个基于GRPO的强化学习框架，包括格式、理解和导航三种互补奖励，以提高结构遵循性、语义锚定和路径忠实度。此外，我们引入了快慢推理范式，将有意语义推理与低延迟反应控制相分离，以实现高效且连贯的导航。在沉浸式AI基准测试上的广泛评估表明，Nav-R1持续超越强劲基准测试，在推理和导航性能上平均提高超过8%。在移动机器人上的实际部署进一步验证了其在有限车载资源下的稳健性。代码公开：https://github.com/AIGeeksGroup/Nav-R1。网站：https://aigeeksgroup.github.io/Nav-R1。

论文及项目相关链接

PDF

Summary

本文提出一种名为Nav-R1的嵌入式基础模型，用于解决嵌入式环境中的推理问题。该模型通过构建大规模数据集Nav-CoT-110K，结合强化学习框架，提高结构遵循、语义定位和路径准确性。引入快慢推理范式，实现高效连贯的导航。在嵌入式AI基准测试上表现优异，平均提高超过8%，并在移动机器人上实现真实世界部署。

Key Takeaways

Nav-R1模型集成了感知、推理和行动，使代理在复杂的3D环境中进行稳健交互。
现有方法面临推理轨迹不一致、不稳定的问题，限制了在不同环境中的泛化能力。
Nav-CoT-110K数据集的构建为嵌入式任务提供了结构化推理的冷启动初始化。
采用GRPO强化学习框架，通过三个互补奖励（格式、理解和导航）提高结构遵循、语义定位和路径准确性。
引入快慢推理范式，实现语义推理与低延迟反应控制的解耦，提高导航效率与连贯性。
Nav-R1模型在嵌入式AI基准测试上的表现优于强基线，平均提高8%以上。

Cool Papers

点此查看论文截图

GTHNA: Local-global Graph Transformer with Memory Reconstruction for Holistic Node Anomaly Evaluation

Authors:Mingkang Li, Xuexiong Luo, Yue Zhang, Yaoyang Li, Fu Lin

Anomaly detection in graph-structured data is an inherently challenging problem, as it requires the identification of rare nodes that deviate from the majority in both their structural and behavioral characteristics. Existing methods, such as those based on graph convolutional networks (GCNs), often suffer from over-smoothing, which causes the learned node representations to become indistinguishable. Furthermore, graph reconstruction-based approaches are vulnerable to anomalous node interference during the reconstruction process, leading to inaccurate anomaly detection. In this work, we propose a novel and holistic anomaly evaluation framework that integrates three key components: a local-global Transformer encoder, a memory-guided reconstruction mechanism, and a multi-scale representation matching strategy. These components work synergistically to enhance the model’s ability to capture both local and global structural dependencies, suppress the influence of anomalous nodes, and assess anomalies from multiple levels of granularity. Anomaly scores are computed by combining reconstruction errors and memory matching signals, resulting in a more robust evaluation. Extensive experiments on seven benchmark datasets demonstrate that our method outperforms existing state-of-the-art approaches, offering a comprehensive and generalizable solution for anomaly detection across various graph domains.

图结构数据中的异常检测是一个具有内在挑战性的问题，因为它需要识别那些在其结构和行为特征上都偏离大多数的罕见节点。现有的方法，如图卷积网络（GCNs）的方法，常常遭受过度平滑的困扰，导致学习到的节点表示无法区分。此外，基于图重建的方法在重建过程中容易受到异常节点的干扰，导致异常检测不准确。在这项工作中，我们提出了一种新颖而全面的异常评估框架，集成了三个关键组件：局部-全局Transformer编码器、记忆引导重建机制和多尺度表示匹配策略。这些组件协同工作，增强了模型捕捉局部和全局结构依赖关系的能力，抑制了异常节点的影响，并从多个粒度层次评估异常。异常分数是通过结合重建误差和内存匹配信号计算得出的，从而得到更稳健的评估。在七个基准数据集上的广泛实验表明，我们的方法优于现有的最新方法，为各种图域中的异常检测提供了全面和可推广的解决方案。

论文及项目相关链接

PDF 9 pages, 7 figures

Summary

本文提出一种结合局部全局Transformer编码器、记忆引导重建机制和多尺度表示匹配策略的新型异常检测框架。它能有效捕捉图结构中的局部和全局依赖关系，抑制异常节点的影响，并从多个粒度层面评估异常。实验表明，该方法在七个基准数据集上超越了现有先进技术，为各种图形领域的异常检测提供了全面且可推广的解决方案。

Key Takeaways

异常检测在图结构数据中是一项具有挑战性的问题，需要识别在结构和行为特征上偏离大多数群体的罕见节点。
现有方法如基于图卷积网络的方法存在过度平滑问题，导致学习到的节点表示难以区分。
基于图重建的方法在重建过程中容易受到异常节点的干扰，导致异常检测不准确。
本文提出了一种新型异常检测框架，包括局部全局Transformer编码器、记忆引导重建机制和多尺度表示匹配策略。
该框架能增强模型捕捉局部和全局结构依赖关系的能力，抑制异常节点的影响，并从多个粒度层面评估异常。
广泛实验证明，该方法在多个基准数据集上表现优于现有先进技术。

Cool Papers

点此查看论文截图

K2-Think: A Parameter-Efficient Reasoning System

Authors:Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing

K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at k2think.ai, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.

K2-Think是一个推理系统，它利用32B参数模型达到了最先进的性能水平，可以与GPT-OSS 120B和DeepSeek v3.1等大型模型相匹敌甚至表现更佳。我们的系统建立在Qwen2.5基础模型上，通过结合先进的后训练和测试时间计算技术，证明小型模型也可以在最高级别上竞争。该方法基于六大关键技术支柱：长链思维监督微调、可验证奖励强化学习（RLVR）、推理前的代理计划、测试时间缩放、投机解码和推理优化硬件，所有这些都使用公开可用的开源数据集。K2-Think在数学推理方面表现出色，在公开基准测试上达到了开源模型的最先进水平，同时在代码和科学等其他领域也表现强劲。我们的结果证实，像K2-Think 32B这样更参数高效的模型，可以通过集成的后训练配方，包括长链思维训练和战略推理时间增强功能，与其他最先进的系统相竞争，这使得开源推理系统更容易访问和负担得起。K2-Think可在k2think.ai上免费获得，通过Cerebras晶圆级引擎提供每秒超过2000令牌的顶级推理速度。

论文及项目相关链接

PDF To access the K2-Think reasoning system, please visit www.k2think.ai

摘要
K2-Think是一个结合先进的后训练和测试时间计算技术的小型模型推理系统，它凭借先进的后训练和测试时计算方法等技术优势实现了与更大的模型竞争的实力。它利用开源数据集构建的六大技术支柱使其成为推理领域最优化的系统之一，达到了国际顶尖的性能水平。K2-Think在数学推理领域表现出卓越的性能，并在公共基准测试中取得了出色的成绩。该系统已发布在k2think.ai上，提供了超过每秒处理2,000个令牌的请求速度和最快的推理时间，开放开源的特性更使其广泛应用于各类使用场景。此论文指出了更具效率的模型可以帮助企业在参数选择和算力需求方面做出更好的决策，从而使得更多的组织能够更容易接触到高级的推理系统。K2-Think具备卓越的性价比，有助于提高系统稳定性和成本效益。总的来说，该系统推动了人工智能推理领域的进步。

关键见解

Cool Papers

点此查看论文截图

Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting

Authors:Shashidhar Reddy Javaji, Bhavul Gauri, Zining Zhu

Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it’’ feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration. Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.

大型语言模型（LLM）现在多用于多轮工作流程中，但我们仍然缺乏明确的衡量标准来判断迭代何时有帮助，何时会适得其反。我们提出了一个涵盖创意、代码和数学的迭代优化评估框架。我们的协议针对每个任务运行受控的12轮对话，利用各种提示，从模糊的“改进它”反馈到有针对性的指导，并记录每轮的输岀。我们使用领域特定的检查来评分结果（代码的单元测试；数学的答案等价性和推理合理性；创意的新颖性和可行性），并通过三个家族的指标跟踪轮次级别的行为：各轮之间的语义变化、轮与轮之间的变化以及输出大小的增长。在不同的模型和任务中，收益是依赖于领域的：它们在创意和代码的早期阶段到达，但在数学领域，晚期的迭代很重要，尤其是在受到详细指导的情况下。经过前几轮后，模糊的反馈通常会达到峰值或反向正确性，而针对性的提示可以可靠地改变预期的质量轴（创意中的新颖性对可行性的影响；代码中的速度与可读性；在数学中，详细的迭代优于探索并带来晚期收益）。我们还观察到一致领域模式：创意在各轮之间意义变化更大，代码倾向于在大小上增长而语义变化较小，数学从固定开始但可以通过后期详细的迭代打破这一路径。总之，该框架和指标使迭代能够在模型之间进行比较和衡量，并指示何时需要调整方向、停止或改变策略。

论文及项目相关链接

PDF

Summary

大型语言模型在多轮对话中的应用日益广泛，但如何评估迭代带来的帮助和伤害尚不清楚。本文提出一个涵盖构想、代码和数学的迭代优化评估框架。该协议针对每个任务进行受控的12轮对话，使用各种提示，从模糊的“改进它”反馈到有针对性的指导，并记录每轮的输岀。我们用域适当的检查来评分结果（代码用单元测试；数学用答案等价加上推理合理性；构想用原创性和可行性），并跟踪轮次的行为，包括语义移动、轮次变化和输出大小增长。在不同模型和任务中，收益是域依赖的：它们在构想和代码的早期到达，但在数学中，后期轮次很重要，在详细指导的情况下更是如此。经过前几轮后，模糊的反馈往往会使正确性达到平台期或逆转，而有针对性的提示可以可靠地改变预期的质量轴（构想的创新性对可行性；代码的速度对可读性；在数学上，详细阐述胜过探索并驱动后期轮次的收益）。我们还观察到一致的域模式：构想在各轮之间的意义变动更大，代码的大小往往随着增长而语义变化很小，数学开始时固定但可能因后期的详细迭代而打破路径。总的来说，该框架和指标使迭代可以在模型之间进行比较，并提示何时需要调整策略。

Key Takeaways

大型语言模型在多轮对话中的应用评估需要明确的衡量方法。
提出了一个涵盖构想、代码和数学的迭代优化评估框架。
通过受控的12轮对话进行实验研究，使用不同的提示和记录每轮的输出来评估迭代效果。
收益是域依赖的，早期迭代在构想和代码中收益较大，数学中后期迭代很重要。
模糊反馈对正确性的帮助有限，而针对性提示能更可靠地改变质量。
观察到构想、代码和数学在迭代过程中的不同模式和趋势。

Cool Papers

点此查看论文截图

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Authors:Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang

Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven “annotation-reasoning-annotation” closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.

视觉语言模型（VLMs）在处理复杂的图像标注任务时面临挑战，如情感分类和上下文驱动的对象检测等，这些任务需要复杂推理。标准监督微调（SFT）只关注标注结果，忽略了潜在的推理过程，而视觉强化微调（Visual-RFT）由于缺乏高质量、经过验证的推理链（CoTs），产生了不一致的推理链。我们引入了RISE（Reason-Inspire-Strengthen-Expertise），这是一个两阶段的框架，旨在克服这些限制。在推理阶段（RISE-CoT），一个由强化学习驱动的“标注-推理-标注”闭环，通过验证其重建原始标注的能力，生成视觉化、逻辑一致的推理链，而无需直接泄露信息。在激励和强化阶段（RISE-R1）则利用RISE-CoT奖励过滤出的高质量推理链子集进行有监督微调，随后进行强化微调，以产生可解释的推理和准确的标注，在复杂视觉任务中实现专业级表现。在复杂和简单的图像标注任务上评估，RISE训练的Qwen2-VL-2B优于SFT和Visual-RFT，实现了稳健的性能和增强的可解释性。RISE提供了一个无需手动注释推理链的自我监督解决方案，以推进VLM推理。相关代码和资源可访问：https://github.com/HSH5L/RISE获取。

论文及项目相关链接

PDF

Summary
视界语言模型（VLMs）在处理复杂的图像标注任务时面临挑战，如情感分类和上下文驱动的对象检测等。标准的监督微调（SFT）只关注标注结果，忽视底层推理。视觉强化微调（Visual-RFT）则因缺乏高质量、经过验证的推理链（CoTs）而产生不一致的推理。为此，我们提出RISE（Reason-Inspire-Strengthen-Expertise）框架，分为两个阶段来克服这些局限。在推理阶段（RISE-CoT），通过强化学习驱动的“标注-推理-标注”闭环生成视觉化、逻辑一致的推理链，验证其重建原始标注的能力。在激励与强化阶段（RISE-R1），利用RISE-CoT奖励过滤的高质量推理链子集进行有监督微调，随后进行强化微调以产生可解释的推理和准确的标注，实现复杂视觉任务的“专业知识”。在复杂和简单的图像标注任务上，RISE训练的Qwen2-VL-2B模型表现出优于SFT和Visual-RFT的稳健性能和增强的可解释性。RISE提供了一个无需手动标注推理链的自我监督解决方案，以推动VLM的推理能力进步。相关代码和资源可在链接中找到。

Key Takeaways

VLM在处理复杂图像标注任务时存在挑战，需要更精细的推理能力。
标准监督微调（SFT）忽略底层推理，导致性能局限。
视觉强化微调（Visual-RFT）因缺乏高质量推理链而产生不一致的推理。
RISE框架分为两个阶段：推理阶段（RISE-CoT）和激励与强化阶段（RISE-R1）。
RISE-CoT通过闭环生成视觉化、逻辑一致的推理链。
RISE利用高质量推理链子集进行有监督微调，随后进行强化微调以提高性能和可解释性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-17/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-09-17 Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

2025-09-17 LLM

LLM

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-09-17 AvatarSync Rethinking Talking-Head Animation through Autoregressive Perspective

2025-09-17 Talking Head Generation

Talking Head Generation

R1_Reasoning

2025-09-17 更新

Do machine learning climate models work in changing climate dynamics?

UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models

TabStruct: Measuring Structural Fidelity of Tabular Data

Agentic Temporal Graph of Reasoning with Multimodal Language Models: A Potential AI Aid to Healthcare

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Formal Reasoning for Intelligent QA Systems: A Case Study in the Educational Domain

D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

HARP: Hallucination Detection via Reasoning Subspace Projection

Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning

Continually Adding New Languages to Multilingual Language Models

Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic

Nav-R1: Reasoning and Navigation in Embodied Scenes

GTHNA: Local-global Graph Transformer with Memory Reconstruction for Holistic Node Anomaly Evaluation

K2-Think: A Parameter-Efficient Reasoning System

Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning