嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-19 更新

Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Authors:Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J

Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.

印度诗歌以其语言复杂性和深厚的文化共鸣而闻名,拥有跨越数千年的丰富多样的传统。然而,其层次丰富的含义、文化典故和复杂的语法结构常常构成理解上的挑战,尤其是对于非母语者或对其语境和语言不熟悉的读者。尽管其在文化上具有重要性,但现有关于诗歌的作品大多忽视了印度语言的诗歌。在本文中,我们提出了翻译与图像生成(TAI)框架,它借助适当提示调整的大型语言模型(LLM)和潜在扩散模型的力量。我们的框架支持联合国可持续发展目标中的优质教育(SDG 4)和减少不平等(SDG 10),通过提高富有文化的印度语言诗歌的易访问性,面向全球受众。它包括(1)一个翻译模块,该模块使用几率比率偏好对齐算法,将形态丰富的诗歌准确翻译成英语;(2)一个图像生成模块,它利用语义图来捕获标记、依存关系和隐喻及其意义之间的语义关系,以创建印度诗歌的视觉有意义表示。我们的全面实验评估,包括人类和定量评估,证明了TAI扩散在诗歌图像生成任务中的优越性,超越了强大的基线。为了进一步解决印度语言诗歌资源稀缺的问题,我们推出了形态丰富印度语言诗歌MorphoVerse数据集,包含1570首跨越21种低资源印度语言的诗歌。通过解决诗歌翻译和视觉理解之间的差距,这项工作旨在提高可及性并丰富读者的体验。

论文及项目相关链接

PDF

Summary
本文提出翻译与图像生成框架(TAI),利用大型语言模型和潜在扩散模型,通过适当的提示调整,支持印度诗歌的翻译和图像生成。框架包括翻译模块和图像生成模块,分别采用比率偏好对齐算法和语义图技术,旨在提高印度诗歌的全球性可访问性和读者体验。同时,引入印度语言诗歌数据集MorphoVerse,包含多种低资源印度语言的诗歌。

Key Takeaways

  1. 印度诗歌具有语言复杂、文化共鸣深厚的特点,但其丰富的文化内涵和复杂的语法结构对非母语者或不了解其文化和语言背景的读者来说,理解起来具有挑战性。
  2. 现有对诗歌的研究大多忽视了印度语言的诗歌。
  3. 本文提出的Translation and Image Generation (TAI)框架,旨在通过大型语言模型和潜在扩散模型提高印度诗歌的可访问性,使其对全球观众更具吸引力。
  4. TAI框架包括翻译模块和图像生成模块。翻译模块采用比率偏好对齐算法准确翻译形态丰富的诗歌;图像生成模块利用语义图技术,创建印度诗歌的视觉意义表示。
  5. 通过对诗歌图像生成任务的全面实验评估,证明了TAI框架的优越性。
  6. 为解决印度语言诗歌资源稀缺的问题,引入了MorphoVerse数据集,包含1,570首跨越21种低资源印度语言的诗歌。

Cool Papers

点此查看论文截图

CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Authors:Shrenik Patel, Daivik Patel

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one’s keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block’s full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.

长视频问答(VQA)对当前的视觉语言模型(VLM)构成了挑战,因为随着运行时的增长,注意力机制和键值(KV)缓存会受到影响,这迫使模型选择昂贵的推理过程或使用目光短浅的滑动窗口。我们引入了CacheFlow,这是一个无需训练的管道,它结合了动态令牌丢弃(DTD)和压缩长期内存。DTD通过余弦相似度来在线丢弃与前帧相似的每块令牌,并将剩余的令牌打包成固定大小的块。这种在线的逐帧处理使得我们的方法非常适合于实时流媒体VQA。在处理块时,每个块的键会被一个微小的循环编码器所总结,形成一个检索索引,而该块的完整KV对会被卸载并在稍后进行重新水化以用于生成答案,保持答案的保真度。在推理过程中,基于共识的检索机制只检索前K个最相关的块,并在检索到的内容和本地上下文上进行关注,以实现精确的长程推理。CacheFlow即插即用,不受架构限制,无需微调。在离线视频问答和流媒体视频问答基准测试上的实验表明,CacheFlow优于当前的强大基线,同时处理的令牌减少了高达87%。我们的双重方法使VLM能够同时实现高效和上下文感知,为实际的长视频理解铺平了道路。

论文及项目相关链接

PDF

Summary

本文介绍了CacheFlow,一种针对长形式视频问答(VQA)的训练免费管道,通过动态令牌丢弃(DTD)和压缩长期记忆来应对当前视觉语言模型(VLMs)面临的挑战。DTD在线对每帧的令牌进行修剪,通过余弦相似度来决定哪些令牌保留下来并形成固定大小的块。这种在线、按帧处理的方式使CacheFlow非常适合实时流媒体VQA。随着块的处理,其关键会被简洁的循环编码器总结成检索索引,而完整的KV对则被卸载并在稍后进行重新填充以生成答案。在推理阶段,基于共识的检索机制只检索最相关的前K个块,同时关注检索和本地上下文以进行精确的长程推理。CacheFlow是即插即用、架构无关且无需微调的方法。实验表明,在离线及流媒体VQA基准测试中,CacheFlow表现优于当前强大的基线,同时处理令牌的效率提高了高达87%。这一双重方法使VLMs既高效又具备上下文感知能力,为实际的长形式视频理解铺平了道路。

Key Takeaways

  1. CacheFlow解决了长形式视频问答(VQA)中视觉语言模型(VLMs)面临的挑战。
  2. 通过动态令牌丢弃(DTD)和压缩长期记忆来优化处理过程。
  3. DTD在线按帧处理令牌,通过余弦相似度进行修剪。
  4. CacheFlow适用于实时流媒体VQA,具备高效的在线处理能力。
  5. 使用简洁的循环编码器总结块的关键信息并形成检索索引。
  6. CacheFlow提高了处理令牌的效率,同时保持了答案的精确度。

Cool Papers

点此查看论文截图

P1: Mastering Physics Olympiads with Reinforcement Learning

Authors:Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui

Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.

最近大型语言模型(LLM)的进展推动了前沿领域从解谜向科学级推理的转变。这种推理方式是为了解决那些答案必须经得起自然考验,而不仅仅是符合某些规定的问题。物理学是这一转变的最严格考验,它以根本的方式将符号与现实相结合,成为大多数现代技术的基石。在这项工作中,我们成功开发了具有出色物理推理能力的大型语言模型,尤其擅长解决奥林匹克级别的物理问题,从而推动了物理学研究。我们介绍了P1系列,这是一系列完全通过强化学习(RL)训练而成的开源物理推理模型。其中,P1-235B-A22B是最新国际物理奥林匹克竞赛(IPhO 2025)中获得金牌的首个开源模型,并在2024/2025年的13场国际/区域物理竞赛中赢得了12枚金牌。P1-30B-A3B在IPhO 2025上也几乎超越了其他所有开源模型,获得了银牌。进一步配备代理框架PhysicsMinions的P1-235B-A22B+PhysicsMinions在IPhO 2025上获得整体第一名,并在13场物理竞赛中获得最高平均分。除了物理学,P1模型在数学和编码等其他推理任务方面也表现出卓越的性能,展示了P1系列的巨大通用性。

论文及项目相关链接

PDF

Summary

大型语言模型(LLM)的最新进展已从解谜转向科学级的推理能力,特别是解决需要根据自然来验证答案的问题。本文介绍了一系列通过强化学习(RL)训练出来的具备卓越物理推理能力的开放源代码物理模型P1家族,能够在国际物理奥林匹克竞赛中获得优异成绩,证明了其解决物理学难题的能力。同时,该模型在其他领域如数学和编程也展现出了出色的性能。这体现了其在各领域的泛用性和优秀性能。这一研究推动了对自然语言理解与推理的研究发展,特别是跨学科的整合研究,并为后续相关模型的训练提供了新的视角和可能。此外,物理模型的推广与应用有望促进科技领域的进步与发展。

Key Takeaways

  1. 大型语言模型(LLM)已从解谜转向科学级的推理能力,特别是在解决需要根据自然验证答案的问题上有所突破。
  2. P1系列模型是具备卓越物理推理能力的开放源代码物理模型,能够在国际物理奥林匹克竞赛中获得优异成绩。

Cool Papers

点此查看论文截图

Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images

Authors:Yinuo Xu, Yan Cui, Mingyao Li, Zhi Huang

Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell’s function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain. To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability. To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.

从常规病理图像中识别细胞类型和亚型对于提高人类疾病的计算理解至关重要。现有的基于瓦片的模型可以捕捉详细的核形态,但往往无法融入更广泛的组织背景,这会影响细胞的功能和身份。此外,现有的人类注释通常粒度较粗,且在不同研究中的分布不均,导致难以获得精细的亚型级别的监督。为了解决这些局限性,我们引入了NuClass,这是一个受病理学家工作流程启发的框架,用于细胞级的核形态和微环境背景的多元融合。NuClass主要包括两个组件:Path local,专注于224像素乘224像素的裁剪区域的核形态; Path global,则建模周围1024像素乘1024像素的邻域。一个可学习的门控模块可以自适应地平衡局部细节和上下文线索。为了鼓励互补学习,我们采用了一种不确定性导向的目标,引导全局路径优先处理局部路径不确定的区域。我们还提供了校准的信心估计和Grad-CAM可视化,以增强可解释性。为了克服高质量注释的缺乏,我们从Xenium空间转录组测定中构建了一个标记引导的数据集,为跨越八个器官的超过两百万个细胞提供了16类的单细胞分辨率标签。在三个完全独立的研究队列中进行评估,NuClass的最佳表现类别达到了96%的F1分数,超过了强大的基线。我们的结果表明,多尺度、感知不确定性的融合可以弥补病理基础模型和可靠细胞水平表型预测之间的差距。

论文及项目相关链接

PDF

摘要

针对从常规组织病理学图像中识别细胞类型和亚型对改善对人类疾病计算理解的重要性,现有基于瓦片的模型虽然能够捕捉详细的核形态,但忽略了影响细胞功能和身份的更大范围的组织上下文。为解决这一问题,我们引入了NuClass,这是一种受病理学家工作流程启发的框架,用于细胞级别的多尺度整合核形态和微环境上下文。NuClass包括两个主要组成部分:局部路径,专注于224像素见方的核形态;全局路径,对周围1024像素见方的区域进行建模。可学习的门模块自适应地平衡局部细节和上下文线索。为了促进互补学习,我们采用了不确定性引导的目标,引导全局路径优先处理局部路径不确定的区域。我们还提供了校准的信心估计和Grad-CAM可视化以增强可解释性。为了解决高质量注释的缺乏,我们从Xenium空间转录组分析中构建了标记引导数据集,为超过两百万细胞提供单细胞分辨率标签,涵盖八个器官和十六类。在三个完全独立的数据集上评估,NuClass的最佳表现类别达到了高达96%的F1分数,优于强大的基线。结果表明,多尺度、具有感知不确定性的融合可以在幻灯片级别的病理基础模型和可靠、细胞级别的表型预测之间建立桥梁。

关键见解

  1. 常规组织病理学图像中的细胞类型和亚型识别对于理解人类疾病至关重要。
  2. 当前模型在捕捉核形态方面表现出色,但忽略了组织上下文的影响。
  3. NuClass框架结合了细胞级别的多尺度核形态和微环境上下文信息。
  4. NuClass包括局部和全局路径,通过可学习的门模块平衡局部细节和上下文信息。
  5. 采用了不确定性引导的目标来增强模型在局部不确定区域的性能。
  6. 通过提供校准的信心估计和可视化增强模型的可解释性。

Cool Papers

点此查看论文截图

RAC-DMVC: Reliability-Aware Contrastive Deep Multi-View Clustering under Multi-Source Noise

Authors:Shihao Dong, Yue Liu, Xiaotong Zhou, Yuhui Zheng, Huiying Xu, Xinzhong Zhu

Multi-view clustering (MVC), which aims to separate the multi-view data into distinct clusters in an unsupervised manner, is a fundamental yet challenging task. To enhance its applicability in real-world scenarios, this paper addresses a more challenging task: MVC under multi-source noises, including missing noise and observation noise. To this end, we propose a novel framework, Reliability-Aware Contrastive Deep Multi-View Clustering (RAC-DMVC), which constructs a reliability graph to guide robust representation learning under noisy environments. Specifically, to address observation noise, we introduce a cross-view reconstruction to enhances robustness at the data level, and a reliability-aware noise contrastive learning to mitigates bias in positive and negative pairs selection caused by noisy representations. To handle missing noise, we design a dual-attention imputation to capture shared information across views while preserving view-specific features. In addition, a self-supervised cluster distillation module further refines the learned representations and improves the clustering performance. Extensive experiments on five benchmark datasets demonstrate that RAC-DMVC outperforms SOTA methods on multiple evaluation metrics and maintains excellent performance under varying ratios of noise.

多视角聚类(MVC)旨在以无监督的方式将多视角数据分成不同的簇,这是一项基本且具有挑战性的任务。为了增强其在现实场景中的应用性,本文解决了一个更具挑战性的任务:在包括缺失噪声和观测噪声的多源噪声下的MVC。为此,我们提出了一种新型框架,即可靠对比深度多视角聚类(RAC-DMVC),该框架构建了一个可靠性图,以在噪声环境中指导稳健表示学习。具体来说,为了解决观测噪声问题,我们引入了跨视图重建,以增强数据级别的稳健性,并引入可靠对比学习以减轻由噪声表示引起的正负样本选择中的偏见。为了处理缺失噪声,我们设计了双注意力填补策略,以捕获跨视图的共享信息,同时保留视图特定特征。此外,自监督聚类蒸馏模块进一步精炼了学到的表示并提高了聚类性能。在五个基准数据集上的广泛实验表明,RAC-DMVC在多个评价指标上优于最新方法,并在不同比例的噪声下保持卓越性能。

论文及项目相关链接

PDF

Summary

该文提出一种新型的多视角聚类方法,名为可靠性感知对比深度多视角聚类(RAC-DMVC)。该方法构建可靠性图,以指导在噪声环境下的稳健表示学习。RAC-DMVC通过跨视角重建增强数据层面的稳健性,并引入可靠性感知噪声对比学习来减轻由噪声表示引起的正负样本选择中的偏见,以应对观测噪声。为处理缺失噪声,设计了一种双关注填补机制,旨在捕捉跨视角的共享信息同时保留特定视角的特征。此外,通过自我监督的集群蒸馏模块进一步优化了学习到的表示并提高了聚类性能。实验证明,RAC-DMVC在多个评估指标上优于现有最佳方法,并在不同噪声比率下保持良好的性能。

Key Takeaways

  1. RAC-DMVC是一种多视角聚类方法,旨在将多视角数据在无需监督的情况下分成不同的集群。
  2. 该方法构建可靠性图以处理现实世界中可能存在的各种噪声。
  3. 为了应对观测噪声,RAC-DMVC采用跨视角重建和可靠性感知噪声对比学习。
  4. 缺失噪声的处理通过双关注填补机制实现,该机制能捕捉跨视角的共享信息并保留特定视角的特征。
  5. RAC-DMVC还采用自我监督的集群蒸馏模块来优化表示并提高聚类性能。
  6. 在五个基准数据集上的实验表明,RAC-DMVC在多个评估指标上表现优异,且能在不同的噪声比率下维持良好性能。
  7. RAC-DMVC为处理多源噪声下的多视角聚类提供了一个新的解决方案。

Cool Papers

点此查看论文截图

TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images

Authors:Sining Chen, Xiao Xiang Zhu

Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.

单眼高度估计在遥感3D感知中扮演着至关重要的角色,它为多视角或基于激光雷达的方法提供了经济高效的替代方案。虽然深度学习已经大大提高了单眼高度估计的能力,但这些方法从根本上仍然受到标记数据可用性的限制,大规模获取这些标记数据既昂贵又耗时。高质量注释的稀缺性阻碍了现有模型的泛化和性能。为了克服这一局限性,我们提出通过半监督学习框架利用大量未标记数据,使模型能够从未标记样本中提取信息线索,提高其预测性能。在这项工作中,我们介绍了TSE-Net,这是一种用于半监督单眼高度估计的自我训练管道。该管道整合了教师、学生和考试网络。学生网络使用教师网络生成的伪标签在未经标记的数据上进行训练,而考试网络则充当学生网络的临时组合以稳定性能。教师网络被制定为联合回归和分类模型:回归分支预测高度值作为伪标签,分类分支预测高度值类别以及类别概率,用于过滤伪标签。高度值类别是通过层次二分策略定义的,以解决高度固有的长尾分布问题,预测的类别概率通过Plackett-Luce模型进行校准,以反映伪标签的预期准确性。我们在三个不同分辨率和成像模式的数据集上评估了所提出的管道。代码可在https://github.com/zhu-xlab/tse-net获取。

论文及项目相关链接

PDF

Summary

本文介绍了单目高度估计在遥感3D感知中的重要性,并指出深度学习方法在单目高度估计中的应用受到标注数据稀缺的限制。为克服此限制,本文提出了利用大量无标签数据的半监督学习框架TSE-Net。该框架通过教师网络生成伪标签来训练学生网络,并引入考试网络以稳定性能。教师网络联合回归和分类模型,回归分支生成伪标签,分类分支预测高度值类别及类别概率以过滤伪标签。采用层次二分策略定义高度值类别,并用Plackett-Luce模型校准预测类别概率以反映伪标签的预期准确性。在三个不同分辨率和成像模态的数据集上进行了评估。

Key Takeaways

  1. 单目高度估计在遥感3D感知中起关键作用,作为多视角或激光雷达方法的成本效益替代方案。
  2. 深度学习方法在单目高度估计中的应用受到标注数据稀缺的限制。
  3. 为解决标注数据问题,提出利用大量无标签数据的半监督学习框架TSE-Net。
  4. TSE-Net通过教师网络生成伪标签来训练学生网络,引入考试网络以改善性能稳定性。
  5. 教师网络结合回归和分类模型,以生成更准确的伪标签。
  6. 采用层次二分策略定义高度值类别,以适应高度值的固有长尾分布。
  7. 使用Plackett-Luce模型校准预测类别概率,以反映伪标签的预期准确性。

Cool Papers

点此查看论文截图

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Authors:Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang, Kaishuo Wei, Yuqi Yu, Weiping Wen, Xiaojie Wu, Junhua Liu

The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.

大型语言模型(LLM)的迅速采纳带来了变革性的应用和新安全风险,包括绕过对齐保障措施以引发有害输出的越狱攻击。现有的自动化越狱生成方法(例如AutoDAN)存在变异多样性有限、适应度评估肤浅以及基于关键词的检测脆弱等问题。为了解决这些局限性,我们提出了ForgeDAN,这是一种针对对齐LLM生成语义连贯且高度有效的对抗性提示的新型进化框架。首先,ForgeDAN引入跨字符、单词和句子级别的多策略文本扰动,以增强攻击多样性;然后,我们基于文本相似性模型采用可解释的语义适应度评估来指导进化过程,以产生语义相关且有害的输出;最后,ForgeDAN整合了双重维度的越狱判断,利用基于LLM的分类器联合评估模型合规性和输出危害性,从而降低误报并提高检测效率。我们的评估表明,ForgeDAN在保持自然性和隐秘性的同时,实现了高越狱成功率,优于现有的最佳解决方案。

论文及项目相关链接

PDF

Summary

大规模语言模型(LLMs)的迅速采纳带来了变革性的应用和新安全威胁,包括绕过对齐保障的监狱突破攻击,以产生有害输出。现有自动监狱突破生成方法,如AutoDAN,存在变异多样性有限、适应度评估肤浅、以及基于关键词的检测脆弱等局限。为解决这些问题,我们提出ForgeDAN,一种新型的进化框架,用于生成针对对齐LLMs的语义连贯且高效的对抗性提示。ForgeDAN引入跨字符、单词和句子级别的多策略文本扰动,以提高攻击多样性;然后,我们利用基于文本相似性的可解释语义适应度评估来指导进化过程,以产生语义相关且有害的输出;最后,ForgeDAN结合双重维度的监狱突破判断,利用LLM分类器联合评估模型合规性和输出危害性,从而降低误报并提高检测效率。评估表明,ForgeDAN在保持自然性和隐秘性的同时,实现了高突破成功率,优于现有最先进解决方案。

Key Takeaways

  1. 大规模语言模型(LLMs)的普及带来了新安全威胁,包括监狱突破攻击。
  2. 现有自动监狱突破生成方法存在变异多样性有限、适应度评估肤浅等问题。
  3. ForgeDAN框架通过引入多策略文本扰动增强攻击多样性。
  4. ForgeDAN利用基于文本相似性的语义适应度评估指导进化过程。
  5. ForgeDAN结合双重维度的监狱突破判断,降低误报并提高检测效率。
  6. ForgeDAN实现了高突破成功率,同时保持自然性和隐秘性。

Cool Papers

点此查看论文截图

Interpretable Ransomware Detection Using Hybrid Large Language Models: A Comparative Analysis of BERT, RoBERTa, and DeBERTa Through LIME and SHAP

Authors:Elodie Mutombo Ngoie, Mike Nkongolo Wa Nkongolo, Peace Azugo, Mahmut Tokmak

Ransomware continues to evolve in complexity, making early and explainable detection a critical requirement for modern cybersecurity systems. This study presents a comparative analysis of three Transformer-based Large Language Models (LLMs) (BERT, RoBERTa, and DeBERTa) for ransomware detection using two structured datasets: UGRansome and Process Memory (PM). Since LLMs are primarily designed for natural language processing (NLP), numerical and categorical ransomware features were transformed into textual sequences using KBinsDiscretizer and token-based encoding. This enabled the models to learn behavioural patterns from system activity and network traffic through contextual embeddings. The models were fine-tuned on approximately 2,500 labelled samples and evaluated using accuracy, F1 score, and ROC-AUC. To ensure transparent decision-making in this high-stakes domain, two explainable AI techniques (LIME and SHAP) were applied to interpret feature contributions. The results show that the models learn distinct ransomware-related cues: BERT relies heavily on dominant file-operation features, RoBERTa demonstrates balanced reliance on network and financial signals, while DeBERTa exhibits strong sensitivity to financial and network-traffic indicators. Visualisation of embeddings further reveals structural differences in token representation, with RoBERTa producing more isotropic embeddings and DeBERTa capturing highly directional, disentangled patterns. In general, RoBERTa achieved the strongest F1-score, while BERT yielded the highest ROC-AUC performance. The integration of LLMs with XAI provides a transparent framework capable of identifying feature-level evidence behind ransomware predictions.

勒索软件不断在复杂性上进化,使得早期和可解释的检测成为现代网络安全系统的关键要求。本研究对三种基于Transformer的大型语言模型(BERT、RoBERTa和DeBERTa)进行了比较分析,用于使用两种结构化数据集UGRansome和Process Memory(PM)进行勒索软件检测。由于大型语言模型主要设计用于自然语言处理(NLP),因此使用KBinsDiscretizer和基于令牌的编码将数值和分类勒索软件特征转换为文本序列。这使得模型能够通过上下文嵌入从系统活动和网络流量中学习行为模式。这些模型在大约2500个标记样本上进行微调,并使用准确度、F1分数和ROC-AUC进行评估。为了确保高风险领域决策透明化,采用了两种可解释的人工智能技术(LIME和SHAP)来解释特征贡献。结果表明,这些模型学会了不同的勒索软件相关线索:BERT严重依赖于主要的文件操作特征,RoBERTa对网络和金融信号的依赖表现出平衡,而DeBERTa对金融和网络流量指标表现出强烈的敏感性。嵌入的可视化进一步揭示了令牌表示的结构性差异,RoBERTa产生更等距的嵌入,而DeBERTa捕捉高度定向、解耦的模式。总的来说,RoBERTa获得了最高的F1分数,而BERT获得了最高的ROC-AUC性能。大型语言模型与可解释性人工智能的结合提供了一个透明的框架,能够识别勒索软件预测背后的特征级别证据。

论文及项目相关链接

PDF

Summary

本研究对比分析了BERT、RoBERTa和DeBERTa三种基于Transformer的大型语言模型(LLMs)在利用UGRansome和Process Memory(PM)两个结构化数据集进行勒索软件检测方面的性能。通过KBinsDiscretizer和基于token的编码技术将数值和分类勒索软件特征转换为文本序列,使模型能够通过上下文嵌入学习系统活动和网络流量的行为模式。模型在约2500个标记样本上进行微调,并使用准确性、F1分数和ROC-AUC进行评估。应用LIME和SHAP两种可解释的AI技术来解释特征贡献,以确保决策透明化。结果显示,模型学习到的勒索软件相关线索各不相同:BERT依赖文件操作特征,RoBERTa平衡依赖网络和金融信号,而DeBERTa对金融和网络流量指标敏感性强。可视化嵌入进一步揭示了token表示的结构差异。总体而言,RoBERTa的F1分数最高,而BERT的ROC-AUC性能最佳。将LLMs与XAI结合,提供了一个能够识别勒索软件预测背后特征级别证据的透明框架。

Key Takeaways:

  1. 勒索软件不断进化,早期和可解释的检测对现代网络安全系统至关重要。
  2. 对比分析三种基于Transformer的大型语言模型(BERT、RoBERTa、DeBERTa)用于勒索软件检测。
  3. 通过KBinsDiscretizer和基于token的编码,将数值和分类特征转换为文本,使模型能够学习行为模式。
  4. 模型在结构化数据集上进行训练并评估,使用准确性、F1分数和ROC-AUC衡量性能。
  5. LIME和SHAP技术用于解释模型决策,增强透明度。
  6. 不同模型对勒索软件相关线索的学习各有侧重:BERT依赖文件操作,RoBERTa平衡考虑网络和金融信号,DeBERTa强调金融和网络流量。
  7. 结合LLMs与可解释的AI技术提供透明框架,有助于识别勒索软件预测的特征级别证据。

Cool Papers

点此查看论文截图

Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

Authors:Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, André Pomp, Tobias Meisen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision-relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi-agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM-as-a-judge agent, and an optional human-in-the-loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder-oriented reports. The system is validated through a real-world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state-of-the-art LLMs and three prompting paradigms identify GPT-4.1 mini with Chain-of-Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi-agent orchestration significantly enhances factual precision, coherence, and scalability in LLM-based reporting. The proposed framework establishes a replicable and domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics.

提升公共交通的燃油效率需要整合复杂的多模式数据为可解读的、与决策相关的见解。然而,传统的分析方法和可视化方法通常产生碎片化的输出,需要大量的人工解读,限制了可扩展性和一致性。本研究提出了一个多智能体框架,该框架利用多模式大型语言模型(LLM)来自动进行数据叙述和能源洞察生成。该框架协调三个专业智能体,包括一个数据叙述智能体、一个作为法官的LLM智能体,以及一个可选的人机循环评估智能体,以迭代的方式将分析产物转化为连贯的、以利益相关者为导向的报告。该系统通过丹麦北日德兰地区公交车公共交通运输的真实案例研究进行验证,其中4006次行程的燃油效率数据使用高斯混合模型聚类进行分析。跨越五种最新LLM和三种提示范式的比较实验确定了GPT-4.1 mini与Chain-of-Thought提示为最佳配置,达到了97.3%的叙述准确性,同时平衡了可解释性和计算成本。研究结果表明,多智能体的协同工作显著提高了基于LLM的报告的事实精确性、连贯性和可扩展性。所提出的框架为AI驱动的叙事生成和能源信息决策支持建立了可复制和适应领域的的方法。

论文及项目相关链接

PDF

Summary

本研究提出一种多智能体框架,利用多模态大型语言模型(LLMs)自动化数据叙事和能源洞察生成,提高公共交通的燃料效率。框架包括数据叙事智能体、LLM作为法官的智能体,以及可选的人机循环评估器,将分析产物转化为连贯、面向利益相关者的报告。通过丹麦北日德兰公共巴士运输的实例研究验证了系统的有效性,对比实验表明GPT-4.1 mini配合Chain-of-Thought提示法为最佳配置,实现了97.3%的叙事准确性,平衡了可解释性和计算成本。该研究建立了可复制、适应领域的AI驱动叙事生成和能源信息化决策支持方法论。

Key Takeaways

  1. 研究提出了多智能体框架,集成了多模态大型语言模型以提高公共交通燃料效率的数据分析和解释效率。
  2. 通过实际案例研究,验证了框架在公共巴士运输中的有效性。
  3. 对比实验表明GPT-4.1 mini配合Chain-of-Thought提示法为最佳配置,具有较高的叙事准确性。
  4. 该方法平衡了数据叙事的可解释性和计算成本。
  5. 研究强调了多智能体协同工作在提高LLM报告的事实精确性、连贯性和可扩展性方面的作用。
  6. 该框架为AI驱动的叙事生成和能源信息化决策支持提供了可复制和适应领域的方法论。

Cool Papers

点此查看论文截图

MedDCR: Learning to Design Agentic Workflows for Medical Coding

Authors:Jiyang Zheng, Islam Nassar, Thanh Vu, Xu Zhong, Yang Lin, Tongliang Liu, Long Duong, Yuan-Fang Li

Medical coding converts free-text clinical notes into standardized diagnostic and procedural codes, which are essential for billing, hospital operations, and medical research. Unlike ordinary text classification, it requires multi-step reasoning: extracting diagnostic concepts, applying guideline constraints, mapping to hierarchical codebooks, and ensuring cross-document consistency. Recent advances leverage agentic LLMs, but most rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real-world documentation, leaving open the question of how to systematically learn effective workflows. We present MedDCR, a closed-loop framework that treats workflow design as a learning problem. A Designer proposes workflows, a Coder executes them, and a Reflector evaluates predictions and provides constructive feedback, while a memory archive preserves prior designs for reuse and iterative refinement. On benchmark datasets, MedDCR outperforms state-of-the-art baselines and produces interpretable, adaptable workflows that better reflect real coding practice, improving both the reliability and trustworthiness of automated systems.

医疗编码将自由文本的临床笔记转化为标准化的诊断和程序代码,对于计费、医院运营和医学研究至关重要。不同于普通的文本分类,它需要进行多步骤推理:提取诊断概念、应用指南约束、映射到分层编码簿,并确保跨文档的一致性。最近的进展利用了代理型大型语言模型,但大多数依赖于僵化、手工制作的工作流程,无法捕捉真实世界文档的细微差别和可变性,留下了如何系统地学习有效工作流程的问题。我们提出了MedDCR,这是一个闭环框架,将工作流程设计视为学习问题。设计者提出工作流程,编码者执行它们,反思者对预测进行评估并提供建设性反馈,同时记忆存档保留先前设计以供重用和迭代优化。在基准数据集上,MedDCR优于最新基线,产生可解释、可适应的工作流程,更好地反映实际编码实践,提高自动化系统的可靠性和可信度。

论文及项目相关链接

PDF

Summary

本文介绍了医疗编码的重要性及其与普通文本分类的区别,它需要将临床笔记中的自由文本转换为标准化的诊断和程序代码,用于计费、医院运营和医学研究。为了解决现有医疗编码系统难以捕捉现实世界文档中的细微差别和变化的问题,提出了一种名为MedDCR的闭环框架,该框架将工作流程设计作为学习问题来处理,包括设计师提出工作流程、编码器执行工作流程、反射器评估预测并提供建设性反馈,以及记忆存档保留先前设计以供重用和迭代优化。在基准数据集上,MedDCR优于现有先进技术基线,并产生可解释、可适应的工作流程,更好地反映实际编码实践,提高自动化系统的可靠性和可信度。

Key Takeaways

  1. 医疗编码是将临床笔记中的自由文本转换为标准化的诊断和程序代码的过程,对计费、医院运营和医学研究至关重要。
  2. 现有的医疗编码系统大多依赖于手动制作的工作流程,难以适应现实世界中文档的细微差别和变化。
  3. MedDCR框架通过设计师、编码器和反射器的协同工作,实现了医疗编码工作流程的自动化和智能化。
  4. MedDCR框架具有可解释性和适应性,能够反映实际的编码实践。
  5. MedDCR框架通过记忆存档功能,实现了先前设计的重用和迭代优化。
  6. MedDCR在基准数据集上的表现优于现有的先进技术基线。

Cool Papers

点此查看论文截图

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Authors:Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

多智能体系统在通用推理任务上表现良好。然而,由于缺乏专业领域的训练,它们准确性受到影响。当前训练方法集中为系统内的所有智能体训练一个统一的大型语言模型(LLM)。但由于不同智能体背后的分布不同,这可能会限制性能。因此,使用不同LLM训练多智能体系统应是下一步的解决方案。然而,这种方法带来了优化挑战。例如,智能体运行频率不同,操作涉及各种子智能体调用,并且智能体通常部署在不同的服务器上,破坏端到端梯度流。为了解决这些问题,我们提出M-GRPO,它是Group Relative Policy Optimization的分层扩展,专为具有主智能体(规划器)和多个子智能体(多轮工具执行器)的垂直多智能体系统而设计。M-GRPO计算主智能体和子智能体的群体相对优势,保持分层信用分配。它还引入了一种轨迹对齐方案,即使在可变子智能体调用的情况下也能生成固定大小的批次。我们部署了一个解耦的训练管道,智能体在单独的服务器上运行并通过共享存储交换最小统计信息。这能够在不进行跨服务器反向传播的情况下实现可扩展的训练。在真实世界基准测试(例如GAIA、XBench-DeepSearch和WebWalkerQA)的实验中,M-GRPO始终优于单智能体GRPO和多智能体GRPO(带有冻结子智能体),显示出提高的稳定性和样本效率。这些结果表明,对齐异构轨迹并在专业智能体之间解耦优化有助于增强工具辅助推理任务。

论文及项目相关链接

PDF

Summary

该文本主要讨论了多智能体系统在通用推理任务上的表现。当前训练方式存在的局限性使得针对不同类型的智能体训练统一的大型语言模型可能性能受限。为此,提出了使用不同的语言模型来训练每个智能体的方法。然而,这带来了优化挑战。为了解决这个问题,提出了M-GRPO方法,这是一种为垂直多智能体系统设计的群体相对策略优化的分层扩展。M-GRPO能够计算主智能体和子智能体的群体相对优势,同时保持分层信用分配,并引入轨迹对齐方案以生成固定大小的批次。实验结果表明,M-GRPO在真实世界基准测试上的表现优于单智能体GRPO和多智能体GRPO,显示出更高的稳定性和样本效率。这表明,对齐异构轨迹并在专业智能体上解耦优化有助于提高工具辅助推理任务的效果。

Key Takeaways

  1. 多智能体系统在通用推理任务上表现良好,但在特定领域缺乏训练会影响其准确性。
  2. 当前训练方法是统一训练所有智能体的大型语言模型,这可能限制了性能。
  3. 针对多智能体系统,提出使用不同的语言模型进行训练以应对不同类型的智能体。
  4. 训练多智能体系统时面临优化挑战,如操作频率不同、子智能体调用不同等。
  5. 提出M-GRPO方法来解决这些问题,它是一种为垂直多智能体系统设计的群体相对策略优化的分层扩展。
  6. M-GRPO能够计算主智能体和子智能体的群体相对优势,并保持信用分配的层次结构。

Cool Papers

点此查看论文截图

MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection

Authors:Junjie Wu, Guohong Fu

Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.

在人工智能生成内容(AIGC)的时代,各种社交媒体上的多模态错误信息泛滥成灾,并继续演变。出现错误信息的成本低、欺骗性高,对社会造成了严重威胁。虽然最近的研究利用通用多模态大型语言模型(MLLMs)在多模态信息检测方面取得了显著成果,但它们面临两个关键的局限性:一是推理不足,通用多模态大型语言模型通常采用统一的推理模式,但由于缺乏多模态信息检测任务的特定知识,它们会生成不准确的解释和判断。二是推理偏见问题,单一思维模式的检测器在进行判断时走的是一条次优路径,难以跟上日益增长的复杂多变的多模态错误信息。针对这些问题,本文提出了一个两阶段的多模态错误信息检测框架MMD-Thinker,通过自适应的多维度思考进行多模态错误信息检测。首先,我们针对多模态错误信息检测设计了专门的思维模式。其次,我们采用任务特定的指令微调,将特定的思维模式注入通用的大型多模态语言模型。再次,我们进一步采用强化学习策略并设置混合优势函数,在路径轨迹中激励推理能力。此外,我们构建了多模态错误信息推理(MMR)数据集,包含超过8K的图像文本对,以及推理过程和分类标签等详细信息等以帮助进步的多模态错误信息的检测领域。实验结果表明我们的提出的MMD-Thinker无论是在内部领域还是外部领域的基准测试数据集上都取得了最先进的性能同时保持了灵活的推理和令牌使用。代码将在GitHub上公开可用。

论文及项目相关链接

PDF

Summary
针对AI生成内容时代多模态错误信息泛滥的问题,现有通用多模态大型语言模型在检测时存在推理不足和推理偏见两个关键局限。为此,本文提出MMD-Thinker框架,通过自适应的多维度思考进行多模态错误信息检测。该框架包括为多模态错误信息检测量身定制的推理模式、通过任务特定指令调整将推理模式注入通用多模态语言模型,以及利用强化学习策略激励推理能力。此外,构建了多模态错误信息推理数据集MMR,包含超过8K的图像文本对,以推动该领域的发展。实验结果表明,MMD-Thinker在域内和跨域基准数据集上均取得最新技术水平的性能,同时保持灵活的推理和令牌使用。

Key Takeaways

  1. 多模态错误信息在社会中构成重大威胁,因其低成本创建和高欺骗性。
  2. 现有通用多模态大型语言模型在多模态错误信息检测中存在推理不足和偏见问题。
  3. MMD-Thinker框架通过自适应的多维度思考来解决这些问题,包括量身定制的推理模式、任务特定指令调整和强化学习策略的应用。
  4. MMR数据集的构建为领域发展提供了支持,包含图像文本对及对应的推理过程和分类标签。
  5. MMD-Thinker框架在实验中表现出卓越性能,不仅适用于域内数据,也适用于跨域数据。
  6. MMD-Thinker框架具有灵活的推理和令牌使用特点。

Cool Papers

点此查看论文截图

TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

Authors:Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, Jieping Ye

Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrate the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek-R1-Distill-Qwen-7B fine-tuned using our proposed method achieved a 50% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model’s self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at https://github.com/zhangyx1122/TokenSqueeze.

新兴推理LLM(大型预训练模型)如OpenAI-o1和DeepSeek-R1通过生成长的思维链(CoT)轨迹,在复杂推理任务上取得了强劲的表现。然而,这些长的思维链导致令牌使用量增加,进而导致推理延迟和内存消耗增加。因此,在实际部署推理LLM时,平衡准确性和推理效率变得至关重要。现有的长转短(Long2Short)方法旨在减少推理长度,但往往牺牲了准确性,这凸显了一种既降低令牌成本又能维持性能的方法的需求。为了解决这一效率与准确度的权衡问题,我们提出了TokenSqueeze,这是一种新型的长转短方法,能在保持性能的同时压缩推理路径,并且仅依赖自我生成的数据。首先,为了防止过度压缩推理深度导致的性能下降,我们提出选择自我生成的样本,其推理深度与问题的复杂性相适应。为了进一步优化语言表达而不改变基本的推理路径,我们引入了一种分布对齐的语言优化方法,它提高了推理路径的清晰度和简洁性,同时保持了其逻辑完整性。全面的实验结果表明,TokenSqueeze在减少令牌使用的同时,能有效地保持准确性。值得注意的是,使用我们提出的方法对DeepSeek-R1-Distill-Qwen-7B进行微调后,其在MATH500基准测试上实现了平均令牌减少50%的同时保持了准确性。TokenSqueeze仅利用模型自我生成的数据,能在不同应用中实现高效和高保真的推理,无需依赖人工编制的简短答案数据集。我们的代码可在https://github.com/zhangyx1122/TokenSqueeze访问。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

摘要
大模型在处理复杂推理任务时表现出色,但生成的长推理链(CoT)会导致更高的推理延迟和内存消耗。因此,在部署推理大模型时,需要在准确性和推理效率之间取得平衡。现有的长短方法旨在减少推理长度但牺牲了准确性,因此需要一种能在降低令牌成本的同时保持性能的方法。为此,我们提出了TokenSqueeze,这是一种新的长短方法,能够在保持性能的同时压缩推理路径,并且只依赖于自我生成的数据。通过选择自适应匹配问题复杂度的自我生成样本,防止了过度压缩导致的性能下降。同时,我们引入了一种与分布对齐的语言精炼方法,旨在优化语言表达而不改变基本的推理路径,增强推理路径的清晰度和简洁性,同时保持其逻辑完整性。实验结果表明,TokenSqueeze在减少令牌使用的同时保持了准确性。值得注意的是,使用我们的方法在DeepSeek-R1-Distill-Qwen-7B上进行微调后,实现了MATH500基准测试上令牌减少50%的平均成绩同时保持准确性。TokenSqueeze仅利用模型自我生成的数据,能在多种应用上实现高效且高保真度的推理,无需依赖手动编纂的短答案数据集。

关键见解

  1. 现有的推理大模型面临准确性和推理效率之间的权衡问题。
  2. TokenSqueeze是一种新的长短方法,旨在解决这一权衡问题,通过压缩推理路径来降低令牌使用成本,同时保持性能。
  3. TokenSqueeze通过选择自适应匹配问题复杂度的自我生成样本进行训练,以防止性能下降。
  4. 引入了一种与分布对齐的语言精炼方法,旨在优化语言表达同时保持基本的推理路径和逻辑完整性。
  5. 实验结果显示TokenSqueeze能有效减少令牌使用同时维持准确性。
  6. 使用TokenSqueeze方法的DeepSeek-R1-Distill-Qwen-7B模型在MATH500基准测试上实现了显著的令牌减少成绩。

Cool Papers

点此查看论文截图

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

Authors:Cheng Peng, Zhenzhe Zhang, Cheng Chi, Xiaobao Wei, Yanhao Zhang, Heng Wang, Pengwei Wang, Zhongyuan Wang, Jing Liu, Shanghang Zhang

Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model’s semantic guidance capabilities, enabling deep reasoning during real-time navigation.

在未知环境中导航到指定对象是具有身体智能的基本但具有挑战的能力。然而,当前的方法在决策频率与智能之间难以取得平衡,导致缺乏远见或行动不连贯的决策。在这项工作中,我们提出PIGEON:以兴趣点为导向的VLM对象导航探索,在探索过程中保持轻量级且语义对齐的快照内存作为探索策略的语义输入。我们使用名为PIGEON-VL的大型视觉语言模型来选择探索过程中形成的兴趣点(PoI),然后采用较低层次的规划器进行动作输出,以提高决策频率。此外,这种基于兴趣点的决策生成适合模拟器的强化学习验证奖励(RLVR)数据。在经典对象导航基准测试上的实验表明,我们的零样本迁移方法达到了最先进的性能,而RLVR进一步增强了模型的语义引导能力,实现了实时导航过程中的深度推理。

论文及项目相关链接

PDF

Summary

本文提出了PIGEON系统,利用点兴趣引导探索实现目标导航。通过维护轻量级语义对齐的快照记忆,PIGEON采用大规模视觉语言模型(VLM)进行点兴趣(PoI)选择,提高决策频率。此外,基于PoI的决策制定能够生成适用于模拟器的可验证奖励强化学习(RLVR)数据。实验表明,零样本迁移方法取得了最先进的目标导航性能,RLVR增强了模型的语义引导能力,实现实时导航的深度推理。

Key Takeaways

  1. PIGEON系统使用点兴趣(PoI)引导探索,实现在未知环境中的目标导航。
  2. 通过维护轻量级语义对齐的快照记忆,提高决策频率和智能平衡。
  3. 利用大规模视觉语言模型(VLM)进行PoI选择。
  4. 基于PoI的决策制定可以生成适用于模拟器的强化学习数据。
  5. 零样本迁移方法取得最先进的目标导航性能。
  6. RLVR技术增强了模型的语义引导能力。

Cool Papers

点此查看论文截图

Video Spatial Reasoning with Object-Centric 3D Rollout

Authors:Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

近期多模态大型语言模型(MLLMs)的进展在视觉语言理解方面展现了显著的能力。然而,实现稳健的视频空间推理——在动态3D场景中理解物体位置、方向和物体间关系的能力,仍然是关键未解决的挑战。现有方法主要依赖于空间基础监督微调或强化学习,但我们发现这些模型通常表现出查询锁定推理,只关注提示中明确提到的物体,而忽视关键上下文线索。为了解决这一局限性,我们提出了Object-Centric 3D Rollout(OCR)这一新策略,它在训练过程中对选定物体的3D几何结构进行结构化扰动。通过退化特定物体的视觉线索,并将改变的几何结构投影到二维空间,OCR迫使模型在整个场景中进行整体推理。我们还设计了一种基于滚动训练的管道,联合使用普通和区域噪声视频来优化空间推理轨迹。实验表明,我们的模型性能处于领先水平:我们的参数规模为3B的模型在VSI-Bench上的准确率达到了47.5%,超过了多个参数规模为7B的基线模型。对比实验证实了OCR在之前的滚动策略(如T-GRPO、NoisyRollout)上的优越性。

论文及项目相关链接

PDF

Summary

本文介绍了多模态大型语言模型(MLLMs)在视觉语言理解方面的最新进展。然而,实现稳健的视频空间推理仍是关键挑战,现有方法主要依赖空间监督微调或强化学习。为解决此局限,本文提出Object-Centric 3D Rollout(OCR)策略,通过结构化扰动训练时选定对象的3D几何形状,提高模型的全局场景推理能力。实验显示,该方法在VSI-Bench上的准确率达到了47.5%,优于多个基线模型。

Key Takeaways

  • MLLMs在视觉语言理解方面取得显著进展,但视频空间推理仍是挑战。
  • 现有方法主要依赖空间监督微调或强化学习,但存在查询锁定推理问题。
  • OCR策略通过结构化扰动选定对象的3D几何形状,提高模型的全局场景推理能力。
  • OCR通过降解特定对象的视觉线索并将更改后的几何形状投影到2D空间来工作。
  • 设计了基于rollout的训练管道,利用普通和区域噪声视频优化空间推理轨迹。
  • 实验显示,OCR方法在VSI-Bench上表现最佳,准确率为47.5%,优于多个基线模型。

Cool Papers

点此查看论文截图

TCM-5CEval: Extended Deep Evaluation Benchmark for LLM’s Comprehensive Clinical Research Competence in Traditional Chinese Medicine

Authors:Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li, Bing Han, Wenchao Tang, Jie Xu, Ming Li

Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek_r1 and gemini_2_5_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the “In-depth Challenge for Comprehensive TCM Abilities” special track.

大型语言模型(LLM)在通用领域表现出了出色的能力,然而,它们在传统中医(TCM)等高度专业化和文化丰富的领域的应用,需要进行严格和细致的评价。基于先前的基础工作,如TCM-3CEval,它强调了系统知识差距和文化背景对齐的重要性,我们引入了TCM-5CEval,这是一个更精细和全面的基准测试。TCM-5CEval旨在评估LLM在五个关键领域的能力:(1)核心知识(TCM-Exam)、(2)古典文学素养(TCM-LitQA)、(3)临床决策(TCM-MRCD)、(4)中药知识(TCM-CMM),以及(5)临床非药物治疗(TCM-ClinNPT)。我们对十五个突出的大型语言模型进行了全面评估,发现了性能上的显著差异,并发现了表现最好的模型,如deepseek_r1和gemini_2_5_pro。我们的研究发现,虽然模型在回忆基础知识方面表现出色,但在解释古典文本的复杂性方面却感到困难。关键的是,基于排列组合的的一致性测试揭示了模型推理的广泛脆弱性。所有经过评估的模型,包括得分最高的模型,在面对不同的提问选项顺序时,性能大幅下降,这表明它们对位置偏差敏感,缺乏稳健的理解。TCM-5CEval不仅为LLM在TCM领域的能力提供了更详细的诊断工具,还暴露了其推理稳定性方面的根本弱点。为了促进进一步的研究和标准化比较,TCM-5CEval已上传到Medbench平台,加入其前身“中医综合能力深度挑战”专项赛道。

论文及项目相关链接

PDF 17 pages, 8 figures

摘要

大型语言模型(LLM)在通用领域表现出卓越的能力,但在传统中医(TCM)等高度专业化和文化丰富的领域的应用,需要进行严格和细致的评价。基于TCM-3CEval等前期基础工作,我们推出了TCM-5CEval,一个更精细和全面的基准测试。TCM-5CEval旨在评估LLM在五个关键领域的能力:(1)核心知识(TCM-Exam)、(2)古典文学(TCM-LitQA)、(3)临床决策(TCM-MRCD)、(4)中药材料(TCM-CMM)和(5)临床非药物治疗(TCM-ClinNPT)。我们对十五个突出的LLM进行了全面评估,发现性能差异显著,并确定了表现最佳模型如deepseek_r1和gemini_2_5_pro。我们的研究发现,虽然模型在回忆基础知识方面表现出色,但在解释古典文本的复杂性方面却遇到困难。重要的是,基于排列组合的的一致性测试揭示了模型推理的普遍脆弱性。所有评估的模型,包括得分最高的模型,在面对不同的提问选项顺序时,性能大幅下降,这表明它们对位置偏差敏感,缺乏稳健的理解能力。TCM-5CEval不仅为LLM在TCM中的能力提供了更详细的诊断工具,还暴露了其推理稳定性的根本弱点。为促进进一步的研究和标准化比较,TCM-5CEval已上传到Medbench平台,加入其“深度挑战中医综合能力”专项赛道的前期基准测试。

关键见解

  1. 大型语言模型(LLMs)在中医领域的专业能力评估需要更精细和全面的基准测试。
  2. TCM-5CEval覆盖了LLM在五个关键中医领域的能力评估,包括核心知识、古典文学、临床决策、中药材料以及临床非药物治疗。
  3. 通过对十五个突出的LLM进行评估,发现性能差异显著,部分模型如deepseek_r1和gemini_2_5_pro表现最佳。
  4. 模型在回忆基础知识方面表现良好,但在解释古典文本的复杂性方面存在困难。
  5. 模型推理存在普遍脆弱性,对位置偏差敏感,缺乏稳健的理解能力。
  6. TCM-5CEval提供了一个详细的诊断工具来评估LLM在中医领域的能力,并揭示了其推理稳定性的弱点。

Cool Papers

点此查看论文截图

Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

Authors:Yanda Zhu, Yuanyang Zhu, Daoyi Dong, Caihua Chen, Chunlin Chen

Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C$\text{D}^\text{3}$T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C$\text{D}^\text{3}$T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C$\text{D}^\text{3}$T achieves better performance than existing baselines.

任务分解在复杂的合作多智能体强化学习(MARL)任务中显示出巨大的潜力,使动态和不确定环境中的长期任务能够进行高效分层学习。然而,从头开始学习动态任务分解通常需要大量的训练样本,特别是在部分可观测性下探索巨大的联合动作空间。在本文中,我们提出了动态任务分解的条件扩散模型(C$\text{D}^\text{3}$T),这是一种新型的两级分层MARL框架,旨在自动推断子任务和协调模式。高级策略学习子任务表示,基于子任务效果生成子任务选择策略。为了捕捉子任务对环境的影响,C$\text{D}^\text{3}$T使用条件扩散模型预测下一个观察和奖励。在低级,智能体在其分配的子任务内协作学习并共享专业技能。此外,学到的子任务表示还用作多头注意力混合网络中的附加语义信息,以增强价值分解,并在个体和联合价值函数之间提供有效的推理桥梁。在各种基准测试上的实验结果表明,C$\text{D}^\text{3}$T的性能优于现有基线。

论文及项目相关链接

PDF AAAI 2026

Summary

在复杂合作的多智能体强化学习(MARL)任务中,任务分解显示出良好的前景,能够实现对长周期任务的动态分层学习,并适用于动态和不确定的环境。本文提出了基于条件扩散模型的动态任务分解(C$\text{D}^\text{3}$T)方法,这是一种新颖的两级分层MARL框架,旨在自动推断子任务和协调模式。高级策略学习子任务表示,基于子任务效果生成子任务选择策略。C$\text{D}^\text{3}$T通过预测下一个观测值和奖励值来捕获子任务对环境的影响,而低级智能体则在分配的子任务内学习和共享专业技能。实验结果表明,C$\text{D}^\text{3}$T在多个基准测试上的性能优于现有基线方法。

Key Takeaways

  1. 任务分解在复杂合作的多智能体强化学习任务中展现出良好前景,能提升对长周期任务的分层学习效率。
  2. C$\text{D}^\text{3}$T是一种新颖的两层分级MARL框架,可自动推断子任务和协调模式。
  3. 高级策略通过基于子任务效果的表示学习生成子任务选择策略。
  4. C$\text{D}^\text{3}$T通过预测下一个观测值和奖励值来捕获子任务对环境的影响。
  5. 在低级层面,智能体在分配的子任务内学习和共享专业技能。
  6. C$\text{D}^\text{3}$T将习得的子任务表示用于多头部注意力混合网络中的额外语义信息,增强价值分解并提供个体与联合价值函数之间的有效推理桥梁。

Cool Papers

点此查看论文截图

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Authors:Chuyuan Li, Giuseppe Carenini

We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just’’), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

我们推出了BeDiscovER(时代推理语言模型中的话语理解基准测试),这是一套最新且全面的评估现代大型语言模型在话语层面知识的工具。BeDiscovER汇集了5个公开可用的话语任务,涵盖话语词汇、(多)句子和文档层面,总计52个单独的数据集。它涵盖了广泛研究的话语任务,如话语解析和时间关系提取,以及一些新颖的挑战,如话语粒子消歧(例如,“只是”),并汇总了关于多语言和多框架话语关系分类的Discourse Relation Parsing和Treebanking共享任务。我们在BeDiscovER上评估了开源的大型语言模型:Qwen3系列、DeepSeek-R1以及前沿模型如GPT-5-mini,发现最先进的模型在时序推理的算术方面表现出色,但在全文推理和一些微妙的语义和话语现象(如修辞关系识别)方面遇到困难。

论文及项目相关链接

PDF

Summary

我们介绍了BeDiscovER(基于现代推理语言模型的篇章理解基准测试),这是一个最新的、全面的评估套件,用于评估现代大型语言模型在篇章级别的知识。BeDiscovER汇集了5个公开的篇章任务,包括篇章词汇、(多)句子和文档级别,总共有52个单独的数据集。它涵盖了广泛研究的任务,如篇章解析和时间关系抽取,以及一些新的挑战,如篇章小品词(例如,“just”)的歧义性,并汇总了关于多语言和多框架的篇章关系解析和树库的共享任务。我们对开源的大型语言模型(如Qwen3系列、DeepSeek-R1和前沿模型GPT-5-mini)进行了BeDiscovER评估,发现最先进的模型在时态推理的算术方面表现出色,但在全文推理和一些微妙的语义和篇章现象(如修辞关系识别)方面存在困难。

Key Takeaways

  1. BeDiscovER是一个用于评估现代大型语言模型在篇章级别的知识的综合套件。
  2. 它涵盖了多种篇章任务和数据集,包括篇章词汇、(多)句子和文档级别。
  3. BeDiscovER包含了广泛研究的任务(如篇章解析和时间关系抽取)和新挑战(如篇章小品词的歧义性)。
  4. 评估发现,最先进的模型在时态推理的算术方面表现出色。
  5. 这些模型在全文推理和一些微妙的语义和篇章现象(如修辞关系识别)方面存在困难。
  6. BeDiscovER还包括一个关于多语言和多框架的篇章关系解析和树库的共享任务。

Cool Papers

点此查看论文截图

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Authors:Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan

Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM’s R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

复杂视频推理对于多模态大型语言模型(MLLMs)来说仍然是一个巨大的挑战,因为当前基于R1的方法往往优先采用文本中心的推理,这些推理是从文本和图像发展中衍生出来的。在视频任务中,这些策略常常未能充分利用丰富的视觉信息,导致潜在的捷径学习和更容易产生幻觉。为了促进更稳健的视觉中心视频理解,我们首先介绍了一种新型的自监督强化学习GRPO算法(Pretext-GRPO),该算法被引入标准的R1管道中。在Pretext-GRPO中,通过为正确解决变换视觉输入而正确解决的预设任务分配正向奖励,使模型对视觉信息进行非平凡处理。基于Pretext-GRPO的有效性,我们进一步提出了ViSS-R1框架,该框架简化了预测试任务基于自监督学习的理念并将其直接整合到MLLM的R1训练后的范式中。我们的框架不仅依赖于稀疏的视觉线索,而且还迫使模型通过同时处理预设问题和真实的用户查询来对变换后的视觉输入进行推理。这需要识别出应用的变换并重建原始视频以形成准确的最终答案。在六个广泛使用的视频推理和理解基准测试上的综合评估证明了我们的Pretext-GRPO和ViSS-R1在复杂视频推理中的有效性和优越性。我们的代码和模型将公开可用。

论文及项目相关链接

PDF Our paper was initially titled “Video-SSR1: Self-Supervised Reinforcement Video Reasoning.” Upon noticing its close resemblance to the title of a recently released paper, we have decided to rename our work as “ViSS-R1.”

Summary
视频推理对于多模态大型语言模型(MLLMs)仍是一个巨大挑战。现有R1方法通常侧重文本为中心的推理,忽视视频中的丰富视觉信息,导致模型易走捷径和产生幻觉。为提升模型对视频的视觉为中心的理解能力,研究引入了新型自我监督强化学习GRPO算法(Pretext-GRPO),并将其纳入标准R
视频任务中引入了一种新型的自我监督强化学习GRPO算法(Pretext-GRPO),通过解决变换视觉输入的伪装任务给予正向奖励,使模型更深入地处理视觉信息。基于Pretext-GRPO的有效性,进一步提出了ViSS-R1框架,该框架简化了伪装任务为基础的自监督学习并将其直接融入MLLM的R1训练后范式中。它促使模型同时处理伪装问题和真实用户查询,而不仅仅依赖于稀疏的视觉线索。综合评估表明,Pretext-GRPO和ViSS-R1在视频推理和理解方面表现出优越效果。我们的代码和模型将公开供使用。

Key Takeaways

以下是七个关于该文本的关键见解:

  • 当前的多模态大型语言模型(MLLMs)在处理视频推理时面临挑战。
  • R1方法常常过于依赖文本中心推理而忽视视频的视觉信息。
  • 提出的新型自我监督强化学习GRPO算法(Pretext-GRPO)有助于增强模型对视频视觉信息的处理能力。
  • Pretext-GRPO算法通过解决变换视觉输入的伪装任务来赋予正向奖励。
  • ViSS-R1框架简化了伪装任务为基础的自监督学习并将其融入MLLM的R1训练后范式中。
  • ViSS-R1框架要求模型同时处理伪装问题和真实用户查询,而非仅依赖稀疏的视觉线索。
  • 综合评估显示Pretext-GRPO和ViSS-R1在视频推理和理解方面效果显著。

Cool Papers

点此查看论文截图

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Authors:Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu, Zhongyi Ye, Weitai Zhang, Quan Liu, Si Wei, Cong Liu

Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a “CoT-augmented state prediction” task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover’s capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.

大规模语言模型(LLMs)在自动化定理证明方面显示出巨大的潜力,然而,进展往往受到多样化和高质量自然语言数据的稀缺性的限制。为了解决这一问题,我们引入了Spark-Prover-X1,这是一个通过三阶段框架训练的7B参数模型,旨在解锁可访问和中等规模LLMs的推理潜力。第一阶段通过在一系列新颖的数据任务的支持下,在广泛的数学语料库上进行持续预训练,以融入深度知识。关键的创新之处在于“认知状态预测增强”任务,以实现精细推理。第二阶段采用监督微调(SFT)在专业迭代循环中对Spark-Prover-X1-7B和Spark-Formalizer-X1-7B模型进行专业化处理。最后,应用一轮有针对性的群体相对策略优化(GRPO),以提高证明器在最具挑战性问题上的能力。为了促进稳健的评估,特别是在现实世界考试中的问题,我们还引入了ExamFormal-Bench,这是一个新的包含402个形式问题的基准数据集。实验结果表明,Spark-Prover-X1-7B在同类开源模型中达到最新性能水平,平均通过率(pass@32)达到37.0%。它在困难的竞赛基准测试中表现出卓越的性能,特别是在PutnamBench上解决了27个问题(pass@32),并在CombiBench上达到24.0%(pass@32)。我们的工作验证了多样化的训练数据和逐步优化的训练管道为增强轻量级LLMs的形式推理能力提供了一条有效的途径。Spark-Prover-X1-7B、Spark-Formalizer-X1-7B以及ExamFormal-Bench数据集均已在https://www.modelscope.cn/organization/iflytek和https://gitcode.com/ifly_opensource上公开可用。

论文及项目相关链接

PDF

Summary

大型语言模型在自动化定理证明方面展现出巨大潜力,但进步受限于多样且高质量的形式语言数据的稀缺性。为解决这一问题,我们推出Spark-Prover-X1,这是一个通过三阶段框架训练的7B参数模型,旨在解锁更易于获取和中等规模的语言模型的推理潜力。第一阶段通过连续预训练广泛数学语料库和一系列新型数据任务来注入深度知识,其中“CoT增强状态预测”任务是实现精细推理的关键。第二阶段采用监督微调(SFT)结合专家迭代循环,对Spark-Prover-X1-7B和Spark-Formalizer-X1-7B模型进行专业化。最后,应用有针对性的群体相对策略优化(GRPO),以提高解决最难题目的能力。为进行稳健评估,尤其是针对现实考试中的问题,我们还推出了ExamFormal-Bench,这是一个包含402个形式问题的新基准数据集。实验结果表明,Spark-Prover-X1-7B在同类开源模型中表现卓越,平均通过率为37.0%(pass@32)。在困难的竞赛基准测试中,如PutnamBench和CombiBench,它表现出色。我们的工作验证了多样化训练数据和逐步优化的训练管道是增强轻量级语言模型形式推理能力的有效途径。

Key Takeaways

  1. 引入Spark-Prover-X1模型,通过三阶段框架训练,旨在提高大型语言模型的推理能力。
  2. 第一阶段通过预训练和新型数据任务注入深度知识,包括“CoT-augmented state prediction”任务。
  3. 第二阶段采用监督微调(SFT)结合专家迭代循环,对模型进行专业化。
  4. 应用Group Relative Policy Optimization(GRPO)以提高解决最难题目的能力。
  5. 推出新的基准数据集ExamFormal-Bench,包含402个形式问题,用于稳健评估模型性能。
  6. Spark-Prover-X1-7B在同类模型中表现卓越,平均通过率为37.0%。
  7. 模型在困难竞赛基准测试中表现良好,如PutnamBench和CombiBench。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
LLM LLM
LLM 方向最新论文已更新,请持续关注 Update in 2025-11-19 Crossing Borders A Multimodal Challenge for Indian Poetry Translation and Image Generation
2025-11-19
下一篇 
Talking Head Generation Talking Head Generation
Talking Head Generation 方向最新论文已更新,请持续关注 Update in 2025-11-18 Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
  目录