LLM

发布日期: 2025-09-19

更新日期: 2025-10-07

文章字数: 18.1k

阅读时长: 74 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-19 更新

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Authors:Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Ines Altemir Marinas, Mohammad Hossein Amani, Matin Ansaripour, Ilia Badanin, Harold Benoit, Emanuela Boros, Nicholas Browning, Fabian Bösch, Maximilian Böther, Niklas Canova, Camille Challier, Clement Charmillot, Jonathan Coles, Jan Deriu, Arnout Devos, Lukas Drescher, Daniil Dzenhaliou, Maud Ehrmann, Dongyang Fan, Simin Fan, Silin Gao, Miguel Gila, María Grandury, Diba Hashemi, Alexander Hoyle, Jiaming Jiang, Mark Klein, Andrei Kucharavy, Anastasiia Kucherenko, Frederike Lübeck, Roman Machacek, Theofilos Manitaras, Andreas Marfurt, Kyle Matoba, Simon Matrenok, Henrique Mendoncça, Fawzi Roberto Mohamed, Syrielle Montariol, Luca Mouchel, Sven Najem-Meyer, Jingwei Ni, Gennaro Oliva, Matteo Pagliardini, Elia Palme, Andrei Panferov, Léo Paoletti, Marco Passerini, Ivan Pavlov, Auguste Poiroux, Kaustubh Ponkshe, Nathan Ranchin, Javi Rando, Mathieu Sauser, Jakhongir Saydaliev, Muhammad Ali Sayfiddinov, Marian Schneider, Stefano Schuppli, Marco Scialanga, Andrei Semenov, Kumar Shridhar, Raghav Singhal, Anna Sotnikova, Alexander Sternfeld, Ayush Kumar Tarun, Paul Teiletche, Jannis Vamvas, Xiaozhe Yao, Hao Zhao Alexander Ilic, Ana Klimovic, Andreas Krause, Caglar Gulcehre, David Rosenthal, Elliott Ash, Florian Tramèr, Joost VandeVondele, Livio Veraldi, Martin Rajman, Thomas Schulthess, Torsten Hoefler, Antoine Bosselut, Martin Jaggi, Imanol Schlag

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

我们推出Apertus，这是一套完全开放的大型语言模型（LLM）套件，旨在解决当前开放模型生态系统中的两个系统性缺陷：数据合规性和多语言表示。与许多先前只发布权重数据管道不可复现或忽视内容所有者权利的模型不同，Apertus模型仅在公开可用的数据上进行预训练，尊重robots.txt的排除规定，并过滤掉不允许的、有毒的和可识别个人身份的内容。为了减轻记忆风险，我们在预训练过程中采用了金鱼记忆目标，在保留下游任务性能的同时，强烈抑制对数据的逐字回忆。Apertus模型还扩大了多语言覆盖，在来自超过1800种语言的15T标记上进行训练，其中约40%的预训练数据为非英语内容。以8B和70B的规模发布时，Apertus在多语言基准测试中达到了最先进的完全开放模型的结果，可与公开权重的同类模型相抗衡或超越。除了模型权重外，我们还以许可的方式发布开发周期中的所有科学成果，包括数据准备脚本、检查点、评估套件和训练代码，以实现透明的审核和扩展。

论文及项目相关链接

PDF

Summary

Apertus是一套完全开放的的大型语言模型（LLM）套件，旨在解决当前开放模型生态系统中的数据合规性和多语言表示的两个系统性短板。该模型在公开可用的数据上进行预训练，尊重robots.txt的排除设置，过滤掉非许可、有毒和可识别的内容。采用Goldfish目标进行预训练，以减轻记忆风险，同时保留下游任务性能。Apertus模型还扩展了多语言覆盖能力，在超过1800种语言的15T标记上进行训练，约40%的预训练数据为非英语内容。以8B和70B的规模发布，Apertus在多语言基准测试上的表现接近或超越其他完全开放的模型。除了模型权重外，我们还以许可的形式发布了开发周期中的所有科学成果，包括数据准备脚本、检查点、评估套件和训练代码，以实现透明的审计和扩展。

Key Takeaways

Apertus是一个大型语言模型套件，专注于解决数据合规性和多语言表示问题。
该模型在公开可用数据上进行预训练，尊重robots.txt设置，并过滤不良内容。
采用Goldfish目标进行预训练，以降低记忆风险并保留下游任务性能。
Apertus模型扩展了多语言覆盖能力，包含超过1800种语言的数据。
模型分为8B和70B两个规模版本。
该模型在多语言基准测试上的表现接近或超越其他完全开放的模型。

Cool Papers

点此查看论文截图

NIRVANA: Structured pruning reimagined for large language models compression

Authors:Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He

Structured pruning of large language models (LLMs) offers substantial efficiency improvements by removing entire hidden units, yet current approaches often suffer from significant performance degradation, particularly in zero-shot settings, and necessitate costly recovery techniques such as supervised fine-tuning (SFT) or adapter insertion. To address these critical shortcomings, we introduce NIRVANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Leveraging a first-order saliency criterion derived from the Neural Tangent Kernel under Adam optimization dynamics, NIRVANA provides a theoretically grounded pruning strategy that respects essential model training behaviors. To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules (attention vs. MLP), which adjusts pruning intensity between modules in a globally balanced manner. Additionally, to mitigate the high sensitivity of pruning decisions to calibration data quality, we propose a simple yet effective KL divergence-based calibration data selection strategy, ensuring more reliable and task-agnostic pruning outcomes. Comprehensive experiments conducted on Llama3, Qwen, and T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints, providing a theoretically sound and practical approach to LLM compression. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/NIRVANA.

大型语言模型（LLM）的结构化剪枝通过移除整个隐藏单元，可以显著提高效率。然而，当前的方法常常面临性能显著下降的问题，特别是在零样本设置下，并且需要昂贵的恢复技术，如监督微调（SFT）或适配器插入。为了解决这些关键缺点，我们引入了NIRVANA，这是一种新型剪枝方法，专门设计用于平衡即时零样本准确性保持与稳健微调能力。NIRVANA利用基于Adam优化动力学下的神经切线核导出的一阶显著性标准，提供了一个理论扎实的剪枝策略，尊重模型的基本训练行为。为了进一步解决结构化剪枝带来的独特挑战，NIRVANA在层和模块（注意力与MLP）之间融入了一种自适应稀疏分配机制，以全局平衡的方式在模块之间调整剪枝强度。此外，为了减轻剪枝决策对校准数据质量的高敏感性，我们提出了一种简单有效的基于KL散度的校准数据选择策略，确保更可靠、与任务无关的剪枝结果。在Llama3、Qwen和T5模型上进行的综合实验表明，NIRVANA在等效稀疏约束下优于现有的结构化剪枝方法，为LLM压缩提供了理论扎实且实用的方法。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/NIRVANA找到。

论文及项目相关链接

PDF

Summary

结构化剪枝大型语言模型（LLM）可以显著提升效率，移除隐藏单元来实现显著的性能提升。为解决现有方法存在的零样本设置性能下降问题，提出NIRVANA方法，旨在平衡即时零样本精度保持与精细调节能力。利用基于Adam优化动力学的神经网络切线核一阶显著标准作为剪枝策略的理论基础。进一步应对结构化剪枝带来的挑战，NIRVANA采用自适应稀疏分配机制，跨层与模块调整剪枝强度。同时，为缓解校准数据质量对剪枝决策的影响，提出基于KL散度的校准数据选择策略，确保更可靠的任务无关剪枝结果。实验表明，NIRVANA在相同稀疏约束下优于现有结构化剪枝方法。

Key Takeaways

结构化剪枝可以提高大型语言模型（LLM）的效率。
当前剪枝方法面临零样本设置性能下降的问题。
NIRVANA方法旨在平衡即时零样本精度保持与精细调节能力。
基于Adam优化动力学的神经网络切线核作为NIRVANA的理论基础。
NIRVANA采用自适应稀疏分配机制，针对层与模块进行优化调整。
基于KL散度的校准数据选择策略提高了剪枝决策的可靠性。

Cool Papers

点此查看论文截图

Dense Video Understanding with Gated Residual Tokenization

Authors:Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu

High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.

高时间分辨率对于捕捉视频理解中的精细细节至关重要。然而，当前的视频大语言模型（VLLMs）和基准测试主要依赖于低帧率采样，如均匀采样或关键帧选择，丢弃了密集的时间信息。这种折衷避免了对整个帧进行分词的高成本，否则会导致冗余计算和线性令牌增长随着视频长度的增加。虽然这种权衡对于缓慢变化的内容有效，但对于诸如讲座理解之类的任务却失败了，在这些任务中，信息几乎出现在每一帧中，并要求精确的时间对齐。为了弥补这一差距，我们引入了密集视频理解（DVU），通过减少分词时间和令牌开销来实现高帧率视频理解。现有的基准测试也有限，它们的问答对侧重于粗略的内容变化。因此，我们提出了DIVE（密集信息视频评估），这是为密集时间推理设计的第一个基准测试。为了使DVU实用，我们提出了门控残差分词（GRT），这是一个两阶段框架：（1）运动补偿间隔分词使用像素级运动估计在分词过程中跳过静态区域，实现令牌计数和计算的次线性增长。（2）语义场景内部分词合并合并场景内静态区域内的令牌，进一步减少冗余，同时保留动态语义。在DIVE上的实验表明，GRT优于较大的VLLM基线，并且与FPS正相关。这些结果突出了密集时间信息的重要性，并证明GRT能够实现高效、可扩展的高帧率视频理解。

论文及项目相关链接

PDF

摘要

高时间分辨率对于视频理解中的精细细节捕捉至关重要。然而，当前的视频大型语言模型（VLLMs）和基准测试主要依赖于低帧率采样，如均匀采样或关键帧选择，这丢弃了密集的时间信息。这种权衡避免了每一帧标记的高成本，避免了视频长度增加时的冗余计算和线性标记增长。虽然这种权衡对于缓慢变化的内容有效，但对于讲座理解等任务却失败了，在这些任务中，信息几乎出现在每一帧中，需要精确的时间对齐。为解决这一空白，我们引入了密集视频理解（DVU），通过减少标记时间和开销，实现了高帧率视频理解。现有的基准测试也有局限性，他们的问答对侧重于粗略的内容变化。因此，我们提出了用于密集时间推理的第一个基准测试——密集信息视频评估（DIVE）。为了使DVU实用，我们提出了门控残差标记化（GRT），这是一个两阶段框架：（1）运动补偿门控标记化使用像素级运动估计在标记过程中跳过静态区域，实现标记计数和计算的次线性增长。（2）语义场景内标记合并合并场景内静态区域内的标记，进一步减少冗余，同时保留动态语义。在DIVE上的实验表明，GRT优于较大的VLLM基线，并且随着帧率的提高而积极发展。这些结果突显了密集时间信息的重要性，并证明GRT能够实现高效、可扩展的高帧率视频理解。

关键见解

高时间分辨率对于捕捉视频理解中的精细细节至关重要。
当前VLLMs和基准测试主要依赖低帧率采样，丢失了密集的时间信息。
引入密集视频理解（DVU）来解决这一问题，实现高帧率视频理解。
现有基准测试无法充分评估密集时间信息的重要性。
提出了密集信息视频评估（DIVE）作为首个用于密集时间推理的基准测试。
引入门控残差标记化（GRT）框架以实现高效、可扩展的视频理解。

Cool Papers

点此查看论文截图

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning

Authors:Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao

With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.

随着大型语言模型和视觉语言模型的快速发展，将大型模型作为Web代理进行自动化网络交互已成为必然趋势。然而，使用强化学习训练Web代理面临着关键挑战，包括信用分配不当、标注成本过高和奖励稀疏。为了解决这些问题，我们提出了树导向偏好优化（TGPO），这是一种离线强化学习框架。它采用树状轨迹表示，通过合并轨迹之间语义相同的状态来消除标签冲突。我们的框架结合了进程奖励模型，该模型通过子目标进度、冗余检测和行为验证自动生成精细奖励。此外，动态加权机制在训练过程中优先考虑高影响力的决策点。在线Mind2Web和我们自行构建的C-WebShop数据集上的实验表明，TGPO显著优于现有方法，在更少的冗余步骤下实现了更高的成功率。

论文及项目相关链接

PDF

Summary

随着大型语言模型和视觉语言模型的快速发展，使用大型模型作为Web代理进行自动化网络交互至关重要。然而，使用强化学习训练Web代理面临关键挑战，如信用分配不当、注释成本过高和奖励稀疏性。为解决这些问题，我们提出了树结构引导偏好优化（TGPO）的离线强化学习框架，通过合并轨迹中的语义相同状态消除标签冲突。我们的框架结合了过程奖励模型，通过子目标进度、冗余检测以及动作验证自动生成精细奖励。此外，动态加权机制在训练过程中优先处理高影响力的决策点。在线实验表明，TGPO在Online-Mind2Web和我们自建的C-WebShop数据集上显著优于现有方法，实现了更高的成功率并减少了冗余步骤。

Key Takeaways

大型语言模型和视觉语言模型的快速发展使得使用大型模型作为Web代理进行自动化网络交互变得至关重要。
使用强化学习训练Web代理面临挑战，包括信用分配不当、注释成本过高和奖励稀疏性。
树结构引导偏好优化（TGPO）是一个离线强化学习框架，旨在解决上述问题。
TGPO通过合并语义相同状态消除标签冲突，并提出了树结构轨迹表示法。
过程奖励模型可以自动通过子目标进度、冗余检测以及动作验证生成精细奖励。
动态加权机制在训练过程中优先处理高影响力的决策点，以提高学习效果。

Cool Papers

点此查看论文截图

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Authors:Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang, Qiang Zhou, Yichen Zhao, Shili Xiong, Hyeongjin Nam, Jaerin Lee, Jaeyoung Chung, JoonKyu Park, Junghun Oh, Kanggeon Lee, Wooseok Lee, Juneyoung Ro, Turghun Osman, Can Hu, Chaoyang Liao, Cheng Chen, Chengcheng Han, Chenhao Qiu, Chong Peng, Cong Xu, Dailin Li, Feiyu Wang, Feng Gao, Guibo Zhu, Guopeng Tang, Haibo Lu, Han Fang, Han Qi, Hanxiao Wu, Haobo Cheng, Hongbo Sun, Hongyao Chen, Huayong Hu, Hui Li, Jiaheng Ma, Jiang Yu, Jianing Wang, Jie Yang, Jing He, Jinglin Zhou, Jingxuan Li, Josef Kittler, Lihao Zheng, Linnan Zhao, Mengxi Jia, Muyang Yan, Nguyen Thanh Thien, Pu Luo, Qi Li, Shien Song, Shijie Dong, Shuai Shao, Shutao Li, Taofeng Xue, Tianyang Xu, Tianyi Gao, Tingting Li, Wei Zhang, Weiyang Su, Xiaodong Dong, Xiao-Jun Wu, Xiaopeng Zhou, Xin Chen, Xin Wei, Xinyi You, Xudong Kang, Xujie Zhou, Xusheng Liu, Yanan Wang, Yanbin Huang, Yang Liu, Yang Yang, Yanglin Deng, Yashu Kang, Ye Yuan, Yi Wen, Yicen Tian, Yilin Tao, Yin Tang, Yipeng Lin, Yiqing Wang, Yiting Xi, Yongkang Yu, Yumei Li, Yuxin Qin, Yuying Chen, Yuzhe Cen, Zhaofan Zou, Zhaohong Liu, Zhehao Shen, Zhenglin Du, Zhengyang Li, Zhenni Huang, Zhenwei Shao, Zhilong Song, Zhiyong Feng, Zhiyu Wang, Zhou Yu, Ziang Li, Zihan Zhai, Zijian Zhang, Ziyang Peng, Ziyun Xiao, Zongshu Li

This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year’s MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants’ methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

本文回顾了关于多模态推理的MARS2 2025挑战赛。我们的目标是通过大型基准测试，汇集不同模态的机器学习和大型语言模型的多模态方法。我们希望这能更好地让研究人员在这个动态领域追踪最新的研究前沿。与此同时，越来越多的测试平台推动了通用大型语言模型的进化。因此，今年的MARS2专注于现实和特定场景，以拓宽多模态语言模型的多模态推理应用。我们团队发布了两个定制数据集Lens和AdsQA作为测试集，它们分别支持在12个日常场景中的通用推理和在广告视频中的特定领域推理。我们评估了超过40个基线模型，包括通用多模态语言模型和特定任务模型，并开设了三个竞赛赛道，即现实世界场景中的视觉定位（VG-RS）、具有空间意识的视觉问答（VQA-SA）以及在创意广告视频中的视觉推理（VR-Ads）。最终，来自著名学术和工业机构的76支团队已经注册，超过40个有效提交（总共超过1200个提交）已被纳入我们的排名列表。我们的数据集、代码集（超过40个基线和超过15种参与者方法），以及排名都已在MARS2研讨会网站和我们的GitHub组织页面公开：[https://github.com/mars2workshop/]上发布，我们将不断更新和公布即将发布的事件。

论文及项目相关链接

PDF ICCV 2025 MARS2 Workshop and Challenge “Multimodal Reasoning and Slow Thinking in the Large Model Era: Towards System 2 and Beyond’’

摘要
本文回顾了MARS2 2025挑战赛的跨模态推理任务。通过大型基准测试，旨在汇集不同跨模态机器学习和大型语言模型的方法。推动研究人员在这个充满变化的领域追踪最新进展。同时，越来越多的测试平台推动了通用大型语言模型的演变。因此，今年的MARS2聚焦于现实和特殊场景，以拓展多模态推理中大型语言模型的应用。组织团队发布了两个定制数据集Lens和AdsQA作为测试集，分别支持日常场景中的通用推理和广告视频中的领域特定推理。评估了超过40个基线模型，包括通用大型语言模型和特定任务模型，并开设了三个竞赛赛道。最后，来自著名学术和工业机构的76个团队已注册参与，有40多个有效提交进入了排名列表。数据集、代码集和排名均可在MARS2研讨会网站和GitHub组织页面上公开访问。

关键见解

MARS2 2025挑战赛旨在促进跨模态机器学习的发展。该基准测试帮助集结多种不同方法和大型语言模型的研究。
随着测试平台的增长，通用大型语言模型的演变加快。这表明跨模态推理应用的广泛性和需求增长。
MARS2聚焦于现实和特殊场景的应用，推出两个定制数据集Lens和AdsQA，分别针对日常场景和广告视频中的推理任务。
该研究评估了超过40个基线模型的表现，包括通用和特定任务的模型。
三个竞赛赛道涵盖了视觉接地、视觉问答和广告视频中的视觉推理等任务。
来自多个学术和工业机构的团队参与挑战，有效提交数量众多。

Cool Papers

点此查看论文截图

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

Authors:Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, Xin Xia

Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by prompting intermediate steps, improving accuracy and robustness in arithmetic, logic, and commonsense tasks. However, this benefit comes with high computational costs: longer outputs increase latency, memory usage, and KV-cache demands. These issues are especially critical in software engineering tasks where concise and deterministic outputs are required. To investigate these trade-offs, we conduct an empirical study based on code generation benchmarks. The results reveal that longer CoT does not always help. Excessive reasoning often causes truncation, accuracy drops, and latency up to five times higher, with failed outputs consistently longer than successful ones. These findings challenge the assumption that longer reasoning is inherently better and highlight the need for adaptive CoT control. Motivated by this, we propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy. SEER combines Best-of-N sampling with task-aware adaptive filtering, dynamically adjusting thresholds based on pre-inference outputs to reduce verbosity and computational overhead. We then evaluate SEER on three software engineering tasks and one math task. On average, SEER shortens CoT by 42.1%, improves accuracy by reducing truncation, and eliminates most infinite loops. These results demonstrate SEER as a practical method to make CoT-enhanced LLMs more efficient and robust, even under resource constraints.

链式思维（Chain-of-Thought，简称CoT）推理通过提示中间步骤增强大型语言模型（LLM）的功能，在算术、逻辑和常识任务中提高准确性和稳健性。然而，这一优势伴随着较高的计算成本：更长的输出增加了延迟、内存使用和KV缓存需求。这些问题在需要简洁和确定性输出的软件工程任务中尤为关键。为了研究这些权衡，我们基于代码生成基准测试进行了实证研究。结果表明，较长的CoT并不总是有帮助。过多的推理常常导致截断、准确性下降和高达五倍的延迟，并且失败输出的长度始终超过成功的输出。这些发现质疑了较长推理过程本身更好的假设，并强调了自适应CoT控制的必要性。因此，我们提出了SEER（自我增强高效推理），这是一个自适应框架，可以在保持准确性的同时压缩CoT。SEER结合了最佳N采样和基于任务的自适应过滤，根据预推理输出动态调整阈值，以减少冗长和计算开销。然后我们在三个软件工程任务和一项数学任务上评估了SEER。平均而言，SEER将CoT缩短了42.1%，通过减少截断提高了准确性，并消除了大多数无限循环。这些结果表明SEER是一种实用的方法，可以使CoT增强的LLM更加高效和稳健，即使在资源受限的情况下也是如此。

论文及项目相关链接

PDF

Summary
链式思维（CoT）推理通过提示中间步骤增强大型语言模型（LLM）的能力，提高算术、逻辑和常识任务的准确性和稳健性。然而，这一优势带来了较高的计算成本，包括输出增长导致的延迟、内存使用和KV缓存需求增加。在软件工程任务中，这些问题尤为重要，需要简洁和确定的输出。研究发现，过度的推理常常导致截断、准确性下降和高达五倍的延迟，而且失败的输出通常比成功的输出更长。为此，提出了自适应的CoT控制框架SEER（自我增强有效推理），它能在保持准确性的同时压缩CoT。SEER结合最佳N采样和任务感知自适应过滤，根据预推理输出动态调整阈值，减少冗长和计算开销。在三个软件工程任务和一项数学任务上的评估表明，SEER平均缩短CoT 42.1%，通过减少截断提高准确性，并消除大多数无限循环。

Key Takeaways

链式思维（CoT）推理能增强大型语言模型（LLM）在算术、逻辑和常识任务上的性能。
CoT推理带来的计算成本较高，包括增加延迟、内存使用和KV缓存需求。
在软件工程任务中，过长和冗余的推理可能导致截断、准确性下降和延迟增加。
SEER框架能在保持准确性的同时压缩CoT，通过最佳N采样和任务感知自适应过滤减少冗长和计算开销。
SEER在多个任务上的评估表明其能有效缩短CoT，提高准确性，并消除大多数无限循环。
SEER方法具有实践意义，能使CoT增强的LLM在资源受限的情况下更加高效和稳健。

Cool Papers

点此查看论文截图

CrowdAgent: Multi-Agent Managed Multi-Source Annotation System

Authors:Maosheng Qin, Renyu Zhu, Mingxuan Xia, Chenkai Chen, Zhen Zhu, Minmin Lin, Junbo Zhao, Lu Xu, Changjie Fan, Runze Wu, Haobo Wang

High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human experts-they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality-cost trade-offs in a unified manner. Inspired by real-world crowdsourcing companies, we introduce CrowdAgent, a multi-agent system that provides end-to-end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at https://github.com/QMMMS/CrowdAgent.

现代自然语言处理（NLP）中，高质量标注数据是核心要素。虽然最近的方法开始利用多种标注来源，包括大型语言模型（LLM）、小型语言模型（SLM）和人类专家，但它们往往只关注标注步骤本身。在动态管理这些来源时，仍然存在整体流程控制的空白，无法以统一的方式解决复杂的调度和质量控制成本之间的权衡问题。受现实世界的众包公司启发，我们引入了CrowdAgent，这是一个多智能体系统，通过任务分配、数据标注和质量/成本管理来提供端到端的流程控制。它实施了一种合理分配任务的新方法，使LLM、SLM和人类专家能够在协同标注工作流程中协同推进。我们在六个不同的多模态分类任务上进行了大量实验，证明了CrowdAgent的有效性。相关源代码和视频演示可在https://github.com/QMMMS/CrowdAgent上查看。

论文及项目相关链接

PDF

Summary
数据标注是自然语言处理的核心环节，当前方法多聚焦于标注本身，缺乏对不同标注源如大型语言模型、小型语言模型及人类专家的全程管控。本文提出CrowdAgent系统，采用多智能体技术实现任务分配、数据标注和质量管理的一体化，通过协同标注实现不同智能体的有效合作，提升标注效率和质量。实验结果证明了其在六个多模态分类任务中的有效性。更多详情可访问开源平台：链接。

Key Takeaways

数据标注在自然语言处理中至关重要。
当前方法主要关注标注本身，缺乏对整个标注过程的控制。
CrowdAgent系统采用多智能体技术实现一体化的数据标注过程控制。
系统能理性分配任务，实现大型语言模型、小型语言模型和人类专家的协同工作。
通过实验验证了CrowdAgent在六个多模态分类任务中的有效性。

Cool Papers

点此查看论文截图

Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection

Authors:Zhixion Chen, Jiangzhou Wang, and Hyundong Shin, Arumugam Nallanathan

The deployment of unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting diverse Internet of Things (IoT) applications. Nevertheless, the limited endurance and communication range of UAVs necessitate intelligent trajectory planning. While reinforcement learning (RL) has been extensively explored for UAV trajectory optimization, its interactive nature entails high costs and risks in real-world environments. Offline RL mitigates these issues but remains susceptible to unstable training and heavily rely on expert-quality datasets. To address these challenges, we formulate a joint UAV trajectory planning and resource allocation problem to maximize energy efficiency of data collection. The resource allocation subproblem is first transformed into an equivalent linear programming formulation and solved optimally with polynomial-time complexity. Then, we propose a large language model (LLM)-empowered critic-regularized decision transformer (DT) framework, termed LLM-CRDT, to learn effective UAV control policies. In LLM-CRDT, we incorporate critic networks to regularize the DT model training, thereby integrating the sequence modeling capabilities of DT with critic-based value guidance to enable learning effective policies from suboptimal datasets. Furthermore, to mitigate the data-hungry nature of transformer models, we employ a pre-trained LLM as the transformer backbone of the DT model and adopt a parameter-efficient fine-tuning strategy, i.e., LoRA, enabling rapid adaptation to UAV control tasks with small-scale dataset and low computational overhead. Extensive simulations demonstrate that LLM-CRDT outperforms benchmark online and offline RL methods, achieving up to 36.7% higher energy efficiency than the current state-of-the-art DT approaches.

无人机（UAV）的部署在支持多种物联网（IoT）应用方面有着巨大的潜力，特别是在从空间分布的设备收集可靠和高效能源数据方面。然而，无人机的续航和通信范围有限，需要进行智能轨迹规划。强化学习（RL）在无人机轨迹优化方面得到了广泛的应用，但其交互性质带来了真实环境中的高成本和风险。离线RL缓解了这些问题，但仍然容易受到训练不稳定的影响，并且高度依赖专家级数据集。

论文及项目相关链接

PDF 14pages, 8 figures

Summary

本文探讨了利用无人机（UAVs）进行可靠、节能的数据采集，以支持物联网（IoT）应用的前景。针对无人机的有限续航和通信范围问题，提出联合无人机轨迹规划和资源分配的策略，以最大化数据采集的能源效率。通过线性规划解决资源分配子问题，并引入大型语言模型（LLM）赋能的决策变压器（DT）框架，即LLM-CRDT，学习有效的无人机控制策略。利用预训练LLM作为DT模型的变压器骨干，并采用参数高效的微调策略，即LoRA，实现小规模数据集下的快速适应和较低的计算开销。模拟结果表明，LLM-CRDT在在线和离线RL方法上表现出优势，与当前最先进的DT方法相比，能源效率提高了高达36.7%。

Key Takeaways

UAVs在IoT应用中具有可靠、节能的数据采集潜力。
无人机的续航和通信范围限制需要智能轨迹规划。
提出联合无人机轨迹规划和资源分配问题以最大化能源效率。
资源分配子问题可通过线性规划以多项式时间复杂度最优解决。
引入LLM-CRDT框架，结合决策变压器和批评网络，学习有效的无人机控制策略。
利用预训练LLM和LoRA策略，实现快速适应小规模数据集和低计算开销。

Cool Papers

点此查看论文截图

Do Large Language Models Understand Word Senses?

Authors:Domenico Meconi, Simone Stirpe, Federico Martelli, Leonardo Lavalle, Roberto Navigli

Understanding the meaning of words in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word senses remains underexplored. In this paper, we address this gap by evaluating both i) the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task, and ii) the ability of two top-performing open- and closed-source LLMs to understand word senses in three generative settings: definition generation, free-form explanation, and example generation. Notably, we find that, in the WSD task, leading models such as GPT-4o and DeepSeek-V3 achieve performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of difficulty. In the generation tasks, results reveal that LLMs can explain the meaning of words in context up to 98% accuracy, with the highest performance observed in the free-form explanation task, which best aligns with their generative capabilities.

理解上下文中的词汇含义对于大型语言模型（LLM）来说是一项基本能力。尽管已经进行了大量的评估工作，但LLM在多大程度上真正掌握词汇意义仍然缺乏深入探索。在本文中，我们通过评估i）指令调优LLM的词义消歧（WSD）能力，并将其性能与专门为此任务设计的最新系统进行比较，以及ii）两个表现最佳的开源和闭源LLM在三种生成环境中理解词汇含义的能力：定义生成、自由形式解释和示例生成。值得注意的是，我们发现，在WSD任务中，领先模型如GPT-4o和DeepSeek-V3与专门的WSD系统性能相当，同时在跨领域和不同难度级别上表现出更大的稳健性。在生成任务中，结果显示LLM能够在上下文中解释词汇的含义，准确率高达98%，并且在自由形式解释任务中观察到最佳性能，这与它们的生成能力最为吻合。

论文及项目相关链接

PDF 20 pages, to be published in EMNLP2025

Summary

大型语言模型（LLM）在语境中理解词汇意义是基本能力。本文评估了指令调整型LLM的词汇感知消歧（WSD）能力，并与专门为该任务设计的最先进的系统进行了比较。同时，还评估了两个顶尖的开源和闭源LLM在三种生成环境中的词汇感知能力，包括定义生成、自由形式解释和示例生成。研究发现，在WSD任务中，GPT-4o和DeepSeek-V3等领先模型的表现与专业的WSD系统相当，并且在跨领域和不同难度级别上表现出更大的稳健性。在生成任务中，LLM能够以高达98%的准确率解释语境中的词汇意义，其中自由形式解释任务的性能最高，这与它们的生成能力最为吻合。

Key Takeaways

大型语言模型（LLM）需要在语境中理解词汇意义的基本能力。
本文评估了LLM的词汇感知消歧（WSD）能力，并与专门的WSD系统进行了比较。
指令调整型LLM在WSD任务中的表现与最先进的WSD系统相当。
LLM在跨领域和不同难度级别上表现出稳健性。
在定义生成、自由形式解释和示例生成三种生成环境中评估了LLM的词汇感知能力。
LLM能够以高达98%的准确率解释语境中的词汇意义。

Cool Papers

点此查看论文截图

Pre-Manipulation Alignment Prediction with Parallel Deep State-Space and Transformer Models

Authors:Motonari Kambara, Komei Sugiura

In this work, we address the problem of predicting the future success of open-vocabulary object manipulation tasks. Conventional approaches typically determine success or failure after the action has been carried out. However, they make it difficult to prevent potential hazards and rely on failures to trigger replanning, thereby reducing the efficiency of object manipulation sequences. To overcome these challenges, we propose a model, which predicts the alignment between a pre-manipulation egocentric image with the planned trajectory and a given natural language instruction. We introduce a Multi-Level Trajectory Fusion module, which employs a state-of-the-art deep state-space model and a transformer encoder in parallel to capture multi-level time-series self-correlation within the end effector trajectory. Our experimental results indicate that the proposed method outperformed existing methods, including foundation models.

在这项工作中，我们解决了预测开放词汇对象操作任务未来成功的问题。传统的方法通常在动作执行后才确定成功与否。然而，这使得预防潜在危险变得困难，并且依赖于失败来触发重新规划，从而降低对象操作序列的效率。为了克服这些挑战，我们提出了一种模型，该模型预测操作前的以自我为中心的图像与计划的轨迹和给定的自然语言指令之间的对齐情况。我们引入了一个多层次轨迹融合模块，该模块采用最先进的深度状态空间模型和并行转换器编码器来捕捉末端执行器轨迹中的多层次时间序列自相关性。我们的实验结果表明，所提出的方法优于现有方法，包括基础模型。

论文及项目相关链接

PDF Published in Advanced Robotics

Summary

预测开放词汇对象操作任务的未来成功是一个关键问题。传统方法通常在操作后进行成败判断，难以预防潜在危险，且依赖失败触发重新规划，降低了对象操作序列的效率。为此，本文提出了一种模型，该模型可以预测预操作第一人称图像与计划轨迹以及给定自然语言指令之间的对齐程度。引入的多层次轨迹融合模块采用最先进的深度状态空间模型和变压器编码器，以捕捉末端执行器轨迹中的多层次时间序列自相关性。实验结果表明，该方法优于现有方法，包括基础模型。

Key Takeaways

传统方法在进行对象操作任务时，通常难以预防潜在危险，需要在操作后进行成败判断。
本文提出了一种预测开放词汇对象操作任务未来成功的新模型。
该模型通过预测预操作第一人称图像与计划轨迹以及自然语言指令之间的对齐程度来解决问题。
引入的多层次轨迹融合模块采用先进的深度状态空间模型和变压器编码器。
该模块能够捕捉末端执行器轨迹中的多层次时间序列自相关性。
实验结果表明，新模型在对象操作任务上的性能优于现有方法。

Cool Papers

点此查看论文截图

Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning

Authors:Yangning Li, Tingwei Lu, Yinghui Li, Yankai Chen, Wei-Chieh Huang, Wenhao Jiang, Hui Wang, Hai-Tao Zheng, Philip S. Yu

Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.

高效指令调整旨在提高在给定指令数据集上训练的大型语言模型（LLM）的最终性能。作为典型的数据组织策略，课程学习在指令调整中已显示出初步的有效性。然而，当前的课程调整方法受到课程刚性的困扰，因为它们完全依赖于静态的启发式难度指标。这些方法未能适应模型在训练过程中的能力演变，导致固定且可能次优的学习轨迹。为了解决这一问题，提出了名为CAMPUS的能力感知多视角课程指令调整框架。CAMPUS提供了几个优点：（1）子课程的动态选择。（2）对课程表的能力感知调整。（3）基于难度的多种调度。大量实验证明了CAMPUS在高效指令调整方面优于其他最新基线。

论文及项目相关链接

PDF EMNLP 2025 Findings

Summary

大型语言模型（LLM）的训练过程中的指令集优化致力于提高其最终性能表现。现有的课程学习策略已被证实能够在优化指令集方面取得初步成效。然而，当前课程式训练策略存在课程安排僵化的问题，因为它们依赖于静态的启发式难度度量，而不是模型在训练过程中的实时适应能力，从而可能生成固定的次优学习轨迹。针对上述问题，本文提出了一个名为COMPASS的多视角竞争度训练策略框架。COMPASS提供了几大优势，包括动态的子课程选择、基于能力的课程安排调整以及多种难度为基础的调度策略。大量的实验证明，相较于其他先进的基线方法，COMPASS在指令集优化方面表现出卓越的性能提升。

Key Takeaways

大型语言模型（LLM）的训练过程中指令集优化是提高其性能的关键。
课程学习策略在指令集优化方面已展现出初步成效。
当前课程式训练策略存在课程安排僵化的问题，因为它们依赖于静态的启发式难度度量。
动态选择子课程能够优化训练过程，避免使用固定难度度量导致的局限性。
基于能力的课程安排调整有助于提高模型的适应能力。
COMPASS通过采用多种难度为基础的调度策略来提升模型性能表现。

Cool Papers

点此查看论文截图

Is GPT-4o mini Blinded by its Own Safety Filters? Exposing the Multimodal-to-Unimodal Bottleneck in Hate Speech Detection

Authors:Niruthiha Selvanayagam, Ted Kurti

As Large Multimodal Models (LMMs) become integral to daily digital life, understanding their safety architectures is a critical problem for AI Alignment. This paper presents a systematic analysis of OpenAI’s GPT-4o mini, a globally deployed model, on the difficult task of multimodal hate speech detection. Using the Hateful Memes Challenge dataset, we conduct a multi-phase investigation on 500 samples to probe the model’s reasoning and failure modes. Our central finding is the experimental identification of a “Unimodal Bottleneck,” an architectural flaw where the model’s advanced multimodal reasoning is systematically preempted by context-blind safety filters. A quantitative validation of 144 content policy refusals reveals that these overrides are triggered in equal measure by unimodal visual 50% and textual 50% content. We further demonstrate that this safety system is brittle, blocking not only high-risk imagery but also benign, common meme formats, leading to predictable false positives. These findings expose a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies to ensure AI systems can be deployed both safely and effectively.

随着大型多模态模型（LMM）在日常数字生活中变得不可或缺，理解其安全架构对于人工智能对齐（AI Alignment）而言是一个关键问题。本文对全球部署的模型OpenAI的GPT-4o mini进行了系统分析，针对多模态仇恨言论检测这一艰巨任务展开研究。我们使用仇恨meme挑战数据集，对500个样本进行多阶段调查，以探索该模型的推理和故障模式。我们的主要发现是实验确定的“单模态瓶颈”，这是一种架构缺陷，模型的先进多模态推理被语境盲安全过滤器系统所阻碍。对144项内容政策拒绝进行的定量验证表明，这些覆盖是由单模态视觉内容和文本内容各占一半触发的。我们进一步证明这一安全系统很脆弱，不仅阻挡高风险图像，还阻挡良性、常见的meme格式，导致可预测的错误阳性结果。这些发现暴露了最先进LMMs在能力和安全之间的基本矛盾，强调了需要更集成、语境感知的对齐策略，以确保AI系统能够既安全又有效地部署。

论文及项目相关链接

PDF

Summary

本文研究了OpenAI的GPT-4o模型在多模态仇恨言论检测任务上的表现。通过对模型的分析，发现了“单模态瓶颈”问题，即模型的先进多模态推理被上下文无关的安全过滤器系统性地阻止。这导致模型在检测仇恨言论时存在缺陷，容易误判图像和文本内容。这表明在最新的大型多模态模型中，能力和安全之间存在根本性冲突，需要更集成、上下文感知的对齐策略来确保人工智能系统的安全和有效部署。

Key Takeaways

GPT-4o模型在多模态仇恨言论检测上表现出重要性。
发现了“单模态瓶颈”问题，即模型的先进多模态推理被安全过滤器阻止。
模型在检测仇恨言论时存在缺陷，容易误判图像和文本内容。
安全过滤器对单模态视觉内容和文本内容的触发比例相同。
安全系统过于敏感，不仅阻止高风险内容，也阻止常见、无害的meme格式。
大型多模态模型的能力和安全之间存在根本性冲突。

Cool Papers

点此查看论文截图

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Authors:Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li

Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/

链式思维（Chain-of-Thought，简称CoT）推理已被广泛应用于通过分解复杂任务为更简单的顺序子任务来提升大规模语言模型（LLM）的能力。然而，将CoT扩展到视觉语言推理任务仍然是一个挑战，因为它通常需要解释视觉状态的过渡来支持推理。现有方法常常在这方面感到困难，因为它们在建模视觉状态过渡的能力有限，或者由于碎片化架构导致视觉轨迹不一致。为了克服这些局限性，我们提出了Uni-CoT，一个统一的思维链框架，它在一个单一模型中实现了连贯和基于情境的多模态推理。关键是利用一个既能理解图像又能生成图像的模型来推理视觉内容并模拟不断演变的视觉状态。然而，让一个统一的模型实现这一点并不简单，考虑到高昂的计算成本和培训负担。为了解决这个问题，Uni-CoT引入了一种新型的两级推理模式：用于高级任务规划的宏观层面CoT和用于子任务执行的微观层面CoT。这种设计显著减少了计算开销。此外，我们引入了一种结构化的训练模式，该模式结合了宏观层面CoT的交替式图像文本监督与微观层面CoT的多任务目标。这些创新使Uni-CoT能够进行可扩展和连贯的多模态推理。此外，由于我们的设计，所有实验都可以仅使用8个带有80GB VRAM的A100 GPU高效完成。在基于推理的图像生成基准测试（WISE）和编辑基准测试（RISE和KRIS）上的实验结果表明，Uni-CoT具有卓越的性能和强大的泛化能力，确立了它在多模态推理领域中的前景。项目页面和代码：https://sais-fuxi.github.io/projects/uni-cot/。

Summary

本文介绍了如何将Chain-of-Thought（CoT）推理应用于增强大型语言模型（LLM）的复杂任务上，特别是在视觉语言推理任务中面临的挑战。为此，提出了一种名为Uni-CoT的统一思维链框架，该框架能够在单一模型中实现连贯和基于基础的多模态推理。它通过引入一种新型的两级推理模式（宏观层面的任务规划和微观层面的子任务执行）和结构化训练模式来降低计算开销并简化训练难度。实验结果表明，Uni-CoT在图像生成和编辑任务上表现出卓越的性能和强大的泛化能力。

Key Takeaways

Uni-CoT框架成功将Chain-of-Thought（CoT）推理应用于多模态任务，通过分解复杂任务为简单子任务来增强大型语言模型（LLM）。
Uni-CoT框架面临视觉语言推理任务的挑战时，利用既能理解图像又能生成图像的模型进行视觉内容的推理和状态建模。
Uni-CoT引入了一种新型的两级推理模式：宏观层面的任务规划和微观层面的子任务执行，以降低计算开销。
Uni-CoT采用了结合图像文本监督的宏观层面CoT和多任务目标的微观层面CoT的结构化训练模式。
Uni-CoT在图像生成和编辑任务上实现了卓越的性能和强大的泛化能力。
Uni-CoT的设计使得所有实验都能够在只有8个A100 GPU（每个具有80GB VRAM）上高效完成。
Uni-CoT框架在理论上展示了强大的潜力，有望成为多模态推理领域的一种有前途的解决方案。

Cool Papers

点此查看论文截图

NL in the Middle: Code Translation with LLMs and Intermediate Representations

Authors:Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, Alexander Wong

Studies show that large language models (LLMs) produce buggy code translations. One promising avenue to improve translation accuracy is through intermediate representations, which provide structured guidance for the translation process. We investigate whether LLM-based code translation can benefit from intermediate representations, specifically in the form of natural language (NL) summaries and abstract syntax trees (ASTs). Since prompt engineering greatly affects LLM performance, we consider several ways to integrate these representations, from one-shot to chain-of-thought (CoT) prompting. Using Open GPT4 8X7B and specialized StarCoder and CodeGen models on popular code translation benchmarks (CodeNet and AVATAR), we find that CoT with an intermediate NL summary performs best, with an increase of 13.8% and 6.7%, respectively, in successful translations for the best-performing model (Open GPT4 8X7B) compared to the zero-shot prompt.

研究表明，大型语言模型（LLM）产生的代码翻译存在错误。提高翻译准确性的一个有前途的途径是通过中间表示，它为翻译过程提供结构化指导。我们调查了基于LLM的代码翻译是否可以从中间表示中受益，具体表现为自然语言（NL）摘要和抽象语法树（ASTs）。由于提示工程极大地影响了LLM的性能，我们考虑了几种将这些表示集成在一起的方法，从单步到链式思维（CoT）提示。在流行的代码翻译基准测试（CodeNet和AVATAR）上，使用Open GPT4 8X7B和专门的StarCoder和CodeGen模型，我们发现使用中间NL摘要的CoT表现最佳，对于性能最佳的模型（Open GPT4 8X7B），与零步提示相比，成功翻译的数量分别增加了13.8%和6.7%。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在代码翻译方面存在缺陷，可通过中间表示法提高翻译准确性。研究发现，自然语言（NL）摘要和抽象语法树（ASTs）等中间表示形式对LLM代码翻译有积极影响。通过一系列实验，发现使用Open GPT4 8X7B和特殊设计的StarCoder和CodeGen模型进行最佳模型性能对比，与零基准提示相比，具有中间NL摘要的思维链（CoT）提示效果最佳，成功翻译次数增加了最高达最高可达增加百分之十三点八和百分之六点七。这表明中间表示法对于提高LLM在代码翻译方面的性能具有巨大潜力。

Key Takeaways

LLM在代码翻译方面存在缺陷，需要改进以提高准确性。
中间表示法是提高LLM代码翻译性能的一种有前途的方法。
自然语言（NL）摘要和抽象语法树（ASTs）是有效的中间表示形式。
提示工程对LLM性能有很大影响。
思维链（CoT）提示方法结合中间NL摘要表现最佳。
与零基准提示相比，最佳模型（Open GPT4 8X7B）使用CoT提示后，成功翻译次数增加了最高达百分之十三点八和百分之六点七。

Cool Papers

点此查看论文截图

Benchmarking Large Language Models for Cryptanalysis and Side-Channel Vulnerabilities

Authors:Utsav Maskey, Chencheng Zhu, Usman Naseem

Recent advancements in large language models (LLMs) have transformed natural language understanding and generation, leading to extensive benchmarking across diverse tasks. However, cryptanalysis - a critical area for data security and its connection to LLMs’ generalization abilities - remains underexplored in LLM evaluations. To address this gap, we evaluate the cryptanalytic potential of state-of-the-art LLMs on ciphertexts produced by a range of cryptographic algorithms. We introduce a benchmark dataset of diverse plaintexts, spanning multiple domains, lengths, writing styles, and topics, paired with their encrypted versions. Using zero-shot and few-shot settings along with chain-of-thought prompting, we assess LLMs’ decryption success rate and discuss their comprehension abilities. Our findings reveal key insights into LLMs’ strengths and limitations in side-channel scenarios and raise concerns about their susceptibility to under-generalization-related attacks. This research highlights the dual-use nature of LLMs in security contexts and contributes to the ongoing discussion on AI safety and security.

最近的大型语言模型（LLM）的进步已经改变了自然语言的理解和生成方式，并在各种任务上进行了广泛的基准测试。然而，密码分析是数据安全的关键领域，其与LLM的泛化能力之间的联系在LLM评估中仍然被探索得不够深入。为了弥补这一空白，我们评估了最新LLM对由多种加密算法产生的密文的分析潜力。我们引入了一个包含多种领域的多样化明文基准数据集，以及与它们的加密版本配对。通过零样本和少样本设置以及链式思维提示，我们评估了LLM的解密成功率并讨论了它们的理解能力。我们的研究结果揭示了LLM在侧通道场景中的优势和局限性，并对它们容易受到泛化不足相关的攻击表示担忧。该研究突显了LLM在安全上下文中的双重用途性质，并为人工智能安全和保密性的持续讨论做出了贡献。

论文及项目相关链接

PDF EMNLP’25 Findings

Summary

最新的大型语言模型（LLM）进展在自然语言理解和生成方面带来了变革，并在各种任务上进行了广泛的基准测试。然而，在数据安全和LLM的泛化能力之间的联系中，密码分析这一关键领域在LLM评估中仍被忽视。本研究旨在解决这一空白，评估最先进的LLM对多种加密算法产生的密文的密码分析潜力。我们引入了一个包含多样化明文和与其加密版本配对的基准数据集。通过零样本和少样本设置以及链式思维提示，我们评估了LLM的解密成功率并讨论了其理解能力。研究发现了LLM在侧信道场景下的优点和局限性，揭示了其对泛化攻击的敏感性，强调了在安全背景下LLM的双重用途性质，并为人工智能安全和安全性的持续讨论做出了贡献。

Key Takeaways

LLM的最新进展在自然语言理解和生成方面带来了显著变革，并进行了广泛的基准测试。
密码分析在LLM评估中仍是一个被忽视的关键领域，特别是在数据安全和泛化能力之间的联系方面。
评估了LLM对多种加密算法产生的密文的密码分析潜力。
通过多样化的明文和其加密版本配对的基准数据集进行评估。
评估了LLM的解密成功率，并讨论了其理解能力，揭示了LLM在侧信道场景下的优点和局限性。
LLM可能容易受到泛化攻击的影响，这引发了对其安全性的关注。
研究强调了LLM在安全性背景下的双重用途性质，既可用于安全目的也可能被用于威胁安全的目的。

Cool Papers

点此查看论文截图

Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary

Authors:Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, Zhixuan Chu

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries–a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models’ safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios. We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets are available at https://github.com/Master-PLC/RASS.

大型语言模型（LLM）在多种任务中展现了卓越的能力，但它们经常拒绝回答合理的查询——这一现象被称为过度拒绝。过度拒绝通常源于过于保守的安全对齐，导致模型将许多合理的提示视为可能存在风险。为了系统地了解这个问题，我们探索和利用了模型的安全决策边界，以分析和减轻过度拒绝现象。我们的研究发现，过度拒绝与这些边界区域的错位紧密相关，在这里，模型很难区分良性内容和有害内容之间的细微差别。基于这些见解，我们提出了RASS，这是一个用于提示生成和选择的自动化框架，它战略性地针对安全边界附近的过度拒绝提示。通过利用表示空间中的引导向量，RASS有效地识别和筛选边界对齐的提示，从而更有效地解决过度拒绝问题。这种方法不仅提供了更精确和可解释的模型安全决策视图，而且无缝地扩展到多语言场景。我们探索了多种LLM的安全决策边界，并构建了MORBench评估集，以促进跨多种语言的模型安全和有用性的稳健评估。代码和数据集可在https://github.com/Master-PLC/RASS找到。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在众多任务中展现出卓越的能力，但存在拒绝合法查询的问题，称为拒绝现象。拒绝通常源于过于保守的安全对齐，导致模型将许多合理的提示视为潜在风险。本文深入探讨了这一问题，通过探究模型的安全决策边界来分析和缓解拒绝现象。研究发现，拒绝现象与决策边界处的对齐失配密切相关，模型在区分良性内容和有害内容之间的微妙差异时遇到困难。基于此，本文提出了RASS框架，用于生成和选择提示，有针对性地针对安全边界处的拒绝提示。RASS利用表示空间中的引导向量，有效识别并筛选边界对齐的提示，更针对拒绝现象进行缓解。这不仅为模型安全决策提供了更精确和可解释的视角，还能无缝扩展到多语言场景。本文探讨了LLM的安全决策边界，并构建了MORBench评估集，以促进跨多种语言的模型安全和有用性的稳健评估。

Key Takeaways

LLMs展现出强大的能力，但存在拒绝合法查询的问题，称为拒绝现象。
拒绝现象通常源于模型的安全对齐过于保守。
拒绝现象与模型安全决策边界的对齐失配有关。
RASS框架用于生成和选择提示，以缓解拒绝现象。
RASS利用表示空间中的引导向量，有效识别边界对齐的提示。
RASS框架可扩展到多语言场景。

Cool Papers

点此查看论文截图

CoPL: Collaborative Preference Learning for Personalizing LLMs

Authors:Youngbin Choi, Seunghyuk Cho, Minjong Lee, MoonJeong Park, Yesong Ko, Jungseul Ok, Dongwoo Kim

Personalizing large language models (LLMs) is important for aligning outputs with diverse user preferences, yet existing methods struggle with flexibility and generalization. We propose CoPL (Collaborative Preference Learning), a graph-based collaborative filtering framework that models user-response relationships to enhance preference estimation, particularly in sparse annotation settings. By integrating a mixture of LoRA experts, CoPL efficiently fine-tunes LLMs while dynamically balancing shared and user-specific preferences. Additionally, an optimization-free adaptation strategy enables generalization to unseen users without fine-tuning. Experiments on UltraFeedback-P demonstrate that CoPL outperforms existing personalized reward models, effectively capturing both common and controversial preferences, making it a scalable solution for personalized LLM alignment. The code is available at https://github.com/ml-postech/CoPL.

个性化大型语言模型（LLM）对于将输出与多样化的用户偏好对齐非常重要，但现有方法在灵活性和通用性方面遇到了挑战。我们提出了CoPL（协同偏好学习），这是一个基于图的协同过滤框架，通过建模用户响应关系来提高偏好估计，特别是在稀疏注释设置下。通过集成混合LoRA专家，CoPL能够高效地微调LLM，同时动态平衡共享和用户特定偏好。此外，一种无需优化的适应策略使得能够推广到未见过的用户而无需微调。在UltraFeedback-P上的实验表明，CoPL优于现有的个性化奖励模型，能够有效捕捉共同和争议的偏好，成为个性化LLM对齐的可扩展解决方案。代码可在https://github.com/ml-postech/CoPL获取。

论文及项目相关链接

PDF 19pages, 13 figures, 11 tables

Summary

本文介绍了针对大型语言模型（LLMs）个性化需求的新方法——协作偏好学习（CoPL）。在稀疏标注场景下，该方法通过建模用户响应关系来提高偏好估计的准确性。CoPL采用基于图的协作过滤框架，集成多个LoRA专家，能够动态平衡共享和用户特定偏好，并在无需微调的情况下适应新用户。在UltraFeedback-P上的实验表明，CoPL在捕捉共同和争议性偏好方面表现优于现有个性化奖励模型，为个性化LLM对齐提供了可扩展的解决方案。

Key Takeaways

CoPL是一种针对大型语言模型的个性化方法，适用于多样用户偏好对齐。
CoPL采用基于图的协作过滤框架，能有效提高稀疏标注场景下的偏好估计准确性。
集成LoRA专家进行动态平衡共享和用户特定偏好。
优化策略使模型能够适应未见用户而无需进行微调。
在UltraFeedback-P上的实验验证了CoPL的有效性，表明其能够捕捉共同和争议性偏好。
CoPL在个性化LLM对齐问题上提供了一种可扩展的解决方案。

Cool Papers

点此查看论文截图

LLM-ABBA: Understanding time series via symbolic approximation

Authors:Erin Carson, Xinye Chen, Cheng Kang

The success of large language models (LLMs) for time series has been demonstrated in previous work. Utilizing a symbolic time series representation, one can efficiently bridge the gap between LLMs and time series. However, the remaining challenge is to exploit the semantic information hidden in time series by using symbols or existing tokens of LLMs, while aligning the embedding space of LLMs according to the hidden information of time series. The symbolic time series approximation (STSA) method called adaptive Brownian bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in preserving salient time series features by modeling time series patterns in terms of amplitude and period while using existing tokens of LLMs. In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA into large language models for various downstream time series tasks. By symbolizing time series, LLM-ABBA compares favorably to the recent state-of-the-art (SOTA) in UCR and three medical time series classification tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to \kc{avoid obvious drifting} during prediction tasks by significantly mitigating the effects of cumulative error arising from misused symbols during the transition from symbols to numerical values. In time series regression tasks, LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER) benchmarks. LLM-ABBA also shows competitive prediction capability compared to recent SOTA time series prediction results. We believe this framework can also seamlessly extend to other time series tasks.

大型语言模型（LLM）在时间序列上的成功已在先前的研究中得到证明。通过符号时间序列表示法，可以有效地弥合LLM和时间序列之间的差距。然而，剩余的挑战在于利用LLM中的符号或现有令牌来挖掘隐藏在时间序列中的语义信息，同时根据时间序列的隐藏信息对齐LLM的嵌入空间。符号时间序列逼近（STSA）方法中的自适应布朗桥基符号聚合（ABBA）方法通过在幅度和周期方面对时间序列模式进行建模，同时使用LLM的现有令牌，显示出在保留重要时间序列特征方面表现出卓越的效果。在本文中，我们介绍了一种称为LLM-ABBA的方法，该方法将ABBA集成到大型语言模型中，用于各种下游时间序列任务。通过符号化时间序列，LLM-ABBA在UCR和三个医疗时间序列分类任务上比较领先当前最新技术。同时，ABBA中的固定多边形链技巧被引入，以避免预测任务期间出现明显的漂移现象，通过显着减轻因符号到数值转换过程中误用符号所产生的累积误差的影响。在时间序列回归任务中，LLM-ABBA在TimeSeriesExtrinsicRegression（TSER）基准测试上达到了新的最新技术水平。LLM-ABBA还显示出与最新时间序列预测结果相比具有竞争力的预测能力。我们相信该框架也可以无缝地扩展到其他时间序列任务。

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型（LLM）在时间序列领域的成功应用。通过符号时间序列表示法，可以有效地弥合LLM和时间序列之间的差距。文章提出了一种名为LLM-ABBA的方法，将自适应布朗桥基符号聚合（ABBA）集成到大型语言模型中，用于各种下游时间序列任务。该方法通过符号化时间序列，在UCR和三个医疗时间序列分类任务中，与最新技术相比表现出优越的性能。此外，ABBA中的固定多边形链技巧避免了预测任务中明显的漂移现象，通过显著减轻符号到数值转换过程中误用符号导致的累积误差的影响。在时间序列回归任务中，LLM-ABBA达到了Time Series Extrinsic Regression（TSER）基准测试的新水平。预计该框架可以无缝扩展到其他时间序列任务。

Key Takeaways