LLM

发布日期: 2025-10-23

更新日期: 2025-11-27

文章字数: 16.7k

阅读时长: 69 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-23 更新

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Authors:Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

多模态大型语言模型（MLLMs）在整体理解方面表现出色，但在捕捉具有复杂场景的密集世界时遇到困难，需要精细分析复杂细节和对象之间的相互关系。区域级MLLMs是一个有希望的进步。然而，之前的尝试通常被优化为理解给定的孤立区域，忽略了重要的全局上下文。为了解决这一问题，我们引入了用于全面区域级视觉理解的Grasp Any Region（GAR）。通过有效的RoI对齐特征回放技术，GAR支持（1）利用必要全局上下文的精确感知，（2）对多个提示之间的交互进行建模。因此，它自然地实现了（3）先进的组合推理，可以回答有关任何区域的特定自由形式问题，从被动描述转向主动对话。此外，我们构建了GAR-Bench，它不仅可以更准确地评估单一区域的理解能力，而且更重要的是，还可以衡量多个区域之间的交互和复杂推理。大量实验表明，GAR-1B不仅保持了最先进的描述能力，例如在DLC-Bench上比DAM-3B高出4.5分，而且在建模多个提示之间的关系方面也表现出卓越的高级理解能力，甚至在GAR-Bench-VQA上超过了InternVL3-78B。更重要的是，我们的零启动GAR-8B甚至在VideoRefer-BenchQ上的表现优于领域内的VideoRefer-7B，这表明其强大能力可以轻松转移到视频领域。

论文及项目相关链接

PDF

Summary

MLLMs在处理复杂场景时面临挑战，尤其在捕捉精细粒度的细节和对象间关系方面。为改善此问题，提出了Grasp Any Region（GAR）模型，通过有效结合全局上下文和多提示交互，实现了高级组合推理，能回答关于任何区域的特定自由形式问题。此外，还构建了GAR-Bench，以更准确地评估单区域理解能力，并测量跨多个区域的交互和复杂推理。实验表明，GAR-1B不仅保持了最先进的描述能力，而且在建模多提示关系方面表现出卓越的理解能力。更重要的是，零样本的GAR-8B在VideoRefer-BenchQ上的表现甚至超过了VideoRefer-7B，显示出强大的跨域能力。

Key Takeaways

MLLMs虽然在整体理解方面表现出色，但在捕捉复杂场景的密集世界时存在挑战，需要精细粒度的分析和对象间关系的理解。
Grasp Any Region (GAR)模型旨在实现全面的区域级视觉理解，通过结合全局上下文和多提示交互来克服MLLMs的局限性。
GAR模型支持精确感知，通过利用必要的全局上下文和建模多个提示之间的交互来实现高级组合推理。
GAR-Bench的构建不仅提供了对单区域理解的更精确评估，还衡量了跨多个区域的交互和复杂推理。
GAR-1B模型在保持最先进的描述能力的同时，也表现出卓越的多提示关系建模能力。
实验表明，GAR-1B在某些任务上的表现超过了其他大型模型，如DAM-3B和InternVL3-78B。

Cool Papers

点此查看论文截图

LightMem: Lightweight and Efficient Memory-Augmented Generation

Authors:Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang

Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.

尽管大型语言模型（LLM）具有出色的能力，但在动态和复杂环境中有效利用历史交互信息方面仍面临挑战。记忆系统通过引入持久信息存储、检索和利用机制，使LLM能够超越无状态交互。然而，现有的记忆系统往往带来大量的时间和计算开销。为此，我们引入了一种新的记忆系统，称为LightMem，它在记忆系统的性能和效率之间达到了平衡。LightMem受到人类记忆的Atkinson-Shiffrin模型的启发，将记忆分为三个互补阶段。首先，受认知启发的感觉记忆通过轻量级压缩快速过滤掉无关信息，并根据其主题对信息进行分组。接下来，主题感知的短期记忆巩固这些基于主题的分组，组织和总结内容，以便进行更结构化的访问。最后，具有睡眠时更新的长期记忆采用离线程序，将巩固与在线推理解耦。在LongMemEval上的实验表明，使用GPT和Qwen骨架的LightMem在准确性上超过了强大的基线（提高了高达10.9％的准确率），同时减少了高达117倍的令牌使用、高达159倍的API调用以及超过12倍的运行时。代码可在https://github.com/zjunlp/LightMem获得。

论文及项目相关链接

PDF Work in progress

Summary

大型语言模型（LLMs）在动态复杂环境中难以有效利用历史交互信息。内存系统的引入使LLMs能够通过持久性信息存储、检索和利用机制来超越无状态交互。然而，现有内存系统往往带来巨大的时间和计算开销。为此，我们提出了一个新的内存系统LightMem，它在性能和效率之间达到了平衡。LightMem受到人类记忆Atkinson-Shiffrin模型的启发，将记忆分为三个阶段。首先，认知启发的感觉记忆通过轻量级压缩快速过滤无关信息并按主题分组信息。其次，主题感知的短期记忆巩固这些基于主题的分组，组织和总结内容以进行更结构化的访问。最后，具有睡眠时更新的长期记忆采用离线程序，将巩固与在线推理解耦。在LongMemEval上的实验表明，LightMem在准确性方面优于强大的基线（提高了高达10.9％），同时减少了令牌使用（高达117倍）、API调用（高达159倍），并降低了运行时间（超过12倍）。

Key Takeaways

LLMs在复杂环境中难以利用历史交互信息。
内存系统的引入帮助LLMs处理持久性信息。
现有内存系统存在时间和计算开销。
LightMem是一个新的内存系统，在性能和效率之间达到平衡。
LightMem受到人类记忆模型的启发，分为感觉记忆、短期记忆和长期记忆三个阶段。
LightMem通过减少令牌使用、API调用和运行时，提高了性能。
LightMem在准确性方面优于其他强大的基线。

Cool Papers

点此查看论文截图

EffiReasonTrans: RL-Optimized Reasoning for Code Translation

Authors:Yanlin Wang, Rongyi Ou, Yanli Wang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Xilin Liu, Yuchi Ma, Zibin Zheng

Code translation is a crucial task in software development and maintenance. While recent advancements in large language models (LLMs) have improved automated code translation accuracy, these gains often come at the cost of increased inference latency, hindering real-world development workflows that involve human-in-the-loop inspection. To address this trade-off, we propose EffiReasonTrans, a training framework designed to improve translation accuracy while balancing inference latency. We first construct a high-quality reasoning-augmented dataset by prompting a stronger language model, DeepSeek-R1, to generate intermediate reasoning and target translations. Each (source code, reasoning, target code) triplet undergoes automated syntax and functionality checks to ensure reliability. Based on this dataset, we employ a two-stage training strategy: supervised fine-tuning on reasoning-augmented samples, followed by reinforcement learning to further enhance accuracy and balance inference latency. We evaluate EffiReasonTrans on six translation pairs. Experimental results show that it consistently improves translation accuracy (up to +49.2% CA and +27.8% CodeBLEU compared to the base model) while reducing the number of generated tokens (up to -19.3%) and lowering inference latency in most cases (up to -29.0%). Ablation studies further confirm the complementary benefits of the two-stage training framework. Additionally, EffiReasonTrans demonstrates improved translation accuracy when integrated into agent-based frameworks. Our code and data are available at https://github.com/DeepSoftwareAnalytics/EffiReasonTrans.

代码翻译是软件开发和维护中的一项重要任务。虽然最近的大型语言模型（LLM）的进步提高了自动代码翻译的准确度，但这些进步往往会导致推理延迟增加，阻碍了涉及人类循环检查的实际开发工作流程。为了解决这一权衡问题，我们提出了EffiReasonTrans，这是一个旨在提高翻译准确度同时平衡推理延迟的训练框架。我们首先通过提示更强大的语言模型DeepSeek-R1来构建高质量推理增强数据集，生成中间推理和目标翻译。每个（源代码、推理、目标代码）三元组都经过自动化的语法和功能性检查，以确保其可靠性。基于此数据集，我们采用两阶段训练策略：首先在推理增强样本上进行监督微调，然后采用强化学习进一步提高准确性和平衡推理延迟。我们在六个翻译对上评估了EffiReasonTrans。实验结果表明，它持续提高了翻译准确度（与基础模型相比，最高提升了+49.2% CA和+27.8% CodeBLEU），同时减少了生成的令牌数量（最多减少-19.3%），并在大多数情况下降低了推理延迟（最多降低-29.0%）。消融研究进一步证实了两阶段训练框架的互补优势。此外，当集成到基于代理的框架中时，EffiReasonTrans显示出更高的翻译准确度。我们的代码和数据可在https://github.com/DeepSoftwareAnalytics/EffiReasonTrans上获得。

论文及项目相关链接

PDF

Summary

本文介绍了代码翻译在软件开发和维护中的重要性，以及大型语言模型（LLM）在自动代码翻译方面的最新进展。针对现有技术中推理翻译准确性和推理延迟之间的权衡问题，提出了一种新的训练框架EffiReasonTrans。该框架通过构建高质量推理增强数据集和采用两阶段训练策略，旨在提高翻译准确性并平衡推理延迟。实验结果表明，EffiReasonTrans在翻译准确性方面有显著改进，同时减少了生成的令牌数量和降低了推理延迟。

Key Takeaways

代码翻译在软件开发和维护中扮演关键角色。
大型语言模型（LLM）在自动代码翻译方面的进展带来了准确性的提高，但推理延迟问题影响了实际应用。
EffiReasonTrans是一个旨在提高翻译准确性并平衡推理延迟的训练框架。
EffiReasonTrans通过构建高质量推理增强数据集和采用两阶段训练策略实现性能提升。
实验结果表明，EffiReasonTrans在翻译准确性方面有显著改进，并降低了生成的令牌数量和推理延迟。
消融研究进一步证实了EffiReasonTrans中两阶段训练策略的优势。

Cool Papers

点此查看论文截图

Streamlining Acceptance Test Generation for Mobile Applications Through Large Language Models: An Industrial Case Study

Authors:Pedro Luís Fonseca, Bruno Lima, João Pascoal Faria

Mobile acceptance testing remains a bottleneck in modern software development, particularly for cross-platform mobile development using frameworks like Flutter. While developers increasingly rely on automated testing tools, creating and maintaining acceptance test artifacts still demands significant manual effort. To help tackle this issue, we introduce AToMIC, an automated framework leveraging specialized Large Language Models to generate Gherkin scenarios, Page Objects, and executable UI test scripts directly from requirements (JIRA tickets) and recent code changes. Applied to BMW’s MyBMW app, covering 13 real-world issues in a 170+ screen codebase, AToMIC produced executable test artifacts in under five minutes per feature on standard hardware. The generated artifacts were of high quality: 93.3% of Gherkin scenarios were syntactically correct upon generation, 78.8% of PageObjects ran without manual edits, and 100% of generated UI tests executed successfully. In a survey, all practitioners reported time savings (often a full developer-day per feature) and strong confidence in adopting the approach. These results confirm AToMIC as a scalable, practical solution for streamlining acceptance test creation and maintenance in industrial mobile projects.

移动验收测试仍是现代软件开发中的瓶颈，特别是在使用Flutter等框架进行跨平台移动开发时。虽然开发者越来越依赖自动化测试工具，但创建和维护验收测试工件仍然需要巨大的手动努力。为了解决这一问题，我们引入了AToMIC，这是一个利用专业的大型语言模型自动化生成Gherkin场景、页面对象和可执行UI测试脚本的框架，这些脚本可以直接从需求（JIRA票据）和最近的代码更改中生成。在宝马的MyBMW应用上应用此框架，覆盖了一个拥有超过170个屏幕代码库的13个真实问题，AToMIC在标准硬件上每个功能生成可执行测试工件的时间不到五分钟。生成的工件质量很高：生成的Gherkin场景中语法正确率高达93.3%，无需手动编辑即可运行的页面对象占78.8%，生成的UI测试全部成功执行。在调查中，所有从业者都报告了时间节省（通常每个功能可以节省一整天的开发时间）以及对采用这种方法的强烈信心。这些结果证实AToMIC是一个可扩展的实际解决方案，可以简化工业移动项目的验收测试创建和维护工作。

论文及项目相关链接

PDF

Summary
自动化框架AToMIC，利用大型语言模型从需求（如JIRA票据）和最新代码更改中直接生成Gherkin场景、页面对象和可执行的UI测试脚本，解决了移动验收测试在现代软件开发中的瓶颈问题。在BMW的MyBMW应用程序上应用，覆盖170多个屏幕代码库中的13个现实问题，AToMIC在标准硬件上于五分钟内生成可执行测试成果。生成物质量高：生成的Gherkin场景语法正确率高达百分之九十三点三，百分之七十八点八的页面对象无需手动编辑即可运行，百分之百的UI测试可成功执行。调查结果证实，使用AToMIC可节省时间（每个功能通常可节省开发人员一天的时间），并且实践者对该方法充满信心。

Key Takeaways

AToMIC是一个自动化框架，旨在解决移动验收测试在现代软件开发中的瓶颈问题。
AToMIC利用大型语言模型从需求和最新代码更改中生成测试脚本。
在BMW的MyBMW应用程序上应用，AToMIC能快速生成高质量的可执行测试成果。
AToMIC生成的测试成果包括Gherkin场景、页面对象和UI测试脚本。
AToMIC生成的测试成果质量高，其中Gherkin场景语法正确率高达百分之九十三点三。
实践者通过使用AToMIC节省了大量时间（每个功能通常可节省开发人员一天的时间）。

Cool Papers

点此查看论文截图

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Authors:Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou

Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.

对用户偏好进行忠实的个性化调整大型语言模型（LLM）是一项至关重要的且具有挑战性的任务。虽然监督微调（SFT）可以快速达到性能瓶颈，但标准的人类反馈强化学习（RLHF）在个性化的细微差别方面也面临着困难。基于标量的奖励模型容易奖励作弊，导致冗长和表面上的个性化回应。为了解决这些局限性，我们提出了批判后编辑（Critique-Post-Edit），这是一个稳健的强化学习框架，能够实现更忠实和可控的个性化。我们的框架集成了两个关键组件：（1）个性化生成奖励模型（GRM），提供多维分数和文本评价来抵抗奖励作弊；（2）批判后编辑机制，策略模型根据这些评价修订自己的输出，以更有针对性和高效的学习。在严格控制的长度评估下，我们的方法在个性化基准测试上大大优于标准PPO。个性化Qwen2.5-7B的平均胜率提高了11%，个性化Qwen2.5-14B模型超越了GPT-4.1的性能。这些结果展示了一条实现忠实、高效和可控个性化的实用途径。

论文及项目相关链接

PDF work in progress

Summary

大型语言模型（LLM）的个人化对齐用户偏好是一项重要且具挑战性的任务。监督微调（SFT）很快达到性能瓶颈，而标准的人类反馈强化学习（RLHF）在个性化细节方面也存在困难。基于标量的奖励模型容易奖励作弊，导致冗长和表面上的个性化回应。为解决这些限制，我们提出了Critique-Post-Edit这一稳健的强化学习框架，实现更忠实和可控的个人化。该框架整合了两个关键组成部分：（1）个性化生成奖励模型（GRM），提供多维分数和文本评论以抵抗奖励作弊；（2）Critique-Post-Edit机制，政策模型根据这些评论修订其自己的输出，以实现更有针对性和高效的学习。在严格控制的评估下，我们的方法在个人化基准测试上大大优于标准PPO。个性化Qwen2.5-7B的平均胜率提高11%，个性化Qwen2.5-14B模型超越GPT-4.1的性能。这些结果展示了一条实现忠实、高效和可控个人化的实用途径。

Key Takeaways

大型语言模型（LLM）的个人化对于对齐用户偏好至关重要且具有挑战性。
监督微调（SFT）和标准的强化学习在个人化方面存在局限性。
基于标量的奖励模型容易奖励作弊，导致回应表面化和冗长。
Critique-Post-Edit框架通过整合个性化生成奖励模型和Critique-Post-Edit机制来解决这些问题。
个性化生成奖励模型提供多维分数和文本评论，抵抗奖励作弊。
Critique-Post-Edit机制允许政策模型根据评论修订输出，实现更高效和有针对性的学习。

Cool Papers

点此查看论文截图

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Authors:Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

长上下文窗口已经成为大型语言模型（LLM）的标准功能，扩展的上下文显著增强了其复杂推理能力，并扩大了其在不同场景中的应用范围。动态稀疏注意力是一种很有前途的方法，可以降低长上下文的计算成本。然而，在分布式环境中，使用动态稀疏注意力有效地训练具有超长上下文的LLM仍然是一个重大挑战，这主要是由于工作级别和步骤级别的失衡。本文介绍了MTraining，这是一种新型的分布式方法，利用动态稀疏注意力实现具有超长上下文的LLM的高效训练。具体来说，MTraining集成了三个关键组件：动态稀疏训练模式、平衡稀疏环形注意力和分层稀疏环形注意力。这些组件被协同设计以解决动态稀疏注意力机制在具有广泛上下文长度的模型训练过程中固有的计算不平衡和通信开销问题。我们通过训练Qwen2.5-3B来展示MTraining的有效性，成功将其上下文窗口从32K扩展到512K令牌集群的32个A100 GPU上。我们在一系列下游任务上的评估，包括RULER、PG-19、InfiniteBench和Haystack中的针寻找任务，揭示了MTraining在提高训练效率的同时保持模型准确性，实现了高达6倍的培训吞吐量提升。我们的代码可在https://github.com/microsoft/MInference/tree/main/MTraining获取。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）中采用长上下文窗口已成为标准特性，扩展的上下文显著提高了其复杂推理能力并拓宽了应用场景范围。动态稀疏注意力是一种降低长上下文计算成本的有前途的方法。然而，在分布式设置中，使用动态稀疏注意力有效地训练LLM处理超长上下文仍存在挑战，主要体现在工作级别和步骤级别的失衡。本文提出MTraining，这是一种新型分布式方法，利用动态稀疏注意力实现LLM的超长上下文高效训练。MTraining包含三个关键组件：动态稀疏训练模式、平衡稀疏环形注意力和分层稀疏环形注意力。这些组件协同解决动态稀疏注意力机制在模型训练过程中的计算不平衡和通信开销问题，特别是针对长上下文长度的情况。通过在一个包含32个A100 GPU的集群上对Qwen2.5-3B进行训练，成功将其上下文窗口从32K扩展到512K令牌，展示了MTraining的有效性。在多种下游任务上的评估表明，MTraining实现了高达6倍的培训吞吐量，同时保持了模型精度。

Key Takeaways

大型语言模型（LLM）中长上下文窗口的应用增强了其复杂推理能力和应用场景范围。
动态稀疏注意力是降低长上下文计算成本的一种有效方法。
在分布式环境中训练超长上下文的LLM面临工作级别和步骤级别的挑战。
MTraining是一种新型分布式方法，通过动态稀疏注意力实现LLM的超长上下文高效训练。
MTraining包含三个关键组件以协同解决计算不平衡和通信开销问题。
MTraining成功扩展了模型的上下文窗口并提高了训练吞吐量。

Cool Papers

点此查看论文截图

Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring

Authors:Shuxin Lin, Dhaval Patel, Christodoulos Constantinides

Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.

小型语言模型（SLM）在工业应用等专业领域日益流行，因其效率高、计算需求低，并且能针对特定领域任务进行微调，从而实现精确且经济实惠的解决方案。然而，在产业4.0等专业领域使用SLM进行复杂推理仍具有挑战性。在本文中，我们提出了一个面向工业资产健康的知识蒸馏框架，该框架通过思维链（CoT）蒸馏的方式，将大型语言模型（LLM）的推理能力转移到更小、更有效率的小型语言模型上。我们讨论了在多选问答（MCQA）提示下使用LLM进行蒸馏的优势和流程，以增强推理并改进决策制定。我们还进行了上下文学习，以验证生成知识的质量，并对比经过微调的小型语言模型和广泛使用的大型语言模型的性能。结果表明，经过思维链推理调整的小型语言模型在性能上大大超过了基础模型，并缩小了与其大型语言模型对应物的差距。我们的代码已开源在：https://github.com/IBM/FailureSensorIQ。

论文及项目相关链接

PDF Accepted at EMNLP 2025

Summary

基于小语言模型（SLM）在专业化领域如工业应用的流行趋势，本文将探讨如何通过知识蒸馏技术提升SLM的性能。该研究提出了一个面向工业资产健康的蒸馏框架，利用大型语言模型（LLM）的推理能力通过思维链（Chain-of-Thought，CoT）蒸馏技术转移到小型、高效的SLM上。该研究还通过多选问答（MCQA）提示进行LLM蒸馏的优势和过程，以提升推理能力和优化决策。实验结果显示，通过CoT推理技术精细调整的SLM性能显著提升，显著缩小了与LLM的差距。相关代码已开源于GitHub上。

Key Takeaways

小语言模型（SLM）在工业应用等特定领域表现出效率、计算需求低和可针对特定任务微调的优势。
知识蒸馏是一种将大型语言模型（LLM）的推理能力转移到小型语言模型（SLM）的技术。
思维链（Chain-of-Thought，CoT）蒸馏是知识蒸馏的一种形式，有助于提高SLM的推理能力。
通过多选问答（MCQA）提示进行LLM蒸馏，提升推理和优化决策过程。
实验结果显示精细调整的SLM通过CoT推理技术性能显著提升，显著缩小与LLM的差距。
开放源代码以便其他研究人员使用和进一步开发。

Cool Papers

点此查看论文截图

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

Authors:Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model’s own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.

我们提出了一种简单、自助的在线监督微调（OSFT）范式，用于LLM推理。在此范式中，模型自行生成答案，并立即在此自生成的数据上进行微调。OSFT是一种针对LLM推理的高效训练策略，因为它不需要奖励，并且默认只使用一次运行。实验结果表明，OSFT在具有挑战性的数学推理任务上的下游性能与可验证奖励强化学习（RLVR）方法（如GRPO）相当。我们的消融研究进一步证明了OSFT的效率和稳健性。OSFT的主要机制在于促进模型从预训练中学到的自身偏好（潜在知识），从而提高推理能力。我们相信OSFT为更复杂的基于奖励的训练范式提供了高效且前景广阔的替代方案。我们的代码可在https://github.com/ElementQi/OnlineSFT找到。

论文及项目相关链接

PDF

Summary

本论文提出了一种简单、自我监督的在线微调（OSFT）范式，用于LLM推理。该模型通过生成自己的响应并立即在此基础上进行微调，实现高效训练。实验结果显示，OSFT在具有挑战性的数学推理任务上的性能与基于奖励的强化学习（RLVR）方法如GRPO相当。其机制在于促进模型从预训练中学到的偏好（潜在知识），从而提高推理能力。OSFT为复杂的奖励训练范式提供了一种高效且有前途的替代方案。

Key Takeaways

提出了一种名为OSFT的自我监督在线微调范式，用于LLM推理。
OSFT通过模型自我生成响应并立即进行微调，实现高效训练。
实验结果显示OSFT在挑战性数学推理任务上的性能与基于奖励的强化学习方法相当。
OSFT的主要机制在于促进模型从预训练中学到的偏好（潜在知识）。
OSFT为复杂的奖励训练范式提供了一个可行的替代方案。
OSFT范式的优点是奖励免费和默认只需一次运行，提高了效率。

Cool Papers

点此查看论文截图

ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Authors:Xiaoxing Hu, Kaicheng Yang, Ziyong Feng, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang

The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP

原CLIP文本编码器的最大输入长度限制为77个标记，这妨碍了其有效处理长文本和进行精细语义理解的能力。此外，CLIP文本编码器不支持多语言输入。所有这些限制极大地限制了其在更广泛任务中的应用。最近的研究试图用基于LLM的嵌入器替换CLIP文本编码器，以增强其处理长文本、多语言理解和精细语义理解的能力。然而，由于LLM的表示空间与CLIP的视觉语言空间是独立进行预训练的，没有对齐先验，直接使用对比学习进行直接对齐可能会破坏CLIP图像编码器中的内在视觉语言对齐，导致在预训练期间获得的知识未能充分利用。为了应对这一挑战，我们提出了ProCLIP，一种基于课程学习的渐进视觉语言对齐框架，有效地将CLIP图像编码器与基于LLM的嵌入器对齐。具体来说，ProCLIP首先通过蒸馏从CLIP文本编码器中提取知识并注入到基于LLM的嵌入器中，以利用CLIP丰富的预训练知识并在LLM嵌入器和CLIP图像编码器之间建立初步对齐。随后，ProCLIP通过图像文本对比调整进一步对齐CLIP图像编码器和基于LLM的嵌入器，并采用自我蒸馏正则化来避免过拟合。为了实现更有效的对齐，在表示继承对比调整过程中采用了实例语义对齐损失和嵌入结构对齐损失。代码可在https://github.com/VisionXLab/ProCLIP上找到。

论文及项目相关链接

PDF 17 pages, 5 fiugres

Summary

在CLIP文本编码器受限于最大输入长度和多语种输入缺失的限制下，无法有效处理长文本和进行精细语义理解。近期研究尝试用LLM嵌入器替换CLIP文本编码器以提升其处理长文本和多语种理解的能力。然而，由于LLM和CLIP的视觉语言空间是独立预训练的，直接对比学习可能破坏CLIP图像编码器中的视觉语言对齐，导致预训练期间学到的知识未得到充分利用。为解决此问题，我们提出ProCLIP，一种基于课程学习的渐进视觉语言对齐框架，有效对齐CLIP图像编码器和LLM嵌入器。ProCLIP首先蒸馏CLIP文本编码器的知识到LLM嵌入器中，并利用CLIP丰富的预训练知识建立LLM嵌入器和CLIP图像编码器之间的初步对齐。随后，通过图像文本对比调整进一步对齐CLIP图像编码器和LLM嵌入器，并采用自我蒸馏正则化避免过拟合。为实现更有效的对齐，实例语义对齐损失和嵌入结构对齐损失在表示继承对比调整过程中得以应用。

Key Takeaways

CLIP文本编码器受限于最大输入长度，无法有效处理长文本和进行精细语义理解。
LLM嵌入器能够提高CLIP在处理长文本和多语种方面的能力。
直接对比学习可能导致预训练知识的浪费和视觉语言对齐的破坏。
ProCLIP通过课程学习框架实现CLIP图像编码器和LLM嵌入器的有效对齐。
ProCLIP利用蒸馏技术将CLIP文本编码器的知识转移到LLM嵌入器中。
自我蒸馏正则化用于避免过拟合，提高模型性能。
ProCLIP利用实例语义对齐损失和嵌入结构对齐损失以实现更有效的模型对齐。

Cool Papers

点此查看论文截图

KAT-Coder Technical Report

Authors:Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui Wang, Huaixi Tang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Minglei Zhang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen

Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.

近期大型语言模型（LLM）的进步推动了代理编码的发展，这些模型能够在交互式软件开发工作流中自主推理、规划和行动。然而，弥合基于静态文本的培训和动态现实世界代理执行之间的差距仍然是一个核心挑战。在本文中，我们介绍了KAT-Coder，这是一个通过包括中期培训、监督微调（SFT）、强化微调（RFT）和强化到部署适应的多阶段课程训练的大规模代理编码模型。中期阶段通过真实软件工程数据和合成代理交互的语料库增强了推理、规划和反思能力。SFT阶段构建了一个百万样本数据集，平衡了二十种编程语言、十种开发环境和十种任务原型。RFT阶段引入了一种新型的多地面真实奖励公式，用于稳定和高效样本的策略优化。最后，强化到部署阶段使用错误屏蔽SFT和树结构轨迹训练使模型适应生产级IDE环境。总之，这些阶段使KAT-Coder实现了稳健的工具使用可靠性、指令对齐和长上下文推理，为现实世界的智能编码代理奠定了可部署的基础。我们的KAT系列32B模型KAT-Dev已在https://huggingface.co/Kwaipilot/KAT-Dev上开源。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在代理编码方面的进步，使得模型能够在交互式软件开发工作流中自主推理、规划和行动。本报告介绍了KAT-Coder，这是一个大规模代理编码模型，经历了包括中期训练、监督微调（SFT）、强化微调（RFT）和强化到部署适应等多个阶段的训练。这些阶段使KAT-Coder实现了工具使用可靠性、指令对齐和长上下文推理，为现实世界智能编码代理的部署奠定了基础。

Key Takeaways

LLM的进步推动了代理编码的发展，使模型能自主参与软件开发流程。
2.KAT-Coder是一个大规模代理编码模型，经历了多阶段训练。
中期训练阶段提升了KAT-Coder的推理、规划和反思能力。
监督微调阶段构建了平衡多种编程语言、开发环境和任务原型的大样本数据集。
强化微调阶段引入了多地面真实奖励公式，实现稳定且高效的策略优化。
从强化到部署阶段使KAT-Coder适应生产级IDE环境。
KAT系列32B模型KAT-Dev已开源。

Cool Papers

点此查看论文截图

Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting

Authors:Taha Binhuraib, Greta Tuckute, Nicholas Blauch

Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into “Topoformers” with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.

空间功能组织是生物大脑的特征：神经元根据其反应属性在多个尺度上进行拓扑排列。相比之下，大多数机器学习模型内的表示缺乏空间偏见，而是表现为难以可视化和解释的杂乱无章的向量空间。在这里，我们提出了一种新型的自注意力形式，将Transformer转变为具有拓扑组织的“Topoformer”。我们引入了空间查询 - 其中键和查询按二维网格排列，局部查询与给定的键相关联 - 以及空间重新加权，其中我们将自注意力的标准全连接层转换为局部连接层。我们首先通过在一个情感分类任务上训练一层Topoformer来证明我们的方法可行性。使用空间查询进行训练鼓励查询和键的拓扑组织，而空间重新加权则分别鼓励值和自注意力输出的拓扑组织。然后，我们应用Topoformer主题并扩大规模，使用掩盖语言建模目标来训练BERT架构。我们发现，在NLP基准测试中，拓扑变体与非拓扑对照模型表现相当，但会产生可解释的拓扑组织，这已通过八个语言测试套件进行评估。最后，通过分析人类大脑对大量自然句子的响应的fMRI数据集，我们展示了Topoformer模型的低维拓扑变化与人类大脑语言网络之间的对齐。进一步扩展Topoformers有望为NLP研究中提供更多的可解释性，并创建更准确的人类大脑语言信息组织模型。

论文及项目相关链接

PDF ICLR 2024 Workshop on Representational Alignment (Re-Align) Camera Ready

Summary

本文提出了带有地形组织结构的“Topoformer”模型，该模型通过引入空间查询和空间重加权机制，使Transformer模型具备地形组织特性。在情感分类任务上的实验证明该方法的可行性。将其应用于BERT架构并进行语言建模任务，结果显示Topoformer模型性能与常规模型相当，但具备可解释的地形组织结构。此外，通过与人类大脑语言网络的对比研究，发现Topoformer模型的低维地形变化与人类大脑响应之间存在一致性。

Key Takeaways

Topoformer模型结合了空间查询和空间重加权机制，使Transformer具备地形组织结构特性。
通过情感分类任务验证了Topoformer模型的可行性。
将Topoformer模型应用于BERT架构进行语言建模任务，性能与常规模型相当。
Topoformer模型展现出可解释的地形组织结构。
Topoformer模型的低维地形变化与人类大脑响应之间存在一致性。
Topoformer模型的提出有助于提高NLP研究的可解释性。

Cool Papers

点此查看论文截图

DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP

Authors:Mariano Barone, Antonio Laudante, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Vincenzo Moscato

The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an LLM-based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: https://github.com/PRAISELab-PicusLab/DART.

从监管文献中提取药理学知识已成为生物医学自然语言处理的关键焦点，其应用从不良事件监测到人工智能辅助临床决策支持不等。然而，该领域的研究主要依赖于如DrugBank等的英语语料库，针对其他医疗系统的资源存在巨大差距。为解决这一局限性，我们推出了DART（来自监管文本的药品注释），这是从意大利药品管理局（AIFA）的官方存储库中提取的第一个结构化意大利语药品特性摘要语料库。该数据集是通过可重复的管道构建的，该管道包括网页规模文档检索、监管部分的语义分割以及使用经过几次调整的大型语言模型进行临床摘要，该模型采用低温解码技术。DART提供了关于关键药理学领域的结构化信息，如适应症、药物不良反应和药物相互作用等。为了验证其实用性，我们实施了一个基于LLM的药物相互作用检查器，该检查器利用数据集推断具有临床意义的相互作用。实验结果表明，在DART的结构化文本字段的基础上，经过指令训练的LLM可以准确地推断出潜在的相互作用及其临床意义。我们在GitHub上公开发布了我们的代码：https://github.com/PRAISELab-PicusLab/DART。

论文及项目相关链接

PDF

Summary

本文介绍了从监管文件中提取药理学知识的重要性，并指出当前研究主要依赖英语语料库，如DrugBank，忽视了其他医疗系统的资源需求。为解决这一问题，本文引入了DART（来自意大利药物管理局（AIFA）官方存储库的意大利产品特性摘要的结构化语料库）。该数据集通过可重复的管道构建，包括网页规模文档检索、监管部分的语义分割以及使用经过少量样本训练的大型语言模型进行临床总结。DART提供关于关键药理学领域的结构化信息，如适应症、药物不良反应和药物相互作用。为验证其实用性，本文实施了一个基于LLM的药物相互作用检查器，利用数据集推断具有临床意义的相互作用。实验结果表明，基于指令训练的LLM在DART的结构化文本字段中，能准确推断潜在的相互作用及其临床意义。

Key Takeaways

提取药理学知识从监管文件在生物医学自然语言处理中是关键焦点。
当前研究主要依赖英语语料库，忽视了其他医疗系统的资源需求，存在资源差距。
DART是首个基于意大利药品特性摘要的结构化语料库。
DART数据集通过可重复的管道构建，包括文档检索、语义分割和临床总结。
DART提供关于关键药理学领域的结构化信息，如适应症、不良反应和药物相互作用。
利用DART数据集开发了一个基于LLM的药物相互作用检查器。
实验结果表明，LLM能准确基于DART的结构化文本字段推断药物间的潜在相互作用及其临床意义。

Cool Papers

点此查看论文截图

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Authors:Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

大型语言模型（LLM）对人类行为的模拟具有颠覆社会和行为科学的潜力，但这种潜力只有在它们真实反映人类行为的情况下才能实现。当前的评估是零碎的，基于特定任务和指标，导致出现结果不可比较的情况。为了解决这一问题，我们引入了SimBench，这是针对稳健、可重复的LLM模拟科学的第一个大规模标准化基准测试。SimBench统一了20个涵盖从道德决策到经济选择等任务的多样化数据集，覆盖了大量全球参与者群体，为关于LLM模拟何时、如何以及为何成功或失败等根本问题提供了必要的基础。我们的研究显示，尽管今天即使是最好的LLM模拟能力也有限（得分：40.80/100），但随着模型大小的增加，性能呈对数线性增长。增加推理时间的计算并不会提高模拟性能。我们证明了存在一种对齐模拟的权衡：指令调整改进了低熵（共识）问题上的性能，但降低了高熵（多样化）问题上的性能。在模拟特定人口群体时，模型尤其面临困难。最后，我们证明模拟能力与深入的知识密集型推理（MMLU-Pro，r=0.939）关联度最高。通过使进步可衡量，我们的目标是加速更真实可靠的LLM模拟器的开发。

Summary

大语言模型（LLM）对人类行为的模拟具有革命性潜力，前提是必须真实反映人类行为。当前评估方式零散，缺乏统一标准，导致结果难以比较。为解决这一问题，本文推出SimBench，首个大规模标准化基准测试，旨在建立稳健、可重复验证的LLM模拟科学。SimBench统一了涵盖道德决策、经济选择等任务的20个不同数据集，并面向全球大规模参与者群体进行测试。研究表明，当前最佳LLM的模拟能力有限，性能随模型规模呈对数线性增长，推理能力对模拟性能影响显著。本文旨在通过制定可度量的标准来加速更真实的人类行为LLM模拟器的开发。

Key Takeaways

大语言模型（LLM）对人类行为的模拟有潜力改变社交和行为科学。
当前LLM模拟的评估方法存在碎片化问题，缺乏统一标准。
SimBench作为首个大规模标准化基准测试，为建立稳健、可重复的LLM模拟科学提供了必要基础。
SimBench统一了涵盖多种任务的多样化数据集，并面向全球大规模参与者进行测试。
当前LLM模拟能力有限，性能随模型规模增长，与深度知识推理能力关联最强。
LLM模拟性能不会因推理时间计算增加而提高。

Cool Papers

点此查看论文截图

Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval

Authors:Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Wenjun Peng, Sen Li, Fuyu Lv

Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.

生成式检索（GR）是一种新兴范式，它利用大型语言模型（LLM）以自回归的方式生成与给定查询相关的文档标识符（docids）。早期的工作主要集中在利用LLM的生成能力来改善GR，却忽视了它们的推理能力同样可以提供帮助。这就引发了一个关键问题：明确的推理能否有益于GR？为了探究这个问题，我们首先进行了一项初步研究，在此研究中，我们提示LLM在进行约束docid解码之前先生成自由形式的思维链（CoT）推理。尽管这种方法优于标准的GR，但生成的推理往往过于冗长，且与docid空间不太对齐。这些局限性促使我们开发一种更适合GR的推理机制。因此，我们提出了面向检索的推理（R4R），这是一个面向GR的增强型推理框架，它将自由形式的CoT推理转换为紧凑、结构化的格式，并在检索过程中迭代地优化推理。R4R增强了一种现有的GR方法，它利用具备GR指令的推理LLM。在推理阶段，R4R首先使用LLM生成初始的结构化推理；然后，同一LLM交替进行（i）使用选择的GR方法进行约束解码以产生候选docids，（ii）基于检索结果更新推理以提高下一轮的准确性。R4R不需要额外的模型或训练，而是使用一个LLM同时作为推理生成器和检索器。在自然问题、MS MARCO和现实世界中的商品搜索基准测试上的大量实验验证了R4R的有效性。

论文及项目相关链接

PDF

Summary

该文探讨了生成式检索（GR）领域的一个新兴问题，即如何利用大型语言模型（LLM）中的推理能力来提升GR的效果。文章首先回顾了以往研究中只关注LLM的生成能力而忽视其推理能力的局限性，并提出一个重要问题：明确推理能否有益于GR。为探究这一问题，文章提出了一种名为Reason-for-Retrieval（R4R）的框架，该框架将自由形式的链式思维（CoT）推理转化为紧凑的结构化格式，并在检索过程中迭代优化推理。R4R通过利用经过指导训练的具备推理能力的LLM来增强现有的GR方法。在推理阶段，R4R首先使用LLM生成初始结构化推理，然后交替进行约束解码与基于检索结果的推理更新，以改进下一轮的检索。R4R只需一个LLM即可同时作为推理生成器和检索器，无需额外的模型或训练。实验结果表明，R4R在自然问题、MS MARCO以及真实物品搜索等多个数据集上均表现出优异的效果。

Key Takeaways

生成式检索（GR）开始探索利用大型语言模型（LLM）的推理能力，而不仅仅是其生成能力。
现有研究在GR中忽视了推理的重要性。
提出Reason-for-Retrieval（R4R）框架，将自由形式的链式思维（CoT）推理转化为结构化格式。
R4R框架利用单一的LLM进行推理生成和检索，增强了现有的GR方法。
R4R通过迭代更新推理来提升检索效果。
R4R在自然问题、MS MARCO和真实物品搜索等多个数据集上的实验表现优异。

Cool Papers

点此查看论文截图

UniVideo: Unified Understanding, Generation, and Editing for Videos

Authors:Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

统一多模态模型在多模态内容生成和编辑方面已显示出有前途的结果，但主要局限于图像领域。在这项工作中，我们提出了UniVideo，这是一个通用框架，将统一建模扩展到视频领域。UniVideo采用双流设计，结合用于指令理解的多模态大型语言模型（MLLM）和用于视频生成的多模态DiT（MMDiT）。这种设计能够准确解释复杂的多模态指令，同时保持视觉一致性。基于这种架构，UniVideo将各种视频生成和编辑任务统一在单一的多模态指令范式下，并在这些任务上进行联合训练。大量实验表明，UniVideo在文本/图像到视频生成、上下文视频生成和上下文视频编辑方面达到了或超越了最先进的任务特定基准。值得注意的是，UniVideo的统一设计实现了两种形式的泛化。首先，UniVideo支持任务组合，例如通过单个指令集成多种功能来结合编辑与风格转换。其次，即使在没有针对自由形式视频编辑的明确训练的情况下，UniVideo也能将其编辑能力从大规模图像编辑数据转移到这一环境中，处理未见过的指令，如在视频中抠像或更改材质。除了这些核心能力之外，UniVideo还支持基于视觉提示的视频生成，其中MLLM解释视觉提示并在合成过程中引导MMDiT。为了促进未来的研究，我们将发布我们的模型和代码。

Summary

本文介绍了一种名为UniVideo的通用框架，它将统一建模扩展到视频领域。UniVideo采用双流设计，结合多模态大型语言模型（MLLM）进行指令理解与多模态DiT（MMDiT）进行视频生成。这种设计能够准确解释复杂的多媒体指令，同时保持视觉一致性。实验证明，UniVideo在文本/图像到视频生成、上下文内视频生成和编辑等方面的性能达到了或超越了现有技术。此外，UniVideo还支持任务组合和从大规模图像编辑数据迁移编辑能力，能够实现未见指令的处理，如绿幕替换角色或更改视频中的材质。

Key Takeaways

UniVideo是一个通用框架，将统一建模扩展到视频领域。
采用双流连设计，结合了多模态大型语言模型与多模态DiT。
UniVideo能够准确解释复杂的多媒体指令并保持视觉一致性。
UniVideo在视频生成和编辑任务上的性能达到或超越了现有技术。
UniVideo支持任务组合，如编辑与风格转换的集成。
UniVideo能够从大规模图像编辑数据迁移编辑能力，处理未见指令。

Cool Papers

点此查看论文截图

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Authors:Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi

Large Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space poorly, rely on hand-crafted heuristics, or lack systematic query refinement. We present NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker-victim-judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline uncovers stealthy, high-success adversarial paths across LLMs. On several closed-source and open-source LLMs, NEXUS increases attack success rate by 2.1% to 19.4% over prior methods. Code: https://github.com/inspire-lab/NEXUS

大型语言模型（LLM）已经彻底改变了自然语言处理领域，但仍然容易受到越狱攻击的影响，特别是那些将恶意意图分散在良性对话中并绕过对齐机制的多轮越狱攻击。现有方法往往对对抗空间探索不足，依赖手工启发式方法，或缺乏系统查询优化。我们提出了NEXUS（用于开发不安全序列的网络探索），这是一个用于构建、优化和执行优化多轮攻击的模块化框架。NEXUS包括：（1）ThoughtNet，它按层次结构扩展有害意图，形成主题、实体和查询链的结构化语义网络；（2）一个反馈驱动的模拟器，通过攻击者-受害者-法官LLM合作，使用有害性和语义相似性基准来迭代优化和修剪这些链；以及（3）一个网络遍历器，能够自适应地导航优化后的查询空间以进行实时攻击。此管道在大型语言模型中揭示了隐蔽且高度成功的对抗性路径。在多个闭源和开源的大型语言模型上，NEXUS相较于先前方法提高了2.1%至19.4%的攻击成功率。代码地址：https://github.com/inspire-lab/NEXUS

论文及项目相关链接

PDF This paper has been accepted in the main conference proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025). Javad Rafiei Asl and Sidhant Narula are co-first authors

Summary

大型语言模型（LLM）在自然语言处理领域取得了革命性的进展，但易受越狱攻击影响，尤其是分布式恶意意图的多轮越狱攻击，能够绕过对齐机制。针对这一问题，本文提出了NEXUS（网络探索不安全序列）框架，包含构建、优化和执行多轮攻击的优化方案。NEXUS框架包括：ThoughtNet（层次扩展有害意图为结构化语义网络）、反馈驱动模拟器（通过攻击者-受害者-法官LLM合作进行精炼和修剪）和网络遍历器（自适应导航精炼查询空间进行实时攻击）。该管道在多个闭源和开源LLM上提高了攻击成功率。

Key Takeaways