嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-09-13 更新

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Authors:Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

开源文本到图像(T2I)模型的进步受到了缺乏大规模以推理为中心的数据集和全面评估基准的阻碍,与领先的闭源系统相比,这导致了性能上的差距。为了应对这一挑战,我们推出了FLUX-Reason-6M和PRISM-Bench(精确可靠的图像合成测量基准)。FLUX-Reason-6M是一个大规模数据集,包含600万张高质量的FLUX生成图像和2000万条专门用于教授复杂推理的双语(英语和中文)描述。图像根据六个关键特征进行组织:想象力、实体、文本渲染、风格、情感和构图,并设计明确的生成思维链(GCoT),以提供图像生成步骤的详细分解。整个数据集的整理需要15,000个A100 GPU天,为社区提供了一个之前只有在大型工业实验室才能获得的资源。PRISM-Bench提供了一个新的评估标准,包括七个不同的赛道,其中有一个强大的长文本挑战使用GCoT。它通过精心设计的提示,利用先进的视觉语言模型进行微妙的、符合人类标准的提示图像对齐和图像美学评估。我们在PRISM-Bench上对19个领先模型进行了广泛评估,揭示了关键的性能差距并强调了需要改进的具体领域。我们的数据集、基准和评估代码已发布,以催化下一波以推理为导向的T2I生成。项目页面:https://flux-reason-6m.github.io/。

论文及项目相关链接

PDF Project page: https://flux-reason-6m.github.io/

Summary

本文介绍了开放源代码文本到图像(T2I)模型的进步受到阻碍的问题,缺乏大规模、以推理为重点的数据集和全面的评估基准。为解决此挑战,引入了FLUX-Reason-6M数据集和PRISM-Bench评估基准。FLUX-Reason-6M包含600万张高质量图像和相应的2亿双语描述,特别设计用于教授复杂推理。图像按六大特点组织,并设计了详细的图像生成步骤分解。PRISM-Bench则提供了一个新颖的评价标准,对高级视觉语言模型进行微妙的以人为本的评估。对领先模型的广泛评估揭示了关键的性能差距并强调了需要改进的领域。数据集、基准和评估代码已发布,以推动下一波以推理为导向的T2I生成。

Key Takeaways

  1. 开放源代码文本到图像(T2I)模型的进步受到缺乏大规模、以推理为重点的数据集和全面的评估基准的限制。
  2. FLUX-Reason-6M是一个包含6百万高质量图像和相应的双语描述的大规模数据集,用于教授复杂推理。
  3. 图像按六大特点组织,包括想象力、实体、文本渲染等,并设计了详细的图像生成步骤分解。
  4. PRISM-Bench提供了一个新颖的评价标准,包含七个不同的轨道,为先进的视觉语言模型提供了微妙的以人为本的评估方法。
  5. 通过精心设计的提示,PRISM-Bench利用先进的视觉语言模型进行图像提示对齐和图像美学的评估。
  6. 对领先模型的广泛评估揭示了关键的性能差距,特别是在某些特定领域需要改进。

Cool Papers

点此查看论文截图

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Authors:Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut’’ during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

视觉语言动作(VLA)模型最近作为机器人操作的一种强大范式而出现。尽管大规模预训练和监督微调(SFT)取得了重大进展,但这些模型仍面临两大基本挑战:(i)用于SFT扩展所需的大规模人类操作机器人轨迹的稀缺性和高成本;(ii)对涉及分布转移的任务的有限泛化能力。最近在大规模推理模型(LRMs)方面的突破表明,强化学习(RL)可以大幅提高逐步推理能力,这引发了一个自然的问题:RL能否以类似方式提高VLA的长远逐步行动计划?在这项工作中,我们介绍了专为VLA模型定制的简单VLA-RL高效强化学习框架。基于veRL,我们引入了针对VLA的轨迹采样、可扩展的并行化、多环境渲染和优化损失计算。当应用于OpenVLA-OFT时,SimpleVLA-RL在LIBERO上实现了最先进的性能,在我们引入的探索增强策略下,甚至在RoboTwin 1.0和2.0上超过了π0。SimpleVLA-RL不仅减少了大规模数据的依赖,实现了稳健的泛化能力,而且在现实任务中显著超越了SFT。此外,我们在强化学习训练过程中发现了一个新现象“pushcut”,在此现象中,策略发现了先前未见的模式,这些模式超出了之前训练过程中的范围。Github地址:https://github.com/PRIME-RL/SimpleVLA-RL

论文及项目相关链接

PDF

Summary
近期,视觉语言动作(VLA)模型在机器人操作领域展现出强大的潜力。尽管通过大规模预训练和精细监督训练(SFT)取得了显著进展,但仍面临两大挑战:一是需要大规模人类操作的机器人轨迹数据,成本高昂且稀缺;二是模型在任务分布变化时的泛化能力受限。本研究引入SimpleVLA-RL,一个针对VLA模型的强化学习框架,通过引入VLA特定的轨迹采样、可扩展的并行化、多环境渲染和优化损失计算等技术,实现了在OpenVLA-OFT上的卓越性能。SimpleVLA-RL不仅降低了对大规模数据的依赖,提高了模型的泛化能力,还在真实任务中超越了SFT。此外,研究还发现了RL训练中的新现象——“pushcut”,即策略在训练过程中发现了之前未见过的模式。

Key Takeaways

  1. VLA模型在机器人操作领域具有强大潜力,但仍面临数据获取和泛化能力的挑战。
  2. 强化学习(RL)可以提高VLA模型的长期规划能力。
  3. SimpleVLA-RL框架针对VLA模型进行了优化,包括轨迹采样、并行化、多环境渲染和损失计算。
  4. SimpleVLA-RL在多个任务上实现了卓越性能,并降低了对大规模数据的依赖。
  5. SimpleVLA-RL提高了模型的泛化能力,在真实任务中超越了SFT。
  6. RL训练过程中出现新现象——“pushcut”,策略能发现之前未见过的模式。

Cool Papers

点此查看论文截图

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Authors:Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

长上下文语言模型的兴起,其上下文窗口扩展至数百万个令牌,为复杂的代码理解和软件开发评估提供了新的机会。我们提出了LoCoBench,这是一个专为评估长上下文LLM而在现实复杂软件开发场景中设计的全面基准测试。与现有的代码评估基准测试不同,这些测试专注于单功能完成或短上下文任务,而LoCoBench解决了对长上下文能力的关键评估空白,这需要理解整个代码库、跨多个文件进行推理以及在大型软件系统中保持架构一致性。我们的基准测试提供了8000个评估场景,这些场景系统地在10种编程语言中生成,上下文长度从1万到百万令牌不等,这种百倍的变化使得能够在现实软件开发环境中精确评估长上下文性能下降的情况。LoCoBench引入了8个任务类别,涵盖了关键的长上下文能力:架构理解、跨文件重构、多会话开发、故障调查、功能实现、代码理解、集成测试和安全分析。通过5阶段的管道,我们创建了多样化且高质量的场景,挑战LLM以前所未有的规模对复杂代码库进行推理。我们引入了一个全面的评估框架,包括4个维度共的17项指标,其中还包括8项新的评估指标,这些指标组合成LoCoBench Score(LCBS)。我们对最先进的长上下文模型进行了评估,揭示了显著的性能差距,表明在复杂软件开发中的长上下文理解是一个重要的未解决问题,需要更多关注。LoCoBench发布在:https://github.com/SalesforceAIResearch/LoCoBench。

论文及项目相关链接

PDF 53 pages

Summary

长语境语言模型的涌现,为复杂的代码理解和软件开发评估提供了新的机会。为此,我们提出了LoCoBench,这是一个专为评估长语境LLM在现实复杂软件开发场景中的表现而设计的综合基准测试。与现有的代码评估基准不同,LoCoBench着重于需要理解整个代码库、跨多个文件推理以及在大型软件系统中保持架构一致性的长语境能力评估。我们的基准测试提供了8000个评估场景,这些场景系统地涵盖了10种编程语言,语境长度从10K到1M令牌,变化范围达到100倍,能够在现实软件开发环境中精确评估长语境性能下降情况。LoCoBench引入了8个任务类别,涵盖了重要的长语境能力,包括架构理解、跨文件重构、多会话开发、故障调查、功能实现、代码理解、集成测试和安全分析。通过5阶段管道,我们创建了多样化、高质量的场景,挑战LLMs以前所未有的规模对复杂代码库进行推理。我们的评估框架包含4个维度的17个指标,包括8个新评估指标,并综合形成LoCoBench评分(LCBS)。我们对先进的长语境模型的评价显示,存在显著的性能差距,这表明复杂软件开发中的长语境理解是一个重大未解决的问题,需要更多关注。

Key Takeaways

  1. LoCoBench是专门为评估长语境LLM而设计的综合基准测试。
  2. 与其他代码评估基准不同,LoCoBench强调长语境能力,如整个代码库的理解、跨多个文件的推理以及大型软件系统中的架构一致性。
  3. LoCoBench提供了8000个系统生成的评估场景,语境长度变化范围大,涵盖10种编程语言。
  4. 基准测试包括8个任务类别:架构理解、跨文件重构、多会话开发等。
  5. 通过5阶段管道创建多样化、高质量的场景,以挑战LLMs对复杂代码库的推理能力。
  6. 引入了一个包含17个指标的全面评估框架,包括8个新评估指标,形成LoCoBench评分(LCBS)。

Cool Papers

点此查看论文截图

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

Authors:Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce OmniEVA – an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

最近的多模态大型语言模型(MLLMs)的进步为嵌入式智能带来了新的机会,实现了多模态理解、推理和交互,以及连续的空间决策。然而,当前基于MLLM的嵌入式系统面临两个关键局限。首先,几何适应性差距:仅通过2D输入或硬编码的3D几何注入训练的模型,缺乏空间信息或2D泛化能力受限,导致在具有各种空间需求的任务之间适应性较差。其次,体现约束差距:先前的工作往往忽视了真实机器人的物理约束和能力,导致任务计划在理论上有效,但在实践中却不可行。为了解决这些差距,我们引入了OmniEVA——一个嵌入式通用规划器,它通过两个关键创新点实现了先进的嵌入式推理和任务规划:(1)任务自适应的3D接地机制,它引入了一个门控路由器,根据上下文要求进行显式的选择性调节3D融合,为实现多样化的嵌入式任务的上下文感知3D接地。 (2)感知体现的推理框架,该框架将任务目标和体现约束纳入推理循环中,从而做出既以目标为导向又可行的规划决策。大量的实验结果证明,OmniEVA不仅达到了最先进的通用嵌入式推理性能,而且在广泛的下游场景中表现出了强大的能力。对一系列提出的嵌入式基准测试的评价,包括原始任务和复合任务,证实了其稳健和通用的规划能力。项目页面:https://omnieva.github.io

论文及项目相关链接

PDF

Summary

本文介绍了近期多模态大型语言模型(MLLMs)的进步为实体智能带来了新的机遇,开启了多模态理解、推理和交互以及连续空间决策制定的可能性。然而,当前基于MLLM的实体系统面临两大局限:几何适应差距和实体约束差距。为此,本文提出了OmniEVA——一种实体通用规划器,通过两项创新技术解决这些问题:任务自适应的3D接地机制和实体感知推理框架。该规划器实现了先进的实体推理和任务规划,实现了上下文中目标导向且可执行的规划决策。实验结果表明,OmniEVA不仅实现了最先进的实体推理性能,而且在多种下游场景中表现出强大的能力。

Key Takeaways

  1. 多模态大型语言模型的进步为实体智能带来了新机遇,包括多模态理解、推理和交互以及连续空间决策制定。
  2. 当前基于MLLM的实体系统面临两大局限:几何适应差距和实体约束差距。
  3. OmniEVA是一种实体通用规划器,通过任务自适应的3D接地机制和实体感知推理框架解决上述局限。
  4. OmniEVA实现了先进的实体推理和任务规划,可以在多种下游场景中表现出强大的能力。
  5. OmniEVA的规划决策是目标导向且可执行的。
  6. 通过实验验证,OmniEVA不仅达到了最先进的实体推理性能,而且在多种任务中表现出稳健和通用的规划能力。

Cool Papers

点此查看论文截图

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

Authors:Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam

Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at https://github.com/cyaaronk/audio_deepfake_eval.

音频深度伪造检测(ADD)模型通常使用结合多种合成器的数据集进行评估,性能以单一的等错误率(EER)报告。然而,这种方法会不成比例地加重样本更多的合成器的权重,低估其他合成器,降低整体EER的可靠性。此外,大多数ADD数据集在真实语音方面缺乏多样性,通常只包含一种环境和语音风格(例如,干净的阅读语音),限制了它们模拟真实世界条件的能力。为了解决这些挑战,我们提出了真实跨测试评估框架,该框架结合了多种真实数据集并汇总EER,以进行更平衡的评估。与传统评估方法相比,我们的方法提高了稳健性和可解释性。我们在九种真实语音类型上对超过150种合成器进行了基准测试,并在https://github.com/cyaaronk/audio_deepfake_eval上发布了一个新的数据集,以促进进一步的研究。

论文及项目相关链接

PDF Published in Interspeech 2025

Summary:音频深度伪造检测模型常用结合多种合成器的数据集进行评估,以单一错误等价率报告性能。然而,这种方法会过多地重视样本较多的合成器,忽视其他合成器,降低错误等价率的总体可靠性。此外,大多数音频深度伪造检测数据集缺乏真实语音的多样性,通常只涉及单一环境和语音风格(如清晰阅读语音),无法模拟真实世界条件。为解决这些挑战,我们提出了真实语音交叉测试这一新颖评估框架,它融入多样的真实语音数据集并汇总错误等价率以进行更平衡的评估。相较于传统评估方法,我们的方法提高了稳健性和解释性。我们对超过12种合成器和九种真实语音类型进行了基准测试,并发布新的数据集以促进进一步研究。

Key Takeaways

  1. 音频深度伪造检测模型的评估存在挑战,包括合成器权重分配不均和真实语音多样性不足的问题。
  2. 现有评估方法倾向于过多重视样本较多的合成器,导致评估结果不够全面可靠。
  3. 提出的真实语音交叉测试评估框架旨在通过融入多样的真实语音数据集来提高评估的稳健性和解释性。
  4. 该框架对超过15种合成器和九种真实语音类型进行了基准测试。
  5. 真实语音交叉测试通过汇总错误等价率进行更平衡的评估,解决了传统评估方法中存在的问题。
  6. 研究者发布了一个新的数据集以促进音频深度伪造检测研究的进一步发展。

Cool Papers

点此查看论文截图

VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Authors:Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Liao, Song Jiang

This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.

本文介绍了作为ICCV 2025视觉质量评估研讨会一部分举办的VQualA 2025大型多模态模型视觉质量比较挑战赛。该挑战旨在评估和提升最前沿大型多模态模型在跨多张图像进行开放式和详细推理视觉质量差异方面的能力。为此,比赛引入了一个新型基准测试,包含从粗到细粒度的数千个视觉质量比较任务,涵盖单张图像、图像对和多图像组。每个任务都要求模型提供准确的质量判断。比赛强调整体评估协议,包括基于2AFC的二元偏好和多选题。约有100名参赛者提交了参赛作品,其中五个模型展示了指令调整型LMM在质量评估方面的新兴能力。本次挑战标志着开放式领域视觉质量推理和比较的重要一步,也是未来可解释性和人类对齐质量评估系统研究的催化剂。

论文及项目相关链接

PDF ICCV VQualA Workshop 2025

Summary

本文概述了作为ICCV 2025视觉质量评估研讨会一部分的VQualA 2025挑战。该挑战旨在评估并提升最新大型多模态模型(LMMs)在跨多张图像进行开放和详细质量比较方面的能力。为此,竞赛引入了一个新型基准测试,包含从粗到细粒度的视觉质量比较任务,涵盖单张图像、图像对和多图像组。每个任务都要求模型提供准确的质量判断。竞赛强调整体评估协议,包括基于2AFC的二元偏好和多选题(MCQs)。约有100名参与者提交了参赛作品,其中五种模型展示了指令优化后的质量评估能力。该挑战是朝着开放领域视觉质量推理和比较的一大步,也是未来研究可解释和人类对齐的质量评估系统的催化剂。

Key Takeaways

  1. VQualA 2025挑战旨在评估大型多模态模型(LMMs)在视觉质量比较方面的能力。
  2. 竞赛引入了一个新型基准测试,包含多种粒度的视觉质量比较任务。
  3. 模型需要在跨多张图像的任务中提供准确的质量判断。
  4. 竞赛强调整体评估协议,包括二元偏好和多选题。
  5. 有约100名参与者提交了参赛作品。
  6. 五种模型展示了指令优化后的质量评估能力。

Cool Papers

点此查看论文截图

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Authors:Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as $\sqrt{L}$. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.

我们提出了FSPO(公平序列策略优化),这是一种针对大型语言模型的序列级强化学习方法,直接在重要性采样(IS)权重空间中实施长度公平裁剪。我们重新审视了序列级强化学习方法,并发现了将PPO/GRPO风格的裁剪方法应用于序列时的不匹配之处:固定裁剪范围会系统地重新调整短序列和长序列的响应权重,扭曲有效目标。在理论上,我们通过长度加权误差(LRE)正式确立长度公平性,并证明较小的LRE可以在剪辑更新和真实更新之间提供方向余弦保证。FSPO引入了一种简单而受高斯启发的补救措施:我们用一条带剪辑序列对数IS比率,该带应用KL校正的漂移术语并按$\sqrt{L}$进行缩放。经验上,FSPO使不同长度分段的剪辑率变得平坦,稳定了训练,并在多个评估数据集上的表现超过了所有基线。

论文及项目相关链接

PDF

Summary
基于强化学习的方法FSPO(公平序列策略优化),旨在解决大型语言模型(LLMs)在重要性采样(IS)权重空间中的序列级别强化学习问题,实现直接长度公平裁剪。研究发现序列级RL方法中的不匹配问题,并提出通过长度重新加权误差(LRE)实现长度公平性的理论形式化。FSPO引入了一种简单且基于高斯动机的解决方案,通过KL校正漂移项和根号L的缩放来裁剪序列对数IS比率。经验上,FSPO能在不同长度区间内平衡剪辑率,稳定训练,并在多个评估数据集上优于所有基线。

Key Takeaways

  • FSPO是一种针对大型语言模型的序列级强化学习方法,解决了长度不公平剪辑的问题。
  • 通过对现有序列级RL方法的研究,发现固定剪辑范围会导致短长响应的权重重新分配,扭曲有效目标。
  • 通过理论形式化长度公平性,提出了长度重新加权误差(LRE)。
  • 小LRE可以在剪辑更新和真实更新之间提供方向余弦保证。
  • FSPO引入了一种基于高斯动机的解决方案,通过裁剪序列对数IS比率来实现公平剪辑。
  • 经验表明,FSPO能平衡不同长度区间的剪辑率,稳定训练,并在多个评估数据集上表现优异。

Cool Papers

点此查看论文截图

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Authors:Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li

Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

语音到语音的大型语言模型(SLLMs)正越来越受到关注。SLLM源于基于文本的大型语言模型(LLM),但往往表现出知识和推理能力的下降。我们假设这一局限性产生的原因是当前SLLM的训练模式无法弥合特征表示空间中的声学语义鸿沟。为了解决这一问题,我们提出了EchoX,它利用语义表示并动态生成语音训练目标。这种方法结合了声学学习和语义学习,使得EchoX作为语音LLM能够保留强大的推理能力。实验结果表明,使用约六千小时的训练数据,EchoX在多个基于知识的问题回答基准测试中取得了先进性能。项目可在 https://github.com/FreedomIntelligence/EchoX 获取。

论文及项目相关链接

PDF

Summary

语音到语音的大型语言模型(SLLM)正日益受到关注。然而,相较于文本型的大型语言模型(LLM),SLLM在知识和推理能力上常有所不足。本文假设这种局限性源于当前SLLM的训练模式无法填补声学语义特征表示空间中的鸿沟。为此,我们提出了EchoX方案,它通过利用语义表示并动态生成语音训练目标,结合声学和语义学习,使得EchoX在作为语音LLM时能够保持强大的推理能力。实验结果表明,在多个基于知识的问题回答基准测试中,EchoX使用约六千小时的训练数据取得了卓越表现。项目可通过 https://github.com/FreedomIntelligence/EchoX 获取。

Key Takeaways

  1. SLLM正在受到关注,但在知识和推理能力上相较LLM有所不足。
  2. 当前SLLM的训练模式无法填补声学语义特征表示空间中的鸿沟是其能力局限的主要原因。
  3. EchoX通过结合声学和语义学习来解决这一问题。
  4. EchoX利用语义表示并动态生成语音训练目标。
  5. EchoX在作为语音LLM时能够保持强大的推理能力。
  6. 实验证明EchoX在多个知识问答基准测试中表现卓越。

Cool Papers

点此查看论文截图

MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Authors:Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu, Shuangyong Song, Yongxiang Li, Zhongjiang He

Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model’s generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

大型语言模型(LLMs)在多个研究领域中展示了强大的能力。然而,它们在通用信息提取(UIE)中的表现仍然不足,尤其是在处理涉及复杂模式描述和多步推理的结构化输出场景时。尽管现有方法通过上下文学习和指令微调增强了LLM的性能,但仍存在重大局限性。为了提高模型的泛化能力,我们提出将强化学习(RL)与多视角推理相结合,用于信息提取(IE)任务。我们的工作使LLM从被动提取器转变为积极推理器,使它们不仅能够理解要提取的内容,还能理解如何进行推理。在多个信息提取基准测试上进行的实验表明,MR-UIE在不同领域始终提高了提取精度,并在几个数据集上超过了最新技术的方法。此外,将多视角推理融入强化学习显著提高了在复杂信息提取任务中的泛化能力,凸显了推理在挑战场景中的关键作用。

论文及项目相关链接

PDF

Summary

大型语言模型(LLMs)在多个研究领域中表现出强大的能力,但在通用信息提取(UIE)方面的表现仍不足,尤其是在涉及复杂模式描述和需要多步骤推理的结构化输出场景中。为提升模型的泛化能力,结合强化学习(RL)和多角度推理的信息提取(IE)任务。实验证明,MR-UIE在多个IE基准测试上的提取精度持续提高,并在某些数据集上超越了最先进的方法。特别是将多视角推理融入RL,显著提高了在复杂IE任务中的泛化能力,凸显了推理在挑战场景中的关键作用。

Key Takeaways

  1. 大型语言模型(LLMs)在通用信息提取(UIE)方面的性能仍有待提升。
  2. 在处理涉及复杂模式描述和多步骤推理的结构化输出场景时,LLMs面临挑战。
  3. 通过结合强化学习(RL)和多角度推理,可以提升LLMs在IE任务中的性能。
  4. MR-UIE方法在多个信息提取基准测试上表现出较高的提取精度。
  5. MR-UIE在某些数据集上的表现超越了现有最先进的方法。
  6. 将多视角推理融入RL显著提高了在复杂IE任务中的泛化能力。

Cool Papers

点此查看论文截图

BRoverbs – Measuring how much LLMs understand Portuguese proverbs

Authors:Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos

Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.

大型语言模型(LLM)在应用时,其性能会因语言和文化的上下文环境而表现出显著的差异。这种差异表明需要成熟的评估框架来评估其在特定区域设置中的能力。在葡萄牙语的情况下,现有的评估仍然有限,通常依赖于可能无法完全捕捉语言细微差别或文化参考的翻译数据集。同时,葡萄牙语原生语言数据集主要集中在结构化的国家考试或社交媒体互动的情感分析上,在评估更广泛的语言理解方面存在空白。为了解决这一局限性,我们推出了BRoverbs数据集,专门用于通过巴西的谚语来评估LLM的性能。谚语是一种丰富的语言资源,包含了文化智慧、比喻表达和复杂的句法结构,挑战了模型对区域表达的理解。BRoverbs旨在为葡萄牙语LLM提供新的评估工具,为区域性的基准测试做出贡献。该基准测试可在https://huggingface.co/datasets/Tropic-AI/BRoverbs获取。

论文及项目相关链接

PDF

Summary

大型语言模型(LLMs)在不同语言和文化背景下的应用性能存在显著差异。这一差异突显了需要成熟的评估框架来评估其在特定区域设置中的能力。对于葡萄牙语而言,现有评估方法仍然有限,常常依赖于可能无法完全捕捉语言细微差别或文化参考的翻译数据集。同时,葡萄牙语原生语言数据集主要集中在国家结构考试或社交媒体互动的情感分析上,在评估更广泛的语言理解方面存在空白。为解决这一局限性,我们推出了BRoverbs数据集,专门用于评估葡萄牙语LLM的性能。谚语是一种丰富的语言资源,蕴含文化智慧、比喻表达和复杂的句法结构,挑战了模型对区域表达的理解。BRoverbs旨在为葡萄牙语LLM提供新的评估工具,为推动区域基准测试的发展做出贡献。该基准测试可在https://huggingface.co/datasets/Tropic-AI/BRoverbs访问。

Key Takeaways

  1. 大型语言模型(LLMs)在不同语言和文化背景下的性能存在差异,需要成熟的评估框架。
  2. 现有葡萄牙语LLM评估方法有限,常依赖于翻译数据集,可能无法全面捕捉语言细微差别和文化参考。
  3. 葡萄牙语原生语言数据集主要集中在结构考试和社交媒体情感分析,缺乏更广泛的语言理解评估。
  4. BRoverbs数据集旨在通过巴西谚语评估葡萄牙语LLM性能,丰富语言资源蕴含文化智慧和复杂句法结构。
  5. BRoverbs为葡萄牙语LLM提供新的评估工具,推动区域基准测试的发展。
  6. BRoverbs数据集可访问于https://huggingface.co/datasets/Tropic-AI/BRoverbs。
  7. 该基准测试为评估葡萄牙语大型语言模型提供了一个重要资源,有助于促进对葡萄牙语语言细微差别和文化背景的理解。

Cool Papers

点此查看论文截图

Merge-of-Thought Distillation

Authors:Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao

Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into student with overcoming conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

对于长链条思维(CoT)模型的推理蒸馏,尽管存在多个候选教师且CoT语料库不断增长,但其越来越受单一权威教师的假设的制约。我们重新考察教师选择并发现,不同的学生有不同的“最佳教师”,即使对于同一学生,最佳教师也会因数据集而异。因此,为了将多位教师的推理能力融入学生模型中,同时解决各位教师的监督冲突问题,我们提出了“思维融合蒸馏”(MoT)这一轻量级框架。该框架交替进行针对特定教师的监督微调分支和结果学生变体权重空间的合并。在竞赛数学基准测试中,仅使用约200个高质量CoT样本,对Qwen3-14B学生应用MoT超越了包括DEEPSEEK-R1、QWEN3-30B-A3B、QWEN3-32B和OPENAI-O1在内的强大模型,显示出显著的提升。此外,MoT始终优于最佳单教师蒸馏和简单的多教师联合,提高了性能上限,同时减轻了过拟合问题,并对分布偏移和教师同级水平表现出稳健性。而且,MoT减少了灾难性遗忘,提高了除数学之外的一般推理能力,甚至培养了更好的教师,表明经过共识过滤的推理特征具有广泛的可迁移性。这些结果将MoT定位为从各种教师中有效蒸馏长CoT能力到紧凑学生模型的简单且可扩展的途径。

论文及项目相关链接

PDF

摘要

对于长链思维(CoT)模型的推理蒸馏,尽管存在多个候选教师且CoT语料库不断增长,但受限于单一oracle教师的假设。我们重新考察教师选择,并发现不同学生有不同的“最佳教师”,甚至对于同一学生,最佳教师也会因数据集而异。因此,为了将学生统一起来并克服不同教师监督之间的冲突,我们提出了Merge-of-Thought Distillation(MoT)这一轻量级框架,它交替进行教师特定的监督微调分支和权重空间的合并学生变体。在竞赛数学基准测试中,仅使用大约200个高质量的CoT样本,将MoT应用于Qwen3-14B学生就能超越强大的模型,如DEEPSEEK-R1、QWEN3-30B-A3B、QWEN3-32B和OPENAI-O1等,显示出显著的优势。此外,MoT持续超越最佳单教师蒸馏和简单的多教师联合方法,提高了性能上限并减轻了过拟合现象,显示出对分布偏移和教师水平的稳健性。而且,MoT减少了灾难性遗忘,提高了通用推理能力并培育了更好的教师,表明经过共识过滤的推理特征具有广泛的迁移性。这些结果将MoT定位为从多种教师高效蒸馏长CoT能力到精简学生的简单、可扩展的途径。

关键见解

  1. 提出Merge-of-Thought Distillation(MoT)框架,统一多个教师的推理能力。
  2. 观察不同学生有不同的最佳教师,且最佳教师可以随数据集变化。
  3. MoT框架通过交替教师特定的监督微调分支和权重空间合并学生变体来工作。
  4. 在竞赛数学基准测试中,MoT显著超越了其他强大的模型。
  5. MoT提高了性能上限,减轻了过拟合现象,并显示出对分布偏移和教师水平的稳健性。
  6. MoT减少了灾难性遗忘,提高了通用推理能力,并培育了更好的教师。
  7. MoT的共识过滤推理特征具有广泛的迁移性。

Cool Papers

点此查看论文截图

PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Authors:Yixuan Tang, Yi Yang, Ahmed Abbasi

Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.

近年来,大型语言模型(LLM)的进展在各个领域表现出显著的能力。这些发展使得人类在各种情境下与LLM之间的直接沟通变得更加频繁,如在社交陪伴和心理支持方面。然而,在真实世界的对话中,LLM在情感感知和社会能力方面常常表现出局限性。这些局限性部分源于它们无法适应不同的社会和任务上下文来调整其交流风格和情感表达。

在这项工作中,我们引入了PersonaFuse,这是一个新型的大型语言模型后训练框架,能够使LLM适应并表达不同情境下的不同个性。受特质激活理论和五大人格模型的启发,PersonaFuse采用了一种专家混合架构,该架构结合了人格适配器和动态路由网络,以实现上下文特质表达。实验结果表明,PersonaFuse在社会情感智力的多个维度上显著优于基准模型。重要的是,这些收益的实现并不会牺牲一般的推理能力或模型安全性,这仍然是直接提示和监督微调方法的常见局限性。PersonaFuse还在下游以人为中心的应用程序(如心理健康咨询和基于评论的客户服务)中实现了持续的改进。最后,与领先的大型语言模型(包括GPT-4o和DeepSeek)进行的人类偏好评估表明,尽管其模型规模相对较小,但PersonaFuse仍实现了具有竞争力的响应质量。这些发现表明,PersonaFuse为开发社会情感增强型LLM提供了理论扎实且实用的方法,标志着向更以人为中心的AI系统迈出了重大步伐。

论文及项目相关链接

PDF

Summary

近期大型语言模型(LLM)的进步为跨领域交流带来了显著的能力提升。这些进步使得人类与LLM在各种情境下的直接交流成为可能,如社交陪伴和心理支持。然而,LLM在实际对话中的情感感知和社会交往能力存在局限。本文提出了一种名为PersonaFuse的新型LLM后训练框架,它能够让LLM适应并表达不同的个性以适应不同情境。该框架受到特质激活理论和五大人格模型的启发,采用混合专家架构,结合人格适配器和动态路由网络,实现上下文特质表达。实验结果显示,PersonaFuse在社会情感智能的多个维度上大幅超越基准模型,且在不牺牲通用推理能力或模型安全性的前提下取得了这些进步。此外,PersonaFuse在心理健康咨询和基于评论的客服等面向人类的应用中也表现出持续的优势。与人类偏好评估领先LLM相比,如GPT-4o和DeepSeek,尽管模型规模较小,但PersonaFuse在响应质量上表现出竞争力。这些发现表明,PersonaFuse为开发社会情感增强型LLM提供了理论上的支持和实用的方法,标志着向更人性化的AI系统迈出了重要的一步。

Key Takeaways

  1. LLMs的进步促进了跨领域交流,实现了多种情境下的与人类交流能力。
  2. LLM在情感感知和社会交往方面存在局限。
  3. PersonaFuse框架被提出,解决了LLM的局限性问题,提升其在不同情境下的情感表达和个性适应能力。
  4. PersonaFuse结合特质激活理论和五大人格模型。
  5. PersonaFuse采用混合专家架构和动态路由网络来实现上下文特质表达。
  6. 实验显示,PersonaFuse在多个维度上超越了基准模型,且不影响通用推理能力和模型安全性。

Cool Papers

点此查看论文截图

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Authors:Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke

Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

大型语言模型(LLM)拥有广泛的世界知识和强大的通用推理能力,但在标准机器学习(ML)任务上,它们很难从多个上下文示例中学习,也就是说,它们无法仅通过上下文学习(ICL)利用多个示例演示,而无需进行梯度下降。我们引入了MachineLearningLM,这是一个便携式持续预训练框架,它使通用LLM具备强大的上下文ML能力,同时保留其一般知识和推理能力,以支持更广泛的聊天工作流程。我们的预训练程序从数百万个结构因果模型(SCM)中综合ML任务,涵盖示例数量高达1024个。我们以随机森林教师开始,将基于树的决策策略蒸馏到LLM中,以加强数值建模中的稳健性。所有任务都通过高效的令牌提示进行序列化,可以在每个上下文窗口中实现3到6倍的更多示例,并通过批量推理实现高达50倍的摊销吞吐量。尽管配置相对简单(使用LoRA等级8的Qwen-2.5-7B-Instruct),但MachineLearningLM在金融、物理、生物和医疗保健领域的离群分布表格分类方面,平均比强大的LLM基准测试(例如GPT-5-mini)高出约15%。它表现出惊人的多示例扩展定律:随着上下文演示从8个增加到1024个,准确性会单调增加。无需任何特定任务训练,它就能获得随机森林级别的准确度(数百个示例)。同时保留了一般聊天功能,包括知识和推理能力:它在MMLU上达到了75.4%。

论文及项目相关链接

PDF

摘要
大型语言模型具备广泛的世界知识和强大的通用推理能力,但在标准机器学习任务上难以从多个上下文示例中学习。为应对这一问题,本文提出了MachineLearningLM,这是一个便携式持续预训练框架,赋予通用语言模型强大的上下文机器学习能力,同时保留其一般知识和推理能力以支持更广泛的聊天工作流程。我们的预训练程序通过数百万的结构因果模型合成机器学习任务,涵盖多达1024个样本数。我们以随机森林教师为基础,将树形决策策略灌输到语言模型中,加强其在数值建模中的稳健性。所有任务都通过高效的标记提示进行序列化,能够在每个上下文窗口中增加3到6倍的示例,并通过批量推理实现高达50倍的摊销吞吐量。尽管使用的设置较为简单(带有LoRA排名的Qwen-2.5-7B-Instruct),但MachineLearningLM在金融、物理、生物和医疗保健领域的分类表上仍优于强大的语言模型基准测试(如GPT-5-mini),平均提高了约15%的准确度。它展现了一个引人注目的多示例规模效应:随着上下文演示从8个增加到1024个,准确度单调增加。无需任何特定任务训练,它即可达到随机森林级别的精度,同时保留了一般聊天能力,包括知识和推理,在MMLU上达到75.4%。

关键见解

  1. 大型语言模型具备广泛知识和推理能力,但在标准机器学习任务上的学习能力受限。
  2. 提出了MachineLearningLM框架,结合持续预训练,增强语言模型的上下文机器学习能力。
  3. 预训练程序通过结构因果模型合成任务,涵盖多种样本数,提高模型的稳健性。
  4. 使用随机森林教师来指导模型学习,提升数值建模的稳健性。
  5. 通过高效的提示序列化任务,增加上下文窗口中的示例数量,提高推理效率。
  6. MachineLearningLM在多个领域分类任务上优于基准测试,展现强大的性能。

Cool Papers

点此查看论文截图

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Authors:Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li

We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

我们介绍了Robix,这是一个统一模型,它在一个视觉语言架构中集成了机器人推理、任务规划的自然语言交互。在分层机器人系统中,Robix充当高级认知层,为低级控制器动态生成原子命令和人类交互的言语响应,使机器人能够遵循复杂指令、规划长期任务,并在端到端框架内与人类自然交互。Robix还引入了新颖的功能,如主动对话、实时中断处理和任务执行过程中的上下文感知常识推理。其核心是利用链式思维推理并采用三阶段培训策略:1)持续预训练,以增强基本的身体化推理能力,包括3D空间理解、视觉定位和任务中心推理;2)监督微调,以将人机互动和任务规划建模为统一的推理行动序列;3)强化学习,以提高推理行动的协调性和长期任务的连贯性。大量实验表明,在交互式任务执行方面,Robix的表现优于开源和商业基准(例如GPT-4o和Gemini 2.5 Pro),在多种指令类型(如开放式、多阶段、约束、无效和中断)和用户参与的各种任务(如餐桌服务、购物和饮食过滤)中表现出强大的泛化能力。

论文及项目相关链接

PDF Tech report. Project page: https://robix-seed.github.io/robix/

Summary

本文介绍了Robix,一个统一模型,它将机器人推理、任务规划和自然语言交互集成在一个视觉语言架构中。Robix作为分层机器人系统的高级认知层,能够动态生成低级控制器的原子命令和人类交互的言语回应,使机器人能够执行复杂的指令、规划长期任务,并在端到端框架内与人类自然交互。其核心采用链式思维推理,并采用三阶段培训策略,包括持续预训练、监督微调以及强化学习。实验表明,Robix在交互式任务执行方面优于开源和商业基准,展现出强大的指令类型泛化能力,包括开放式、多阶段、约束、无效和中断等,以及涉及用户的各种任务,如餐桌服务、购物和饮食过滤等。

Key Takeaways

  1. Robix是一个集成机器人推理、任务规划和自然语言交互的统一模型。
  2. Robix在分层机器人系统中作为高级认知层,能生成原子命令和言语回应。
  3. Robix具备复杂指令执行、长期任务规划和自然人机交互能力。
  4. Robix采用链式思维推理和三阶段培训策略,包括持续预训练、监督微调和强化学习。
  5. Robix具备主动对话、实时中断处理和语境常识推理等新功能。
  6. Robix在交互式任务执行方面表现出强大的性能,优于开源和商业基准。

Cool Papers

点此查看论文截图

CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

Authors:Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang

Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at https://github.com/WNQzhu/CARFT.

推理能力在大型语言模型(LLM)的广泛应用中发挥着至关重要的作用。为了提高LLM的推理性能,已经提出了多种基于强化学习(RL)的微调方法来解决仅通过监督微调(SFT)训练的LLM的泛化能力有限的问题。尽管这些方法有效,但两个主要局限性阻碍了LLM的进步。首先,普通的基于RL的方法忽略了注释的思维链(CoT)并融入了不稳定的推理路径采样,这通常会导致模型崩溃、训练过程不稳定和性能不佳。其次,现有的SFT方法通常过分强调注释的CoT,可能导致性能下降,因为未能充分利用潜在的CoT。在本文中,我们提出了一种结合注释的CoT的强化微调方法,即“对比学习与强化推理训练(Contrastive and Reinforced Learning with Annotated CoT)”,以增强LLM的推理性能并解决上述局限性。具体来说,我们为每个CoT学习一种表示。基于此表示,我们设计新型对比信号来指导微调过程。我们的方法不仅充分利用了可用的注释CoT,还通过引入额外的无监督学习信号来稳定微调过程。我们通过三种基础方法、两种基础模型和两种数据集的综合实验和深入分析,证明了我们的方法在稳健性、性能(最高提升10.15%)和效率(最高提升30.62%)方面的显著优势。代码可在https://github.com/WNQzhu/CARFT找到。

论文及项目相关链接

PDF 14 pages, to appear in EMNLP25

Summary

大语言模型的推理能力在各种应用场景中发挥着重要作用。为提高其推理性能,研究者提出基于强化学习的微调方法来解决单纯监督微调导致的模型泛化能力有限的问题。然而,现有方法存在两大局限:一是普通强化学习忽略标注的思维链,导致模型崩溃和不稳定训练过程;二是现有监督微调方法过度依赖标注的思维链,未能充分利用潜在思维链。本文提出一种结合标注思维链的对比强化学习微调方法(\TheName{}),旨在提高大语言模型的推理性能并解决上述问题。该方法不仅充分利用标注的思维链,还通过引入额外的无监督学习信号来稳定微调过程。实验和深度分析表明,\TheName{}在鲁棒性、性能(最高提升10.15%)和效率(最高提升30.62%)方面表现出显著优势。

Key Takeaways

  1. 推理能力在大语言模型(LLMs)的广泛应用中起到关键作用。
  2. 为提高LLMs的推理性能,研究者引入了基于强化学习(RL)的微调方法。
  3. 普通RL方法存在忽略标注的思维链(CoT)和不稳定推理路径采样的问题,导致模型崩溃和训练不稳定。
  4. 现有监督微调方法过度依赖标注的CoT,未能充分利用潜在CoT。
  5. 本文提出了一种结合标注CoT的对比强化学习微调方法(\TheName{}),旨在解决上述问题并提升LLMs的推理性能。
  6. \TheName{}通过引入对比学习和标注CoT的结合,不仅充分利用了标注数据,还通过引入无监督学习信号来稳定训练过程。
  7. 实验表明,\TheName{}在性能、鲁棒性和效率方面均表现出显著优势,最高提升分别达10.15%、30.62%。

Cool Papers

点此查看论文截图

Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning

Authors:Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, Guorui Zhou

Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases. Our approach achieves broad coverage of programming problems via a novel Generator-Validation (G-V) framework, ensuring correctness through a consistency validation mechanism that verifies outputs against gold solutions. The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment in code reinforcement learning. In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution. Through comprehensive experiments, we demonstrate the effectiveness of our curated dataset, showing significant improvements in model performance and training stability. The source codes, curated dataset and sandbox system are available at: https://github.com/Kwai-Klear/CodeTest.

精确、正确的反馈对于在代码强化学习中有效地训练大型语言模型(LLM)至关重要。然而,合成高质量测试用例仍然是一个极具挑战且尚未解决的问题。在这项工作中,我们提出了Klear-CodeTest,这是一个全面的测试用例合成框架,具有严格验证的功能,以确保测试用例的质量和可靠性。我们的方法通过新颖的Generator-Validation(G-V)框架实现了对编程问题的广泛覆盖,该框架通过一致性验证机制确保正确性,通过对照黄金解决方案验证输出。提出的G-V框架生成全面的测试用例,包括常规和角落案例,提高测试覆盖率和对代码强化学习解决方案正确性的辨别力。此外,我们还为在线验证平台设计了一个优化的多层安全沙箱系统,保证代码的安全可靠执行。通过全面的实验,我们验证了自定义数据集的有效性,显示出模型性能和训练稳定性的显著提高。源代码、精选数据集和沙箱系统可在:https://github.com/Kwai-Klear/CodeTest访问。

论文及项目相关链接

PDF 21 pages, 11 figures

Summary

文章介绍了一种用于代码强化学习中的大型语言模型(LLM)测试案例综合框架Klear-CodeTest。该框架具有严格的验证机制,确保测试案例的质量和可靠性。它采用新颖的Generator-Validation(G-V)框架,通过一致性验证机制确保正确性,并对比黄金解决方案进行验证。G-V框架能够生成全面的测试案例,包括常规和极端情况,提高代码强化学习中解决方案正确性的评估的覆盖面和鉴别力。此外,设计了一个针对在线验证平台的优化多层安全沙箱系统,保证代码执行的安全性和可靠性。实验证明,该框架能有效提高模型性能和训练稳定性。

Key Takeaways

  1. Klear-CodeTest是一个用于代码强化学习的测试案例综合框架,旨在解决高质量测试用例的合成挑战。
  2. 引入新颖的Generator-Validation(G-V)框架,通过一致性验证确保正确性。
  3. G-V框架生成全面的测试案例,包括常规和极端情况,提高解决方案正确性的评估能力。
  4. 设计了针对在线验证平台的优化多层安全沙箱系统,确保代码执行的安全性和可靠性。
  5. 该框架通过严格的验证机制确保测试案例的质量和可靠性。
  6. 实验证明,使用Klear-CodeTest框架能显著提高模型性能和训练稳定性。

Cool Papers

点此查看论文截图

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Authors:Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang

The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/.

多模态大型语言模型(MLLMs)的视频推理能力对于视频问答和时间定位等下游任务至关重要。虽然最近的方法已经探索了基于文本的思维链(CoT)推理用于MLLMs,但这些方法通常受到有限的跨模态交互和增加的幻觉影响,特别是在处理长视频或复杂推理链时。为了解决这些挑战,我们提出了通过工具增强学习实现视频智能(VITAL)这一新型端到端智能视频推理框架。借助视觉工具箱,该模型可以按需求密集采样新的视频帧,并生成多模态思维链,以实现精确的长视频推理。我们观察到时间定位和问题回答对视频理解任务是互利的。因此,我们构建了两个高质量的多任务视频推理数据集MTVR-CoT-72k,用于监督微调,以及MTVR-RL-110k,用于强化学习。此外,我们提出了一种难度感知群体相对策略优化算法(DGRPO),以缓解多任务强化学习中的难度不平衡问题。在11个具有挑战性的视频理解基准测试上的广泛实验表明,VITAL具有先进的推理能力,在视频问答和时间定位任务上优于现有方法,特别是在长视频场景中。相关代码可访问:https://zhang9302002.github.io/thinkingwithvideos-page/。

论文及项目相关链接

PDF

Summary
多媒体大语言模型的视频推理能力对于视频问答和时间定位等下游任务至关重要。为应对现有方法跨模态交互有限、推理链条增长时出现幻想的挑战,提出一种新型端到端智能视频推理框架——工具增强学习视频智能(VITAL)。借助视觉工具箱,模型可按需密集采样新视频帧,生成多模态思维链,实现精确长视频推理。构建两个高质量多任务视频推理数据集,用于监督微调与强化学习。此外,提出难度感知分组相对策略优化算法(DGRPO),缓解多任务强化学习中难度不均衡问题。在多个视频理解基准测试中表现优异,特别是在长视频场景中,VITAL在视频问答和时间定位任务上超越现有方法。

Key Takeaways

  1. 多模态大语言模型的视频推理能力对于完成下游任务如视频问答和时间定位非常重要。
  2. 现有方法在处理长视频或复杂推理链时存在跨模态交互不足和幻想问题。
  3. 提出的VITAL框架通过使用视觉工具箱密集采样视频帧,生成多模态思维链,以处理长视频推理。
  4. 为适应监督微调与强化学习,构建两个多任务视频推理数据集MTVR-CoT-72k和MTVR-RL-110k。
  5. 提出DGRPO算法以处理多任务强化学习中的难度不均衡问题。
  6. VITAL在多个基准视频理解测试中表现卓越,特别是在长视频场景的视频问答和时间定位任务上。

Cool Papers

点此查看论文截图

LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

Authors:Yining Huang, Bin Li, Keke Tang, Meilian Chen

Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

大规模生成模型如DeepSeek-R1和OpenAI-O1在很大程度上受益于思维链(CoT)推理,然而,要提升它们的性能通常需要大量的数据、庞大的模型规模和全面的参数微调。虽然参数效率微调(PEFT)有助于降低成本,但大多数现有方法主要解决领域适应或分层分配问题,而不是明确地针对数据和参数进行不同的响应需求调整。受《思考,快与慢》的启发,该书描述了两种截然不同的思维模式——系统1(快速、直觉性、通常自动)和系统2(较慢、更审慎和分析性)——我们类比大型语言模型(LLM)的参数的不同“子区域”可能同样专长于快速直觉反应的任务与需要多步骤逻辑推理的任务。因此,我们提出了LoRA-PAR双系统LoRA框架,它根据系统1或系统2的需求对数据和参数进行分区,为每个任务使用更少但更集中的参数。具体来说,我们通过多模型角色扮演和投票对任务数据进行分类,根据重要性评分对参数进行分区,然后采用两阶段微调策略,用监督微调(SFT)训练系统1任务以增强知识和直觉,用强化学习(RL)细化系统2任务以加强更深层次的逻辑思考。大量实验表明,两阶段微调策略——SFT和RL,降低了活动参数的使用量,同时达到或超越了最新PEFT基准测试水平。

论文及项目相关链接

PDF 12 pages

Summary
大型生成模型如DeepSeek-R1和OpenAI-O1可通过引入链式思维(CoT)推理大幅提升性能,但提升性能通常需要大量数据、大型模型及全参数微调。参数效率微调(PEFT)有助于降低成本,但现有方法主要关注领域适应或逐层分配,并未明确针对数据和参数的调整来适应不同的响应需求。受“快速与慢速思考”启发,文中提出一种双系统LoRA框架LoRA-PAR,该框架根据系统1(快速、直觉性)和系统2(较慢、分析性)的需求划分数据和参数。通过多模型角色扮演和投票对任务数据进行分类,基于重要性评分对参数进行分区,并采用两阶段微调策略:先用监督微调(SFT)训练系统1任务以增强知识和直觉,再用强化学习(RL)细化系统2任务以加强深度逻辑思考。实验表明,这种两阶段微调策略在降低活动参数使用的同时,达到或超越了最佳PEFT基准线。

Key Takeaways

  1. 大型生成模型通过引入链式思维(CoT)推理提高性能。
  2. 现有参数效率微调方法主要关注领域适应和逐层分配。
  3. 文中借鉴“快速与慢速思考”理论,提出双系统LoRA框架LoRA-PAR。
  4. LoRA-PAR框架根据系统1和系统2的需求进行数据和参数的分区。
  5. 采用多模型角色扮演和投票对任务数据进行分类。
  6. 采用基于重要性评分的参数分区方法。

Cool Papers

点此查看论文截图

DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training

Authors:Zhixin Wang, Tianyi Zhou, Liming Liu, Ao Li, Jiarui Hu, Dian Yang, Yinhui Lu, Jinlong Hou, Siyuan Feng, Yuan Cheng, Yuan Qi

Reinforcement learning (RL) has become the pivotal post-training technique for large language model (LLM). Effectively scaling reinforcement learning is now the key to unlocking advanced reasoning capabilities and ensuring safe, goal-aligned behavior in the most powerful LLMs. Mainstream frameworks usually employ a hybrid-controller architecture where a single-controller dispatches the overall execution logic and manages overall data transfer and the multi-controller executes distributed computation. For large-scale reinforcement learning, minor load imbalances can introduce significant bottlenecks, ultimately constraining the scalability of the system. To address this limitation, we introduce DistFlow, a novel, fully distributed RL framework designed to break scaling barrier. We adopt a multi-controller paradigm that dispatches data transfer and execution tasks to all workers, which eliminates the centralized node. This allows each worker to operate independently, leading to near-linear scalability up to 1024 GPUs and dramatic efficiency gains. Furthermore, our architecture decouples resource configuration from execution logic, allowing each worker to have a unique execution flow, offering significant flexibility for rapid and cost-effective algorithmic experimentation. Extensive experiments show that DistFlow achieves excellent linear scalability and up to a 7x end-to-end throughput improvement in specific scenarios over state-of-the-art (SOTA) frameworks.

强化学习(RL)已成为大型语言模型(LLM)的关键后训练技术。有效地扩展强化学习现在是解锁先进推理能力并确保最强大LLM安全、目标对齐行为的关键。主流框架通常采用混合控制器架构,其中单个控制器调度整体执行逻辑并管理整体数据传输,而多个控制器执行分布式计算。对于大规模强化学习,轻微的负载不平衡可能会引入重大瓶颈,最终限制系统的可扩展性。为了解决这一局限性,我们引入了DistFlow,这是一个新型的全分布式RL框架,旨在打破扩展障碍。我们采用多控制器范式,将数据传输和执行任务调度到所有工作者,消除了中心节点。这允许每个工作者独立操作,从而实现接近线性的可扩展性,最高可达1024个GPU,并显著提高效率。此外,我们的架构将资源配置与执行逻辑解耦,使每个工作者都能拥有独特的执行流程,为快速和经济的算法实验提供了重大灵活性。大量实验表明,DistFlow实现了出色的线性可扩展性,并在特定场景下实现了相较于最新框架高达7倍端到端吞吐量改进。

论文及项目相关链接

PDF

Summary

强化学习(RL)已成为大型语言模型(LLM)的关键后训练技术。主流框架通常采用混合控制器架构,通过单一控制器调度整体执行逻辑并管理数据转移,多控制器执行分布式计算。为解决大规模强化学习中的负载不均衡问题,我们推出了DistFlow,一种全新的全分布式RL框架,旨在打破规模扩展的障碍。通过采用多控制器范式,我们消除了集中节点,允许每个工作者独立操作,从而实现近线性扩展到1024个GPU,并显著提高效率。此外,我们的架构将资源配置与执行逻辑解耦,使每个工作者拥有独特的执行流程,为快速经济的算法实验提供了重大灵活性。实验表明,DistFlow实现了出色的线性可扩展性,并在特定场景下实现了相对于最新技术高达7倍端到端吞吐量提升。

Key Takeaways

  1. 强化学习是大型语言模型的关键后训练技术,对于解锁高级推理能力和确保安全、目标对齐的行为至关重要。
  2. 主流框架在强化学习方面存在负载不均衡问题,限制了系统的可扩展性。
  3. DistFlow是一个全新的全分布式RL框架,旨在消除集中节点,允许每个工作者独立操作,实现近线性扩展到1024个GPU。
  4. DistFlow采用多控制器范式,消除集中瓶颈,显著提高效率。
  5. DistFlow架构将资源配置与执行逻辑解耦,为快速经济的算法实验提供了重大灵活性。
  6. 实验表明DistFlow实现了出色的线性可扩展性。

Cool Papers

点此查看论文截图

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Authors:Jungkoo Kang

Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.

健壮的工作流组合对于有效的代理性能至关重要,然而,由于缺少可扩展的评估数据,大型语言模型(LLM)规划和推理方面的进展受到了阻碍。这项工作推出了NL2Flow,一个用于生成和评估工作流规划问题的全自动管道。NL2Flow以结构化的中间表示形式参数化生成问题,将它们翻译成自然语言以及正式的PDDL。我在由NL2Flow生成的包含2296个低难度问题的数据集上评估了多个开源的、经过指令调整的大型语言模型。结果表明,表现最好的模型在生成有效计划方面取得了86%的成功率,在生成最优计划方面取得了69%(针对可解决的问题)。回归分析表明,问题特征对计划生成的影响取决于模型和提示设计。重要的是,在符号规划之前将自然语言问题翻译成结构化的JSON表示形式显著提高了成功率,这显示了神经符号融合的优势。这些发现强调了了解大型语言模型推理中的错误来源随着系统处理更复杂任务而扩大的重要性。随着大型语言模型推理处理越来越复杂的问题,了解这些系统中瓶颈和错误来源的变化将至关重要。

论文及项目相关链接

PDF 31 pages, 7 figures

Summary

这是一篇关于LLM工作流规划性能的研究。该研究介绍了一种名为NL2Flow的全自动化管道,用于生成和评估工作流规划问题。评估结果表明,使用NL2Flow生成的参数化问题可以成功评估LLM的推理性能,而结构化的JSON表示方式对于计划生成成功率的提高有重要作用。此研究有助于了解误差的来源及其在复杂的任务中对LLM的影响。随着LLM的发展,对错误瓶颈的识别对于解决日益复杂的问题将变得更加重要。

Key Takeaways

以下是关于该文本的关键见解:

  • LLM的工作流规划性能至关重要,但缺乏可扩展的评估数据限制了其进展。
  • NL2Flow是一个自动化管道,用于生成和评估工作流规划问题,可以生成参数化问题并转化为自然语言或正式PDDL。
  • 最佳性能的模型在生成有效计划和最优计划方面达到了86%和69%的成功率。
  • 结构化的JSON表示方式在计划生成中起到了关键作用,显著提高了成功率。这显示了神经符号整合的益处。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
LLM LLM
LLM 方向最新论文已更新,请持续关注 Update in 2025-09-13 The Illusion of Diminishing Returns Measuring Long Horizon Execution in LLMs
2025-09-13
下一篇 
Talking Head Generation Talking Head Generation
Talking Head Generation 方向最新论文已更新,请持续关注 Update in 2025-09-13 A vibe coding learning design to enhance EFL students' talking to, through, and about AI
  目录