发布日期: 2025-04-24

更新日期: 2025-05-14

文章字数: 21.1k

阅读时长: 87 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-04-24 更新

TTRL: Test-Time Reinforcement Learning

Authors:Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL’s potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

本文探讨了在大语言模型（LLM）的推理任务中，在无明确标签数据上的强化学习（RL）应用。该问题的核心挑战在于在无法获取真实信息的情况下进行推理时的奖励估算。尽管这个设定看似难以捉摸，但我们发现测试时间缩放（TTS）的常见做法，如多数投票，可以产生令人惊讶的有效奖励，适合驱动RL训练。在这项工作中，我们引入了测试时间强化学习（TTRL），这是一种在无标签数据上训练LLM的新型方法。TTRL利用预训练模型中的先验知识，实现LLM的自我进化。我们的实验表明，TTRL在各种任务和模型上的性能持续提高。值得注意的是，在AIME 2024上，TTRL将Qwen-2.5-Math-7B的pass@1性能提高了约159%，仅使用无标签的测试数据。此外，尽管TTRL仅受到Maj@N指标的监督，但其表现一直超过初始模型的上限，并接近直接在测试数据上使用真实标签训练的模型性能。我们的实验结果表明TTRL在多种任务中的普遍有效性，并突出了其在更广泛的任务和领域中的潜力。GitHub：https://github.com/PRIME-RL/TTRL

论文及项目相关链接

PDF

Summary
强化学习（RL）在无标签数据上的大语言模型（LLM）推理任务研究。本论文解决的核心问题是推理过程中奖励估计的难题，即在缺乏真实信息的情况下进行推断。研究发现，测试时间缩放（TTS）的常见做法，如多数投票，能产生有效的奖励，适用于驱动RL训练。本研究介绍了一种新型训练LLM的方法——测试时间强化学习（TTRL），该方法在预训练模型的基础上进行自我进化。实验表明，TTRL在各种任务和模型上的表现均有所提升，特别是在特定任务上的表现提升显著。潜在的应用领域广泛。

Key Takeaways

本研究解决了强化学习在无标签数据上的大语言模型推理任务的奖励估计难题。
测试时间缩放（TTS）的常见方法，如多数投票，能有效生成奖励信号用于RL训练。
研究提出了测试时间强化学习（TTRL）这一新型LLM训练方法。
TTRL利用预训练模型的先验知识实现自我进化。
实验表明，TTRL在各种任务和模型上的表现均有所提升，特别是在特定任务上的提升显著。
TTRL的潜力不仅限于特定任务，可应用于更广泛的领域。

Cool Papers

点此查看论文截图

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Authors:Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Muhan Zhang, Hua Xing Zhu

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

我们介绍了PHYBench，这是一个新型的高质量基准测试，旨在评估大型语言模型在物理环境中的推理能力。PHYBench包含500个精心策划的物理问题，基于现实世界的物理场景设计，旨在评估模型理解和推理现实物理过程的能力。这些问题涵盖了力学、电磁学、热力学、光学、现代物理学和高级物理学，难度从高中练习到大学问题和物理奥林匹克挑战不等。此外，我们提出了表达式编辑距离（EED）分数，这是一种基于数学表达式之间编辑距离的新型评估指标，可以有效地捕获模型推理过程和结果之间的差异，而传统的二元评分方法则无法做到。我们在PHYBench上评估了各种大型语言模型，并与人类专家的表现进行了比较。我们的结果表明，即使是最新的人工智能推理模型也远远落后于人类专家，这突显了它们在复杂物理推理场景中的局限性以及改进的必要性。我们的基准测试结果和数据集可在https://phybench-official.github.io/phybench-demo/公开访问。

论文及项目相关链接

PDF 21 pages ,8 figures, 4 tables

Summary

PHYBench是一个针对物理场景的大型语言模型（LLM）推理能力评估的新型高质量基准测试。它包含500个精心策划的物理问题，旨在评估模型理解和推理现实物理过程的能力，涵盖从高中习题到大学物理问题和物理奥林匹克竞赛的挑战。此外，还提出了基于数学表达式编辑距离的新型评估指标——表达编辑距离（EED）分数，该方法有效地捕捉了模型推理过程和结果的差异，超越了传统的二元评分方法。评估结果显示，即使是最先进的推理模型，其表现也显著落后于人类专家，这突显了复杂物理推理场景中的局限性和改进的必要性。

Key Takeaways

PHYBench是一个新型的基准测试，专门用于评估大型语言模型在物理场景中的推理能力。
它包含500个基于真实世界物理场景的问题，旨在评估模型理解和推理现实物理过程的能力。
涵盖多个物理领域，包括力学、电磁学、热力学、光学、现代物理学和高级物理学。
提出了一个新的评估指标——表达编辑距离（EED）分数，用于更准确地评估模型的推理过程。
通过PHYBench评估的LLM表现普遍落后于人类专家，这突显了复杂物理推理领域的挑战和模型局限性。
这个基准测试和相关的数据集是公开可用的，以便研究人员进行进一步的评估和比较。

Cool Papers

点此查看论文截图

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Authors:Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

指代表达式生成（REG）是评估视觉语言系统语用能力的核心任务，这不仅要求准确的语义定位，还要求遵循合作沟通的原则（Grice，1975）。然而，对视觉语言模型（VLM）的当前评估往往忽略了语用维度，将REG简化为基于区域的描述任务，并忽略了Gricean的原则。在这项工作中，我们从语用学的角度重新审视REG，介绍了一个包含1500张图像的新数据集（RefOI），这些图像都注有书面和口语指代表达式。通过对最先进的VLM的系统评估，我们确定了三种关键的语用能力缺失：（1）无法唯一地识别参照物，（2）包含过多或无关的信息，以及（3）与人类语用偏好不一致，如过度使用最少的空间线索。我们还表明，标准的自动评估无法捕捉到这些语用违规行为，反而会强化表层线索，而不是真正的指代成功。我们的研究结果呼吁人们重新关注语用信息丰富的模型和评估框架，使其与真实的人类沟通相一致。

Summary

本文关注视觉语言系统的核心任务——指代表达生成（REG），指出当前的评价体系忽视了语言的实际运用场景和合作沟通原则。作者从实用角度重新审视REG，并引入新的数据集RefOI，包含1.5k张图像标注的书面和口语指代表达。通过对现有视觉语言模型的全面评估，发现三大缺乏实用能力的表现：无法准确识别指代对象、包含过多或无关信息以及与人类实用偏好不符。同时指出标准自动评估难以捕捉这些实用违反情况，强调需要重新关注实用角度的模型和评估框架。

Key Takeaways

当前视觉语言系统（VLM）的评价体系忽视了语言的实际运用场景和合作沟通原则。
代指表达生成（REG）是评价视觉语言系统的重要任务之一。
介绍了新的数据集RefOI，用于研究实用视角下的REG任务。
现有视觉语言模型（VLM）在REG任务中存在三大问题：无法准确识别指代对象、包含过多或无关信息以及与人类实用偏好不符。
标准自动评估方法难以捕捉模型在实用方面的缺陷。
模型和评估框架需要重新关注实用角度，以更好地与人类沟通。

Cool Papers

点此查看论文截图

Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

Authors:Ahmed R. Sadik, Siddhata Govind

Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection

确定用于代码异味检测的最有效的大型语言模型是一个复杂的挑战。本研究引入了一种结构化的方法和评估矩阵来解决这个问题，利用一组经过筛选的代码样本数据集，这些代码样本与已知异味持续标注。数据集涵盖了四种流行的编程语言，包括Java、Python、JavaScript和C++，可实现跨语言比较。我们使用精确度、召回率和F1分数作为评价指标，对两种最新的大型语言模型OpenAI GPT 4.0和DeepSeek-V3进行基准测试。我们的分析涵盖了三个层次的细节：总体性能、类别级别的性能和单个代码异味类型的性能。此外，我们还通过比较GPT 4.0的基于令牌检测方法与DeepSeek V3所采用的模式匹配技术，探讨了成本效益。该研究还包括相对于传统静态分析工具（如SonarQube）的成本分析。研究结果为从业者选择高效、经济的自动化代码异味检测解决方案提供了宝贵的指导。

论文及项目相关链接

PDF

Summary

这篇论文研究了大语言模型在代码异味检测中的有效性，引入了一种结构化的方法和评估矩阵来解决这个问题。论文使用了一种经过整理的代码样本数据集，这些数据集已用已知的异味进行标注，涵盖了四种流行的编程语言：Java、Python、JavaScript和C++。研究分析了两款先进的LLM模型——OpenAI GPT 4.0和DeepSeek-V3的性能，采用精确度、召回率和F1分数作为评价指标。研究还涵盖了成本效益分析，将GPT 4.0的基于令牌检测方法与DeepSeek V3的模式匹配技术进行比较，并将结果与传统的静态分析工具如SonarQube进行了对比。本研究的发现可为从业人员在选择高效、经济的自动化代码异味检测解决方案时提供宝贵指导。

Key Takeaways

研究提出了一个结构化的方法和评估矩阵来确定最有效的语言模型进行代码异味检测。
使用的数据集包含四种主要编程语言的代码样本，实现了跨语言的比较。
GPT 4.0和DeepSeek-V3两款LLM模型被选为基准测试，采用精确度、召回率和F1分数作为评价指标。
研究分析了这两款模型的整体性能、分类级别性能和单个代码异味类型的性能。
研究比较了GPT 4.0的基于令牌检测方法和DeepSeek V3的模式匹配技术，并进行了成本效益分析。
研究还将LLM模型与传统静态分析工具如SonarQube进行了成本分析对比。

Cool Papers

点此查看论文截图

Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support

Authors:Dinithi Dissanayake, Suranga Nanayakkara

Flow theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making. This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on three key contextual factors: type, timing, and scale. By leveraging multimodal behavioral cues (e.g., gaze behavior, typing hesitation, interaction speed), AI can dynamically adjust cognitive support to maintain or restore flow. We introduce the concept of cognitive flow, an extension of flow theory in AI-augmented reasoning, where interventions are personalized, adaptive, and minimally intrusive. By shifting from static interventions to context-aware augmentation, our approach ensures that AI systems support deep engagement in complex decision-making and reasoning without disrupting cognitive immersion.

流动理论描述了一种最佳认知状态，当任务的难度与个人的技能水平相匹配时，个人会体验到深度集中和内在动机。在人工智能辅助推理中，破坏认知流动状态的干预措施可能会阻碍而非促进决策制定。本文提出了一个基于上下文感知的认知增强框架，该框架根据三个关键上下文因素调整干预措施：类型、时间和规模。通过利用多模式行为线索（如目光行为、打字犹豫、交互速度），人工智能可以动态调整认知支持，以维持或恢复流动状态。我们介绍了认知流动的概念，这是流动理论在人工智能辅助推理中的扩展，其中干预措施具有个性化、自适应和最少干扰的特点。通过从静态干预转向基于上下文感知的增强，我们的方法确保人工智能系统在支持复杂决策和推理的深入参与时，不会破坏认知沉浸。

论文及项目相关链接

PDF Presented at the 2025 ACM Workshop on Human-AI Interaction for Augmented Reasoning, Report Number: CHI25-WS-AUGMENTED-REASONING

Summary
本文介绍了认知流的最佳状态，即个体在任务难度与技能水平相匹配时体验到深度专注和内在动机的状态。在AI辅助推理中，干预措施如打断认知流可能会阻碍而非促进决策制定。本文提出了一个基于上下文感知的认知增强框架，该框架根据类型、时间和规模三个关键上下文因素调整干预措施。借助多模式行为线索（如目光行为、打字犹豫、交互速度），AI可以动态调整认知支持以维持或恢复认知流。我们介绍了认知流的概念，它是AI辅助推理中流理论的扩展，其中干预措施具有个性化、自适应和最少干扰的特点。通过从静态干预转向基于上下文的增强，我们的方法确保AI系统在支持复杂决策和推理的深度参与时不会破坏认知沉浸。

Key Takeaways

流理论描述了一种最佳认知状态，即个体在任务难度与技能水平相匹配时体验深度专注和内在动机。
在AI辅助推理中，打破认知流状态的干预可能会阻碍决策制定。
提出了一个基于上下文感知的认知增强框架，该框架根据类型、时间和规模三个关键上下文因素调整干预。
AI可以利用多模式行为线索来动态调整认知支持，以维持或恢复认知流。
认知流是AI辅助推理中流理论的扩展，强调干预措施的个性化、自适应和最少干扰特点。
静态干预转向基于上下文的增强，确保AI在支持复杂决策和推理时不会破坏认知沉浸。

Cool Papers

点此查看论文截图

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Authors:Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang

Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment. StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting.

强化学习（RL）已成为大型语言模型（LLM）的核心后训练技术。LLM的RL涉及两个阶段：生成和训练。LLM首先在线生成样本，然后用于推导训练奖励。传统观点认为，共置架构（两个阶段通过时间复用共享资源）优于分散架构（为每个阶段分配专用资源）。但在实际部署中，我们发现共置架构存在资源耦合的问题，两个阶段被约束为使用相同资源。这种耦合降低了共置RL在大规模训练中的可扩展性和成本效益。相比之下，分散架构允许灵活的资源分配，支持异构训练设置，并便于跨数据中心部署。StreamRL从第一性原则出发进行分散设计，并通过解决现有分散RL框架中的两种性能瓶颈来充分发挥其潜力：由阶段依赖性引起的管道气泡和由长尾输出长度分布导致的偏态气泡。为解决管道气泡问题，StreamRL打破了同步RL算法中的传统阶段界限，通过流生成实现异步RL的完全重叠。为解决偏态气泡问题，StreamRL采用输出长度排名模型来识别长尾巴样本，并通过偏态感知的分派和调度减少生成时间。实验表明，与现有最先进的系统相比，StreamRL的吞吐量提高了高达2.66倍，在异构、跨数据中心设置中，成本效益提高了高达1.33倍。

论文及项目相关链接

PDF

Summary

该文探讨了强化学习（RL）在大规模语言模型（LLM）后训练中的核心作用，介绍了RL for LLM的生成和训练两个阶段。文章指出，传统的共置架构在资源分配上存在耦合问题，影响可扩展性和成本效益。相比之下，分散式架构允许灵活的资源分配，并支持跨数据中心部署。StreamRL从基本原则上采用分散式设计，解决了现有分散式RL框架中的两种性能瓶颈问题，包括管道气泡和偏斜气泡。StreamRL通过流生成打破了传统阶段的边界，并采用了输出长度排名模型来识别长尾样本，从而减少生成时间。实验表明，StreamRL在吞吐量方面提高了最多2.66倍，在异构、跨数据中心环境中成本效益提高了最多1.33倍。

Key Takeaways

强化学习已成为大规模语言模型的核心后训练技术。
传统共置架构在资源分配上存在耦合问题，影响可扩展性和成本效益。
分散式架构允许灵活的资源分配，并支持跨数据中心部署。
StreamRL解决了现有分散式RL框架中的管道气泡和偏斜气泡问题。
StreamRL通过流生成打破了传统阶段的边界，提高了性能。
StreamRL采用输出长度排名模型来识别长尾样本，优化生成时间。

Cool Papers

点此查看论文截图

Authors:Hieu Huynh, Hai Phung, Hao Pham, Tien N. Nguyen, Vu Nguyen

In Android GUI testing, generating an action sequence for a task that can be replayed as a test script is common. Generating sequences of actions and respective test scripts from task goals described in natural language can eliminate the need for manually writing test scripts. However, existing approaches based on large language models (LLM) often struggle with identifying the final action, and either end prematurely or continue past the final screen. In this paper, we introduce VisiDroid, a multi-modal, LLM-based, multi-agent framework that iteratively determines the next action and leverages visual images of screens to detect the task’s completeness. The multi-modal approach enhances our model in two significant ways. First, this approach enables it to avoid prematurely terminating a task when textual content alone provides misleading indications of task completion. Additionally, visual input helps the tool avoid errors when changes in the GUI do not directly affect functionality toward task completion, such as adjustments to font sizes or colors. Second, the multi-modal approach also ensures the tool not progress beyond the final screen, which might lack explicit textual indicators of task completion but could display a visual element indicating task completion, which is common in GUI apps. Our evaluation shows that VisiDroid achieves an accuracy of 87.3%, outperforming the best baseline relatively by 23.5%. We also demonstrate that our multi-modal framework with images and texts enables the LLM to better determine when a task is completed.

在Android GUI测试中，为任务生成可重播为测试脚本的动作序列是很常见的。从自然语言描述的任务目标中生成动作序列和相应的测试脚本，可以消除手动编写测试脚本的需求。然而，基于大型语言模型（LLM）的现有方法往往难以识别最终动作，并且要么过早结束，要么继续超过最终屏幕。在本文中，我们介绍了VisiDroid，这是一个基于LLM的多模式、多代理框架，它迭代地确定下一个动作并利用屏幕视觉图像来检测任务的完成程度。多模式方法在两个重要方面增强了我们的模型。首先，这种方法使其能够避免仅凭文本内容提供误导任务完成的迹象时过早终止任务。此外，视觉输入有助于工具在GUI更改不直接影响任务完成的功能时避免错误，例如调整字体大小或颜色。其次，多模式方法还确保工具不会超越最终屏幕，这可能缺少明确的文本指示任务完成，但可能会显示表示任务完成的视觉元素，这在GUI应用程序中很常见。我们的评估表明，VisiDroid的准确度达到87.3%，相对于最佳基线提高了23.5%。我们还证明，我们的具有图像和文本的多模式框架使LLM能够更好地确定任务何时完成。

论文及项目相关链接

PDF Under review for a conference

Summary
本文介绍了VisiDroid，一种基于多模态和多代理的框架，用于在Android GUI测试中生成可重播的测试脚本。该框架采用大型语言模型（LLM）迭代确定下一步操作，并利用屏幕视觉图像检测任务的完成情况。其多模态方法增强了模型的准确性，避免提前终止任务并避免在GUI不影响任务完成时的功能变化时的错误。此外，它还可以确保不会进展到最后缺少任务完成文本指示的屏幕，但可以显示表示任务完成的视觉元素。评估显示，VisiDroid的准确度达到87.3%，相对于最佳基线提高了23.5%。

Key Takeaways

 1. VisiDroid是一个基于多模态和多代理的框架，用于在Android GUI测试中生成可重播的测试脚本。
 2. 该框架采用大型语言模型（LLM）来迭代确定下一步操作，并结合屏幕视觉图像检测任务的完成情况。
 3. 多模态方法增强了模型的准确性，避免提前终止任务或错误地判断任务完成情况。
 4. 视觉输入有助于工具避免在GUI变化不影响任务完成时产生错误判断，如字体大小或颜色的调整。
 5. 该框架能够识别任务的最终动作并确保不会进展到最后屏幕，即使缺少文本指示也能通过视觉元素判断任务是否完成。
 6. 评估显示VisiDroid的准确度达到87.3%，相对于最佳基线提高了23.5%。这一准确性反映出了其在处理Android GUI测试时的优异表现。

Cool Papers

点此查看论文截图

SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

Authors:Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, Xiangang Li

Recent work shows that reinforcement learning(RL) can markedly sharpen the reasoning ability of large language models (LLMs) by prompting them to “think before answering.” Yet whether and how these gains transfer to audio-language reasoning remains largely unexplored. We extend the Group-Relative Policy Optimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model (LALM), and construct a 32k sample multiple-choice corpus. Using a two-stage regimen supervised fine-tuning on structured and unstructured chains-of-thought, followed by curriculum-guided GRPO, we systematically compare implicit vs. explicit, and structured vs. free form reasoning under identical architectures. Our structured audio reasoning model, SARI (Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a 16.35% improvement in average accuracy over the base model Qwen2-Audio-7B-Instruct. Furthermore, the variant built upon Qwen2.5-Omni reaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark. Ablation experiments show that on the base model we use: (i) SFT warm-up is important for stable RL training, (ii) structured chains yield more robust generalization than unstructured ones, and (iii) easy-to-hard curricula accelerate convergence and improve final performance. These findings demonstrate that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.

最新研究表明，通过提示“先思考再回答”，强化学习（RL）可以显著提高大语言模型（LLM）的推理能力。然而，这些收益是否以及如何转移到音频语言推理领域仍鲜有探索。我们将DeepSeek-R1的Group-Relative Policy Optimization（GRPO）框架扩展到一个大型音频语言模型（LALM），并构建了一个包含32k样本的选择题语料库。通过结构化与非结构化思维链的两阶段监管监督微调，随后进行课程指导的GRPO，我们系统地比较了隐式和显式的结构化与非结构化推理在相同架构下的表现。我们的结构化音频推理模型SARI（通过课程指导强化学习的结构化音频推理）在基准模型Qwen2-Audio-7B-Instruct的基础上实现了平均精度16.35%的提升。此外，基于Qwen2.5-Omni的变体在MMAU测试小型基准上达到了最先进的性能，准确率为67.08%。消融实验表明，在我们使用的基准模型上：（i）SFT预热对于稳定的RL训练很重要，（ii）结构化链比非结构化链产生更稳健的泛化性能，（iii）从易到难的课程加速收敛并提高了最终性能。这些发现表明，明确的、结构化的推理和课程学习显著增强了音频语言理解能力。

论文及项目相关链接

PDF

Summary
强化学习（RL）能提高大型语言模型（LLM）的推理能力，但其在音频语言推理中的应用尚未得到充分探索。本研究将Group-Relative Policy Optimization（GRPO）框架扩展到大型音频语言模型（LALM），构建了包含多种选择的音频数据集。通过结构化与非结构化思维链的精细调整，以及课程指导的GRPO，本研究系统地比较了隐式和显式、结构化与非结构化推理。结果显示，结构化音频推理模型SARI在基准模型Qwen2-Audio-7B-Instruct的基础上平均准确率提高了16.35%。此外，基于Qwen2.5-Omni的版本在MMAU测试小数据集上达到了最先进的性能水平。实验表明：（i）SFT预热对于稳定的RL训练至关重要；（ii）结构化思维链比非结构化思维链更具稳健性；（iii）从易到难的课程能加速收敛并提高最终性能。这些发现表明，显式、结构化的推理和课程学习能显著提高音频语言理解的能力。

Key Takeaways

强化学习提高了大型语言模型的推理能力，但在音频语言推理中的应用尚未广泛研究。
研究人员将GRPO框架扩展到大型音频语言模型，并构建了一个包含多种选择的音频数据集。
通过精细化调整，结构化思维链的推理模型表现出更高的性能。
SARI模型在基准模型上的平均准确率提高了16.35%。
Qwen2.5-Omni版本在MMAU测试小数据集上表现最佳。
实验表明，SFT预热对稳定RL训练至关重要。

Cool Papers

点此查看论文截图

Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback

Authors:Ning Wang, Bingkun Yao, Jie Zhou, Yuchen Hu, Xi Wang, Nan Guan, Zhe Jiang

Large language models (LLMs) have shown strong performance in Verilog generation from natural language description. However, ensuring the functional correctness of the generated code remains a significant challenge. This paper introduces a method that integrates verification insights from testbench into the training of Verilog generation LLMs, aligning the training with the fundamental goal of hardware design: functional correctness. The main obstacle in using LLMs for Verilog code generation is the lack of sufficient functional verification data, particularly testbenches paired with design specifications and code. To address this problem, we introduce an automatic testbench generation pipeline that decomposes the process and uses feedback from the Verilog compiler simulator (VCS) to reduce hallucination and ensure correctness. We then use the testbench to evaluate the generated codes and collect them for further training, where verification insights are introduced. Our method applies reinforcement learning (RL), specifically direct preference optimization (DPO), to align Verilog code generation with functional correctness by training preference pairs based on testbench outcomes. In evaluations on VerilogEval-Machine, VerilogEval-Human, RTLLM v1.1, RTLLM v2, and VerilogEval v2, our approach consistently outperforms state-of-the-art baselines in generating functionally correct Verilog code. We open source all training code, data, and models at https://anonymous.4open.science/r/VeriPrefer-E88B.

大规模语言模型（LLMs）在自然语言描述生成Verilog代码方面表现出强大的性能。然而，确保生成代码的功能正确性仍然是一个巨大的挑战。本文介绍了一种方法，该方法将测试平台中的验证见解集成到Verilog生成LLM的训练中，使训练与硬件设计的根本目标——功能正确性——保持一致。在使用LLM进行Verilog代码生成的主要障碍是缺乏足够的功能验证数据，特别是与设计规范和数据相匹配的测试平台。为了解决这个问题，我们引入了一个自动测试平台生成流程，该流程分解了过程并使用Verilog编译器模拟器（VCS）的反馈来减少幻觉并确保正确性。然后，我们使用测试平台来评估生成的代码并将其收集用于进一步的训练，其中引入了验证见解。我们的方法应用强化学习（RL），特别是直接偏好优化（DPO），通过基于测试平台结果的偏好对来使Verilog代码生成与功能正确性保持一致。在VerilogEval-Machine、VerilogEval-Human、RTLLMv1.1、RTLLMv2和VerilogEval v2上的评估表明，我们的方法在生成功能正确的Verilog代码方面始终优于最新基线。我们在https://anonymous.4open.science/r/VeriPrefer-E88B公开所有训练代码、数据和模型。

论文及项目相关链接

PDF

Summary

大型语言模型在根据自然语言描述生成Verilog代码方面表现出强大的性能，但保证生成代码的功能正确性仍然是一个重大挑战。本文介绍了一种方法，该方法将测试平台中的验证见解集成到Verilog生成大型语言模型的训练中，使训练符合硬件设计的根本目标：功能正确性。我们面临的主要问题是缺乏足够的功能验证数据，特别是与设计规范和代码配套的测试平台。为解决此问题，我们引入了自动测试平台生成流程，该流程分解过程并使用Verilog编译器模拟器（VCS）的反馈来减少幻觉并确保正确性。然后，我们使用测试平台评估生成的代码并收集它们以进行进一步的训练，同时引入验证见解。我们的方法采用强化学习（RL），特别是基于测试平台结果的偏好对（preference pairs）来优化Verilog代码生成，使其符合功能正确性。在VerilogEval-Machine、VerilogEval-Human、RTLLMv1.1、RTLLMv2和VerilogEvalv2上的评估表明，我们的方法在生成功能正确的Verilog代码方面始终优于最新基线。

Key Takeaways

大型语言模型在Verilog代码生成中表现出强大性能，但功能正确性是一大挑战。
缺乏足够的功能验证数据，特别是与设计规范和代码配套的测试平台。
引入自动测试平台生成流程，使用Verilog编译器模拟器的反馈确保代码正确性。
使用强化学习优化Verilog代码生成，使其符合功能正确性。
方法通过基于测试平台结果的偏好对进行训练。
在多个评估中，该方法在生成功能正确的Verilog代码方面优于最新基线。

Cool Papers

点此查看论文截图

WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

Authors:Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang

Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment’s dynamics usually bottlenecks LLMs’ performance as world models. To bridge the gap, we propose a training-free “world alignment” that learns an environment’s symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents’ policies. We further propose an RL-free, model-based agent “WALL-E 2.0” through the model-predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look-ahead optimizer of future steps’ actions by interacting with the neurosymbolic world model. While the LLM agent’s strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL-E 2.0 significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%-51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.

我们能否利用大型语言模型（LLM）构建精确的世界模型？世界模型如何对LLM代理产生益处？大型语言模型（LLM）的先验知识与特定环境动态之间的差距通常会限制LLM作为世界模型的表现。为了弥补这一差距，我们提出了一种无需训练的“世界对齐”方法，该方法可以学习与LLM互补的环境符号知识。符号知识包括行动规则、知识图谱和场景图，这些是由LLM从探索轨迹中提取并编码成可执行代码，以调节LLM代理的策略。我们进一步提出了一个基于模型的代理“WALL-E 2.0”，该代理基于模型预测控制（MPC）框架，并且无需强化学习。与传统的MPC需要在飞行中进行昂贵的优化不同，我们采用LLM代理作为高效的向前预测优化器，通过与神经符号世界模型的交互，对未来步骤的行动进行预测。LLM代理的强大启发式使其成为MPC中的高效规划器，而其计划的行动质量也得到了对齐的世界模型的准确预测保障。它们共同作用，大大提高了新环境下的学习效率。在类似于火星（Minecraft）的开放世界挑战和室内环境体化的ALFWorld中，WALL-E 2.0显著优于现有方法，例如在火星上的成功率超过基线16.1%~51.6%，得分至少提高61.7%。在ALFWorld中，它仅经过4次迭代就达到了98%的新纪录成功率。

论文及项目相关链接

PDF Code is available at https://github.com/elated-sawyer/WALL-E

Summary

本文探讨了如何构建基于大型语言模型（LLM）的世界模型，并解决了LLM与特定环境动态之间的知识差距问题。提出了无需训练的“世界对齐”方法，该方法可以学习环境的符号知识并与LLM互补。通过提取动作规则、知识图和场景图等信息，并将其编码为可执行代码，以调节LLM代理的策略。此外，介绍了无需强化学习（RL）的基于模型预测控制（MPC）框架的“WALL-E 2.0”代理。与需要即时优化的经典MPC不同，我们采用LLM代理作为未来步骤动作的高效前瞻优化器，通过与神经符号世界模型的交互来实现。在火星（Minecraft风格）和ALFWorld（室内环境）的开放世界挑战中，WALL-E 2.0显著优于现有方法，如在火星上的成功率超过基线16.1%~51.6%，得分至少提高61.7%。在ALFWorld中，仅经过4次迭代就达到了98%的成功率，创造了新纪录。

Key Takeaways

大型语言模型（LLMs）在构建世界模型时存在与特定环境动态的知识差距问题。
提出了无需训练的“世界对齐”方法来解决这一知识差距，学习环境的符号知识与LLMs互补。
符号知识包括动作规则、知识图和场景图，由LLMs从探索轨迹中提取并编码为可执行代码，以调节LLM代理的策略。
介绍了基于模型预测控制（MPC）框架的“WALL-E 2.0”代理，无需强化学习（RL）。
LLM代理作为未来步骤动作的高效前瞻优化器，通过与神经符号世界模型的交互来提高学习效率。
在开放世界挑战中，如火星（Minecraft风格）和ALFWorld（室内环境），WALL-E 2.0显著优于现有方法，成功率和得分均有明显提高。

Cool Papers

点此查看论文截图

Tina: Tiny Reasoning Models via LoRA

Authors:Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger

How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model’s underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights & checkpoints.

如何实现语言模型中的高效推理能力？在这个基本问题的驱动下，我们推出了Tina，这是一款具有高成本效益的小型推理模型系列。值得注意的是，Tina展示了在强化学习（RL）过程中应用参数高效更新和低阶适应（LoRA）时，仅使用极少的资源就能实现显著的推理性能。这一极简的方法产生的模型，其推理性能与基于同一基础模型的最先进的RL推理模型相竞争，有时甚至更胜一筹。关键的是，这一切都是在现有最先进的模型所使用的一小部分训练后计算成本下实现的。事实上，最好的Tina模型在AIME24上实现了超过20%的推理性能提升和43.33%的Pass@1准确率，而训练后评估和成本仅为9美元（即估计的成本降低了约26倍）。我们的工作揭示了通过LoRA实现高效RL推理的惊人效果。我们在多个开源推理数据集和各种消融设置上进行了验证，起始于一组固定的超参数。此外，我们假设这种有效性和效率源于LoRA能够迅速适应RL所奖励的推理的结构化格式，同时保留基础模型的底层知识。为了可访问性和开放研究，我们完全开源所有代码、训练日志和模型权重及检查点。

论文及项目相关链接

PDF

Summary

本文介绍了Tina系列微小推理模型的高成本效益实现方式。通过应用参数有效的更新和强化学习（RL）以及低秩适应（LoRA）技术，在仅有极小资源的情况下，Tina展现出强大的推理能力。相较于当前最佳实践（SOTA）模型，Tina不仅实现了竞争性的推理性能，有时甚至能超越它们，同时大幅降低了训练后的计算成本。最佳Tina模型在AIME24上实现了超过20%的推理性能提升和43.33%的Pass@1准确率，而训练后及评估成本仅为9美元。我们的研究揭示了通过LoRA技术实现高效RL推理的惊人效果。我们跨多个开源推理数据集和不同设置进行了验证，并假设这种效果和效率来源于LoRA能快速适应RL奖励的结构化格式，同时保留基础模型的底层知识。我们公开了所有代码、训练日志和模型权重及检查点，以促进研究的开放性和可访问性。

Key Takeaways

Tina是微小推理模型，以高成本效益实现强大推理能力。
通过参数有效的更新和强化学习（RL）以及低秩适应（LoRA）技术，Tina在少量资源下展现出竞争力。
Tina在某些情况下超越了当前最佳实践（SOTA）模型的推理性能。
最佳Tina模型在AIME24上实现了显著的性能提升和准确率。
LoRA技术被认为是实现高效RL推理的关键。
LoRA能快速适应RL奖励的结构化格式，同时保留基础模型的底层知识。

Cool Papers

点此查看论文截图

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Authors:David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

现有的多模态大型语言模型（MLLM）评估框架主要关注图像推理或通用视频理解任务，很大程度上忽视了图像上下文在视频理解中的重要作用。为了弥补这一差距，我们提出了IV-Bench，这是第一个用于评估图像基地视频感知和推理的综合性基准测试。IV-Bench包含967个视频和与之配对的2585个精心标注的图像文本查询，涵盖13个任务（7个感知任务和6个推理任务），以及5个代表性类别。对最先进的开源（例如InternVL2.5、Qwen2.5-VL）和闭源（例如GPT-4o、Gemini2-Flash和Gemini2-Pro）MLLM的广泛评估表明，当前模型在图像基地视频感知和推理方面表现不佳，最多仅达到28.9%的准确率。进一步的分析揭示了影响IV-Bench上模型性能的关键因素，包括推理模式、帧数以及分辨率。此外，通过简单的数据合成方法，我们证明了IV-Bench的挑战不仅仅在于训练过程中的数据格式对齐。这些发现共同为未来的研究提供了宝贵的见解。我们的代码和数据已发布在https://github.com/multimodal-art-projection/IV-Bench。

论文及项目相关链接

PDF

Summary

本文提出了一项名为IV-Bench的新评估基准，旨在评估图文结合的多媒体大型语言模型在图像基础视频感知和推理方面的能力。它包含大量视频和配对图像文本查询，涉及多种任务和类别。对现有模型的评估表明，当前模型在此类任务上的表现严重不足，而一些关键因素如推理模式、帧数、分辨率等影响了模型在IV-Bench上的性能。通过数据合成方法，作者展示了IV-Bench的挑战不仅在于数据格式的对齐，而且具有为未来研究提供有价值见解的潜力。相关代码和数据已公开发布。

Key Takeaways

IV-Bench是首个针对图像基础视频感知和推理的评估基准。
它包含967个视频和2,585个精心标注的图像文本查询。
IV-Bench涵盖13个任务，分为感知和推理任务两大类，以及五个代表性类别。
当前多媒体大型语言模型在IV-Bench上的表现令人失望，最高准确率仅为28.9%。
模型性能受到推理模式、帧数、分辨率等因素的影响。
通过数据合成方法，展示了IV-Bench的挑战不仅限于训练过程中的数据格式对齐。

Cool Papers

点此查看论文截图

PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind

Authors:Yajie Yu, Yue Feng

Multi-agents has exhibited significant intelligence in real-word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision-making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt-based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol-Agent, a comprehensive LLM-empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol-Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL-based models and agent-based methods, demonstrate the superiority of PolicyEvol-Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.

多智能体在现实世界的模拟中展现出了显著智能，这归功于大型语言模型（LLM）的社会认知和知识检索能力。然而，关于配备有效认知链的智能体的研究仍然有限，这些认知链包括推理、规划、决策和反思，特别是在动态交互场景中。此外，与人类不同，基于提示的回应在游戏过程中的心理状态感知和经验校准方面面临挑战，这可能会导致不可避免的认知偏见。鉴于此，我们引入了PolicyEvol-Agent，这是一个全面的LLM赋能框架，其特点是系统地获取他人的意图并自适应优化非理性策略以实现持续增强。具体来说，PolicyEvol-Agent首先获得反思专家模式，然后将一系列认知操作与心智理论以及内外视角相结合。仿真结果证明PolicyEvol-Agent优于基于RL的模型和基于代理的方法，为最终的游戏胜利带来了优势。此外，政策进化机制表明动态指南调整在自动和人类评估中的有效性。

论文及项目相关链接

PDF

Summary

多智能体在大规模语言模型（LLM）的加持下，在模拟场景中展现了社会认知和知识检索的能力。然而，目前对于包含推理、规划、决策和反思的认知链研究仍然有限，尤其在动态交互场景中。为解决此问题，我们提出PolicyEvol-Agent框架，具备获取他人意图和系统优化非理性策略的能力。该框架结合了心智理论、内部和外部视角，通过模拟验证其优越性，超越基于强化学习模型和智能体的方法。此外，策略进化机制显示出动态指导调整的有效性。在不确定的游戏过程中面临着心理状态感知和经验校准的挑战，可能导致认知偏见。PolicyEvol-Agent能够获取反思专家模式，适应性地优化非理性策略以实现持续改进。总体而言，PolicyEvol-Agent框架有助于提升智能体的认知能力和游戏胜利概率。

Key Takeaways

多智能体在模拟场景中展现出社会认知和知识检索能力。
目前对于智能体的认知链研究有限，特别是在动态交互场景中。
PolicyEvol-Agent框架具备获取他人意图和系统优化非理性策略的能力。
该框架结合了心智理论、内部和外部视角。
模拟验证PolicyEvol-Agent框架的优越性，超越基于强化学习模型和智能体的方法。
策略进化机制显示动态指导调整的有效性。

Cool Papers

点此查看论文截图

LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

Authors:Kang He, Kaushik Roy

Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.

大型语言模型（LLMs）在各个领域都取得了显著的多步推理能力。然而，LLMs在复杂逻辑推理方面仍面临独特的挑战，因为（1）证明寻找需要系统的探索和维持逻辑连贯性；（2）在具有大量前提空间的任务中，搜索每一步推理的正确前提组合本质上是具有挑战性的。为了解决这一问题，我们提出了LogicTree，这是一个推理时间模块化框架，采用算法指导的搜索来自动结构化证明探索并确保逻辑连贯性。超越思维树（ToT），我们在LogicTree中引入了缓存机制，以有效利用历史知识，防止推理停滞并尽量减少冗余。此外，我们将前提搜索的组合复杂性分解为线性过程来解决。精炼的前提选择将随后的推理限制在每一步最多只有一个推导，提高了推理的粒度并实施了严格的逐步推理。此外，我们引入了两种用于前提优先级的LLM自由启发式方法，以实现战略性的证明搜索。在五个数据集上的实验结果表明，LogicTree最优地扩展了推理时间的计算，实现了更高的证明精度，超过思维链（CoT）和ToT的平均增益分别为23.6%和12.5%，在GPT-4o上表现尤其明显。而且，在LogicTree内部，GPT-4o平均比o3-mini高出7.6%。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在多步推理领域展现出显著能力，但在复杂逻辑推理方面仍面临挑战。为此，本文提出了LogicTree框架，采用算法引导搜索，实现结构化证明探索并确保逻辑连贯性。LogicTree融入缓存机制，有效利用历史知识，防止推理停滞并减少冗余。此外，它通过分解前提搜索为线性过程，解决组合复杂性。实验结果表明，LogicTree在推理时间计算上实现优化规模，提高证明准确性，优于链式思维（CoT）和树状思维（ToT），并在GPT-4o上平均提升23.6%。同时，在LogicTree内部，GPT-4o较o3-mini平均高出7.6%。

Key Takeaways

大型语言模型（LLMs）在多步推理方面表现出色，但在复杂逻辑推理中仍存挑战。
LogicTree框架采用算法引导搜索，实现结构化证明探索，确保逻辑连贯性。
LogicTree融入缓存机制，有效利用历史知识，防止推理停滞。
通过分解前提搜索为线性过程，解决组合复杂性。
LogicTree优化推理时间计算，提高证明准确性。
LogicTree较链式思维（CoT）和树状思维（ToT）表现更优，在GPT-4o上平均提升23.6%。

Cool Papers

点此查看论文截图

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Authors:Leo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy Dvijotham

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and $\tau$-bench (for tool calling agents); 2) It is configurable and allows for detailed threat modeling, allowing configuration of specific components of the agentic framework being attackable, and specifying targets for the attacker; and 3) It is modular and decouples the development of attacks from details of the environment in which the agent is deployed, allowing for the same attacks to be applied across multiple environments. We illustrate several advantages of our framework, including the ability to adapt to new threat models and environments easily, the ability to easily combine several previously published attacks to enable comprehensive and fine-grained security testing, and the ability to analyze trade-offs between various vulnerabilities and performance. We apply DoomArena to state-of-the-art (SOTA) web and tool-calling agents and find a number of surprising results: 1) SOTA agents have varying levels of vulnerability to different threat models (malicious user vs malicious environment), and there is no Pareto dominant agent across all threat models; 2) When multiple attacks are applied to an agent, they often combine constructively; 3) Guardrail model-based defenses seem to fail, while defenses based on powerful SOTA LLMs work better. DoomArena is available at https://github.com/ServiceNow/DoomArena.

我们推出DoomArena，这是一个针对AI代理的安全评估框架。DoomArena基于三个原则设计：1）它是一个插件框架，可以轻松地集成到实际的代理框架中，例如BrowserGym（用于网络代理）和τ-bench（用于工具调用代理）；2）它是可配置的，允许进行详细威胁建模，允许配置代理框架的特定组件受到攻击，并为攻击者指定目标；3）它是模块化的，将攻击的发展与代理部署的环境细节分开，使得相同的攻击可以应用于多个环境。我们展示了我们框架的几个优点，包括能够轻松适应新的威胁模型和环境、能够轻松组合之前发布的多个攻击以实现全面和精细的安全测试，以及分析各种漏洞和性能之间的权衡。我们将DoomArena应用于最先进的网络和工具调用代理，并发现了一些意想不到的结果：1）最先进的代理对不同威胁模型（恶意用户与恶意环境）的脆弱性程度不同，没有在所有威胁模型中占据绝对优势的代理；2）当多个攻击同时攻击一个代理时，它们往往会结合建设性地进行攻击；3）基于护栏模型的防御似乎失败，而基于先进的大型语言模型的防御效果更好。你可以在https://github.com/ServiceNow/DoomArena获取DoomArena。

论文及项目相关链接

PDF

Summary
在介绍了一项名为DoomArena的针对人工智能的安全评估框架，其特点是基于三个设计原则：插件化、可配置和模块化。该框架可以轻松地集成到真实的代理框架中，允许详细的威胁建模，模块化设计使得攻击的开发独立于代理部署的环境细节。文章还介绍了该框架的几个优势，如易于适应新的威胁模型和多种环境，能够组合多个先前发布的攻击进行细致全面的安全测试，并分析了各种漏洞和性能之间的权衡。通过应用该框架于当前先进的Web和工具调用代理，发现了一些意外结果。对于不同的威胁模型（恶意用户和恶意环境），先进代理的脆弱性程度各不相同，并且不存在对所有威胁模型都表现最好的代理；当对代理发起多重攻击时，它们往往会协同作用；基于Guardrail模型的防御似乎失败，而基于先进的大型语言模型的防御则表现更好。有关详细信息，请访问其GitHub页面。

Key Takeaways

DoomArena是一个针对AI的安全评估框架，具有插件化、可配置和模块化三大特点。
可以轻松集成到真实的代理框架中，允许详细的威胁建模。
该框架具有适应新威胁模型和多种环境的能力。
能够组合多种攻击进行安全测试，分析不同漏洞和性能之间的权衡。

Cool Papers

点此查看论文截图

Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods

Authors:Junlin Wang, Shang Zhu, Jon Saad-Falcon, Ben Athiwaratkun, Qingyang Wu, Jue Wang, Shuaiwen Leon Song, Ce Zhang, Bhuwan Dhingra, James Zou

There is intense interest in investigating how inference time compute (ITC) (e.g. repeated sampling, refinements, etc) can improve large language model (LLM) capabilities. At the same time, recent breakthroughs in reasoning models, such as Deepseek-R1, unlock the opportunity for reinforcement learning to improve LLM reasoning skills. An in-depth understanding of how ITC interacts with reasoning across different models could provide important guidance on how to further advance the LLM frontier. This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. Specifically, we focus our research on verifier-free inference time-scaling methods due to its generalizability without needing a reward model. We construct the Pareto frontier of quality and efficiency. We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models. For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods like best-of-N and sequential revisions, while the additional inference compute offers minimal improvements. We further perform in-depth analyses of the association of key response features (length and linguistic markers) with response quality, with which we can improve the existing ITC methods. We find that correct responses from reasoning models are typically shorter and have fewer hedging and thinking markers (but more discourse markers) than the incorrect responses.

关于如何运用推理时间计算（ITC）（例如重复采样、精细化等）来提升大型语言模型（LLM）的能力，人们表现出浓厚的兴趣。同时，像Deepseek-R1这样的推理模型的突破，开启了强化学习提升LLM推理能力的可能性。深入了解ITC在不同模型中与推理的交互作用，可以为进一步推动LLM前沿提供重要指导。本研究对推理和非推理模型在具有挑战性的推理任务上的推理时间缩放方法进行了综合分析。我们特别关注无需验证器（verifier）的推理时间缩放方法，因为它具有不需要奖励模型的通用性。我们构建了质量与效率之间的帕累托前沿。我们发现，即使拥有极高的推理预算，非推理模型仍然远远落后于推理模型。对于推理模型来说，多数投票证明是一种稳健的推理策略，通常与其他更复杂的ITC方法竞争或表现更好，如选择最优和顺序修订等，而额外的推理计算提供的效果微乎其微。我们还深入分析了关键响应特征（长度和语言标记）与响应质量的关联，以此改进现有的ITC方法。我们发现，推理模型的正确响应通常更短，含有较少的模糊和思维标记（但含有更多的篇章标记）比错误响应。

论文及项目相关链接

PDF

Summary

本文探讨了推理模型和非推理模型在挑战性推理任务上的推理时间计算（ITC）缩放方法。研究重点关注无需奖励模型的通用性强的无验证器推理时间缩放方法。研究发现，非推理模型即使在极高的推理预算下，与推理模型的性能差距仍然显著。对于推理模型，多数投票证明是一种稳健的推理策略，通常与其他更复杂的ITC方法竞争或表现更好，而额外的推理计算只提供微小的改进。此外，还对关键响应特征与响应质量的关联进行了深入分析，以期改进现有的ITC方法。

Key Takeaways

推理时间计算（ITC）对提高大型语言模型（LLM）的能力有浓厚兴趣，与模型推理能力有关。
Deepseek-R1等突破性的推理模型为强化学习提升LLM推理技能创造了机会。
在挑战性推理任务上，对推理和非推理模型的推理时间缩放方法进行了深入分析。
研究重点是无验证器推理时间缩放方法，因其具有普遍性，无需奖励模型。
非推理模型在极高推理预算下仍显著落后于推理模型。
多数投票是一种稳健的推理策略，通常比其他更复杂的ITC方法有竞争力。

Cool Papers

点此查看论文截图

Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain

Authors:Zhongxi Qiu, Zhang Zhang, Yan Hu, Heng Li, Jiang Liu

This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.

本文探讨了针对医学领域强化学习验证奖励（RLVR）训练的最优数据选择策略。尽管RLVR在增强大型语言模型的推理能力方面表现出巨大潜力，但大多数先前的研究都集中在数学和逻辑谜题上，对特定领域的应用程序（如医学）的探索有限。我们从MedQA-USMLE调查了四种不同的数据采样策略：随机采样（基线），以及使用Phi-4、Gemma-3-27b-it和Gemma-3-12b-it模型的过滤。我们以Gemma-3-12b-it为基础模型，采用群体相对策略优化（GRPO），在多个基准测试（包括MMLU、GSM8K、MMLU-Pro和CMMLU）上评估性能。我们的研究结果表明，经过过滤的数据训练的模型通常比经过随机选择样本训练的模型表现更好。值得注意的是，使用自我过滤样本（使用Gemma-3-12b-it进行过滤）在医学领域取得了卓越的性能，但在不同基准测试上表现出较低的稳健性，而使用同一系列中更大的模型进行过滤则产生了更好的总体稳健性。这些结果提供了关于RLVR在特定领域进行有效数据组织策略的宝贵见解，并强调了深思熟虑的数据选择在实现最佳性能方面的重要性。您可以访问我们的仓库（https://github.com/Qsingle/open-medical-r1）获取代码。

论文及项目相关链接

PDF 15 figures

Summary
数据选择策略对于强化学习在医疗领域的表现至关重要。本研究探讨了四种不同的数据采样策略，发现经过过滤的数据训练的模型通常优于随机采样数据训练的模型。使用Gemma-3-12b-it模型进行过滤的自过滤样本在医疗领域表现优越，但跨不同基准测试的稳健性较低。来自同一系列的大型模型的过滤可带来更好的整体稳健性。此研究对强化学习在特定领域的数据组织策略提供了宝贵的见解。

Key Takeaways

强化学习在医疗领域的数据选择策略对模型性能有重要影响。
研究了四种数据采样策略，包括随机采样和三种过滤方法。
过滤数据训练的模型通常优于随机采样数据训练的模型。
自过滤样本在医疗领域表现优越，但稳健性较低。
使用大型模型进行过滤可提高模型的总体稳健性。
研究结果对强化学习在医疗领域的有效数据组织策略提供了宝贵见解。

Cool Papers

点此查看论文截图

Authors:Zhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Mengli Zhu, Shuang Wu, Shiliang Sun, Hao Yang

The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.

大型视觉语言模型（LVLMs）的快速发展极大地推动了文档理解应用程序的应用，特别是在光学字符识别（OCR）和多语种翻译领域。然而，当前对LVLMs的评估，如广泛使用的OCRBench，主要侧重于验证其简短文本回应和简单布局长文本回应的正确性，而对于他们理解具有复杂布局设计的长文本的评估至关重要，却被忽视了。在本文中，我们提出了Menu OCR and Translation Benchmark（MOTBench）这一专门的评估框架，强调菜单翻译在跨文化交流中的关键作用。MOTBench要求LVLMs准确识别并翻译菜单上的每一道菜及其价格和单位项，从而全面评估其视觉理解和语言处理能力。我们的基准测试包含中英文菜单的集合，这些菜单具有复杂的布局、多种字体以及不同语言的特定文化元素，同时配备精确的人工注释。实验表明，我们的自动评估结果与专业的人工评估结果高度一致。我们评估了一系列公开的先进LVLMs，通过分析他们的输出来确定其性能的优势和劣势，为未来的LVLM发展提供了宝贵的见解。MOTBench可在https://github.com/gitwzl/MOTBench获取。

论文及项目相关链接

PDF 12 pages, 5 figures, 5 Tables

Summary
文本提出了一种新的评估框架MOTBench，用于评估大型视觉语言模型（LVLMs）在菜单翻译方面的性能。该框架强调菜单翻译在跨文化交流中的重要性，要求LVLMs准确识别并翻译菜单上的菜品、价格及单位，全面评估其视觉理解和语言处理能力。

Key Takeaways

LVLMs在文档理解应用，如光学字符识别（OCR）和多语种翻译方面取得快速进展。
当前对LVLMs的评估主要关注简单文本布局，但对其处理复杂文本布局的能力评估不足。
文本提出了一种新的评估框架MOTBench，专注于菜单翻译，强调跨文化交流中的重要性。
MOTBench要求LVLMs准确识别并翻译菜单上的详细信息，全面评估其视觉理解和语言处理能力。
MOTBench包含中英文菜单，具有复杂布局、多种字体和跨文化元素，且有人类精确注释。
实验显示，自动评估结果与专业人工评估高度一致。

Cool Papers

点此查看论文截图

SolarZip: An Efficient and Adaptive Compression Framework for Solar EUV Imaging Data

Authors:Zedong Liu, Song Tan, Alexander Warmuth, Frédéric Schuller, Yun Hong, Wenjing Huang, Yida Gu, Bojing Zhu, Guangming Tan, Dingwen Tao

Context: With the advancement of solar physics research, next-generation solar space missions and ground-based telescopes face significant challenges in efficiently transmitting and/or storing large-scale observational data. Aims: We develop an efficient compression and evaluation framework for solar EUV data, specifically optimized for Solar Orbiter Extreme Ultraviolet Imager (EUI) data, significantly reducing data volume while preserving scientific usability. Methods: We systematically evaluated four error-bounded lossy compressors across two EUI datasets. However, the existing methods cannot perfectly handle the EUI datasets (with continuously changing distance and significant resolution differences). Motivated by this, we develop an adaptive hybrid compression strategy with optimized interpolation predictors. Moreover, we designed a two-stage evaluation framework integrating distortion analysis with downstream scientific workflows, ensuring that observational analysis is not affected at high compression ratios. Results: Our framework SolarZip achieved up to 800x reduction for Full Sun Imager (FSI) data and 500x for High Resolution Imager (HRI$_{\text{EUV}}$) data. It significantly outperformed both traditional and advanced algorithms, achieving 3-50x higher compression ratios than traditional algorithms, surpassing the second-best algorithm by up to 30%. Simulation experiments verified that SolarZip can reduce data transmission time by up to 270x while ensuring the preservation of scientific usability. Conclusions: The SolarZip framework significantly enhances solar observational data compression efficiency while preserving scientific usability by dynamically selecting optimal compression methods based on observational scenarios and user requirements. This provides a promising data management solution for deep space missions like Solar Orbiter.

背景：随着太阳物理学研究的进步，下一代太阳空间任务和地面望远镜在有效地传输和/或存储大规模观测数据方面面临重大挑战。目标：我们为太阳观测仪极紫外成像仪（EUI）数据开发了一个高效的压缩和评估框架，能在保持科学使用性的同时显著减少数据量。方法：我们系统地评价了四个误差有界的有损压缩机在两套EUI数据集上的应用。然而，现有的方法不能完全处理EUI数据集（具有连续变化的距离和显著的分辨率差异）。受此启发，我们开发了一种带有优化插值预测器的自适应混合压缩策略。此外，我们设计了一个两阶段的评估框架，将失真分析与下游科学工作流程相结合，确保在高压缩比下观测分析不受影响。结果：我们的SolarZip框架对全日成像仪（FSI）数据实现了高达800倍的压缩比，对高分辨率成像仪（HRI$_{EUV}$）数据实现了高达500倍的压缩比。与传统的先进算法相比，SolarZip的表现更加出色，达到了传统算法的3至50倍高压缩比，超过了第二优秀的算法最多高达30%。模拟实验证实，SolarZip能在确保科学使用性的同时，减少数据传输时间高达270倍。结论：SolarZip框架通过动态选择最佳压缩方法，显著提高了太阳观测数据压缩效率，同时保持科学使用性。这为像太阳观测器这样的深空任务提供了有前景的数据管理解决方案。

论文及项目相关链接

PDF

摘要
随着太阳能物理学研究的进步，下一代太阳空间任务和地面望远镜在有效传输和/或存储大规模观测数据方面面临巨大挑战。针对这些问题，我们开发了一个针对太阳观测器极紫外成像仪（EUI）数据的压缩和评估框架，可在保持科学使用性的同时大幅减少数据量。通过系统评估四种误差有界有损压缩机，我们发现现有方法无法完美处理EUI数据集。因此，我们开发了一种自适应混合压缩策略，采用优化后的插值预测器。此外，我们设计了一个两阶段评估框架，结合失真分析与下游科学工作流程，确保高压缩比下观测分析不受影响。我们的框架SolarZip对全太阳成像仪（FSI）数据实现了高达800倍的压缩，对高分辨率成像仪（HRI）数据实现了高达500倍的压缩。与传统和先进算法相比，SolarZip的压缩比高出3-50倍，最高可达第二先进算法的3倍。模拟实验表明，SolarZip能减少数据传输时间高达27倍，同时确保科学使用性的保持。结论：SolarZip框架通过动态选择最佳压缩方法，根据观测场景和用户要求大大提高太阳能观测数据压缩效率并保持了科学的可用性。这为深空任务如太阳能轨道任务提供了有前途的数据管理解决方案。

关键见解

SolarZip框架针对太阳观测器极紫外成像仪（EUI）数据开发，旨在提高压缩效率并减少数据存储需求。
SolarZip框架在保持科学使用性的同时实现了显著的数据压缩效果，对全太阳成像仪（FSI）数据实现高达800倍压缩，对高分辨率成像仪（HRI）数据实现高达500倍压缩。
SolarZip框架显著优于传统和先进算法，最高压缩比高出第二先进算法达3倍。
SolarZip框架通过减少数据传输时间来提高数据管理的效率，模拟实验表明其能减少数据传输时间高达27倍。
SolarZip框架采用自适应混合压缩策略和优化插值预测器来适应不同的观测数据和用户需求。
SolarZip框架的动态压缩策略对不同的观测场景和用户要求能够保持较高的适用性，为后续的太阳物理研究提供了重要的数据支持。

Cool Papers

点此查看论文截图

SkyReels-V2: Infinite-length Film Generative Model

Authors:Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, Yahui Zhou

Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs’ inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

近期视频生成技术的进展主要得益于扩散模型和自回归框架的推动，然而，在提示遵循、视觉质量、运动动态和持续时间等方面仍存在关键挑战：为提升时间视觉质量而妥协运动动态、为优先保证分辨率而限制视频时长（5-10秒）、以及由通用多模态大型语言模型（MLLM）无法解释电影语法（如镜头构图、演员表达和摄影机运动）导致的镜头感知生成不足。这些交织的限制阻碍了真实的长形式合成和专业的电影风格生成。为了解决这些局限，我们提出了SkyReels-V2，一种无限长电影生成模型，该模型协同多模态大型语言模型（MLLM）、多阶段预训练、强化学习和扩散强制框架。首先，我们设计了一种全面的视频结构表示，结合多模态LLM的一般描述和子专家模型的详细镜头语言。借助人工标注，我们随后训练了一个统一的视频字幕器，名为SkyCaptioner-V1，以有效地标注视频数据。其次，我们为基本的视频生成建立了渐进式分辨率预训练，随后是四个阶段的后训练增强：初始概念平衡的监督微调（SFT）提高基线质量；使用人工注释和合成失真数据的特定运动强化学习（RL）训练解决动态伪影问题；我们的扩散强制框架和非递减噪声时间表使长视频合成在有效的搜索空间中进行；最后的高质量SFT细化视觉保真度。所有代码和模型可在https://github.com/SkyworkAI/SkyReels-V2找到。

论文及项目相关链接

PDF 31 pages,10 figures

Summary

本文介绍了视频生成领域面临的挑战，如提示遵循、视觉质量、运动动力学和持续时间的平衡问题。为解决这些问题，提出了SkyReels-V2，一个无限长度的电影生成模型，融合了多模态大型语言模型（MLLM）、多阶段预训练、强化学习和扩散强制框架。通过设计全面的视频结构表示、训练统一视频标注器、建立渐进式分辨率预训练等关键方法，实现了长视频合成和电影风格生成的专业性提升。

Key Takeaways

视频生成领域存在提示遵循、视觉质量、运动动力学和持续时间平衡的挑战。
SkyReels-V2模型融合了MLLM、多阶段预训练、强化学习和扩散强制框架。
设计了全面的视频结构表示，结合多模态LLM和详细镜头语言的子专家模型。
训练了统一视频标注器SkyCaptioner-V1，以高效标注视频数据。
建立渐进式分辨率预训练，采用四阶段后训练增强，包括初始概念平衡的监督微调、运动特定的强化学习训练、扩散强制框架和非递减噪声调度以及最终高质量监督微调。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-04-24/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-04-24 TTRL Test-Time Reinforcement Learning

2025-04-24 LLM

LLM

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-04-23 SOLIDO A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation

2025-04-23 Talking Head Generation

Talking Head Generation

R1_Reasoning

2025-04-24 更新

TTRL: Test-Time Reinforcement Learning

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback

WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

Tina: Tiny Reasoning Models via LoRA

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind

LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods

Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain

Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

SolarZip: An Efficient and Adaptive Compression Framework for Solar EUV Imaging Data

SkyReels-V2: Infinite-length Film Generative Model