发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 20.5k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

Authors:Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, Yihong Gong

We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

我们推出了VDC-Agent，这是一款用于视频详细字幕的自我进化框架，无需人工标注和更大的教师模型。该代理形成了字幕生成、原则指导评分（评分和文本建议）和提示细化的闭环。当字幕质量下降时，自我反思路径会利用之前的思考链来修正更新。将此过程运行在未标记的视频上，会产生（字幕，分数）对轨迹。我们将轨迹转换为偏好元组，并过滤掉带有JSON解析错误的样本，从而得到VDC-Agent-19K，其中包含18886个自动构建的对。然后，我们在该数据集上微调基础MLLM，采用从易到难的课程直接偏好优化。基于Qwen2.5-VL-7B-Instruct构建，我们的VDC-Agent-7B在VDC基准测试上达到了最先进的性能，平均准确率为49.08%，得分为2.50，超过了专业视频字幕并提高了+5.13%的准确率和+0.27的得分，同时推理成本相似。

论文及项目相关链接

PDF

Summary
在不需要人工标注和大型教师模型的情况下，我们推出了VDC-Agent这一用于视频详细字幕的自我进化框架。该框架形成字幕生成、原则导向评分（评分和文本建议）以及提示优化的闭环。当字幕质量下降时，自我反思路径会利用之前的思考链进行修正更新。在非标注视频上运行此过程会产生（字幕，评分）对轨迹。我们将轨迹转换为偏好元组，并过滤掉带有JSON解析错误的样本，从而得到包含18,886对自动构建的VDC-Agent-19K数据集。在此基础上，我们使用简单到复杂的课程直接偏好优化对基础MLLM进行微调。在VDC基准测试上，基于Qwen2.5-VL-7B-Instruct构建的VDC-Agent-7B达到了最先进的性能，平均准确度为49.08%，得分为2.50，超越了专业视频字幕组，并在类似的推理成本下，比基础模型提高了5.13%的准确度和0.27的得分。

Key Takeaways

VDC-Agent是一个自我进化的视频详细字幕框架，无需人工标注和大型教师模型。
框架包含字幕生成、原则导向评分和提示优化，形成闭环。
通过自我反思路径修正字幕质量。
在无标注视频上运行产生（字幕，评分）对轨迹。
将轨迹转换为偏好元组，过滤JSON解析错误样本，形成VDC-Agent-19K数据集。
使用课程直接偏好优化对基础模型MLLM进行微调。

Cool Papers

点此查看论文截图

SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning

Authors:David Jiahao Fu, Aryan Gupta, Aaron Councilman, David Grove, Yu-Xiong Wang, Vikram Adve

Recent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.

近期大型语言模型（LLM）的进展在跨多种编程语言的代码生成中表现出了非常令人印象深刻的能力。然而，即使是最先进的大型语言模型也会生成包含语法错误的程序，并且无法完成给定的任务，特别是对于低资源编程语言（LRPL）。此外，高昂的训练成本使得在有限的计算资源上对大型语言模型进行微调变得不切实际，这进一步削弱了大型语言模型在代码生成方面的有效性。在这项工作中，我们提出了SLMFix，这是一种新的代码生成流程，它利用使用强化学习（RL）技术微调的小型语言模型（SLM）来修复大型语言模型生成的程序中的语法错误，以提高大型语言模型在特定领域语言（DSL）的代码生成质量。具体来说，我们对用于程序修复任务的SLM应用了强化学习，奖励是通过静态验证器和静态语义相似性度量计算得出的。我们的实验结果表明，我们的方法在多个人工语言上是有效和通用的，在静态验证器上的通过率超过95%。值得注意的是，SLMFix对基础模型带来了实质性的改进，即使在低资源编程语言上对于7B模型也优于监督微调方法，这显示了我们的方法作为传统微调方法的替代方案的潜力。

论文及项目相关链接

PDF

Summary

LLM在代码生成方面的最新进展在各种编程语言上展现了令人印象深刻的性能。然而，最先进的LLM生成的程序仍包含语法错误，且无法完成给定任务，特别是在低资源编程语言（LRPLs）方面。此外，高昂的训练成本使得在有限的计算资源上对LLM进行微调变得不切实际，进一步削弱了LLM在代码生成方面的效果。针对这些问题，本研究提出了一种名为SLMFix的新型代码生成管道，它利用小型语言模型（SLM）通过强化学习（RL）技术修复LLM生成程序中的语法错误，以提高领域特定语言（DSLs）的生成程序质量。实验结果表明，该方法在多个DSL上具有有效性和泛化性，在静态验证器上的通过率超过95%。值得注意的是，SLMFix对基础模型带来了实质性的改进，即使在LRPL上对7B模型也优于监督微调方法，显示出该方法作为传统微调方法的潜在替代方案。

Key Takeaways

LLM在代码生成方面展现出强大的能力，但仍存在语法错误和任务完成问题，特别是在低资源编程语言中。
高昂的训练成本限制了LLM在代码生成方面的应用，特别是在有限的计算资源下。
SLMFix是一种新型代码生成管道，利用小型语言模型（SLM）和强化学习（RL）技术修复LLM生成的程序中的语法错误。
实验证明SLMFix方法有效，在多个领域特定语言上泛化性好，静态验证通过率超过95%。
SLMFix对基础模型有实质性改进，尤其在低资源编程语言上表现优异。
SLMFix显示出作为传统微调方法潜在替代方案的潜力。

Cool Papers

点此查看论文截图

Efficiency vs. Fidelity: A Comparative Analysis of Diffusion Probabilistic Models and Flow Matching on Low-Resource Hardware

Authors:Srishti Gupta, Yashasvee Taiwade

Denoising Diffusion Probabilistic Models (DDPMs) have established a new state-of-the-art in generative image synthesis, yet their deployment is hindered by significant computational overhead during inference, often requiring up to 1,000 iterative steps. This study presents a rigorous comparative analysis of DDPMs against the emerging Flow Matching (Rectified Flow) paradigm, specifically isolating their geometric and efficiency properties on low-resource hardware. By implementing both frameworks on a shared Time-Conditioned U-Net backbone using the MNIST dataset, we demonstrate that Flow Matching significantly outperforms Diffusion in efficiency. Our geometric analysis reveals that Flow Matching learns a highly rectified transport path (Curvature $\mathcal{C} \approx 1.02$), which is near-optimal, whereas Diffusion trajectories remain stochastic and tortuous ($\mathcal{C} \approx 3.45$). Furthermore, we establish an ``efficiency frontier’’ at $N=10$ function evaluations, where Flow Matching retains high fidelity while Diffusion collapses. Finally, we show via numerical sensitivity analysis that the learned vector field is sufficiently linear to render high-order ODE solvers (Runge-Kutta 4) unnecessary, validating the use of lightweight Euler solvers for edge deployment. \textbf{This work concludes that Flow Matching is the superior algorithmic choice for real-time, resource-constrained generative tasks.}

去噪扩散概率模型（DDPMs）已经在生成图像合成领域建立了最新的技术状态。然而，它们在推理过程中存在重大的计算开销，通常需要高达1000次的迭代步骤，这阻碍了其部署。本研究对DDPMs与新兴的流程匹配（校正流程）范式进行了严格的比较分析，特别是在低资源硬件上对其几何和效率属性进行了隔离。我们通过使用MNIST数据集在共享的时间条件U-Net主干上实现这两个框架，证明流程匹配在效率上显著优于扩散。我们的几何分析表明，流程匹配学习了一种高度校正的传输路径（曲率C≈1.02），这接近最优，而扩散轨迹仍然具有随机性和曲折性（C≈3.45）。此外，我们在函数评估N=10时建立了“效率前沿”，流程匹配保持高保真度，而扩散则崩溃。最后，我们通过数值敏感性分析表明，所学的向量场足够线性，无需使用高阶常微分方程求解器（如龙格-库塔法4），验证了使用轻量级欧拉求解器进行边缘部署的合理性。**本研究得出结论，流程匹配是适用于实时、资源受限生成任务的更优秀的算法选择。

论文及项目相关链接

PDF

Summary

DDPMs与Flow Matching之间的比较研究展示了Flow Matching在几何和效率方面的优势。在MNIST数据集上，使用共享的时间条件U-Net主干实现这两个框架，结果显示Flow Matching在效率上显著优于Diffusion。几何分析显示Flow Matching学习到的传输路径近乎最优，而Diffusion轨迹更随机且曲折。此外，Flow Matching在函数评估次数为10时保持高保真度，而Diffusion则崩溃。因此，Flow Matching适合用于实时、资源受限的生成任务。

Key Takeaways

DDPMs在生成图像合成中达到最新技术状态，但其推理过程中的计算开销较大，需要长达1000次迭代步骤。
Flow Matching框架与DDPMs进行了严格的比较分析，特别是在低资源硬件上的几何和效率特性。
使用MNIST数据集和共享的时间条件U-Net主干实现两个框架，结果显示Flow Matching在效率上优于Diffusion。
几何分析表明，Flow Matching学习到的传输路径近乎最优，具有较低的曲率值。
在函数评估次数为10时，Flow Matching保持高保真度输出，而Diffusion模型表现较差。
通过数值敏感性分析验证，Flow Matching的向量场足够线性，无需使用高阶ODE求解器（如Runge-Kutta 4），使得Euler求解器成为边缘部署的理想选择。

Cool Papers

点此查看论文截图

LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

Authors:Tianyang Duan, Zongyuan Zhang, Zheng Lin, Songxiao Guo, Xiuxian Guan, Guangyu Wu, Zihan Fang, Haotian Meng, Xia Du, Ji-Zhe Zhou, Heming Cui, Jun Luo, Yue Gao

Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent’s learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.

多智能体强化学习（MARL）在现实世界应用中得到了越来越多的采用。虽然MARL能够在资源受限的边缘设备上实现分布式部署，但由于智能体策略的同步更新，它面临着严重的非平稳性问题。这种非平稳性导致训练不稳定和策略收敛性差，尤其是随着智能体数量的增加。在本文中，我们提出了RELED，一个可扩展的MARL框架，它将大型语言模型（LLM）驱动的专家演示与自主智能体探索相结合。RELED采用了一个平稳性感知专家演示模块，该模块利用理论上的非平稳性界限来提高LLM生成的专家轨迹的质量，从而为每个智能体提供高奖励和训练稳定的样本。此外，混合专家-智能体策略优化模块自适应地平衡每个智能体从专家生成和智能体生成的轨迹中学习，加速策略收敛并改善泛化。基于OpenStreetMap的实时城市网络的大量实验表明，RELED相较于最先进的MARL方法实现了卓越的性能。

论文及项目相关链接

PDF 15 pages, 9 figures

Summary

多智能体强化学习（MARL）在实际应用中越来越广泛，但在资源受限的边缘设备上部署时面临严重的非稳定性问题。本文提出了一种可扩展的MARL框架RELED，它结合了大型语言模型驱动的专家演示和自主智能体的探索。RELED包含非稳态专家演示模块和混合专家智能体策略优化模块，通过利用非稳态理论边界提高专家轨迹的质量，提供高奖励和训练稳定的样本。实验表明，RELED在基于OpenStreetMap的真实城市网络上实现了相较于先进MARL方法的优越性能。

Key Takeaways

MARL在实际应用中的部署面临非稳态问题，特别是在资源受限的边缘设备上。
RELED框架结合了大型语言模型驱动的专家演示和自主智能体的探索来解决这些问题。
RELED包含非稳态专家演示模块，利用非稳态理论边界提高专家轨迹质量。
RELED通过混合专家智能体策略优化模块平衡了专家生成和智能体生成的轨迹的学习。
RELED实现了在基于OpenStreetMap的真实城市网络上的优越性能。
RELED框架能够加速策略收敛并改善泛化能力。

Cool Papers

点此查看论文截图

An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification

Authors:Saniah Kayenat Chowdhury, Rusab Sarmun, Muhammad E. H. Chowdhury, Sohaib Bassam Zoghoul, Israa Al-Hashimi, Adam Mushtak, Amith Khandakar

Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor’s size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable “black box” manner, our method offers both state-of-the-art performance and transparent decision support.

准确的肺癌肿瘤分期对预后和治疗计划至关重要。然而，对于端到端的深度学习方法来说，肿瘤分期仍然是一个挑战，因为这些方法往往会忽略与肿瘤-节点-转移系统密切相关的空间解剖信息。肿瘤分期取决于多个定量标准，包括肿瘤大小及其与最近解剖结构的接近程度，微小的变化都可能改变分期结果。我们提出了一种医学基础的混合管道，它通过明确测量肿瘤的大小和距离属性来进行分期，而不是将其视为纯粹的图像分类任务。我们的方法采用专门的编码器-解码器网络来精确分割肺和相邻的解剖结构，包括叶、肿瘤、纵隔和膈肌。随后，我们提取必要的肿瘤属性，即测量肿瘤的最大尺寸并通过定量分析分割掩膜计算肿瘤与邻近解剖结构之间的距离。最后，我们根据医学指南应用基于规则的肿瘤分期。这一新型框架已在Lung-PET-CT-Dx数据集上进行了评估，与传统深度学习模型相比，表现出了卓越的性能，总体分类准确率达到了91.36%。我们报告了每个阶段的F1分数：T1为0.93，T2为0.89，T3为0.96，T4为0.90，这是先前文献中经常遗漏的关键评估方面。据我们所知，这是第一项将明确的临床背景嵌入肿瘤分期分类的研究。与以不可解释“黑箱”方式运行的标准卷积神经网络不同，我们的方法既提供了最先进的性能，又提供了透明的决策支持。

论文及项目相关链接

PDF

Summary
肺癌的准确分期对于预后和治疗计划至关重要。然而，端到端的深度学习方法常常忽略肿瘤-节点-转移系统中的重要空间解剖信息，导致分期挑战。我们提出了一种医学基础的混合管道，通过明确测量肿瘤的大小和距离属性来进行分期，而不是将其视为纯粹的图像分类任务。该方法首先精确分割肺和相邻的解剖结构，然后提取肿瘤属性，最后应用基于规则的肿瘤分期。在Lung-PET-CT-Dx数据集上的评估表明，该方法相较于传统深度学习模型表现出优越的性能，总体分类准确率达到了91.36%。

Key Takeaways

肺癌分期的准确性对预后和治疗计划至关重要。
传统的端到端深度学习方法在肺癌分期上常忽略空间解剖信息，造成挑战。
提出了一种医学基础的混合管道，通过测量肿瘤大小和距离属性进行分期。
方法包括肺和相邻结构的精确分割，肿瘤属性的提取，以及基于规则的肿瘤分期。
在特定数据集上的评估显示，该方法相较于传统模型表现出优越性能。
该方法实现了高水平的分类准确率（91.36%），并提供了透明化的决策支持。

Cool Papers

点此查看论文截图

HeLEx: A Heterogeneous Layout Explorer for Spatial Elastic Coarse-Grained Reconfigurable Arrays

Authors:Alan Jia Bao Du, Tarek S. Abdelrahman

We present HeLEx, a framework for determining the functional layout of heterogeneous spatially-configured elastic Coarse-Grained Reconfigurable Arrays (CGRAs). Given a collection of input data flow graphs (DFGs) and a target CGRA, the framework starts with a full layout in which every processing element (PE) supports every operation in the DFGs. It then employs a branch-and-bound (BB) search to eliminate operations out of PEs, ensuring that the input DFGs successfully map onto the resulting CGRAs, eventually returning an optimized heterogeneous CGRA. Experimental evaluation with 12 DFGs and 9 target CGRA sizes reveals that the framework reduces the number of operations by 68.7% on average, resulting in a reduction of CGRA area by almost 70% and of power by over 51%, all compared to the initial full layout. HeLEx generates CGRAs that are on average only within 6.2% of theoretically minimum CGRAs that support exactly the number of operations needed by the input DFGs. A comparison with functional layouts produced by two state-of-the-art frameworks indicates that HeLEx achieves better reduction in the number of operations, by up to 2.6X.

我们提出了HeLEx框架，用于确定异构空间配置弹性粗粒度可重构阵列（CGRAs）的功能布局。给定一组输入数据流图（DFGs）和目标CGRA，该框架从一个每个处理元素（PE）都支持DFGs中所有操作的完整布局开始。然后，它采用分支定界（BB）搜索来消除PE中的操作，确保输入的DFGs能够成功映射到生成的CGRAs上，并最终返回一个优化的异构CGRA。对12个DFGs和9个目标CGRA大小的实验评估表明，该框架平均减少了68.7%的操作数量，与初始完整布局相比，CGRA面积减少了近70%，功耗降低了超过51%。HeLEx生成的CGRAs平均仅比理论上支持输入DFGs所需操作数的最小CGRAs高出6.2%。与另外两个最先进的框架产生的功能布局的比较表明，HeLEx在减少操作数方面实现了更好的效果，最多可达2.6倍。

论文及项目相关链接

PDF

Summary

本文介绍了HeLEx框架，该框架用于确定异质空间配置弹性粗粒度可重构阵列（CGRAs）的功能布局。框架通过采用分支定界（BB）搜索，对输入数据流图（DFGs）进行优化，减少不必要的操作，实现CGRA的优化。实验表明，该框架平均减少了68.7%的操作，使得CGRA面积减少近70%，功耗降低超过51%。与现有两种先进框架比较，HeLEx在减少操作数方面表现更优，最多可达2.6倍。

Key Takeaways

HeLEx是一个用于确定异质空间配置弹性CGRAs功能布局的框架。
框架接受数据流图（DFGs）和目标CGRA作为输入。
通过分支定界（BB）搜索优化DFGs，消除不必要的操作。
实验表明，HeLEx平均减少了68.7%的操作，CGRA面积和功耗分别减少近70%和超过51%。
HeLEx生成的CGRAs平均仅比理论上支持输入DFGs所需操作数的最小CGRAs高出6.2%。
与其他两种先进框架相比，HeLEx在减少操作数方面表现更优。

Cool Papers

点此查看论文截图

Leveraging LLMs for reward function design in reinforcement learning control tasks

Authors:Franklin Cardenoso, Wouter Caarls

The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt’s main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.

在强化学习（RL）中设计有效的奖励函数是一个重大挑战，通常需要大量的人力专业知识和耗费时间。之前的工作和最近的大型语言模型（LLM）的进步已经证明了它们在自动化生成奖励函数方面的潜力。然而，现有方法常常需要初步评估指标、用于改进过程的人工工程反馈或使用环境源代码作为上下文。为了解决这些限制，本文介绍了LEARN-Opt（用于奖励函数优化的LLM基础评估器和分析器）。这个基于LLM的、完全自主的和模型无关的框架，不需要初步指标和环境源代码作为上下文，就可以从系统文本描述和任务目标来生成、执行和评估奖励函数候选者。LEARN-Opt的主要贡献在于它能够从系统描述和任务目标中自主推导性能指标，实现奖励函数的无监督评估和选择。我们的实验表明，LEARN-Opt的性能与最新方法（如EUREKA）相当或更好，同时需要的先验知识更少。我们发现自动化奖励设计是一个高方差问题，平均候选者可能会失败，需要多次运行来找到最佳候选者。最后，我们证明LEARN-Opt能够发挥低成本LLM的潜力，找到与更大模型性能相当甚至更好的高性能候选者。这种表现证明了LEARN-Opt生成高质量奖励函数的潜力，无需任何初步的人工定义指标，从而减少了工程开销并增强了通用性。

论文及项目相关链接

PDF

Summary

在强化学习中设计有效的奖励函数是一个重大挑战，需要大量的人力资源和时间。最近的大型语言模型（LLM）在自动化生成奖励函数方面显示出潜力。然而，现有方法常常需要初步评估指标、人为反馈来进行优化，或使用环境源代码作为上下文。本文介绍了一个名为LEARN-Opt的LLM驱动评估和分析工具，它能从系统描述和任务目标文本中生成、执行和评估奖励函数候选者，无需初步指标和环境源代码。LEARN-Opt主要贡献在于其能从系统描述和任务目标中自主推导性能指标，实现奖励函数的无人监督评估和选择。实验表明，LEARN-Opt的性能与最新方法相当或更好，并且发现自动化奖励设计是一个高方差问题，需要多次运行以找到最佳候选者。此外，LEARN-Opt证明低成本的LLM也能找到与高性能大型模型相当的候选者。这证明了其无需初步人为定义的度量标准，即可生成高质量奖励函数的潜力，减少了工程开销，提高了通用性。

Key Takeaways

LEARN-Opt是一个基于LLM的完全自主、模型无关框架，可自动生成、执行和评估奖励函数候选者。
该框架无需初步评估指标和环境源代码，直接从系统描述和任务目标中推导性能度量。
LEARN-Opt能够实现奖励函数的无人监督评估和选择。
实验显示LEARN-Opt性能与最新方法相当或更好，并且自动化奖励设计存在高方差。
LEARN-Opt通过多次运行找到最佳奖励函数候选者。
该工具证明低成本的LLM也能找到高质量的奖励函数候选者。

Cool Papers

点此查看论文截图

CellFMCount: A Fluorescence Microscopy Dataset, Benchmark, and Methods for Cell Counting

Authors:Abdurahman Ali Mohammed, Catherine Fonder, Ying Wei, Wallapak Tavanapong, Donald S Sakaguchi, Qi Li, Surya K. Mallapragada

Accurate cell counting is essential in various biomedical research and clinical applications, including cancer diagnosis, stem cell research, and immunology. Manual counting is labor-intensive and error-prone, motivating automation through deep learning techniques. However, training reliable deep learning models requires large amounts of high-quality annotated data, which is difficult and time-consuming to produce manually. Consequently, existing cell-counting datasets are often limited, frequently containing fewer than $500$ images. In this work, we introduce a large-scale annotated dataset comprising $3{,}023$ images from immunocytochemistry experiments related to cellular differentiation, containing over $430{,}000$ manually annotated cell locations. The dataset presents significant challenges: high cell density, overlapping and morphologically diverse cells, a long-tailed distribution of cell count per image, and variation in staining protocols. We benchmark three categories of existing methods: regression-based, crowd-counting, and cell-counting techniques on a test set with cell counts ranging from $10$ to $2{,}126$ cells per image. We also evaluate how the Segment Anything Model (SAM) can be adapted for microscopy cell counting using only dot-annotated datasets. As a case study, we implement a density-map-based adaptation of SAM (SAM-Counter) and report a mean absolute error (MAE) of $22.12$, which outperforms existing approaches (second-best MAE of $27.46$). Our results underscore the value of the dataset and the benchmarking framework for driving progress in automated cell counting and provide a robust foundation for future research and development.

在生物医学研究和临床应用（包括癌症诊断、干细胞研究和免疫学）中，准确的细胞计数至关重要。手动计数劳动强度大且易出错，促使通过深度学习技术实现自动化。然而，训练可靠的深度学习模型需要大量的高质量标注数据，这些数据手动生产起来既困难又耗时。因此，现有的细胞计数数据集通常很有限，往往包含不到500张图像。在这项工作中，我们引入了一个大规模标注数据集，包含来自与细胞分化相关的免疫组织化学实验的3,023张图像，包含超过43万张手动标注的细胞位置。该数据集具有重大挑战：细胞密度高、细胞重叠和形态多样、每张图片的细胞计数长尾分布以及染色协议的变化。我们对三类现有方法进行了基准测试：基于回归的方法、人群计数和细胞计数技术，在测试集上的细胞计数范围从每张图片的10到2,126个细胞不等。我们还评估了仅使用点标注数据集如何适应显微镜细胞计数的Segment Anything Model（SAM）。作为案例研究，我们实现了基于密度图的SAM适配器（SAM计数器），并报告了平均绝对误差（MAE）为22.12，优于现有方法（次优MAE为27.46）。我们的结果强调了数据集和基准测试框架在推动自动化细胞计数进展中的价值，并为未来的研究和开发提供了坚实的基础。

论文及项目相关链接

PDF The IEEE International Conference on Data Mining (ICDM) 2025

摘要

本文介绍了一个大规模注释数据集，包含3,023张与细胞分化相关的免疫细胞化学实验图像，其中超过430,000个细胞位置被手动注释。该数据集具有挑战性，包括高细胞密度、细胞重叠、形态多样、每张图片的细胞计数长尾分布以及染色协议的变化。文章对三类现有方法进行了基准测试，包括基于回归的方法、人群计数技术和细胞计数技术，并在测试集上评估了点注释数据集上自适应的Segment Anything Model（SAM）在显微镜细胞计数上的应用。作为案例研究，我们实现了SAM的密度图适应版本（SAM-Counter），其平均绝对误差（MAE）为22.12，优于现有方法。本研究强调了数据集和基准测试框架在推动自动化细胞计数进展中的价值，并为未来的研究和发展提供了稳健的基础。

关键见解

准确细胞计数在生物医学研究和临床应用（如癌症诊断、干细胞研究和免疫学）中至关重要。
手动计数劳动强度大且易出错，深度学习技术自动化成为迫切需求。
训练可靠的深度学习模型需要大量高质量的手动注释数据，这些数据难以手动生成且耗时。
现有的细胞计数数据集通常规模较小，包含少于500张图像。
本文介绍了一个大规模注释数据集，包含3,023张与细胞分化相关的免疫细胞化学实验图像。
数据集具有挑战性，包括高细胞密度、重叠和形态多样的细胞以及染色协议的差异。

Cool Papers

点此查看论文截图

Optimizing Weak Orders via Integer Linear Programming

Authors:Juan A. Aledo, Concepción Domínguez, Juan de Dios Jaime-Alcántara, Mercedes Landete

Rank aggregation problems aim to combine multiple individual orderings of a common set of items into a consensus ranking that best reflects the collective preferences. This paper introduces a general Integer Linear Programming (ILP) framework to model and solve, in an exact way, problems whose solutions are weak orders (a.k.a.\ bucket orders). Within this framework, we consider additional relevant constraints to produce the consensus bucket order, considering configurations with a fixed number of buckets, predefined bucket sizes, top-$k$ type problems, and fairness constraints. All these formulations are developed in a general setting, allowing their application to different rank aggregation contexts. One of these problems is the Optimal Bucket Order Problem (OBOP), for which we propose for the first time an exact formulation and test the variants proposed. The computational study includes, on the one hand, a comparison between the exact results obtained by our models and the heuristic methods proposed by Aledo et al.\ (2018), and on the other hand, an additional evaluation of their performance on a representative set of instances from the PrefLib library. The results confirm the validity and efficiency of the proposed approach, providing a solid foundation for future research on rank aggregation problems with weak orders as consensus rankings.

排名聚合问题的目标是结合同一项目集的多个个体排序，以形成一个最能反映集体偏好的共识排名。本文引入了一个通用的整数线性规划（ILP）框架，以精确建模和解决那些解决方案是弱排序（也称为桶排序）的问题。在这个框架内，我们考虑额外的相关约束来产生共识桶排序，考虑具有固定数量的桶、预定义的桶大小、前k类型问题和公平约束的配置。所有这些公式都是在一般环境下开发的，允许它们在不同排名聚合上下文中应用。这些问题之一是最佳桶排序问题（OBOP），我们首次为其提出精确公式并进行测试。计算研究一方面包括我们的模型所获得的精确结果与Aledo等人（2018）提出的启发式方法之间的比较；另一方面是对PrefLib库中代表性实例的性能进行额外评估。结果证实了所提出方法的有效性和效率，为未来在弱排序共识排名上的排名聚合问题研究提供了坚实的基础。

论文及项目相关链接

PDF 33 pages, 7 figures

Summary

本文介绍了一种基于整数线性规划（ILP）的通用框架，用于建模和解决生成共识排名的问题，该共识排名能够反映多个个体对同一物品集的排序偏好。该研究考虑了一系列相关问题，包括固定数量的桶、预定的桶大小、前k类问题和公平性约束。针对其中一个新问题——最优桶排序问题（OBOP），本文首次提出确切的建模方法并测试了提出的变体。通过对比本研究模型和Aledo等人在2018年提出的启发式方法的结果，以及对PrefLib库中代表性实例的性能评估，证实了该方法的可行性和效率，为后续关于生成弱排序共识排名的研究工作奠定了坚实的基础。

Key Takeaways

引入整数线性规划框架来解决排名聚合问题，特别是生成弱排序共识排名的问题。
考虑多种相关约束条件，如固定数量的桶、预定的桶大小、前k类问题和公平性约束。
针对最优桶排序问题（OBOP）首次提出确切的建模方法。
比较本研究模型与启发式方法的精确结果。
在PrefLib库的实例集上评估了模型性能。
结果证实所提出方法的可行性和效率。

Cool Papers

点此查看论文截图

Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

Authors:Qihan Huang, Haofei Zhang, Rong Wei, Yi Wang, Rui Tang, Mingli Song, Jie Song

RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.

针对多模态大型语言模型（MLLM）感知能力的强化学习（RL）方法（例如GRPO）由于其出色的泛化能力而引起了广泛的研究兴趣。然而，现有的强化学习方法仍然面临数据质量低的问题，其中数据样本无法激发MLLM的多样响应，从而限制了MLLM强化学习的探索范围。一些方法试图通过施加熵约束来缓解这个问题，但没有从根本上解决它。因此，为了解决这一问题，这项工作提出了Syn-GRPO（合成GRPO），它采用在线数据生成器，合成具有多样响应的高质量训练数据用于GRPO训练。具体来说，Syn-GRPO包含两个组件：（1）数据服务器；（2）GRPO工作流程。数据服务器使用图像生成模型对现有样本进行合成新样本，采用解耦和异步方案以实现高生成效率。GRPO工作流程向数据服务器提供新图像描述，并利用多样性奖励来监督MLLM对合成样本进行具有多样响应的图像描述预测。在三个视觉感知任务上的实验结果表明，Syn-GRPO大幅提高了数据质量，显著优于现有MLLM感知方法，并且Syn-GRPO在长期自我进化的强化学习中表现出良好的潜力。我们的代码可在https://github.com/hqhQAQ/Syn-GRPO找到。

论文及项目相关链接

PDF

Summary
强化学习（RL）方法在多模态大型语言模型（MLLM）感知能力方面展现出强大的泛化能力，但面临数据质量低的问题。现有方法尝试通过熵约束来缓解这一问题，但并未从根本上解决。因此，本研究提出Syn-GRPO（合成GRPO），采用在线数据生成器合成高质量的训练数据，以多样化响应进行GRPO训练。具体包括数据服务器和GRPO工作流程两部分，前者利用图像生成模型合成新样本，后者提供新图像描述并借助多样性奖励来监督MLLM预测图像描述，从而合成具有多样化响应的样本。实验结果显示，Syn-GRPO大幅提高数据质量，显著优于现有MLLM感知方法，并在三个视觉感知任务上表现出良好的长期自我进化潜力。代码已公开。

Key Takeaways

强化学习在多模态大型语言模型感知能力方面表现出泛化优势。
当前强化学习方法面临数据质量低的问题，限制了MLLM的探索范围。
Syn-GRPO通过在线数据生成器合成高质量训练数据，解决数据质量问题。
Syn-GRPO包含数据服务器和GRPO工作流程两部分，分别负责样本合成和提供新图像描述。
数据服务器利用图像生成模型合成新样本，实现高效生成。
GRPO工作流程通过多样性奖励监督MLLM预测图像描述，生成多样化响应的样本。

Cool Papers

点此查看论文截图

Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation

Authors:Maral Ebrahimzadeh, Gilberto Bernardes, Sebastian Stober

State-of-the-art symbolic music generation models have recently achieved remarkable output quality, yet explicit control over compositional features, such as tonal tension, remains challenging. We propose a novel approach that integrates a computational tonal tension model, based on tonal interval vector analysis, into a Transformer framework. Our method employs a two-level beam search strategy during inference. At the token level, generated candidates are re-ranked using model probability and diversity metrics to maintain overall quality. At the bar level, a tension-based re-ranking is applied to ensure that the generated music aligns with a desired tension curve. Objective evaluations indicate that our approach effectively modulates tonal tension, and subjective listening tests confirm that the system produces outputs that align with the target tension. These results demonstrate that explicit tension conditioning through a dual-level beam search provides a powerful and intuitive tool to guide AI-generated music. Furthermore, our experiments demonstrate that our method can generate multiple distinct musical interpretations under the same tension condition.

最新的符号音乐生成模型在输出质量方面取得了显著的进步，但在对构成特征（如音调张力）的显式控制方面仍然面临挑战。我们提出了一种新方法，该方法结合了基于音调间隔向量分析的计算音调张力模型，并将其纳入Transformer框架。我们的方法在推理过程中采用了两级光束搜索策略。在令牌层面，使用模型概率和多样性指标对生成的候选者进行重新排名，以维持整体质量。在条形层面，采用基于张力的重新排名策略，以确保生成的音乐与所需的张力曲线一致。客观评估表明，我们的方法可以有效地调节音调张力，主观听觉测试证实，该系统产生的输出与目标张力一致。这些结果表明，通过双级光束搜索的显式张力条件提供了一种强大而直观的工具来指导人工智能生成的音乐。此外，我们的实验表明，我们的方法可以在相同的张力条件下生成多种不同的音乐解读。

论文及项目相关链接

PDF 12 pages, 2 Figures, Accepted at the 17th International Symposium on Computer Music Multidisciplinary Research (CMMR) 2025

Summary

本文提出了一种将基于音调的间隔向量分析的音调张力模型集成到Transformer框架中的新方法，用于控制音乐生成中的组成特征。该方法采用两级光束搜索策略，在令牌级别使用模型概率和多样性指标对生成的候选进行再排名，以保持整体质量，并在酒吧级别应用基于张力的再排名，以确保生成的音乐与所需的张力曲线一致。客观评估和主观听觉测试均表明，该方法可以有效地调节音调和生成与目标张力相符的音乐。

Key Takeaways

介绍了将音调张力模型集成到Transformer音乐生成框架中的新方法。
提出了两级光束搜索策略，以控制音乐生成的组成特征。
通过模型概率和多样性指标在令牌级别进行再排名，确保音乐质量。
在酒吧级别应用基于张力的再排名，使生成的音乐与预期的张力曲线一致。
客观评估显示该方法能有效调节音调。
主观听觉测试证实生成的音乐与目标张力相符。

Cool Papers

点此查看论文截图

PRInTS: Reward Modeling for Long-Horizon Information Seeking

Authors:Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM’s reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

信息搜索是AI代理人的核心能力，要求他们在长时间的轨迹上收集并处理工具生成的信息。然而，对于由语言模型支持的代理人来说，这种多步骤的信息搜索任务仍然具有挑战性。虽然过程奖励模型（PRM）可以通过在测试时对候选步骤进行排名来指导代理人，但现有的PRM是为短期二元判断推理而设计的，无法捕捉信息搜索步骤的丰富维度，如工具交互和处理工具输出，也无法处理长周期任务中迅速增长的上下文。为了解决这些局限性，我们引入了PRInTS，这是一种具有双重能力的生成型PRM：1）基于PRM在多步骤质量维度上的推理进行密集评分（例如，对工具输出的解释、工具调用信息的有用性等）；2）轨迹摘要，能够压缩不断增长的上下文，同时保留步骤评估所需的关键信息。在FRAMES、GAIA（1-3级）和WebWalkerQA（简单-困难）基准测试上对多个模型的广泛评估，以及相关的消除实验表明，使用PRInTS的最佳n采样策略提高了开源模型和专用代理的信息搜索能力，在匹配或超越前沿模型性能的同时，具有较小的主干代理，并超越了其他强大的奖励建模基线。

论文及项目相关链接

PDF 18 pages, code: https://github.com/G-JWLee/PRInTS

Summary：
AI代理在信息搜寻方面需要收集并处理长时间轨迹的工具生成信息。然而，对于依赖语言模型的代理来说，这种多步骤的信息搜寻任务仍然具有挑战性。为了解决这个问题，提出了一种基于过程奖励模型（PRM）的改进方法PRInTS，具备对多个步骤质量维度的密集评分能力和压缩不断增长语境的同时保留对步骤评估至关重要的信息的轨迹摘要能力。研究表明，使用PRInTS的模型能够在不同的信息搜索基准测试中增强其搜寻能力并匹配或超越前沿模型的性能。同时由于其更小的骨架模型而具有优势。

Key Takeaways：

信息搜寻是AI代理的核心能力，需要处理长时间轨迹的工具生成信息。
过程奖励模型（PRM）能够指导代理在测试时对候选步骤进行排名。然而，现有PRM不能处理更丰富多样的信息搜寻步骤以及无法处理长时间视野中的迅速增长语境信息。为此引入PRInTS模型进行改进。

Cool Papers

点此查看论文截图

Diagnosis of mixed-state topological phases in strongly correlated systems via disorder parameters

Authors:Shao-Hang Shi, Xiao-Qi Sun, Zi-Xiang Li

Characterizing topological phases for strongly interacting fermions in the mixed-state regime remains a major challenge. Here we introduce a general and numerically efficient framework to diagnose mixed-state topological phases in strongly interacting systems via the disorder parameter (DP) of the U(1) charge operator. Specifically, from the finite-size scaling of the second derivative of the DP generating function, we introduce the topological scaling indicator, which exhibits a characteristic linear scaling with the system’s linear dimension for topological phases, a signature that vanishes upon transition into a topologically trivial phase. Crucially, we develop an efficient determinant Quantum Monte Carlo algorithm that facilitates the evaluation of this indicator in interacting systems. We apply our approach to two paradigmatic models: for the Kane-Mele-Hubbard model, we successfully map the interaction-driven transition from a quantum spin Hall insulator to a trivial Mott insulator. Furthermore, our method circumvents the limitations imposed by the severe sign problem in the Haldane-Hubbard model, enabling robust identification of the quantum anomalous Hall phase at accessible temperatures. This work provides a powerful and accessible tool for the numerical exploration of topological phenomena in interacting mixed states, opening a pathway to study systems previously inaccessible due to computational obstacles.

在混合态下对强相互作用费米子的拓扑相进行表征仍然是一个重大挑战。在这里，我们引入了一种通用且数值高效的框架，通过U(1)电荷算符的混乱参数（DP）来诊断强相互作用系统中的混合态拓扑相。具体来说，通过DP生成函数的二阶导数的有限尺寸缩放，我们引入了拓扑缩放指标，该指标在拓扑相中表现出与系统线性尺寸成线性缩放的特性，在过渡到拓扑平凡相时消失。关键的是，我们开发了一种高效的行列式量子蒙特卡洛算法，该算法有助于评估相互作用系统中的这一指标。我们将该方法应用于两个典型模型：对于Kane-Mele-Hubbard模型，我们成功地绘制了从量子自旋霍尔绝缘体到平凡Mott绝缘体的相互作用驱动转变。此外，我们的方法克服了Haldane-Hubbard模型中严重符号问题所带来的限制，能够在可访问温度下稳健地识别量子反常霍尔相。这项工作为数值探索相互作用混合态中的拓扑现象提供了强大而实用的工具，为之前因计算障碍而无法研究的系统开辟了一条研究途径。

论文及项目相关链接

PDF 6+8 pages, 4+4 figures

Summary

该文本介绍了一种通过U(1)电荷算符的混乱参数（DP）诊断混合态拓扑相的有效框架。作者引入了一种拓扑尺度指标，通过其有限大小缩放行为来识别拓扑相，该指标在系统线性尺寸上表现出特征线性缩放行为。此外，作者开发了一种高效的行列式量子蒙特卡洛算法，用于评估交互系统中的这一指标。应用此方法于两个典型模型，成功映射了Kane-Mele-Hubbard模型的相互作用驱动转变过程，并绕过了Haldane-Hubbard模型的严重符号问题限制，在可访问温度下实现了量子反常霍尔相的稳健识别。这项工作提供了数值探索交互混合态中的拓扑现象的强大工具，为以前因计算障碍无法研究的系统开辟了研究道路。简化概述为：文中引入新框架及算法用于诊断混合态拓扑相，为探索交互混合态中的拓扑现象提供有力工具。

Key Takeaways

引入了一种通过混乱参数诊断混合态拓扑相的一般且数值高效的框架。
通过有限大小缩放行为的拓扑尺度指标识别拓扑相。
开发了高效的行列式量子蒙特卡洛算法以评估交互系统中的指标。
成功应用于Kane-Mele-Hubbard模型，映射相互作用驱动下的量子相变。
方法可绕过Haldane-Hubbard模型的符号问题限制。
在可访问温度下实现了量子反常霍尔相的稳健识别。

Cool Papers

点此查看论文截图

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

Authors:Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, Yuyu Luo

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.

人类通过学习和掌握不同世界的底层规则，自然地适应多样化的环境，这些世界具有不同的动态、观察和奖励结构。相比之下，现有的智能体通常通过在单一领域内自我进化来展示改进，这隐含地假设了环境分布是固定的。跨环境学习在很大程度上仍未被衡量：没有可控的、多样化的环境的标准集合，也没有统一的方式来表示智能体如何学习。我们分两步解决这些空白。首先，我们提出了AutoEnv，这是一个自动化的框架，将环境视为可转换的、观察和奖励的因子分布，能够低成本（平均4.12美元）地生成多样化的世界。使用AutoEnv，我们构建了AutoEnv-36数据集，包含36个环境、358个验证级别，七个语言模型在此数据集上获得12-49%的归一化奖励，展示了AutoEnv-36的挑战性。其次，我们将智能体学习形式化为以组件为中心的过程，由选择、优化和评估三个阶段驱动，应用于可改进的智能体组件。利用这种表述，我们设计了八种学习方法，并在AutoEnv-36上进行了评估。从经验上看，随着环境数量的增加，任何单一学习方法的收益都会迅速下降，这表明固定的学习方法并不适用于多样化的环境。环境自适应的学习方法选择能显著提高性能，但随着方法空间的扩大，其回报也会递减。这些结果既强调了跨环境泛化中智能体学习的必要性，也指出了当前存在的局限性，并将AutoEnv和AutoEnv-36定位为研究跨环境智能体学习的试验场。代码可在https://github.com/FoundationAgents/AutoEnv访问。

论文及项目相关链接

PDF

Summary

本文提出人类能够适应不同的环境通过学习不同环境的潜在规则来实现。现有的智能体通常只在单一领域内自我进化并改进，但缺少跨环境学习的衡量标准。针对这些问题，本文提出AutoEnv自动化框架和AutoEnv-36数据集来解决智能体在不同环境之间学习和适应能力的问题。文中通过实验表明单一学习方法在面对多样化环境时性能快速下降，并展示了自适应学习方法在提高性能上的重要性及其在当前环境下的局限性。文中提到的研究成果和开源代码有助于为跨环境智能体学习提供研究基础。

Key Takeaways

以下是文章中的关键见解要点：

人类通过学习不同环境的潜在规则来适应各种环境，而现有智能体主要侧重于在单一领域内自我进化。
缺乏跨环境学习的衡量标准，缺乏可控制的异构环境和统一表示的智能体学习的方法。
AutoEnv框架可以解决上述问题，它通过将环境视为可分解的分布来实现转换、观察和奖励的灵活性。使用该框架构建了AutoEnv-36数据集。

Cool Papers

点此查看论文截图

Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning

Authors:James R. M. Black, Moritz S. Hanke, Aaron Maiwald, Tina Hernandez-Boussard, Oliver M. Crook, Jaspreet Pannu

Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.

新型深度学习架构越来越多地应用于生物数据，包括基因序列。这些被称为基因组语言模型（gLMs）的模型表现出了令人印象深刻的预测和生成能力，这引发了人们对这些模型可能被滥用的担忧，例如通过生成感染人类的病毒基因组。这些担忧促使人们呼吁采取风险缓解措施。事实上，首选的缓解方法是过滤预训练数据（即从训练数据集中去除病毒基因组序列），以限制gLM在病毒相关任务上的性能。然而，目前尚不清楚这种方法对于使用敏感病原体数据进行微调的开源模型的稳健性如何。在这里，我们评估了最先进的gLM——Evo 2，并使用来自110种有害感染人类病毒的序列进行微调，以评估滥用相关的预测能力的恢复。经过微调后的模型相对于1）预训练模型和2）在噬菌体序列上微调后的版本，在未见过的病毒序列上表现出较低的困惑度。经过人类感染病毒调校的模型还能识别出SARS-CoV-2的免疫逃逸变异体（达到AUROC的0.6），尽管在微调过程中并未接触到SARS-CoV-2序列。这项工作表明，通过微调方法可能会绕过数据排除问题，能在一定程度上恢复gLM的滥用相关能力。我们强调了为gLM制定安全框架的必要性，并概述了进一步评估和缓解措施所需的工作，以实现gLM的安全部署。

论文及项目相关链接

PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Biosecurity Safeguards for Generative AI

摘要
基因语言模型（gLMs）应用于生物数据包括遗传序列的新型深度学习架构日益增多，这些模型展现出令人印象深刻的预测和生成能力，但也引发了滥用担忧，例如生成感染人类的病毒基因组。为此，人们呼吁采取风险缓解措施，其中数据过滤是主要选择。然而，对于开源模型使用敏感病原体数据进行微调的安全性尚不清楚。本研究评估了最先进的gLM（Evo 2），并使用来自110种有害感染人类病毒的序列进行微调，以评估滥用相关预测能力的恢复。微调后的模型在未见的病毒序列上的表现优于预训练模型和用噬菌体序列微调后的版本。尽管在微调过程中未接触SARS-CoV-2序列，但该模型在识别SARS-CoV-2的免疫逃逸变异体方面达到了0.6的AUROC。这项工作表明，数据排除可能被微调方法规避，微调方法能够在一定程度上恢复gLM的滥用相关能力。我们强调了为gLM制定安全框架的必要性，并概述了评估和缓解措施方面的进一步工作，以实现gLM的安全部署。

关键见解

深度学习架构越来越多地应用于生物数据，包括遗传序列的新型数据处理。
基因语言模型（gLMs）展现出强大的预测和生成能力，引发关于潜在误用的担忧。
数据过滤是当前主要的缓解措施，但其在开源模型中的有效性尚不清楚。
对一个先进的gLM（Evo 2）进行了评估，使用人类感染病毒序列进行微调后表现出更高的性能。
即使在未接触特定病毒（如SARS-CoV-2）的情况下，微调后的模型仍能识别出病毒的免疫逃逸变异体。
数据排除可能不是绝对有效的风险缓解策略，微调方法可能恢复模型的滥用相关能力。

Cool Papers

点此查看论文截图

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

Authors:Qianying Liu, Xiao Liang, Zhiqiang Zhang, Yibo Chen, Xu Tang, Zhongfei Qing, Fengfan Zhou, Yao Hu, Paul Henderson

We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

我们提出了ReMatch框架，该框架利用MLLMs（大型语言模型）的生成能力进行多模态检索。之前的方法将MLLM视为简单的编码器，忽略了其生成性质，未能充分利用其组合推理和世界知识。相反，我们通过端到端的方式训练嵌入式MLLM，并结合聊天风格的生成匹配阶段。匹配阶段使用相同的MLLM进行自回归决策，从多视图输入中判断相关性，包括原始数据和每个查询及其投影嵌入。它提供了实例级的判别监督，以补充标准对比损失，对硬负样本提供更强烈的梯度，并保留原始MLLM的组合优势。为了获得语义上更丰富的多模态嵌入，我们使用多个可学习令牌来增强每个输入，生成低推理成本的精细上下文、相互正交嵌入。利用我们建立的高性能基线，我们将上述想法组合成一个强大的训练配方，并在大规模多模态嵌入基准测试（MMEB）上实现了新的最先进的性能。我们的实验在五个数据集上表现出特别强大的零样本泛化结果，凸显了ReMatch的稳健性和可迁移性。

论文及项目相关链接

PDF

Summary

本文介绍了ReMatch框架，该框架利用MLLMs的生成能力进行多模态检索。不同于以往将MLLM视为简单编码器的做法，ReMatch充分利用了其生成性质、组合推理和常识知识。通过端到端训练嵌入MLLM和聊天式生成匹配阶段，ReMatch能够从多视角输入中自动判断相关性，包括原始数据和每个查询和文档的投影嵌入。此外，ReMatch还使用多个可学习令牌增强每个输入，生成精细的上下文相关、相互正交的嵌入，降低推理成本。在大量多模态嵌入基准测试（MMEB）上，ReMatch取得了最新技术水平的成绩，特别是在五个数据集上的零样本泛化结果尤为突出，体现了其稳健性和可迁移性。

Key Takeaways

ReMatch框架利用MLLMs的生成能力进行多模态检索，充分发挥其组合推理和常识知识。
ReMatch通过端到端训练，使MLLM同时作为特征提取器和匹配器。
匹配阶段使用MLLM自动判断多视角输入的相关性，包括原始数据和投影嵌入。
ReMatch提供实例级别的判别监督，补充标准对比损失，对硬负样本提供更强梯度。
使用多个可学习令牌增强每个输入，生成精细的上下文相关、相互正交的嵌入。
ReMatch在大量多模态嵌入基准测试上取得最新技术水平的成绩。

Cool Papers

点此查看论文截图

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

Authors:Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang

Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.

条件图像生成通过结构、空间或风格先验增强了文本到图像的合成。但是，当前的方法在处理各种来源之间的冲突时面临挑战。这些冲突包括：1）输入级别的冲突，即条件图像与文本提示相矛盾；以及2）模型偏见冲突，即使条件与文本匹配，生成偏见也会破坏对齐。解决这些冲突需要微妙的解决方案，而标准的监督微调很难提供这样的解决方案。基于偏好的优化技术（如直接偏好优化（DPO））显示出潜力，但受限于文本和条件信号之间的梯度纠缠以及缺乏多约束任务的解耦训练数据。为了克服这一缺陷，我们提出了双向解耦的DPO框架（BideDPO）。我们的方法创建了两个解耦的偏好对——一个用于条件，一个用于文本——以减少梯度纠缠。使用自适应损失平衡策略来管理配对的影响，以实现平衡优化。我们引入了一个自动化的数据管道来抽样模型输出并生成意识到的冲突数据。这个过程嵌入了一种迭代优化策略中，该策略可以完善模型和数据的精度。我们构建了一个DualAlign基准来评估文本和条件之间的冲突解决情况。实验表明，BideDPO显著提高文本成功率（例如，+35％）并增强了条件遵守性。我们还使用COCO数据集验证了我们的方法。项目页面：https://limuloo.github.io/BideDPO/。

论文及项目相关链接

PDF 29 pages

Summary

本文介绍了条件图像生成在文本到图像合成中的应用及其所面临的挑战，包括输入级和模型偏见冲突。为解决这些问题，提出了一种双向解耦的偏好优化框架（BideDPO），通过创建两个解纠缠的偏好对来减少梯度纠缠，并引入自适应损失平衡策略进行平衡优化。此外，还建立了一个冲突感知的数据采样管道和DualAlign基准测试来评估文本和条件之间的冲突解决情况。实验表明，BideDPO在文本成功率方面显著提高（例如，+35%），并增强了条件遵循性。

Key Takeaways

条件图像生成在文本到图像合成中面临输入级和模型偏见冲突的挑战。
现有的标准监督微调方法难以解决这些冲突。
提出了一个双向解耦的偏好优化框架（BideDPO）来解决梯度纠缠问题。
BideDPO通过创建两个解纠缠的偏好对来减少冲突，并引入自适应损失平衡策略进行平衡优化。
建立了一个冲突感知的数据采样管道和DualAlign基准测试来评估模型性能。

Cool Papers

点此查看论文截图

Leveraging Spatiotemporal Graph Neural Networks for Multi-Store Sales Forecasting

Authors:Manish Singh, Arpita Dayama

This work evaluates the effectiveness of spatiotemporal Graph Neural Networks (GNNs) for multi-store retail sales forecasting and compares their performance against ARIMA, LSTM, and XGBoost baselines. Using weekly sales data from 45 Walmart stores, we construct a relational forecasting framework that models inter-store dependencies through a learned adaptive graph. The proposed STGNN predicts log-differenced sales and reconstructs final values through a residual path, enabling stable training and improved generalisation. Experiments show that STGNN achieves the lowest overall forecasting error, outperforming all baselines in Normalised Total Absolute Error, P90 MAPE, and variance of MAPE across stores. Analysis of the learned adjacency matrix reveals meaningful functional store clusters and high-influence nodes that emerge without geographic metadata. These results demonstrate that relational structure significantly improves forecast quality in interconnected retail environments and establishes STGNNs as a robust modelling choice for multi-store demand prediction.

本文评估了时空图神经网络（GNNs）在多店零售销售预测中的有效性，并将其性能与ARIMA、LSTM和XGBoost基线进行了比较。我们使用来自45家沃尔玛商店的每周销售数据，构建了一个关系预测框架，该框架通过自适应学习图对店铺间的依赖关系进行建模。提出的STGNN预测对数差分销售情况，并通过残差路径重建最终值，从而实现稳定的训练和提高泛化能力。实验表明，STGNN的总体预测误差最低，在标准化总绝对误差、P90 MAPE以及各店铺MAPE的方差等各项指标上均优于所有基线方法。对学习的邻接矩阵的分析揭示了有意义的函数型店铺集群和高影响力节点，这些是在没有地理元数据的情况下自然形成的。这些结果表明，关系结构在互联的零售环境中能显著提高预测质量，并确立了STGNN在多店需求预测中的稳健建模选择地位。

论文及项目相关链接

PDF 6 pages, 4 figures, 1 table

Summary

时空图神经网络（STGNN）在多店铺零售销售预测中的有效性评估。研究使用来自45家沃尔玛店铺的每周销售数据，构建一个关系预测框架，通过自适应图学习店铺间的依赖关系。实验表明，STGNN在标准化总绝对误差、P90 MAPE以及各店铺MAPE的方差等评价指标上实现了最低的整体预测误差，优于所有基线模型。分析学习得到的邻接矩阵揭示了有意义的店铺功能集群和高影响力节点，进一步证明了关系结构在互联零售环境中对提高预测质量的重要性，并确立了STGNN作为多店铺需求预测稳健模型的选择地位。

Key Takeaways

时空图神经网络（STGNN）用于多店铺零售销售预测的效果评估。
使用沃尔玛45家店铺的每周销售数据构建关系预测框架。
STGNN通过自适应图学习店铺间的依赖关系。
STGNN在多个评价指标上实现最低预测误差，优于其他基线模型。
分析邻接矩阵揭示出店铺的功能集群和高影响力节点。
关系结构在互联零售环境中对提升预测质量至关重要。

Cool Papers

点此查看论文截图

RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning

Authors:Deyi Ji, Yuekui Yang, Liqun Liu, Peng Shu, Haiyang Wu, Shaogang Tang, Xudong Chen, Shaoping Ma, Tianrun Chen, Lanyun Zhu

Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.

广告（Ad）是数字经济的重要组成部分，但由于其复杂性和对精确违规定位的需求，视频广告的适度控制仍然是一个巨大的挑战。尽管最近的进展，如RAVEN模型，已经改进了粗粒度的违规检测，但在精细粒度理解、解释性和泛化方面仍存在关键差距。为了解决这些局限性，我们提出了RAVEN++，一个引入三项关键创新的新框架：1）主动强化学习（RL），它动态地适应不同难度样本的训练；2）精细粒度的违规理解，通过分层奖励函数和推理蒸馏来实现；以及3）渐进多阶段训练，系统地结合知识注入、基于课程的被动RL和主动RL。在公共和专有数据集上的大量实验，以及在线A/B测试和离线场景下的测试，证明RAVEN++在精细粒度违规理解、推理能力和泛化能力方面优于通用大型语言模型（LLM）和RAVEN等专用模型。

论文及项目相关链接

PDF EMNLP 2025 (Oral, Industry Track)

Summary
在数字经济中，广告（Ad）是基石之一。然而，视频广告的适度化依然是一大挑战，因为其复杂性和需要精准定位违规行为。虽然最近的进展如RAVEN模型已经改进了粗略违规检测，但在精细粒度理解、解释性和泛化方面仍存在关键差距。为了解决这个问题，我们提出了RAVEN++框架，引入了三个关键创新点：1）主动强化学习（RL），可动态适应不同难度的样本；2）精细粒度违规理解，通过层次奖励函数和推理蒸馏实现；3）渐进式多阶段训练，系统性地结合知识注入、基于课程的被动RL和主动RL。在公共和专有数据集以及在线和离线场景的大量实验表明，RAVEN++在精细粒度违规理解、推理能力和泛化能力方面优于通用的大型语言模型（LLMs）和专门化的模型如RAVEN。

Key Takeaways

视频广告适度化是数字经济的核心挑战之一，因广告的复杂性和精准定位违规行为的需求。
RAVEN模型虽有改进，但在精细粒度理解、解释性和泛化方面存在局限。
RAVEN++框架引入三大创新点来解决这些挑战：主动强化学习、精细粒度违规理解和渐进式多阶段训练。
主动强化学习可动态适应不同难度的样本，提高训练效率。
通过层次奖励函数和推理蒸馏实现精细粒度违规理解。
渐进式多阶段训练结合知识注入、基于课程的被动RL和主动RL，系统化提升模型性能。

Cool Papers

点此查看论文截图

Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization

Authors:Zijian Wang, Yanxiang Ma, Chang Xu

Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.

链式思维（CoT）推理对于大型语言模型（LLM）是一项至关重要的能力，能够使它们解决复杂的多步骤任务。虽然基于通用文本语料库进行预训练的基础LLM常常由于缺乏专项训练而在推理方面遇到困难，但最近的研究揭示了与隐藏状态相关的潜在推理潜力。然而，现有的隐藏状态操作方法，如线性激活操纵，由于它们僵化和无约束的性质，存在局限性，往往会导致分布偏移和文本质量下降。在这项工作中，我们提出了一种基于概率条件生成的新型隐藏状态操作方法，用于从基础LLM中提取CoT推理能力。通过重新将挑战表述为一个具有平衡似然和先验正则化框架的优化问题，我们的方法引导隐藏状态走向以推理为导向的轨迹，同时保留语言连贯性。对数学推理、常识推理和逻辑推理基准的广泛评估表明，我们的方法一致地超越了现有的操纵方法，为增强基础LLM的推理能力提供了理论上有依据和有效的解决方案。

论文及项目相关链接

PDF AAAI2026

Summary

基于链式思维（CoT）推理是大型语言模型（LLM）的关键能力，使其能够应对复杂的多步骤任务。尽管基础LLM在一般文本语料库上进行预训练，但由于缺乏专门训练，它们通常面临推理困难。然而，最新研究表明其隐藏状态与潜在推理能力有关。然而，现有的隐藏状态操纵方法，如线性激活引导，由于它们固有的局限性和不受约束的特性，往往会导致分布偏移和文本质量下降。本研究提出了一种基于概率条件生成的新方法，通过隐藏状态操纵来激发基础LLM的CoT推理能力。通过将其重新表述为平衡可能性与先验正则化框架的优化问题，我们的方法引导隐藏状态走向推理导向轨迹，同时保持语言连贯性。在跨数学、常识和逻辑推理基准测试上的广泛评估表明，我们的方法持续优于现有引导方法，为增强基础LLM的推理能力提供了理论上有原则的有效解决方案。

Key Takeaways

链式思维（CoT）推理是大型语言模型（LLM）完成复杂多步骤任务的关键。
基础LLM在一般文本语料库上预训练，缺乏专门训练，导致推理困难。
隐藏状态与LLM的潜在推理能力有关。
现有隐藏状态操纵方法存在分布偏移和文本质量下降的问题。
本研究提出了一种基于概率条件生成的新方法，通过隐藏状态操纵来激发LLM的推理能力。
该方法通过平衡可能性与先验正则化框架的优化问题来引导隐藏状态，保持语言连贯性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-26/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-11-26 VDC-Agent When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

2025-11-26 LLM

LLM

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-11-25 Behind the Screens Uncovering Bias in AI-Driven Video Interview Assessments Using Counterfactuals

2025-11-25 Talking Head Generation

Talking Head Generation