发布日期: 2025-06-05

更新日期: 2025-07-06

文章字数: 21.1k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-05 更新

SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

Authors:Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, Yongliang Shen, Weiming Lu, Yueting Zhuang

Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at https://zju-real.github.io/SVGenius.

大型语言模型（LLMs）和多模态LLMs在SVG处理方面表现出有前景的能力，然而现有的基准测试面临着现实世界覆盖有限、缺乏复杂性分层和评估范式碎片化的问题。我们推出了SVGenius，这是一个包含2377个查询的全面基准测试，涵盖三个渐进维度：理解、编辑和生成。SVGenius建立在来自24个应用领域的真实数据基础上，具有系统的复杂性分层，通过8个任务类别和18个指标对模型进行评估。我们对22个主流模型进行了评估，这些模型跨越不同规模、架构、训练模式和可访问性级别。我们的分析表明，专有模型显著优于开源模型，所有模型在复杂性增加时都表现出性能下降，表明当前方法存在根本性局限；然而，推理增强训练证明在克服这些局限性方面比单纯扩大规模更为有效，尽管风格转换仍然是所有模型类型中最具挑战的能力。SVGenius建立了SVG处理的首个系统评估框架，为开发更强大的矢量图形模型和推进自动化图形设计应用提供了关键见解。附录和补充材料（包括所有数据和代码）可在链接中找到。

论文及项目相关链接

PDF 19 pages,4 figures, Project page: https://zju-real.github.io/SVGenius, Code: https://github.com/ZJU-REAL/SVGenius-Bench

Summary

本文介绍了SVGenius，一个涵盖理解、编辑和生成三个渐进维度的新型SVG处理综合基准测试。该基准测试包含来自24个应用领域的真实数据，具有系统的复杂性分层，通过8个任务类别和18个指标评估模型性能。文章评估了不同规模、架构、训练模式和访问权限的22个主流模型，发现专有模型在性能上显著优于开源模型，随着复杂性的增加，所有模型的系统性能均有所下降，这表明当前方法存在基本局限性。同时，研究发现增强推理训练比单纯扩大规模更能克服这些局限性，但风格转换仍是所有模型类型中最具挑战性的能力。SVGenius为SVG处理建立了第一个系统评估框架，为开发更具能力的矢量图形模型和推动自动化图形设计应用提供了关键见解。

Key Takeaways

SVGenius是一个新型SVG处理综合基准测试，涵盖理解、编辑和生成三个维度。
该基准测试包含来自24个应用领域的真实数据，具有系统的复杂性分层。
评估结果显示专有模型性能显著优于开源模型。
随着复杂性的增加，所有模型的系统性能均有所下降，表明存在基本局限性。
增强推理训练比单纯扩大规模更能克服这些局限性。
风格转换是所有模型类型中最具挑战性的能力。

Cool Papers

点此查看论文截图

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Authors:Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder’s mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

我们提出了CURE，这是一种新型强化学习框架，具有专门的奖励设计，能够根据交互结果共同进化编码和单元测试生成能力，而无需任何真实代码作为监督。这种方法使训练和测试更加灵活和可扩展，并允许单元测试人员直接从编码人员的错误中学习。我们在ReasonFlux-Coder-7B和14B模型上的优化后，提高了代码生成的准确性，在Qwen2.5-Instruct模型上的准确率提高了5.3%，Best-of-N准确率提高了9.0%，超越了同等规模的Qwen-Coder、DeepSeek-Coder和Seed-Coder。它们自然地扩展到下游任务，如测试时标度和代理编码，较基础模型提高了8.1%的改进。对于长CoT模型，我们的ReasonFlux-Coder-4B始终优于Qwen3-4B，同时在单元测试生成中达到64.8%的推理效率。值得注意的是，我们还发现我们的模型可以作为基础模型上强化学习的有效奖励模型。项目地址：https://github.com/Gen-Verse/CURE

Summary

该文提出了一种新型强化学习框架CURE，它通过专门设计的奖励机制，协同进化编码和单元测试生成能力，基于交互结果，无需任何真实代码作为监督。此方法实现了灵活且可扩展的训练，并允许测试者直接从编码错误中学习。优化后，ReasonFlux-Coder模型在代码生成和最佳N准确率上有所提升，超越了类似规模的模型。此外，该模型可自然扩展到下游任务，如测试时缩放和智能编码，并可作为基础模型的强化学习奖励模型。

Key Takeaways

CURE是一种新型强化学习框架，通过协同进化编码和单元测试生成能力，无需真实代码监督。
该方法实现了灵活且可扩展的训练，允许测试者从编码错误中学习。
ReasonFlux-Coder模型在代码生成和最佳N准确率上有所提升。
该模型超越了类似规模的模型，如Qwen-Coder、DeepSeek-Coder和Seed-Coder。
模型可自然扩展到下游任务，如测试时缩放和智能编码。
CURE框架中的模型可作为强化学习基础模型的奖励模型。

Cool Papers

点此查看论文截图

AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation

Authors:Prashanth Vijayaraghavan, Luyao Shi, Ehsan Degan, Vandana Mukherjee, Xin Zhang

Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL,a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework’s effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.

模拟电路拓扑合成是电子设计自动化（EDA）的重要组成部分，它能够实现针对特定设计要求的电路结构的自动化创建。然而，庞大的设计搜索空间和严格约束的存在，使得高效的合成面临挑战。我们利用大型语言模型（LLM）的通用性，提出AUTOCIRCUIT-RL，这是一个基于强化学习（RL）的自动化模拟电路合成新框架。该框架分为两个阶段：指令调整阶段，LLM从编码设计约束的结构化提示中学习生成电路拓扑；RL精炼阶段，使用评估有效性、效率和输出电压的奖励模型进一步改进指令调整模型。改进后的模型然后直接被用来生成满足设计约束的拓扑结构。实验结果表明，与最佳基线相比，AUTOCIRCUIT-RL生成的有效电路数量增加了约12%，效率提高了约14%，同时降低了约38%的重复生成率。在有限训练数据的情况下，它实现了超过60%的合成有效电路成功率，显示出强大的泛化能力。这些发现强调了该框架在扩展到复杂电路时保持效率和约束遵守的有效性，标志着人工智能驱动电路设计的一个重大进步。

论文及项目相关链接

PDF 9 Pages (Content), 4 Pages (Appendix), 7 figures, ICML’2025

Summary

本文介绍了一种基于强化学习（RL）的新型自动化模拟电路合成框架——AUTOCIRCUIT-RL。该框架包括指令调整和RL精炼两个阶段。指令调整阶段利用大型语言模型（LLM）从结构化提示中学习生成满足设计约束的电路拓扑结构；RL精炼阶段则通过奖励模型进一步改善模型性能，评估其有效性、效率和输出电压。该框架能够生成满足设计约束的电路拓扑结构，与现有最佳基线相比，生成的电路数量增加了约12%，效率提高了约14%，重复生成率降低了约38%。在有限训练数据的情况下，成功合成有效电路的比例达到了60%以上，展现出强大的泛化能力。此框架对于在保持效率的同时解决复杂电路的约束问题具有重要意义，标志着人工智能驱动电路设计的重要进展。

Key Takeaways

AUTOCIRCUIT-RL是一个基于强化学习的自动化模拟电路合成框架。
框架包括指令调整和RL精炼两个阶段，利用大型语言模型学习生成电路拓扑结构。
奖励模型用于评估生成电路的有效性、效率和输出电压。
与现有方法相比，AUTOCIRCUIT-RL生成的电路数量增加，效率提高，重复生成率降低。
在有限训练数据下，该框架成功合成有效电路的比例较高，显示出强大的泛化能力。

Cool Papers

点此查看论文截图

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Authors:Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chao Yang, Helen Meng

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

最近，强化学习（RL）在具有数值反馈方面的进展，如标量奖励，已经显著提高了大型语言模型（LLM）的复杂推理能力。尽管如此，我们发现了仅使用数值反馈的RL所面临的三大挑战：性能瓶颈、自我反思的局限性以及持续失败。我们进而证明，即使在性能瓶颈期后，通过利用批判形式的自然语言反馈，RL微调模型也可以对持续失败的问题进行正确的改进。在此基础上，我们提出了批判性GRPO（Critique-GRPO）在线RL框架，它结合了自然语言反馈和数值反馈来进行有效的策略优化。Critique-GRPO使LLM能够同时从初始响应和批判指导的改进中学习，同时保持探索。使用Qwen2.5-7B-Base和Qwen3-8B-Base进行的广泛实验表明，Critique-GRPO在八个具有挑战性的数学、STEM和一般推理任务上始终优于基于监督学习和RL的微调方法，平均pass@1得分分别提高了约4.5%和5%。值得注意的是，Critique-GRPO超越了一个强大的基线，该基线结合了在线RL中的专家演示。进一步的分析揭示了关于策略探索的两个关键见解：（1）高熵并不总是保证从探索中有效学习，（2）更长的响应并不一定导致更有效的探索。

论文及项目相关链接

PDF 38 pages

Summary
强化学习（RL）在数值反馈上的最新进展，如标量奖励，已显著提高了大型语言模型（LLM）的复杂推理能力。然而，本文指出RL面临三大挑战：性能瓶颈、自我反思的局限性以及持续失败的问题。通过利用批判性反馈，即使在性能瓶颈后，RL微调模型仍能对持续失败的问题进行正确的改进。基于此，本文提出了整合自然语言与数值反馈的在线RL框架——Critique-GRPO，用于有效的策略优化。Critique-GRPO使LLM能够同时从初步响应和批判性指导的改进中学习，并保持探索。实验表明，Critique-GRPO在八个具有挑战性的数学、STEM和一般推理任务上，相较于基于监督学习和RL的微调方法，表现出更高的性能，平均提高约4.5%和5%。同时，对策略探索的进一步分析揭示了两个关键见解：高熵并不总是保证从探索中有效学习，长响应并不一定能带来更有效的探索。

Key Takeaways

强化学习（RL）结合数值反馈如标量奖励提高了大型语言模型（LLM）的推理能力。
RL面临性能瓶颈、自我反思的局限性及持续失败等挑战。
批判性反馈能够使RL模型在性能瓶颈后对持续失败的问题进行改进。
Critique-GRPO框架结合了自然语言与数值反馈，有效提高LLM的策略优化能力。
Critique-GRPO在多种挑战性任务上表现优越，相较于其他方法平均提高约4.5%和5%。
高熵并不总是保证从探索中有效学习。

Cool Papers

点此查看论文截图

EgoVLM: Policy Optimization for Egocentric Video Understanding

Authors:Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu

Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero’s approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.

新兴的身临其境的AI应用，如可穿戴相机和自主代理，强调了从第一人称视频流中进行稳健推理的必要性。我们推出了EgoVLM，这是一种专门设计的视觉语言模型，能够在以自我为中心的视频上下文中整合视觉理解和时空推理。EgoVLM通过群体相对策略优化（GRPO）进行微调，这是一种适应性强化的学习方法，旨在使模型输出与人类类似的推理步骤相一致。我们遵循DeepSeek R1-Zero的方法，直接使用强化学习进行调整，而无需在思维链（CoT）数据上进行任何监督微调阶段。我们在以自我为中心的视频问答基准测试上对EgoVLM进行了评估，结果表明，与通用VLM相比，特定领域的训练大大提高了性能。我们仅在非CoT以自我为中心的数据上训练的EgoVLM-3B，在EgoSchema基准测试上的表现优于基础Qwen2.5-VL 3B和7B模型，分别提高了14.33和13.87的准确率。通过明确生成推理轨迹，EgoVLM提高了可解释性，非常适合用于下游应用。此外，我们引入了一种基于关键帧的奖励方法，该方法结合了显著帧选择来指导强化学习的优化。这种奖励形式为未来的时间基础以自我为中心的推理探索开辟了一条有前途的道路。

论文及项目相关链接

PDF Our Code can be found at https://github.com/adityavavre/VidEgoVLM

Summary

新兴的身临其境的AI应用，如可穿戴相机和自主代理，强调需要从第一人称视频流中进行稳健推理的必要性。为此，我们推出了EgoVLM，一种专为第一人称视频语境设计的视觉语言模型，可整合视觉理解与时空推理。通过采用Group Relative Policy Optimization (GRPO)这一强化学习方法，EgoVLM与类人推理步骤的输出对齐。我们参考DeepSeek R1-Zero的方法，直接使用强化学习进行微调，无需在思维链数据上进行监督微调阶段。在针对第一人称视频问答的基准测试中，EgoVLM表现出色，显示特定领域的训练大幅提升了相较于通用VLM的性能。我们的EgoVLM-3B模型仅在非思维链的第一人称数据上训练，在EgoSchema基准测试中，相较于基准Qwen2.5-VL模型，其准确率提高了14.33和13.87个百分点。通过明确生成推理轨迹，EgoVLM增强了可解释性，非常适合用于下游应用。此外，我们还引入了一种新的基于关键帧的奖励，该奖励结合了显著帧选择来指导强化学习的优化。这种奖励的公式化为在时间上立足的第一人称推理探索打开了有前途的道路。

Key Takeaways

新兴的AI应用强调第一人称视频流中的稳健推理需求。
引入EgoVLM模型，结合视觉理解与时空推理。
采用Group Relative Policy Optimization (GRPO)强化学习方法进行模型微调。
直接使用强化学习进行微调，无需监督微调阶段。
EgoVLM在针对第一人称视频问答的基准测试中表现优异。
特定领域的训练显著提升性能，相较于通用VLM有大幅度提升。

Cool Papers

点此查看论文截图

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

Authors:Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li

Training language models on long sequence data is a demanding requirement for enhancing the model’s capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP’s sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.

训练语言模型进行长序列数据处理是一项艰巨要求，旨在提高模型在处理复杂任务（如长链推理）方面的能力。然而，随着序列长度的增加，反向传播（BP）过程中存储激活值的内存成本变得非常大，即使应用了梯度检查点技术也是如此。为了应对这一挑战，我们提出了一种高效内存且精确的反向传播方法，称为StreamBP。它通过逐层沿序列维度执行链式法则的线性分解，显著降低了激活值和逻辑值的内存成本。该方法适用于如SFT、GRPO和DPO等常见目标。从实现的角度来看，StreamBP利用语言模型的因果结构实现了较少的计算FLOPs和更快的BP速度。与梯度检查点相比，StreamBP将反向传播的最大序列长度扩大了2.8-5.5倍，同时使用的时间相当甚至更少。值得注意的是，StreamBP的序列长度缩放能力可以直接转移到批处理大小缩放，以加速训练。我们还开发了一种通信高效的分布式StreamBP，以有效支持多GPU训练并扩大其适用性。我们的代码可以轻松地集成到任何transformer模型的训练流程中，可在https://github.com/Ledzy/StreamBP找到。

论文及项目相关链接

PDF

Summary：针对语言模型在训练长序列数据时面临的内存挑战，提出了一种名为StreamBP的内存高效反向传播方法。它通过逐层线性分解链式法则，显著降低了激活值和logits的内存消耗。此方法适用于常见的目标函数，并提供了更少的计算浮点数操作和更快的反向传播速度。相比梯度检查点技术，StreamBP能将反向传播的最大序列长度增加2.8至5.5倍，同时保持相当的或更少的反向传播时间。此外，它还支持多GPU训练，并易于集成到任何transformer模型的训练流程中。

Key Takeaways：

语言模型在训练长序列数据时面临内存挑战。
StreamBP是一种内存高效的反向传播方法，通过逐层线性分解链式法则来降低内存消耗。
StreamBP适用于多种常见的目标函数。
StreamBP提供了更少的计算浮点数操作和更快的反向传播速度。
与梯度检查点技术相比，StreamBP在反向传播的最大序列长度方面有明显的优势。
StreamBP支持多GPU训练，提高了分布式训练的通信效率。

Cool Papers

点此查看论文截图

Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning

Authors:Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu

Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI’s o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.

细胞类型注释是分析单细胞RNA测序数据异质性的关键任务。尽管最近的基础模型自动执行此过程，但它们通常独立地注释细胞，而不会考虑批次级别的细胞上下文或提供解释性理由。相比之下，人类专家通常基于其领域知识为不同的细胞群注释不同的细胞类型。为了模仿这种工作流程，我们引入了CellPuzzles任务，其目标是为一批细胞分配唯一的细胞类型。此基准测试涵盖了各种组织、疾病和捐赠者条件，需要跨越批次级别的细胞上下文进行推理，以确保标签的唯一性。我们发现现成的大型语言模型（LLM）在CellPuzzles上表现挣扎，最佳基线（OpenAI的o1）仅达到19.0％的批次级别准确度。为了填补这一空白，我们提出了Cell-o1，这是一个通过蒸馏推理轨迹进行有监督微调训练的7B LLM，随后使用批次级奖励进行强化学习。Cell-o1达到了最先进的性能，较o1高出73％以上，并且在不同上下文中表现良好。对训练动态和推理行为的进一步分析提供了对批次级别注释性能和新兴专家级推理的见解。代码和数据位于https://github.com/ncbi-nlp/cell-o1。

论文及项目相关链接

PDF 28 pages; 16 tables; 7 figures; Code: https://github.com/ncbi-nlp/cell-o1

Summary

本文主要介绍了细胞类型注释在单细胞RNA测序数据异质性分析中的重要性。现有模型通常在独立于批量级别的细胞上下文的条件下对细胞进行注释。为了模拟人类专家的注释流程，引入了CellPuzzles任务，要求为批量细胞分配独特的细胞类型。研究发现，现有的大型语言模型在CellPuzzles任务上表现不佳，因此提出了通过监督微调蒸馏推理轨迹，再结合批量级别奖励进行强化学习的Cell-o1模型。Cell-o1实现了最先进的性能，在批量级别上比OpenAI的o1模型高出73%以上，并在不同上下文中具有良好的泛化能力。

Key Takeaways

细胞类型注释是分析单细胞RNA测序数据异质性的关键任务。
现有模型在注释细胞时通常不考虑批量级别的细胞上下文。
引入了CellPuzzles任务，要求根据批量细胞的特性分配独特的细胞类型。
大型语言模型（LLMs）在CellPuzzles任务上的表现不佳。
提出了Cell-o1模型，通过监督微调蒸馏推理轨迹和强化学习进行训练，实现先进性能。
Cell-o1在批量级别上显著优于OpenAI的o1模型。

Cool Papers

点此查看论文截图

It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Authors:Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders’’ to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model’s willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

说服能力是大型语言模型（LLM）的一项强大功能，它既能带来有益的应用（例如帮助人们戒烟），也会带来重大风险（例如大规模、有针对性的政治操纵）。之前的研究发现，模型的说服能力显著且不断增长，可以通过模拟或真实用户的信念变化来衡量。然而，这些基准测试忽略了一个关键的风险因素：模型在有害语境中尝试说服的倾向性。了解模型是否会在有害主题上盲目“遵循命令”进行说服（例如美化加入恐怖组织）是了解安全护栏有效性的关键。此外，了解模型是否以及何时为了达成某个目标而从事说服行为，对于理解代理人工智能系统的风险至关重要。我们提出了尝试说服评估（APE）基准测试，将重点从说服成功转向说服尝试，表现为模型生成旨在塑造信念或行为的内容的意愿。我们的评估框架使用多轮对话设置来探测前沿的LLM，模拟劝说者和被劝说者之间的对话。APE探索了包括阴谋论、有争议的问题和非争议性有害内容在内的各种主题。我们引入了一个自动化评估模型来识别说服的意愿，并测量说服尝试的频率和上下文。我们发现许多开放和封闭权重的模型都愿意在有害主题上尝试说服，而且越狱可以增加参与此类行为的意愿。我们的研究结果揭示了当前安全护栏的差距，并强调了评估说服意愿作为LLM风险关键维度的重要性。APE可在github.com/AlignmentResearch/AttemptPersuadeEval找到。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）具有强大的说服力，既能用于有益的应用（如帮助戒烟），也存在重大风险（如大规模、有针对性的政治操纵）。以往的研究主要关注模型的说服能力，但忽略了模型在有害情境下尝试说服的倾向性这一关键风险因素。本文提出尝试说服评估（APE）基准，侧重于模型的说服意愿，即生成旨在塑造信念或行为的内容的意愿。评估框架使用多回合对话设置来探索前沿LLM，并介绍了一个自动化评估模型来识别说服意愿并测量说服尝试的频率和上下文。研究结果表明，许多开源和封闭模型的权重经常在有害主题上表现出强烈的说服意愿，越狱技术可能增加这种行为的意愿。本文强调当前安全护栏的差距并突出评估说服意愿作为LLM风险的关键维度的重要性。APE基准可在github.com/AlignmentResearch/AttemptPersuadeEval获取。

Key Takeaways

大型语言模型（LLM）具有强大的说服力，能够应用于有益及存在风险的不同场景。
以往的研究主要关注模型的说服能力，但新研究强调了解模型在有害情境下尝试说服的倾向性的重要性。
提出了尝试说服评估（APE）基准，以评估模型的说服意愿，并不仅仅关注其说服的成功率。
APE评估框架使用多回合对话设置，并引入自动化评估模型来识别模型的说服意愿及测量其说服尝试的频率和上下文。
研究发现，许多LLM模型在有害主题上经常表现出强烈的说服意愿，并且某些技术如“越狱”可能增加这种行为的意愿。
当前的安全防护措施存在缺口，需要更全面地评估LLM的风险，其中说服意愿是一个关键维度。

Cool Papers

点此查看论文截图

Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

Authors:Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, Jing Shao

Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the reasoning trajectories of LRMs from an information-theoretic perspective. By tracking how mutual information (MI) between intermediate representations and the correct answer evolves during LRM reasoning, we observe an interesting MI peaks phenomenon: the MI at specific generative steps exhibits a sudden and significant increase during LRM’s reasoning process. We theoretically analyze such phenomenon and show that as MI increases, the probability of model’s prediction error decreases. Furthermore, these MI peaks often correspond to tokens expressing reflection or transition, such as Hmm'', Wait’’ and ``Therefore,’’ which we term as the thinking tokens. We then demonstrate that these thinking tokens are crucial for LRM’s reasoning performance, while other tokens has minimal impacts. Building on these analyses, we propose two simple yet effective methods to improve LRM’s reasoning performance, by delicately leveraging these thinking tokens. Overall, our work provides novel insights into the reasoning mechanisms of LRMs and offers practical ways to improve their reasoning capabilities. The code is available at https://github.com/ChnQ/MI-Peaks.

大型推理模型（LRMs）在复杂问题解决方面展现出了令人印象深刻的能力，然而其内部推理机制仍然知之甚少。在本文中，我们从信息论的角度研究LRMs的推理轨迹。通过跟踪LRM推理过程中中间表示与正确答案之间的互信息（MI）的演变，我们观察到一个有趣的互信息峰值现象：在特定的生成步骤中，互信息会突然出现并显著增加。我们进行理论分析，并表明随着互信息的增加，模型预测错误的概率降低。此外，这些互信息峰值通常对应于表示反思或过渡的标记，如“嗯”、“等等”和“因此”，我们称之为思考标记。然后，我们证明这些思考标记对LRM的推理性能至关重要，而其他标记的影响最小。基于这些分析，我们提出两种简单有效的方法，通过巧妙利用这些思考标记来提高LRM的推理性能。总的来说，我们的工作为理解LRM的推理机制提供了新的见解，并为提高其推理能力提供了实用方法。相关代码可通过https://github.com/ChnQ/MI-Peaks获取。

论文及项目相关链接

PDF Preprint. Under review

Summary

本文利用信息论视角研究大型推理模型（LRMs）的推理轨迹。通过追踪模型推理过程中中间表示与正确答案之间的互信息（MI）变化，观察到MI峰现象：在LRM推理过程的特定生成步骤中，互信息会出现突然且显著的增加。理论上分析此现象，发现随着MI的增加，模型预测错误的概率降低。同时，这些MI峰通常对应于表示反思或过渡的标记词，如“Hmm”、“Wait”和“Therefore”，称为思考标记词。研究证明这些思考标记词对LRM的推理性能至关重要，而其他标记词影响较小。基于这些分析，本文提出两种简单有效的方法来提高LRM的推理性能，通过巧妙利用这些思考标记词。总体而言，本文为理解LRM的推理机制提供了新的视角，并为提高其推理能力提供了实用方法。

Key Takeaways

大型推理模型（LRMs）的推理机制尚不完全清楚，本文利用信息论视角进行研究。
通过追踪互信息（MI）变化观察到MI峰现象，即在特定生成步骤中互信息会突然显著增加。
MI峰与表示反思或过渡的标记词（思考标记词）相关，如“Hmm”，“Wait”和“Therefore”。
思考标记词对LRM的推理性能至关重要，而其他标记词影响较小。
本文提出两种简单有效的方法提高LRM的推理性能，通过巧妙利用思考标记词。
研究结果为理解LRM的推理机制提供了新的视角。

Cool Papers

点此查看论文截图

BNPO: Beta Normalization Policy Optimization

Authors:Changyi Xiao, Mengdi Zhang, Yixin Cao

Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters. BNPO aligns the normalization with the changing policy distribution, enabling more precise and lower-variance gradient estimation, which in turn promotes stable training dynamics. We provide theoretical analysis demonstrating BNPO’s variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings. Furthermore, we introduce an advantage decomposition mechanism to extend BNPO’s applicability to more complex reward systems. Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks. The code is available at https://github.com/changyi7231/BNPO.

最近的研究，包括DeepSeek-R1和Kimi-k1.5，已经证明基于规则、二值奖励函数的强化学习可以显著增强大型语言模型的推理能力。这些模型主要利用基于REINFORCE的策略优化技术，如带有基准的REINFORCE和群组相对策略优化（GRPO）。然而，仍存在一个关键局限性：当前的策略优化方法要么忽视奖励归一化，要么采用静态归一化策略，这些策略无法适应训练过程中策略更新的动态性质。这可能导致梯度估计不稳定，阻碍训练稳定性。为了解决这一问题，我们提出了Beta归一化策略优化（BNPO），这是一种新的策略优化方法，它利用具有动态更新参数的Beta分布自适应地归一化奖励。BNPO使归一化与变化的策略分布相一致，能够实现更精确、低方差的梯度估计，从而促进了稳定的训练动态。我们提供了理论分析，证明了BNPO的降方差属性，并表明它在二值奖励设置下可以概括REINFORCE和GRPO。此外，我们引入了一种优势分解机制，以将BNPO扩展到更复杂的奖励系统。实验结果证实，BNPO在推理任务上的策略优化方法方面达到了最新技术水平。代码可在https://github.com/changyi7231/BNPO找到。

论文及项目相关链接

PDF

Summary
强化学习利用基于规则的二元奖励函数能够显著提升大语言模型的推理能力。现有政策优化方法未适应动态政策更新，造成训练不稳定。因此，研究人员提出了Beta归一化政策优化（BNPO），利用动态更新的Beta分布参数自适应地归一化奖励，以减少方差并实现稳定训练。该方法具有强大的优势分解机制，能应用于更复杂的奖励系统。实验证明，BNPO在推理任务上的表现优于其他政策优化方法。代码已公开。

Key Takeaways

强化学习可以增强大语言模型的推理能力。
当前政策优化方法在奖励归一化上存在缺陷。
BNPO通过Beta归一化策略优化来解决此问题。
BNPO可实现低方差和高精度的梯度估计。
BNPO可适用于更复杂的奖励系统，并具有优势分解机制。
实验证明BNPO在推理任务上的表现优于其他政策优化方法。

Cool Papers

点此查看论文截图

Rethinking Machine Unlearning in Image Generation Models

Authors:Renyang Liu, Wenjie Feng, Tianwei Zhang, Wei Zhou, Xueqi Cheng, See-Kiong Ng

With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at https://github.com/ryliu68/IGMU.

随着图像生成模型的涌现和广泛应用，数据隐私和内容安全成为主要关注点，并引起了用户、服务提供商和政策制定者的极大关注。机器遗忘（MU）被认为是一种经济高效且前景广阔的解决这些挑战的手段。尽管有一些进展，但图像生成模型的遗忘（IGMU）在实践中仍然面临显著的差距，例如任务辨别和遗忘指南不明确，缺乏有效的评估框架和不可靠的评估指标。这些可能阻碍对遗忘机制的理解和实用遗忘算法的设计。我们对现有的最新遗忘算法和评估标准进行了全面的评估，并发现了图像生成模型遗忘（IGMU）任务中的几个关键缺陷和挑战。受这些局限性的驱动，我们做出了几项核心贡献，以促进对IGMU的全面理解、标准化分类和可靠评估。具体来说，（1）我们设计了CatIGMU，这是一种新型分层任务分类框架。它为IGMU提供了详细的实施指南，有助于设计遗忘算法和构建测试平台。（2）我们介绍了EvalIGMU，这是一个全面的评估框架。它包括五个关键方面的可靠定量指标。（3）我们构建了DataIGM，这是一个高质量的遗忘数据集，可用于对IGMU进行全面评估、训练内容检测器进行判读和基准测试最新遗忘算法。借助EvalIGMU和DataIGM，我们发现大多数现有IGMU算法在不同的评估维度上无法很好地处理遗忘问题，尤其是在保持性和稳健性方面。代码和模型可在链接中找到。

论文及项目相关链接

PDF Accepted by ACM CCS 2025

Summary
在图像生成模型广泛应用的背景下，数据隐私和内容安全受到关注。机器遗忘（MU）被视为解决这些挑战的有前途且经济的手段。尽管有所进展，但图像生成模型的遗忘（IGMU）在实践中仍存在显著差距，如任务辨别不清、遗忘准则不明确、缺乏有效的评估框架和不可靠的评估指标等。我们对现有的最新遗忘算法和评估标准进行了全面评估，发现IGMU任务中存在几个关键缺陷和挑战。为解决这些问题，我们做出了几项核心贡献，以促进对IGMU的全面理解、标准化分类和可靠评估。具体来说：（1）我们设计了CatIGMU，一种新型层次任务分类框架，为IGMU的详细实施提供指导，协助遗忘算法的设计和测试平台的构建。（2）我们推出了EvalIGMU，一个全面的评估框架，包括五个关键方面的可靠定量指标。（3）我们构建了DataIGM，一个高质量的遗忘数据集，可用于对IGMU进行广泛评估、训练内容检测器进行判决以及评估最新的遗忘算法。通过EvalIGMU和DataIGM的发现表明大多数现有IGMU算法在评估维度上的表现不佳，尤其在保留性和鲁棒性方面。

Key Takeaways

随着图像生成模型的广泛应用，数据隐私和内容安全成为主要关注点。
机器遗忘被视为解决图像生成模型中的数据隐私和内容安全挑战的有效手段。
图像生成模型的遗忘（IGMU）在实践中存在显著差距和挑战，如任务辨别不清、遗忘准则不明确等。
团队针对IGMU任务提出了核心贡献：设计CatIGMU分类框架，引入EvalIGMU评估框架以及构建DataIGM数据集。
通过EvalIGMU和DataIGM发现现有IGMU算法在多个评估维度上的不足。
CatIGMU为IGMU的详细实施提供指导，帮助设计遗忘算法和构建测试平台。

Cool Papers

点此查看论文截图

BenLOC: A Benchmark for Learning to Configure MIP Optimizers

Authors:Hongpei Li, Ziyan He, Yufei Wang, Wenting Tu, Shanwen Pu, Qi Deng, Dongdong Ge

The automatic configuration of Mixed-Integer Programming (MIP) optimizers has become increasingly critical as the large number of configurations can significantly affect solver performance. Yet the lack of standardized evaluation frameworks has led to data leakage and over-optimistic claims, as prior studies often rely on homogeneous datasets and inconsistent experimental setups. To promote a fair evaluation process, we present BenLOC, a comprehensive benchmark and open-source toolkit, which not only offers an end-to-end pipeline for learning instance-wise MIP optimizer configurations, but also standardizes dataset selection, train-test splits, feature engineering and baseline choice for unbiased and comprehensive evaluations. Leveraging this framework, we conduct an empirical analysis on five well-established MIP datasets and compare classical machine learning models with handcrafted features against state-of-the-art deep-learning techniques. The results demonstrate the importance of datasets, features and baseline criteria proposed by BenLOC and the effectiveness of BenLOC in providing unbiased and comprehensive evaluations.

混合整数编程（MIP）优化器的自动配置变得日益关键，因为大量的配置会显著影响求解器的性能。然而，由于缺乏标准化的评估框架，导致数据泄露和过于乐观的声明，因为先前的研究往往依赖于同质的数据集和实验设置不一致。为了促进公平的评估过程，我们提出了BenLOC，这是一个全面的基准测试和开源工具包，它不仅提供了一个端到端的管道来学习实例化的MIP优化器配置，还标准化了数据集选择、训练测试分割、特征工程和基线选择，以进行客观全面的评估。利用该框架，我们对五个成熟的MIP数据集进行了实证分析，并将手工制作的特征的传统机器学习模型与最新的深度学习技术进行了比较。结果表明，BenLOC所提出的数据集、特征和基线标准是重要的，BenLOC在提供客观全面的评估方面是有效的。

论文及项目相关链接

PDF A Benchmark for learning to configurate MIP Optimizers (Solvers)

Summary：随着混合整数编程（MIP）优化器配置数量的增加，自动配置的重要性日益凸显。然而，由于缺乏标准化的评估框架，导致数据泄露和过于乐观的声明。为此，我们提出BenLOC，一个全面的基准测试和开源工具包，不仅提供端到端的MIP优化器配置学习流程，还标准化数据集选择、训练测试分割、特征工程和基线选择，以便进行客观全面的评估。借助该框架，我们在五个知名的MIP数据集上进行了实证分析，比较了带有手工特征的经典机器学习模型与最新深度学习方法，证明了BenLOC所提出的数据集、特征和基线标准的重要性及有效性。

Key Takeaways：

自动配置MIP优化器非常重要，因为大量的配置会显著影响求解器的性能。
缺乏标准化的评估框架导致数据泄露和过于乐观的研究结果。
BenLOC是一个全面的基准测试和开源工具包，提供端到端的MIP优化器配置学习流程。
BenLOC标准化数据集选择、训练测试分割、特征工程和基线选择。
通过实证分析比较了经典机器学习模型和最新深度学习方法在BenLOC框架下的表现。
BenLOC所提出的数据集、特征和基线标准对于评估MIP优化器配置至关重要。

Cool Papers

点此查看论文截图

RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

Authors:Qihang Yan, Xinyu Zhang, Luming Guo, Qi Zhang, Feifan Liu

Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs’ knowledge application, reasoning reliability, and process transparency in complex vertical domains.

大型语言模型（LLM）在垂直领域面临准确性、领域特定推理和可解释性的挑战。传统的偏好对齐方法，如基于人类反馈的强化学习（RLHF）和直接偏好优化（DPO），往往忽视了潜在的知识来源和推理逻辑。本文介绍了RACE-Align（检索增强与思维链增强对齐）这一新型框架，旨在解决这些局限性。RACE-Align系统地构建了一个二进制偏好数据集，融入了外部知识支持和明确的思维链（CoT）推理，然后使用DPO算法对齐LLM。其核心创新之处在于其偏好数据构建策略：它结合了AI驱动的检索来实现事实基础，提高了知识性和准确性，并强调领域特定思维链的优化，将推理过程本身视为一个关键的偏好维度。多阶段、AI驱动的优化管道以成本效益的方式生成这些偏好对。以Qwen3-1.7B为基础模型，在传统中医领域进行实验验证，结果表明RACE-Align显著优于原始基础模型和仅使用监督微调（SFT）的模型。在多个维度上均观察到改进，包括答案准确性、信息丰富性、中医思维模式的运用、逻辑性和深度推理以及可解释性。这些发现表明，RACE-Align为增强LLM在复杂垂直领域的知识应用、推理可靠性和过程透明度提供了一条有效路径。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在垂直领域存在准确性、领域特定推理和可解释性的挑战。传统偏好对齐方法如强化学习从人类反馈（RLHF）和直接偏好优化（DPO）常常忽略底层知识来源和推理逻辑。本文提出RACE-Align框架，通过结合外部知识支持和明确的思维链（CoT）推理，来解决这些问题。RACE-Align利用DPO算法对齐LLMs，其核心创新之处在于其偏好数据构建策略，通过AI驱动检索为事实基础增强知识性和准确性，并强调领域特定思维链的优化。在中医领域的实验验证表明，RACE-Align显著优于基础模型，并优于仅通过监督微调（SFT）的模型。改进表现在多个维度，包括答案准确性、信息丰富性、中医思维模式的运用、逻辑推理性和深度以及可解释性。这表明RACE-Align能有效提升LLMs在复杂垂直领域的知识应用、推理可靠性和过程透明度。

Key Takeaways

大型语言模型（LLMs）在垂直领域存在挑战，包括准确性、领域特定推理和可解释性。
传统偏好对齐方法忽略底层知识来源和推理逻辑。
RACE-Align框架通过结合外部知识支持和明确的思维链（CoT）推理来解决这些问题。
RACE-Align利用AI驱动检索增强知识性和准确性，并强调领域特定思维链的优化。
RACE-Align显著优于基础模型，改进表现在多个维度，包括答案质量、信息丰富性、对中医思维模式的运用、逻辑推理性和深度以及可解释性。
RACE-Align提升LLMs在复杂垂直领域的知识应用、推理可靠性和过程透明度。

Cool Papers

点此查看论文截图

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems

Authors:Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, Zenglin Xu

Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks, yet their deployment in real-world applications is hindered by fixed knowledge cutoffs and difficulties in generating controllable, accurate outputs in a single inference. Multi-agent systems (MAS) built from specialized LLM agents offer a promising solution, enabling dynamic collaboration and iterative reasoning. However, optimizing these systems remains a challenge, as conventional methods such as prompt engineering and supervised fine-tuning entail high engineering overhead and limited adaptability. Reinforcement learning (RL), particularly multi-agent reinforcement learning (MARL), provides a scalable framework by refining agent policies based on system-level feedback. Nevertheless, existing MARL algorithms, such as Multi-Agent Proximal Policy Optimization (MAPPO), rely on Critic networks, which can cause training instability and increase computational burden. To address these limitations and target the prototypical Multi-Agent Search System (MASS), we propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), a novel Critic-free algorithm that guides policy updates by estimating relative reward advantages across heterogeneous groups of rollouts. MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead. Additionally, we introduce three group rollout sampling strategies that trade off between efficiency and effectiveness. Experiments on a multi-agent LLM-based search system demonstrate that MHGPO consistently outperforms MAPPO in both task performance and computational efficiency, without requiring warm-up, underscoring its potential for stable and scalable optimization of complex LLM-based MAS.

大规模语言模型（LLM）在不同自然语言处理任务中取得了显著的成功，但它们在现实世界应用中的部署受到固定知识截断和难以在单次推断中产生可控、准确输出的限制。由专业LLM代理构建的多智能体系统（MAS）提供了有前景的解决方案，能够实现动态协作和迭代推理。然而，优化这些系统仍然是一个挑战，因为传统的方法，如提示工程和监督微调，涉及高工程开销和有限的适应性。强化学习（RL），特别是多智能体强化学习（MARL），提供了一个可扩展的框架，可以根据系统级别的反馈来优化智能体策略。然而，现有的MARL算法，如多智能体近端策略优化（MAPPO），依赖于评论家网络，可能导致训练不稳定并增加计算负担。为了解决这些局限性并面向典型的多智能体搜索系统（MASS），我们提出了多智能体异质群组策略优化（MHGPO），这是一种全新的无评论家算法，它通过估计不同智能体之间的相对奖励优势来指导策略更新。MHGPO消除了对评论家网络的需求，提高了稳定性并降低了计算开销。此外，我们介绍了三种群组滚动采样策略，在效率和有效性之间进行权衡。在多智能体LLM基于搜索系统上的实验表明，MHGPO在任务性能和计算效率方面均持续优于MAPPO，且无需预热过程，突显其在稳定和可扩展优化复杂LLM基于MAS方面的潜力。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在自然语言处理任务中取得了显著的成功，但在现实世界应用中的部署受到了固定知识截断和生成可控、准确输出的困难限制。多智能体系统（MAS）由专业LLM智能体构建，提供了一种有前景的解决方案，可实现动态协作和迭代推理。然而，优化这些系统仍然是一个挑战，传统方法如提示工程和监督微调涉及高工程开销和有限的适应性。强化学习（RL），特别是多智能体强化学习（MARL），提供了一个可扩展的框架，可以根据系统级别的反馈优化智能体策略。然而，现有的MARL算法，如多智能体近端策略优化（MAPPO），依赖于评论家网络，可能导致训练不稳定并增加计算负担。为了解决这个问题并面向典型的多智能体搜索系统（MASS），我们提出了多智能体异质组策略优化（MHGPO），这是一种新型的无需评论家的算法，它通过估计不同组rollouts之间的相对奖励优势来指导策略更新。MHGPO消除了对评论家网络的需要，提高了稳定性和计算效率。此外，我们介绍了三种组rollouts采样策略，在效率和效果之间寻求平衡。在基于多智能体LLM的搜索系统上的实验表明，MHGPO在任务性能和计算效率方面均优于MAPPO，不需要预热，突显其在稳定和可扩展优化复杂LLM基MAS方面的潜力。

Key Takeaways

大型语言模型（LLM）在自然语言处理任务中表现优异，但在实际应用中存在固定知识截断和输出控制难题。
多智能体系统（MAS）通过动态协作和迭代推理为解决这些问题提供了解决方案。
强化学习（RL）特别是多智能体强化学习（MARL）为优化智能体系统提供了一个可扩展框架。
现有MARL算法如MAPPO存在训练不稳定和计算负担重的问题。
提出的Multi-Agent Heterogeneous Group Policy Optimization (MHGPO) 算法是一种新型的评论家独立算法，通过估计不同组rollouts的相对奖励优势来优化策略更新。
MHGPO提高了稳定性和计算效率，消除了对评论家网络的需求。

Cool Papers

点此查看论文截图

Open-Set Living Need Prediction with Large Language Models

Authors:Xiaochong Lan, Jie Feng, Yizhou Sun, Chen Gao, Jiahuan Lei, Xinlei Shi, Hengliang Luo, Yong Li

Living needs are the needs people generate in their daily lives for survival and well-being. On life service platforms like Meituan, user purchases are driven by living needs, making accurate living need predictions crucial for personalized service recommendations. Traditional approaches treat this prediction as a closed-set classification problem, severely limiting their ability to capture the diversity and complexity of living needs. In this work, we redefine living need prediction as an open-set classification problem and propose PIGEON, a novel system leveraging large language models (LLMs) for unrestricted need prediction. PIGEON first employs a behavior-aware record retriever to help LLMs understand user preferences, then incorporates Maslow’s hierarchy of needs to align predictions with human living needs. For evaluation and application, we design a recall module based on a fine-tuned text embedding model that links flexible need descriptions to appropriate life services. Extensive experiments on real-world datasets demonstrate that PIGEON significantly outperforms closed-set approaches on need-based life service recall by an average of 19.37%. Human evaluation validates the reasonableness and specificity of our predictions. Additionally, we employ instruction tuning to enable smaller LLMs to achieve competitive performance, supporting practical deployment.

生活需求是人们日常生活中为生存和福祉所产生的需求。在美团等生活服务平台上，用户购买行为是由生活需求驱动的，因此进行准确的生活需求预测对于个性化服务推荐至关重要。传统方法将需求预测视为一个封闭集分类问题，这严重限制了其捕捉生活需求多样性和复杂性的能力。

在这项工作中，我们重新定义了生活需求预测为一个开放集分类问题，并提出了一种利用大型语言模型（LLM）进行无限制需求预测的新型系统PIGEON。PIGEON首先采用行为感知记录检索器帮助LLM理解用户偏好，然后结合马斯洛的需求层次理论，使预测与生活需求相符合。

论文及项目相关链接

PDF ACL 2025 Findings

Summary

本文介绍了美团等生活服务平台上用户需求的预测问题。传统方法将其视为封闭集分类问题，无法捕捉生活需求的多样性和复杂性。本文重新定义为开放集分类问题，并提出PIGEON系统，利用大型语言模型（LLMs）进行无限制的需求预测。PIGEON通过行为感知记录检索器帮助LLMs理解用户偏好，并结合马斯洛的需求层次理论进行预测。实验证明，PIGEON在基于需求的生活服务召回方面显著优于封闭集方法，平均提高19.37%。

Key Takeaways

生活需求预测在个性化服务推荐中具有重要性。
传统方法将生活需求预测视为封闭集分类问题，限制了其多样性和复杂性的捕捉。
PIGEON系统利用大型语言模型（LLMs）进行开放集分类的生活需求预测。
PIGEON通过行为感知记录检索器理解用户偏好，结合马斯洛需求层次理论进行预测。
PIGEON在真实数据集上的实验结果表明，其在基于需求的生活服务召回方面显著优于传统方法。
人类评估验证了PIGEON预测的合理性和特异性。

Cool Papers

点此查看论文截图

Computational Thinking Reasoning in Large Language Models

Authors:Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Jingjing Xu, Hao Zhu, Lecheng Wang, Jia Li, Yihong Dong, Jing Mai, Bin Gu, Zhi Jin

While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they often struggle with complex tasks that require specific thinking paradigms, such as divide-and-conquer and procedural deduction, \etc Previous researches integrate external, reliable tools to alleviate logical inconsistencies and hallucinations in LLMs’ problem-solving processes. However, we argue that the root challenge is more profound: LLMs lack the complex thinking paradigms (\ie, computational thinking) during reasoning. In this paper, we propose Computational Thinking Model (CTM), a novel framework that incorporates computational thinking paradigms into LLMs. This framework enables LLMs to reformulate complex problems through decomposition, abstraction, reduction, and simulation, among other techniques. Specifically, live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing. CTM directly instills computational thinking objectives into LLMs through tailored reinforcement learning rewards, which encourages problem simplification, modular planning, and iterative verification. We conduct extensive evaluations on multiple code generation and mathematical benchmarks. The results demonstrate that CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability. We hope this study offers valuable insights for AI reasoning, where LLMs can transform problems into robust, verifiable, and scalable computational workflows, much like computer scientists do.

虽然大型语言模型（LLM）已经展现出惊人的推理能力，但它们在处理需要特定思维模式的复杂任务时常常遇到困难，例如分而治之和程序推理等。之前的研究通过整合外部可靠的工具来缓解LLM在解决问题过程中的逻辑不一致和幻觉。然而，我们认为真正的挑战更为深刻：LLM在推理过程中缺乏复杂的思维模式（即计算思维）。在本文中，我们提出了计算思维模型（CTM），这是一个将计算思维模式融入LLM的新型框架。该框架使LLM能够通过分解、抽象、简化和模拟等技术重新制定复杂问题。具体来说，实时代码执行无缝地融入到推理过程中，使CTM能够通过计算来思考。CTM通过定制的强化学习奖励直接将计算思维目标植入LLM，鼓励问题简化、模块化规划和迭代验证。我们在多个代码生成和数学基准测试上进行了广泛评估。结果表明，在准确性、可解释性和泛化性方面，CTM优于传统推理模型和工具增强基线。我们希望这项研究能为AI推理提供有价值的见解，使LLM能够将问题转化为稳健、可验证和可扩展的计算工作流程，就像计算机科学家所做的那样。

论文及项目相关链接

PDF

Summary

大型语言模型在复杂任务上展现出卓越的推理能力，但仍面临特定思维模式的挑战，如分而治之、程序性推理等。为缓解逻辑不一致和幻觉问题，先前研究将外部可靠工具集成到大型语言模型中。然而，本文认为问题的根源在于大型语言模型缺乏复杂思维模式（即计算思维）来进行推理。因此，本文提出了计算思维模型（CTM），该模型将计算思维模式融入大型语言模型，使其能够通过分解、抽象、简化、模拟等技术重新构建复杂问题。特别是通过无缝集成实时代码执行，使CTM能够通过计算进行推理。此外，CTM通过定制的强化学习奖励直接灌输计算思维目标，鼓励问题简化、模块化规划和迭代验证。在多个代码生成和数学基准测试上进行的广泛评估表明，CTM在准确性、可解释性和通用性方面优于传统的推理模型和工具增强基线。

Key Takeaways

大型语言模型在复杂任务上展现出推理能力，但仍面临特定思维模式的挑战。
逻辑不一致和幻觉问题是大型语言模型在解决复杂任务时面临的挑战。
先前的研究通过集成外部工具来缓解这些问题，但问题根源在于缺乏计算思维。
提出的计算思维模型（CTM）融入计算思维模式到大型语言模型中。
CTM能够重新构建复杂问题，通过分解、抽象、简化、模拟等技术进行推理。
CTM通过实时代码执行的无缝集成，能够通过计算进行推理。
CTM通过定制的强化学习奖励直接灌输计算思维目标，并在多个基准测试中表现出优异性能。

Cool Papers

点此查看论文截图

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Authors:Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.

近期大型语言模型（LLM）的进步展现出了令人印象深刻的推理能力，反映了人类思维。然而，LLM是否拥有真正的流体智力（即在新的情况下进行抽象推理和规则概括的能力）仍然是一个悬而未决的问题。现有的推理基准测试要么侧重于特定领域的知识（晶体智力），要么缺乏解释性。为了解决这些限制，我们提出了DRE-Bench，这是一个基于分层认知框架的动态推理评估基准。DRE-Bench由36个抽象推理任务组成，这些任务分为四个认知层次，每个任务都有多个动态变体，测试相同的潜在规则。这种设计实现了对流体智力的精细、可解释和可靠的评估。我们评估了一系列最先进的LLM，包括通用LLM（GPT-4o、Claude 3.7）和推理LLM（o1、DeepSeek-R1、QwQ、Skywork-OR1）。实验结果表明，虽然大多数LLM在低级认知方面达到了稳健和熟练的水平，但在高级认知方面却表现挣扎，随着任务复杂性的增长，其概括能力有限。我们的研究突出了当前LLM与真正的人类流体智力之间的差距，并为系统跟踪LLM的推理进步提供了新的途径。

论文及项目相关链接

PDF

Summary
大语言模型（LLM）的最新进展展现出令人印象深刻、与人类思考相仿的推理能力。然而，LLM是否拥有真正的流体智力（即在新的情境下抽象推理和规则应用的能力）仍是悬而未决的问题。当前推理基准测试集中在特定领域知识或缺乏解释性方面。为此，我们提出DRE-Bench，一个基于分层认知框架的动态推理评估基准。DRE-Bench包含36个抽象推理任务，分为四个认知层次，每个任务包含多个测试相同潜在规则的动态变体。这种设计可实现精细、可解释和可靠的流体智力评估。我们评估了一系列最先进的大型语言模型，包括通用的大型语言模型（如GPT-4o和Claude 3.7）和专门用于推理的大型语言模型（如o1、DeepSeek-R1、QwQ和Skywork-OR1）。实验结果表明，大多数大型语言模型在低层次认知方面表现良好且稳健，但在高层次认知方面表现欠佳，随着任务复杂性的增长，其表现有限。我们的研究揭示了当前大型语言模型与人类真正的流体智力之间的差距，并为系统地跟踪大型语言模型中推理的进步提供了新的途径。

Key Takeaways

大型语言模型展现出与人类相似的推理能力。
流体智力（在新的情境下抽象推理和规则应用的能力）仍是大型语言模型的一个开放问题。
当前推理基准测试存在局限性，主要集中在特定领域知识或缺乏解释性。
DRE-Bench是一个新的动态推理评估基准，具有分层认知框架和精细、可解释的评估特点。
大多数大型语言模型在低层次认知上表现良好，但在高层次认知上表现有限。
大型语言模型与人类真正的流体智力之间存在差距。

Cool Papers

点此查看论文截图

Beyond the Surface: Measuring Self-Preference in LLM Judgments

Authors:Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin

Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

最近的研究表明，大型语言模型（LLM）在作为评判者时表现出自我偏好偏见，即它们倾向于偏好自己的回答而不是其他模型生成的回答。现有的方法通常通过计算评判模型对自己回应的评分与其他模型回应的评分之间的差异来衡量这种偏见。然而，这种方法将自我偏好偏见与回答质量混淆在一起，因为即使不存在偏见，评判模型的高质量回答也可能导致正面得分差异。为了解决这个问题，我们引入金牌评判作为回应实际质量的代理，并提出DBG分数，该分数衡量自我偏好偏见为评判模型对自己回应的评分与相应的金牌评判之间的差值。由于金牌评判反映了真正的回应质量，因此DBG分数减轻了回应质量对偏见测量的混淆影响。使用DBG分数，我们进行了全面的实验，以评估不同版本、规模和推理能力的大型语言模型的自我偏好偏见。此外，我们还研究了影响并有助于缓解自我偏好偏见的两个因素：回应文本风格和评判模型的后续训练数据。最后，我们从注意力视角探索了自我偏好偏见潜在的根本机制。我们的代码和数据可在https://github.com/zhiyuanc20 0 1 /self-preference找到。

论文及项目相关链接

PDF

Summary

大型语言模型在作为评判者时存在自我偏好偏差，即倾向于青睐自己的回应而非其他模型产生的回应。现有方法通过计算评判模型对自己回应与其他模型回应的评分差异来衡量这种偏差，但这种方法将自我偏好偏差与回应质量混淆。为解决这个问题，我们引入金牌评判作为回应实际质量的代理，并提出DBG评分，该评分衡量自我偏好偏差为评判模型对自己回应的评分与金牌评判之间的差值。通过DBG评分，我们对不同版本、规模和推理能力的大型语言模型的自我偏好偏差进行了全面实验。同时，我们还探讨了影响并有助于缓解自我偏好偏差的两个因素：回应文本风格和评判模型的后续训练数据。最后，我们从注意力角度探讨了自我偏好偏差的潜在内在机制。

Key Takeaways

大型语言模型在作为评判者时存在自我偏好偏差。
现有方法衡量自我偏好偏差时存在与回应质量混淆的问题。
引入金牌评判作为回应质量的代理，提出DBG评分来衡量自我偏好偏差。
DBG评分有效区分了回应质量与自我偏好偏差。
通过实验评估了不同版本、规模和推理能力的大型语言模型的自我偏好偏差。
回应文本风格和评判模型的后续训练数据是影响并有助于缓解自我偏好偏差的两个重要因素。
从注意力角度探讨了自我偏好偏差的潜在内在机制。

Cool Papers

点此查看论文截图

V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving

Authors:Xuewen Luo, Fengze Yang, Fan Ding, Xiangbo Gao, Shuo Xing, Yang Zhou, Zhengzhong Tu, Chenxi Liu

Knowledge-driven autonomous driving systems(ADs) offer powerful reasoning capabilities, but face two critical challenges: limited perception due to the short-sightedness of single-vehicle sensors, and hallucination arising from the lack of real-time environmental grounding. To address these issues, this paper introduces V2X-UniPool, a unified framework that integrates multimodal Vehicle-to-Everything (V2X) data into a time-indexed and language-based knowledge pool. By leveraging a dual-query Retrieval-Augmented Generation (RAG) mechanism, which enables retrieval of both static and dynamic knowledge, our system enables ADs to perform accurate, temporally consistent reasoning over both static environment and dynamic traffic context. Experiments on a real-world cooperative driving dataset demonstrate that V2X-UniPool significantly enhances motion planning accuracy and reasoning capability. Remarkably, it enables even zero-shot vehicle-side models to achieve state-of-the-art performance by leveraging V2X-UniPool, while simultaneously reducing transmission cost by over 99.9% compared to prior V2X methods.

知识驱动的自动驾驶系统（ADs）提供了强大的推理能力，但面临两大挑战：由于单车传感器的目光短浅而导致的感知局限，以及由于缺乏实时环境基础而产生的幻觉。为了解决这些问题，本文引入了V2X-UniPool，这是一个统一框架，它将多模式的车对一切（V2X）数据集成到一个基于时间和语言的知识库中。通过利用双查询增强生成（RAG）机制，该机制能够检索静态和动态知识，我们的系统使自动驾驶系统能够在静态环境和动态交通环境中进行准确且时间一致的推理。在真实世界的协同驾驶数据集上的实验表明，V2X-UniPool显著提高了运动规划准确性和推理能力。值得注意的是，它甚至使零车辆模型通过利用V2X-UniPool实现最先进的性能，同时与先前的V2X方法相比，传输成本降低了99.9%以上。

论文及项目相关链接

PDF

Summary

知识驱动的自动驾驶系统（ADs）面临两大挑战：单一车辆传感器的视野受限导致的感知受限，以及由于缺乏实时环境基础而产生的幻觉。为解决这些问题，本文提出了V2X-UniPool统一框架，该框架将多模态的车对万物（V2X）数据集成到时间索引和语言基础的知识池中。通过利用双查询检索增强生成（RAG）机制，该机制能够检索静态和动态知识，我们的系统使自动驾驶系统能够在静态环境和动态交通环境中进行准确且时间一致的推理。在真实世界的合作驾驶数据集上的实验表明，V2X-UniPool显著提高了运动规划准确性和推理能力。值得注意的是，它甚至能使零车辆模型通过利用V2X-UniPool实现卓越性能，同时与先前的V2X方法相比，传输成本降低了99.9%以上。

Key Takeaways

自动驾驶系统面临感知受限和幻觉两大挑战。
V2X-UniPool框架集成了多模态的V2X数据。
V2X-UniPool采用双查询检索增强生成（RAG）机制进行知识检索。
V2X-UniPool提高了自动驾驶系统的运动规划准确性和推理能力。
V2X-UniPool使零车辆模型也能实现卓越性能。
V2X-UniPool降低了传输成本。

Cool Papers

点此查看论文截图

CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

Authors:Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song

Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously. Thoroughly assessing their cybersecurity capabilities is critical and urgent, given the high stakes in this domain. However, existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope. To address this gap, we introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities found and patched across 188 large software projects. While it includes tasks of various settings, CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests for vulnerability reproduction, based on text descriptions and corresponding source repositories. Solving this task is particularly challenging, as it requires comprehensive reasoning across entire codebases to locate relevant code fragments and produce effective PoCs that accurately trigger the target vulnerability starting from the program’s entry point. Our evaluation across 4 state-of-the-art agent frameworks and 9 LLMs reveals that even the best combination (OpenHands and Claude-3.7-Sonnet) achieves only a 11.9% reproduction success rate, mainly on simpler cases. Beyond reproducing historical vulnerabilities, we find that PoCs generated by LLM agents can reveal new vulnerabilities, identifying 15 zero-days affecting the latest versions of the software projects.

大型语言模型（LLM）代理在自主处理网络安全任务方面变得越来越熟练。考虑到网络安全领域的高风险，全面评估其网络安全能力至关重要且紧迫。然而，现有的基准测试存在不足，通常无法捕捉真实场景或范围有限。为了弥补这一空白，我们推出了CyberGym，这是一个大规模、高质量的网络安全评估框架，包含188个大型软件项目中发现的1007个真实漏洞。虽然它包含了各种设置的任务，但CyberGym主要侧重于基于文本描述和相应源代码仓库生成概念证明（PoC）测试以实现漏洞复现。解决这一任务特别具有挑战性，因为它需要全面推理整个代码库，以找到相关的代码片段并生成有效的PoC，从程序入口点开始准确触发目标漏洞。我们对4个最先进的代理框架和9个LLM的评估表明，即使是最佳组合（OpenHands和Claude-3.7-Sonnet）也仅实现了11.9%的复现成功率，主要是在较简单的情况下。除了复现历史漏洞外，我们发现LLM代理生成的PoC还可以揭示新的漏洞，识别出影响软件项目最新版本中的15个零日漏洞。

论文及项目相关链接

PDF

Summary：大型语言模型（LLM）在处理网络安全任务时展现出越来越多的自主能力。现有的网络安全评估工具无法完全满足需求，为此引入了CyberGym评估框架，涵盖大量真实世界中的漏洞并基于文本描述和对应的源代码库生成针对漏洞的复现测试。尽管挑战巨大，LLM模型依然展现了一定能力，但在面对更复杂的漏洞时，成功复现率仅为最高仅达百分之十一。此外，LLM模型生成的测试还能揭示新的漏洞。

Key Takeaways：

大型语言模型（LLM）在网络安全任务处理方面技能增强。
当前网络安全评估工具存在缺陷，无法充分捕捉真实场景或存在局限性。
CyberGym评估框架旨在解决这一缺陷，包含大量真实漏洞并生成针对漏洞的复现测试。
LLM模型在生成针对漏洞的复现测试时面临挑战，成功复现率较低。
LLM模型展现出的能力不仅能复现已知漏洞，还能揭示新的漏洞（如零日攻击）。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-06-05/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-06-05 Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM

2025-06-05 LLM

LLM

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-06-04 Absolute Coordinates Make Motion Generation Easy

2025-06-04 Text-to-Motion

Text-to-Motion