LLM

发布日期: 2025-09-11

更新日期: 2025-10-07

文章字数: 20.4k

阅读时长: 83 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-11 更新

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Authors:Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

并行思维作为一种新型方法，通过同时探索多个推理路径来提升大语言模型（LLM）的推理能力。然而，通过训练激活这种能力仍然具有挑战性，因为现有方法主要依赖于合成数据上的监督微调（SFT），这鼓励了教师强制模仿，而不是探索和泛化。与它们不同，我们提出了Parallel-R1，这是第一个能够支持并行思维行为的强化学习（RL）框架，用于处理复杂的现实世界推理任务。我们的框架采用了一种渐进的课程学习方法，明确解决了训练并行思维时的冷启动问题。我们首先使用SFT对来自简单任务的提示生成轨迹进行训练，以灌输并行思维能力，然后过渡到RL以在更复杂的问题上探索并推广这项技能。在各种数学基准测试上的实验，包括MATH、AMC23和AIME，表明Parallel-R1成功培养了并行思维，相对于直接在具有挑战性的任务上使用RL训练的顺序思维模型，其准确率提高了8.4%。进一步的分析显示模型思维行为的明显转变：在早期阶段，它使用并行思维作为探索策略，而在后期阶段，它使用相同的能力进行多视角验证。最重要的是，我们验证了并行思维作为中期训练中的探索脚手架，这一暂时的探索阶段在RL之后开启了更高的性能上限，在AIME25上的改进比基线提高了42.9%。我们的模型、数据和代码将在https://github.com/zhengkid/Parallel-R1上开源。

Summary

在强化训练语言模型的过程中，为了提高大型语言模型的推理能力，提出了一种新颖的并行思维训练方法——Parallel-R1框架。与传统的基于合成数据的监督微调方法不同，Parallel-R1采用强化学习来实现并行思维行为，更好地解决复杂现实世界中的推理任务。它首先通过基于简单任务的提示轨迹进行有监督的训练，传授并行思维能力；随后过渡至强化学习，进一步探索和推广这种能力在困难问题上的使用。该框架在不同的数学基准测试上展现了明显的效果，特别是作为中期训练中的探索架构时表现优异。并行思维作为一种临时性的探索阶段，为后续强化学习阶段带来了更高的性能上限。模型和数据将在公开源代码平台上发布。

Key Takeaways

Parallel thinking作为一种新颖的方法被引入以提高大型语言模型的推理能力。
现有方法主要通过合成数据监督微调来训练模型，这限制了模型的探索与泛化能力。
Parallel-R1框架采用强化学习实现并行思维行为，以解决复杂现实世界的推理任务。
该框架通过分阶段训练策略解决冷启动问题，首先在有监督的环境下培养并行思维能力，然后过渡到强化学习进行探索和推广。

Cool Papers

点此查看论文截图

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Authors:Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, Dipanjan Das

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

我们介绍了SimpleQA Verified，这是一个基于OpenAI的SimpleQA的千次提示基准测试，用于评估大型语言模型（LLM）的短事实性。它解决了OpenAI基准测试中的关键局限性，包括标签嘈杂和错误、主题偏见和问题冗余。SimpleQA Verified是通过严格的多阶段过滤过程创建的，包括去重、主题平衡和源和解，以产生更可靠和有挑战性的评估集，以及自动评分提示的改进。在这个新的基准测试上，双子座2.5 Pro达到了最新的F1分数55.6，优于其他前沿模型，包括GPT-5。这项工作为研究领域提供了一个更高保真度的工具，以跟踪参数模型的真实进步并减轻虚构现象。基准数据集、评估代码和排行榜可在https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified中找到。

论文及项目相关链接

PDF

Summary：

我们推出了SimpleQA Verified，这是一个基于OpenAI SimpleQA的千题基准测试，用于评估大型语言模型（LLM）的短事实准确性。它解决了OpenAI基准测试中的关键局限性，包括标签噪声和错误、主题偏见和问题冗余。SimpleQA Verified通过严格的多阶段过滤过程创建，包括去重、主题平衡和源协调，以产生更可靠和更具挑战性的评估集，以及自动评分提示的改进。在这个新的基准上，Gemini 2.5 Pro达到了最新的F1分数55.6，超过了其他前沿模型，包括GPT-5。这一工作为研究社区提供了更高保真度的工具，以跟踪参数模型事实性的真正进展，并减轻虚构现象。基准数据集、评估代码和排行榜可在：https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified上找到。

Key Takeaways：

SimpleQA Verified是一个新的基准测试，旨在评估LLM的短事实准确性。
它解决了现有基准测试如OpenAI SimpleQA中的多个问题，包括标签质量和话题多样性。
SimpleQA Verified数据集通过多阶段过滤过程创建，确保数据质量和评估的公正性。
Gemini 2.5 Pro在此新基准上取得了最佳F1分数55.6，超越了其他前沿模型。
该研究为社区提供了一个更准确的工具来评估模型的事实准确性。
基准测试数据集、评估代码和排行榜可通过Kaggle平台访问。

Cool Papers

点此查看论文截图

ImportSnare: Directed “Code Manual” Hijacking in Retrieval-Augmented Code Generation

Authors:Kai Ye, Liangcai Su, Chenxiong Qian

Code generation has emerged as a pivotal capability of Large Language Models(LLMs), revolutionizing development efficiency for programmers of all skill levels. However, the complexity of data structures and algorithmic logic often results in functional deficiencies and security vulnerabilities in generated code, reducing it to a prototype requiring extensive manual debugging. While Retrieval-Augmented Generation (RAG) can enhance correctness and security by leveraging external code manuals, it simultaneously introduces new attack surfaces. In this paper, we pioneer the exploration of attack surfaces in Retrieval-Augmented Code Generation (RACG), focusing on malicious dependency hijacking. We demonstrate how poisoned documentation containing hidden malicious dependencies (e.g., matplotlib_safe) can subvert RACG, exploiting dual trust chains: LLM reliance on RAG and developers’ blind trust in LLM suggestions. To construct poisoned documents, we propose ImportSnare, a novel attack framework employing two synergistic strategies: 1)Position-aware beam search optimizes hidden ranking sequences to elevate poisoned documents in retrieval results, and 2)Multilingual inductive suggestions generate jailbreaking sequences to manipulate LLMs into recommending malicious dependencies. Through extensive experiments across Python, Rust, and JavaScript, ImportSnare achieves significant attack success rates (over 50% for popular libraries such as matplotlib and seaborn) in general, and is also able to succeed even when the poisoning ratio is as low as 0.01%, targeting both custom and real-world malicious packages. Our findings reveal critical supply chain risks in LLM-powered development, highlighting inadequate security alignment for code generation tasks. To support future research, we will release the multilingual benchmark suite and datasets. The project homepage is https://importsnare.github.io.

代码生成作为大型语言模型（LLM）的核心功能，已经彻底改变了各个技能水平的程序员的开发效率。然而，数据结构和算法逻辑的复杂性常常导致生成代码的功能缺陷和安全漏洞，使其仅沦为需要广泛手动调试的原型。虽然检索增强生成（RAG）可以通过利用外部代码手册增强正确性和安全性，但它同时引入了新的攻击面。

论文及项目相关链接

PDF This paper has been accepted by the ACM Conference on Computer and Communications Security (CCS) 2025

Summary

本文探讨了大型语言模型（LLM）在代码生成中的核心作用，以及由此产生的开发效率革命。然而，复杂的数据结构和算法逻辑会导致生成的代码存在功能缺陷和安全漏洞，使得生成的代码更像原型而非完整可用产品。为提高正确性，研究者尝试将检索增强生成（RAG）技术引入代码生成过程，利用外部代码手册增强代码的正确性和安全性。然而，这种技术也带来了新的攻击面，特别是在恶意依赖劫持方面。本文提出了一种名为ImportSnare的新型攻击框架，通过优化隐藏排名序列和生成诱导序列来操纵LLM推荐恶意依赖，实现对RACG的攻击。实验结果证明，ImportSnare在不同语言和库中均有较高的攻击成功率。本文揭示了LLM驱动开发中的关键供应链风险，并强调了代码生成任务中安全性的重要性。

Key Takeaways

LLMs已经演变为代码生成的关键工具，极大地提高了开发效率。
生成代码存在功能缺陷和安全漏洞，需要增强正确性和安全性。
RAG技术通过利用外部代码手册增强代码的正确性，但同时也引入了新的攻击面。
本文展示了如何通过恶意文档中的隐藏依赖来攻击RACG，并展示了攻击的成功率。
ImportSnare攻击框架能够通过优化隐藏排名序列和生成诱导序列来操纵LLM推荐恶意依赖。
LLM驱动的开发中存在关键供应链风险，需要加强安全性措施。

Cool Papers

点此查看论文截图

Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

Authors:Katsuaki Nakano, Reza Feyyazi, Shanchieh Jay Yang, Michael Zuzak

Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM’s reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8%, 72.8%, and 78.6% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5%, 16.5%, and 75.7% of subtasks and required 86.2%, 118.7%, and 205.9% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

最近的大型语言模型（LLM）的进步推动了自动化网络安全渗透测试工作流程的兴趣，为企业系统的漏洞评估提供了更快和更一致的承诺。现有的渗透测试LLM主要依赖于自我引导推理，这可能会产生不准确或虚构的程序步骤。因此，LLM代理可能会执行无效操作，例如利用未使用的软件库或生成重复先前策略的循环响应。在这项工作中，我们提出了一种用于渗透测试LLM代理的引导推理管道，它结合了由MITRE ATT＆CK矩阵构建的确定性任务树，这是一个经过验证的渗透测试杀伤链，将LLM的推理过程限制在明确定义的战术、技术和程序内。这使推理锚定在成熟的渗透测试方法上，并通过引导代理执行更有效的攻击程序来过滤掉无效操作。为了评估我们的方法，我们使用三个LLM（Llama-3-8B、Gemini-1.5和GPT-4）构建了一个自动化的渗透测试LLM代理，并将其应用于导航10个HackTheBox网络安全演习中的103个离散子任务，代表真实的网络攻击场景。我们提出的推理管道使用Llama-3-8B、Gemini-1.5和GPT-4分别引导LLM代理完成了71.8％、72.8％和78.6％的子任务。相比之下，使用自我引导推理的最先进的LLM渗透测试工具仅完成了13.5％、16.5％和75.7％的子任务，并且需要86.2％、118.7％和205.9％的更多模型查询。这表明在LLM推理管道中融入确定性任务树可以提高网络安全评估的准确性和效率。

论文及项目相关链接

PDF

摘要

近期大型语言模型（LLM）的进展为自动化网络安全渗透测试工作流程带来了希望，承诺为企业系统提供更快速和更一致的漏洞评估。现有LLM代理主要依赖自我引导推理，可能产生不准确或虚构的程序步骤。因此，LLM代理可能执行无效动作，如利用未使用的软件库或生成重复策略。本文提出一种渗透测试LLM代理的引导推理管道，采用MITRE ATT＆CK矩阵构建的确定性任务树，约束LLM的推理过程遵循明确的战术、技术和程序。这基于经过验证的渗透测试方法，通过引导代理执行更有效的攻击程序来过滤无效动作。为了评估我们的方法，我们使用三个LLM（Llama-3-8B、Gemini-1.5和GPT-4）构建了一个自动化渗透测试LLM代理，并应用于导航10个HackTheBox网络安全练习，包含103个离散子任务，模拟真实网络攻击场景。使用我们的推理管道引导的LLM代理完成了71.8％、72.8％和78.6％的子任务。相比之下，使用自我引导推理的先进LLM渗透测试工具仅完成了13.5％、16.5％和75.7％的子任务，并需要更多的模型查询。这表明将确定性任务树纳入LLM推理管道可以提高网络安全评估的准确性和效率。

Key Takeaways

LLM在自动化网络安全渗透测试领域具有应用潜力。
当前LLM主要依赖自我引导推理，存在不准确和无效动作的问题。
引入确定性任务树（基于MITRE ATT&CK矩阵）能有效约束LLM的推理过程。
提出的引导推理管道显著提高了LLM在渗透测试中的准确性和效率。
在实验评估中，使用引导推理管道的LLM代理在子任务完成率方面优于自我引导推理的LLM工具。
确定性任务树能提高网络安全评估的准确性。

Cool Papers

点此查看论文截图

GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models

Authors:Tuo Wang, Adithya Kulkarni, Tyler Cody, Peter A. Beling, Yujun Yan, Dawei Zhou

Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at https://github.com/ODYSSEYWT/GUQ.

不确定性估计是提高大型语言模型（LLM）可靠性，特别是在高风险应用中的关键。现有方法往往忽略了语义依赖关系，依赖于令牌级别的概率度量，无法捕获生成文本中的结构关系。我们提出GENUINE：大型语言模型图增强多层次不确定性估计（Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models），这是一个结构感知框架，利用依赖解析树和分层图池化来优化不确定性量化。通过引入有监督学习，GENUINE可以有效地建模语义和结构关系，从而提高信心评估。在多个NLP任务上的大量实验表明，GENUINE相较于基于语义熵的方法实现了高达29%的AUROC提升，并降低了超过15%的校准误差，证明了基于图的不确定性建模的有效性。代码可在https://github.com/ODYSSEYWT/GUQ中找到。

论文及项目相关链接

PDF Accepted by EMNLP 2025

Summary

大型语言模型（LLM）的不确定性估计对于提高其可靠性至关重要，特别是在高风险的应用中。现有方法常常忽视语义依赖，依赖词级概率度量，无法捕捉生成文本中的结构关系。本文提出GENUINE：大型语言模型图增强多层次不确定性估计（Graph ENhanced mUlti-level uncertaINty Estimation），这是一个结构感知的框架，利用依存解析树和分层图池化技术来优化不确定性量化。通过引入有监督学习，GENUINE有效地建模语义和结构关系，提高了置信度评估。在NLP任务上的广泛实验表明，与基于语义熵的方法相比，GENUINE的AUROC提高了高达29%，校准误差降低了超过15%，证明了基于图的不确定性建模的有效性。相关代码已发布在https://github.com/ODYSSEYWT/GUQ。

Key Takeaways

不确定性估计是提高大型语言模型（LLM）可靠性的关键，特别是在高风险应用中的重要性。
现有方法常常忽视语义依赖，无法全面捕捉文本中的结构关系。
GENUINE是一个新的结构感知框架，利用依存解析树和分层图池化技术来优化不确定性量化。
GENUINE通过引入有监督学习，有效地建模语义和结构关系，提高置信度评估。
在自然语言处理任务上，GENUINE的性能优于基于语义熵的方法，达到了高达29%的AUROC提升。
GENUINE还能降低校准误差超过15%，显示出其有效性。

Cool Papers

点此查看论文截图

HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

Authors:Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye

Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at https://github.com/SciYu/HiPhO.

最近，(M)LLM的物理能力越来越受到关注。然而，现有的物理基准测试存在两大空白：它们既没有提供对现实世界物理竞赛（如物理奥林匹克竞赛）的系统和最新覆盖，也无法与人类进行直接性能比较。为了弥补这些空白，我们推出了HiPhO，这是第一个专门面向高中物理奥林匹克竞赛的与人类评估对齐的基准测试。具体来说，HiPhO突出了三个关键创新点。(1)综合数据：它整理了13场最新的奥林匹克考试，涵盖国际和区域竞赛，包括从纯文本到图表的各种题型。(2)专业评估：我们采用官方评分方案进行精细的评分，包括答案和步骤级别，与人类评估员完全对齐，以确保高质量和领域特定的评估。(3)与人类参赛者的比较：我们根据官方奖牌阈值为模型分配金、银和铜牌，从而能够直接比较(M)LLM和人类参赛者。我们对30个最先进的(M)LLM的大规模评估表明：在13场考试中，开源MLLM大多停留在铜牌水平或以下；开源LLM偶尔能获得金牌，显示出有希望的进展；封闭源代码的推理MLLM可以获得6至12枚金牌；但大多数模型与满分仍有很大差距。这些结果突显了开源模型与顶尖学生之间的显著性能差距、封闭源代码推理模型的强大物理推理能力，以及仍有很大的改进空间。HiPhO作为一个严谨、与人类评估对齐、专注于奥林匹克竞赛的多模式物理推理基准测试，是开源的，可以在https://github.com/SciYu/HiPhO获取。

论文及项目相关链接

PDF

Summary

本文关注于物理领域的大型语言模型的表现，介绍了一个新的基准测试平台HiPhO。该平台专门用于高中生物理奥林匹克竞赛的模型评估，实现了人类与模型的性能对比。该平台拥有全面的数据覆盖，包含国际及区域性物理奥林匹克竞赛的试题；采用官方评分方案进行精细的评分；通过与人类参赛者的比较，对模型进行金牌、银牌和铜牌的评定。评估结果显示，开源的大型语言模型仍有待提升，而封闭源推理大型语言模型展现出强大的物理推理能力。整体而言，HiPhO平台为推动多模态物理推理的进步提供了重要的工具和资源。

Key Takeaways

HiPhO是首个专门针对高中物理奥林匹克竞赛的基准测试平台。
平台具备全面的数据覆盖，包括国际及区域性物理奥林匹克竞赛试题。
采用官方评分方案进行精细的评分，与人类评价者保持一致。
平台能够实现模型与人类的性能对比，根据官方阈值评定模型获得金牌、银牌和铜牌。
评估显示开源大型语言模型仍有待提升，而封闭源推理大型语言模型展现出强大的物理推理能力。
HiPhO平台为推动多模态物理推理的进步提供了重要的工具和资源。

Cool Papers

点此查看论文截图

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

Authors:Xinyu Zhang, Changzhi Zhou, Linmei Hu, Luhao Zhang, Xiancai Chen, Haomin Fu, Yang Yang, Mengdi Zhang

Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

现有的大型语言模型（LLM）通常依赖于从专有LLM中提炼的大规模指令数据进行微调，这通常会导致高昂的成本。在本文中，我们探索了小型开源LLM（例如7B）作为高质量代码指令数据构建的合成器的潜力。我们首先观察到，通过训练来自专有LLM的少数优质数据合成样本，小型LLM的数据合成能力可以得到增强。在此基础上，我们提出了一种新的迭代自蒸馏方法，以引导小型LLM，将它们转变为强大的合成器，减少对专有LLM的依赖，并最小化成本。具体来说，在每次迭代中，为了获得多样且高质量的自蒸馏数据，我们为初始数据选择设计了多检查点采样和多方面评分策略。此外，为了识别最有影响力的样本，我们引入了一种基于梯度的影响力估计方法进行最终数据过滤。基于小型合成器生成的代码指令数据集，我们开发了SCoder系列代码生成模型，这些模型以DeepSeek-Coder为基础进行微调。SCoder模型达到了先进的代码生成能力，证明了我们的方法的有效性。

论文及项目相关链接

PDF

Summary

小规模开源LLM在代码指令数据合成中具有巨大潜力，通过训练提升合成能力并引入迭代自蒸馏方法，可减少对专有LLM的依赖并降低成本。设计多检查点采样和多方面评分策略，结合梯度影响估计方法，实现高质量自蒸馏数据。基于此数据开发的SCoder模型，展现出卓越的代码生成能力。

Key Takeaways

小规模开源LLM可作为高质量代码指令数据合成的合成器。
通过训练提升小规模LLM的数据合成能力。
引入迭代自蒸馏方法，使小规模LLM变得更强大，减少了对专有LLM的依赖和成本。
设计多检查点采样和多方面评分策略，用于初始数据选择，实现多样性和高质量的自蒸馏数据。
引入梯度影响估计方法，用于最终数据过滤，识别最具影响力的样本。
基于小规模合成器开发的SCoder模型展现出卓越的代码生成能力。

Cool Papers

点此查看论文截图

Aligning LLMs for the Classroom with Knowledge-Based Retrieval – A Comparative RAG Study

Authors:Amay Jain, Liu Cui, Si Chen

Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM’s latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.

大型语言模型如ChatGPT在教室中的使用越来越普遍，但它们经常提供过时或虚假的信息，可能会误导学生。检索增强生成（RAG）通过以外部资源为基础来改进大型语言模型的可靠性。我们研究了两种可访问的RAG范式，基于向量的检索和基于图的检索，以确定课堂问答（QA）的最佳实践。现有的比较研究未能考虑到教育因素，如教育学科、问题类型和实际部署成本。我们使用一个新的数据集EduScopeQA，包含3176个跨学科的学术问题，衡量在各种教育查询类型上的表现，从具体事实到广泛的主题讨论。我们还使用系统更改的教科书数据集评估了系统与大型语言模型潜在知识的对齐程度，这些教科书数据集中的内容与大型语言模型的知识相矛盾。我们发现，作为低成本通才，OpenAI Vector Search RAG在快速事实检索方面表现良好。另一方面，GraphRAG Global在提供主题查询的丰富答案方面表现出色，而GraphRAG Local在密集、更改的教科书中实现了最高精度，当语料库完整性至关重要时尤其如此。考虑到GraphRAG（代表基于图的RAG）的资源使用量是前者的10-20倍，我们展示了动态分支框架通过将查询路由到最佳检索方法来提高保真度和效率。这些见解为教育者和系统设计人员提供了将RAG增强的大型语言模型有效集成到学习环境中的实用指南。

论文及项目相关链接

PDF This work has been submitted to the IEEE for possible publication

Summary

大型语言模型如ChatGPT在教室中的应用日益广泛，但它们提供的信息往往过时或虚假，容易误导学生。为提高大型语言模型（LLM）的可靠性，研究者提出了基于检索的增强生成（RAG）方法，通过结合外部资源来优化回应。本研究探讨了两种易于实现的RAG方法——基于向量的检索和基于图的检索，以寻找最适合课堂问答的最佳实践。通过考虑教学因素，如教育学科、问题类型和实际部署成本，对现有研究进行了补充。使用全新的数据集EduScopeQA，涵盖了不同学科的3176个问题，对各类教育查询性能进行了测量，包括具体事实和广泛的主题讨论。系统也与被改变的系统性教科书数据集对齐，以评估大型语言模型的潜在知识矛盾。研究结果显示，OpenAI向量搜索RAG在低成本通用领域表现良好，尤其擅长快速事实检索；而GraphRAG Global在提供丰富教育性答案方面表现优越，尤其是在回应主题查询时；GraphRAG Local在教科书密集度高的情境下准确性最高，尤其在语料库完整性至关重要时。考虑GraphRAG的资源使用量是前者的10-20倍，研究还显示了一个动态分支框架，该框架能够优化查询的路由以提高准确性和效率。这些见解为教育工作者和系统设计师有效整合RAG增强LLM到学习环境中提供了实用指导。

Key Takeaways

大型语言模型（LLM）在教室应用中存在提供误导性信息的问题。
基于检索的增强生成（RAG）方法能提高LLM的可靠性，通过结合外部资源优化回应。
研究比较了两种RAG方法：基于向量的检索和基于图的检索，在教室问答中的应用效果。
教育学科、问题类型和部署成本等教学因素被纳入考虑，补充了现有研究。
OpenAI向量搜索RAG在快速事实检索方面表现良好，适合低成本通用领域。
GraphRAG Global擅长提供丰富教育性答案，尤其在回应主题查询时表现优越。

Cool Papers

点此查看论文截图

Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

Authors:Zhuoxu Huang, Mingqi Gao, Jungong Han

3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

使用大型语言模型（LLM）进行3D对象分割已经成为一种流行的范式，因为其具有广泛的语义、任务灵活性和强大的泛化能力。然而，这种范式受到表示不对齐的阻碍：LLM处理高级语义令牌，而3D点云仅传达密集的几何结构。在先前的方法中，不对齐限制了输入和输出。在输入阶段，密集的点补丁需要大量的预对齐操作，这削弱了对象级别的语义并混淆了类似的干扰物。在输出阶段，预测仅依赖于密集特征而没有明确的几何线索，导致精细粒度精度的损失。为了解决这些局限性，我们提出了Point Linguist Model（PLM），这是一个通用框架，可以在不需要对3D文本或3D图像进行大规模预对齐的情况下，弥合LLM和密集3D点云之间的表示差距。具体来说，我们引入了Object-centric Discriminative Representation（OcDR），它学习对象为中心的令牌，这些令牌在硬负感知训练目标下捕获目标语义和场景关系。这减轻了LLM令牌和3D点之间的不对齐问题，增强了对抗干扰物的韧性，并促进了LLM内的语义级推理。为了准确分割，我们引入了Geometric Reactivation Decoder（GRD），它通过结合OcDR令牌（携带LLM推断的几何）和相应的密集特征来预测掩码，从而在整个管道中保留全面的密集特征。大量实验表明，PLM在ScanNetv2上实现了+7.3 mIoU的显著改进，在Multi3DRefer的3D引用分割上实现了+6.0 mIoU的改进，同时在跨越4个不同任务的7个基准测试中实现了持续的收益，这证明了全面的以对象为中心的推理对于稳健的3D理解的有效性。

论文及项目相关链接

PDF Preprint

Summary

基于大型语言模型（LLM）的3D对象分割已成为主流范式，但存在表示对齐问题。本文提出Point Linguist Model（PLM）框架，通过引入Object-centric Discriminative Representation（OcDR）和Geometric Reactivation Decoder（GRD），缩小了LLM和密集三维点云之间的表示差距，提高了三维理解的准确性。

Key Takeaways

大型语言模型（LLM）在3D对象分割中的广泛应用，但存在表示对齐问题。
PLM框架通过缩小LLM和密集三维点云之间的表示差距来解决这一问题。
PLM引入Object-centric Discriminative Representation（OcDR），学习目标语义和场景关系的对象中心令牌。
Geometric Reactivation Decoder（GRD）结合LLM推断的几何体与相应密集特征，进行精确分割。
PLM在ScanNetv2和Multi3DRefer等数据集上实现了显著的分割性能提升。

Cool Papers

点此查看论文截图

Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Authors:Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie

Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type’s utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.

文本响应生成对于多模式任务导向型对话系统至关重要，该系统旨在基于多模式上下文生成适当的文本响应。尽管现有的努力已经取得了显著的进步，但仍存在以下局限性：1）忽视非结构化评论知识；2）大型语言模型（LLM）利用不足。受此启发，我们的目标是充分利用双知识（即结构化属性知识和非结构化评论知识）与LLM，以促进多模式任务导向型对话系统中的文本响应生成。然而，由于两个关键挑战，这项任务并不简单：1）动态知识类型选择；2）意图与响应的解耦。为了解决这些挑战，我们提出了一种新的双知识增强的两阶段推理器，通过适应LLM用于多模式对话系统（名为DK2R）。具体来说，DK2R首先从外部知识库中提取给定的对话上下文中的结构化属性知识和非结构化评论知识。此后，DK2R使用LLM来评估每种知识的实用性，通过分析LLM生成的临时探测响应。此外，DK2R通过专项推理单独总结面向意图的关键线索，这些关键线索进一步作为辅助信号，增强基于LLM的文本响应生成。在公共数据集上进行的广泛实验验证了DK2R的优越性。我们已经发布了代码和参数。

论文及项目相关链接

PDF

Summary

本文关注多模态任务导向对话系统中的文本响应生成，旨在结合结构化属性与无结构化评论知识，借助大型语言模型（LLMs）促进文本响应的生成。文章指出了现有工作的不足与挑战，并为此提出了一种新的结合双重知识的两阶段推理器DK2R。DK2R能从外部知识库中提取知识，评估每种知识的效用，并总结关键线索以增强基于LLM的文本响应生成。实验证明DK2R在公开数据集上的优越性。

Key Takeaways

多模态任务导向对话系统中的文本响应生成是关键环节，需结合结构化属性与无结构化评论知识。
大型语言模型（LLMs）在此任务中的应用是重要的创新点。
主要挑战在于动态知识类型选择和意图-响应解耦。
DK2R通过提取和应用双重知识来解决这些挑战。
DK2R能结合对话上下文从外部知识库中提取知识。
DK2R通过LLM生成的临时探针响应评估每种知识的效用。

Cool Papers

点此查看论文截图

CAViAR: Critic-Augmented Video Agentic Reasoning

Authors:Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid

Video understanding has seen significant progress in recent years, with models’ performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

视频理解在近年来取得了显著进展，模型在短片段感知方面的性能持续提高。然而，最近的多个基准测试，如LVBench、Neptune和ActivityNet-RTL，显示随着查询的复杂性和视频长度的增加，需要复杂视频推理的任务的性能会下降。在这项工作中，我们提出的问题是：能否利用现有的感知能力成功执行更复杂的视频推理？特别是，我们开发了一个大型语言模型代理，可以访问视频模块作为子代理或工具。与之前的工作（如视觉编程、ViperGPT和MoReVQA）不同，该代理并不遵循固定的程序来解决查询，而是利用对模块的每次调用结果来确定后续步骤。受文本推理领域工作的启发，我们引入了一个评论家来区分代理成功和失败序列的实例。我们证明了我们的代理和评论家的组合在上述数据集中表现出强大的性能。

论文及项目相关链接

PDF

Summary

本文关注视频理解领域的复杂推理任务，指出随着视频时长和查询复杂度的增加，现有模型的性能会下降。为此，研究团队利用现有的感知能力，开发了一种大型语言模型代理，该代理可访问视频模块作为子代理或工具。与传统的遵循固定程序解决查询的方法不同，该代理利用模块调用的结果来确定后续步骤。通过引入批判家来区分代理成功和失败序列的实例，研究证明，该代理与批判家的结合在多个数据集上表现出强大的性能。

Key Takeaways

视频理解领域虽然近年来在短片段感知方面取得显著进展，但在面对复杂查询和长视频时，现有模型的性能会下降。
研究者利用现有的感知能力开发了一种大型语言模型代理，能够访问视频模块作为子代理或工具。
与遵循固定程序解决查询的传统方法不同，该代理采用模块调用的结果来确定后续步骤。
研究引入了批判家机制来区分代理成功和失败序列的实例。
该代理与批判家的结合在多个数据集上实现了强大的性能表现。
这一进展可能推动视频理解领域在复杂推理任务方面的进步。

Cool Papers

点此查看论文截图

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Authors:Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored – despite audio’s centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data – less than 30K hours (5K unique) – Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities – such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors – are not required for strong performance, even compared to models trained on over 500K hours of data.

大型语言模型（LLM）已经改变了自然语言处理的格局，然而，尽管音频在人类交流中占据核心地位，但它们与音频的融合仍被较少探索。我们推出了Falcon3-Audio，这是一系列基于指令调优的LLM和Whisper编码器的音频语言模型（ALM）。使用少量公开音频数据——少于3万小时（独特的仅有5千小时）——Falcon3-Audio-7B在MMAU基准测试中达到了公开权重模型中的最佳报告性能，得分为64.14，与R1-AQA持平，同时通过其卓越的数据和参数效率、单阶段训练和透明度来区分自己。值得注意的是，我们最小的1B模型仍然与从2B到13B参数的更大开放模型保持竞争力。通过广泛的消融实验，我们发现即使是与在超过50万小时的数据上训练的模型相比，常见的复杂性——如课程学习、多个音频编码器和复杂的交叉注意力连接器——对于强大性能并不是必需的。

论文及项目相关链接

PDF Accepted at ASRU 2025

Summary

大型语言模型（LLM）在自然语言处理（NLP）领域取得了显著进展，但在与音频的融合方面仍缺乏深入研究，尽管音频在人类沟通中占据核心地位。本研究介绍了Falcon3-Audio系列，这是一系列基于指令调优的LLM和Whisper编码器的音视语言模型（ALM）。使用不到3万小时（5千独特数据点）的公开音频数据，Falcon3-Audio-7B在MMAU基准测试中取得了与公开权重模型的最佳表现相匹配的得分（64.14），与R1-AQA相当，同时以其卓越的数据和参数效率、单阶段训练和透明度脱颖而出。值得注意的是，我们最小的模型（参数规模为1B）仍然具有竞争力，与参数规模介于2B到13B之间的其他大型公开模型不相上下。通过广泛的实验验证，我们发现即使与在超过50万小时的数据上训练的模型相比，常见的复杂性策略（如课程学习、多个音频编码器和复杂的跨注意力连接器）对卓越性能并不必要。总体来说，本研究展示了高效的音频语言模型。

Key Takeaways

大型语言模型在自然语言处理领域取得了显著进展，但在与音频融合方面仍待深化研究。
Falcon3-Audio系列是基于指令调优的大型语言模型和Whisper编码器的音视语言模型。
使用少量公开音频数据，Falcon3-Audio-7B在基准测试中表现卓越，与其他大型公开模型相比具有竞争力。
Falcon3-Audio系列展现出卓越的数据和参数效率、单阶段训练和透明度。

Cool Papers

点此查看论文截图

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Authors:Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’’) models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

为大型语言模型（LLM）配备复杂、交织的推理和工具使用能力已成为代理人工智能研究的关键焦点，尤其是最近面向推理的（“思考”）模型的进步。这种能力是解锁许多重要应用的关键。其中一个应用是深度研究（DR），它需要在许多源之间进行广泛搜索和推理。本文的工作重点是为DR开发本地自主单代理模型，具有最小的网络爬虫和Python工具集成。与多代理系统不同，多代理系统中的代理扮演预定义角色，并在静态工作流中的每一步被告知要做什么，而自主单代理则根据上下文动态确定其下一步行动，无需手动指令。虽然先前的工作已经提出了针对基础或指令微调LLM的训练配方，但我们专注于对推理优化模型的持续强化学习（RL），以进一步增强代理技能，同时保留推理能力。为此，我们提出了一种简单的全合成数据的RL配方，将其应用于各种开源LLM。我们最好的变体SFR-DR-20B在“人类最后的考试”基准测试上达到了28.7%的准确率。此外，我们还进行了关键的分析实验，以更深入了解我们的方法。

论文及项目相关链接

PDF Technical Report

摘要
大语言模型（LLM）装备复杂的交错推理和工具使用能力已成为代理人工智能研究的关键焦点，尤其是最近的推理导向（“思考”）模型进展。此类能力是解锁多个重要应用的关键。深度研究（DR）就是一个例子，它需要进行大量源数据的搜索和推理。本文的重点是开发用于DR的自主单代理模型，具有最少的网络爬虫和Python工具集成。不同于多代理系统，其中代理承担预定义角色并在静态工作流程的每一步被告知要做什么，自主单代理会根据上下文动态地确定其下一个行动，无需手动指令。虽然先前的工作已经提出了针对基础或指令微调LLM的训练配方，但我们将重点放在持续强化学习（RL）的推理优化模型上，以进一步增强代理技能同时保留推理能力。为此，我们提出了一个使用合成数据的简单RL配方，并应用于各种开源LLM。我们最好的变体SFR-DR-20B在“人类最后的考试”基准测试中达到了28.7%的准确率。此外，我们还进行了关键的分析实验，以提供更深入的方法论洞察。

关键见解

大语言模型（LLM）正逐渐配备复杂的交错推理和工具使用能力，成为代理人工智能研究的关键焦点。
自主单代理模型在深度研究（DR）应用中具有潜力，能动态地基于上下文确定行动，无需手动指令。
与多代理系统不同，自主单代理模型更加灵活和自适应。
通过持续强化学习（RL）来增强LLM的推理优化模型，以进一步提升其代理技能并保留推理能力。
使用合成数据的简单RL配方被应用于多种开源LLM。
实验结果表明，某些LLM在特定任务上的性能表现优异，如SFR-DR-20B在“人类最后的考试”基准测试中达到了28.7%的准确率。
通过关键分析实验，提供了关于方法论的深入洞察。

Cool Papers

点此查看论文截图

LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Authors:Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li

Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.

大型语言模型（LLM）的最新进展为解决复杂的优化问题开辟了新的途径，包括神经网络架构搜索（NAS）。然而，现有的基于LLM的NAS方法严重依赖于提示工程和特定领域的调整，这限制了它们在跨不同任务时的实用性和可扩展性。在这项工作中，我们提出了LM-Searcher，这是一个新型框架，它利用LLM进行跨域神经网络架构优化，而无需进行广泛的特定领域适应。我们的方法的核心是NCode，这是一种用于神经网络架构的通用数值字符串表示，它实现了跨域架构编码和搜索。我们还重新将NAS问题表述为排名任务，训练LLM从候选池中选兗高性能架构，使用来自新型基于剪枝的子空间采样策略的指导调整样本。我们编制的数据集包含广泛的架构-性能对，鼓励鲁棒和可迁移学习。综合实验表明，LM-Searcher在域内（例如，用于图像分类的CNN）和域外（例如，用于分割和生成的LoRA配置）任务中均取得了具有竞争力的性能，为灵活和通用的基于LLM的架构搜索建立了新范式。数据集和模型将在https://github.com/Ashone3/LM-Searcher发布。

论文及项目相关链接

PDF EMNLP 2025 Main

Summary：最新进展的大型语言模型（LLM）为求解复杂的优化问题，包括神经网络架构搜索（NAS），提供了新的途径。然而，现有的LLM驱动的NAS方法严重依赖于提示工程和特定领域的调整，这在跨不同任务的实用性和可扩展性方面存在局限性。本研究提出了一种新型的框架LM-Searcher，该框架利用LLM进行跨域神经网络架构优化，无需广泛的特定领域适应。其核心是NCode，一种用于神经网络架构的通用数值字符串表示，它实现了跨域架构编码和搜索。此外，该研究将NAS问题重新表述为排名任务，训练LLM从候选池中为高性能架构排名，利用基于修剪的子空间采样策略生成的指令调整样本。实验表明，LM-Searcher在域内（如用于图像分类的CNN）和域外（如用于分割和生成的LoRA配置）的任务中均取得了具有竞争力的表现。

Key Takeaways：

LLM为解决复杂优化问题如NAS提供了新的途径。
现有LLM-driven NAS方法依赖提示工程和特定领域调整，具有局限性。
LM-Searcher框架利用LLM进行跨域神经网络架构优化，无需广泛特定领域适应。
NCode是LM-Searcher的核心，提供了一种跨域的神经网络架构编码和搜索方法。
LM-Searcher将NAS问题重新表述为排名任务，利用指令调整样本训练LLM。
LM-Searcher在多种任务中取得了具有竞争力的表现，包括域内和域外的任务。
LM-Searcher数据集和模型将在https://github.com/Ashone3/LM-Searcher发布。

Cool Papers

点此查看论文截图

MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Authors:Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover

A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning. MaRVL-QA provides a challenging new tool for the research community to measure progress, expose model limitations, and guide the development of MLLMs with more profound reasoning abilities.

多模态大型语言模型（MLLMs）的关键前沿是直接从图像进行深度数学和空间推理的能力，这超越了其在语义描述方面已经取得的成就。数学曲面图为此类能力提供了严格的测试平台，因为它们将推理任务与自然图像中常见的语义噪声隔离开来。为了衡量这方面的进展，我们引入了MaRVL-QA（基于视觉景观的数学推理），这是一个新的基准测试，旨在定量评估这些核心推理技能。该基准测试包含两个新任务：拓扑计数，识别和列举局部最大值等特征；以及转换识别，识别应用的几何变换。这些函数来源于一个精心筛选的函数库，经过严格的歧义过滤，我们对MaRVL-QA的评估表明，即使是最先进的MLLMs也面临着巨大的挑战，通常更倾向于使用肤浅的启发式方法而不是稳健的空间推理。MaRVL-QA为学术界提供了一个具有挑战性的新工具，可以用来衡量进展、暴露模型局限性，并引导开发具有更深层次推理能力的MLLMs。

论文及项目相关链接

PDF

Summary
多模态大型语言模型（MLLMs）的关键前沿是能直接从图像中进行深度数学和空间推理，超越了其在语义描述方面的成功应用。为测试这一能力，我们推出了MaRVL-QA（基于视觉景观的数学推理）新基准测试，旨在定量评估这些核心推理技能。该基准测试包含两个新任务：拓扑计数，识别和枚举局部最大值等特征；以及转换识别，识别应用的几何变换。通过对MaRVL-QA的评估显示，即使是最先进的MLLMs也面临着很大的挑战，经常依赖浅层的启发式方法而非稳健的空间推理。MaRVL-QA为研究人员提供了一个具有挑战性的新工具，可以衡量进展、揭示模型局限性，并引导开发具有更深层次推理能力的MLLMs。

Key Takeaways

多模态大型语言模型（MLLMs）面临从图像中直接进行深度数学和空间推理的挑战。
MaRVL-QA是一个旨在定量评估MLLMs核心推理技能的基准测试。
MaRVL-QA包含两个新任务：拓扑计数和转换识别。
评估显示，即使是先进的MLLMs在MaRVL-QA上也面临重大挑战。
MLLMs常常依赖浅层的启发式方法而非真正的空间推理。
MaRVL-QA为研究社区提供了衡量进展、揭示模型局限性的工具。

Cool Papers

点此查看论文截图

Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers

Authors:Samyak S. Sanghvi

Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.

在多语言环境中，反义词与同义词的区分表现出独特的计算挑战，这是由于反义词关系词的悖论性质造成的，这些词共享语义域却表达相反的意义。本研究引入了Bhav-Net，这是一种新型的双空间架构，它能够实现从复杂的多语言模型到更简单、特定语言的架构的有效知识迁移，同时保持强大的跨语言反义词-同义词区分能力。我们的方法结合了特定语言的BERT编码器和图变换网络，创建独特的语义投影，其中同义词对在一个空间中聚类，而反义词对在另一个补充空间中表现出高相似性。通过对八种语言（英语、德语、法语、西班牙语、意大利语、葡萄牙语、荷兰语和俄语）的综合评估，我们证明了语义关系模型在不同语言之间的有效迁移。双编码器设计实现了与最新基线技术相竞争的性能，同时提供可解释的语义表示和有效的跨语言泛化。

论文及项目相关链接

PDF Found some issues and need to correct them

Summary

该文章介绍了一种新型的双空间架构——Bhav-Net，它能够实现对跨语言环境下反义关系与同义词关系的有效区分。这一架构通过语言特定的BERT编码器和图转换网络的结合，建立不同的语义投影，使得同义词对在一个空间中聚类，反义词对在另一个空间中表现出高相似性。在八种语言上的综合评估证明了该架构在跨语言环境下的语义关系建模转移的有效性。双编码器设计实现了与最新技术水平的竞争性能，同时提供了可解释的语义表示和有效的跨语言泛化能力。

Key Takeaways

Bhav-Net是一种新型的双空间架构，用于处理跨语言的反义关系与同义词关系。
该架构结合了语言特定的BERT编码器和图转换网络，以建立不同的语义投影。
通过同义词在单一空间中的聚类和反义词在另一空间中的相似性展示，实现了语义关系的有效区分。
在八种语言上的综合评估证明了该架构在跨语言环境下的语义关系建模转移的有效性。
双编码器设计实现了高性能表现，同时提供了可解释的语义表示。
Bhav-Net架构能够支持有效的跨语言泛化。

Cool Papers

点此查看论文截图

TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review

Authors:Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong

While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.

虽然大型语言模型（LLM）在协助同行评审方面显示出巨大潜力，但当前的方法往往难以在保持效率的同时生成全面而有洞察力的评审。在本文中，我们提出了TreeReview，这是一个将论文评审建模为分层双向问答过程的新型框架。TreeReview首先通过递归地将高级问题分解为精细的子问题来构建评审问题树，然后通过从叶子到根迭代聚合答案来获得最终评审结果。关键的是，我们引入了一种动态问题扩展机制，以在需要时生成跟进问题，从而实现更深入的探究。我们构建了基于ICLR和NeurIPS会议衍生的基准测试，以评估我们在完整评审生成和可操作的反馈意见生成任务上的方法。基于LLM和人工评估的实验结果表明，TreeReview在提供全面、深入、与专家一致的评审反馈方面优于强基线，同时与计算密集型方法相比，可减少高达80%的LLM令牌使用。我们的代码和基准数据集可在https://github.com/YuanChang98/tree-review找到。

论文及项目相关链接

PDF Accepted to EMNLP2025 Main

Summary
在大规模语言模型（LLM）辅助同行评审的领域中，当前的方法往往难以在生成全面而有深度的评论的同时保持效率。本文提出了一种新的框架TreeReview，它将论文评审建模为一个层次化的双向问答过程。TreeReview通过递归分解高级问题为精细的子问题来构建评审问题的树结构，然后通过从叶子到根的答案聚合来获得最终的评审结果。关键的是，我们引入了动态问题扩展机制，能够在需要时生成跟进问题以实现更深入的探究。通过实验和基准测试发现，TreeReview在提供全面、深入、与专家一致的评审反馈方面表现优于基线方法，同时将LLM令牌使用减少了高达80%。我们的代码和基准数据集可在[链接地址]（https://github.com/YuanChang98/tree-review）找到。

Key Takeaways

TreeReview是一个新颖的框架，用于同行评审过程，将其建模为层次化的双向问答过程。
通过递归分解高级问题为精细的子问题，TreeReview构建了评审问题的树结构。
TreeReview引入了动态问题扩展机制以支持更深入的问题探究。
TreeReview在生成全面、深入的评审反馈方面表现优异。
TreeReview相较于其他方法，可提供专家级的评审反馈。
TreeReview显著减少了在计算密集型方法中的LLM令牌使用，效率提高了80%。

Cool Papers

点此查看论文截图

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

Authors:Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it – they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider’s profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

前沿的大型语言模型需要专业硬件和大量能源才能运行。因此，提供大型语言模型访问的云服务变得非常受欢迎。在这些服务中，用户为模型提供的输出支付的价格取决于模型生成输出所使用的令牌数量——他们按令牌支付固定价格。在这项工作中，我们表明这种定价机制为提供者创造了策略化和谎报模型用于生成输出所使用令牌数量的财务激励，用户无法证明，甚至不知道提供者是否对他们收取了过高的费用。然而，我们也表明，如果不诚实的提供者必须对其使用的模型生成过程透明化，那么在没有引起怀疑的情况下最佳地谎报就很困难。尽管如此，作为一种概念验证，我们开发了一种高效的启发式算法，允许提供者大幅对用户收取额外费用而不引起怀疑。关键的是，我们证明运行该算法的成本低于对用户超额收费所带来的额外收入，这突显了在当前按令牌付费定价机制下用户的脆弱性。此外，我们表明，为了消除产生策略化的财务激励，定价机制必须按字符数对令牌进行线性定价。虽然这使得提供者的利润边际在不同的令牌上有所不同，我们在引入一个简单的规定下发现采用这种激励相容定价机制的提供者可以维持他们在按令牌付费定价机制下的平均利润边际。为了说明和补充我们的理论结果，我们在路上进行了几场大型语言模型的实验，这些模型来自LLama、Gemma和Ministral家族，输入提示来自LMSYS聊天机器人竞技场平台。

论文及项目相关链接

PDF

Summary

该文探讨了在以按令牌计费模式运营的云大型语言模型服务中，提供商的财务激励问题。这种定价机制会导致提供商有策略性地误报生成输出所使用的令牌数量，用户难以证明或甚至不知道是否被过度收费。研究提出了一个高效的启发式算法，可以在不引起用户怀疑的情况下大幅提高对用户的收费，并且这种做法的成本低于从过度收费中获得的额外收入。研究还表明，为了消除这种策略行为的财务激励，需要采用一种基于字符计数的线性定价机制来调整令牌价格。通过进行大型语言模型实验，该研究补充了理论结果。

Key Takeaways

大型语言模型需要专门硬件和大量能源来运行，因此云提供的服务变得非常受欢迎。
在这些服务中，按令牌计费模式为提供商创造了财务激励，使他们能够策略性地误报令牌的用量。
用户难以证明或知道提供商是否过度收费。
研究人员提出了一种启发式算法，可以在不引起怀疑的情况下大幅提高对用户的收费，并且这种做法的成本相对较低。
这种定价机制导致用户面临财务风险，因此需要采用基于字符计数的线性定价机制来调整令牌价格来消除激励问题。
采用新的定价机制后，提供商的平均利润率可以保持不变。

Cool Papers

点此查看论文截图

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Authors:Qi Feng

While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.

虽然多模态大型语言模型（MLLMs）在一般的视觉语言任务上表现出色，但对于空间布局、关系和动态的空间认知仍然是一个巨大的挑战。现有的模型往往缺乏必要的架构组件和精细空间理解所需的专门训练数据。我们引入了ViCA2（视觉空间认知助手2），这是一种新型的多模态大型语言模型，旨在提高空间推理能力。ViCA2采用双视觉编码器架构，集成了SigLIP用于语义分析和Hiera用于空间结构分析，同时采用令牌比率控制机制以提高效率。我们还开发了ViCA-322K数据集，包含超过32万组空间定位问题答案对，用于针对指令调整。在具有挑战性的VSI-Bench基准测试中，我们的ViCA2-7B模型达到了平均得分56.8的业界最佳水平，显著超过了大型开源模型（例如LLaVA-NeXT-Video-72B，得分为40.9）和领先的专有模型（Gemini-1.5 Pro，得分为45.4）。这证明了我们的方法在实现强大空间认知智能的紧凑模型方面的有效性。我们发布ViCA2及其代码库和ViCA-322K数据集，以促进进一步研究。

论文及项目相关链接

PDF 26 pages, 19 figures, 4 tables

Summary

本文介绍了Multimodal Large Language Models在视觉空间认知方面的挑战，提出了一种新型的多模态大型语言模型ViCA2来增强空间推理能力。ViCA2具有双视觉编码器架构，融合了SigLIP语义和Hiera空间结构，配合token比率控制机制提高效率。同时开发的大型数据集ViCA-322K，用于针对指令微调。在具有挑战性的VSI-Bench基准测试中，ViCA2-7B模型取得了平均得分56.8的最佳成绩，显著超过了其他大型开源模型和专有模型。这证明了ViCA2在构建强大视觉空间智能方法中的有效性。我们发布了ViCA2及其代码库和ViCA-322K数据集，以促进进一步研究。

Key Takeaways

Multimodal Large Language Models面临视觉空间认知的挑战。
提出了新型的多模态大型语言模型ViCA2以增强空间推理能力。
ViCA2具有双视觉编码器架构，融合语义和空间结构。
开发了大型数据集ViCA-322K用于针对指令微调。
ViCA2在VSI-Bench基准测试中取得了最佳成绩，显著超越其他模型。
ViCA2方法的有效性得到证明。

Cool Papers

点此查看论文截图

Audio-centric Video Understanding Benchmark without Text Shortcut

Authors:Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang

Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.

音频通常在视听大型语言模型（LLM）的视频理解任务中充当辅助模态，仅协助理解视觉信息。然而，对视频的全面理解很大程度上依赖于听觉信息，因为音频提供了关键上下文、情感线索和语义含义，这些通常是视觉数据所缺乏的。本文提出了一个以音频为中心的视频理解基准测试（AVUT），旨在评估多模式LLM的视频理解能力，特别是听觉信息的理解能力。AVUT引入了一套精心设计的以音频为中心的任务，全面测试视频中的音频内容和视听交互的理解。此外，本文指出了在其他基准测试中大量存在的文本捷径问题，即正确答案仅存在于问题文本中，而无需观看视频。AVUT通过提出基于答案排列的过滤机制来解决此问题。在多样化和开源的自主多模式LLM之间进行彻底评估，随后分析视听LLM的缺陷。演示和数据可在https://github.com/lark-png/AVUT找到。

论文及项目相关链接

PDF Accepted for publication in the Proceedings of EMNLP 2025 (Main Conference)

Summary
该论文提出一个以音频为中心的视频理解基准测试（AVUT），旨在评估多模态大型语言模型对视频的理解能力，特别是音频信息。AVUT设计了一系列音频中心任务，全面测试视频中的音频内容和视听交互的理解。此外，该论文还指出了其他基准测试中普遍存在的文本捷径问题，并提出基于答案置换的过滤机制来解决此问题。

Key Takeaways