LLM

发布日期: 2025-07-05

更新日期: 2025-07-17

文章字数: 21.1k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-07-05 更新

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Authors:Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, Kaiyang Zhou

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.

多模态大型语言模型（MLLMs）在利用自然语言解释图像方面表现出了显著的能力。然而，如果不使用大规模数据集进行重新训练，这些模型很难适应特定的视觉任务，例如图表理解。这个问题是由预训练与下游数据集之间的不匹配造成的：预训练数据集主要集中在场景和对象上，但关于特定、非对象图像（如图表和表格）的信息有限。在本文中，我们分享了一个有趣的发现，即通过链式思维（CoT）推理数据训练MLLM可以促进模型在特定视觉任务中的适应，尤其是在数据有限的情况下。然而，我们发现了从预训练的MLLM中提炼出的CoT数据存在的一个关键问题，即数据中的推理步骤往往存在多个事实错误。为了解决这个问题，我们提出了基于接地链思维（GCoT）的方法，该方法旨在注入接地信息（即边界框）到CoT数据中，本质上使推理步骤更加忠于输入图像。我们在五个特定的视觉任务上评估了我们的方法，这些任务涵盖了各种视觉格式，包括图表、表格、收据和报告。结果表明，在数据有限的情况下，我们的方法显著改进了微调方法和提炼方法的结果。

论文及项目相关链接

PDF Accepted by ICCV2025

Summary
多模态大型语言模型（MLLMs）在利用自然语言解读图像方面表现出卓越的能力，但在不使用大规模数据集进行再训练的情况下，难以适应特定的视觉任务，如图表理解。本文提出一种基于思维链（CoT）推理数据训练MLLM的方法，有助于模型在特定视觉任务中的适应，特别是在数据有限的情况下。然而，发现CoT数据存在关键性问题，即推理步骤中常含有多处事实错误。为解决此问题，本文提出基于接地思维链（GCoT）的方法，旨在将接地信息（即边界框）注入CoT数据，使推理步骤更忠实于输入图像。评估结果表明，在数据有限的情况下，该方法显著改进了微调与蒸馏的效果。

Key Takeaways

MLLMs在解读图像方面表现出卓越能力，但适应特定视觉任务存在困难。
CoT数据有助于MLLM在特定视觉任务中的适应，但在推理步骤中存在事实错误。
GCoT方法旨在解决CoT数据中的问题，通过注入接地信息提高推理准确性。
GCoT方法在数据有限的情况下显著改进了MLLM的适应性和性能。
评估结果表明GCoT在多种特定视觉任务上表现优异。
接地信息的引入增强了MLLM对输入图像的理解能力。

Cool Papers

点此查看论文截图

Requirements Elicitation Follow-Up Question Generation

Authors:Yuchen Shen, Anmol Singhal, Travis Breaux

Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.

访谈是广泛使用的技术，用于激发需求以收集软件系统的利益相关者的需求、偏好和期望。有效的访谈需要技能娴熟的访谈者在面对多个挑战时，实时制定适当的访谈问题，包括对领域的陌生、过度的认知负荷和信息过载，这妨碍了人类处理利益相关者的言语。最近，大型语言模型（LLM）在多个自然语言处理任务中表现出了最先进的性能，包括文本摘要和内涵推理。为了支持访谈者，我们调查了GPT-4o在需求激发过程中生成后续访谈问题的应用，这建立在常见的访谈者错误类型框架之上。此外，我们还介绍了基于受访者言语生成问题的方法。我们报告了一项控制实验，以评估LLM生成的问题和几乎无指导的人类创作的问题，以及第二项控制实验，以评估在生成过程中受到访谈者错误类型引导下的LLM生成的问题。我们的研究结果表明，对于这两项实验，LLM生成的问题在清晰度、相关性和信息量方面并不亚于人类创作的问题。此外，当受到常见错误类型的引导时，LLM生成的问题表现优于人类创作的问题。这突出了使用LLM帮助访谈者在实时中提高需求激发访谈的质量和便捷性的潜力。

论文及项目相关链接

PDF 13 pages, 2 figures, accepted at the 33rd IEEE International Requirements Engineering 2025

Summary
访谈是广泛使用的技术，用于激发需求以收集软件系统的利益相关者需求、偏好和期望。有效访谈需要熟练的问答师实时制定适当的问题，并面临多重挑战，包括不熟悉领域、过度认知负荷和信息过载，阻碍了人们对利益相关者话语的处理。我们探讨了基于GPT-4o的大型语言模型（LLM）在需求激发访谈中的应用潜力，旨在支持问答师生成跟进问题。本文通过建立常见的问答师错误类型框架来构建生成问题的方法。我们通过受控实验评估了在最小指导下LLM生成的问题与人类创作的问题之间的表现差异，以及根据问答师错误类型指导问题生成时的表现差异。研究结果表明，无论是在何种实验中，LLM生成的问题在清晰度、相关性和信息量方面都不亚于人类创作的问题。在指导常见错误类型的情况下，LLM生成的问题表现优于人类创作的问题。这突显了使用LLM实时帮助问答师提高需求激发访谈质量和便捷性的潜力。

Key Takeaways

访谈是软件需求收集的关键技术，用于获取利益相关者的需求、偏好和期望。
有效访谈需要面对多重挑战，包括领域不熟悉、认知负荷和信息过载。
大型语言模型（LLM）如GPT-4o在生成访谈问题方面具有潜力。
通过建立常见问答师错误类型框架来增强LLM生成问题的效果。
LLM生成的问题在清晰度、相关性和信息量方面与人类创作的问题相当。
在指导常见错误类型的情况下，LLM生成的问题表现优于人类创作的问题。

Cool Papers

点此查看论文截图

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Authors:Purbesh Mitra, Sennur Ulukus

Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ – an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8% and 3.3% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

最近，大型语言模型（LLM）的推理能力取得了进展，研究表明，采用群体相对策略优化（GRPO）算法进行强化学习（RL）训练，可以使模型利用更多的思考/推理标记来生成更好的回应。然而，LLM只能在保持对先前生成标记的注意的同时生成有限数量的标记。这个限制也被称为LLM的上下文大小，是在使用任意数量的标记进行LLM推理时的瓶颈。为了超越上下文大小的限制，LLM必须采用模块化思考策略来进行多轮推理。在这项工作中，我们提出了MOTIF：通过强化微调实现模块化思考——一种用于生成多轮思考标记的RL训练方法，有效地使模型能够用额外的上下文大小进行思考。我们在GSM8K数据集上对开源模型Qwen2.5-3B-Instruct进行了参数高效的微调训练，并在MATH500和AIME2024基准测试上对其准确性进行了测试。实验表明，与基于GRPO的常规训练相比，我们的方法在各自基准测试中的准确率提高了3.8%和3.3%。此外，这一改进仅使用了15%的样本，从而证明了MOTIF的样本效率。我们的代码和模型分别可在https://github.com/purbeshmitra/MOTIF和https://huggingface.co/purbeshmitra/MOTIF找到。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的推理能力最新进展，研究采用群体相对策略优化（GRPO）算法进行强化学习（RL）训练，使模型能够利用更多思考/推理令牌生成更好的响应。然而，LLM在维持对先前生成令牌的注意时，只能生成有限数量的令牌。这一限制是LLM进行任意大量令牌推理的瓶颈。本研究提出了通过强化微调实现模块化思维的MOTIF方法——一种RL训练生成多轮思考令牌的方法，有效地使模型能够思考额外的上下文大小。我们在GSM8K数据集上训练了开源模型Qwen2.5-3B-Instruct，并在MATH500和AIME2024基准测试上测试了其准确性。实验表明，与基于GRPO的常规训练相比，MOTIF在相应基准测试中的准确率提高了3.8%和3.3%。此外，这一改进仅使用了15%的样本，显示了MOTIF的高样本效率。我们的代码和模型分别位于https://github.com/purbeshmitra/MOTIF和https://huggingface.co/purbeshmitra/MOTIF。

Key Takeaways

LLM采用GRPO算法强化训练可以提升生成响应质量。
LLM面临生成令牌数量有限的瓶颈。
MOTIF是一种多轮推理令牌生成的RL训练方法，增加模型的上下文处理能力。
在MATH500和AIME2024基准测试中，MOTIF较常规GRPO训练准确率有所提升。
MOTIF在样本效率方面表现优越，仅使用15%样本即取得显著改进。
研究采用了开源模型Qwen2.5-3B-Instruct进行训练，并公开了代码和模型供他人使用。

Cool Papers

点此查看论文截图

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Authors:Ziqi Miao, Yi Ding, Lijun Li, Jing Shao

With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.

随着强大的视觉语言能力的出现，多模态大型语言模型（MLLMs）在真实世界应用中表现出了巨大的潜力。然而，视觉模式所展现的安全漏洞为在开放世界环境中部署此类模型带来了重大挑战。最近的研究已经成功诱导目标MLLMs产生有害响应，通过将有害文本语义直接编码到视觉输入中。然而，在这些方法中，视觉模式主要充当不安全行为的触发因素，通常表现出语义模糊，并且在现实场景中缺乏依据。在这项工作中，我们定义了一种新型场景：视觉为中心的越狱，其中视觉信息在构建完整且现实的越狱上下文中充当必要组成部分。基于这一场景，我们提出了VisCo（视觉上下文）攻击。VisCo使用四种不同的以视觉为中心的策略来制造上下文对话，在必要时动态生成辅助图像，以构建视觉为中心的越狱场景。为了最大化攻击效果，它结合了自动的毒性模糊处理和语义细化，以产生最终的攻击提示，从而可靠地触发目标黑盒MLLMs的有害响应。具体来说，VisCo在MM-SafetyBench上对GPT-4o实现的毒性得分为4.78，攻击成功率（ASR）为85%，显著优于基线，基线的毒性得分为2.48，ASR为22.2%。代码可在https://github.com/Dtc7w3PQ/Visco-Attack 获得。

论文及项目相关链接

PDF 16 pages

Summary

本文介绍了视觉为中心的多模态大语言模型（MLLMs）在现实世界应用中的巨大潜力及其安全漏洞挑战。针对现有研究中视觉模态主要用于触发不安全行为的问题，本文定义了一种新的攻击场景：视觉为中心越狱，并提出了一种基于视觉上下文的攻击方法VisCo。VisCo利用四种独特的视觉策略构建上下文对话，并在必要时动态生成辅助图像来构建真实的越狱场景。通过自动的毒性混淆和语义优化，VisCo能够有效地触发目标黑箱MLLMs的恶意响应。它在MM-SafetyBench上的毒性得分为4.78，攻击成功率为85%，显著优于基线方法。

Key Takeaways

多模态大语言模型（MLLMs）在现实世界应用中具有巨大潜力，但存在安全漏洞挑战。
现有研究主要通过视觉模态触发MLLMs的不安全行为，但这种方法存在语义模糊和缺乏实际场景基础的问题。
本文定义了新的攻击场景：视觉为中心越狱，并提出VisCo攻击方法，利用视觉上下文构建对话。
VisCo通过动态生成辅助图像构建真实的越狱场景，并采用自动毒性混淆和语义优化技术。
VisCo在MM-SafetyBench上的实验结果显示其毒性得分为4.78，攻击成功率为85%，显著优于基线方法。
VisCo攻击方法具有重要的实际意义，提醒我们需要注意多模态语言模型的安全性问题。

Cool Papers

点此查看论文截图

StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

Authors:Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its comfort zone’’ and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.

强化学习与可验证奖励（RLVR）是提高大型语言模型（LLM）复杂推理能力的一种有前途的方法。然而，当前的RLVR方法面临两大挑战：一是近错奖励问题，即一个小错误可能会使正确的推理过程无效，极大地阻碍训练效率；二是探索停滞问题，模型往往局限于其“舒适区”内的解决方案，缺乏探索可能更有效的替代方案的动机。为了应对这些挑战，我们提出了StepHint这一新型RLVR算法，它利用多层次逐步提示来帮助模型更有效地探索解空间。StepHint从更强大的模型中生成有效的推理链，并使用我们提出的自适应分区方法将这些链分割成推理步骤。最初的几个步骤被用作提示，同时向模型提供多个级别的提示（每个级别包含不同数量的步骤）。这种方法可以引导模型的探索朝着有希望的解决方案子空间进行，同时保持其独立探索的灵活性。通过提供提示，StepHint缓解了近错奖励问题，从而提高了训练效率。此外，外部推理路径有助于模型发展更好的推理能力，使其能够超越其“舒适区”，并缓解探索停滞问题。在六个数学基准测试中，StepHint的表现优于其他竞争性RLVR增强方法，同时显示出更好的泛化能力，并在域外基准测试上超越基线。

论文及项目相关链接

PDF

Summary
强化学习与可验证奖励（RLVR）在提升大型语言模型（LLM）的复杂推理能力方面具有广阔前景。然而，现有RLVR方法面临两大挑战：近邻奖励问题，即一个微小错误可能破坏正确的推理过程，极大影响训练效率；以及探索停滞问题，模型倾向于专注于其“舒适区”内的解决方案，缺乏探索更多有效替代方案的动机。针对这些挑战，我们提出StepHint这一新型RLVR算法，利用多层次逐步提示帮助模型更有效地探索解空间。StepHint通过自适应分区方法将更强模型的合理链分割成推理步骤。初始的几个步骤作为提示，同时提供多级提示（每个包含不同数量的步骤）给模型。这一方法引导模型朝着有前景的解决方案子空间探索，同时保持其独立探索的灵活性。通过提供提示，StepHint缓解了近邻奖励问题，提高了训练效率。此外，外部推理途径帮助模型发展更好的推理能力，使其能够超越其“舒适区”，并缓解探索停滞问题。StepHint在六个数学基准测试中表现出优于其他RLVR增强方法的效果，同时在域外基准测试中展现出色的泛化性能。

Key Takeaways

RLVR对于提高LLM的复杂推理能力具有重要意义。
当前RLVR方法面临近邻奖励问题和探索停滞问题。
StepHint是一种新型的RLVR算法，通过多层次逐步提示帮助模型更有效地探索解空间。
StepHint利用自适应分区方法将推理步骤分割，并提供初始提示来引导模型训练。
StepHint缓解了近邻奖励问题并提高训练效率。
StepHint有助于模型超越其“舒适区”，发展更好的推理能力并缓解探索停滞问题。

Cool Papers

点此查看论文截图

Multimodal Mathematical Reasoning with Diverse Solving Perspective

Authors:Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen

Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista’s minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.

近期强化学习（RL）在大规模语言模型（LLM）中的进展显著提高了其推理能力，特别是在数学领域。然而，当前用于数学推理的多模态LLM（MLLM）常常依赖于一对一的图像文本对和单一解决方案的监督，忽略了有效的推理视角的多样性和内部反思。在这项工作中，我们引入了MathV-DP，这是一个新型数据集，能够捕捉每个图像问题对的多个多样化解决方案轨迹，为更丰富的推理监督提供支持。我们进一步提出了Qwen-VL-DP模型，该模型基于Qwen-VL构建，通过监督学习进行微调，并通过基于组的相对策略优化（GRPO）增强。GRPO是一种基于规则的RL方法，集成了正确性判别和多样性感知奖励函数。我们的方法强调从多样化的推理视角学习，并区分正确但不同的解决方案。在MathVista的微小测试和Math-V基准测试上的广泛实验表明，Qwen-VL-DP在准确性和生成多样性方面显著优于先前的基础MLLM，突显了在多模态数学推理中融入多样视角和反思推理的重要性。

论文及项目相关链接

PDF 8 pages

Summary

近期强化学习在大型语言模型（LLM）中的进步显著提高了数学领域中的推理能力。然而，现有的多模态大型语言模型（MLLMs）在数学推理上往往依赖于一对一的图像文本对和单一解决方案的监督学习，忽视了有效推理视角的多样性和内部反思。本研究引入MathV-DP数据集，针对每对图像问题捕捉多个多样的解决方案轨迹，提供丰富的推理监督。进一步提出Qwen-VL-DP模型，基于Qwen-VL构建，通过监督学习微调，并采用基于规则的强化学习方法——组相对策略优化（GRPO）进行增强。该方法重视从不同推理视角学习并区分正确但不同的解决方案。在MathVista的Mini测试和Math-V基准测试上的广泛实验表明，Qwen-VL-DP在准确性和生成多样性方面显著优于先前的MLLM基础模型，强调在多媒体数学推理中融入多样视角和反思推理的重要性。

Key Takeaways

大型语言模型（LLM）在数学领域的推理能力得到增强。
当前多模态大型语言模型（MLLMs）在数学推理上缺乏多样性视角和内部反思。
引入MathV-DP数据集，包含多样化的解决方案轨迹，以促进丰富的推理监督。
提出Qwen-VL-DP模型，结合监督学习和基于规则的强化学习（GRPO）进行增强。
Qwen-VL-DP模型重视从不同推理视角学习并区分正确但不同的解决方案。
在基准测试上，Qwen-VL-DP模型表现出更高的准确性和生成多样性。

Cool Papers

点此查看论文截图

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Authors:Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia

Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

推理语言模型（RLMs）通过思维链（CoT）提示或精细调整后的推理轨迹等机制，能够执行复杂的多步推理任务，因此受到了广泛关注。尽管这些功能提高了可靠性，但它们对社会偏见稳健性的影响仍然不清楚。在这项工作中，我们利用原本为大型语言模型（LLM）设计的CLEAR-Bias基准测试，来研究RLMs对抗偏见引发的稳健性。我们系统地评估了最先进的RLMs在多种社会文化维度上的表现，采用LLM作为法官的方法进行自动化安全评分，并利用越狱技术来评估内置安全机制的强度。我们的评估解决了三个关键问题：（i）推理能力的引入如何影响模型的公平性和稳健性；（ii）与依赖CoT提示进行推断的模型相比，经过精细调整的推理模型是否表现出更高的安全性；以及（iii）针对偏见引发的攻击，采用不同推理机制的越狱攻击成功率如何变化。我们的研究发现推理能力与偏见安全之间有着微妙的关系。令人惊讶的是，无论是通过CoT提示还是精细调整的推理轨迹，具有明确推理能力的模型通常更容易受到偏见的影响，这表明推理可能会无意中打开刻板印象强化的新途径。与依赖CoT提示的模型相比，具备推理能力的模型似乎更安全一些，尤其是在面对通过故事提示、虚构角色或奖励导向指令进行上下文重构攻击时表现出较高的脆弱性。这些结果挑战了推理固有提高稳健性的假设，并强调需要更多偏向理性的方法来设计推理。

论文及项目相关链接

PDF

摘要

本摘要主要探讨了Reasoning Language Models（RLMs）对社会偏见影响的稳健性。研究利用专为大型语言模型设计的CLEAR-Bias benchmark来评估RLMs对抗偏见激发的稳健性。研究系统地评估了最先进的RLMs在不同社会文化维度上的表现，利用LLM作为法官的方法来进行自动化安全评分并利用越狱技术来评估内置安全机制的强度。研究发现，引入推理能力如何影响模型的公平性和稳健性；经过推理微调后的模型是否比在推理时依赖Chain-of-Thought（CoT）提示展现出更高的安全性；针对偏见激发的越狱攻击成功率与所使用的推理机制的关系如何。研究发现推理能力与偏见安全之间存在微妙关系。令人惊讶的是，通过CoT提示或经过精细调整的推理轨迹显示出的明确推理模型通常比没有这些机制的基准模型更容易受到偏见激发的影响，这表明推理可能会无意中打开刻板印象强化的新途径。相较于依赖CoT提示的模型，使用推理功能的模型相对更安全一些，后者特别容易受到通过故事提示、虚构角色或奖励导向指令进行上下文重构攻击的影响。这些结果挑战了推理固有提高稳健性的假设，并强调了需要更多偏见意识的推理设计策略。

关键发现

引入推理能力会影响模型的公平性和稳健性。
通过Chain-of-Thought（CoT）提示或精细调整的推理轨迹显示的明确推理模型更容易受到偏见激发的影响。
与依赖CoT提示的模型相比，使用推理功能的模型相对更安全。
上下文重构攻击（如通过故事提示、虚构角色或奖励导向指令）对模型的稳健性构成威胁。
推理机制可能无意中强化刻板印象。
需要更多的研究来深入了解如何结合推理能力和模型的稳健性。

Cool Papers

点此查看论文截图

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Authors:Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu

The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

在线视频内容的快速增长，特别是在短视频平台上，已经产生了对高效视频编辑技术的日益增长的需求，这些技术能够将长视频浓缩成简洁吸引人的片段。现有的自动编辑方法主要依赖于ASR转录的文本线索和端到端的片段选择，往往忽视了丰富的视觉上下文，导致输出内容缺乏连贯性。在本文中，我们提出了一个受人类启发的自动视频编辑框架（HIVE），该框架利用多模式叙事理解来解决这些限制。我们的方法结合了字符提取、对话分析和叙事摘要，通过多模式大型语言模型，实现对视频内容的整体理解。为了进一步提高连贯性，我们应用场景级分割，并将编辑过程分解为三个子任务：高光检测、开头/结尾选择，以及删除无关内容。为了促进该领域的研究，我们引入了DramaAD，这是一个新的基准数据集，包含超过800个短剧片段和500个专业编辑的广告片段。实验结果表明，我们的框架在通用和面向广告的编辑任务上都优于现有基线，显著缩小了自动编辑视频和人类编辑视频之间的质量差距。

论文及项目相关链接

PDF

Summary
在线视频内容的迅速增长，尤其是短视频平台上的内容，对高效的视频编辑技术提出了更高的要求，需要将长视频浓缩成简洁、引人入胜的片段。现有自动编辑方法主要依赖ASR文本的文本线索和端到端片段选择，忽视了丰富的视觉语境，导致输出内容不连贯。本文提出了一种借鉴人类编辑思路的自动视频编辑框架（HIVE），该框架采用多模态叙事理解来解决这些问题。结合人物提取、对话分析和叙事摘要，通过多模态大型语言模型，实现对视频内容的整体理解。为提高连贯性，应用场景级分割，将编辑过程分解为三个子任务：高光检测、开头/结尾选择和不相关内容的修剪。为推进该领域研究，本文引入了DramaAD数据集，包含超过800个短片戏剧和500个专业编辑的广告片段。实验结果表明，该框架在一般和广告导向的编辑任务上均优于现有基线，大幅缩小了自动编辑和人类编辑视频的质量差距。

Key Takeaways

在线视频内容的增长推动了高效视频编辑技术的需求。
现有自动编辑方法主要依赖文本线索和端到端片段选择，存在视觉语境忽视的问题。
提出的HIVE框架结合多模态叙事理解，解决现有方法的局限性。
HIVE框架结合人物提取、对话分析和叙事摘要，实现视频内容的全面理解。
为提高连贯性，HIVE框架采用场景级分割，分解编辑任务为三个子任务。
引入DramaAD数据集，促进该领域的研究进展。

Cool Papers

点此查看论文截图

Authors:Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM’s original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM’s native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

我们介绍了DeSTA2.5-Audio，这是一款通用的大型音频语言模型（LALM），旨在实现稳健的听觉感知和指令遵循，而无需针对特定任务的音频指令进行调整。最近的LALM通常通过在大规模、手工整理或LLM合成的音频指令数据集上进行训练，来增强大型语言模型（LLM）的听觉能力。然而，这些方法往往会导致LLM的原始语言能力的灾难性遗忘。为了解决这一问题，我们重新审视数据构建流程，提出了DeSTA，这是一种自我生成的跨模态对齐策略，其中骨干LLM生成自己的训练目标。这种方法保留了LLM的母语熟练程度，同时实现了有效的音频文本对齐，从而实现了零样本泛化，无需特定任务调整。使用DeSTA，我们构建了DeSTA-AQA5M，这是一个大规模、任务通用的数据集，包含500万训练样本，这些样本来源于7000小时音频，跨越50个不同数据集，包括语音、环境声音和音乐。DeSTA2.5-Audio在广泛的音频语言基准测试上达到了最新水平或具有竞争力，包括Dynamic-SUPERB、MMAU、SAKURA、Speech-IFEval和VoiceBench。综合对比研究表明，我们的自我生成策略在听觉感知和指令遵循能力方面优于广泛采用的数据构建和培训策略。我们的研究强调了精心设计的数据构建在LALM发展中的重要性，并为构建稳健、通用的LALM提供了实践见解。

论文及项目相关链接

PDF Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

摘要

DeSTA2.5-Audio是一种通用的大型音频语言模型（LALM），旨在实现稳健的听觉感知和指令遵循，而无需针对特定任务的音频指令进行调整。该模型通过重新审视数据构建管道并提出DeSTA（一种自我生成的跨模态对齐策略），其中骨干LLM生成自己的训练目标，从而解决了LLM原有语言能力的灾难性遗忘问题。使用DeSTA构建的大型、任务无关数据集DeSTA-AQA5M，包含从7000小时音频中派生的500万训练样本，跨越50个不同的数据集，包括语音、环境声音和音乐。DeSTA2.5-Audio在多种音频语言基准测试上达到或具有竞争力，包括Dynamic-SUPERB、MMAU、SAKURA、Speech-IFEval和VoiceBench。综合对比研究表明，自我生成策略在听觉感知和指令遵循能力方面优于广泛采用的数据构建和培训策略。我们的研究强调了精心设计的数据构建在LALM发展中的重要性，并为构建稳健、通用的LALM提供了实际见解。

关键见解

DeSTA2.5-Audio是一种通用的大型音频语言模型（LALM），能进行稳健的听觉感知和指令遵循。
该模型通过自我生成的跨模态对齐策略（DeSTA）解决LLM原有语言能力的遗忘问题。
DeSTA-AQA5M数据集用于训练DeSTA2.5-Audio模型，包含从7000小时音频派生的500万训练样本。
DeSTA2.5-Audio在多个音频语言基准测试中表现优异。
综合对比研究表明，自我生成策略在听觉感知和指令遵循方面优于其他策略。
研究强调了精心设计的数据构建在LALM发展中的重要性。

Cool Papers

点此查看论文截图

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Authors:Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo

Prompt injection attacks pose a significant security threat to LLM-integrated applications. Model-level defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigation against prompt injection attacks. To this end, we develop Meta SecAlign, the first open-source and open-weight LLM with built-in model-level defense that achieves commercial-grade model performance. We provide complete details of our training recipe, which utilizes an improved version of the SOTA SecAlign defense. Evaluations on 9 utility benchmarks and 7 security benchmarks show that Meta SecAlign, despite being trained on a generic instruction-tuning dataset, confers security in unseen downstream tasks, including tool-calling and agentic web navigation, in addition general instruction-following. Our best model – Meta-SecAlign-70B – achieves state-of-the-art robustness against prompt injection attacks and comparable utility to closed-source commercial LLM with model-level defense.

指令注入攻击对集成大型语言模型（LLM）的应用构成重大安全威胁。模型层面的防御已经显示出强大的有效性，但目前是以闭源的方式部署在商业级模型中。我们相信AI安全社区需要开源模型，通过开放研究共同开发攻击和防御，推动缓解指令注入攻击的科研进展。为此，我们开发了Meta SecAlign，这是一个首个开源且公开权重的具有内置模型级防御的大型语言模型，实现了商业级模型性能。我们提供了训练食谱的完整细节，利用改进的版本SOTA SecAlign防御策略。在9个实用基准测试和7个安全基准测试上的评估表明，Meta SecAlign即使在通用指令调整数据集上进行训练，也能在未见过的下游任务中提供安全性，包括工具调用和代理网页导航，以及一般的指令遵循。我们最好的模型——Meta-SecAlign-70B——实现了针对指令注入攻击的业界最前沿稳健性，与具有模型级防御的闭源商业大型语言模型的实用性相当。

论文及项目相关链接

PDF

Summary

模型级别的防御对于LLM集成应用的安全至关重要。目前，模型级别的防御措施以闭源方式部署在商业级模型中，阻碍了AI安全社区的发展。因此，我们开发了一个名为Meta SecAlign的开源、公开权重LLM模型，该模型具有内置的模型级别防御功能，并实现了商业级模型性能。我们的训练配方经过改进，采用了最先进的SecAlign防御技术。评估表明，Meta SecAlign在未见过的下游任务中具有安全性，包括工具调用和自主网页导航，以及一般的指令遵循。最好的模型Meta-SecAlign-70B在抵御提示注入攻击方面达到了最新水平，并且在具有模型级别防御的闭源商业LLM中具有相当的实用性。

Key Takeaways

模型级别的防御对LLM集成应用的安全至关重要。
目前大多数模型级别的防御措施是闭源的，阻碍了AI安全社区的开放研究和发展。
Meta SecAlign是一个具有内置模型级别防御功能的开源LLM模型，具有商业级模型性能。
Meta SecAlign采用了最先进的SecAlign防御技术的改进版训练配方。
Meta SecAlign在未见过的下游任务中具有安全性，包括工具调用和自主网页导航等。
Meta-SecAlign-70B模型在抵御提示注入攻击方面达到了最新水平。

Cool Papers

点此查看论文截图

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

Authors:Matthieu Zimmer, Xiaotong Ji, Rasul Tutunov, Anthony Bordg, Jun Wang, Haitham Bou Ammar

Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

推理对于大型语言模型（LLM）来说仍然是一项具有挑战性的任务，特别是在逻辑约束环境下的自动定理证明（ATP），这是由于奖励稀疏和证明的大规模性。在像PutnamBench这样的基准测试中，这些挑战更加放大，其中包含需要复杂多步骤推理的大学级别问题。为了解决这一问题，我们引入了自生成目标条件MDP（sG-MDP），这是一种新框架，其中代理根据不断变化的证明状态生成和追求其子目标。鉴于这种更有结构的生成目标，产生的问题变得更适合搜索。然后，我们应用蒙特卡洛树搜索（MCTS）等算法来解决sG-MDP问题，在我们的方法中应用Bourbaki（7B），这是一个模块化系统，可以用于子目标生成和战术合成，并能集成多个7B LLM。在PutnamBench上，Bourbaki（7B）解决了26个问题，在如此规模的模型中取得了最新的最先进的成果。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在逻辑推理方面存在挑战，尤其在自动化定理证明（ATP）的逻辑约束环境中，因为证明的奖励稀疏且规模庞大。针对此类挑战，在像PutnamBench这样的基准测试中尤其突出，我们引入了自我生成的目标条件马尔可夫决策过程（sG-MDP），这是一个新的框架，代理可以根据不断变化的证明状态生成和追求其子目标。这使得问题更适合搜索解决。然后，我们应用蒙特卡洛树搜索（MCTS）等算法来解决sG-MDP问题，在Bourbaki（7B）系统中实现了这一方法，这是一个模块化系统，可以用于生成子目标和战术合成。在PutnamBench上，Bourbaki（7B）解决了26个问题，实现了此规模模型的新世界纪录。

Key Takeaways

大型语言模型（LLM）在自动化定理证明（ATP）中面临挑战，主要是由于证明的奖励稀疏和规模庞大。
sG-MDP框架被引入，允许代理根据证明状态的演变生成和追求子目标。
sG-MDP使得问题更易于搜索解决。
采用了类似于蒙特卡洛树搜索（MCTS）的算法来解决sG-MDP问题。
Bourbaki（7B）系统实现了这一方法，这是一个模块化系统，用于生成子目标和战术合成。
Bourbaki（7B）在PutnamBench基准测试中解决了26个问题。

Cool Papers

点此查看论文截图

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Authors:Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, Rongrong Ji

The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.

人工智能生成内容（AIGC）技术的快速发展导致高度逼真的AI生成图像（AIGI）被滥用，传播错误信息，对公众信息安全构成威胁。尽管现有的AIGI检测技术通常有效，但它们面临两个问题：1）缺乏可验证的人为解释；2）最新技术中缺乏通用性。为了解决这些问题，我们引入了一个大规模且综合的数据集Holmes-Set，其中包括Holmes-SFTSet，这是一个带有解释指令的数据集，解释图像是否是AI生成的，以及Holmes-DPOSet，这是一个与人类对齐的偏好数据集。我们的工作引入了一种高效的数据注释方法，称为多专家陪审团，通过结构化的MLLM解释和质量控制（包括跨模型评估、专家缺陷过滤和人类偏好修改）增强数据生成。此外，我们提出了精心设计的三阶段训练框架Holmes Pipeline，包括视觉专家预训练、监督微调以及直接偏好优化。Holmes Pipeline使多模态大型语言模型（MLLMs）适应AIGI检测，同时生成可验证的、与人类对齐的解释，最终产生我们的模型AIGI-Holmes。在推理阶段，我们引入了一种协同解码策略，将视觉专家的模型感知与MLLMs的语义推理相结合，进一步增强了模型的通用化能力。在三个基准上的大量实验验证了我们的AIGI-Holmes的有效性。

论文及项目相关链接

PDF Accepted to ICCV 2025

Summary
人工智能生成内容（AIGC）技术的快速发展导致人工智能生成图像（AIGI）被恶意用于传播虚假信息，威胁公众信息安全。为解决现有AIGI检测技术在人类可验证解释和最新技术通用性方面的问题，我们引入了大规模综合数据集Holmes-Set，包括带有图像是否AI生成解释的指令调整数据集Holmes-SFTSet和人类对齐偏好数据集Holmes-DPOSet。通过提出有效的数据注释方法——多专家陪审团，我们增强了数据的生成，通过结构化MLLM解释和质量控制（包括跨模型评估、专家缺陷过滤和人类偏好修改），我们推进了数据生成。此外，我们精心设计了Holmes管道，这是一个包含视觉专家预训练、监督微调以及直接偏好优化的三阶段训练框架。我们的模型AIGI-Holmes利用该管道适应了多模式大型语言模型（MLLMs）用于AIGI检测，生成了人类可验证且与人类对齐的解释。在推理阶段，我们引入了协作解码策略，将模型的视觉专家感知与MLLMs的语义推理相结合，增强了通用性。在三个基准上的广泛实验验证了AIGI-Holmes的有效性。

Key Takeaways

AI-generated content (AIGC)技术的发展导致AI生成的图像（AIGI）被用于传播误导性信息，带来公共安全风险。
现有AIGI检测技术缺乏人类可验证的解释和最新技术的通用性。
引入大规模综合数据集Holmes-Set以解决这些问题，包括指令调整数据集和人类对齐偏好数据集。
提出多专家陪审团数据注释方法，增强数据生成和质量控制。
精心设计的Holmes管道包含视觉专家预训练、监督微调和直接偏好优化等阶段。
模型AIGI-Holmes利用Holmes管道适应了多模式大型语言模型（MLLMs）用于AIGI检测，可以生成人类可验证且与人类对齐的解释。

Cool Papers

点此查看论文截图

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Authors:Jungkoo Kang

Progress in enhancing large language model (LLM) planning and reasoning capabilities is significantly hampered by the bottleneck of scalable, reliable data generation and evaluation. To overcome this, I introduce NL2FLOW, a fully automated system for parametrically generating planning problems - expressed in natural language, a structured intermediate representation, and formal PDDL - and rigorously evaluating the quality of generated plans. I demonstrate NL2FLOW’s capabilities by generating a dataset of 2296 problems in the automated workflow generation domain and evaluating multiple open-sourced, instruct-tuned LLMs. My results reveal that the highest performing models achieved 86% success in generating valid plans and 69% in generating optimal plans, specifically for problems with feasible solutions. Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Notably, I observed that the highest success rate for translating natural language into a JSON representation of a plan was lower than the highest rate of generating a valid plan directly. This suggests that unnecessarily decomposing the reasoning task - introducing intermediate translation steps - may actually degrade performance, implying a benefit to models capable of reasoning directly from natural language to action. As I scale LLM reasoning to increasingly complex problems, the bottlenecks and sources of error within these systems will inevitably shift. Therefore, a dynamic understanding of these limitations - and the tools to systematically reveal them - will be crucial for unlocking the full potential of LLMs as intelligent problem solvers.

在提升大型语言模型（LLM）规划和推理能力方面，可扩展且可靠的数据生成与评估的瓶颈对其造成了重大阻碍。为了克服这一难题，我引入了NL2FLOW，这是一个参数化生成规划问题的全自动系统，能够以自然语言、结构化中间表示和正式PDDL表达问题，并严格评估生成计划的质量。我通过生成包含2296个问题的数据集，在自动化工作流程生成领域展示了NL2FLOW的能力，并评估了多个开源的、经过指令调整的大型语言模型。我的结果显示，表现最佳的模型在生成有效计划方面达到了86%的成功率，在生成最优计划方面达到了69%，这主要是针对有可行解决方案的问题。回归分析表明，问题特性对计划生成的影响取决于模型和提示设计。值得注意的是，我观察到将自然语言转化为计划JSON表示形式的最高成功率低于直接生成有效计划的最高成功率。这表明，不必要地分解推理任务（引入中间翻译步骤）可能会降低性能，暗示那些能够直接从自然语言进行推理的模型具有优势。随着我将大型语言模型的推理能力扩展到越来越复杂的问题，这些系统中的瓶颈和错误来源也必将发生变化。因此，对这些限制的动态理解以及揭示它们的工具，对于解锁大型语言模型作为智能问题求解器的潜力至关重要。

论文及项目相关链接

PDF 20 pages, 7 figures

Summary

本文介绍了NL2FLOW系统，该系统旨在通过参数化生成规划问题，克服大型语言模型（LLM）在规划和推理能力方面的提升瓶颈。作者通过生成包含自动化工作流程生成域的数据集并对多个开源的、经过指令调优的LLM进行评估，展示了NL2FLOW的能力。研究结果表明，最优模型在生成有效计划和最优计划方面的成功率分别为86%和69%，且回归分析显示问题特性对计划生成的影响取决于模型和提示设计。研究还发现，从自然语言直接进行推理的能力可能优于需要经过中间翻译步骤的方法。随着LLM推理应用于越来越复杂的问题，对其局限性的动态理解和揭示这些局限性的工具将是解锁LLM作为智能问题求解器潜力的重要关键。

Key Takeaways

NL2FLOW系统能够自动化生成规划问题，并以自然语言和正式PDDL表示表达，为LLM规划和推理能力的提升提供了解决方案。
通过生成包含自动化工作流程生成域的数据集，对LLM进行评估，展示了NL2FLOW的实际效果。
最优模型在生成有效计划和最优计划方面的成功率分别为86%和69%，表明LLM在规划任务中的性能表现。
回归分析揭示了问题特性、模型和提示设计对计划生成的影响。
研究发现直接从自然语言进行推理的能力可能优于需要经过中间翻译步骤的方法，这暗示了未来模型开发的方向。
随着问题的复杂性增加，了解LLM的局限性以及揭示这些局限性的工具将变得至关重要。

Cool Papers

点此查看论文截图

Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Authors:Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu

Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model’s internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.

链式思维（CoT）推理使基于转换器的语言模型在复杂数学和多步规划方面表现出色。然而，在标准的仅解码器架构中，这些推理步骤以自然语言形式外在化，以提高可解释性，但牺牲了效率。为了捕捉不容易用文字表示的理由，许多工作已经探索了循环架构，旨在将推理内在化于潜在空间，可能支持潜在链式思维（CoT）。在本文中，我们调查了这种推理结构是否出现在Huginn-3.5B这一深度循环转换器中，该转换器在推理时间重新使用层而不会增加参数计数。我们使用包括Logit Lens和Coda Lens等一系列探测技术对模型在算术任务上的内部行为进行检查。我们的研究结果显示有限的证据表明存在可解释的潜在链式思维（CoT），通过跟踪最终结果令牌和中间结果令牌的排名轨迹来体现。此外，我们发现不同循环块之间存在显著的探测不一致性，隐藏状态的解释性在很大程度上取决于层索引和解码方法。最后，我们从实证上证明，增加循环深度只会带来微小的收益，远远落后于那些明确外在化推理步骤的模型。代码可在https://github.com/wenquanlu/huginn-latent-cot找到。

论文及项目相关链接

PDF

Summary

本文探讨了Huginn-3.5B深度循环Transformer模型在推理任务中的内部行为。通过一系列探测技术，研究发现在该模型中有限的证据表明其具备可解释的潜在链式思维（latent CoT）。然而，不同循环块间的探测结果存在不一致性，隐藏状态的解释性取决于层索引和解码方法。此外，增加循环深度带来的收益甚微，仍远低于显式外部化推理步骤的模型。

Key Takeaways

Huginn-3.5B模型结合了链式思维（CoT）推理，以深度循环Transformer架构执行复杂任务和多步骤规划。
该模型使用探测技术来研究其在算术任务中的内部行为。
研究发现有限证据表明Huginn-3.5B具备可解释的潜在链式思维（latent CoT）。
在不同循环块中，隐藏状态的解释性存在不一致性。
增加模型的循环深度带来的性能提升有限。
与显式外部化推理步骤的模型相比，Huginn-3.5B的表现仍有差距。

Cool Papers

点此查看论文截图

FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports

Authors:Muhammad Bilal Zafar

The proliferation of artificial intelligence (AI) in financial services has prompted growing demand for tools that can systematically detect AI-related disclosures in corporate filings. While prior approaches often rely on keyword expansion or document-level classification, they fall short in granularity, interpretability, and robustness. This study introduces FinAI-BERT, a domain-adapted transformer-based language model designed to classify AI-related content at the sentence level within financial texts. The model was fine-tuned on a manually curated and balanced dataset of 1,586 sentences drawn from 669 annual reports of U.S. banks (2015 to 2023). FinAI-BERT achieved near-perfect classification performance (accuracy of 99.37 percent, F1 score of 0.993), outperforming traditional baselines such as Logistic Regression, Naive Bayes, Random Forest, and XGBoost. Interpretability was ensured through SHAP-based token attribution, while bias analysis and robustness checks confirmed the model’s stability across sentence lengths, adversarial inputs, and temporal samples. Theoretically, the study advances financial NLP by operationalizing fine-grained, theme-specific classification using transformer architectures. Practically, it offers a scalable, transparent solution for analysts, regulators, and scholars seeking to monitor the diffusion and framing of AI across financial institutions.

人工智能（AI）在金融服务的普及引发了对企业文件系统中AI相关披露的系统检测工具的需求增长。尽管先前的方法经常依赖于关键词扩展或文档级别的分类，但在粒度、可解释性和稳健性方面存在不足。本研究引入了FinAI-BERT，这是一个基于域的转换器语言模型，旨在在美国银行（2015年至2023年）的年度报告中抽取的669份报告的文本数据基础上对金融文本中的句子级别进行AI相关内容的分类。该模型经过精细调整，采用人工编辑且平衡的包含一万五千八十六句的数据集。FinAI-BERT实现了近乎完美的分类性能（准确率为百分之九十九点三七，F1分数为百分之九十九点三），超过了逻辑回归、朴素贝叶斯分类器、随机森林和XGBoost等传统基线模型的性能。通过基于SHAP的令牌归属确保了可解释性，而偏见分析和稳健性检查则证实了该模型在句子长度、对抗性输入和时间样本方面的稳定性。在理论上，该研究通过利用转换器架构进行精细粒度的主题特定分类，推动了金融NLP的发展。在实践中，它为分析师、监管机构和学者提供了一种可伸缩的透明解决方案，用于监控金融机构中人工智能的扩散和框架构建。

论文及项目相关链接

PDF The FinAI-BERT model can be directly loaded via Hugging Face Transformers (https://huggingface.co/bilalzafar/FinAI-BERT) for sentence-level AI disclosure classification

Summary：

随着人工智能在金融服务的普及，对企业年报中AI相关披露内容的系统检测工具的需求日益增长。本研究引入了FinAI-BERT模型，该模型是基于域适应的转换器语言模型，可在金融文本中的句子级别对AI相关内容进行分类。该模型在手动编纂和平衡的1586个句子数据集上进行微调，这些句子来自2015年至2023年美国银行的669份年报。FinAI-BERT分类性能优异，准确率为99.37%，F1分数为0.993，优于逻辑回归、朴素贝叶斯、随机森林和XGBoost等传统基线模型。同时，通过SHAP基于标记的归因确保了模型的可解释性，并通过偏差分析和稳健性检查验证了模型在不同句子长度、对抗性输入和时间样本上的稳定性。研究不仅在金融NLP领域推动了以转换器架构进行精细粒度、主题特定的分类方法的发展，而且为分析师、监管机构和学者提供了一个可规模化、透明的解决方案，以监测金融机构中AI的扩散和框架。

Key Takeaways：

FinAI-BERT是一个基于转换器语言模型的金融文本中AI相关内容分类工具。
模型在手动编纂和平衡的句子数据集上进行微调，表现优异。
FinAI-BERT分类性能优于多种传统机器学习模型。
模型具有高度的可解释性，通过SHAP标记归因进行验证。
模型稳定性经过偏差分析和多种场景的稳健性检查。
研究推动了金融NLP领域中精细粒度主题特定分类方法的发展。

Cool Papers

点此查看论文截图

Large Language Model-Driven Closed-Loop UAV Operation with Semantic Observations

Authors:Wenhao Wang, Yanyan Li, Long Jiao, Jiawei Yuan

Recent advances in large Language Models (LLMs) have revolutionized mobile robots, including unmanned aerial vehicles (UAVs), enabling their intelligent operation within Internet of Things (IoT) ecosystems. However, LLMs still face challenges from logical reasoning and complex decision-making, leading to concerns about the reliability of LLM-driven UAV operations in IoT applications. In this paper, we propose a LLM-driven closed-loop control framework that enables reliable UAV operations powered by effective feedback and refinement using two LLM modules, i.e., a Code Generator and an Evaluator. Our framework transforms numerical state observations from UAV operations into natural language trajectory descriptions to enhance the evaluator LLM’s understanding of UAV dynamics for precise feedback generation. Our framework also enables a simulation-based refinement process, and hence eliminates the risks to physical UAVs caused by incorrect code execution during the refinement. Extensive experiments on UAV control tasks with different complexities are conducted. The experimental results show that our framework can achieve reliable UAV operations using LLMs, which significantly outperforms baseline approaches in terms of success rate and completeness with the increase of task complexity.

最近的大型语言模型（LLM）的进步已经彻底改变了移动机器人，包括无人机（UAVs），使其能够在物联网（IoT）生态系统内进行智能操作。然而，LLM仍面临逻辑推理和复杂决策的挑战，这引发了人们对LLM驱动的无人机在IoT应用程序中运行的可靠性的担忧。在本文中，我们提出了一种LLM驱动的闭环控制框架，该框架通过两个LLM模块即代码生成器和评估器的有效反馈和改进，使无人机能够可靠运行。我们的框架将无人机操作中的数值状态观察转化为自然语言轨迹描述，以提高评估LLM对无人机动力学的理解，从而生成精确的反馈。我们的框架还支持基于模拟的改进过程，因此消除了改进过程中由于代码执行不正确而对实体无人机造成的风险。在具有不同复杂性的无人机控制任务上进行了大量实验。实验结果表明，我们的框架能够利用LLM实现可靠的无人机操作，在任务复杂性增加的情况下，相较于基准方法，我们的框架在成功率和完整性方面表现出显著的优势。

论文及项目相关链接

PDF 9 pages, 7 figures

Summary

大型语言模型（LLM）在移动机器人领域的应用已取得突破性进展，尤其是在无人机（UAV）方面，使无人机在物联网（IoT）生态系统中的智能操作成为可能。然而，LLM在逻辑推理和复杂决策制定方面仍存在挑战，这引发了人们对LLM驱动的无人机在IoT应用中可靠性的担忧。本文提出了一种LLM驱动的闭环控制框架，通过有效的反馈和改进，使用两个LLM模块（即代码生成器和评估器）实现可靠的无人机操作。该框架将无人机的数值状态观测转化为自然语言轨迹描述，增强了评估LLM对无人机动力学的理解，以产生精确的反馈。该框架还启用了基于模拟的改进过程，从而消除了改进过程中因代码执行错误而对实体无人机造成的风险。通过对不同复杂度的无人机控制任务进行大量实验，实验结果表明，该框架可实现可靠的无人机LLM操作，在成功率和完整性方面显著优于基准方法，并随着任务复杂性的增加而表现出优势。

Key Takeaways

LLM在移动机器人领域，特别是无人机操作中实现了重大进展，推动了物联网生态系统中的智能操作。
LLM在逻辑推理和复杂决策制定方面仍存在挑战，影响了无人机操作的可靠性。
提出了一种LLM驱动的闭环控制框架，通过两个模块——代码生成器和评估器实现可靠无人机操作。
框架将无人机的数值状态转化为自然语言轨迹描述，增强了评估器对无人机动力学的理解。
框架支持基于模拟的改进过程，减少了对实体无人机的风险。
实验证明该框架在复杂任务中实现了可靠的无人机操作，并在成功率和完整性方面优于传统方法。

Cool Papers

点此查看论文截图

Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

Authors:Nifu Dan, Yujun Cai, Yiwei Wang

Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.

对于大型语言模型（LLM）来说，长期以来，驾驭复杂的物理推理一直是一项艰巨的任务，需要深刻的概念理解和熟练的问题解决技术的结合。在这项研究中，我们调查了高级指令调整推理模型（如Deepseek-R1）在应对来自具有挑战性的SciBench基准的一系列多样化物理问题时的应用情况。我们的全面实验评估表明推理模型的卓越能力。这些模型不仅能够在回答复杂的物理问题时达到最新的精确度，而且会产生强调符号推导的独特推理模式。此外，我们的研究还发现，即使是这些高度复杂的推理模型，通过策略性地采用少量提示仍然能够提高整体精度，这突显了未来性能增长的可能性。

论文及项目相关链接

PDF

Summary：本研究探讨了先进的指令调整推理模型，如Deepseek-R1，在解决多样化物理问题中的应用。实验评估表明，这些模型不仅达到回答复杂物理问题的最先进水平，还能产生独特的强调符号推导的推理模式。研究发现，即使在高度先进的推理模型中，策略性使用小样本提示仍然可以进一步提高总体准确性，展示了性能改进的潜力。

Key Takeaways：

先进的指令调整推理模型，如Deepseek-R1，在解决物理问题方面表现出显著的能力。
这些模型能够在回答复杂物理问题时达到最先进水平。
推理模型能够产生独特的强调符号推导的推理模式。
在高度先进的推理模型中，使用小样本提示可以提高总体准确性。
这种策略性使用小样本提示的方法展示了性能改进的潜力。
实验中使用了SciBench基准测试中的多样化物理问题来评估模型性能。

Cool Papers

点此查看论文截图

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Authors:Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.

现代的大型语言模型，如LLaMA、Qwen和DeepSeek系列，主要采用了Pre-LayerNorm（Pre-LN）Transformer架构。虽然在预训练过程中表现稳定，且能够扩展到大型模型，但Pre-LN在层间激活方差上呈现指数增长，导致残差连接中的快捷方式主导子层输出，并限制深层层的学习能力。为了缓解这一问题，我们提出了梯度保持激活缩放（GPAS）技术，它可以与现有方法相结合使用。GPAS通过缩小中间激活值同时保持其梯度不变来发挥作用。这保留了激活值中的信息，避免了与梯度缩小相关的梯度消失问题。在71M到1B的各种模型大小上的大量实验表明，GPAS实现了性能的稳定提升。除了增强Pre-LN Transformer的性能外，GPAS在改进替代架构（如Sandwich-LN和DeepNorm）方面也显示出希望，证明了其在改进各种设置中的训练动力学的通用性和潜力。我们的代码位于https://github.com/dandingsky/GPAS。

论文及项目相关链接

PDF

Summary
现代大型语言模型多采用Pre-LayerNorm（Pre-LN）Transformer架构，它在预训练和大规模模型应用方面表现稳定。然而，Pre-LN面临激活方差随层数呈指数增长的挑战，导致残差连接中的快捷方式主导子层输出，限制了深层的学习能力。为缓解这一问题，本文提出一种名为Gradient-Preserving Activation Scaling（GPAS）的技术。它通过减小中间激活值的同时保持梯度不变，既保留了激活信息，又避免了梯度下降导致的梯度消失问题。实验证明，GPAS在不同规模的模型中都能实现性能的提升，并有望改进其他架构如Sandwich-LN和DeepNorm。相关代码已发布在https://github.com/dandingsky/GPAS上。

Key Takeaways

现代大型语言模型倾向于采用Pre-LayerNorm（Pre-LN）Transformer架构。
Pre-LN在预训练和大规模应用方面表现稳定，但存在激活方差随层数增长的问题。
GPAS技术能有效缓解这一问题，通过减小中间激活值同时保持梯度不变，避免梯度消失问题。
GPAS在不同规模的模型中都能实现性能提升。
GPAS不仅适用于Pre-LN架构，对其他模型架构如Sandwich-LN和DeepNorm也有改进潜力。
该技术已经公开发布在GitHub上供研究使用。

Cool Papers

点此查看论文截图

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Authors:Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu

Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.

信息检索是现代知识获取的核心基石，每天可在多个领域处理数十亿次查询。然而，传统的基于关键词的搜索引擎越来越难以满足复杂的、多步骤的信息需求。我们的观点是，大型语言模型（LLM）具备推理和智能能力，正在催生一种名为智能深度研究的新范式。这些系统通过紧密集成自主推理、迭代检索和信息合成到一个动态反馈循环中，超越了传统信息搜索技术。我们追溯了从静态网页搜索到交互式、基于代理的系统的演变，这些系统可以计划、探索和学习的历程。我们还介绍了一种测试时缩放定律，以正式计算深度对推理和搜索的影响。在基准测试结果和开源实现兴起的支持下，我们证明智能深度研究不仅显著优于现有方法，而且还将成为未来信息搜索的主导范式。所有相关资源，包括工业产品、研究论文、基准数据集和开源实现，都收集在了https://github.com/DavidZWZ/Awesome-Deep-Research，以供社区使用。

论文及项目相关链接

PDF

Summary
在信息检索领域，传统基于关键词的搜索引擎在处理复杂、多步骤的信息需求时日益显得不足。大型语言模型（LLMs）通过结合自主推理、迭代检索和信息合成，开创了名为Agentic深度研究的新范式。Agentic深度研究不仅显著优于现有方法，而且将成为未来信息搜索的主导范式。

Key Takeaways

信息检索是现代知识获取的核心，每天处理数十亿次查询。
传统搜索引擎在处理复杂信息需求时存在不足。
大型语言模型（LLMs）具有推理和代理能力，开启了名为Agentic深度研究的新范式。
Agentic深度研究集成了自主推理、迭代检索和信息合成。
Agentic系统从静态网页搜索进化到交互式的、基于代理的系统，能够进行规划、探索和学习。
存在一个测试时的尺度定律，用于形式化计算深度对推理和搜索的影响。

Cool Papers

点此查看论文截图

Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation

Authors:Abdolazim Rezaei, Mehdi Sookhak, Ahmad Patooghy

Connected and Autonomous Vehicles (CAVs) rely on a range of devices that often process privacy-sensitive data. Among these, roadside units play a critical role particularly through the use of AI-equipped (AIE) cameras for applications such as violation detection. However, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. While traditional techniques such as face blurring and obfuscation have been applied to mitigate privacy risks, individual privacy remains at risk, as individuals can still be tracked using other features such as their clothing. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The main idea is to convert images into semantically equivalent textual descriptions, ensuring that scene-relevant information is retained while visual privacy is preserved. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Evaluation results demonstrate significant improvements in both privacy protection and textual quality, with the Unique Word Count increasing by approximately 77% and Detail Density by around 50% compared to existing approaches.

自动驾驶和联网车辆（CAVs）依赖于一系列经常处理敏感隐私数据的设备。其中，路边单元发挥着关键作用，特别是在使用配备人工智能（AI）的摄像机进行违章检测等应用方面。然而，捕获图像所带来的隐私风险仍然是一个主要的担忧，因为这种数据可能会被用于身份盗窃、个人档案制作或未经授权的商业目的等不当用途。虽然传统的技术如面部模糊和模糊处理已被应用于减轻隐私风险，但个人隐私问题仍然存在，因为个体仍然可以通过他们的服装等其他特征进行跟踪。本文介绍了一种新型的隐私保护框架，它利用基于反馈的强化学习（RL）和视觉语言模型（VLMs）来保护被人工智能摄像机捕获的敏感视觉信息。主要思想是将图像转换为语义等效的文本描述，确保场景相关信息被保留的同时保护视觉隐私。采用分层RL策略来迭代优化生成的文本，提高语义准确性和隐私保护能力。评估结果表明，在隐私保护和文本质量方面都有显著提高，与现有方法相比，唯一字数增加了约77%，细节密度增加了约50%。

论文及项目相关链接

PDF

Summary
本文引入了一种新型的隐私保护框架，该框架结合反馈强化学习和视觉语言模型技术，将AIE相机捕捉到的敏感视觉信息转换为语义等效的文本描述，以保留场景相关信息并保护视觉隐私。采用分层强化学习策略对生成的文本进行迭代优化，提高语义准确性和隐私保护效果。评估结果显示，该方法在隐私保护和文本质量方面均有显著改善。

Key Takeaways

CAVs中使用的路边单元配备AI相机进行违章检测等应用，但捕捉的图像存在隐私泄露风险。
传统技术如面部模糊和伪装不能完全保护个人隐私，因为仍可通过其他特征如服装追踪个体。
新框架利用反馈强化学习和视觉语言模型保护敏感视觉信息，将图像转换为文本描述。
框架能保留场景相关信息，同时保护视觉隐私。
采用分层强化学习策略优化生成的文本，提高语义准确性和隐私保护效果。
评估结果显示，新框架在隐私保护和文本质量方面表现出显著改善。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-07-05/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-07-05 Establishing Best Practices for Building Rigorous Agentic Benchmarks

2025-07-05 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-07-05 MOTIF Modular Thinking via Reinforcement Fine-tuning in LLMs

2025-07-05 R1_Reasoning

R1_Reasoning

LLM

2025-07-05 更新

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Requirements Elicitation Follow-Up Question Generation

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

Multimodal Mathematical Reasoning with Diverse Solving Perspective

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports

Large Language Model-Driven Closed-Loop UAV Operation with Semantic Observations

Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation