LLM

发布日期: 2025-10-21

更新日期: 2025-11-27

文章字数: 16.6k

阅读时长: 67 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-21 更新

Authors:Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov

Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

推进机器智能需要培养其在多种模式下的感知能力，就像人类感知世界一样。我们推出OmniVinci项目，旨在构建一个强大、开源的全方位大型语言模型（LLM）。我们仔细研究了模型架构和数据整理的设计选择。在模型架构方面，我们提出了三项关键创新：（i）OmniAlignNet，用于加强视觉和音频嵌入在共享全方位潜在空间中的对齐；（ii）Temporal Embedding Grouping，用于捕捉视觉和音频信号之间的相对时间对齐；（iii）Constrained Rotary Time Embedding，用于编码全方位嵌入中的绝对时间信息。我们引入了一个整理和合成管道，生成了2.4亿个单一模式和全方位模式的对话。我们发现各种模式在感知和推理上相互强化。我们的OmniVinci模型在DailyOmni（跨模态理解）上比Qwen2.5-Omni高出+19.05分，在MMAR（音频）上高出+1.7分，在Video-MME（视觉）上高出+3.9分；同时只使用了0.2万亿个训练令牌，与Qwen2.5-Omni的1.2万亿个训练令牌相比减少了6倍。最后，我们在涵盖机器人、医疗人工智能和智能工厂的下游应用中展示了全方位模式的优势。

论文及项目相关链接

PDF Technical Report. Code: https://github.com/NVlabs/OmniVinci

Summary

OmniVinci是一个开源的、多模态的大型语言模型，旨在提高机器的智能感知能力。该模型在架构设计上有多项创新，包括OmniAlignNet、Temporal Embedding Grouping和Constrained Rotary Time Embedding等技术。此外，还介绍了数据整理与合成流程，并通过实验验证了OmniVinci模型在跨模态理解、音频和视觉任务上的优异性能。该模型在机器人、医疗人工智能和智能工厂等下游应用具有显著优势。

Key Takeaways

OmniVinci是一个旨在提高机器智能感知能力的开源多模态大型语言模型。
模型架构设计有多项创新，包括OmniAlignNet、Temporal Embedding Grouping和Constrained Rotary Time Embedding等技术。
介绍了数据整理与合成流程，生成了24M的单模态和跨模态对话数据。
实验验证了OmniVinci模型在跨模态理解、音频和视觉任务上的优越性能。
OmniVinci模型在机器人、医疗人工智能和智能工厂等下游应用具有显著优势。
模型使用0.2T训练令牌，相较于Qwen2.5-Omni使用的1.2T，实现了6倍的减少。

Cool Papers

点此查看论文截图

BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models

Authors:Kaushitha Silva, Mansitha Eashwara, Sanduni Ubayasiri, Ruwan Tennakoon, Damayanthi Herath

The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model’s performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.

生物医学视觉语言模型在临床应用中的采用受到了提示优化技术的阻碍，这些技术产生了不可解释的潜在向量或单一文本提示。这种透明度的缺乏以及未能捕捉到依赖于整合各种观察的临床诊断的多面性特征，限制了它们在高风险环境中的可信度。为了解决这一问题，我们引入了BiomedXPro，这是一个进化框架，它利用大型语言模型作为生物医学知识提取器和自适应优化器，自动生成多种可解释的自然语言提示对，用于疾病诊断。在多个生物医学基准测试上的实验表明，BiomedXPro持续优于最新的提示调整方法，特别是在数据稀缺的少量样本环境中表现尤其出色。此外，我们的分析表明，发现的提示与统计学上重要的临床特征之间存在强烈语义对齐，为模型的性能提供了可验证的概念基础。通过生成多种可解释提示的集合，BiomedXPro为模型预测提供了可验证的基础，这是朝着开发更可信赖且与临床相符的AI系统的重要一步。

论文及项目相关链接

PDF 10 Pages + 15 Supplementary Material Pages, 5 figures

Summary

生物医学视觉语言模型在临床应用中的优化技术难题，包括产生不可解释的潜在向量或单一文本提示的问题。为解决这一问题，我们推出BiomedXPro框架，利用大型语言模型作为生物医学知识提取器和自适应优化器，自动生成多样化的、可解释的自然语言提示对进行疾病诊断。实验表明，BiomedXPro在多个生物医学基准测试中表现优异，特别是在数据稀缺的少数样本环境中。此外，分析显示发现的提示与临床特征之间存在强烈语义对齐，为模型预测提供了可验证的基础。这代表了对更可信赖且临床相符的人工智能系统发展的关键一步。

Key Takeaways

生物医学视觉语言模型在临床应用中存在优化问题，涉及潜在向量和文本提示的透明度不足。
BiomedXPro框架旨在解决这些问题，利用大型语言模型生成多样化、可解释的自然语言提示对进行疾病诊断。
实验证明BiomedXPro在多个生物医学基准测试中表现优于现有技术，尤其在数据稀缺环境中。
发现的临床提示与统计显著的临床特征之间存在强烈语义对齐。
BiomedXPro提供了模型预测的可验证基础，有助于建立更可信赖的临床决策支持系统。
此框架代表了朝着开发更贴近临床实际需求的人工智能系统迈出的重要一步。

Cool Papers

点此查看论文截图

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

Authors:Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu

Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at https://github.com/Pokee-AI/PokeeResearchOSS.

工具增强型大型语言模型（LLM）正逐渐发展为深度研究代理，这些系统能够分解复杂查询、检索外部证据并综合生成有根据的回应。然而，当前代理仍受限于浅层检索、弱对齐指标和脆弱的工具使用行为。我们推出了PokeeResearch-7B，这是一个在统一强化学习框架下构建的7B参数深度研究代理，旨在实现稳健性、对齐性和可扩展性。PokeeResearch-7B通过无需标注的强化学习从人工智能反馈（RLAIF）框架进行训练，优化策略使用LLM基础奖励信号，捕捉事实准确性、引用忠诚度和指令遵循度。基于思维链驱动的多呼叫推理架构通过自我验证和工具故障时的自适应恢复进一步增强了稳健性。在10个流行的深度研究基准测试中，PokeeResearch-7B在7B规模深度研究代理中实现了最先进的性能。这强调了精心设计的强化学习和推理可以产生高效、坚韧和研究级的人工智能代理。模型和推理代码以MIT许可证在https://github.com/Pokee-AI/PokeeResearchOSS上开源。

论文及项目相关链接

PDF

Summary

基于强化学习的工具增强大型语言模型（LLM）研究取得最新进展。新型模型PokeeResearch-7B具备稳健性、对齐性和可扩展性，在深度研究任务中表现出卓越性能。通过无标注强化学习从人工智能反馈（RLAIF）框架训练，模型能优化策略并捕获事实准确性、引用忠诚度和指令遵循度。该模型通过自我验证和自适应恢复工具故障增强了稳健性，并在多个深度研究基准测试中达到业界最佳水平。模型及推理代码已开源。

Key Takeaways

PokeeResearch-7B是一个基于强化学习的工具增强的大型语言模型（LLM）。
该模型具备稳健性、对齐性和可扩展性，适用于深度研究任务。
通过无标注强化学习从人工智能反馈（RLAIF）框架训练，优化策略。
模型能够捕获事实准确性、引用忠诚度和指令遵循度。
通过自我验证和自适应恢复工具故障增强了模型的稳健性。
在多个深度研究基准测试中，PokeeResearch-7B达到了业界最佳性能水平。

Cool Papers

点此查看论文截图

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Authors:Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang

Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.

大型语言模型（LLM）通过强化学习（RL）取得了显著进展，特别是在奖励可以程序化验证的领域，如数学和代码。在这些领域，模型受益于由明确基于规则的客观目标引导的良好定义的操作基础。然而，这种进展揭示了一个重大局限：在奖励模糊、主观或依赖于上下文的无定式领域，如创造性写作、科学推理和尤其是医疗咨询，缺乏稳健的奖励功能，使得这些领域对当前的RL策略构成挑战。为了弥补这一差距，我们引入了ORBIT，这是一个专为高风险医疗对话设计的开放式评分基准增量训练框架。ORBIT将合成对话生成与动态创建评分基准相结合，利用这些评分基准来指导增量RL过程。特别是，这种方法不依赖外部医学知识或手动规则，而是利用评分基准指导的反馈来塑造学习。当在Qwen3-4B-Instruct模型上实施时，我们的方法仅使用2k个样本就能将其在HealthBench-Hard基准测试上的性能从7.0大幅提高至27.2，从而为这一规模的模型实现了最新技术成果。我们的分析证实，评分驱动RL在多种咨询场景中实现了持续的性能提升，超越了简单的数值改进。这些发现强调，基于评分基准的反馈是推动大型语言模型在无定式复杂任务中发展的可扩展策略。

论文及项目相关链接

PDF 17 pages, 6 figures

摘要

大型语言模型（LLM）通过强化学习（RL）取得了显著进展，特别是在奖励可以程序化验证的领域，如数学和代码。然而，在奖励模糊、主观或依赖于上下文的开放领域，如创意写作、科学推理和医疗咨询等，当前的RL策略面临挑战。为此，本文提出了ORBIT，一个针对高风险医疗对话的开放式评分基准增量训练框架。ORBIT通过合成对话生成和动态创建评分基准，采用这些评分基准来指导增量RL过程。该方法不依赖外部医学知识或手动规则，而是利用评分指导的反馈来塑造学习。在实施于Qwen3-4B-Instruct模型后，我们的方法可以在仅使用2k样本的情况下，将HealthBench-Hard基准测试的性能从7.0大幅提高至27.2，实现了该规模模型的最佳结果。分析表明，评分驱动RL在多样化咨询场景中实现了性能的稳定提升，超越了简单的数字改进。

关键见解

LLM通过强化学习在规则明确、奖励可程序化验证的领域取得了显著进展。
在开放领域，特别是医疗咨询等复杂场景中，现有的强化学习策略面临奖励模糊的挑战。
ORBIT框架旨在解决上述问题，是一个专为高风险医疗对话设计的开放式评分基准增量训练框架。
ORBIT结合了合成对话生成和动态评分基准创建，利用这些评分基准引导增量强化学习过程。
该方法不依赖外部领域知识或手动规则，而是使用评分驱动的反馈塑造学习。
在Qwen3-4B-Instruct模型上的实验结果显示，ORBIT方法显著提高了模型性能，达到了该规模模型的最佳结果。

Cool Papers

点此查看论文截图

Paper2Web: Let’s Make Your Paper Alive!

Authors:Yuhang Chen, Tianpeng Lv, Siyi Zhang, Yixiang Yin, Yao Wan, Philip S. Yu, Dongping Chen

Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.

学术项目网站在清晰地呈现核心内容、提供直观导航和交互功能时，能更有效地传播研究。然而，当前的方法，如直接采用大型语言模型（LLM）生成、模板或直接HTML转换，很难产生布局感知、交互式的网站，并且缺乏针对此任务的全面评估套件。在本文中，我们介绍了Paper2Web，这是一套用于评估学术网页生成的基准数据集和多维度评估框架。它结合了基于规则度的指标，如连通性、完整性和经过人工验证的LLM法官（涵盖交互性、美观性和信息性），以及PaperQuiz，用于衡量论文级别的知识保留情况。我们还推出了PWAgent，这是一个能将科学论文转化为交互性强、多媒体丰富的学术主页的自主管道。该代理通过增强重点、平衡和呈现质量的MCP工具，对内容和布局进行迭代优化。我们的实验表明，PWAgent在保持低成本的同时，在学术网页生成方面始终优于基于模板的网页和arXiv/alphaXiv版本等端到端的基线方案，并在学术网页生成方面达到了帕累托前沿。

论文及项目相关链接

PDF Under Review. Check https://github.com/YuhangChen1/Paper2All for the unified platform to streamline all academic presentation

摘要
学术论文网站通过清晰呈现核心内容、实现直观导航和交互，能更有效地传播研究。然而，当前的方法如直接大型语言模型（LLM）生成、模板或直接HTML转换，难以生成布局意识、互动的网站，并且缺乏对此任务的全面评估套件。本文介绍了Paper2Web，这是用于评估学术网页生成效果的基准数据集和多维度评估框架。它包含了基于规则的评价指标，如连通性、完整性和经人工验证的LLM评判（涵盖交互性、美观性和信息丰富性），以及衡量论文层面知识保留情况的PaperQuiz。此外，还提出了PWAgent，一个能将科学论文转化为互动性和多媒体丰富的学术主页的自主管道。该代理通过增强重点、平衡和呈现质量的模型预测控制工具进行迭代优化内容和布局。实验表明，PWAgent在学术网页生成方面大幅超越了基于模板的网页和arXiv/alphaXiv版本等端到端基线，同时保持低成本，达到了学术网页生成的帕累托前沿。

要点归纳

学术论文网站在清晰呈现核心内容和实现直观导航和交互方面能更有效地传播研究。
当前的方法在生成布局意识、互动性的学术网站方面存在挑战。
Paper2Web基准数据集和多维度评估框架用于评估学术网页生成效果，包含连通性、完整性等基于规则的评价指标和LLM评判。
PaperQuiz用于衡量论文层面知识的保留情况。
PWAgent是一个能将科学论文转化为学术主页的自主管道，通过模型预测控制工具优化内容和布局。
实验显示，PWAgent在学术网页生成方面大幅超越其他方法，达到帕累托前沿。

Cool Papers

点此查看论文截图

Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

Authors:Richard M. Bailey

So-called wicked problems', those involving complex multi-dimensional settings, non-verifiable outcomes, heterogeneous impacts and a lack of single objectively correct answers, have plagued humans throughout history. Modern examples include decisions over justice frameworks, solving environmental pollution, planning for pandemic resilience and food security. The use of state-of-the-art artificial intelligence systems (notably Large Language Model-based agents) collaborating with humans on solving such problems is being actively explored. While the abilities of LLMs can be improved by, for example, fine-tuning, hand-crafted system prompts and scaffolding with external tools, LLMs lack endogenous mechanisms to develop expertise through experience in such settings. This work address this gap with Dialectica, a framework where agents engage in structured dialogue on defined topics, augmented by memory, self-reflection, and policy-constrained context editing. Formally, discussion is viewed as an implicit meta-reinforcement learning process. The dialogue-trained’ agents are evaluated post-hoc using judged pairwise comparisons of elicited responses. Across two model architectures (locally run Qwen3:30b and OpenAI’s o4-mini) results show that enabling reflection-based context editing during discussion produces agents which dominate their baseline counterparts on Elo scores, normalized Bradley-Terry-Davidson ability, and AlphaRank mass. The predicted signatures of learning are observed qualitatively in statement and reflection logs, where reflections identify weaknesses and reliably shape subsequent statements. Agreement between quantitative and qualitative evidence supports dialogue-driven context evolution as a practical path to targeted expertise amplification in open non-verifiable domains.

所谓的“棘手问题”，那些涉及复杂多维环境、无法验证的结果、不同的影响以及没有单一客观正确答案的问题，历史上一直困扰着人类。现代例子包括正义框架的决策、解决环境污染问题、规划疫情应对能力和粮食安全等。目前正积极探索使用最新的人工智能系统（特别是基于大型语言模型的智能体）与人类合作解决这些问题。虽然可以通过微调、手工定制的系统提示和外部工具脚手架等来提升大型语言模型的能力，但大型语言模型在这种环境中缺乏通过经验发展专业的内在机制。这项工作通过Dialectica框架来解决这一空白，在该框架中，智能体参与关于定义主题的结构化对话，辅以记忆、自我反思和政策约束的上下文编辑。正式地，讨论被视为一种隐式的元强化学习过程。这些对话训练的智能体是通过利用得到的响应进行事后评价进行的判断配对比较进行评估的。在两种模型架构中（本地运行的Qwen3：30b和OpenAI的o4-mini），结果表明，在讨论过程中启用基于反思的上下文编辑，可以产生在埃洛得分、标准化的Bradley-Terry-Davidson能力和AlphaRank质量方面优于基准线的智能体。学习的预测签名在陈述和反思日志中都可以观察到，反思可以识别弱点并可靠地影响随后的陈述。定量和定性证据之间的协议支持对话驱动上下文演化作为在非验证领域实现目标专业知识放大的实用途径。

论文及项目相关链接

PDF 50 pages, 4 figures

摘要

本文探讨了所谓的“棘手的难题”，这类问题涉及复杂的多维设置、不可验证的结果、异质性的影响以及没有单一的客观正确答案。现代例子包括正义框架的决策、解决环境污染问题、规划疫情韧性和粮食安全等。文章介绍了使用最新的人工智能系统（特别是基于大型语言模型的代理人）来解决这些问题的尝试。虽然可以通过微调、手工系统提示和与外部工具结合使用脚手架等方法提高大型语言模型的能力，但大型语言模型在这些环境中缺乏通过经验发展专业知识的能力。本文填补了这一空白，提出了Dialectica框架，该框架让代理人参与结构化的对话，支持记忆、自我反思和策略受限的上下文编辑。正式地，讨论被视为一种隐性的元强化学习过程。经过后评价，“对话训练”的代理通过使用所诱发响应的经评判的配对比较进行评估。在两种模型架构（本地运行的Qwen3：30b和OpenAI的o4-mini）的结果显示，在讨论过程中启用基于反思的上下文编辑产生的代理在Elo得分、标准化的Bradley-Terry-Davidson能力和AlphaRank质量方面优于基线代理。预测的学习签名在陈述和反思日志中可观察到，反思能识别出弱点并可靠地影响随后的陈述。定量和定性证据之间的协议支持对话驱动的上下文进化是在开放的非验证域中实现有针对性的专业扩大的一种实用途径。

要点摘要

大型语言模型在历史和现代都有应用于解决棘手的难题，如决策正义框架、解决环境污染等。
虽然大型语言模型能力强大，但它们缺乏在复杂环境中通过经验发展专业知识的能力。
Dialectica框架填补了这一空白，通过结构化对话、记忆支持、自我反思和策略受限的上下文编辑来训练代理。
形式上，讨论被视为一种隐性的元强化学习过程。代理通过反射式上下文编辑在讨论中表现出卓越性能。
在两种模型架构下进行的评估显示，“对话训练”的代理在讨论过程中展现出了强大的能力，超越了基线代理的表现。

Cool Papers

点此查看论文截图

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

Authors:Gao Yang, Yuhang Liu, Siyu Miao, Xinyue Liang, Zhengyang Liu, Heyan Huang

Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other’s output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

理想还是现实——这是问题所在。在这项工作中，我们探讨博弈理论的原则是否能够有效地应用于大型语言模型（LLM）的评估。这一探究是由传统评估实践的日益不足所驱动的，传统评估实践经常依赖于具有参考答案的固定格式任务，并且难以捕捉现代LLM行为的微妙、主观和开放性质。为了应对这些挑战，我们提出了一种新的替代方案：自动相互评估，即LLM通过自我博弈和同行评审相互评估彼此的输岀。然后将这些同行评估与人类的投票行为进行系统的比较，以评估它们与人类判断的一致性。我们的框架结合了博弈论投票算法来汇总同行评审意见，从而有原则地研究模型生成的排名是否反映人类偏好。经验结果表明，理论预测和人类评估之间既有收敛也有分歧，为我们深入了解相互评估的承诺和局限性提供了宝贵的见解。据我们所知，这是第一项联合采用相互评估、博弈论聚合和人类基础验证来评估LLM能力的工作。

论文及项目相关链接

PDF

Summary

本文探索了博弈理论原则在大规模语言模型（LLM）评估中的有效应用。针对传统评估方法的不足，提出了一种新的自动相互评估方法，通过LLM的自我博弈和同行评审来评估彼此的输出来实现。该方法通过系统地与人类投票行为进行比较，评估其与人类判断的对齐程度。本文的框架结合了博弈理论投票算法来汇总同行评审意见，以研究模型生成的排名是否反映了人类偏好。实证结果表明，理论预测与人类评估之间存在一致性和分歧，为相互评估的承诺和局限性提供了有价值的见解。

Key Takeaways

探讨了博弈理论在大规模语言模型（LLM）评估中的应用价值。
提出了自动相互评估方法，通过LLM的自我博弈和同行评审来评估输出。
框架结合了博弈理论投票算法来汇总同行评审意见。
系统地与人类投票行为进行比较，以评估其与人类判断的对齐程度。
实证结果表明理论预测与人类评估之间存在一致性和分歧。
此方法提供了对相互评估的承诺和局限性的有价值见解。

Cool Papers

点此查看论文截图

FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens

Authors:Chao Wang, Yixin Song, Jinhui Ye, Chuan Qin, Dazhong Shen, Lingfeng Liu, Xiang Wang, Yanyong Zhang

Recently, large language models (LLMs) have been explored for integration with collaborative filtering (CF)-based recommendation systems, which are crucial for personalizing user experiences. However, a key challenge is that LLMs struggle to interpret the latent, non-semantic embeddings produced by CF approaches, limiting recommendation effectiveness and further applications. To address this, we propose FACE, a general interpretable framework that maps CF embeddings into pre-trained LLM tokens. Specifically, we introduce a disentangled projection module to decompose CF embeddings into concept-specific vectors, followed by a quantized autoencoder to convert continuous embeddings into LLM tokens (descriptors). Then, we design a contrastive alignment objective to ensure that the tokens align with corresponding textual signals. Hence, the model-agnostic FACE framework achieves semantic alignment without fine-tuning LLMs and enhances recommendation performance by leveraging their pre-trained capabilities. Empirical results on three real-world recommendation datasets demonstrate performance improvements in benchmark models, with interpretability studies confirming the interpretability of the descriptors. Code is available in https://github.com/YixinRoll/FACE.

最近，人们开始探索将大型语言模型（LLM）与基于协同过滤（CF）的推荐系统相结合，这对于个性化用户体验至关重要。然而，一个关键挑战在于LLM很难解释CF方法产生的潜在非语义嵌入，这限制了推荐效果和其他应用。为了解决这一问题，我们提出了FACE，这是一个通用的可解释框架，可将CF嵌入映射到预训练的LLM令牌上。具体来说，我们引入了一个分离投影模块，将CF嵌入分解成概念特定的向量，然后通过一个量化自编码器将连续的嵌入转换为LLM令牌（描述符）。然后，我们设计了一个对比对齐目标，以确保令牌与相应的文本信号对齐。因此，模型通用的FACE框架实现了语义对齐，而无需微调LLM，并利用其预训练能力提高了推荐性能。在三个真实世界的推荐数据集上的实证结果表明，在基准模型中的性能有所提高，解释性研究证实了描述符的可解释性。代码可在https://github.com/YixinRoll/FACE中找到。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

基于协同过滤（CF）的推荐系统融入大型语言模型（LLM）是提升用户体验个性化的关键。然而，LLM难以解读CF产生的非语义潜在嵌入，限制了推荐效果及进一步应用。为应对此挑战，提出FACE这一通用可解释框架，将CF嵌入映射到预训练LLM令牌上。通过分解CF嵌入、连续嵌入转LLM令牌以及确保令牌与相应文本信号对齐等步骤，实现语义对齐并提升推荐性能。在三个真实推荐数据集上的实证结果证明了该框架的性能提升效果，且描述符的可解释性研究证实了其可解释性。代码可访问 https://github.com/YixinRoll/FACE。

Key Takeaways

大型语言模型（LLM）融入协同过滤（CF）推荐系统对于个性化用户体验至关重要。
LLM面临解读CF产生的非语义潜在嵌入的挑战。
提出FACE框架，实现CF嵌入到预训练LLM令牌的映射。
通过分解CF嵌入、转换为LLM令牌和确保令牌与文本信号对齐，实现语义对齐。
模型无需微调LLM即可实现语义对齐。
实证结果显示FACE框架在推荐性能上的提升效果。

Cool Papers

点此查看论文截图

The 3rd Place Solution of CCIR CUP 2025: A Framework for Retrieval-Augmented Generation in Multi-Turn Legal Conversation

Authors:Da Li, Zecheng Fang, Qiang Yan, Wei Huang, Xuanpu Luo

Retrieval-Augmented Generation has made significant progress in the field of natural language processing. By combining the advantages of information retrieval and large language models, RAG can generate relevant and contextually appropriate responses based on items retrieved from reliable sources. This technology has demonstrated outstanding performance across multiple domains, but its application in the legal field remains in its exploratory phase. In this paper, we introduce our approach for “Legal Knowledge Retrieval and Generation” in CCIR CUP 2025, which leverages large language models and information retrieval systems to provide responses based on laws in response to user questions.

检索增强生成在自然语言处理领域取得了显著进展。通过结合信息检索和大型语言模型的优势，RAG可以根据从可靠来源检索到的项目生成相关且上下文恰当的回应。这项技术在多个领域都表现出了卓越的性能，但其在法律领域的应用仍处于探索阶段。在本文中，我们介绍了在CCIR CUP 2025中“法律知识检索与生成”的方法，该方法利用大型语言模型和检索信息系统，根据法律对用户问题作出回应。

论文及项目相关链接

PDF CCIR2025

Summary
随着自然语言处理技术进步，基于检索增强技术的生成式方法在多个领域展现出优越性能。本篇文章介绍了在CCIR杯2025中开展的“法律知识检索与生成”研究，该研究利用大型语言模型和检索系统，根据法律内容生成响应用户提问的答案。这项技术在法律领域的应用仍处于探索阶段。

Key Takeaways

检索增强生成技术在自然语言处理领域取得显著进展。
该技术结合了信息检索和大型语言模型优势。
在多个领域表现出优越性能，能够根据从可靠来源检索的项目生成相关且符合语境的响应。
“法律知识检索与生成”研究利用大型语言模型和检索系统回应用户法律问题。
该技术在法律领域的应用仍处于探索阶段。

Cool Papers

点此查看论文截图

GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery

Authors:Italo Luis da Silva, Hanqi Yan, Lin Gui, Yulan He

Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have. While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce $\textbf{GraphMind}$, an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, $\textbf{GraphMind}$ enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. $\textbf{GraphMind}$ enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea’s core contributions and its connections to existing work. $\textbf{GraphMind}$ is available at https://oyarsa.github.io/graphmind and a demonstration video at https://youtu.be/wKbjQpSvwJg. The source code is available at https://github.com/oyarsa/graphmind.

大型语言模型（LLM）展现出强大的推理和文本生成能力，被应用于科学文献分析，包括新颖性评估。虽然评估科学论文的新颖性对于同行评审至关重要，但需要广泛的相关知识，并非所有审稿人都具备这种知识。关于LLM辅助科学文献分析的最新工作支持文献比较，但现有方法提供的透明度有限，缺乏通过信息检索模块进行结果追溯的机制。为了弥补这一空白，我们推出了GraphMind，这是一款易于使用的交互式Web工具，旨在帮助用户评估科学论文或草案的新颖性。GraphMind使用户能够捕捉科学论文的主要结构，通过不同的视角探索相关思想，并通过提供可验证的上下文洞察来评估新颖性。GraphMind还使用户能够注释论文的关键要素，通过各种关联探索相关论文，并借助上下文洞察评估新颖性。此工具整合了arXiv和Semantic Scholar等外部API与LLM，支持论文的注释、提取、检索和分类。这种结合为用户提供了一个丰富的结构化视图，展示科学思想的核心贡献及其与现有工作的联系。GraphMind可在https://oyarsa.github.io/graphmind上访问，演示视频为https://youtu.be/wKbjQpSvwJg。源代码可在https://github.com/oyarsa/graphmind上获取。

论文及项目相关链接

PDF 9 pages, 6 figures, 3 tables, EMNLP 2025 Demo paper

Summary

大型语言模型（LLM）具备强大的推理和文本生成能力，可用于科学文献分析，包括新颖性评估。针对科学论文新颖性评估中的知识局限性，提出了一种名为GraphMind的易于使用的交互式Web工具。GraphMind通过整合LLM、外部API如arXiv和Semantic Scholar，支持用户捕捉论文主要结构、探索相关思想、评估新颖性并提供可验证的上下文洞察。GraphMind能帮助用户标注论文关键元素、探索相关论文并评估其新颖性。

Key Takeaways

大型语言模型（LLM）可用于科学文献分析中的新颖性评估。
GraphMind是一种交互式Web工具，旨在协助用户评估科学论文或草案的新颖性。
GraphMind通过整合LLM和外部API如arXiv和Semantic Scholar，提供丰富的结构化视图，展示科学思想的核心贡献及其与现有工作的联系。
GraphMind支持用户捕捉论文的主要结构并探索相关思想。
用户可以通过GraphMind标注论文关键元素，并探索相关论文。
GraphMind提供可验证的上下文洞察，帮助用户评估论文的新颖性。

Cool Papers

点此查看论文截图

MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing

Authors:Shiwen Ou, Yuwei Li, Lu Yu, Chengkun Wei, Tingke Wen, Qiangpu Chen, Yu Chen, Haizhi Tang, Zulie Pan

Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications. However, bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security. While numerous techniques have been proposed to detect bugs in DL frameworks, research exploring common API patterns across frameworks and the potential risks they entail remains limited. Notably, many DL frameworks expose similar APIs with overlapping input parameters and functionalities, rendering them vulnerable to shared bugs, where a flaw in one API may extend to analogous APIs in other frameworks. To address this challenge, we propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks. MirrorFuzz operates in three stages: First, MirrorFuzz collects historical bug data for each API within a DL framework to identify potentially buggy APIs. Second, it matches each buggy API in a specific framework with similar APIs within and across other DL frameworks. Third, it employs large language models (LLMs) to synthesize code for the API under test, leveraging the historical bug data of similar APIs to trigger analogous bugs across APIs. We implement MirrorFuzz and evaluate it on four popular DL frameworks (TensorFlow, PyTorch, OneFlow, and Jittor). Extensive evaluation demonstrates that MirrorFuzz improves code coverage by 39.92% and 98.20% compared to state-of-the-art methods on TensorFlow and PyTorch, respectively. Moreover, MirrorFuzz discovers 315 bugs, 262 of which are newly found, and 80 bugs are fixed, with 52 of these bugs assigned CNVD IDs.

深度学习（DL）框架作为广泛应用于人工智能应用的骨干，其存在的漏洞可能会级联成更高级应用中的关键问题，威胁到可靠性和安全性。虽然已有许多技术被提出来检测DL框架中的漏洞，但关于跨框架常见API模式及其潜在风险的研究仍然有限。值得注意的是，许多DL框架暴露出相似的API，具有重叠的输入参数和功能，使其容易受到共享漏洞的影响，一个API中的缺陷可能会扩展到其他框架中的类似API。为了解决这一挑战，我们提出了MirrorFuzz，一种自动化API模糊解决方案，用于发现DL框架中的共享漏洞。MirrorFuzz分为三个步骤：首先，MirrorFuzz收集DL框架中每个API的历史漏洞数据，以识别可能存在漏洞的API。其次，它将特定框架中的漏洞API与其他DL框架内的类似API进行匹配。最后，它利用大型语言模型（LLM）合成针对测试API的代码，并借助类似API的历史漏洞数据来触发跨API的类似漏洞。我们在四个流行的DL框架（TensorFlow、PyTorch、OneFlow和Jittor）上实现了MirrorFuzz并对其进行了评估。广泛评估表明，与最新方法相比，MirrorFuzz在TensorFlow和PyTorch上的代码覆盖率分别提高了39.92%和98.20%。此外，MirrorFuzz发现了315个漏洞，其中262个是全新发现的，并修复了80个漏洞，其中52个漏洞被分配了CNVD编号。

论文及项目相关链接

PDF Accepted for publication in IEEE Transactions on Software Engineering (TSE), 2025

摘要

深度学习框架作为众多人工智能应用的核心，其存在的漏洞可能引发严重问题。为检测深度学习框架中的漏洞，提出MirrorFuzz方案，该方案通过三个步骤进行自动化API模糊测试以发现共享漏洞。首先，收集深度学习框架中每个API的历史漏洞数据以识别潜在漏洞API；其次，匹配特定框架中的漏洞API与其他框架中的相似API；最后，利用大型语言模型合成待测试API的代码，并借助相似API的历史漏洞数据触发跨API的类似漏洞。在四个流行深度学习框架上的评估表明，MirrorFuzz较现有技术提高了39.92%和98.20%的代码覆盖率，并发现了315个漏洞，其中262个为新发现，并成功修复了80个漏洞，其中52个获得了CNVD认证。

关键见解

深度学习框架中存在的漏洞可能对高级应用程序的可靠性和安全性构成威胁。
MirrorFuzz是一种针对深度学习框架中的共享漏洞的自动化API模糊测试解决方案。
MirrorFuzz通过收集历史漏洞数据来识别潜在漏洞的API，并匹配跨框架的相似API。
利用大型语言模型合成API代码并触发类似漏洞。
在四个流行深度学习框架上的评估表明，MirrorFuzz较现有方法显著提高代码覆盖率。
MirrorFuzz发现了大量新漏洞并成功修复了部分漏洞，部分获得CNVD认证。
MirrorFuzz的方案为深度学习框架的漏洞检测提供了新的思路和方法。

Cool Papers

点此查看论文截图

HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Authors:Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Yew-Soon Ong, Anirudh Goyal, Dianbo Liu

As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

随着语言模型在科研工作流程中的使用日益增加，评估它们提出一系列解释的能力——而不仅仅是单一正确答案——变得至关重要。许多科学问题具有不确定性：多个机制上不同的假设与同一组观测结果相符。我们引入了HypoSpace，这是一种诊断工具套件，它将大型语言模型视为有限假设集的采样器，并测量三个互补指标：有效性（与观测结果一致的提案的准确性）、唯一性（提案之间的非冗余性）和恢复能力（所列举的可接受集合的覆盖范围）。我们在三个结构化领域实例化HypoSpace，这些领域具有确定性验证器和精确枚举的假设空间：（i）来自扰动的因果图，（ii）受重力约束的3D体素从顶部向下的投影重建，以及（iii）布尔基因相互作用。在指令调整和推理重点模型中，有效性通常保持较高水平，而随着可接受空间的增长，唯一性和恢复能力会下降，这揭示了模式崩溃，这种崩溃情况在使用仅校正指标时是看不到的。HypoSpace为那些明确探索和覆盖可接受解释空间的方法提供了一个受控探针，而不是排行榜。相关代码可通过以下网址获得：网站链接。

论文及项目相关链接

PDF

Summary

语言模型在科学工作流中的使用日益增多，评估它们提出一套解释而非仅单一正确答案的能力变得至关重要。针对这一问题，本文提出了HypoSpace诊断套件，将大型语言模型视为有限假设集的采样器，并测量三个互补指标：有效性（提案与观察结果的一致性程度）、唯一性（提案之间的非冗余性）和恢复性（对列举的可接受集合的覆盖度）。通过三个结构化领域的实例研究，包括因果图、重力约束的三维体素重建和布尔遗传交互等，发现随着可接受空间的增长，有效性往往保持高水平，而唯一性和恢复性会下降，暴露出模式崩溃的问题，这在仅使用正确性指标时是无法看到的。HypoSpace为显式探索和覆盖可接受解释空间的方法提供了一个受控的探针。

Key Takeaways

随着语言模型在科研中的应用增多，评估其提出多种解释的能力变得关键。
很多科学问题存在多种可能的解释。
本文提出了HypoSpace诊断套件，以评估语言模型在提出假设方面的能力。
该套件通过三个指标来衡量语言模型的表现：有效性、唯一性和恢复性。
实例研究表明，随着可接受空间的增长，语言模型的唯一性和恢复性可能会下降。
这暴露了语言模型在某些情况下的模式崩溃问题。

Cool Papers

点此查看论文截图

Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

Authors:Enis Oğuz

The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.

人工智能生成技术（Generative AI）的发展为不同领域的创新铺平了道路。最近，人工智能生成技术被提议作为自动评估学生作文的AES系统的竞争对手。考虑到人工智能在处理成语方面的潜在局限性，本研究通过融合语料库语言学和计算语言学的见解，评估了人工智能生成模型在有、无成语作文上的评分表现。从语料库中选取了348篇学生作文，创建了两个相等的作文列表：一个列表中的每篇作文都有多个成语，另一个列表的作文则无成语。要求三个生成式人工智能模型（ChatGPT、Gemini和Deepseek）三次对所有作文进行评分，评分依据与人类评分者分配作文分数时使用的标准一致。结果表明，所有模型的评分一致性都非常高，但在与人类评分者的跨评价者可靠性方面，Gemini表现最佳。此外，任何人口统计群体在人工智能评估中都没有明显的偏见。对于含有多个成语的作文，Gemini的评分模式与人类评分者最为相似。虽然研究中的模型都显示了混合方法的潜力，但由于能够处理比喻语言的能力，Gemini最适合这一任务，并有望在未来单独处理作文评分任务。

论文及项目相关链接

PDF

Summary

本文研究了生成式AI模型在评估带有和不带成语的学生作文方面的表现。结合语料库语言学和计算语言学，对ChatGPT、Gemini和Deepseek三款生成式AI模型进行了实验评估。结果显示，所有模型表现出良好的一致性，但Gemini在与人评分的可靠性方面表现最佳。对于带有多个成语的作文，Gemini的模式最符合人类评分者的模式。未来，生成式AI模型有望在单独处理作文评分任务方面展现潜力。

Key Takeaways

生成式AI技术在不同领域推动了多项创新。
近期，生成式AI被提议作为评估学生作文的自动评估系统（AES）的竞争对手。
研究评估了生成式AI模型对带有和不带成语的作文的评分表现。
三款生成式AI模型（ChatGPT、Gemini和Deepseek）参与实验评估。
所有模型表现出良好的一致性，但Gemini在与人评分的可靠性方面最佳。
对于带有多个成语的作文，Gemini的评分模式最符合人类评分者的模式。

Cool Papers

点此查看论文截图

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Authors:Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary ${0,1}$ during training. This choice carries a cost: it introduces \textit{false negatives} (rejecting correct answers, FNs) and \textit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\frac{12}{36}$ as wrong when compared against the canonical $\frac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a \textit{backward} correction that de-biases the observed binary reward to recover an \textit{unbiased} estimator of the clean policy gradient. The second is a \textit{forward} correction that reweights score-function terms so that the expected update direction aligns with the \textit{clean gradient}; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.

强化学习与可验证奖励（RLVR）通过针对自动化验证器训练策略，以避免昂贵的人力标注。为了减少验证器黑客攻击带来的脆弱性，许多RLVR系统将奖励缩减为二元${0，1}$在训练期间。这种选择是有代价的：它引入了“假阴性”（拒绝正确答案）和“假阳性”（接受错误答案）。例如，基于规则的检查器在与标准$\frac{1}{3}$相比时，可能会将正确的分数$\frac{12}{36}$标记为错误，这是由于脆弱的解析/等价规则（FN），而大型语言模型（LLM）的判断可能会受到表面线索甚至是单个对抗性标记的影响，为错误解决方案提供膨胀的正确性（FP）。我们通过将验证器建模为具有不对称噪声率的随机奖励通道来正式验证器的不可靠性。从这个抽象中，我们为验证器错误推导出两种校正算法。第一种是“向后”校正，它通过消除观察到的二元奖励的偏见，以恢复清洁政策梯度的“无偏见”估计器。第二种是“向前”校正，它重新加权分数函数项，以使预期的更新方向与“清洁梯度”一致；值得注意的是，它只需要FN率。我们将这两者都实现为基于组相对策略优化（GRPO）的RLVR管道中的轻量级钩子，并在数学推理模型和基准测试上对其进行评估。在模型和数据集方面，这两种校正方法都优于未校正的训练；向前方法收敛更快，在更重的噪声下保持稳定。最后，我们展示了一种实用吸引力机制，其中轻量级LLM验证器通过重新检查基于规则的阴性来在线估计FN率，并取得了与其他最先进竞争者相比的性能。

论文及项目相关链接

PDF

Summary

强化学习与可验证奖励（RLVR）通过针对自动化验证器训练策略，避免了昂贵的人力标注。为减少验证器被黑客攻击的风险，许多RLVR系统将奖励在训练期间简化为二元{0，1}。但这种选择也有代价：它引入了假阴性（拒绝正确答案）和假阳性（接受错误答案）。例如，基于规则的检查器可能会将正确的分数12/36与标准的1/3进行比较时标记为错误（假阴性），而大型语言模型（LLM）则可能被表面线索甚至单个对抗性标记所影响，为错误解决方案提供过高正确性评估（假阳性）。我们通过将验证器建模为具有不对称噪声率的随机奖励通道来形式化验证器的不可靠性。从这个抽象中，我们推导出了两种纠正验证器错误的算法。首先是反向校正，它通过去偏观察到的二元奖励来恢复干净的政策梯度无偏估计器。其次是正向校正，它重新加权分数函数的项，以使预期更新方向与干净梯度一致；值得注意的是，它只需要假阴性率。我们将这两种方法实现为基于组相对策略优化（GRPO）的RLVR管道中的轻量级钩子，并在数学推理模型和基准测试上进行了评估。在模型和数据集方面，经过校正的训练表现均优于未校正的训练；正向校正收敛更快，在更重的噪声下仍能保持稳定。最后，我们展示了一种实用的上诉机制，其中轻量级LLM验证器通过重新检查基于规则的结果来估算假阴性率，相较于其他最新竞争者表现出更好的性能。

Key Takeaways

RLVR通过自动化验证器训练策略，减少了对昂贵人力标注的依赖。
验证器可能被黑客攻击影响，因此需要采取安全措施。
奖励简化为二元{0,1}可能引入假阴性和假阳性结果。
验证器的不可靠性可以通过将其建模为随机奖励通道进行形式化。
提出了两种纠正验证器错误的算法：反向校正和正向校正。
两种方法在基于组相对策略优化的RLVR管道中实施并进行了评估。

Cool Papers

点此查看论文截图

CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Authors:Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

多模态大型语言模型（MLLM）最近通过整合视觉感知与自然语言理解在放射学领域取得了显著进展。然而，它们经常生成无法得到临床支持的诊断描述，即所谓的医学幻觉，这在需要精确性和图像基础输出的医学应用中构成了严重风险。通过实证分析，我们发现提示诱导的幻觉在放射学MLLM中仍然普遍存在，很大程度上是由于对临床部分的过度敏感。为了解决这一问题，我们引入了临床对比解码（CCD），这是一种无需训练和检索的推理框架，它整合了特定任务放射学专家模型的结构化临床信号。CCD引入了一种双阶段对比机制，以在生成过程中优化令牌级别的逻辑，从而提高临床保真度，同时无需修改基础MLLM。在三个数据集和多个模型上的实验表明，CCD在放射学报告生成（RRG）方面始终提高总体性能。在MIMIC-CXR数据集上，当应用于最先进的RRG模型时，它在RadGraph-F1上产生了高达17%的改进。我们的方法提供了一种轻便且通用的解决方案，用于缓解医学幻觉，有效地在专家模型和MLLM之间架起桥梁。

论文及项目相关链接

PDF Preprint, 27 pages, 3 figures

Summary

本文介绍了多模态大型语言模型（MLLMs）在放射学领域的最新进展，它们通过整合视觉感知和自然语言理解取得了显著成效。然而，MLLMs常常生成不受临床支持的描述，即所谓的医学幻觉，这在需要精确性和图像基础输出的医学应用中带来了严重风险。研究发现，提示诱导的幻觉在放射学MLLMs中普遍存在，主要是由于对临床部分的过度敏感。为解决这一问题，本文提出了Clinical Contrastive Decoding（CCD）方法，这是一种无需训练和检索的推理框架，它整合了来自特定任务放射学专家模型的结构化临床信号。CCD通过引入双阶段对比机制，在生成过程中精细调整令牌级别的逻辑，提高了临床准确性，同时不修改基础MLLM。实验结果表明，CCD在放射学报告生成任务上持续提高了整体性能。在MIMIC-CXR数据集上，应用于最先进的RRG模型时，它在RadGraph-F1上提高了高达17%。本文的方法为缓解医学幻觉提供了一个轻便且通用的解决方案，有效地架起了专家模型和MLLMs之间的桥梁。

Key Takeaways

多模态大型语言模型（MLLMs）在放射学领域集成视觉感知和自然语言理解方面取得显著进展。
MLLMs存在生成临床不支持描述（医学幻觉）的问题，这在医学应用中带来风险。
医学幻觉在放射学MLLMs中普遍存在，主要是因为对临床部分的过度敏感。
为解决这一问题，提出了Clinical Contrastive Decoding（CCD）方法，这是一种无需训练和检索的推理框架。
CCD整合了来自特定任务放射学专家模型的结构化临床信号，通过双阶段对比机制提高临床准确性。
实验结果表明，CCD在放射学报告生成任务上持续提高了整体性能。

Cool Papers

点此查看论文截图

Imperative vs. Declarative Programming Paradigms for Open-Universe Scene Generation

Authors:Maxim Gumin, Do Heon Han, Seung Jean Yoo, Aditya Ganeshan, R. Kenny Jones, Rio Aguina-Kang, Stewart Morris, Daniel Ritchie

Current methods for generating 3D scene layouts from text predominantly follow a declarative paradigm, where a Large Language Model (LLM) specifies high-level constraints that are then resolved by a separate solver. This paper challenges that consensus by introducing a more direct, imperative approach. We task an LLM with generating a step-by-step program that iteratively places each object relative to those already in the scene. This paradigm simplifies the underlying scene specification language, enabling the creation of more complex, varied, and highly structured layouts that are difficult to express declaratively. To improve the robustness, we complement our method with a novel, LLM-free error correction mechanism that operates directly on the generated code, iteratively adjusting parameters within the program to resolve collisions and other inconsistencies. In forced-choice perceptual studies, human participants overwhelmingly preferred our imperative layouts, choosing them over those from two state-of-the-art declarative systems 82% and 94% of the time, demonstrating the significant potential of this alternative paradigm. Finally, we present a simple automated evaluation metric for 3D scene layout generation that correlates strongly with human judgment.

当前通过文本生成3D场景布局的方法主要遵循声明式范式，其中大型语言模型（LLM）指定高级约束，然后由单独的求解器解决。本文通过引入一种更直接、命令式的方法，对这一共识提出了挑战。我们让LLM生成一个逐步的程序，该程序会迭代地将每个对象放置在场景中已存在的对象旁边。这种范式简化了基本的场景规格语言，能够实现更复杂、多样和高度结构的布局，而这些布局难以用声明式表达。为了提高稳健性，我们通过一种新型的无LLM错误校正机制来补充我们的方法，该机制直接在生成的代码上操作，迭代调整程序中的参数以解决碰撞和其他不一致性。在强制选择感知研究中，人类参与者更倾向于选择我们的命令式布局，超过两种最先进的声明式系统（达到）82%和94%的时间，这显示了这种替代范式的巨大潜力。最后，我们为3D场景布局生成提供了一个简单的自动化评估指标，该指标与人类判断高度相关。

论文及项目相关链接

PDF

Summary

本文提出了一种基于命令式编程范式生成三维场景布局的新方法，通过大型语言模型（LLM）生成逐步的程序来迭代放置场景中的每个对象。这种方法简化了场景描述语言，可以创建复杂、多样且高度结构化的布局，难以用声明式表达。为提高稳健性，结合了无LLM的错误修正机制，直接在生成的代码上操作，调整程序参数以解决碰撞和其他不一致性。人类参与者的感知研究表明，大多数参与者更倾向于本研究的命令式布局，超过两种先进的声明式系统的偏好比例分别为82%和94%。最后，本研究提出了一种简单自动化的三维场景布局生成评估指标，与人类判断高度相关。

Key Takeaways

提出了一种基于命令式编程的新方法生成三维场景布局。
使用LLM生成逐步程序来放置场景中的对象。
简化了场景描述语言，支持创建复杂、多样和高度结构化的布局。
结合了无LLM的错误修正机制以提高生成的稳健性。
人类感知研究证明大多数参与者偏好该命令式布局。
该方法与先进的声明式系统相比具有显著优势。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-21/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-10-21 VISTA A Test-Time Self-Improving Video Generation Agent

2025-10-21 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-10-21 PokeeResearch Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

2025-10-21 R1_Reasoning

R1_Reasoning

LLM

2025-10-21 更新

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Paper2Web: Let’s Make Your Paper Alive!

Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens

The 3rd Place Solution of CCIR CUP 2025: A Framework for Retrieval-Augmented Generation in Multi-Turn Legal Conversation

GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery

MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing

HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Imperative vs. Declarative Programming Paradigms for Open-Universe Scene Generation